How to get skato to work with Regenie?
Hi!
I am trying to use Regenie to run SKAT-O for a couple of phenotypes for a specific gene. I keep on getting the following error:
"regenie error: unknown chromosome code in set list file."
I am using the set file that is made available by UKB for the final release of the exome data. This is (part of) my code:
"
genotype_prefix_chr4="/mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c4_b0_v1"
phenotype_file="/mnt/project/Custom_files/input_files/pheno_files/phenos.txt"
path_to_500kwes_helper_files="/mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, PLINK format - final release/helper_files"
./regenie_v3.2.5.3.gz_x86_64_Linux \
--step 2 \
--ignore-pred \
--bgen "${genotype_prefix_chr4}".bgen \
--ref-first \
--sample "${genotype_prefix_chr4}".sample \
--phenoFile "${phenotype_file}" \
--covarFile "${phenotype_file}" \
--phenoCol pheno1,pheno2,pheno3 \
--covarColList sex,age,PC{1:10} \
--set-list "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.sets.txt.gz" \
--anno-file "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.annotations.txt.gz" \
--mask-def "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.masks" \
--aaf-bins 0.01,0.001,0.0001 \
--extract-setlist "GeneX(ENSG00000XXXX)" \
--write-mask \
--vc-tests skato \
--bt \
--out PhenoGroup_GeneX_all_bin
I also tried to run this command with an unzipped version of the sets.txt.gz file, or with adding "chr" before the numbers in column 2. This does not help. Can anyone help me solve this problem?
(I am aware I should not skip step 1 of regenie, this is just to pass a first try of running regenie on UKB RAP with jupyterlab).
Thank you in advance!
Iris
Comments
6 comments
Iris,
Double check the gene set that you want. Your command as written is looking for a nonexistent gene set:
"--extract-setlist "GeneX(ENSG00000XXXX)" \ "
You need to pick a gene that exists in the ukb23158_500k_OQFE.sets.txt.gz. They will be in the form of GENENAME(ENSG00ID) like:
OR4F5(ENSG00000186092)
PCSK9(ENSG00000169174)
See further document\ation here:
https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/using-regenie-to-generate-variant-masks
-Phil
Hi Phil,
Thank you for your reply, I appreciate it a lot. I am aware that the current gene is not correct, I used a general template for privacy reasons. Basically in a similar way as you specified the form: GENENAME(ENSG00ID). But I can give an example with one of the genes that we work on: "TAZ(ENSG00000102125)"
This gives the same error. I am happy to hear about potential other reasons why my code is giving this error.
Best wishes,
Iris
I feel like Regenie developer or people in Regenie github repo might know better.
One thing I can think of is to check if all your input have consistent notation of chromosome name especially the small contigs.
Hi Chai,
Thank you for your answer. I did check for the chromosome names and they are all fine (1-23). In the meantime I got the analysis to work. I basically rewrote the sets.txt.gz file to my temporary folder on DNAnexus while specifying the file to be tab-delimited. And that worked. So maybe it was a problem with permissions for the location of the original file, or it had to do with the formatting. Either way, I am happy it runs fine now.
Hi Iris,
Could you tell me how you solved this problem in detail? Like how to specify the file to be tab-delimited? Thank you!
I needed to re-write this file, restricting to autosomes for Regenie to work. I used a jupyter notebook on the RAP, using the code below, then used the new set file in my Regenie command.
import pandas as pd
# read set file, only keep autosomes
#! dx download "path_to_helper_files/ukb23158_500k_OQFE.sets.txt.gz" # can only read once per session, otherwise need to delete the file
set_col_names = ['Gene', 'Chromosome', 'Position', 'Set']
autosome_range = list(range(1, 23))
sets = (
pd.read_csv('ukb23158_500k_OQFE.sets.txt.gz',
compression='gzip', delimiter='\t', names=set_col_names)
.query('Chromosome in @autosome_range')
)
# write autosomes set file
sets.to_csv('ukb23158_500k_OQFE_autosomes.sets.txt.gz', compression='gzip', sep='\t',
header=False, index=False)
#! dx upload ukb23158_500k_OQFE_autosomes.sets.txt.gz --path /path_to_somewhere_in_your_project/ukb23158_500k_OQFE_autosomes.sets.txt.gz # can only run once per session
Please sign in to leave a comment.