We notice that the sample file for BGEN are not formatted correctly. We have notified the UKB and data provider and that would be fixed in the future data release. However, if you want to analyze these GEL and TOPMed Impute datasets meanwhile, it is quite easy and super cheap to fix the format issue. You just need to change the second row of sample files from 0 0 0 0 to 0 0 0 D. You could do this in interactive workstation (e.g. ttyd, clould workstation, jupyter lab, Rstudio, etc), or write a script to do this and put in swiss-army-knife to change the format.
I manually change the chr22 sample file in my testing application and was able to get sample file work with PLINK tool.
However, if anyone run into other problems with these two data (or if the solution I provide above doesn't work), please share with the community.
It seems that UKB has not changed permission for research applications to have access to this data that we released in mid December. I can see it only in my testing application, but not research application. I will meet with them next week to check what is the issue and fix it asap.
Once it's available, it would show up as two new folders in Bulk/Imputation. One for GEL and one for TOPMed.
I have been trying to analyze the haplotypes of a a specific number of individuals from the UK Biobank. Ultimately, I want to compute LD within a specific genomic region and visualize it using tools such as Haploview.
What I have done so far:
1) Downloaded the imputed data from TopMed for the individuals of interest using DNAnexus SwissArmyKnife tool.
2) Adjusted the .sample according to Chai's comment above.
I now want to compute LD and visualize haplotype blocks among all SNPs in this region.
Is QCtool the best approach for this task? Based on its documentation, it's not clear to me how to calculate LD within a single genotype file. Should I use the same .bgen and .sample files in the code, for example:
It apparently works but I wanted to make sure I'm getting the correct results (I haven't added any additional arguments so far, just wanted to test the default options).
Other errors I have gotten while manipulating the .bgen file in order to use as input in different tools (all with the purpose of generating input files for Haploview).
plink: Error: '--export haps' must be used with a fully phased dataset.
After conversion to .vcf, the genotypes appear as "0/0, 0/1,1/1" instead of the expected output for phased genotypes (0|0, 0|1, etc...).
I guess my questions are:
a) Given the errors above, is the TOPMED data (field 21007) phased as I am assuming? If so, any chance that I might've lost phase information while downloading the data for the individuals of interest?
b) For the purpose of this type of analysis, are the files found in Bulk>Imputation (22828, 21007, 21008) the options one indeed should be using? It's a bit unclear to me the definition of field 22438.
c) If the above qctool command is the appropriate way to go, could anyone be kind enough to help me figure out how to graphically visualize the results stored in "results.sqlite" as I'm not very familiar with .sql files manipulation?
Thank you very much for any insights of this community.
Awesome. Thanks for sharing these talks. I?ve just watched them and they were super informative. Excited for the phased WGS release in the next few months.
I have a separate question, I see there is a paper on the GEL methods for imputation in the showcase, but there does not seem to be one for the TOPmed imputation. Has anyone seen this paper yet?
Specifically, how many snps passed QC to be submitted to the imputation server? How large were the batches? (HRC, 4700 per batch, GEL 26K per batch) did they try to submit batches by reported ancestry? etc...
Comments
22 comments
We notice that the sample file for BGEN are not formatted correctly. We have notified the UKB and data provider and that would be fixed in the future data release. However, if you want to analyze these GEL and TOPMed Impute datasets meanwhile, it is quite easy and super cheap to fix the format issue. You just need to change the second row of sample files from 0 0 0 0 to 0 0 0 D. You could do this in interactive workstation (e.g. ttyd, clould workstation, jupyter lab, Rstudio, etc), or write a script to do this and put in swiss-army-knife to change the format.
I manually change the chr22 sample file in my testing application and was able to get sample file work with PLINK tool.
However, if anyone run into other problems with these two data (or if the solution I provide above doesn't work), please share with the community.
May I know how to access topmed imputation files in UKB RAP? I couldn't able to find anywhere in BULK folder of UKB-RAP
It seems that UKB has not changed permission for research applications to have access to this data that we released in mid December. I can see it only in my testing application, but not research application. I will meet with them next week to check what is the issue and fix it asap.
Once it's available, it would show up as two new folders in Bulk/Imputation. One for GEL and one for TOPMed.
Thank you so much and looking forward to using it.
@Akhil Pampana? The data has been released now. You can refresh the project to get it.
It seems that UKB has unrestricted the data a while ago, but somehow it took longer than expected to be in effect.
I made a short python notebook to loop for BGEN sample files for all chromosomes.
Thank you so much for the resource. I could able to access the files. Its really helpful
Hello,
Happy to see the TopMed release for ukbb.
It will greatly improve our approved project.
However, after refreshing the dataset following instructions, I still cannot access it.
I tried using : gfetch 21007 -with my approved key and got this error.
Error: Field=21007 is not permitted for download
Download failure
Can you please advise? I don't see any specifics in community discussions.
Thank you.
The TOPMed and GEL data need to be analyzed on UKB-RAP only per the MTA, so could not download data from Showcase.
Thank you. I modified for TopMed.
But now wondering how to actual run?
Can I run from dx tools?
Thank you!
Yes, in this example, you can use jupyter notebook to process them.
See tutorial on how to run jupyter notebook on UKB-RAP here https://www.youtube.com/watch?v=YIPdhf3qbQA&list=PLRkZ0Fz-n3Z7Jg0Vz4vudLYnBza4EUGLM&index=21
Or you can copy only the code and run in Python within ttyd app too.
Hello,
I have been trying to analyze the haplotypes of a a specific number of individuals from the UK Biobank. Ultimately, I want to compute LD within a specific genomic region and visualize it using tools such as Haploview.
What I have done so far:
1) Downloaded the imputed data from TopMed for the individuals of interest using DNAnexus SwissArmyKnife tool.
2) Adjusted the .sample according to Chai's comment above.
3) Filtered the genomic region of interest. Btw: I observed the same issue as reported in this link: lack of rsids in the .bgen file (https://community.dnanexus.com/s/question/0D5t000004SBxtyCAD/potential-issues-with-imputed-data)
I now want to compute LD and visualize haplotype blocks among all SNPs in this region.
qctool -g file.bgen -s file.sample -compute-ld-with file.bgen file.sample -old sqlite://results.sqlite:LD
It apparently works but I wanted to make sure I'm getting the correct results (I haven't added any additional arguments so far, just wanted to test the default options).
Other errors I have gotten while manipulating the .bgen file in order to use as input in different tools (all with the purpose of generating input files for Haploview).
I guess my questions are:
a) Given the errors above, is the TOPMED data (field 21007) phased as I am assuming? If so, any chance that I might've lost phase information while downloading the data for the individuals of interest?
b) For the purpose of this type of analysis, are the files found in Bulk>Imputation (22828, 21007, 21008) the options one indeed should be using? It's a bit unclear to me the definition of field 22438.
c) If the above qctool command is the appropriate way to go, could anyone be kind enough to help me figure out how to graphically visualize the results stored in "results.sqlite" as I'm not very familiar with .sql files manipulation?
Thank you very much for any insights of this community.
Could you repost this as a new question? It's pretty hard questions, so I want to see if other members in community could chime in.
I want to note though that there will be phasing WGS data coming out around July this year for 200k WGS data.
Thanks for replying, Chai. Will do. I'm glad to know there will be a phasing WGS data release soon.
You are welcome. If you are interested in phasing data, you may find this two talks useful.
https://www.youtube.com/watch?v=jF2GKfrWaz4&t=8s
https://www.youtube.com/watch?v=iNtg9PuYj4g&t=1s
Awesome. Thanks for sharing these talks. I?ve just watched them and they were super informative. Excited for the phased WGS release in the next few months.
I updated my GWAS repo for TOPmed imputed data using plink. I will work on adding the regenie version sometime in the near future.
https://github.com/pjgreer/ukb-rap-tools/tree/main/GWAS_pipeline/gwas_topmed_plink
I have a separate question, I see there is a paper on the GEL methods for imputation in the showcase, but there does not seem to be one for the TOPmed imputation. Has anyone seen this paper yet?
Would the note and resource section of https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=21007 contain information you are looking for?
Chai,
No, that is really the bare minimum information.
The original HRC imputation paper (https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=530) and the GEL pdf (https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=10510) are really what I am looking for. The TOPmed document just doesn't seem to exist yet.
Specifically, how many snps passed QC to be submitted to the imputation server? How large were the batches? (HRC, 4700 per batch, GEL 26K per batch) did they try to submit batches by reported ancestry? etc...
thanks for this note Phil. I will pass on this request to UKB.
Hello,
Any news on the TOPMed QC details?
Best,
Felix
Please sign in to leave a comment.