Haplotype/LD analysis using TOPMed impute data
Hello,
I have been trying to analyze the haplotypes of a a specific number of individuals from the UK Biobank. Ultimately, I want to compute LD within a specific genomic region and visualize it using tools such as Haploview.
What I have done so far:
1) Downloaded the imputed data from TopMed for the individuals of interest using DNAnexus SwissArmyKnife tool.
2) Adjusted the .sample according to Chai's comment. https://community.dnanexus.com/s/question/0D5t000004CaydsCAB/have-questions-about-the-gel-or-topmed-impute-data-release-ask-them-here
3) Filtered the genomic region of interest. Btw: I observed the same issue as reported in this link: lack of rsids in the .bgen file (https://community.dnanexus.com/s/question/0D5t000004SBxtyCAD/potential-issues-with-imputed-data)
I now want to compute LD and visualize haplotype blocks among all SNPs in this region.
- Is QCtool the best approach for this task? Based on its documentation, it's not clear to me how to calculate LD within a single genotype file. Should I use the same .bgen and .sample files in the code, for example:
qctool -g file.bgen -s file.sample -compute-ld-with file.bgen file.sample -old sqlite://results.sqlite:LD
It apparently works but I wanted to make sure I'm getting the correct results (I haven't added any additional arguments so far, just wanted to test the default options).
Other errors I have gotten while manipulating the .bgen file in order to use as input in different tools (all with the purpose of generating input files for Haploview).
- plink: Error: '--export haps' must be used with a fully phased dataset.
- plink: error as reported in: https://www.biostars.org/p/9484011/#9484013
- Conversion to hap/sample/legend file using bcftools: same error as reported in https://github.com/samtools/bcftools/issues/730
- After conversion to .vcf, the genotypes appear as "0/0, 0/1,1/1" instead of the expected output for phased genotypes (0|0, 0|1, etc...).
I guess my questions are:
a) Given the errors above, is the TOPMED data (field 21007) phased as I am assuming? If so, any chance that I might've lost phase information while downloading the data for the individuals of interest?
b) For the purpose of this type of analysis, are the files found in Bulk>Imputation (22828, 21007, 21008) the options one indeed should be using? It's a bit unclear to me the definition of field 22438.
c) If the above qctool command is the appropriate way to go, could anyone be kind enough to help me figure out how to graphically visualize the results stored in "results.sqlite" as I'm not very familiar with .sql files manipulation?
Thank you very much for any insights of this community.
Comments
4 comments
I can answer only subset of questions, so I'm looking forward to get help from other experts. My colleague who is good at this is out of town currently.
First, make sure you use EID from .sample rather than what is contained in BGEN. I'm sure you already did. I just want to make sure. See https://community.dnanexus.com/s/question/0D5t000004SBxtyCAD/potential-issues-with-imputed-data
I would recommend against using the 22438. Both necessary metadata and many variant blocks are missing. I looked into this data a while ago. For impute data, the three fields that you list are a good option.
Per my understanding about LD block calculation, there are several tools that could do this including PLINK (would be PLINK2 in this case since it's bgen) which is pretty standard. The main issue I heard about QCtools isn't about the quality of results, but it's very slow because it's running on single thread. I'm not expert enough to tell which is better, so you could also try PLINK or wait for others' opinions.
You may see option on how to use bgenix or sql to extract variants of interest here.
https://enkre.net/cgi-bin/code/bgen/doc/trunk/doc/wiki/bgenix.md
The bgenix is installed in Swiss-army-knife, so you can try if that is useful.
Thanks a lot for your prompt reply, Chai.
You may try out if this library can help check if the bgen is phased or not.
https://bgen-reader.readthedocs.io/en/latest/quickstart.html
https://pypi.org/project/bgen/
I usually use bgen_reader to inspect bgen for my own work.
Excellent!
I confirmed that the bgen file I have is not phased. That explains the errors I got.
Thanks again for the recommendations.
Please sign in to leave a comment.