Haplotype/LD analysis using TOPMed impute data

Hello,

 

I have been trying to analyze the haplotypes of a a specific number of individuals from the UK Biobank. Ultimately, I want to compute LD within a specific genomic region and visualize it using tools such as Haploview.

 

What I have done so far:

 

1) Downloaded the imputed data from TopMed for the individuals of interest using DNAnexus SwissArmyKnife tool.

 

2) Adjusted the .sample according to Chai's comment. https://community.dnanexus.com/s/question/0D5t000004CaydsCAB/have-questions-about-the-gel-or-topmed-impute-data-release-ask-them-here

 

3) Filtered the genomic region of interest. Btw: I observed the same issue as reported in this link: lack of rsids in the .bgen file (https://community.dnanexus.com/s/question/0D5t000004SBxtyCAD/potential-issues-with-imputed-data)

 

I now want to compute LD and visualize haplotype blocks among all SNPs in this region.

 

  • Is QCtool the best approach for this task? Based on its documentation, it's not clear to me how to calculate LD within a single genotype file. Should I use the same .bgen and .sample files in the code, for example:

 

qctool -g file.bgen -s file.sample -compute-ld-with file.bgen file.sample -old sqlite://results.sqlite:LD

 

It apparently works but I wanted to make sure I'm getting the correct results (I haven't added any additional arguments so far, just wanted to test the default options).

 

Other errors I have gotten while manipulating the .bgen file in order to use as input in different tools (all with the purpose of generating input files for Haploview).

 

 

I guess my questions are:

 

a) Given the errors above, is the TOPMED data (field 21007) phased as I am assuming? If so, any chance that I might've lost phase information while downloading the data for the individuals of interest?

 

b) For the purpose of this type of analysis, are the files found in Bulk>Imputation (228282100721008) the options one indeed should be using? It's a bit unclear to me the definition of field 22438.

 

c) If the above qctool command is the appropriate way to go, could anyone be kind enough to help me figure out how to graphically visualize the results stored in "results.sqlite" as I'm not very familiar with .sql files manipulation?

 

Thank you very much for any insights of this community.

Comments

4 comments

  • Comment author
    Chai Fungtammasan DNAnexus Team

    I can answer only subset of questions, so I'm looking forward to get help from other experts. My colleague who is good at this is out of town currently. 

     

    First, make sure you use EID from .sample rather than what is contained in BGEN. I'm sure you already did. I just want to make sure. See https://community.dnanexus.com/s/question/0D5t000004SBxtyCAD/potential-issues-with-imputed-data

     

    I would recommend against using the 22438. Both necessary metadata and many variant blocks are missing. I looked into this data a while ago. For impute data, the three fields that you list are a good option.

     

    Per my understanding about LD block calculation, there are several tools that could do this including PLINK (would be PLINK2 in this case since it's bgen) which is pretty standard. The main issue I heard about QCtools isn't about the quality of results, but it's very slow because it's running on single thread. I'm not expert enough to tell which is better, so you could also try PLINK or wait for others' opinions. 

     

    You may see option on how to use bgenix or sql to extract variants of interest here. 

    https://enkre.net/cgi-bin/code/bgen/doc/trunk/doc/wiki/bgenix.md

    The bgenix is installed in Swiss-army-knife, so you can try if that is useful.

     

     

    0
  • Thanks a lot for your prompt reply, Chai.

     

    • Yes, I using the EID from .sample as recommended. Thanks for bringing that up.
    • I agree with you that 22438 is incomplete.
    • You are right about the use of these tools for LD block calculation. I think the issue I?m having is most probably due to unphased data. What I would like to have clarifications on is if the TOPmed/GEL datasets included under Bulk > Imputation are expected to be phased or not. If they are, that means that I might be losing the phase data while selecting my samples of interest through qctool. However, if unphased data is the expected output, then I know that I need to conduct the phasing step myself. I think that my confusion comes from the TOPmed note (21007) where it is described that the UKB data was phased before imputation, but I guess it does not necessarily mean that the data available is indeed phased.
    • I?ll take a look at bgnix. Thanks for the suggestion.
    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    You may try out if this library can help check if the bgen is phased or not.

    https://bgen-reader.readthedocs.io/en/latest/quickstart.html

    https://pypi.org/project/bgen/

    I usually use bgen_reader to inspect bgen for my own work.

    0
  • Excellent!

    I confirmed that the bgen file I have is not phased. That explains the errors I got.

    Thanks again for the recommendations.

     

    0

Please sign in to leave a comment.