Have questions about the GEL or TOPMed Impute Data Release? Ask them here!

Chai Fungtammasan DNAnexus Team
The data should be released this week. https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=21007  # see Note tab for doc of TOPMed data https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=21008  # see Resource tab for doc of GEL data

Comments

22 comments

  • Comment author
    Chai Fungtammasan DNAnexus Team

    We notice that the sample file for BGEN are not formatted correctly. We have notified the UKB and data provider and that would be fixed in the future data release. However, if you want to analyze these GEL and TOPMed Impute datasets meanwhile, it is quite easy and super cheap to fix the format issue. You just need to change the second row of sample files from 0 0 0 0 to 0 0 0 D. You could do this in interactive workstation (e.g. ttyd, clould workstation, jupyter lab, Rstudio, etc), or write a script to do this and put in swiss-army-knife to change the format. 

     

    I manually change the chr22 sample file in my testing application and was able to get sample file work with PLINK tool. 

     

    However, if anyone run into other problems with these two data (or if the solution I provide above doesn't work), please share with the community. 

    0
  • Comment author
    Former User of DNAx Community_6

    May I know how to access topmed imputation files in UKB RAP? I couldn't able to find anywhere in BULK folder of UKB-RAP

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    It seems that UKB has not changed permission for research applications to have access to this data that we released in mid December. I can see it only in my testing application, but not research application. I will meet with them next week to check what is the issue and fix it asap.

    Once it's available, it would show up as two new folders in Bulk/Imputation. One for GEL and one for TOPMed.

    0
  • Comment author
    Former User of DNAx Community_6

    Thank you so much and looking forward to using it.

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    @Akhil Pampana? The data has been released now. You can refresh the project to get it.

    It seems that UKB has unrestricted the data a while ago, but somehow it took longer than expected to be in effect.

    0
  • Comment author
    Anastazie Sedlakova DNAnexus Team

    I made a short python notebook to loop for BGEN sample files for all chromosomes.

    0
  • Comment author
    Former User of DNAx Community_6

    Thank you so much for the resource. I could able to access the files. Its really helpful

    0
  • Comment author
    Former User of DNAx Community_69

    Hello,

    Happy to see the TopMed release for ukbb.

    It will greatly improve our approved project.

     

    However, after refreshing the dataset following instructions, I still cannot access it.

    I tried using : gfetch 21007 -with my approved key and got this error.

     

    Error: Field=21007 is not permitted for download

    Download failure

     

    Can you please advise? I don't see any specifics in community discussions.

    Thank you.

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    The TOPMed and GEL data need to be analyzed on UKB-RAP only per the MTA, so could not download data from Showcase.

    0
  • Comment author
    Former User of DNAx Community_69

    Thank you. I modified for TopMed.

    But now wondering how to actual run?

    Can I run from dx tools?

     

     

    0
  • Comment author
    Former User of DNAx Community_69

    Thank you!

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    Yes, in this example, you can use jupyter notebook to process them.

    See tutorial on how to run jupyter notebook on UKB-RAP here https://www.youtube.com/watch?v=YIPdhf3qbQA&list=PLRkZ0Fz-n3Z7Jg0Vz4vudLYnBza4EUGLM&index=21

     

    Or you can copy only the code and run in Python within ttyd app too.

    0
  • Hello,

     

    I have been trying to analyze the haplotypes of a a specific number of individuals from the UK Biobank. Ultimately, I want to compute LD within a specific genomic region and visualize it using tools such as Haploview.

     

    What I have done so far:

     

    1) Downloaded the imputed data from TopMed for the individuals of interest using DNAnexus SwissArmyKnife tool.

     

    2) Adjusted the .sample according to Chai's comment above.

     

    3) Filtered the genomic region of interest. Btw: I observed the same issue as reported in this link: lack of rsids in the .bgen file (https://community.dnanexus.com/s/question/0D5t000004SBxtyCAD/potential-issues-with-imputed-data)

     

    I now want to compute LD and visualize haplotype blocks among all SNPs in this region.

     

    • Is QCtool the best approach for this task? Based on its documentation, it's not clear to me how to calculate LD within a single genotype file. Should I use the same .bgen and .sample files in the code, for example:

     

    qctool -g file.bgen -s file.sample -compute-ld-with file.bgen file.sample -old sqlite://results.sqlite:LD

     

    It apparently works but I wanted to make sure I'm getting the correct results (I haven't added any additional arguments so far, just wanted to test the default options).

     

    Other errors I have gotten while manipulating the .bgen file in order to use as input in different tools (all with the purpose of generating input files for Haploview).

     

     

    I guess my questions are:

     

    a) Given the errors above, is the TOPMED data (field 21007) phased as I am assuming? If so, any chance that I might've lost phase information while downloading the data for the individuals of interest?

     

    b) For the purpose of this type of analysis, are the files found in Bulk>Imputation (22828, 21007, 21008) the options one indeed should be using? It's a bit unclear to me the definition of field 22438.

     

    c) If the above qctool command is the appropriate way to go, could anyone be kind enough to help me figure out how to graphically visualize the results stored in "results.sqlite" as I'm not very familiar with .sql files manipulation?

     

    Thank you very much for any insights of this community.

     

     

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    Could you repost this as a new question? It's pretty hard questions, so I want to see if other members in community could chime in.

    I want to note though that there will be phasing WGS data coming out around July this year for 200k WGS data.

    0
  • Thanks for replying, Chai. Will do. I'm glad to know there will be a phasing WGS data release soon.

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    You are welcome. If you are interested in phasing data, you may find this two talks useful.

    https://www.youtube.com/watch?v=jF2GKfrWaz4&t=8s

    https://www.youtube.com/watch?v=iNtg9PuYj4g&t=1s

    0
  • Awesome. Thanks for sharing these talks. I?ve just watched them and they were super informative. Excited for the phased WGS release in the next few months.

    0
  • I updated my GWAS repo for TOPmed imputed data using plink. I will work on adding the regenie version sometime in the near future.

     

    https://github.com/pjgreer/ukb-rap-tools/tree/main/GWAS_pipeline/gwas_topmed_plink

     

    I have a separate question, I see there is a paper on the GEL methods for imputation in the showcase, but there does not seem to be one for the TOPmed imputation. Has anyone seen this paper yet?

     

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    Would the note and resource section of https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=21007 contain information you are looking for?

    0
  • Chai,

     

    No, that is really the bare minimum information.

     

    The original HRC imputation paper (https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=530) and the GEL pdf (https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=10510) are really what I am looking for. The TOPmed document just doesn't seem to exist yet.

     

    Specifically, how many snps passed QC to be submitted to the imputation server? How large were the batches? (HRC, 4700 per batch, GEL 26K per batch) did they try to submit batches by reported ancestry? etc...

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    thanks for this note Phil. I will pass on this request to UKB.

    0
  • Comment author
    Felix Vaura

    Hello,

    Any news on the TOPMed QC details?

    Best,
    Felix

    0

Please sign in to leave a comment.