Potential issues with imputed data

Hello,   I've been analysing the new imputed data (TOPMed and GEL) and have noticed some formatting issues and wanted to see if anybody else encountered them before I report them to UKBB:   - The .bgi files for the TOPMed imputation appear to be missing the rsid field (at least when queried via sqlite3) - The .bgen files have hardcoded sample IDs that do not match the .sample file. This is not strictly a problem, but some tools (e.g. qctool) expect the sample and bgen file to have identical IDs and it is unclear how this may affect other tools - Lack of .mfi files (or similar format) including summary imputation statistics   Has anybody experienced similar (or additional) issues?   Thanks!

Comments

7 comments

  • Comment author
    Chai Fungtammasan DNAnexus Team

    Thanks for reporting this {@005t000000BBvGcAAL}? . This is very helpful.

     

    1) I usually didn't use this file, so I'm not sure. However, for GEL at least, I was able to see rsid in bgen file when using Python bgen_reader. Would you mind sharing more about when you use .bgi file?

     

    2) I checked and saw EID in bgen (at least for GEL). Most likely, this is EID of research application of researchers who created this file. I will send inquiry to check with UKB that this is not original eid which would be a bigger concern. In the future, this should be removed and have eid only in sample file.

     

    3) For TOPMed, the file in helper_files would have the standard impute stats that would be useful. The GEL doesn't have this though.

     

    There is an issue with header of sample file that I reported here, but you probably knew it already: https://community.dnanexus.com/s/question/0D5t000004CaydsCAB/have-questions-about-the-gel-or-topmed-impute-data-release-ask-them-here

     

    My colleague {@005t00000089ohSAAQ}? saw issue that two SNPs (one on chr2 and one on chr6 that throw error with PLINK tool, but not with Hail, PLINK2 --freq or Python bgen_reader, so we think it might be minor incompatibility of format rather than incorrect format.

     

    Feel free to report issues to UKB directly. I will send this communication thread to them too.

     

    0
  • Comment author
    Anastazie Sedlakova DNAnexus Team

    Yes, I have a problem when doing LD clumping with PLINK.

     

    ./plink ... --clump-p1 1 --clump-r2 0.1 --clump-kb 250 --clump significant_variants.txt --clump-snp-field Name --clump-field Pval

     

    By running this command with the small subset of significant variants I was able to identify problematic SNPs. Excluding those SNPs fixed the problem. However, when I extracted those SNPs, I did not get any error.

     

     

    0
  • Comment author
    Former User of DNAx Community_51

    Thanks for the quick reply, Chai.

     

    1) You are correct that the rsid field is populated in the bgen file, but not in the .bgi file. The .bgi file is helpful if I want to query individual variants rapidly with sqlite3 or bgenix rather than using qctool / plink (possibly very slow). I use the following python3 code to access the bgi file (unsure if formatting works...):

     

    ```

    from pathlib import Path

    import pandas as pd

    import sqlite3

     

    bgi_file = Path('ukb21007_c21_b0_v1.bgen.bgi')

    bgi_connection = sqlite3.connect(bgi_file)

    bgi_connection.row_factory = sqlite3.Row

    bgi_table = pd.read_sql('SELECT * FROM Variant', con=bgi_connection)

    ```

     

    On inspecting this pandas DataFrame (bgi_table), you should see that there is a '.' for all rsid fields.

     

    2) Good to know it wasn't just me and thank you for replicating this.

     

    3) You are correct about the helper files and I should have been more clear about differences in TOPMed vs GEL. Since I want to use the GEL imputation, this does not help me. To be clear, the GEL 'helper file' is roughly a .tsv representation of the .bgi file with an additional 'alternate_ids' column. And yes, I was aware of the header file issue, but thank you for including in this thread.

     

    Now that I know I didn't make these issues up, I will report to UKB as well.

     

    0
  • Comment author
    Former User of DNAx Community_51

    Hello Anastzie,

     

    Thanks for including in this thread. Would be good to get to the root of the issue, because if the problem is found in just a small selection of SNPs based off of an analysis, then it is likely that many others in the file have a problem and we can't be sure of what the effect(s) may be on other tools that might rely on plink or plink-like processing.

    0
  • Comment author
    Former User of DNAx Community_69

    Hi, I've been working on performing GWAS with TopMed data. Using Phil's guide; https://github.com/pjgreer/ukb-rap-tools/tree/main/GWAS_pipeline.

     

    1. For the issue: The .bgi files for the TOPMed imputation appear to be missing the rsid field (at least when queried via sqlite3)

     

    I included:  --set-missing-var-ids @:#\$r:\$a using plink2. It will replace the "." in bim files with chr:BP:Ref:ALT. This gives at least trackable ID for downstream analysis.

     

    Happy to hear what others do for TopMed.

    Thanks,

    Alyssa

    0
  • Hi Chai,

     

    Regarding 3) for GEL - is there any indication if (or when) summary imputation statistics would become available?

     

    Thanks!

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    At this moment, there is no pan to make such information available. However, feel free to send your feedback to UKB as well that such information would be very useful for your research.

    0

Please sign in to leave a comment.