Potential issues with imputed data

20 February 2023 00:00
7 comments

Hello, I've been analysing the new imputed data (TOPMed and GEL) and have noticed some formatting issues and wanted to see if anybody else encountered them before I report them to UKBB: - The .bgi files for the TOPMed imputation appear to be missing the rsid field (at least when queried via sqlite3) - The .bgen files have hardcoded sample IDs that do not match the .sample file. This is not strictly a problem, but some tools (e.g. qctool) expect the sample and bgen file to have identical IDs and it is unclear how this may affect other tools - Lack of .mfi files (or similar format) including summary imputation statistics Has anybody experienced similar (or additional) issues? Thanks!

Comments

7 comments

Chai Fungtammasan DNAnexus Team
- 21 February 2023 00:40
Thanks for reporting this {@005t000000BBvGcAAL}? . This is very helpful.

1) I usually didn't use this file, so I'm not sure. However, for GEL at least, I was able to see rsid in bgen file when using Python bgen_reader. Would you mind sharing more about when you use .bgi file?

2) I checked and saw EID in bgen (at least for GEL). Most likely, this is EID of research application of researchers who created this file. I will send inquiry to check with UKB that this is not original eid which would be a bigger concern. In the future, this should be removed and have eid only in sample file.

3) For TOPMed, the file in helper_files would have the standard impute stats that would be useful. The GEL doesn't have this though.

There is an issue with header of sample file that I reported here, but you probably knew it already: https://community.dnanexus.com/s/question/0D5t000004CaydsCAB/have-questions-about-the-gel-or-topmed-impute-data-release-ask-them-here

My colleague {@005t00000089ohSAAQ}? saw issue that two SNPs (one on chr2 and one on chr6 that throw error with PLINK tool, but not with Hail, PLINK2 --freq or Python bgen_reader, so we think it might be minor incompatibility of format rather than incorrect format.

Feel free to report issues to UKB directly. I will send this communication thread to them too.

0
Anastazie Sedlakova DNAnexus Team
- 21 February 2023 09:02
Yes, I have a problem when doing LD clumping with PLINK.

./plink ... --clump-p1 1 --clump-r2 0.1 --clump-kb 250 --clump significant_variants.txt --clump-snp-field Name --clump-field Pval

By running this command with the small subset of significant variants I was able to identify problematic SNPs. Excluding those SNPs fixed the problem. However, when I extracted those SNPs, I did not get any error.

0
Former User of DNAx Community_51
- 21 February 2023 11:59
Thanks for the quick reply, Chai.

1) You are correct that the rsid field is populated in the bgen file, but not in the .bgi file. The .bgi file is helpful if I want to query individual variants rapidly with sqlite3 or bgenix rather than using qctool / plink (possibly very slow). I use the following python3 code to access the bgi file (unsure if formatting works...):

```
from pathlib import Path
import pandas as pd
import sqlite3

bgi_file = Path('ukb21007_c21_b0_v1.bgen.bgi')
bgi_connection = sqlite3.connect(bgi_file)
bgi_connection.row_factory = sqlite3.Row
bgi_table = pd.read_sql('SELECT * FROM Variant', con=bgi_connection)
```

On inspecting this pandas DataFrame (bgi_table), you should see that there is a '.' for all rsid fields.

2) Good to know it wasn't just me and thank you for replicating this.

3) You are correct about the helper files and I should have been more clear about differences in TOPMed vs GEL. Since I want to use the GEL imputation, this does not help me. To be clear, the GEL 'helper file' is roughly a .tsv representation of the .bgi file with an additional 'alternate_ids' column. And yes, I was aware of the header file issue, but thank you for including in this thread.

Now that I know I didn't make these issues up, I will report to UKB as well.

0
Former User of DNAx Community_51
- 21 February 2023 12:01
Hello Anastzie,

Thanks for including in this thread. Would be good to get to the root of the issue, because if the problem is found in just a small selection of SNPs based off of an analysis, then it is likely that many others in the file have a problem and we can't be sure of what the effect(s) may be on other tools that might rely on plink or plink-like processing.

0
Former User of DNAx Community_69
- 21 February 2023 19:18
Hi, I've been working on performing GWAS with TopMed data. Using Phil's guide; https://github.com/pjgreer/ukb-rap-tools/tree/main/GWAS_pipeline.
1. For the issue: The .bgi files for the TOPMed imputation appear to be missing the rsid field (at least when queried via sqlite3)
I included: --set-missing-var-ids @:#\$r:\$a using plink2. It will replace the "." in bim files with chr:BP:Ref:ALT. This gives at least trackable ID for downstream analysis.

Happy to hear what others do for TopMed.
Thanks,
Alyssa
0
Former User of DNAx Community_76
- 16 May 2023 20:56
Hi Chai,

Regarding 3) for GEL - is there any indication if (or when) summary imputation statistics would become available?

Thanks!

0
Chai Fungtammasan DNAnexus Team
- 16 May 2023 21:24
At this moment, there is no pan to make such information available. However, feel free to send your feedback to UKB as well that such information would be very useful for your research.

0

Please sign in to leave a comment.