WGS BGEN files are ref-last.
Hi,
I am currently working with whole-genome sequencing (WGS) data and performing rare variant collapsing tests using REGENIE. Initially, I used the --ref-first parameter, assuming the BGEN files followed a reference-first allele order. However, I later discovered discrepancies in the results and recalculated allele frequencies using PLINK2. This revealed that the WGS DRAGEN BGEN files are actually in reference-last format, which differs from the reference-first convention used in other UK Biobank BGEN files.
This inconsistency resulted in a significant loss of time and resources. Furthermore, I could not find any documentation indicating this difference. If such information exists, I would appreciate being directed to it. Otherwise, I strongly recommend adding a clear note in the documentation to prevent others from encountering the same issue.
Best,
Ahmet Sayici
Comments
2 comments
Here are results to confirm what Ahmet reported, the DRAGEN .bgen files are ref-last.
Using the following test SNPs on chromosome 21, I compared different versions of the genotype data:
In the imputed .bgen data, the “first” alleles for these SNPs are G,G,C,C,T. In the DRAGEN .bgen data, the “first” alleles for these SNPs are A,A,T,T,A:
However, in Plink (and presumably in other tools), things like allele frequencies and GWAS effect estimates will still be correct whether the data is read as ref-first or ref-last:
If I use the .pgen version of the DRAGEN data, output is consistent with what it would be when the .bgen version is read as ref-last:
I also checked the allele frequencies for these SNPs in the original .bed files for non-imputed genotyping array data. The allele frequencies are still consistent - in all versions of the data, the major alleles are G, G, C, C, T. So this does appear to be just a matter of the DRAGEN data being ref-last, and not something more sinister like the alleles themselves somehow being swapped.
Caveat to the above: When I say that Plink or most analysis tools will still produce correct allele frequencies, effect estimates, etc. I mean tools that use one data source for the whole process. With something like Regenie (where I am guessing Ahmet used the imputed BGENs for step 1 and then the DRAGEN BGENs for step 2), the impact is potentially much worse.
So this change to ref-last is definitely something that needs to be clearly documented by UKB.
And Ahmet, thank you for posting about this and saving other people from the same headache.
Please sign in to leave a comment.