WGS BGEN files are ref-last.

07 August 2025 09:39
2 comments

Hi,

I am currently working with whole-genome sequencing (WGS) data and performing rare variant collapsing tests using REGENIE. Initially, I used the --ref-first parameter, assuming the BGEN files followed a reference-first allele order. However, I later discovered discrepancies in the results and recalculated allele frequencies using PLINK2. This revealed that the WGS DRAGEN BGEN files are actually in reference-last format, which differs from the reference-first convention used in other UK Biobank BGEN files.

This inconsistency resulted in a significant loss of time and resources. Furthermore, I could not find any documentation indicating this difference. If such information exists, I would appreciate being directed to it. Otherwise, I strongly recommend adding a clear note in the documentation to prevent others from encountering the same issue.

Best,

Ahmet Sayici

Comments

2 comments

Kristen

Edited 15 August 2025 04:18

Here are results to confirm what Ahmet reported, the DRAGEN .bgen files are ref-last.

Using the following test SNPs on chromosome 21, I compared different versions of the genotype data:

rsid            dragen_bgen_name        dragen_pgen_name
rs114813604     chr21:13457317:G:A      DRAGEN:chr21:13457317:G:A
rs148053639     chr21:13488923:G:A      DRAGEN:chr21:13488923:G:A
rs11911574      chr21:13835732:C:T      DRAGEN:chr21:13835732:C:T
rs112638103     chr21:13888449:C:T      DRAGEN:chr21:13888449:C:T
rs3115464       chr21:13919595:T:A      DRAGEN:chr21:13919595:T:A

In the imputed .bgen data, the “first” alleles for these SNPs are G,G,C,C,T. In the DRAGEN .bgen data, the “first” alleles for these SNPs are A,A,T,T,A:

imputed_bgen="/mnt/project/Bulk/Imputation/UKB imputation from genotype/ukb22828_c21_b0_v3"
bgenix -g "$imputed_bgen.bgen" -incl-rsids test-snps-rsids.txt -list > bgenix-list-imputed.txt
cat bgenix-list-imputed.txt | column -t
alternate_ids  rsid         chromosome  position    number_of_alleles  first_allele  alternative_alleles
rs114813604    rs114813604  21          14829638    2                  G             A
rs148053639    rs148053639  21          14861244    2                  G             A
rs11911574     rs11911574   21          15208053    2                  C             T
rs112638103    rs112638103  21          15260770    2                  C             T
rs3115464      rs3115464    21          15291916    2                  T             A

dragen_bgen="/mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, BGEN format [500k release]/ukb24309_c21_b0_v1"
bgenix -g "$dragen_bgen.bgen" -incl-rsids test-snps-dragen.txt -list > bgenix-list-dragen.txt
cat bgenix-list-dragen.txt | column -t
alternate_ids  rsid                chromosome  position    number_of_alleles  first_allele  alternative_alleles
.              chr21:13457317:G:A  21          13457317    2                  A             G
.              chr21:13488923:G:A  21          13488923    2                  A             G
.              chr21:13835732:C:T  21          13835732    2                  T             C
.              chr21:13888449:C:T  21          13888449    2                  T             C
.              chr21:13919595:T:A  21          13919595    2                  A             T

However, in Plink (and presumably in other tools), things like allele frequencies and GWAS effect estimates will still be correct whether the data is read as ref-first or ref-last:

# Directly reading/converting the .bgen data with Plink would take forever. Instead, I'm extracting
# with bgenix, then using Plink. According to the bgenix documentation at https://enkre.net/cgi-bin/code/bgen/doc/trunk/doc/wiki/bgenix.md
# "By default genotype data is output in the same format as in the input. This makes bgenix fast as it doesn't have to do any processing of the data."
# "bgenix simply outputs an appropriate BGEN header, and then copies bytes from the input file to the output."

dragen_bgen="/mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, BGEN format [500k release]/ukb24309_c21_b0_v1"
bgenix -g "$dragen_bgen.bgen" -incl-rsids test-snps-dragen.txt > testsnps-dragen.bgen

# Allele frequencies when .bgen is read as ref-first
plink2 --freq --bgen testsnps-dragen.bgen ref-first --sample "$dragen_bgen.sample" --out testsnps-dragen
cat testsnps-dragen.afreq | column -t
#CHROM  ID                  REF  ALT  ALT_FREQS  OBS_CT
21      chr21:13457317:G:A  A    G    0.943398   981044
21      chr21:13488923:G:A  A    G    0.944474   980916
21      chr21:13835732:C:T  T    C    0.93019    981074
21      chr21:13888449:C:T  T    C    0.943934   978932
21      chr21:13919595:T:A  A    T    0.80528    981058

# Allele frequencies when .bgen is read as ref-last
plink2 --freq --bgen testsnps-dragen.bgen ref-last --sample "$dragen_bgen.sample" --out testsnps-dragen-reflast
cat testsnps-dragen-reflast.afreq | column -t
#CHROM  ID                  REF  ALT  ALT_FREQS  OBS_CT
21      chr21:13457317:G:A  G    A    0.0566019  981044
21      chr21:13488923:G:A  G    A    0.0555257  980916
21      chr21:13835732:C:T  C    T    0.0698102  981074
21      chr21:13888449:C:T  C    T    0.0560662  978932
21      chr21:13919595:T:A  T    A    0.19472    981058

If I use the .pgen version of the DRAGEN data, output is consistent with what it would be when the .bgen version is read as ref-last:

dragen_plink="/mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release]/ukb24308_c21_b0_v1"
plink2 --freq --pfile "$dragen_plink" --no-pheno --extract test-snps-dragen-plink.txt --out dragen-plink
cat dragen-plink.afreq | column -t
#CHROM  ID                         REF  ALT  ALT_FREQS  OBS_CT
21      DRAGEN:chr21:13457317:G:A  G    A    0.0566019  981044
21      DRAGEN:chr21:13488923:G:A  G    A    0.0555257  980916
21      DRAGEN:chr21:13835732:C:T  C    T    0.0698102  981074
21      DRAGEN:chr21:13888449:C:T  C    T    0.0560662  978932
21      DRAGEN:chr21:13919595:T:A  T    A    0.19472    981058

I also checked the allele frequencies for these SNPs in the original .bed files for non-imputed genotyping array data. The allele frequencies are still consistent - in all versions of the data, the major alleles are G, G, C, C, T. So this does appear to be just a matter of the DRAGEN data being ref-last, and not something more sinister like the alleles themselves somehow being swapped.

array_genos="/mnt/project/Bulk/Genotype Results/Genotype calls/ukb22418_c21_b0_v2"
plink2 --freq --bfile "$array_genos" --extract test-snps-rsids.txt --out raw-array-genos
cat raw-array-genos.afreq | column -t
#CHROM  ID           REF  ALT  PROVISIONAL_REF?  ALT_FREQS  OBS_CT
21      rs114813604  A    G    Y                 0.944925   962158
21      rs148053639  A    G    Y                 0.945571   971858
21      rs11911574   T    C    Y                 0.93041    975204
21      rs112638103  T    C    Y                 0.940584   971728
21      rs3115464    A    T    Y                 0.804213   953136

Kristen
- Edited 15 August 2025 04:20
Caveat to the above: When I say that Plink or most analysis tools will still produce correct allele frequencies, effect estimates, etc. I mean tools that use one data source for the whole process. With something like Regenie (where I am guessing Ahmet used the imputed BGENs for step 1 and then the DRAGEN BGENs for step 2), the impact is potentially much worse.
So this change to ref-last is definitely something that needs to be clearly documented by UKB.
And Ahmet, thank you for posting about this and saving other people from the same headache.

2

Please sign in to leave a comment.