WGS BGEN files are ref-last.

Ahmet Sayici

Hi,

I am currently working with whole-genome sequencing (WGS) data and performing rare variant collapsing tests using REGENIE. Initially, I used the --ref-first parameter, assuming the BGEN files followed a reference-first allele order. However, I later discovered discrepancies in the results and recalculated allele frequencies using PLINK2. This revealed that the WGS DRAGEN BGEN files are actually in reference-last format, which differs from the reference-first convention used in other UK Biobank BGEN files.

This inconsistency resulted in a significant loss of time and resources. Furthermore, I could not find any documentation indicating this difference. If such information exists, I would appreciate being directed to it. Otherwise, I strongly recommend adding a clear note in the documentation to prevent others from encountering the same issue. 

Best,

Ahmet Sayici

Comments

2 comments

  • Comment author
    Kristen
    • Edited

    Here are results to confirm what Ahmet reported, the DRAGEN .bgen files are ref-last.

    Using the following test SNPs on chromosome 21, I compared different versions of the genotype data:

    rsid            dragen_bgen_name        dragen_pgen_name
    rs114813604     chr21:13457317:G:A      DRAGEN:chr21:13457317:G:A
    rs148053639     chr21:13488923:G:A      DRAGEN:chr21:13488923:G:A
    rs11911574      chr21:13835732:C:T      DRAGEN:chr21:13835732:C:T
    rs112638103     chr21:13888449:C:T      DRAGEN:chr21:13888449:C:T
    rs3115464       chr21:13919595:T:A      DRAGEN:chr21:13919595:T:A

     

    In the imputed .bgen data, the “first” alleles for these SNPs are G,G,C,C,T. In the DRAGEN .bgen data, the “first” alleles for these SNPs are A,A,T,T,A:

    imputed_bgen="/mnt/project/Bulk/Imputation/UKB imputation from genotype/ukb22828_c21_b0_v3"
    bgenix -g "$imputed_bgen.bgen" -incl-rsids test-snps-rsids.txt -list > bgenix-list-imputed.txt
    cat bgenix-list-imputed.txt | column -t
    alternate_ids  rsid         chromosome  position    number_of_alleles  first_allele  alternative_alleles
    rs114813604    rs114813604  21          14829638    2                  G             A
    rs148053639    rs148053639  21          14861244    2                  G             A
    rs11911574     rs11911574   21          15208053    2                  C             T
    rs112638103    rs112638103  21          15260770    2                  C             T
    rs3115464      rs3115464    21          15291916    2                  T             A
    
    dragen_bgen="/mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, BGEN format [500k release]/ukb24309_c21_b0_v1"
    bgenix -g "$dragen_bgen.bgen" -incl-rsids test-snps-dragen.txt -list > bgenix-list-dragen.txt
    cat bgenix-list-dragen.txt | column -t
    alternate_ids  rsid                chromosome  position    number_of_alleles  first_allele  alternative_alleles
    .              chr21:13457317:G:A  21          13457317    2                  A             G
    .              chr21:13488923:G:A  21          13488923    2                  A             G
    .              chr21:13835732:C:T  21          13835732    2                  T             C
    .              chr21:13888449:C:T  21          13888449    2                  T             C
    .              chr21:13919595:T:A  21          13919595    2                  A             T

     

    However, in Plink (and presumably in other tools), things like allele frequencies and GWAS effect estimates will still be correct whether the data is read as ref-first or ref-last:

    # Directly reading/converting the .bgen data with Plink would take forever. Instead, I'm extracting
    # with bgenix, then using Plink. According to the bgenix documentation at https://enkre.net/cgi-bin/code/bgen/doc/trunk/doc/wiki/bgenix.md
    # "By default genotype data is output in the same format as in the input. This makes bgenix fast as it doesn't have to do any processing of the data."
    # "bgenix simply outputs an appropriate BGEN header, and then copies bytes from the input file to the output."
    
    dragen_bgen="/mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, BGEN format [500k release]/ukb24309_c21_b0_v1"
    bgenix -g "$dragen_bgen.bgen" -incl-rsids test-snps-dragen.txt > testsnps-dragen.bgen
    
    # Allele frequencies when .bgen is read as ref-first
    plink2 --freq --bgen testsnps-dragen.bgen ref-first --sample "$dragen_bgen.sample" --out testsnps-dragen
    cat testsnps-dragen.afreq | column -t
    #CHROM  ID                  REF  ALT  ALT_FREQS  OBS_CT
    21      chr21:13457317:G:A  A    G    0.943398   981044
    21      chr21:13488923:G:A  A    G    0.944474   980916
    21      chr21:13835732:C:T  T    C    0.93019    981074
    21      chr21:13888449:C:T  T    C    0.943934   978932
    21      chr21:13919595:T:A  A    T    0.80528    981058
    
    # Allele frequencies when .bgen is read as ref-last
    plink2 --freq --bgen testsnps-dragen.bgen ref-last --sample "$dragen_bgen.sample" --out testsnps-dragen-reflast
    cat testsnps-dragen-reflast.afreq | column -t
    #CHROM  ID                  REF  ALT  ALT_FREQS  OBS_CT
    21      chr21:13457317:G:A  G    A    0.0566019  981044
    21      chr21:13488923:G:A  G    A    0.0555257  980916
    21      chr21:13835732:C:T  C    T    0.0698102  981074
    21      chr21:13888449:C:T  C    T    0.0560662  978932
    21      chr21:13919595:T:A  T    A    0.19472    981058

     

    If I use the .pgen version of the DRAGEN data, output is consistent with what it would be when the .bgen version is read as ref-last:

    dragen_plink="/mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release]/ukb24308_c21_b0_v1"
    plink2 --freq --pfile "$dragen_plink" --no-pheno --extract test-snps-dragen-plink.txt --out dragen-plink
    cat dragen-plink.afreq | column -t
    #CHROM  ID                         REF  ALT  ALT_FREQS  OBS_CT
    21      DRAGEN:chr21:13457317:G:A  G    A    0.0566019  981044
    21      DRAGEN:chr21:13488923:G:A  G    A    0.0555257  980916
    21      DRAGEN:chr21:13835732:C:T  C    T    0.0698102  981074
    21      DRAGEN:chr21:13888449:C:T  C    T    0.0560662  978932
    21      DRAGEN:chr21:13919595:T:A  T    A    0.19472    981058

     

    I also checked the allele frequencies for these SNPs in the original .bed files for non-imputed genotyping array data. The allele frequencies are still consistent - in all versions of the data, the major alleles are G, G, C, C, T. So this does appear to be just a matter of the DRAGEN data being ref-last, and not something more sinister like the alleles themselves somehow being swapped.

    array_genos="/mnt/project/Bulk/Genotype Results/Genotype calls/ukb22418_c21_b0_v2"
    plink2 --freq --bfile "$array_genos" --extract test-snps-rsids.txt --out raw-array-genos
    cat raw-array-genos.afreq | column -t
    #CHROM  ID           REF  ALT  PROVISIONAL_REF?  ALT_FREQS  OBS_CT
    21      rs114813604  A    G    Y                 0.944925   962158
    21      rs148053639  A    G    Y                 0.945571   971858
    21      rs11911574   T    C    Y                 0.93041    975204
    21      rs112638103  T    C    Y                 0.940584   971728
    21      rs3115464    A    T    Y                 0.804213   953136

     

    2
  • Comment author
    Kristen
    • Edited

    Caveat to the above: When I say that Plink or most analysis tools will still produce correct allele frequencies, effect estimates, etc. I mean tools that use one data source for the whole process. With something like Regenie (where I am guessing Ahmet used the imputed BGENs for step 1 and then the DRAGEN BGENs for step 2), the impact is potentially much worse.

    So this change to ref-last is definitely something that needs to be clearly documented by UKB.

    And Ahmet, thank you for posting about this and saving other people from the same headache.

    2

Please sign in to leave a comment.