ML-Corrected DRAGEN whole genome sequencing (WGS) release

This updated ML-corrected version of the DRAGEN WGS data contains the results of DRAGEN cohort level variant filtering and format conversion with the original DRAGEN pVCFs (Field 24310) as input. The aim of this release is to reduce data footprint and apply QC filters to improve genotyping rate and genotype consistency.

This data is released in pVCF, PLINK2, and BGEN formats.

Data processing

The data processing to generate version 2 consisted of the following steps.

(1) Multi-allelic variants were split into bi-allelic ones, with each pair of REF and ALT alleles normalised, all sample level genotypes were adjusted accordingly. Each variant is assigned a unique ID composed of CHROM:POS:REF:ALT. This unique variant ID is critical in format conversion from multiallelic pVCFs into PGEN (Plink2) and BGEN format.

(2) For each bi-allelic variant, the DRAGEN ML cohort level filtering is applied. Based on sample level quality metrics, a site quality score (MLSQ) per variant allele is calculated and set as the value of QUAL, and the variant alleles with QUAL >= 0.1 are set with FILTER=PASS. This cut-off is chosen based on the cohort level quality assessment of genotyping rate and trio/twin consistency (see below). Variants not passing the quality filter are assigned LowGTR if the genotyping rate is lower than 90%, or assigned LowMLSQ if the MLSQ is lower than 0.1.

(3) For each bi-allelic allele, the allele count, allele frequency, allele number and number of samples (total, with genotype, missing genotype and no coverage) are provided in the INFO field. In the FORMAT field, only GT metrics are retained. Keeping only these essential INFO and FORMAT metrics significantly reduced the size of bi-allelic pVCF files, making it easier to transfer and analyse in downstream applications at whole genome scale.

(4) The output biallelic pVCF (Fie ld 24311) is partitioned with the same sharding schema as the input multiallelic pVCF (Field 24310).

(5) The sharded biallelic pVCFs were converted to sharded PGEN files (with companion PVAR and PSAM files) and sharded BGEN files (with additional BGI index files and a sample list file), using Plink2 (version 2024 March 18). Sample information is removed from BGEN files using EDIT-BGEN (version 1.1.4). BGEN indexing is done using BGENIX (version 1.1.4).

(6) Sharded PGEN files were concatenated into chromosomal level PGEN files, with the exception that All ALT contig files were concatenated into one ALT contigs PGEN file, using Plink2. Each chromosomal level (and ALT contigs) PGEN file was then converted into chromosomal level (and ALT contigs) BGEN file using Plink2.

The output pVCF, PGEN and BGEN contain all the variants in the original pVCF, in biallelic format. Users can use either the full call set or the subset by filtering the variants with FILTER=PASS (available in pVCF and PVAR files).

Quality analysis

The quality of variant calls before and after cohort level ML filtering is assessed based on the overall genotype missingness across all the UK Biobank samples, and the genotype inconsistency among 1043 trios and between 177 monozygotic twins. The assessment is done in two type of genomics regions, the low confidence region, defined as the union of sample level low confidence regions of 7 Genome In A Bottle (GIAB) samples (18% of genome), and the high confidence region, define as the complement of overall low confidence region. For trios, any genotype in a child with allele(s) not found in genotypes of parents, or any missing genotype among the trios, is considered as inconsistent. For monozygotic twins, any mismatch genotype between the twins or any missing genotype, is considered as inconsistent. At cohort level, the genotype missingness is defined as, among all variants with genotyping rate above 90%, the fraction of variants with genotyping rate lower than 99% (i.e. 1%-10% of samples with missing genotype).

For trios, the genotype inconsistency is reduced by 6.5 times from 0.059% (all variants) to 0.009% (PASS variants) in the high confidence regions, and by 17 times from 3.712% (all variants) to 0.219% (PASS variants) in the low confidence regions. For twins, the genotype inconsistency is reduced by 4.9 times from 0.039% (all variants) to 0.008% (PASS variants) in the high confidence regions, and by 11 times from 2.736% (all variants) to 0.245% (PASS variants) in the low confidence regions.

At 500K cohort level, in the high confidence region, genotype missingness is effectively reduced from 0.275% (all variants) to 0.005% (PASS variants). In the low confidence region, the genotype missingness is significantly reduced from 21.050% (all variants) to 0.010% (PASS variants).

As an assessment of concordance of variants across joint calling datasets, on autosomes and chromosome X, after cohort level ML filtering, 91.9% of DRAGEN PASS variants in high confidence region, and 54.7% of DRAGEN PASS variants in low confidence region, are commonly called in GraphTyper (AAscore >= 0.8) at 500K cohort level (which amounts to 99.5% of GraphTyper variants in high confidence region and 94.8% of GraphTyper variants in low confidence region). In terms of variants unique to each call sets (not shared between DRAGEN and GraphTyper), DRAGEN calls 18 times more variants in high confidence region and 15 times more variants in low confidence region, after cohort level ML filtering. This implies DRAGEN variant calls is in concordance with GraphTyper in high confidence regions, but keeps high novelty, hence high discovery power, in both high and low confidence region.

In total, 995,249,518 variant sites and 1,210,935,383 variant alleles (1,081,661,407 SNVs and 129,273,976 INDELs) are called with PASS FILTER on autosomes (chr1-chr22), sex chromosome (chrX, chrY), mitochondria (chrM) and 3341 ALT contigs.

Data Size

Compared to Release1 with total pVCF file size 1.2PB, the data footprint in version 2 is significantly reduced, because of the reduced INFO and FORMAT in pVCF and binary genotype file format PGEN and BGEN. The total size of per chromosome PGEN+PVAR+PSAM files is 5.5TB (with equivalent information as in biallelic reduced pVCF), and BGEN+BGI+SAMPLE files is 15.8TB. The size per sharded biallelic pVCF files is 29.5TB.

Computational information

For the post processing work for version 2, 160,000 analyses on Illumina Analytics (ICA) Platform were launched using non-FPGA software instance (36 vCPU 72 GB). The total amount of compute for 500K WGS joint call is 1.3 million CPU hours, effectively done on ICA in 14 days.

If you need additional guidance or support please submit a ticket

Related to