The initial DRAGEN WGS release contains the multiallelic multi-sample VCF (pVCF) of over 490,000 WGS samples, aggregated using DRAGEN Iterative gVCF Genotyper (IGG) on Illumina Connected Analytics (ICA) cloud platform. The files available are described on Showcase in Category 185 and the pVCF is available in Field 24310.
Methodology
Data processing consists of two stages. In the first stage, to improve variant calling accuracy, DRAGEN Machine Learning Recalibration (MLR; version v4.2.4) was performed at the sample level to recalibrate the variant quality (QUAL and GQ). The input of the MLR pipeline was sample gVCF and CRAM files, both generated from DRAGEN v3.7.8 germline variant calling pipeline with reference genome hg38. Output of MLR pipeline is the recalibrated per sample gVCF file.
DRAGEN MLR improves variant precision while maintaining the sensitivity. Assessed from Genome In a Bottle (GIAB) samples with truth data (NIST v4.2.1), by filtering variants with ML QUAL less than 3.0, the precision of SNVs is improved from 99.74% (before MLR) to 99.91% (after MLR), sensitivity from 99.72% to 99.75%. For INDELs, the precision is improved from 99.69% to 99.83%, sensitivity from 99.68% to 99.70%. This ML filtering ensures effective reduction of false positive variant calls at the sample level before the aggregation step without harming sensitivity.
Next, the recalibrated gVCF files of 490,000 WGS samples were aggregated using the DRAGEN IGG pipeline (version 4.2.4). Variants with recalibrated ML variant quality QUAL less than 3.0 were removed before the aggregation. The joint call set generated from the aggregation has maximized discovery rate at the cohort level, containing the full multiallelic information, which provides a baseline for downstream application specific filtering.
The output pVCF contains 1,109,854,569 multi-allelic variant sites and 1,494,611,198 variant alleles (1,289,650,789 SNVs and 204,960,409 INDELs), called on autosomes (chr1-chr22), sex chromosomes (chrX, chrY), mitochondria (chrM) and 3341 ALT contigs.
Quality Analysis
At each variant site, the allele count, allele number, number of samples (total, with genotype, with missing genotype and with no coverage), inbreeding coefficient, HWE p-value, and excess of heterozygosity are provided in the INFO field of pVCF. The FORMAT field contains genotype, genotype quality, FILTER and QUAL metrics as in the recalibrated gVCFs, localised allelic depth, localised genotype likelihood, localised allele fractions (chrM only), and mapping between localised ALT index and global (cohort level) ALT index. The localised metrics, which will be the standard in VCF specifications version 4.5, are crucial to avoid hitting the VCF record memory limitation beyond 100K sample size scale.
The output pVCFs are partitioned into 157,772 genomic shards, each spanning about 20kbp (with the exception that on ALT contigs, one shard spans the entire contig). The total size of all pVCFs is 1.2 PB.
Computational Information
In total, this was produced with 874,000 launched analyses on the Illumina Analytics (ICA) Platform using non-FPGA software instances (16 vCPU 128GB RAM for the MLR pipeline and 36 vCPU 72GB for the IGG pipeline). The total amount of compute for the 500k WGS joint variant call was 7.3 million CPU hours, effectively done on ICA in 72 days.
If you need additional guidance or support please submit a ticket
Related to
Comments
0 comments
Article is closed for comments.