Quality control and metrics Pinned
As part of the whole genome sequencing (WGS) release, a large number of quality control (QC) and metadata metrics have been released (Category 187). These build on those made available with the 200k WGS release, and can help with understanding and using the data effectively.
The WGS project occurred in two phases:
- The vanguard phase, where the first 50,000 samples were sequenced at the Wellcome Sanger Institute and with bioinformatics provided by Seven Bridges. The first 5,000 samples of this phase formed a pilot.
- The main phase, where the remaining 440,000 samples were divided for sequencing between Wellcome Sanger Institute (with bioinformatics provided by Seven Bridges) and deCODE Genetics (with bioinformatics provided in-house).
Field 32051 can be used to identify which of these phases each sample was part of, and for the main phase samples which of the two sequencing providers received the sample.
Further information on sample shipment can come from Field 32053 Shipment batch number, which identifies the shipment each sample was sent in, and Field 32052 Sample plate ID, identifying the 96-well plate within each shipment that the sample was distributed on.
Note: samples were not always processed chronologically in terms of delivery, with each sequencing provider managing samples and sequencing priority on receipt independently. Additionally, the end of the vanguard project overlapped with the start of shipping for the main phase, therefore some vanguard samples were shipped and/or sequenced after the main phase began.
After delivery to the sequencing provider DNA concentration was measured (Field 32055 sample quant reading (sequence provider)) which could be compared and confirmed against that measured by UK Biobank (Field 32054 Sample quant reading (UKB)). These may not fully align if different technologies were used for measurement between UKB and the sample provider. Processing for sequencing for received samples is then recorded with Field 32056 Library prep plate barcode and 32057 Library prep plate position (well). These can be useful tools to identify batches within sequencing preparation.
Within the WGS QC category, seven different metrics considering the quality of the sequencing itself are also available for researchers to use. Field 32061 Average batch coverage and Field 32060 coverage both consider sequencing coverage achieved, across a sequencing batch and individual sample respectively. Field 32058 provides the sequencing yield, and Field 32059 indicates what proportion of mapped read pairs had appropriate orientation and separation. The final three metrics confirm sample identity and possible contamination, with Field 32063 confirming concordance to the genotyping data, and Freemix VerifyBamID (Field 32062) and Read haps (Field 32065) identifying the likelihood of contamination in the sequence.
Samples which passed all required metrics for these quality metrics are indicated with Field 32064. Details of field-level thresholds, where available, are provided in the field notes in each case. Note that in a small number of cases sequences which did not pass all metrics has been made available, if no further DNA sample was available, or subsequent sequencing attempts were unsuccessful. Researchers may want to consider the quality thresholds which these samples failed to meet and consider whether they should be excluded from their research.
Related to
Comments
1 comment
We are using the Dragen (500k) WGS for an analysis. Can we use the array-based PCs and QC for this analysis? Are there any reports of concordance between array-based and WGS data? Is there concern for sample mismatch? Thank you
Please sign in to leave a comment.