This article provides information on the whole genome sequencing (WGS) data released on 30th November 2023 from half a million UK Biobank (UKB) volunteers. Access to all sequencing data is available via the UKB-RAP cloud platform under a tier 3 application. For more information about access tiers and the costs associated with accessing UKB data, please see this article. If you wish to know the details on previous releases, category 186 on the Showcase website provides information on a previous subset of this data.
1. What data have been released?
Two versions of the data have been released, one produced using BWA-MEM/GATK pipelines, and a second dataset produced using Illumina DRAGEN v3.7.8.
For the BWA-MEM/GATK data, the released data consists of fields in category 270:
- Individual level CRAM and CRAM index files (Field 23372)
- Individual level gVCF and gVCF index files (Field 23370)
- Joint variant called data produced using GraphTyper2 (Field 23374)
- Genotype concordance metrics (Fields 23378, 23379, 23380, and 23381)
- Measures of sample contamination (Fields 23377, 23383, and 23384)
- Base quality score recalibration (BQSR) (Field23376)
- Supplementary files produced during data quality control (Field 23382)
For the DRAGEN data, the released data consists of fields in category 185:
- Individual level CRAM and CRAM index files (Field 24048)
- Individual level gVCF and gVCF index files (Field 24051)
- Individual level VCF and VCF index files (Field 24053)
- Joint variant called data produced using DRAGEN (Field 24310)
- Individual measures of copy number variation (CNV) (Fields 24056 and 24058)
- Identified short tandem repeats (STRs) (Fields 24062 and 24064)
- Structural variant data (SV) (Fields 24059 and 24061)
- Genotype calls for CYP2D6 (Field 24065)
- Supplementary and diagnostic information (Fields 24050 and 24055)
Additionally, tabular data containing quality control metrics from the sequencing process are available
in category 187.
2. Can this data be accessed or downloaded through Data Showcase?
The whole genome sequencing data can only be used within the UKB-RAP, and researchers are not
permitted to download the data either through UKB-RAP or from Data Showcase.
3. Can I still access the 200k participant whole genome sequencing data?
The individual-level data, such as CRAM and VCF files, from the first 200k participant WGS release has
been merged into the enduring 500k release fields. The joint variant calls and phased datasets
produced for the interim 200k release are still accessible to researchers in category 186 and on UKBRAP. These fields may be deprecated in future, but UK Biobank will notify researchers in advance.
4. Will additional genomic data be released?
The next UKB-RAP release (v19) will contain 3 new DRAGEN genomics fields:
- Data-field 24311: ML-corrected DRAGEN population level WGS variants, pVCF format [500k release]
- Data-field 24308: DRAGEN population level WGS variants, PLINK format [500k release]
- Data-field 24309: DRAGEN population level WGS variants, BGEN format [500k release]
Field 24311 is an updated version of the current pVCF (field 24310) where machine-learning techniques have been employed to improve calling. Fields 24308 and 24309 are PLINK2 and BGEN formats of field 24311 pVCF. If you wish to know more about our future data releases, please see our future timelines page.
5. What sequencing technology has been used for UK Biobank WGS?
The samples have been sequenced using Illumina NovaSeq 6000 instruments, with paired-end
sequencing performed using S4 flowcells (v1.0 chemistry).
6. Do the CRAM files also contain unmapped reads?
Yes. Original sample FASTQs can be re-created from the lossless CRAMs, which contain every read
regardless of whether they map and all original quality scores. Please note that CRAMs should be
name sorted or randomised prior to extracting a FASTQ to ensure uncorrelated read sets for
subsequent parallelised mapping (e.g. BWA).
7. Why are there two versions of the dataset?
During the WGS Main Phase programme, the industry consortium chose to further process the
individual level data using the Illumina DRAGEN v3.7.8 pipeline. As well as taking advantage of the
potential improvements offered by the mapping and calling algorithms within DRAGEN, other largescale population genomics initiatives have sought to standardise on the DRAGEN v3.7.8 pipeline with
the aim of simplifying cross-cohort analyses.
The canonical outputs of the WGS programme should be seen to be the DRAGEN version of these
data. Given DRAGEN remains an upcoming standard increasingly used within the genetics community,
and that many researchers will currently use BWA-MEM/GATK as their preferred version, we have
chosen to make both versions of the data available at this time to ensure that it is accessible to as
many researchers as possible. The BWA-MEM/GATK CRAMs may be deprecated in
the future, but UK Biobank will notify researchers in advance.
8. Why are the EIDs in the header of the gVCF and CRAM different to the filename?
The EID in the filename is pseudonymised to match your application EIDs. These EIDs are consistent
across your project space, for all bulk and tabular data. Please disregard any sample IDs within the
gVCF, VCF and CRAM files. Further information can be found on the UKB-RAP FAQ page.
9. How can I tell whether a participant was sequenced by Wellcome Sanger Institute (WSI) or deCODE Genetics?
The sequencing provider can be determined using Field 32051. Samples sequenced in the Vanguard
Pilot and Vanguard Phase were sequenced by WSI, in addition to those sequenced by WSI as part of
the main phase of the sequencing project.
Related to
Comments
0 comments
Article is closed for comments.