Consequence and Amino Acid Annotations in UKB WES Data

Dr. Mc. Ninja

Consequence and Amino Acid Annotations in UKB WES Data

UK Biobank-Provided WES Annotations on RAP

Consequence Terms: Yes – the UK Biobank’s whole-exome sequencing (WES) data files do include variant effect annotations (using snpEff/Ensembl “ANN/CSQ” fields). Each variant is tagged with its most severe consequence (e.g. missense_variantstop_gainedsynonymous_variant, etc.) across all protein-coding transcripts[1]. UKB’s pipeline annotated variants with snpEff v5 and selected one consequence per variant for filtering (loss-of-function variants were defined by terms like stop_gainedsplice_donor/acceptorframeshift, etc., if the alternate allele was non-ancestral)[1]. These annotations are provided both in the population VCFs (pVCFs) and in companion “helper” tables on the Research Analysis Platform. For example, UKB supplies an “annotations” TSV where each variant is listed with a gene and a functional category. In the final 500k WES release, each variant is classified as “LoF”“synonymous”, or “missense” (with missense further sub-labeled by predicted deleteriousness from 5 algorithms)[2]. This means researchers can directly filter or query variants by consequence class using the UKB-provided annotation file (e.g. selecting all “LoF” variants in a gene for burden testing)[3][4].

Amino Acid Changes (HGVSp): No – UK Biobank’s distributed data do not include the actual protein change (HGVSp) strings. The WES VCFs were annotated with snpEff using the -noProtein flag, so the amino-acid substitution fields were omitted[5]. In other words, the provided annotations stop at the coding DNA position – there are no HGVSp tokens or amino-acid change strings in the UKB files[6]. As a result, you cannot retrieve the exact amino acid change (e.g. p.Arg123Trp) directly from UKB’s files – that information was never written to the ANN/CSQ fields[7]. Consequently, if you need the HGVSp or protein alteration, you must generate it yourself. UKB’s documentation confirms that there is “nothing to extract” for amino acid changes and researchers will have to re-annotate or translate variants independently[8]. In practice, there are two main approaches on the RAP: (1) Re-run an annotator like Ensembl VEP or snpEff (without the no-protein option) on the exome variants to produce full consequences and HGVSp, or (2) Compute the protein changes in-house with a lightweight script (e.g. using reference coding sequences to translate missense/LoF variants)[9][10]. The choice depends on your needs – for rigorous HGVS notations and scores, a full VEP re-run is recommended, whereas for simple filtering (e.g. getting the affected codon or amino acid for missense variants) a small custom script can suffice[11][8]. UKB’s RAP provides reference files (GRCh38 FASTA and Ensembl gene models) in every project to facilitate consistent re-annotation[12].

Access via RAP Tools: On the UKB Research Analysis Platform, you can explore variant annotations through interactive tools. The RAP offers a Variant Browser (in the Cohort Browser interface) where authorized users can query specific variants in the WES dataset. For example, you can search by an rsID or chromosomal position and view that variant’s info – including allele frequencies and the annotated consequence[13]. (Note that this browser uses the 450k/500k exome data in GRCh38 coordinates, whereas the older Showcase “Genomic Search” on the UKB website only covers array/imputation data in GRCh37[14].) The RAP variant browser is useful for quickly checking if a variant is present and seeing its functional class. However, since UKB didn’t store HGVSp, the browser will show the consequence term (e.g. “missense”) and gene but not the actual amino acid change. To get amino-acid substitutions at scale, you should plan to use one of the re-annotation methods above (for instance, download the pVCF and run vep or use bcftools +split-vep in a RAP Jupyter notebook)[9].

Third-Party Resources (Genebass and Variant Portals)

Several external resources have integrated UKB exome results and can provide functional context for variants:

  • Genebass (Broad Institute): The Broad/Neale lab has performed comprehensive association analyses on ~394,000 UKB exomes and made the results available via the Genebass browser. Genebass is a publicly accessible web portal for exploring rare-variant associations in UKB[15]. You can search by gene or variant and review both gene-level burden tests and single-variant results for thousands of phenotypes[16]. Importantly, Genebass uses the same underlying WES data and annotations – each variant in their dataset has a defined consequence category. In their analysis, Broad annotated variants with SnpEff/Ensembl (similar to UKB) and applied LOFTEE-like criteria and in silico predictions to classify variants. For example, any UKB variant labeled stop_gained, splice_acceptor, splice_donor, frameshift, etc. is treated as a putative loss-of-function (pLOF) in the Genebass results[1]. Missense variants are further stratified by predicted deleteriousness (using 5 algorithms) into “likely damaging” vs “benign” categories[17] – akin to UKB’s (0/5, >=1/5, 5/5) missense sub-labels. While the Genebass website focuses on statistical associations, it does display variant-level info that includes the consequence term and gene. For instance, significant variants are often reported with their protein change in papers (e.g. rs139491786, Arg171Trp in SLC9A3R2[18]), indicating that the team mapped variants to amino-acid substitutions internally. Where to find it: You can access Genebass at app.genebass.org[19] (no login required). There, you may query a variant by rsID or coordinates – if it’s a coding variant in UKB, the site will show which gene it falls in and the consequence (e.g. “stop-gained” or “missense”) as part of the result. Genebass does not provide a bulk list of HGVSp for every variant, but it’s an excellent resource to quickly check a variant’s functional class and any phenotype associations. The underlying summary statistics have also been released as a public dataset[20], so advanced users can download those files for research. In summary, Genebass links UKB SNPs to consequence annotations (and association data) and is a practical portal for finding if a given UKB exome SNP is a LOF, missense, etc., even though you might still need to derive the exact HGVSp separately.
  • UKB “SNP Explorer” / Variant Lookup Tools: Aside from Genebass, there are other portals and tools to explore UKB variants. The term “UKB SNP Explorer” likely refers to interfaces for browsing UKB genomic data. One is the UKB Allele Frequency Browser, a public site for UKB’s whole-genome sequencing data (150k genomes) that allows variant queries (hosted at afb.ukbiobank.ac.uk). That browser shows allele frequencies across populations and provides functional context for WGS variants. However, it currently covers the WGS subset (in GRCh38) and was launched for the UKB 150k genome release, not the full 500k exome set[21]. Many exome variants (especially rare ones) will be absent if they weren’t in those 150k genomes. For those variants that are present, the WGS browser does display annotations (e.g. gene, consequence term) since it uses Ensembl/DECIPHER code from deCODE’s platform. In short, the public allele-frequency browser can be useful for quickly checking a variant’s consequence if it’s in the WGS data, but it’s not comprehensive for all exome SNPs. The more directly relevant tool for exome data is the UKB-RAP Variant Browser mentioned earlier, which is essentially a UKB-hosted “SNP explorer” within the RAP environment. This tool is linked to the UKB exome data you have access to, meaning you can look up any UKB exome SNP and see its attributes. It’s a handy way to filter by consequence as well – e.g. you could query all variants in a gene and then filter results to “nonsense or missense” within the interface (or use the helper annotation file in code). Keep in mind that these UKB-run browsers are available to registered researchers with data access; they aren’t open to the general public like Genebass.

Practical retrieval of consequences and HGVSp: To summarize, the most practical way to get consequence annotations for UKB WES variants is to use the provided UKB annotation files or RAP variant browser, which already categorize each variant (e.g. as missense, stop-gained, etc.). All 500k exome variants have an assigned consequence term in these files, so you can filter on, say, “LoF” or “synonymous” as needed[2]. On the other hand, to retrieve amino-acid changes (HGVSp) for those variants, you will need to run an annotation tool yourself. A straightforward approach is to download the UKB pVCF blocks (or query them in RAP) and run Ensembl VEP with the GRCh38 cache that matches UKB’s reference – this will output the HGVSp for each variant. UK Biobank’s community notes suggest that re-annotating one chromosome of exome data with VEP on a medium RAP instance costs only a few GBP and a few hours of compute[22][11]. If you only need a few specific variants’ amino acid changes, you might use Ensembl’s web VEP or UNC’s SNPdb, but for large-scale work it’s best done within RAP for data governance reasons. Third-party portals like Genebass can confirm the consequence class of a variant and sometimes mention the protein change in research results, but they do not provide a downloadable list of HGVSp for all variants. In practice, many researchers use a hybrid approach: filter variants by consequence using UKB’s files or Genebass (e.g. find all missense variants in a gene of interest), then run a focused VEP job to get the HGVSp for those variants only.

Key Takeaways: The UKB WES dataset does include VEP/snpEff consequence terms for each variant (accessible via RAP in the VCF INFO or helper TSV), but it does not include the protein change notation by default[5]. You’ll have to generate amino-acid annotations yourself (re-annotating with VEP or similar) if needed[8]. Third-party resources can assist in the meantime – the Broad’s Genebass browser (app.genebass.org) is a convenient way to see UKB variant consequences and their associations, and the UKB’s own variant browsers on RAP or the public WGS portal let you look up specific SNPs by ID. All these resources are ultimately linked to UKB’s SNPs (by genomic position or rsID) and use the same reference genome build, so you can cross-reference among them. By leveraging the UKB-provided annotation tables for consequence filtering and using tools/portals for quick lookups, you can pinpoint variants of interest, then run a custom annotation to obtain HGVSp amino acid changes for reporting or analysis. This combination of UKB’s official data and third-party portals should cover most needs in retrieving variant consequence and protein-change information for the UKB exome dataset.

Sources:

  • UKB documentation on WES annotation and missing HGVSp[5][8]
  • UKB WES annotation file format (LoF, missense, etc.)[2][4]
  • Nature 2021 – UKB exome pipeline (snpEff consequences and pLOF definition)[1]
  • Broad Institute – Genebass browser release (public portal for UKB exome data)[15][20]
  • UKB RAP forum – using the Variant Browser for exome SNP queries[13]

 

 

[1] [17] [18] Exome sequencing and analysis of 454,787 UK Biobank participants | Nature

https://www.nature.com/articles/s41586-021-04103-z?error=cookies_not_supported&code=868e63ac-0807-4364-a19d-11b028a95494

[2] [3] [4] Burden testing with WES | Research Analysis Platform

https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/burden-testing-with-wes

[5] [6] [7] [8] [9] [10] [11] [12] [22] Whole Exome Variant Annotations – UK Biobank

https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/25417791390621-Whole-Exome-Variant-Annotations

[13] How do I filter a particular exome SNP rsID for a particular EID? – UK Biobank

https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/16019586431517-How-do-I-filter-a-particular-exome-SNP-rsID-for-a-particular-EID

[14] [21] Cohort browser for Genomics – UK Biobank

https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/17668482092189-Cohort-browser-for-Genomics

[15] [19] New online resource helps connect rare genetic variants to human health and disease | Broad Institute

https://www.broadinstitute.org/news/new-online-resource-helps-connect-rare-genetic-variants-human-health-and-disease

[16] [20] Sumstats - CTGCatalog

https://catalog.gwaslab.org/Sumstats_Sumstats_README/

Comments

0 comments

Please sign in to leave a comment.