Whole Exome Variant Annotations
Hi everyone,
I’m working with UKB whole exome sequencing data and using the provided snpEff annotations in the helper files, but I noticed they don’t include amino acid changes for the variants.
Has anyone found a way to extract this information from the existing UKB annotations, or would re-annotating with VEP/ANNOVAR be necessary? Any advice on the best approach would be appreciated!
Thanks!
Comments
4 comments
Short answer
The “helper” tables that UK Biobank distribute with the WES VCFs intentionally keep only the minimal ANN/ CSQ columns that are needed for most high-level filtering. UKB ran snpEff v5 with the
-noProteinflag, so the ANN field in the VCF (and the tab-delimited helper files that are derived from it) stop at the CDS_position column and never include the Amino_acids / HGVSp tokens. In other words, the information you want was never written, so it cannot be recovered from those files.To obtain amino-acid (protein) changes you therefore have two practical options:
Option 1:
1. Re-annotate the VCF yourself with VEP or snpEff
When it makes sense: You need precise HGVS.p for variant-level QC, burden tests or reporting. • You are comfortable paying ~ £1–3 of compute per chromosome.
Sketch of the workflow on UKB-RAP:
dx downloada chromosome-level pVCF to local disk inside the job.vep --cache --assembly GRCh38 --everything --offline --vcf(orsnpEff -canon -hgvs -fastaRef …).dx uploadthe re-annotated VCF (or a stripped-down TSV with the fields you need).Option 2:
2. Re-compute only the protein consequence in a lightweight pass
When it makes sense: You just need the AA substitution for LoF/Missense filters.
Sketch of the workflow on UKB-RAP:
Which should you choose?
-noProtein) is the most robust path, and takes only a few hours on a RAP medium instance.Tips for either route
The FASTA (
GRCh38_full_analysis_set_plus_decoy_hla.fa) and GTF (Homo_sapiens.GRCh38.104.gtf) are already in every RAP project under/Bulk/reference_files. Point VEP or snpEff to those to guarantee coordinate compatibility.UKB pVCFs are large; processing in 1 GB slices keeps memory and scratch-disk needs modest, letting you stay on
mem1_ssd1_v2_x8(8 vCPU / 16 GB RAM).Keep the new fields separate.
Rather than writing a huge replacement VCF, many groups save a four-column TSV:
Joining on
CHROM-POS-REF-ALTlater is trivial withbcftools annotate -a.Store your Docker/WDL tool under the project’s
/Toolsfolder so colleagues can reuse it without re-pulling images each time.Bottom line
Because UK Biobank’s default snpEff run suppressed the protein-change tokens, there is nothing to “extract” from the helper files. You will need to generate the amino-acid changes yourself, either by (a) re-running a full annotator such as VEP/snpEff or (b) doing a quick in-house translation from CDS to protein for the variants you care about.
Hi Dr. Mc. Ninja,
thank you for your helpful post on variant annotations.
Thank you Dr. Mc. Ninja for the helpful information!
In retrospect, this can't be right… e.g. in the cohort browser you can see SNPs with VEP annotation… Lemmy pester my assistant again…
Her short form:
UKB WES: Consequence vs Amino-acid (HGVSp) — what’s included, what’s not, and how to proceed
TL;DR
ANN/CSQ) in the pVCFs and RAP “helper” tables — e.g., LoF, missense (with 0/5, ≥1/5, 5/5 deleteriousness tiers), synonymous.-noProtein).-noProtein) using GRCh38 resources, or script translations if you only need basics.Practical workflow
Key references
https://www.nature.com/articles/s41586-021-04103-z
https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/burden-testing-with-wes
https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/25417791390621-Whole-Exome-Variant-Annotations
https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/16019586431517-How-do-I-filter-a-particular-exome-SNP-rsID-for-a-particular-EID
https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/17668482092189-Cohort-browser-for-Genomics
https://www.broadinstitute.org/news/new-online-resource-helps-connect-rare-genetic-variants-human-health-and-disease
https://catalog.gwaslab.org/Sumstats_Sumstats_README/
Would be good for this to be validated for accuracy.
Please sign in to leave a comment.