Whole Exome Variant Annotations

Anushka Sinha

Hi everyone,

I’m working with UKB whole exome sequencing data and using the provided snpEff annotations in the helper files, but I noticed they don’t include amino acid changes for the variants.

Has anyone found a way to extract this information from the existing UKB annotations, or would re-annotating with VEP/ANNOVAR be necessary? Any advice on the best approach would be appreciated!

Thanks!

Comments

4 comments

  • Comment author
    Dr. Mc. Ninja

    Short answer

    The “helper” tables that UK Biobank distribute with the WES VCFs intentionally keep only the minimal ANN/ CSQ columns that are needed for most high-level filtering. UKB ran snpEff v5 with the -noProtein flag, so the ANN field in the VCF (and the tab-delimited helper files that are derived from it) stop at the CDS_position column and never include the Amino_acids / HGVSp tokens. In other words, the information you want was never written, so it cannot be recovered from those files.
    To obtain amino-acid (protein) changes you therefore have two practical options:

    Option 1:

    1. Re-annotate the VCF yourself with VEP or snpEff

    When it makes sense: You need precise HGVS.p for variant-level QC, burden tests or reporting. • You are comfortable paying ~ £1–3 of compute per chromosome.

    Sketch of the workflow on UKB-RAP:

    1. Spin up a Cloud Workstation or run a WDL/CWL pipeline.
    2. dx download a chromosome-level pVCF to local disk inside the job.
    3. Run, e.g. vep --cache --assembly GRCh38 --everything --offline --vcf (or snpEff -canon -hgvs -fastaRef …).
    4. dx upload the re-annotated VCF (or a stripped-down TSV with the fields you need).

    Option 2:

    2. Re-compute only the protein consequence in a lightweight pass

    When it makes sense: You just need the AA substitution for LoF/Missense filters.

    Sketch of the workflow on UKB-RAP:

    1. Use bcftools +split-vep or a short Python script in a RAP “Swiss-Army-Knife” applet.
    2. Read REF/ALT and the transcript’s CDS FASTA (Ensembl 104, same build UKB used).
    3. Translate in-silico and write a two-column “variant_id → HGVSp” lookup table. This runs in minutes per chromosome.

    Which should you choose?

    • If you only need simple filters such as “is this variant missense/LoF and which codon changes?”, the lightweight translation route is fast and cheap.
    • If you need full HGVS.p notation, SIFT/PolyPhen scores or canonical-transcript selection, re-running a full annotator (VEP or snpEff without -noProtein) is the most robust path, and takes only a few hours on a RAP medium instance.

    Tips for either route

    1. Reuse the reference/assets UKB already ship.
      The FASTA (GRCh38_full_analysis_set_plus_decoy_hla.fa) and GTF (Homo_sapiens.GRCh38.104.gtf) are already in every RAP project under /Bulk/reference_files. Point VEP or snpEff to those to guarantee coordinate compatibility.
    2. Work chromosome-by-chromosome.
      UKB pVCFs are large; processing in 1 GB slices keeps memory and scratch-disk needs modest, letting you stay on mem1_ssd1_v2_x8 (8 vCPU / 16 GB RAM).
    3. Keep the new fields separate.
      Rather than writing a huge replacement VCF, many groups save a four-column TSV:

      chrom pos ref alt   hgvsp
      1     55516888  G   A     p.Gly12Asp
      

      Joining on CHROM-POS-REF-ALT later is trivial with bcftools annotate -a.

    4. Cache the annotation container.
      Store your Docker/WDL tool under the project’s /Tools folder so colleagues can reuse it without re-pulling images each time.

    Bottom line

    Because UK Biobank’s default snpEff run suppressed the protein-change tokens, there is nothing to “extract” from the helper files. You will need to generate the amino-acid changes yourself, either by (a) re-running a full annotator such as VEP/snpEff or (b) doing a quick in-house translation from CDS to protein for the variants you care about.

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Hi Dr. Mc. Ninja,

    thank you for your helpful post on variant annotations.

    0
  • Comment author
    Anushka Sinha

    Thank you Dr. Mc. Ninja for the helpful information!

    0
  • Comment author
    Dr. Mc. Ninja

    In retrospect, this can't be right… e.g. in the cohort browser you can see SNPs with VEP annotation… Lemmy pester my assistant again…

     

    Her short form:
     

    UKB WES: Consequence vs Amino-acid (HGVSp) — what’s included, what’s not, and how to proceed

    TL;DR

    • Included: UK Biobank WES comes pre-annotated with variant consequence terms (snpEff/Ensembl ANN/CSQ) in the pVCFs and RAP “helper” tables — e.g., LoF, missense (with 0/5, ≥1/5, 5/5 deleteriousness tiers), synonymous.
    • Not included: Protein changes (HGVSp) are not in the UKB WES deliverables (snpEff was run with -noProtein).
    • To get HGVSp: Re-annotate on RAP with Ensembl VEP (or snpEff without -noProtein) using GRCh38 resources, or script translations if you only need basics.
    • Where to browse: RAP Variant Browser (exomes, GRCh38) for quick consequence/frequency lookups; the public Allele Frequency Browser is WGS-only; Genebass is great for consequence classes + association signals (but not bulk HGVSp).

    Practical workflow

    1. Filter by consequence using the UKB WES annotation file or RAP Variant Browser (e.g., pull LoF or missense in a gene for burden testing).
    2. Re-annotate the hits with VEP on RAP to obtain HGVSp for reporting. (For a few variants, web VEP is fine; for scale, run VEP in RAP. Costs/time are modest.)
    3. (Optional) Use Genebass to sanity-check consequence class and see phenotype associations; download summary stats if needed.

    Key references

    1. UKB WES pipeline & pLOF definition (snpEff consequences): Nature 2021 — Exome sequencing of 454,787 UKB participants.
      https://www.nature.com/articles/s41586-021-04103-z
    2. RAP guide: Burden testing with WES (LoF/missense/synonymous categories, helper tables).
      https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/burden-testing-with-wes
    3. UKB community note: Whole Exome Variant Annotations (HGVSp omitted; how to re-annotate; GRCh38 files; cost/time tips).
      https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/25417791390621-Whole-Exome-Variant-Annotations
    4. RAP Variant Browser usage (lookup exome variants by rsID/position).
      https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/16019586431517-How-do-I-filter-a-particular-exome-SNP-rsID-for-a-particular-EID
    5. Cohort Browser note (public browser is WGS-only; exome queries via RAP).
      https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/17668482092189-Cohort-browser-for-Genomics
    6. Genebass (Broad/Neale lab) — Portal + summary stats for UKB exome associations.
      https://www.broadinstitute.org/news/new-online-resource-helps-connect-rare-genetic-variants-human-health-and-disease
      https://catalog.gwaslab.org/Sumstats_Sumstats_README/

     

    Would be good for this to be validated for accuracy.

    0

Please sign in to leave a comment.