Best strategy for annotating and filtering VCF files using VEP on UKB RAP?

I've been struggling with a few issues regarding annotating VCF files on UKB RAP using HAIL. I am trying to annotate pVCF files that I subsetted using BCFtools and QC'd using VCFtools . The files appear to be able to annotate when I use the HAIL annotation database and VEP. But three issues I'm having is:

 

  1. I cannot filter the annotated hail tables as the headers of the tables contain array + dict expressions which are not as straightforward as string expressions
  2. Some of the annotations fail such as Clinvar gene summary and Clinvar variant summary.
  3. I cannot export to VCF ("Error summary: VCFParseError: unexpected end of line")

 

I ran VCF validator which throws up an error with regards to the format flag:

"FORMAT tag [RNC] expected different number of values (2),Could not validate the char value [..]"

 

The VCF line looks like this:

chr1 69026 chr1_69026_T_G T G 38 . AF=1e-06;AQ=38;AC=1;AN=226780 GT:DP:AD:GQ:PL:RNC

 

Usually when I annotate with VEP it's very straightforward and outputs a VCF file which is easily worked with and without a complicated format. This makes downstream filtering of variants in R extremely straightforward. I've attached the code I use for this. I've pasted the code I usually use below.

 

I'm wondering if anyone can help me with a more straightforward solution to annotating filtering variants in HAIL. Or, if there is an alternative method to annotate using VEP and it's plugins (i.e. via docker) on UKB RAP which is easier to work with than HAIL? Any suggestions would be much appreciated. This has really taken me forever to get to this stage even though it's a fairly basic analysis.

 

---------------------------------------------------------------------------------------------------

 

Here is the code I use to annotate the files:

 

(Instance type - mem3_ssd2_V2_x8, HAIL-VEP enabled, spark cluster x3 nodes)

 

import pyspark

import dxpy

import hail as hl

 

sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc)

hl.init(sc=sc)

 

#import file

vcf_path = 'file:///mnt/project/FI_subset_QC_pVCFs/ukbXXXXX_c1_b0_v1.FI_subset.hwe10-6.maxmissing0.02.vcf.gz'

 

mt = hl.import_vcf(vcf_path, force=True, reference_genome='GRCh38', array_elements_required=False)

 

#annotate files

db = hl.experimental.DB(region='us', cloud='aws')

 

annotated_mt = db.annotate_rows_db(mt, 'CADD', 'clinvar_gene_summary', 'clinvar_variant_summary', 'dbNSFP_genes', 'dbNSFP_variants', 'dbSNP_rsid', 'gnomad_exome_sites')

 

annotated_mt2 = hl.vep(annotated_mt, "file:///mnt/project/vep-GRCh38.json")

 

#move annotations into INFO column in VCF - CADD scores example

annotated_mt2 = annotated_mt2.annotate_rows(info=annotated_mt2.info.annotate(CADD_PHRED=annotated_mt2.CADD["PHRED_score"]))

 

#export VCF

hl.export_vcf(annotated_mt2,"file:///opt/notebooks/test.annotate.vcf.bgz")

 

Here is the code I use to annotate VCFs on my local server:

 

date ; time /path/to/file/ensembl-vep/vep -i test.vcf.gz --force_overwrite --vcf -o test.annotated.vcf --assembly GRCh38 --fork 15 --offline --dir /path/to/file/.vep --symbol --hgvs --numbers --canonical --per_gene --no_check_variants_order --plugin CADD,/path/to/file/reference/CADD_GRCh38_v1.6/whole_genome_SNVs.tsv.gz,path/to/file/reference/CADD_GRCh38_v1.6/gnomad.genomes.r3.0.indel.tsv.gz --custom /path/to/file/reference/gnomAD_GRCh38_v2.1.1/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.gz,gnomAD_v2_exome,vcf,exact,0,AF_nfe --plugin dbNSFP,/path/to/file/reference/dbNSFP/dbNSFP4.1a_grch38.gz,Ensembl_transcriptid,SIFT4G_score,Polyphen2_HDIV_score,MPC_score,REVEL_score,FATHMM_score

 

{@005t0000008A1QgAAK}? 

Comments

1 comment

Please sign in to leave a comment.