Best strategy for annotating and filtering VCF files using VEP on UKB RAP?
I've been struggling with a few issues regarding annotating VCF files on UKB RAP using HAIL. I am trying to annotate pVCF files that I subsetted using BCFtools and QC'd using VCFtools . The files appear to be able to annotate when I use the HAIL annotation database and VEP. But three issues I'm having is:
- I cannot filter the annotated hail tables as the headers of the tables contain array + dict expressions which are not as straightforward as string expressions
- Some of the annotations fail such as Clinvar gene summary and Clinvar variant summary.
- I cannot export to VCF ("Error summary: VCFParseError: unexpected end of line")
I ran VCF validator which throws up an error with regards to the format flag:
"FORMAT tag [RNC] expected different number of values (2),Could not validate the char value [..]"
The VCF line looks like this:
chr1 69026 chr1_69026_T_G T G 38 . AF=1e-06;AQ=38;AC=1;AN=226780 GT:DP:AD:GQ:PL:RNC
Usually when I annotate with VEP it's very straightforward and outputs a VCF file which is easily worked with and without a complicated format. This makes downstream filtering of variants in R extremely straightforward. I've attached the code I use for this. I've pasted the code I usually use below.
I'm wondering if anyone can help me with a more straightforward solution to annotating filtering variants in HAIL. Or, if there is an alternative method to annotate using VEP and it's plugins (i.e. via docker) on UKB RAP which is easier to work with than HAIL? Any suggestions would be much appreciated. This has really taken me forever to get to this stage even though it's a fairly basic analysis.
---------------------------------------------------------------------------------------------------
Here is the code I use to annotate the files:
(Instance type - mem3_ssd2_V2_x8, HAIL-VEP enabled, spark cluster x3 nodes)
import pyspark
import dxpy
import hail as hl
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
hl.init(sc=sc)
#import file
vcf_path = 'file:///mnt/project/FI_subset_QC_pVCFs/ukbXXXXX_c1_b0_v1.FI_subset.hwe10-6.maxmissing0.02.vcf.gz'
mt = hl.import_vcf(vcf_path, force=True, reference_genome='GRCh38', array_elements_required=False)
#annotate files
db = hl.experimental.DB(region='us', cloud='aws')
annotated_mt = db.annotate_rows_db(mt, 'CADD', 'clinvar_gene_summary', 'clinvar_variant_summary', 'dbNSFP_genes', 'dbNSFP_variants', 'dbSNP_rsid', 'gnomad_exome_sites')
annotated_mt2 = hl.vep(annotated_mt, "file:///mnt/project/vep-GRCh38.json")
#move annotations into INFO column in VCF - CADD scores example
annotated_mt2 = annotated_mt2.annotate_rows(info=annotated_mt2.info.annotate(CADD_PHRED=annotated_mt2.CADD["PHRED_score"]))
#export VCF
hl.export_vcf(annotated_mt2,"file:///opt/notebooks/test.annotate.vcf.bgz")
Here is the code I use to annotate VCFs on my local server:
date ; time /path/to/file/ensembl-vep/vep -i test.vcf.gz --force_overwrite --vcf -o test.annotated.vcf --assembly GRCh38 --fork 15 --offline --dir /path/to/file/.vep --symbol --hgvs --numbers --canonical --per_gene --no_check_variants_order --plugin CADD,/path/to/file/reference/CADD_GRCh38_v1.6/whole_genome_SNVs.tsv.gz,path/to/file/reference/CADD_GRCh38_v1.6/gnomad.genomes.r3.0.indel.tsv.gz --custom /path/to/file/reference/gnomAD_GRCh38_v2.1.1/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.gz,gnomAD_v2_exome,vcf,exact,0,AF_nfe --plugin dbNSFP,/path/to/file/reference/dbNSFP/dbNSFP4.1a_grch38.gz,Ensembl_transcriptid,SIFT4G_score,Polyphen2_HDIV_score,MPC_score,REVEL_score,FATHMM_score
{@005t0000008A1QgAAK}?
Comments
1 comment
https://community.dnanexus.com/s/question/0D5t000003yHOfACAW/how-to-run-vep-docker-image-using-swissarmyknife
Please sign in to leave a comment.