How do I load .p.vcf.gz files into the python environment on the spark cluster? Do I need to use dxdata?

01 April 2022 00:00
3 comments

I want to annotate the files but can't find the documentation on actually loading the file into the environment. This documentation gives a rough outline of how to annotate (not precise) but it does not specify how to load the files from the DNAnexus platform to the python environment: https://documentation.dnanexus.com/user/jupyter-notebooks/dxjupyterlab-spark-cluster

Comments

3 comments

Ondrej Klempir DNAnexus Team
- 04 April 2022 12:10
Hello,

I think that dxdata is mostly intended for accessing pheno data, not raw files (such as vcf etc.). What I would do, I would check if there is any recommended function for working with pVCF in official Hail documentation. I went here: https://hail.is/docs/0.2/methods/impex.html and found the function import_vcf().

Actually, I played with it a little and I was able to import a UKB RAP pVCF in Python and also was able to reproduce some parts of this publicly available Hail notebook: https://docs.databricks.com/_static/notebooks/genomics/hail-overview.html

Here is my notebook code:

import pyspark
import hail as hl

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
hl.init(sc=sc)

vcf_path = 'file:///mnt/project/Bulk/.../XYZ.vcf.gz' # replace with some actual path and existing pVCF file
vcf_path

mt = hl.import_vcf(vcf_path, force_bgz=True, reference_genome='GRCh38')
mt.rows().select().show(5)

annotated_mt = hl.vep(mt, "file:///mnt/project/vep-GRCh38.json") # followed the doc page https://documentation.dnanexus.com/user/jupyter-notebooks/dxjupyterlab-spark-cluster#using-vep-with-hail

Useful information about loading and writing Hail data can be also found here: https://discuss.hail.is/t/ukbiobank-research-analysis-platform-rap-matrixtable-write-issues/2256/11

One more note: Here is a screenshot of the JupyterLab env I used.

0
Ondrej Klempir DNAnexus Team
- 04 April 2022 12:47
annotated_mt.vep.show() showed the annotated Hail matrix

0
Former User of DNAx Community_4
- 04 April 2022 13:08
Hi Ondrej,

Thanks for the detailed reply. I actually found another way of doing this following some code from Dan King:
https://discuss.hail.is/t/how-should-i-use-hail-on-the-dnanexus-rap/2277

I think this code is similar to the link you posted - https://discuss.hail.is/t/ukbiobank-research-analysis-platform-rap-matrixtable-write-issues/2256/11

This seems work. I think the issue I was really having was how to work the project file system on the Jupyter lab without having to download the files. The example given on the DNAnexus spark cluster page is from the Hail tutorial on working with the 1k genome project which isn't the same as working with UKB files on RAP so that's where I got confused (https://documentation.dnanexus.com/user/jupyter-notebooks/dxjupyterlab-spark-cluster#using-vep-with-hail).

0

Please sign in to leave a comment.