I want to annotate the files but can't find the documentation on actually loading the file into the environment. This documentation gives a rough outline of how to annotate (not precise) but it does not specify how to load the files from the DNAnexus platform to the python environment:
https://documentation.dnanexus.com/user/jupyter-notebooks/dxjupyterlab-spark-cluster
I think that dxdata is mostly intended for accessing pheno data, not raw files (such as vcf etc.). What I would do, I would check if there is any recommended function for working with pVCF in official Hail documentation. I went here: https://hail.is/docs/0.2/methods/impex.html and found the function import_vcf().
This seems work. I think the issue I was really having was how to work the project file system on the Jupyter lab without having to download the files. The example given on the DNAnexus spark cluster page is from the Hail tutorial on working with the 1k genome project which isn't the same as working with UKB files on RAP so that's where I got confused (https://documentation.dnanexus.com/user/jupyter-notebooks/dxjupyterlab-spark-cluster#using-vep-with-hail).
Comments
3 comments
Hello,
I think that dxdata is mostly intended for accessing pheno data, not raw files (such as vcf etc.). What I would do, I would check if there is any recommended function for working with pVCF in official Hail documentation. I went here: https://hail.is/docs/0.2/methods/impex.html and found the function import_vcf().
Actually, I played with it a little and I was able to import a UKB RAP pVCF in Python and also was able to reproduce some parts of this publicly available Hail notebook: https://docs.databricks.com/_static/notebooks/genomics/hail-overview.html
Here is my notebook code:
import pyspark
import hail as hl
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
hl.init(sc=sc)
vcf_path = 'file:///mnt/project/Bulk/.../XYZ.vcf.gz' # replace with some actual path and existing pVCF file
vcf_path
mt = hl.import_vcf(vcf_path, force_bgz=True, reference_genome='GRCh38')
mt.rows().select().show(5)
annotated_mt = hl.vep(mt, "file:///mnt/project/vep-GRCh38.json") # followed the doc page https://documentation.dnanexus.com/user/jupyter-notebooks/dxjupyterlab-spark-cluster#using-vep-with-hail
Useful information about loading and writing Hail data can be also found here: https://discuss.hail.is/t/ukbiobank-research-analysis-platform-rap-matrixtable-write-issues/2256/11
One more note: Here is a screenshot of the JupyterLab env I used.
annotated_mt.vep.show() showed the annotated Hail matrix
Hi Ondrej,
Thanks for the detailed reply. I actually found another way of doing this following some code from Dan King:
https://discuss.hail.is/t/how-should-i-use-hail-on-the-dnanexus-rap/2277
I think this code is similar to the link you posted - https://discuss.hail.is/t/ukbiobank-research-analysis-platform-rap-matrixtable-write-issues/2256/11
This seems work. I think the issue I was really having was how to work the project file system on the Jupyter lab without having to download the files. The example given on the DNAnexus spark cluster page is from the Hail tutorial on working with the 1k genome project which isn't the same as working with UKB files on RAP so that's where I got confused (https://documentation.dnanexus.com/user/jupyter-notebooks/dxjupyterlab-spark-cluster#using-vep-with-hail).
Please sign in to leave a comment.