Hi folks, I have a basic question for you?how can we access the exome sequencing data (preferably as the VCFs) using HAIL on the UKB RAP?

10 August 2022 00:00
4 comments

I was watching your video on the documentation site, and it says that the data should be available as a MatrixTable. I haven?t been able to find any further documentation on the site. Best, Jeremy

Comments

4 comments

Ondrej Klempir DNAnexus Team
- 10 August 2022 09:19
As far as I know, Hail pVCF representation is not available on the UKB RAP at this moment. One option is that you will need to first load pVCF(s) into Hail, save it to Hail matrix (using dnax connector) and then use it in your analyses.

I found a couple of examples on how to work with HAIL matrix table on RAP here: https://discuss.hail.is/t/ukbiobank-research-analysis-platform-rap-matrixtable-write-issues/2256

0
Former User of DNAx Community_7
- 11 August 2022 06:06
Thank you Ondrej! I wonder if it is worth the trouble then. I have two follow-up questions for you?

One, if we do try to do this with HAIL, then it seems like the relevant code from the link you provided is

import pyspark
import dxpy
import hail as hl

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

hl.init(sc=sc)

spark.sql("CREATE DATABASE test LOCATION 'dnax://'")

id = dxpy.find_one_data_object(name="test")["id"]
hail_matrixtable.write("dnax://" + id + "/matrix.mt")

mt = hl.read_matrix_table("dnax://database-XX/matrix.mt")

I don't understand the code beginning with "id = dxpy.find_one_data_object". I presume this line can be used to find a relevant pVCF, but I am not sure how to use dxpy. On this documentation page, there is code to point it to the entire dataset, but I imagine that writing the entire dataset is going to be a prohibitively big task (which is probably related to the warning from the link you provided). Any thoughts here?

Two, could you walk me through/point me to an alternative (and preferably "best practices") solution? Just for overall context, my goal is to find all the variants within a genomic region I specify. It seems like I should be able to find which pVCFs correspond to which region using the qc_metrics_graphtyper_v2.7.1_qc.tab.gz file according to your post on this page and this page, but I'm too much of a newb to even be able to locate this file.

Best,
Jeremy

0
Ondrej Klempir DNAnexus Team
- 19 August 2022 13:53
1) dxpy Python library provides Python bindings to interact with the DNAnexus Platform via its API. In many applications, it provides helper functions to e.g. (like in this case) automatically get object id for the given newly create database object "test" ("test" is an object in the parent permanent project - i.e. the place you dispensed UKB RAP data into) - with knowing that, each file on DNAnexus has its own unique object ID.

Of course, here, I assume you can "hardcode"/specify object id directly by assigning it to variable id. You could also print content of the project, e.g. via "dx ls -l" and see the object ids there.

I do not think that hail_matrixtable.write("dnax://" + id + "/matrix.mt") will write the entire dataset, this should just, in my understanding, write your hail matrix into an empty newly created database object.

doc page for dxpy sits here: http://autodoc.dnanexus.com/bindings/python/current/index.html

0
Chai Fungtammasan DNAnexus Team
- 15 September 2022 23:13
Could you try with example Hail notebooks that we just published recently? We have example on how to load vcf and most common operation in Hail as well.
https://community.dnanexus.com/s/question/0D5t0000043xrVhCAI/hail-tutorial-and-example-notebooks-for-ukbrap-analysis

0

Please sign in to leave a comment.