Problems writing Hail Matrix Table
Hi there
I am quite new to the UKBB so apologies if this is asked regulary, but I couldn't find a recent solution to my problem. I am following the instructions here: https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/pVCF_import.ipynb
to convert vcf to Hail matrix table, and I can see the matrix exists with:
mt.describe()
and I have successfully annotated the matrix table using Hail's annotation database. But I cannot figure out how to write the matrix table somewhere so I can resume using the same matrix table in a new Jupyter session.
Things I have tried:
Following the instructions to create a database in dnanexus as described in the GitHub notebook above and writing the matrix table to that. I can make a database, as described, but the table does not get written there. I understand we now cannot directly write files to our project space on dnanexus from Jupyter lab?
Using mt.write to try and write the matrix table to the current working directory, as I understand it /opt/notebooks. I don't get an error but nothing happens.
Using mt.export to try and write the thing as a tsv file stored in my current working directory, again no error but no writing either.
I am currently using single participant exome files, to keep the executions short while I learn what to do, so I don't think it's a memory or compute problem.
Any guidance would be much appreciated!
Thank you so much
Comments
1 comment
My assistant says … (NOTE, she has the habit of optimistically making stuff up, so please let me know if this works or not) …
You’re not doing anything “wrong” – this is mostly about how Hail + Spark are wired up on RAP. The big gotcha is:
Here’s a concrete pattern you can drop into your notebook.
1. Create (or reuse) a DNAnexus “database” for Hail
Run this once per project (or just reuse an existing DB name):
Optionally, confirm the DB object exists:
2. Initialise Hail pointing at that database
Each time you start a new Jupyter session you should:
Key points:
tmp_dirmust be adnax://path, not/opt/notebooksor/mnt/project.db_name(and hencedb_id) across sessions is what lets you read/write the same MatrixTable later. (Hail Discussion)3. Write your MatrixTable so you can re-use it
Once you’ve imported and annotated:
This actually writes a whole directory of Parquet files into the DNAnexus database, not into your project file tree – so you won’t see it in the web UI.
To convince yourself it’s there, you can use Spark or HDFS:
(or more simply from a terminal cell:)
4. Read the same MatrixTable in a new Jupyter session
Next time you start a Spark/Hail notebook:
If you get
FileNotFoundError, double-check:db_nameas when you wrote it.mt_namestring.5. Why
/opt/notebooksandmt.export“do nothing”What’s happening with the things you tried:
mt.write("/opt/notebooks/...")mt.export("file:///opt/notebooks/whatever.tsv")dx uploadordxpyAPIs.dnax://), because they are many small files and DNAnexus’s project filesystem doesn’t support them efficiently. (Hail Discussion)6. (Optional) Export a TSV/VCF into project space
If you want a plain TSV/VCF in your project:
Export from Hail into the database filesystem:
Then move it into your project with
hadoop fs+dx upload:The uploaded
/results/my_variants.tsv.bgzwill then appear in the project UI like any other file. (Hail Discussion)If you’d like, I can turn this into a tiny “minimal Hail template” notebook for UKB RAP that:
so you’ve always got a working reference.
Please sign in to leave a comment.