Working with Hail outside of JupyterLab
I've been working on developing pipelines to filter and re-format UKB exome data using Hail. So far, I've been working exclusively in JupyterLab within notebooks by reading in one WES VCF file (filtered down to one gene prior to this pipeline using bcftools), saving it to a matrix table, and then running my code over that matrix table. My team is looking to expand this pipeline to analyze several hundred genes. What would be the easiest/most advisable way to do so?
I assume I'll need to shift from Jupyter notebooks to Python scripts. Once I do, it seems like my options are to create an applet or execute the script using swiss-army-knife. Since running JupyterLab with a Spark-enabled cluster does all the heavy lifting of properly configuring a Spark-enabled environment with Hail and VEP for me, I'm not sure what additional steps I'll need to take to set that up on my own. Is this something that can be done just by importing Hail and building a Spark session at the start of a Python script and then executing that through swiss-army-knife, or are there more steps I would need to take? Similarly, if building an applet, I'd love suggestions on the best way to configure Hail once I have a Spark-enabled applet set up. Thanks for the help, and apologies for the basic question!
Comments
4 comments
I would consider creating a Jupyter snapshot with bcftools and running this spark based JL with snapsnot in non-interactive mode.
https://documentation.dnanexus.com/user/jupyter-notebooks/references#run-notebooks-non-interactively
I believe this might be easier than implement a spark based applet from scratch.
Thanks! I'm trying to go with this approach, but not sure what the benefit is of creating a snapshot with bcftools as opposed to just trying to follow the linked instructions exactly (ie just using a papermill cmd string to execute a plain notebook). I am also struggling to specify up the correct file paths to get the instructions you linked to run without failure ? should I be trying the file:///mnt/project/ prefix, the project ID prefix, or no prefix at all when specifying the path to the input notebook? Where exactly is the /opt/notebooks/ directory that the command runs in? Thanks again for your help, I really appreciate it.
"as opposed to just trying to follow the linked instructions exactly (ie just using a papermill cmd string to execute a plain notebook"
--> the reason why a snapshot with bcftools --> I think that bcftools is not part of Spark based JupyterLab, but I now understand that you do not need it as you already processed vcf file prior). Anyway, if you need bcftools (or any other tool which is not there by default) inside of JupyterLab, just run Spark based JL in interactive, install it and then save snapshot (snapshot should be saved in your dnax project and you can use in future JL sessions). [https://documentation.dnanexus.com/user/jupyter-notebooks#environment-snapshots]
"Where exactly is the /opt/notebooks/ directory that the command runs in? Thanks again for your help, I really appreciate it."
--> /opt/notebooks/ is the working directory inside JupyterLab environment
Several months ago I was successful with running these two commands:
my_cmd="papermill notebook.ipynb output_notebook.ipynb -f config.txt"
dx run dxjupyterlab_spark_cluster -icmd="$my_cmd" -iin="notebook.ipynb" -iin="config.txt"
https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-rstudio-on-the-research-analysis-platform#working-with-data
https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-rstudio-on-the-research-analysis-platform#accessing-project-data-downloading-project-files
https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-rstudio-on-the-research-analysis-platform#advanced-use-case-reading-from-mnt-project
Thank you, this is extremely helpful! I ended up specifying the full file path to the notebook within the run command (eg -iin="project-####:/path/to/input_notebook.ipynb") and then only the file name within the cmd string (eg input_notebook.ipynb), and that seemed to work for me. And then within the notebook, I read in relevant files using the /mnt/project/... path, since I plan to run this pipeline over ~500 vcfs and don't want to download them all.
Please sign in to leave a comment.