Working with Hail outside of JupyterLab

I've been working on developing pipelines to filter and re-format UKB exome data using Hail. So far, I've been working exclusively in JupyterLab within notebooks by reading in one WES VCF file (filtered down to one gene prior to this pipeline using bcftools), saving it to a matrix table, and then running my code over that matrix table. My team is looking to expand this pipeline to analyze several hundred genes. What would be the easiest/most advisable way to do so?

 

I assume I'll need to shift from Jupyter notebooks to Python scripts. Once I do, it seems like my options are to create an applet or execute the script using swiss-army-knife. Since running JupyterLab with a Spark-enabled cluster does all the heavy lifting of properly configuring a Spark-enabled environment with Hail and VEP for me, I'm not sure what additional steps I'll need to take to set that up on my own. Is this something that can be done just by importing Hail and building a Spark session at the start of a Python script and then executing that through swiss-army-knife, or are there more steps I would need to take? Similarly, if building an applet, I'd love suggestions on the best way to configure Hail once I have a Spark-enabled applet set up. Thanks for the help, and apologies for the basic question!

Comments

4 comments

  • Comment author
    Ondrej Klempir DNAnexus Team

    I would consider creating a Jupyter snapshot with bcftools and running this spark based JL with snapsnot in non-interactive mode.

    https://documentation.dnanexus.com/user/jupyter-notebooks/references#run-notebooks-non-interactively

     

    I believe this might be easier than implement a spark based applet from scratch.

    0
  • Thanks! I'm trying to go with this approach, but not sure what the benefit is of creating a snapshot with bcftools as opposed to just trying to follow the linked instructions exactly (ie just using a papermill cmd string to execute a plain notebook). I am also struggling to specify up the correct file paths to get the instructions you linked to run without failure ? should I be trying the file:///mnt/project/ prefix, the project ID prefix, or no prefix at all when specifying the path to the input notebook? Where exactly is the /opt/notebooks/ directory that the command runs in? Thanks again for your help, I really appreciate it.

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    "as opposed to just trying to follow the linked instructions exactly (ie just using a papermill cmd string to execute a plain notebook"

     

    --> the reason why a snapshot with bcftools --> I think that bcftools is not part of Spark based JupyterLab, but I now understand that you do not need it as you already processed vcf file prior). Anyway, if you need bcftools (or any other tool which is not there by default) inside of JupyterLab, just run Spark based JL in interactive, install it and then save snapshot (snapshot should be saved in your dnax project and you can use in future JL sessions). [https://documentation.dnanexus.com/user/jupyter-notebooks#environment-snapshots]

     

    "Where exactly is the /opt/notebooks/ directory that the command runs in? Thanks again for your help, I really appreciate it."

     

    --> /opt/notebooks/ is the working directory inside JupyterLab environment

     

    Several months ago I was successful with running these two commands:

     

       my_cmd="papermill notebook.ipynb output_notebook.ipynb -f config.txt"

       dx run dxjupyterlab_spark_cluster -icmd="$my_cmd" -iin="notebook.ipynb" -iin="config.txt"

     

    1. I provided my implemented bioinformatics pipeline as notebook.ipynb (available in my dnax project) which defined all the steps for processing my file(s) inside the JL session. Input notebook also contained "dx upload" commands to copy newly generated results back to permanent storage.
    2. Whole processing pipeline will also generate output_notebook.ipynb (rendered plots, table views etc.). This is good for visual inspection of the obtained results.
    3. Optional - I added also config.txt as another input to the job - with this I was able to configure which file I wanted to process (paths etc.) or added additional configurations to plots etc. (this might be hardcoded directly in the input notebook).
    4. As for paths, I normally use "dx download file-XXXX" to download data on the worker. Alternatively, for mnt/project syntax, I access data using /mnt/project/Bulk... This is specified inside the notebook.ipynb, you can test it in interactive mode.

     

    https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-rstudio-on-the-research-analysis-platform#working-with-data

    https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-rstudio-on-the-research-analysis-platform#accessing-project-data-downloading-project-files

    https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-rstudio-on-the-research-analysis-platform#advanced-use-case-reading-from-mnt-project

     

     

    0
  • Thank you, this is extremely helpful! I ended up specifying the full file path to the notebook within the run command (eg -iin="project-####:/path/to/input_notebook.ipynb") and then only the file name within the cmd string (eg input_notebook.ipynb), and that seemed to work for me. And then within the notebook, I read in relevant files using the /mnt/project/... path, since I plan to run this pipeline over ~500 vcfs and don't want to download them all.

    0

Please sign in to leave a comment.