How to create an applet which uses spark and a custom docker image

I *think* we want to build an applet which runs a docker image with pyspark installed. This spark driver will need to be able to communicate with the cluster manager to spin up and communicate with worker nodes. We know how to create an app and run a docker image. We know how to create a spark app using the whatever image https://documentation.dnanexus.com/developer/apps/developing-spark-apps uses. But how would we go about combining the two?   Ideally we don't want to create a spark session using $SPARK_HOME/bin/spark-submit as in the docs but we want to find or create one using pyspark.

Comments

7 comments

  • Comment author
    Ondrej Klempir DNAnexus Team

    {@005t000000AD9qLAAT}? 

     

    Did you consider running Spark based JupyterLab in non-interactive mode? https://documentation.dnanexus.com/user/jupyter-notebooks/references#run-notebooks-non-interactively

     

    Inside JupyterLab, you should be able to pull docker image on the worker station via Terminal. Later you can save all the software updates as a JupyterLab snapsnot and run a script inside JL non-interactively.

     

    Interactive mode is really great for prototyping the docker functionality. Also, pyspark is already part of it - but I understand that you would like to load a custom docker image with pyspark installed.

     

    I was interested to get ADAM Genomics working on RAP. ADAM is a library and command line tool that enables the use of Apache Spark to parallelize genomic data analysis. ADAM Genomics is actually available as in the form of docker image with pyspark installed.

    https://adam.readthedocs.io/en/latest/

     

    I implemented the following steps inside the Spark based JupyterLab:

     

    ON THE WORKER

     

    docker pull quay.io/biocontainers/adam:1.0--hdfd78af_0 # pull docker image from quay

    docker images # check loaded docker images

     

    mkdir -p data # prep testing data

    wget -O data/small.sam https://raw.githubusercontent.com/bigdatagenomics/adam/master/adam-core/src/test/resources/small.sam

     

    docker run --rm=true -it -v /home/dnanexus/data/:/data quay.io/biocontainers/adam:1.0--hdfd78af_0 # execute interactive mode of docker image

     

    INSIDE THE INTERACTIVE MODE OF DOCKER IMAGE

     

    adam-submit # get help page to see which features are available

     

    Screenshot 2022-09-30 at 12.36.21 

    adam-submit transformAlignments data/small.sam data/small.adam # use dockerized spark tool to process genomics data

     

    I was also able to run ADAM spark based subshell, which loads scala programming lang as default:

     

    Screenshot 2022-09-30 at 12.57.56And finally, for Python with Pyspark, I just ran "python" in order to open python session:

     

    INSIDE PYTHON SESSION

     

    import pyspark

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    @Katie Sandford? 

     

    1. Which steps have you already tried to implemented your spark+docker applet?
    2. Which bioinformatics functionality you would like to implement?
    3. When implementing your applet, have you seen any error?
    4. Why Spark/PySpark is important for your applet?
    5. Alternatively, does it make sense to run JupyterLab noninteractively, use PySpark, and install the software dependencies without running docker?

     

    0
  • Comment author
    Former User of DNAx Community_8

    Thanks for the detailed answer - yesterday we actually went for the jupyter notebook approach and were able to run our own docker image interactively as you suggested with the dna nexus data mounted :) But we weren't able to run pyspark (which is installed on our docker image) I think because it wasn't able to communicate with the spark cluster manager. We could run pyspark from within the notebook just fine, but just not from within our docker image. How did you manage to get your docker image pyspark to talk to the spark running outside of the docker image?

     

    To answer your questions...

    1) We are really just starting to look at UKBB so have sone very little. We have managed to create a custom app with all the data we need mounted. We've also uploaded a custom docker image as a tarball, used dx download to download it, used docker to load it and run a basic hello world command.

    2) We want to run WES pipelines on cohorts of the full 500k patients. We currently have an in-house pipeline on the 200k patients that you can download data for. But fairly recently UKBB released sequences for 300k more patients, with the caveat that you can no long download the data and have to run pipelines on their platform. This is why we want to use a custom docker image - we want to use our current code to do this and don't want to reimplement / rewrite our pipeline on the UKBB rap. We also imagine running our pipeline on other data in the future and not just UKBB data, and so don't want to be fully tied to the RAP.

    3) We haven't really seen a specific error - it's more that we don't know how to start. I couldn't see any documentation that covers our use case. So it might be that we are expected to achieve our goal some other way, but I still don't know how that would work. But we are stuck on - how to get spark working inside our image. And the best way to access the data. https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data suggests using spark to query databases and not just reading files directly. But we don't know how to find out which data is in all of the spark databases - is there some catalogue for this data?

    4) It is important because we want to run WES and eventually WGR on 500k patients - it is just too slow without spark.

     

    Please let me know if you have any more ideas?

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    to 3)

     

    There are two basic sets of files on RAP: Bulk (mostly genomics and imaging) and Database object (phenotypic information).

     

    Catalog sits here:

    https://biobank.ndph.ox.ac.uk/showcase/browse.cgi?id=-2&cd=search

     

    Also Cohort Browser might serve as a catalog:

    https://documentation.dnanexus.com/user/cohort-browser

     

    to 4)

     

    I see. With that, I would suggest reading the following tutorial https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/guide-to-analyzing-large-sample-sets.

    For this use case, I am not sure whether Spark would be the best option for you. I believe you can do parallelisation on the level of jobs (run multiple jobs at the same time), instead of running Spark based analysis on just one cluster. It sounds to me more as an example of Large scale analysis on RAP, using Bulk data such as WGS or WES.

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Does your in-house pipeline (non-RAP) use Spark? If so, which library/method?

    0
  • Comment author
    Former User of DNAx Community_8

    Yep we use spark in-house. We use the glow library mostly but have some UDFs too I think - although I don't know why because I'm new to the project.

    0
  • Comment author
    Former User of DNAx Community_8

    I'll have a look at that guide - thanks.

    0

Please sign in to leave a comment.