How to create an applet which uses spark and a custom docker image

28 September 2022 00:00
7 comments

I *think* we want to build an applet which runs a docker image with pyspark installed. This spark driver will need to be able to communicate with the cluster manager to spin up and communicate with worker nodes. We know how to create an app and run a docker image. We know how to create a spark app using the whatever image https://documentation.dnanexus.com/developer/apps/developing-spark-apps uses. But how would we go about combining the two? Ideally we don't want to create a spark session using $SPARK_HOME/bin/spark-submit as in the docs but we want to find or create one using pyspark.

Comments

7 comments

Ondrej Klempir DNAnexus Team
- 30 September 2022 12:13
{@005t000000AD9qLAAT}?

Did you consider running Spark based JupyterLab in non-interactive mode? https://documentation.dnanexus.com/user/jupyter-notebooks/references#run-notebooks-non-interactively

Inside JupyterLab, you should be able to pull docker image on the worker station via Terminal. Later you can save all the software updates as a JupyterLab snapsnot and run a script inside JL non-interactively.

Interactive mode is really great for prototyping the docker functionality. Also, pyspark is already part of it - but I understand that you would like to load a custom docker image with pyspark installed.

I was interested to get ADAM Genomics working on RAP. ADAM is a library and command line tool that enables the use of Apache Spark to parallelize genomic data analysis. ADAM Genomics is actually available as in the form of docker image with pyspark installed.
https://adam.readthedocs.io/en/latest/

I implemented the following steps inside the Spark based JupyterLab:

ON THE WORKER

docker pull quay.io/biocontainers/adam:1.0--hdfd78af_0 # pull docker image from quay
docker images # check loaded docker images

mkdir -p data # prep testing data
wget -O data/small.sam https://raw.githubusercontent.com/bigdatagenomics/adam/master/adam-core/src/test/resources/small.sam

docker run --rm=true -it -v /home/dnanexus/data/:/data quay.io/biocontainers/adam:1.0--hdfd78af_0 # execute interactive mode of docker image

INSIDE THE INTERACTIVE MODE OF DOCKER IMAGE

adam-submit # get help page to see which features are available

adam-submit transformAlignments data/small.sam data/small.adam # use dockerized spark tool to process genomics data

I was also able to run ADAM spark based subshell, which loads scala programming lang as default:

And finally, for Python with Pyspark, I just ran "python" in order to open python session:

INSIDE PYTHON SESSION

import pyspark

0
Ondrej Klempir DNAnexus Team
- 30 September 2022 12:19
@Katie Sandford?
1. Which steps have you already tried to implemented your spark+docker applet?
2. Which bioinformatics functionality you would like to implement?
3. When implementing your applet, have you seen any error?
4. Why Spark/PySpark is important for your applet?
5. Alternatively, does it make sense to run JupyterLab noninteractively, use PySpark, and install the software dependencies without running docker?
0
Former User of DNAx Community_8
- 30 September 2022 12:41
Thanks for the detailed answer - yesterday we actually went for the jupyter notebook approach and were able to run our own docker image interactively as you suggested with the dna nexus data mounted :) But we weren't able to run pyspark (which is installed on our docker image) I think because it wasn't able to communicate with the spark cluster manager. We could run pyspark from within the notebook just fine, but just not from within our docker image. How did you manage to get your docker image pyspark to talk to the spark running outside of the docker image?

To answer your questions...
1) We are really just starting to look at UKBB so have sone very little. We have managed to create a custom app with all the data we need mounted. We've also uploaded a custom docker image as a tarball, used dx download to download it, used docker to load it and run a basic hello world command.
2) We want to run WES pipelines on cohorts of the full 500k patients. We currently have an in-house pipeline on the 200k patients that you can download data for. But fairly recently UKBB released sequences for 300k more patients, with the caveat that you can no long download the data and have to run pipelines on their platform. This is why we want to use a custom docker image - we want to use our current code to do this and don't want to reimplement / rewrite our pipeline on the UKBB rap. We also imagine running our pipeline on other data in the future and not just UKBB data, and so don't want to be fully tied to the RAP.
3) We haven't really seen a specific error - it's more that we don't know how to start. I couldn't see any documentation that covers our use case. So it might be that we are expected to achieve our goal some other way, but I still don't know how that would work. But we are stuck on - how to get spark working inside our image. And the best way to access the data. https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data suggests using spark to query databases and not just reading files directly. But we don't know how to find out which data is in all of the spark databases - is there some catalogue for this data?
4) It is important because we want to run WES and eventually WGR on 500k patients - it is just too slow without spark.

Please let me know if you have any more ideas?

0
Ondrej Klempir DNAnexus Team
- 30 September 2022 14:12
to 3)

There are two basic sets of files on RAP: Bulk (mostly genomics and imaging) and Database object (phenotypic information).

Catalog sits here:
https://biobank.ndph.ox.ac.uk/showcase/browse.cgi?id=-2&cd=search

Also Cohort Browser might serve as a catalog:
https://documentation.dnanexus.com/user/cohort-browser

to 4)

I see. With that, I would suggest reading the following tutorial https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/guide-to-analyzing-large-sample-sets.
For this use case, I am not sure whether Spark would be the best option for you. I believe you can do parallelisation on the level of jobs (run multiple jobs at the same time), instead of running Spark based analysis on just one cluster. It sounds to me more as an example of Large scale analysis on RAP, using Bulk data such as WGS or WES.

0
Ondrej Klempir DNAnexus Team
- 30 September 2022 14:18
Does your in-house pipeline (non-RAP) use Spark? If so, which library/method?

0
Former User of DNAx Community_8
- 03 October 2022 15:14
Yep we use spark in-house. We use the glow library mostly but have some UDFs too I think - although I don't know why because I'm new to the project.

0
Former User of DNAx Community_8
- 03 October 2022 15:15
I'll have a look at that guide - thanks.

0

Please sign in to leave a comment.