I *think* we want to build an applet which runs a docker image with pyspark installed. This spark driver will need to be able to communicate with the cluster manager to spin up and communicate with worker nodes. We know how to create an app and run a docker image. We know how to create a spark app using the whatever image https://documentation.dnanexus.com/developer/apps/developing-spark-apps uses. But how would we go about combining the two?
Ideally we don't want to create a spark session using $SPARK_HOME/bin/spark-submit as in the docs but we want to find or create one using pyspark.
Inside JupyterLab, you should be able to pull docker image on the worker station via Terminal. Later you can save all the software updates as a JupyterLab snapsnot and run a script inside JL non-interactively.
Interactive mode is really great for prototyping the docker functionality. Also, pyspark is already part of it - but I understand that you would like to load a custom docker image with pyspark installed.
I was interested to get ADAM Genomics working on RAP. ADAM is a library and command line tool that enables the use of Apache Spark to parallelize genomic data analysis. ADAM Genomics is actually available as in the form of docker image with pyspark installed.
Thanks for the detailed answer - yesterday we actually went for the jupyter notebook approach and were able to run our own docker image interactively as you suggested with the dna nexus data mounted :) But we weren't able to run pyspark (which is installed on our docker image) I think because it wasn't able to communicate with the spark cluster manager. We could run pyspark from within the notebook just fine, but just not from within our docker image. How did you manage to get your docker image pyspark to talk to the spark running outside of the docker image?
To answer your questions...
1) We are really just starting to look at UKBB so have sone very little. We have managed to create a custom app with all the data we need mounted. We've also uploaded a custom docker image as a tarball, used dx download to download it, used docker to load it and run a basic hello world command.
2) We want to run WES pipelines on cohorts of the full 500k patients. We currently have an in-house pipeline on the 200k patients that you can download data for. But fairly recently UKBB released sequences for 300k more patients, with the caveat that you can no long download the data and have to run pipelines on their platform. This is why we want to use a custom docker image - we want to use our current code to do this and don't want to reimplement / rewrite our pipeline on the UKBB rap. We also imagine running our pipeline on other data in the future and not just UKBB data, and so don't want to be fully tied to the RAP.
3) We haven't really seen a specific error - it's more that we don't know how to start. I couldn't see any documentation that covers our use case. So it might be that we are expected to achieve our goal some other way, but I still don't know how that would work. But we are stuck on - how to get spark working inside our image. And the best way to access the data. https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data suggests using spark to query databases and not just reading files directly. But we don't know how to find out which data is in all of the spark databases - is there some catalogue for this data?
4) It is important because we want to run WES and eventually WGR on 500k patients - it is just too slow without spark.
For this use case, I am not sure whether Spark would be the best option for you. I believe you can do parallelisation on the level of jobs (run multiple jobs at the same time), instead of running Spark based analysis on just one cluster. It sounds to me more as an example of Large scale analysis on RAP, using Bulk data such as WGS or WES.
Comments
7 comments
{@005t000000AD9qLAAT}?
Did you consider running Spark based JupyterLab in non-interactive mode? https://documentation.dnanexus.com/user/jupyter-notebooks/references#run-notebooks-non-interactively
Inside JupyterLab, you should be able to pull docker image on the worker station via Terminal. Later you can save all the software updates as a JupyterLab snapsnot and run a script inside JL non-interactively.
Interactive mode is really great for prototyping the docker functionality. Also, pyspark is already part of it - but I understand that you would like to load a custom docker image with pyspark installed.
I was interested to get ADAM Genomics working on RAP. ADAM is a library and command line tool that enables the use of Apache Spark to parallelize genomic data analysis. ADAM Genomics is actually available as in the form of docker image with pyspark installed.
https://adam.readthedocs.io/en/latest/
I implemented the following steps inside the Spark based JupyterLab:
ON THE WORKER
docker pull quay.io/biocontainers/adam:1.0--hdfd78af_0 # pull docker image from quay
docker images # check loaded docker images
mkdir -p data # prep testing data
wget -O data/small.sam https://raw.githubusercontent.com/bigdatagenomics/adam/master/adam-core/src/test/resources/small.sam
docker run --rm=true -it -v /home/dnanexus/data/:/data quay.io/biocontainers/adam:1.0--hdfd78af_0 # execute interactive mode of docker image
INSIDE THE INTERACTIVE MODE OF DOCKER IMAGE
adam-submit # get help page to see which features are available
adam-submit transformAlignments data/small.sam data/small.adam # use dockerized spark tool to process genomics data
I was also able to run ADAM spark based subshell, which loads scala programming lang as default:
INSIDE PYTHON SESSION
import pyspark
@Katie Sandford?
Thanks for the detailed answer - yesterday we actually went for the jupyter notebook approach and were able to run our own docker image interactively as you suggested with the dna nexus data mounted :) But we weren't able to run pyspark (which is installed on our docker image) I think because it wasn't able to communicate with the spark cluster manager. We could run pyspark from within the notebook just fine, but just not from within our docker image. How did you manage to get your docker image pyspark to talk to the spark running outside of the docker image?
To answer your questions...
1) We are really just starting to look at UKBB so have sone very little. We have managed to create a custom app with all the data we need mounted. We've also uploaded a custom docker image as a tarball, used dx download to download it, used docker to load it and run a basic hello world command.
2) We want to run WES pipelines on cohorts of the full 500k patients. We currently have an in-house pipeline on the 200k patients that you can download data for. But fairly recently UKBB released sequences for 300k more patients, with the caveat that you can no long download the data and have to run pipelines on their platform. This is why we want to use a custom docker image - we want to use our current code to do this and don't want to reimplement / rewrite our pipeline on the UKBB rap. We also imagine running our pipeline on other data in the future and not just UKBB data, and so don't want to be fully tied to the RAP.
3) We haven't really seen a specific error - it's more that we don't know how to start. I couldn't see any documentation that covers our use case. So it might be that we are expected to achieve our goal some other way, but I still don't know how that would work. But we are stuck on - how to get spark working inside our image. And the best way to access the data. https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data suggests using spark to query databases and not just reading files directly. But we don't know how to find out which data is in all of the spark databases - is there some catalogue for this data?
4) It is important because we want to run WES and eventually WGR on 500k patients - it is just too slow without spark.
Please let me know if you have any more ideas?
to 3)
There are two basic sets of files on RAP: Bulk (mostly genomics and imaging) and Database object (phenotypic information).
Catalog sits here:
https://biobank.ndph.ox.ac.uk/showcase/browse.cgi?id=-2&cd=search
Also Cohort Browser might serve as a catalog:
https://documentation.dnanexus.com/user/cohort-browser
to 4)
I see. With that, I would suggest reading the following tutorial https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/guide-to-analyzing-large-sample-sets.
For this use case, I am not sure whether Spark would be the best option for you. I believe you can do parallelisation on the level of jobs (run multiple jobs at the same time), instead of running Spark based analysis on just one cluster. It sounds to me more as an example of Large scale analysis on RAP, using Bulk data such as WGS or WES.
Does your in-house pipeline (non-RAP) use Spark? If so, which library/method?
Yep we use spark in-house. We use the glow library mostly but have some UDFs too I think - although I don't know why because I'm new to the project.
I'll have a look at that guide - thanks.
Please sign in to leave a comment.