I have read the documentation on Spark apps (https://documentation.dnanexus.com/developer/apps/developing-spark-apps), but it's not clear to me how to actually run one.
For example, I have run the Jupyterlab with Spark cluster app, and managed to do some simple analyses. I would like to run the same analyses as an app, so that I don't have to worry about the Jupyter notebook timing out. The Spark Apps description references the dx-spark-submit utility (https://documentation.dnanexus.com/developer/apps/developing-spark-apps/dx-spark-submit-utility), but I can't find this utility. Is a special license actually required to run a Spark App, as stated here?
[Image: image]
If so, why can you run Spark in a Jupyterlab notebook?
I believe that Spark is available on the UKB RAP, and no additional license is required. The doc page and licensing info is IMO mostly related to the Apollo DNAnexus product. Also having Spark license active should be the reason why you can actually run Spark based JupyterLab and also other Spark apps from the Tools library, such as Table Exporter or Dataset Extender.
Comments
7 comments
In my understanding, yes. @Ben Busby? could help you get in touch with the right contact person if you are interested.
Hi @Jeremy Schwartzentruber?,
I believe that Spark is available on the UKB RAP, and no additional license is required. The doc page and licensing info is IMO mostly related to the Apollo DNAnexus product. Also having Spark license active should be the reason why you can actually run Spark based JupyterLab and also other Spark apps from the Tools library, such as Table Exporter or Dataset Extender.
I was curious about it and tried to build (via "dx build ." command) an example spark applet, got inspiration from the doc page: https://documentation.dnanexus.com/developer/apps/developing-spark-apps
The testing run was successful and my code is using dx-spark-submit utility.
Thanks Ondrej.
I can't for the life of me figure out how to get the dx-spark-submit utility though. Where is it?
I've installed the dx toolkit and other commands work fine.
I've tried to run it on Cloud workstations running on DNA Nexus, but it's not found there either.
(dx-)spark-submit is part of Apache Spark cluster. This is available for Spark based app(let)s, so definitely not part of Cloud Workstation.
For Apache Spark and spark submit utility, dxapp.json must contain the section Cluster Spec:
"clusterSpec": {
"type": "dxspark", # Type of the cluster e.g dxspark , apachespark
"version": "3.2.0", # Cluster version to use
"initialInstanceCount": "<num_cluster_nodes>", # Total number of nodes in the cluster (including 1 master)
"ports": "9500, 9700-9750", # ( Optional ) Ports (or port range) to be opened between the nodes of the cluster
"bootstrapScript": "path/to/script.sh" # ( Optional ) Bootstrap Script that can run on all nodes of
# the cluster before application code.
}
https://documentation.dnanexus.com/developer/apps/developing-spark-apps#cluster-specifications
I then put dx-spark-submit into script.sh.
https://documentation.dnanexus.com/developer/apps/developing-spark-apps/dx-spark-submit-utility#example
Inside the Apache Spark job, you can submit spark to cluster either via dx-spark-submit or via $SPARK_HOME/bin/spark-submit
https://documentation.dnanexus.com/developer/apps/developing-spark-apps#submitting-applications-without-dx-spark-submit
Ah, I didn't realise that dx-spark-submit needed to be called from within the app's script. I'll give that a try.
Does this work to submit a pyspark script to be run? E.g. within script.sh, I would submit as follows?
dx-spark-submit \
--log-level INFO \
my_pyspark_script.py
Yes, that is the case.
Please sign in to leave a comment.