JupyterLab Spark SQL queries keep hanging
Hello community-
I have been using JupyterLab in the Spark (i.e. multi-node) cluster configuration to perform SQL queries of UKB data. I have encountered a recurrent phenomenon in which queries that should complete in a short time do not complete, in some cases for hours. The little circle in the upper right is solid, indicating that the kernel is still running, but stopping and restarting the kernel, then re-running the query, has no impact. In some cases the same query will run fine in one open notebook but not another in the same JupyterLab instance, but when the behavior appears, it is evident before long in any open notebooks. Closing and re-opening the notebook seems to have no effect. The only fix I have found is to create an entirely new JupyterLab instance, which is time-consuming and expensive.
Has anyone else encountered this and if so, do you know of a fix? It's driving me bonkers (that would be "barmy" I believe for those in the UK).
Thanks,
Eric Rose
PS. Here is an example.
First cell contents, to initiate environment:
#import packages
import pyspark
import dxpy
import dxdata
#initialize Spark (do only one; Do not rerun this unless you select Kernel->Restart kernel
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
Second cell contents (runs in seconds when JupyterLab instance is fresh but not after a few different queries are executed):
# Find counts of gp_scripts rows where the only code provided is a dmd code
spark.sql("USE app84142_20220504132912")
retrieve_sql = \
"""
select count(*) from
(select
case when read_2 is null THEN 'N' else 'Y' end as read_2_present,
case when bnf_code is null THEN 'N' else 'Y' end as bnf_code_present,
case when dmd_code is null THEN 'N' else 'Y' end as dmd_present
from gp_scripts)
where read_2_present = 'N'
and bnf_code_present = 'N'
and dmd_present = 'Y'
"""
df=spark.sql(retrieve_sql)
df.show(n=10,truncate=False)
Comments
3 comments
PS. Please ignore the inelegant structure of the SQL query above; It's intended to be tweaked for multiple uses.
I've had the same issue but unable to find a fix
I am trying to reproduce this issue by adopting and running the code you posted, but not running into these errors. I reran many times with success.
If a particular Spark command is taking too long to evaluate, you can monitor the Spark status by visiting the Spark console page. To do that, copy the URL of your current JupyterLab session (typically ending in
".dnanexus.cloud/lab?"
), open a new browser tab, and paste the URL. Replace
"/lab?"
with
":8081/jobs/"
and press Enter.
You can then share a screenshot showing what you are seeing on that page (it can tell you the reason why the particular Spark jobs keep hanging).
Please sign in to leave a comment.