JupyterLab Spark SQL queries keep hanging

19 May 2022 00:00
3 comments

Hello community-

I have been using JupyterLab in the Spark (i.e. multi-node) cluster configuration to perform SQL queries of UKB data. I have encountered a recurrent phenomenon in which queries that should complete in a short time do not complete, in some cases for hours. The little circle in the upper right is solid, indicating that the kernel is still running, but stopping and restarting the kernel, then re-running the query, has no impact. In some cases the same query will run fine in one open notebook but not another in the same JupyterLab instance, but when the behavior appears, it is evident before long in any open notebooks. Closing and re-opening the notebook seems to have no effect. The only fix I have found is to create an entirely new JupyterLab instance, which is time-consuming and expensive.

Has anyone else encountered this and if so, do you know of a fix? It's driving me bonkers (that would be "barmy" I believe for those in the UK).

Thanks,

Eric Rose

PS. Here is an example.

First cell contents, to initiate environment:

#import packages

import pyspark

import dxpy

import dxdata

#initialize Spark (do only one; Do not rerun this unless you select Kernel->Restart kernel

sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc)

Second cell contents (runs in seconds when JupyterLab instance is fresh but not after a few different queries are executed):

# Find counts of gp_scripts rows where the only code provided is a dmd code

spark.sql("USE app84142_20220504132912")

retrieve_sql = \

"""

select count(*) from

(select

case when read_2 is null THEN 'N' else 'Y' end as read_2_present,

case when bnf_code is null THEN 'N' else 'Y' end as bnf_code_present,

case when dmd_code is null THEN 'N' else 'Y' end as dmd_present

from gp_scripts)

where read_2_present = 'N'

and bnf_code_present = 'N'

and dmd_present = 'Y'

"""

df=spark.sql(retrieve_sql)

df.show(n=10,truncate=False)

Comments

3 comments

Former User of DNAx Community_62
- 19 May 2022 16:16
PS. Please ignore the inelegant structure of the SQL query above; It's intended to be tweaked for multiple uses.

0
Former User of DNAx Community_63
- 24 May 2022 15:05
I've had the same issue but unable to find a fix

0
Ondrej Klempir DNAnexus Team
- 25 May 2022 09:36
I am trying to reproduce this issue by adopting and running the code you posted, but not running into these errors. I reran many times with success.

If a particular Spark command is taking too long to evaluate, you can monitor the Spark status by visiting the Spark console page. To do that, copy the URL of your current JupyterLab session (typically ending in
".dnanexus.cloud/lab?"
), open a new browser tab, and paste the URL. Replace
"/lab?"
with
":8081/jobs/"
and press Enter.

You can then share a screenshot showing what you are seeing on that page (it can tell you the reason why the particular Spark jobs keep hanging).

0

Please sign in to leave a comment.