Issue Retrieving Fields Using dxdata in Spark Cluster

04 July 2025 06:52
9 comments

Dear team,

I am currently using the Spark Cluster on the UKB-RAP and encountered an issue when trying to retrieve fields using dxdata.

I have already defined the entity and the list of fields, and I’m using the following code to extract the data into a spark dataframe:

df = participant.retrieve_fields(names=field_names, engine=dxdata.connect())

This approach has worked successfully in the past. However, I now receive the following error message:

ValueError: ca_certs is needed when cert_reqs is not ssl.CERT_NONE

I noticed that the default Spark cluster version in JupyterLab was recently updated to v2.5.0 as of July 2. Could this issue be related to the updated environment?

I would appreciate any suggestions.

Thank you very much.

Comments

9 comments

Rachael W UKB Community team Data Analyst
- 07 July 2025 12:18
Hi Po-Wen,
there was a temporary problem with an expired SSL certificate, see https://status.dnanexus.com/ on June 30th. Please try it again now.
If there are still issues, please clear cache and cookies and try again. Please also make sure you are logged into the UKB platform via https://ukbiobank.dnanexus.com/login.
If this doesn't help, please contact the platform providers, DNAnexus, via the Help tab > Contact Support in the UKB-RAP, or directly by email to ukbiobank-support@dnanexus.com .

0
Erik Andersson
- 08 July 2025 09:58
Just as a further update: I still get the same error while running under the same conditions as above, even after clearing cache and cookies.

0
Rachael W UKB Community team Data Analyst
- 08 July 2025 10:18
Hi Erik,
thank you for the update. Please contact DNAnexus with details.

0
Clair Enthoven
- 08 July 2025 12:52
Hi all,
Do you have any updates about this issue?
I also get the same error and l'll contact DNAnexus.
Thanks!
Clair

0
Erik Andersson
- 08 July 2025 12:57
Hello Clair,
I also reached out to them and they provided a workaround while they are fixing it by adding:
```
dxdata.connect(dialect=”hive+pyspark")
```
That worked for me.
3
Kristin Sims Levine
- 08 July 2025 19:08
I'm getting the same error – has it been fixed yet? Where do I add in the dxdata.connect(dialect=”hive+pyspark")?

0
Kristin Sims Levine
- 08 July 2025 19:14
Got it to work by adding here:
# Pull down the fields we need
df = participant.retrieve_fields(names=field_names, coding_values="replace", engine = dxdata.connect(dialect="hive+pyspark"))

2
Eljas Roellin
- 11 July 2025 11:53
Hey, this example is all over the place in the tutorials and the first step needed to work with spark on this platform, and seems to be not working still unless the workaround hidden in this community post is used?

0

Georgios Tsitsiridis

Edited 23 July 2025 08:13

The problem still persists. There seems to be a bug with the way you're passing cert_reqs=CERT_NONE.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[44], line 1
----> 1 olink_data = olink.retrieve_fields(engine=dxdata.connect(), fields=fields, coding_values="replace", limit=100)

File /opt/conda/lib/python3.12/site-packages/dxdata/engine/__init__.py:73, in _create_engine(*args, **kwargs)
     68         return create_engine("hive+pyspark:///",
     69                              connect_args=thrift_config.as_dict(), **new_kwargs)
     70     #If inside a worker & not on a spark cluster
     71     else:
     72         return create_engine("hive:///",
---> 73                              connect_args=thrift_config.as_dict(), **new_kwargs)
     74 else:
     75     #If not inside a worker, use same thrift as where user is logged in
     76     project_id = os.environ.get("DX_PROJECT_CONTEXT_ID")

File /opt/conda/lib/python3.12/site-packages/dxdata/engine/conn_utils/apollo_thrift.py:117, in ApolloThriftConnConfig.as_dict(self)
    115 if not self.ssl:
    116     raise NotImplementedError()
--> 117 socket = TSSLSocket(self.host, self.port, cert_reqs=CERT_NONE)
    118 transport = TSaslClientTransport(self.sasl_client_factory(), "PLAIN", socket)
    119 return {"thrift_transport": transport, "port": None}

File /opt/conda/lib/python3.12/site-packages/thrift/transport/TSSLSocket.py:267, in TSSLSocket.__init__(self, host, port, *args, **kwargs)
    265 socket_keepalive = kwargs.pop('socket_keepalive', False)
    266 self._validate_callback = kwargs.pop('validate_callback', _match_hostname)
--> 267 TSSLBase.__init__(self, False, host, kwargs)
    268 TSocket.TSocket.__init__(self, host, port, unix_socket,
    269                          socket_keepalive=socket_keepalive)

File /opt/conda/lib/python3.12/site-packages/thrift/transport/TSSLSocket.py:154, in TSSLBase.__init__(self, server_side, host, ssl_opts)
    152 if self._should_verify:
    153     if not self.ca_certs:
--> 154         raise ValueError(
    155             'ca_certs is needed when cert_reqs is not ssl.CERT_NONE')
    156     if not os.access(self.ca_certs, os.R_OK):
    157         raise IOError('Certificate Authority ca_certs file "%s" '
    158                       'is not readable, cannot validate SSL '
    159                       'certificates.' % (self.ca_certs))

ValueError: ca_certs is needed when cert_reqs is not ssl.CERT_NONE

Please sign in to leave a comment.