Issue Retrieving Fields Using dxdata in Spark Cluster

Po-Wen Ku

Dear team,

I am currently using the Spark Cluster on the UKB-RAP and encountered an issue when trying to retrieve fields using dxdata.

I have already defined the entity and the list of fields, and I’m using the following code to extract the data into a spark dataframe:

df = participant.retrieve_fields(names=field_names, engine=dxdata.connect())

This approach has worked successfully in the past. However, I now receive the following error message:

ValueError: ca_certs is needed when cert_reqs is not ssl.CERT_NONE

I noticed that the default Spark cluster version in JupyterLab was recently updated to v2.5.0 as of July 2. Could this issue be related to the updated environment?

I would appreciate any suggestions.

Thank you very much.

Comments

9 comments

  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Hi Po-Wen,

    there was a temporary problem with an expired SSL certificate, see https://status.dnanexus.com/ on June 30th.   Please try it again now.

    If there are still issues, please clear cache and cookies and try again.    Please also make sure you are logged into the UKB platform via https://ukbiobank.dnanexus.com/login.

    If this doesn't help, please contact the platform providers, DNAnexus, via the Help tab > Contact Support in the UKB-RAP, or directly by email to ukbiobank-support@dnanexus.com .

     

     

    0
  • Comment author
    Erik Andersson

    Just as a further update: I still get the same error while running under the same conditions as above, even after clearing cache and cookies.

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Hi Erik,

    thank you for the update. Please contact DNAnexus with details.

     

    0
  • Comment author
    Clair Enthoven

    Hi all,

    Do you have any updates about this issue? 

    I also get the same error and l'll contact DNAnexus.

    Thanks!

    Clair

    0
  • Comment author
    Erik Andersson

    Hello Clair, 

    I also reached out to them and they provided a workaround while they are fixing it by adding:

    dxdata.connect(dialect=”hive+pyspark")

    That worked for me.

    3
  • Comment author
    Kristin Sims Levine

    I'm getting the same error – has it been fixed yet?  Where do I add in the dxdata.connect(dialect=”hive+pyspark")?

    0
  • Comment author
    Kristin Sims Levine

    Got it to work by adding here: 
    # Pull down the fields we need 
    df = participant.retrieve_fields(names=field_names, coding_values="replace", engine = dxdata.connect(dialect="hive+pyspark"))

    2
  • Comment author
    Eljas Roellin

    Hey, this example is all over the place in the tutorials and the first step needed to work with spark on this platform, and seems to be not working still unless the workaround hidden in this community post is used?

    0
  • Comment author
    Georgios Tsitsiridis
    • Edited

    The problem still persists. There seems to be a bug with the way you're passing cert_reqs=CERT_NONE.

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    Cell In[44], line 1
    ----> 1 olink_data = olink.retrieve_fields(engine=dxdata.connect(), fields=fields, coding_values="replace", limit=100)
    
    File /opt/conda/lib/python3.12/site-packages/dxdata/engine/__init__.py:73, in _create_engine(*args, **kwargs)
         68         return create_engine("hive+pyspark:///",
         69                              connect_args=thrift_config.as_dict(), **new_kwargs)
         70     #If inside a worker & not on a spark cluster
         71     else:
         72         return create_engine("hive:///",
    ---> 73                              connect_args=thrift_config.as_dict(), **new_kwargs)
         74 else:
         75     #If not inside a worker, use same thrift as where user is logged in
         76     project_id = os.environ.get("DX_PROJECT_CONTEXT_ID")
    
    File /opt/conda/lib/python3.12/site-packages/dxdata/engine/conn_utils/apollo_thrift.py:117, in ApolloThriftConnConfig.as_dict(self)
        115 if not self.ssl:
        116     raise NotImplementedError()
    --> 117 socket = TSSLSocket(self.host, self.port, cert_reqs=CERT_NONE)
        118 transport = TSaslClientTransport(self.sasl_client_factory(), "PLAIN", socket)
        119 return {"thrift_transport": transport, "port": None}
    
    File /opt/conda/lib/python3.12/site-packages/thrift/transport/TSSLSocket.py:267, in TSSLSocket.__init__(self, host, port, *args, **kwargs)
        265 socket_keepalive = kwargs.pop('socket_keepalive', False)
        266 self._validate_callback = kwargs.pop('validate_callback', _match_hostname)
    --> 267 TSSLBase.__init__(self, False, host, kwargs)
        268 TSocket.TSocket.__init__(self, host, port, unix_socket,
        269                          socket_keepalive=socket_keepalive)
    
    File /opt/conda/lib/python3.12/site-packages/thrift/transport/TSSLSocket.py:154, in TSSLBase.__init__(self, server_side, host, ssl_opts)
        152 if self._should_verify:
        153     if not self.ca_certs:
    --> 154         raise ValueError(
        155             'ca_certs is needed when cert_reqs is not ssl.CERT_NONE')
        156     if not os.access(self.ca_certs, os.R_OK):
        157         raise IOError('Certificate Authority ca_certs file "%s" '
        158                       'is not readable, cannot validate SSL '
        159                       'certificates.' % (self.ca_certs))
    
    ValueError: ca_certs is needed when cert_reqs is not ssl.CERT_NONE
    0

Please sign in to leave a comment.