Attempt to retrieve all data fields from a cohort failed in dxjupyter notebook. I am wondering what the error means and the best practice for retrieving all fields for every participant may be.

Hello,

 

I am interested in retrieving all fields from my project in RAP for every participant. I was able to successfully retrieve fields for the small subsets of fields I initially tested on, but now want to see the entire scope of fields.

 

What I have tried is with spark initialized is:

 

main_entity = dataset.primary_entity

all_fields=main_entity.fields[:]

df = main_entity.retrieve_fields(fields=all_fields

                       coding_values='replace',

                       filter_sql=cohort.sql,

                       engine=dxdata.connect())

 

But then get this error:

Py4JJavaError: An error occurred while calling o34289.join.

: org.apache.spark.sql.AnalysisException: USING column `eid` cannot be resolved on the left side of the join. The left-side columns: [participant_0001$eid,etc

 

and

 

During handling of the above exception, another exception occurred:

AnalysisException: 'USING column `eid` cannot be resolved on the left side of the join. The left-side columns: [participant_0001$eid,etc

 

 

 

 

 

 

 

Comments

8 comments

  • Comment author
    Ondrej Klempir DNAnexus Team

    Hello Michael,

     

    I think that retrieving all fields for every participants is not the intended use case as there might be a lot of phenotypic columns in table to be exported.  I hope you can preview the scope of fields in Cohort Browser, e.g. in the Table Tab and based on that, decide which columns ("small subset of fields") you would like to export and use in your analyses.

     

    As workaround - maybe this will help - I would give a chance to TableExporter to try exporting phenotypic data, full set as well as subset.

    0
  • Hi Ondrej,

     

    Thank you for the response! I can see now that it may not be the intended use case. Would you have any reference to how others in the past approached retrieving all fields for every participants?

     

    I've tried TableExporter and run into the same error.

     

     

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    My friend @Anastazie Sedlakova?  was successful with running the export via (https://github.com/dnanexus/UKB_RAP/blob/main/GWAS/gwas-phenotype-samples-qc.ipynb)

     

    cont_df = participant.retrieve_fields(fields = fields, filter_sql = cont.sql, engine=dxdata.connect(

    dialect="hive+pyspark",

    connect_args=

    {

    'config':{'spark.kryoserializer.buffer.max':'256m','spark.sql.autoBroadcastJoinThreshold':'-1'}

    }

    )).to_koalas()

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    and/or the following lines worked for me (you can select different instance types to use more memory):

     

    df = dataset.primary_entity.retrieve_fields(fields=dataset.primary_entity.fields[:],coding_values='replace',engine=dxdata.connect())

    df2 = dataset.primary_entity.retrieve_fields(engine=dxdata.connect())

     

    import pandas

     

    df.toPandas()

    df2.toPandas()

    0
  • Hi Ondrej,

     

    Thanks for following up on this again!

     

    I've tried both of the above suggestions and run into the same error again (Py4JJavaError: An error occurred while calling o423.join.)

     

    When I try:

     

    main_entity = dataset.primary_entity

    all_fields = main_entity.fields[:]

    all_fields_df=pd.DataFrame(

      {

        'Name': [f.name for f in all_fields],

        'Title': [f.title for f in all_fields]

      }

    )

    all_fields_df

     

    I see that I have 19,990 fields. I assume that in my dispensed dataset that I have access to all fields, but am wondering whether or not this error has something to do with access? Furthermore could this error be dependent on different instance type memory? I've mostly been testing on mem1_ssd1_v2_x16.

     

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Yes, that is exactly what I would do as the next steps. Here is a list of available instance types:

    https://dnanexus-prod-asg-dnanexusprodassets4d7ed69b-i607e894f3ya.s3.us-east-1.amazonaws.com/images/files/UKB_Rate_Card-Current.pdf

     

    Anyway, mem1_ssd1_v2_x16 has 32 GB of memory, which seems to be large enough in my opinion (even in case of Spark cluster). How many cluster nodes are you using?

     

    If the issue persists, you can contact ukbiobank-support@dnanexus.com and share your project with DNAnexus Supporters. They are great and will be able to do some in-depth testing.

    0
  • Using 2 nodes, how many would you reccomend?

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Hi Michael,

     

    Were you able to resolve this? If so, please what was the solution?

     

    I would use 3, i.e. one master and 2 cluster nodes, but not sure if this can help in this situation.

    0

Please sign in to leave a comment.