Attempt to retrieve all data fields from a cohort failed in dxjupyter notebook. I am wondering what the error means and the best practice for retrieving all fields for every participant may be.

18 March 2022 00:00
8 comments

Hello,

I am interested in retrieving all fields from my project in RAP for every participant. I was able to successfully retrieve fields for the small subsets of fields I initially tested on, but now want to see the entire scope of fields.

What I have tried is with spark initialized is:

main_entity = dataset.primary_entity

all_fields=main_entity.fields[:]

df = main_entity.retrieve_fields(fields=all_fields

coding_values='replace',

filter_sql=cohort.sql,

engine=dxdata.connect())

But then get this error:

Py4JJavaError: An error occurred while calling o34289.join.

: org.apache.spark.sql.AnalysisException: USING column `eid` cannot be resolved on the left side of the join. The left-side columns: [participant_0001$eid,etc

and

During handling of the above exception, another exception occurred:

AnalysisException: 'USING column `eid` cannot be resolved on the left side of the join. The left-side columns: [participant_0001$eid,etc

Comments

8 comments

Ondrej Klempir DNAnexus Team
- 18 March 2022 16:27
Hello Michael,

I think that retrieving all fields for every participants is not the intended use case as there might be a lot of phenotypic columns in table to be exported. I hope you can preview the scope of fields in Cohort Browser, e.g. in the Table Tab and based on that, decide which columns ("small subset of fields") you would like to export and use in your analyses.

As workaround - maybe this will help - I would give a chance to TableExporter to try exporting phenotypic data, full set as well as subset.

0
Former User of DNAx Community_26
- 18 March 2022 18:46
Hi Ondrej,

Thank you for the response! I can see now that it may not be the intended use case. Would you have any reference to how others in the past approached retrieving all fields for every participants?

I've tried TableExporter and run into the same error.

0
Ondrej Klempir DNAnexus Team
- 21 March 2022 14:26
My friend @Anastazie Sedlakova? was successful with running the export via (https://github.com/dnanexus/UKB_RAP/blob/main/GWAS/gwas-phenotype-samples-qc.ipynb)

cont_df = participant.retrieve_fields(fields = fields, filter_sql = cont.sql, engine=dxdata.connect(
dialect="hive+pyspark",
connect_args=
{
'config':{'spark.kryoserializer.buffer.max':'256m','spark.sql.autoBroadcastJoinThreshold':'-1'}

}
)).to_koalas()

0
Ondrej Klempir DNAnexus Team
- 21 March 2022 14:28
and/or the following lines worked for me (you can select different instance types to use more memory):

df = dataset.primary_entity.retrieve_fields(fields=dataset.primary_entity.fields[:],coding_values='replace',engine=dxdata.connect())
df2 = dataset.primary_entity.retrieve_fields(engine=dxdata.connect())

import pandas

df.toPandas()
df2.toPandas()

0
Former User of DNAx Community_26
- 21 March 2022 19:40
Hi Ondrej,

Thanks for following up on this again!

I've tried both of the above suggestions and run into the same error again (Py4JJavaError: An error occurred while calling o423.join.)

When I try:

main_entity = dataset.primary_entity
all_fields = main_entity.fields[:]
all_fields_df=pd.DataFrame(
  {
    'Name': [f.name for f in all_fields],
    'Title': [f.title for f in all_fields]
  }
)
all_fields_df

I see that I have 19,990 fields. I assume that in my dispensed dataset that I have access to all fields, but am wondering whether or not this error has something to do with access? Furthermore could this error be dependent on different instance type memory? I've mostly been testing on mem1_ssd1_v2_x16.

0
Ondrej Klempir DNAnexus Team
- 21 March 2022 20:05
Yes, that is exactly what I would do as the next steps. Here is a list of available instance types:
https://dnanexus-prod-asg-dnanexusprodassets4d7ed69b-i607e894f3ya.s3.us-east-1.amazonaws.com/images/files/UKB_Rate_Card-Current.pdf

Anyway, mem1_ssd1_v2_x16 has 32 GB of memory, which seems to be large enough in my opinion (even in case of Spark cluster). How many cluster nodes are you using?

If the issue persists, you can contact ukbiobank-support@dnanexus.com and share your project with DNAnexus Supporters. They are great and will be able to do some in-depth testing.

0
Former User of DNAx Community_26
- 21 March 2022 20:47
Using 2 nodes, how many would you reccomend?

0
Ondrej Klempir DNAnexus Team
- 29 March 2022 13:08
Hi Michael,

Were you able to resolve this? If so, please what was the solution?

I would use 3, i.e. one master and 2 cluster nodes, but not sure if this can help in this situation.

0

Please sign in to leave a comment.