Hi all, I would be grateful for some help on this topic also. I've been trying to create a cohort of specific eids (several hundred) but having no luck. I've tried dxdata.create_cohort but when I run it, it does create a cohort in the right folder, but it always opens with an error. Can anyone give some example code for how to use dxdata.create_cohort properly?
I've seen the solution using the koalas dataframe, but doesn't that involve having to query the entire 500,000 participant dataset for your fields of interest first? Wouldn't that use a huge amount of computing power? I would be grateful for any advice! Thanks, David
Comments
12 comments
Hi Catarina! let me take a look!
I dont see the file. Can you tell me which IDs you are using?
Aha, I think this should answer your question, please lmk if it doesn't: https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb
Participant IDs - They will be specific to my project and I don't think I should share them(?)
I have checked that link but I couldn't retrieve all the data fields for a specific participant (or group of participants)
agreed, dont share the IDs here
I just needed to know which IDs
are you certain all of the fields you wanted were selected in showcase?
If you just got access to them, you may have to redispense
doc on that for completeness: https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/creating-a-project
This is probably obvious, but (for others) you can check by running dataset.entities
I am going to recheck the first link you sent.
My problem was in retrieving data for a particular set of participants (eids - e.g. 1234567, 12345678).
Thanks!
Check and lmk if the answer is not there.
Ill be back on tomorrow morning at the latest!
Catarina, this may also be helpful:
field_names = []
for feature in feature_list:
print(feature)
print(field_names_for_id(feature_code_mapping[feature]))
field_names+=field_names_for_id(feature_code_mapping[feature])
Hi! Instead of bits and pieces, my friend @Ondrej Klempir? put everything in one place, using koalas:
# after phenotypes are successfuly loaded into a Spark dataframe (https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb)
list_of_eids = ["1234567", "12345678"] # it can be hardcoded or e.g. loaded from a file
import databricks.koalas as ks # import Koalas
df_phenotypes_koalas = df_phenotypes.to_koalas() # convert Spark dataframe to enable filtering in Koalas library
print(df_phenotypes_koalas.shape) # check shape before filtering
filtered_phenotypes = df_phenotypes_koalas[df_phenotypes_koalas["eid"].isin(list_of_eids)] # apply "isin" filtering
print(df_phenotypes_koalas[df_phenotypes_koalas["eid"].isin(list_of_eids)].shape) # apply "isin" filtering and check output shape
Hi all, I would be grateful for some help on this topic also. I've been trying to create a cohort of specific eids (several hundred) but having no luck. I've tried dxdata.create_cohort but when I run it, it does create a cohort in the right folder, but it always opens with an error. Can anyone give some example code for how to use dxdata.create_cohort properly?
I've seen the solution using the koalas dataframe, but doesn't that involve having to query the entire 500,000 participant dataset for your fields of interest first? Wouldn't that use a huge amount of computing power? I would be grateful for any advice! Thanks, David
Please sign in to leave a comment.