How to define a cohort from a list of EIDs?

18 March 2022 00:00
4 comments

I see on the cohort browser it is possible to add a cohort filter based on values of participant eid values. I want to query a set of hundreds to thousands, and do not want to manually input and enter each one on the cohort browser. What would be the best way to work around this in the cohort browser or possibly with dxdata.create_cohort?

Comments

4 comments

Ondrej Klempir DNAnexus Team
- 18 March 2022 16:06
For hundreds to thousands EIDs, I would use JupyterLab and dxdata as you suggested. I think we answered similar question here https://community.dnanexus.com/s/question/0D5t000003jkhhqCAA/hi-question-how-to-retrieve-all-fields-from-phenotypic-data-for-a-specific-sample-or-list-of-samples-provided-a-file-for-example

You may try the following code (shows how to access phenotypes and then filter for specific EIDs using koalas):

# after phenotypes are successfuly loaded into a Spark dataframe (https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb)

list_of_eids = ["1234567", "12345678"] # it can be hardcoded or e.g. loaded from a file

import databricks.koalas as ks # import Koalas

df_phenotypes_koalas = df_phenotypes.to_koalas() # convert Spark dataframe to enable filtering in Koalas library
print(df_phenotypes_koalas.shape) # check shape before filtering

filtered_phenotypes = df_phenotypes_koalas[df_phenotypes_koalas["eid"].isin(list_of_eids)] # apply "isin" filtering
print(df_phenotypes_koalas[df_phenotypes_koalas["eid"].isin(list_of_eids)].shape) # apply "isin" filtering and check output shape

0
Ondrej Klempir DNAnexus Team
- 18 March 2022 16:12
If you want to stick with downstream analysis in Cohort Browser, here are some other ideas which could work around:
- For a short set, you can manually enter them into the Sample ID (EID) field one by one (obviously not this use case, but it is good to have it here)
- For a longer list of a few hundred, a sql backed cohort can be used (generated e.g. using JupyterLab and the code from my previous Post)
- For a very large list (over a few thousand) you can create a phenotype and using dataset extender to add a column marking the participants qualification for that cohort. At that point you can use the cohort browser for a quick filter.
0
Former User of DNAx Community_26
- 18 March 2022 20:10
HI Ondrej,

The workaround you suggested with creating a new phenotype and dataset extender seems interesting. Any documentation or tutorials on how to use that tool correctly/properly?

Thanks!

0
Ondrej Klempir DNAnexus Team
- 21 March 2022 09:18
Hi Michael,

Dataset Extender sits in the Tools Library:
https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/tools-library#utility-apps

Tutorial is here:
https://documentation.dnanexus.com/developer/ingesting-data/dataset-extender
https://documentation.dnanexus.com/developer/ingesting-data/dataset-extender/dataset-extender-usage

0

Please sign in to leave a comment.