I see on the cohort browser it is possible to add a cohort filter based on values of participant eid values. I want to query a set of hundreds to thousands, and do not want to manually input and enter each one on the cohort browser. What would be the best way to work around this in the cohort browser or possibly with dxdata.create_cohort?
If you want to stick with downstream analysis in Cohort Browser, here are some other ideas which could work around:
For a short set, you can manually enter them into the Sample ID (EID) field one by one (obviously not this use case, but it is good to have it here)
For a longer list of a few hundred, a sql backed cohort can be used (generated e.g. using JupyterLab and the code from my previous Post)
For a very large list (over a few thousand) you can create a phenotype and using dataset extender to add a column marking the participants qualification for that cohort. At that point you can use the cohort browser for a quick filter.
The workaround you suggested with creating a new phenotype and dataset extender seems interesting. Any documentation or tutorials on how to use that tool correctly/properly?
Comments
4 comments
For hundreds to thousands EIDs, I would use JupyterLab and dxdata as you suggested. I think we answered similar question here https://community.dnanexus.com/s/question/0D5t000003jkhhqCAA/hi-question-how-to-retrieve-all-fields-from-phenotypic-data-for-a-specific-sample-or-list-of-samples-provided-a-file-for-example
You may try the following code (shows how to access phenotypes and then filter for specific EIDs using koalas):
# after phenotypes are successfuly loaded into a Spark dataframe (https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb)
list_of_eids = ["1234567", "12345678"] # it can be hardcoded or e.g. loaded from a file
import databricks.koalas as ks # import Koalas
df_phenotypes_koalas = df_phenotypes.to_koalas() # convert Spark dataframe to enable filtering in Koalas library
print(df_phenotypes_koalas.shape) # check shape before filtering
filtered_phenotypes = df_phenotypes_koalas[df_phenotypes_koalas["eid"].isin(list_of_eids)] # apply "isin" filtering
print(df_phenotypes_koalas[df_phenotypes_koalas["eid"].isin(list_of_eids)].shape) # apply "isin" filtering and check output shape
If you want to stick with downstream analysis in Cohort Browser, here are some other ideas which could work around:
HI Ondrej,
The workaround you suggested with creating a new phenotype and dataset extender seems interesting. Any documentation or tutorials on how to use that tool correctly/properly?
Thanks!
Hi Michael,
Dataset Extender sits in the Tools Library:
https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/tools-library#utility-apps
Tutorial is here:
https://documentation.dnanexus.com/developer/ingesting-data/dataset-extender
https://documentation.dnanexus.com/developer/ingesting-data/dataset-extender/dataset-extender-usage
Please sign in to leave a comment.