how can I define a cohort without using the cohort browser?

24 October 2023 00:00
2 comments

I want to define a complex cohort - eg. I want to exclude every person who has an ICD code that contains the word kidney OR renal; and I want to exclude every person who has cancer; and include only men. Cohort Browswer crashes when I try this. Can it be done on the command line /in R or in jupyter lab/ spark jupyter? Could someone please provide an example script? All I can find is guides about how to manipulate cohorts that have already been defined in cohort browser.

Comments

2 comments

Ondrej Klempir DNAnexus Team
- 24 October 2023 14:15
You should be able to work with full data using JupyterLab from the very beginning with no need to use Cohort Browser. However, I would recommend to first export relevant columns and maybe apply some simple filters. See this post https://community.dnanexus.com/s/question/0D5t000004SBm0eCAD/query-of-the-week-1-export-phenotypic-data-to-a-file

Starting directly with full data is visible here: https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb

I think that this doc page explains it well: https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-spark-to-analyze-tabular-data

More specifically, you can use SQL to filter data: https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-spark-to-analyze-tabular-data#tips-for-using-sql

In addition, there is this Query of the week post that shows a bit advanced filtering: https://community.dnanexus.com/s/question/0D5t000004SCYMrCAP/query-of-the-week-2-manipulating-optical-coherence-tomography-data-in-pyspark

1
Former User of DNAx Community_94
- 24 October 2023 15:05
Thanks Ondrej. i'll work through these. Is there an update for this codeblock that appears in all documentation, but doesn't work any more? :
def fields_for_id(field_id):

  from distutils.version import LooseVersion
  field_id = str(field_id)
  fields = participant.find_fields(name_regex=r'^p{}(_i\d+)?(_a\d+)?$'.format(field_id))
  return sorted(fields, key=lambda f: LooseVersion(f.name))

# Returns all field names for a given UKB showcase field id

def field_names_for_id(field_id):
  return [f.name for f in fields_for_id(field_id)]

field_names_for_id('31')

ERROR MESSAGE:
/tmp/ipykernel_119/3738977934.py:5: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
return sorted(fields, key=lambda f: LooseVersion(f.name))

0

Please sign in to leave a comment.