How to extract selected cohort in csv format
Hello,
I would like to extract a cohort of cases from my dataset with at least one of the following diagnoses (as defined by ICD codes):
- M4712: Other spondylosis with myelopathy (Cervical region)
- M500: Cervical disk disorder with myelopathy
- M9931: Osseous stenosis of neural canal (Cervical region)
- M9941: Connective tissue stenosis of neural canal (Cervical region)
- M9951: Intervertebral disc stenosis of neural canal (Cervical region)
I would then like to characterise this cohort's baseline characteristics (age, sex, etc) and genotype for a gene defined by 2 SNPs.
I would ideally like this data to be in tabular format with each variable in a separate column in the form of a CSV file that I can then read into RStudio. However, I am struggling to do this despite having read the documentation from this website: https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/working-with-ukb-data
Any help would be much appreciated. Thanks in advance.
Comments
1 comment
Hi Renuka,
These two pages may help:
https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-phenotypic-data-as-a-file
https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-spark-to-analyze-tabular-data
There are also some useful example notebooks with different approaches:
https://github.com/UK-Biobank/UKB-RAP-Notebooks/blob/main/NBs_Prelim/103_export_participant_data.ipynb
https://github.com/dnanexus/UKB_RAP/blob/main/pheno_data/03-dx_extract_dataset_R.ipynb
My approach to this would be to identify all the fields of interest, and use either Table Exporter or Jupyterlab with Spark to get them into a csv. There are a number of useful places to help identify fields of interest:
- the UKB Showcase https://biobank.ndph.ox.ac.uk/showcase/ to find the field IDs (also contains supporting documentation on the data)
- the schema available in RAP. These will be in the folder Showcase Metadata in your project. The fields.tsv file should be useful.
- the data dictionary - get this with the command `dx extract_dataset <dataset> -ddd` (there is an example of this in one of the notebooks above). This can give you all the column names associated with a field (eg field 53 will have column names p53_i0, p53_i1, p53_i2, p53_i3, for each instance).
I would extract data to a CSV for all participants for the fields of interest, and then filter to create the cohort using your preferred option (RStudio, etc)
For SNPs, this example notebook might help https://github.com/UK-Biobank/SNP-filtering
Please sign in to leave a comment.