Query of the Week #1: Export Phenotypic Data to a File
Hello everyone!
Lately, I have been trying to understand the possibilities of how to write queries for exporting phenotype data from the dnax database. I decided to summarize 4 possibilities of such "queries" in this post. I am aware that there is this documentation page: https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-phenotypic-data-as-a-file, which describes some of the methods. However, I wanted to give a closer look at the practical examples. I gave this post the title Query of the week #1, therefore, I have other topics in mind and thought I would like to share them with you. Ideally, I will try to share one post each week.
Let me know in the comments what you think about such a series and I will definitely appreciate your comments and feedback.
For this week, let's implement phenotype export in 4 different ways:
- exporting phenotypes directly in Cohort Browser (max. 30k of rows)
- via dx extract dataset
- via Table Exporter app
- and implementing an SQL query in the Spark based JupyterLab
I got consistent results using all 4 methods.
In my eyes, exporting phenotypes is typically "one time" operation, so the resulting csv can be stored on the dnax permanent storage (UKB RAP project).
For demonstration purposes, we are going to create and work with a cohort. Let's focus on participants having the "Age at recruitment" between 50 AND 55, and having not null information about "Volume of brain". This filtering operation in Cohort Browser lowered down the full >500k dataset to less than 10k.
I saved my cohort by clicking on the small "floppy disk" icon.
1) Exporting phenotypes directly in Cohort Browser
This is the easiest and fastest way. It can be done directly in the Cohort Browser. Just click on the Data Preview tab, and a customizable table will appear. The main limitation is that only about 30,000 rows could be exported. You can add more columns of your interest and later download a csv.
2) dx extract dataset
dx extract dataset is a great utility, but you would need to know how to operate command line interface (CLI). This tool is part of dx toolkit. It can also be accessed using the ttyd app.
There are great notebooks and documentation for those who would like to learn more
https://github.com/dnanexus/OpenBio/tree/master/dx-toolkit super notebooks
https://documentation.dnanexus.com/user/helpstrings-of-sdk-command-line-utilities#extract_dataset
I implemented the following steps:
a) dx extract_dataset query_of_the_week_1 --dump-dataset-dictionary
--> which takes our created cohort as input and generates dictionary metadata
b) cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Participant ID"
cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Age at recruitment"
cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Volume of brain, grey+white
--> To convert field titles into field names
--> In order to construct our export query
c) dx extract_dataset query_of_the_week_1 --fields "participant.eid,participant.p21022,participant.p25010_i2"
--> export took about one minute and we could then mapped field names back to field title using the dumped dictionary
3) Table Exporter
Table exporter is a Spark based app available on the UKB RAP. The main advantage is its graphical interface, which may be suitable for those who do not want to work with CLI. https://documentation.dnanexus.com/developer/apps/developing-spark-apps/table-exporter-application
The settings I used can be seen here on this screenshot:

- FIELD-TITLE is the name of the field as shown in the cohort browser.
- I had to use a backslash because one of my fields contained a comma.
- It took 5 minutes.
- Instance type used was mem3_ssd1_v2_x16.
4) Spark based JupyterLab
Spark based JupyterLab is a very powerful tool for implementing various highly customizable queries. I have built the phenotype export based on the following publicly available notebook:
https://github.com/dnanexus/OpenBio/blob/master/dxdata/getting_started_with_dxdata.ipynb
As we are working with Cohort object type, I have focused the most on "Loading a cohort and retrieving data section".
Apart from that, I also recommend this notebook for other types of SQL like queries and to get info how to export dataframes to csv:
https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb
Comments
39 comments
Hi Chinonso,
Thanks for sharing. May I ask two questions for you: 1) In Table Exporter, did you attach the File containing Field Names, or did you manually type in each Field name in the box? Have you tried with using File containing Field Names? I have not been successful in that approach. 2) Are all the variables requested in your Table Exporter already included in your cohort created by Cohort Browser? If so, why did not you download your saved cohort directly by clicking the "download" button, and why did you have to navigate through Table Exporter to download it?
Thank you!
Yimei
Hi Yimei,
Not sure your first question is clear enough for me to respond. Can you kindly provide further details? I think the first thing to do is to access the variable you want via the cohort browser. save it and download it via table exporter. The reason for this is three-folds, and I guess this also answers your second question:
a) if you download the dataset from the cohort browser, you will end up with column names different from the UK Biobank data dictionary;
b) With the cohort browser, you can only download 30,000 rows but table exporter gives you the leverage to download all the selected variables for all the UK participants assigned to you.
c)With table exporter, you get the opportunity to save the file in UKB format.
Does this answer your question?
Hi Chinonso,
Thanks for your timely response. For my first question "In Table Exporter, did you attach the File containing Field Names, or did you manually type in each Field name in the box?", please see the figure below, where you can have two ways to tell Table Exporter which variables you want to request. One way is to attach the File containing Field Names (the second blue line), the other way is to manually type in Field Names or Field Titles in the Advanced Options section. Which way did you choose?
For my 2nd question, thanks for providing the rationale. In your reason (b), in fact you can download the full participants with Cohort Browser, it only gives you a preview of 30,000 rows, but once you download, it will reveal the full. I have tried with Cohort Browser before, and it gives me 502,376 rows. The reason I do not like Cohort Browser is that it is not efficient for a batch query of > 100 variables.
Thank you!
Yimei
Hi Yimei,
?
Sorry for my late reply.
?
For the section on file containing filed names, I left it blank as I do not think are required to do anything there. I only selected the saved file I created from the cohort browser in the dataset or cohort or dashboard section. See the attached picture for clarity on this. Also in your header style, you are expected to change the FIELD_NAME to UKB-FORMAT, the coding option from REPLACE to RAW.
?
I actually taught the cohort browser only allows 30,000 rows, but thank you for sharing that information. I will put that in mind.
Thank you Chinonso. I see. My problem with Table Exporter specifically lies in using the file containing field names to request a big number of fields, which is different from yours, but thank you for trying to help. After some discussion with DNANexus IT team, they solved my problem of being unable to use File Containing Field Names in Table Exporter. I will share the experience in a new thread.
Hi {@005t0000006BZL2AAO}? , with the help of DNANexus IT Support team, I get some hidden tips to use Table Exporter, which solved my error message in the threads above that I discussed with you earlier. The tips specifically address the error when requesting fields through File Containing Field Names in Table Exporter. I would like to share these tips for everyone to reference:
Hope this can be helpful to everyone who plays with Table Exporter!
Yimei
Hi @Yimei Huang?, this is great! Thanks for sharing this with Community.
If I want to export, say, all demographics + all measurements/fields with “instance 0” in their name, is this the right approach?
(I want all the measurements and info, for the whole population, for doing medical/predictive analytics = I do not know in advance which features are useful, so I want all of them. Thats hundreds of columns)
Please start a new post for your new question, rather than adding to this old thread.
Please sign in to leave a comment.