Query of the Week #1: Export Phenotypic Data to a File

Ondrej Klempir DNAnexus Team

17 February 2023 00:00
39 comments

Hello everyone!

Lately, I have been trying to understand the possibilities of how to write queries for exporting phenotype data from the dnax database. I decided to summarize 4 possibilities of such "queries" in this post. I am aware that there is this documentation page: https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-phenotypic-data-as-a-file, which describes some of the methods. However, I wanted to give a closer look at the practical examples. I gave this post the title Query of the week #1, therefore, I have other topics in mind and thought I would like to share them with you. Ideally, I will try to share one post each week.

Let me know in the comments what you think about such a series and I will definitely appreciate your comments and feedback.

For this week, let's implement phenotype export in 4 different ways:

- exporting phenotypes directly in Cohort Browser (max. 30k of rows)

- via dx extract dataset

- via Table Exporter app

- and implementing an SQL query in the Spark based JupyterLab

I got consistent results using all 4 methods.

In my eyes, exporting phenotypes is typically "one time" operation, so the resulting csv can be stored on the dnax permanent storage (UKB RAP project).

For demonstration purposes, we are going to create and work with a cohort. Let's focus on participants having the "Age at recruitment" between 50 AND 55, and having not null information about "Volume of brain". This filtering operation in Cohort Browser lowered down the full >500k dataset to less than 10k.

I saved the cohort by clicking on the small disk icon. I saved my cohort by clicking on the small "floppy disk" icon.

1) Exporting phenotypes directly in Cohort Browser

This is the easiest and fastest way. It can be done directly in the Cohort Browser. Just click on the Data Preview tab, and a customizable table will appear. The main limitation is that only about 30,000 rows could be exported. You can add more columns of your interest and later download a csv.

Screenshot 2023-02-17 at 13.19.23

2) dx extract dataset

dx extract dataset is a great utility, but you would need to know how to operate command line interface (CLI). This tool is part of dx toolkit. It can also be accessed using the ttyd app.

There are great notebooks and documentation for those who would like to learn more

https://github.com/dnanexus/OpenBio/tree/master/dx-toolkit super notebooks

https://documentation.dnanexus.com/user/helpstrings-of-sdk-command-line-utilities#extract_dataset

I implemented the following steps:

a) dx extract_dataset query_of_the_week_1 --dump-dataset-dictionary

--> which takes our created cohort as input and generates dictionary metadata

b) cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Participant ID"

cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Age at recruitment"

cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Volume of brain, grey+white

--> To convert field titles into field names

--> In order to construct our export query

c) dx extract_dataset query_of_the_week_1 --fields "participant.eid,participant.p21022,participant.p25010_i2"

--> export took about one minute and we could then mapped field names back to field title using the dumped dictionary

3) Table Exporter

Table exporter is a Spark based app available on the UKB RAP. The main advantage is its graphical interface, which may be suitable for those who do not want to work with CLI. https://documentation.dnanexus.com/developer/apps/developing-spark-apps/table-exporter-application

The settings I used can be seen here on this screenshot:

Screenshot 2023-02-17 at 15.51.06

FIELD-TITLE is the name of the field as shown in the cohort browser.
I had to use a backslash because one of my fields contained a comma.
It took 5 minutes.
Instance type used was mem3_ssd1_v2_x16.

4) Spark based JupyterLab

https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-spark-to-analyze-tabular-data

Spark based JupyterLab is a very powerful tool for implementing various highly customizable queries. I have built the phenotype export based on the following publicly available notebook:

https://github.com/dnanexus/OpenBio/blob/master/dxdata/getting_started_with_dxdata.ipynb

As we are working with Cohort object type, I have focused the most on "Loading a cohort and retrieving data section".

Apart from that, I also recommend this notebook for other types of SQL like queries and to get info how to export dataframes to csv:

https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb

Comments

39 comments

Former User of DNAx Community_77
- 20 March 2023 17:46
Hi Chinonso,
Thanks for sharing. May I ask two questions for you: 1) In Table Exporter, did you attach the File containing Field Names, or did you manually type in each Field name in the box? Have you tried with using File containing Field Names? I have not been successful in that approach. 2) Are all the variables requested in your Table Exporter already included in your cohort created by Cohort Browser? If so, why did not you download your saved cohort directly by clicking the "download" button, and why did you have to navigate through Table Exporter to download it?

Thank you!
Yimei

0
Former User of DNAx Community_84
- 21 March 2023 00:32
Hi Yimei,

Not sure your first question is clear enough for me to respond. Can you kindly provide further details? I think the first thing to do is to access the variable you want via the cohort browser. save it and download it via table exporter. The reason for this is three-folds, and I guess this also answers your second question:

a) if you download the dataset from the cohort browser, you will end up with column names different from the UK Biobank data dictionary;
b) With the cohort browser, you can only download 30,000 rows but table exporter gives you the leverage to download all the selected variables for all the UK participants assigned to you.
c)With table exporter, you get the opportunity to save the file in UKB format.

Does this answer your question?

0
Former User of DNAx Community_77
- 21 March 2023 02:44
Hi Chinonso,
Thanks for your timely response. For my first question "In Table Exporter, did you attach the File containing Field Names, or did you manually type in each Field name in the box?", please see the figure below, where you can have two ways to tell Table Exporter which variables you want to request. One way is to attach the File containing Field Names (the second blue line), the other way is to manually type in Field Names or Field Titles in the Advanced Options section. Which way did you choose?

For my 2nd question, thanks for providing the rationale. In your reason (b), in fact you can download the full participants with Cohort Browser, it only gives you a preview of 30,000 rows, but once you download, it will reveal the full. I have tried with Cohort Browser before, and it gives me 502,376 rows. The reason I do not like Cohort Browser is that it is not efficient for a batch query of > 100 variables.

Thank you!
Yimei

0
Former User of DNAx Community_84
- 23 March 2023 21:06
Hi Yimei,
?
Sorry for my late reply.
?
For the section on file containing filed names, I left it blank as I do not think are required to do anything there. I only selected the saved file I created from the cohort browser in the dataset or cohort or dashboard section. See the attached picture for clarity on this. Also in your header style, you are expected to change the FIELD_NAME to UKB-FORMAT, the coding option from REPLACE to RAW.
?
I actually taught the cohort browser only allows 30,000 rows, but thank you for sharing that information. I will put that in mind.

0
Former User of DNAx Community_77
- 24 March 2023 00:22
Thank you Chinonso. I see. My problem with Table Exporter specifically lies in using the file containing field names to request a big number of fields, which is different from yours, but thank you for trying to help. After some discussion with DNANexus IT team, they solved my problem of being unable to use File Containing Field Names in Table Exporter. I will share the experience in a new thread.

0
Former User of DNAx Community_77
- 24 March 2023 00:41
Hi {@005t0000006BZL2AAO}? , with the help of DNANexus IT Support team, I get some hidden tips to use Table Exporter, which solved my error message in the threads above that I discussed with you earlier. The tips specifically address the error when requesting fields through File Containing Field Names in Table Exporter. I would like to share these tips for everyone to reference:
1. After creating the Field Name.txt file in Windows, use dos2unix to convert it to UNIX format, otherwise it will secretly add an invisible ?^M? at the end of each line, which will cause error when running Table Exporter.
2. The format for Field Name should be p<Field ID>_i<Instance ID>_a<Array ID>. If there is no Instance or Array defined in UKB Showcase, then omit that parameter, just type in p<Field ID>. If you impute i0_a0 for those fields, it will cause error.
3. Output file does not automatically include EID, so need to add eid into your variable list. Also, specify Entity as ?participant? when filling out Table Exporter Analysis Input Advance Options.
Hope this can be helpful to everyone who plays with Table Exporter!
Yimei
2
Ondrej Klempir DNAnexus Team
- 28 March 2023 02:41
Hi @Yimei Huang?, this is great! Thanks for sharing this with Community.

0
Dan Ofer
- 23 April 2024 10:01
If I want to export, say, all demographics + all measurements/fields with “instance 0” in their name, is this the right approach?
(I want all the measurements and info, for the whole population, for doing medical/predictive analytics = I do not know in advance which features are useful, so I want all of them. Thats hundreds of columns)

0
Rachael W UKB Community team Data Analyst
- 23 April 2024 21:16
Please start a new post for your new question, rather than adding to this old thread.

0

Please sign in to leave a comment.