Query of the Week #1: Export Phenotypic Data to a File Pinned Featured
Hello everyone!
Lately, I have been trying to understand the possibilities of how to write queries for exporting phenotype data from the dnax database. I decided to summarize 4 possibilities of such "queries" in this post. I am aware that there is this documentation page: https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-phenotypic-data-as-a-file, which describes some of the methods. However, I wanted to give a closer look at the practical examples. I gave this post the title Query of the week #1, therefore, I have other topics in mind and thought I would like to share them with you. Ideally, I will try to share one post each week.
Let me know in the comments what you think about such a series and I will definitely appreciate your comments and feedback.
For this week, let's implement phenotype export in 4 different ways:
- exporting phenotypes directly in Cohort Browser (max. 30k of rows)
- via dx extract dataset
- via Table Exporter app
- and implementing an SQL query in the Spark based JupyterLab
I got consistent results using all 4 methods.
In my eyes, exporting phenotypes is typically "one time" operation, so the resulting csv can be stored on the dnax permanent storage (UKB RAP project).
For demonstration purposes, we are going to create and work with a cohort. Let's focus on participants having the "Age at recruitment" between 50 AND 55, and having not null information about "Volume of brain". This filtering operation in Cohort Browser lowered down the full >500k dataset to less than 10k.
I saved my cohort by clicking on the small "floppy disk" icon.
1) Exporting phenotypes directly in Cohort Browser
This is the easiest and fastest way. It can be done directly in the Cohort Browser. Just click on the Data Preview tab, and a customizable table will appear. The main limitation is that only about 30,000 rows could be exported. You can add more columns of your interest and later download a csv.
2) dx extract dataset
dx extract dataset is a great utility, but you would need to know how to operate command line interface (CLI). This tool is part of dx toolkit. It can also be accessed using the ttyd app.
There are great notebooks and documentation for those who would like to learn more
https://github.com/dnanexus/OpenBio/tree/master/dx-toolkit super notebooks
https://documentation.dnanexus.com/user/helpstrings-of-sdk-command-line-utilities#extract_dataset
I implemented the following steps:
a) dx extract_dataset query_of_the_week_1 --dump-dataset-dictionary
--> which takes our created cohort as input and generates dictionary metadata
b) cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Participant ID"
cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Age at recruitment"
cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Volume of brain, grey+white
--> To convert field titles into field names
--> In order to construct our export query
c) dx extract_dataset query_of_the_week_1 --fields "participant.eid,participant.p21022,participant.p25010_i2"
--> export took about one minute and we could then mapped field names back to field title using the dumped dictionary
3) Table Exporter
Table exporter is a Spark based app available on the UKB RAP. The main advantage is its graphical interface, which may be suitable for those who do not want to work with CLI. https://documentation.dnanexus.com/developer/apps/developing-spark-apps/table-exporter-application
The settings I used can be seen here on this screenshot:
- FIELD-TITLE is the name of the field as shown in the cohort browser.
- I had to use a backslash because one of my fields contained a comma.
- It took 5 minutes.
- Instance type used was mem3_ssd1_v2_x16.
4) Spark based JupyterLab
Spark based JupyterLab is a very powerful tool for implementing various highly customizable queries. I have built the phenotype export based on the following publicly available notebook:
https://github.com/dnanexus/OpenBio/blob/master/dxdata/getting_started_with_dxdata.ipynb
As we are working with Cohort object type, I have focused the most on "Loading a cohort and retrieving data section".
Apart from that, I also recommend this notebook for other types of SQL like queries and to get info how to export dataframes to csv:
https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb
Comments
39 comments
Are there any other ways you prefer to export pheno data?
How often do you export phenotypes, is it one time operation for you?
Hi @Ondrej Klempir? Thank you so much for your useful information. Just to be clear on the methods outlined, If I want to use for example, table exporter and/or spark based juptyer lab, I need to first set up the billing account, and after that, I will have to click on the start analysis button and select either of the apps I want to use, and try to run the query to extract the variable of interest? Is this correct?
Hi @Chinonso Odebeatu?, yes, correct. You will need an active UKB RAP account. I believe this all will be covered in the webinar training this Thursday.
Awesome. Many thanks for your reply. Looking forward to the webinar training then.
@Ondrej Klempir? so when using dx extract_dataset i need to use field titles as entity name + field name?
e.g.
$cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Age at recruitment"
participant,p21022,integer,,,,,Population characteristics > Baseline characteristics,,,http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=21022,,,,Age at recruitment,years
I will take first and second column: participant and p21022 and join them with '.': participant.p21022? EID remains eid because there field name and field title are the same?
Yes, this is exactly what I did! Thanks for adding this comment.
Hi Ondrej,
I tried to use Table Exporter by providing a "File containing Field Names", and I kept getting error message "Invalid characters found in field names on line number(s) 1, 2, 3, 4 of the input." My Cohort is a subset of UKB participants that contains 4 Fields which I used as filters. The "File containing Field Names" includes all 4 Fields in the Cohort, and an additional Field not in Cohort (p40000). It is a txt of 5 rows:
p131354
p31
p20003
p20009
p40000
Could you tell where I did wrong?
Also, if I want to obtain all instances and arrays under a Field, do I have to spell out all instances and arrays one by one, or can I just use the parent Field Name while omitting all instances and arrays in the name?
Thank you for looking into this for me.
Hi @Ondrej Klempir? , add to my question above, I also want to ask where can we access the data dictionary that contains all Field Names and Field Titles? I want to download a copy of it. Thanks!
I use a lot this catalog: https://biobank.ndph.ox.ac.uk/showcase/list.cgi
Alternatively, I sometimes search Field name for given Field title in Cohort Browser.
For question "Also, if I want to obtain all instances and arrays under a Field, do I have to spell out all instances and arrays one by one, or can I just use the parent Field Name while omitting all instances and arrays in the name?", to my knowledge, you will need to list all the instances one by one.
I have not yet done experiments with field names. I personally prefer using field title in Table Exporter, but field names sound useful. Before I jump into running experiment - what will happen if you run it just on the subset of 4 fields names you have in your cohort?
Thank you @Ondrej Klempir? . I have received your reply regarding this catalog https://biobank.ndph.ox.ac.uk/showcase/list.cgi. This morning when I listen to the RAP Overview webinar, there seems to be a data dictionary csv file stored somewhere in Github, but the link they provided in Q&A of the webinar is no longer valid when I tried. If we have such a csv file, it will be much easier to pull a long list of Fields.
If anyone knows how to access that file, please post to this thread! Thanks!
Hi @Ondrej Klempir? , I just tried with only the 4 Field Names that I have in my cohort, I still get the same error "Invalid characters found in field names on line number(s) 1, 2, 3, 4 of the input." This time my input txt looks like this:
p131354
p31
p20003
p20009.
Among these, p131354 and p31 only have 1 instance and 1 array, so there shouldn't be any deviation on the Field Name. Not sure why even these two lines do not process.
Thanks.
OK, I ran two experiments with field names and it worked. For both, I used the cohort created for purposes of the Query of the week #1. Please double check formatting of the items in your file and also add particular instance to it.
A) the same fields as it was captured in the cohort
prompt> cat list_ids
eid
p21022
p25010_i2
B) not just field from cohort but also others
prompt> cat list_ids_2
eid
p21022
p25010_i2
p131354
p31
p20003_i0
Hi @Alexandra Lee?, do you know which github file could store this info?
Is this the list? https://biobank.ndph.ox.ac.uk/showcase/schema.cgi?id=1
https://biobank.ndph.ox.ac.uk/showcase/download.cgi
It should be downloadable by clicking Download button
@Ondrej Klempir? No, it is a list with Field Name already formatted to DNAnexus rule (adding a "p" in front). Like the one below. The speakers in today's webinar showed it during their demo of dx extract_dataset. I think it was the male speaker who talked about that part. Could you @ him if you know his name?
OK, I see. I refer here to the part 2a).
You can use dx extract_dataset query_of_the_week_1 --dump-dataset-dictionary. This will give you the full dictionary in csv format.
query_of_the_week_1.data_dictionary.csv
Hi Yimei,
Here is the github link that I pointed to during the overview webinar: https://github.com/dnanexus/OpenBio/blob/master/dx-toolkit/dx_extract_dataset_r.ipynb
This github doesn't contain the data dictionary file, but rather it contains example code for how you can get this data dictionary file using dx extract_dataset - see `2. Call ?dx extract_dataset? using a supplied dataset` section of the notebook.
Sorry for the miscommunication, I hope this helps
Oh I see, thank you Alex and Ondrej! Is the coding dictionary csv file only accessible by navigating through dx extract_dataset, rather than stored somewhere in the website or RAP folder?
Thank you!
As far as I know, I don't believe this is on the website or inside the RAP project. To get the coding_dictionary.csv file you can do the following:
Make sure you have dxpy installed on your local machine
Login to the RAP from the terminal by calling: `dx login`
Then you can extract the coding_dictionary.csv + the other two dictionary files by calling: `dx extract_dataset <"project-id:record-id"> -ddd -o -`
You will need to replace <"project-id:record-id"> if your project and record identifiers. You can find the project-id by navigating to the "Settings" tab when you're in RAP and the record-id will be found on the right-side panel when you select the checkbox associated with the dataset file you want.
Alternatively:
Hope this helps
Thank you Alex! This step-wise tutorial is very helpful! I am clear now.
Thank you for sharing this information Ondrej, I'm curious how folks decide which method they will use for their research? Are there methods that are more suitable for certain analyses?
Good question! Yeah, I think it would mostly depend on the technical skills and if user can interact with RAP using command line interface. For those who prefer GUI, Table Exporter and Cohort Browser would be the best. Also, as Table Exporter is a supported UKB RAP app, it could be also a component of a larger workflow. For more tech skilled persons, I would definitely recommend programmatic way, as one could create a more complex queries and custom analysis. Do you have any other ideas?
And +1 to your question "I'm curious how folks decide which method they will use for their research?" I would love to hear from others what would be their method of choice.
Thanks for the Q/A, Alexandra and Ondrej. My experience is, if we only need to download a few variables or if we are a beginner at exploring the data structure, then Cohort Browser is the most user-friendly way. If we want to download a big batch of variables, we should use Table Exporter or dx_extract instead. However, Table Exporter seems to have some App configuration issues that cause unstable software performance, and I have not been successful with it on any jobs.
Three weeks ago in the discussion thread above, I posted questions on Table Exporter, and I have tried manipulating the File containing Field Names with a format like this p<FIELD-ID>_i<INSTANCE-ID>_a<ARRAY-ID>. Still, no luck. I repeatedly received the same error message "Invalid characters found in field names on line number(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, .... of the input." Every single variable failed to pull. I have submitted a Failure Report to DNAnexus because the failure report suspects there to be an App Configuration issue or an internal error. I will share with this community if I hear anything useful back from DNAnexus Support Team.
Hi Ondrej, I tried to download the phenotype file with the 5 variables you listed above on my Cohort of 15 participants, and it still gave me error message "Invalid characters found in field names on line number(s) 1, 2, 3, 4, 5 of the input." I basically copied your variables into a txt file, which print out like this:
eid
p21022
p25010_i2
p131354
p31
p20003_i0
I also followed your example to fill out Entity as "participant". I cannot see where it went wrong in my input setting. In the email notification of the error message, it says there may be an issue with App configuration of internal error, so I reported the failed job to DNAnexus IT Support. If I hear back from them, I will keep the community updated on the mystery of error.
Thanks.
Here is what I did to get the required variables of interest:
?
I opened the dataset record and I led me to the cohort browser. I selected all the variables I wanted by adding column to the data preview. Once I have selected all that I needed, I save it to my UKB RAP where my main dataset is (there is a save button on the cohort browser). Then, I started analysis and selected the table exporter. I used the recommended settings from the UKB RAP, including choosing the UK B data format in the settings, and the file format (which is csv). Then, I extracted the saved variables for all the participants, and saved it in the UKB RAP. At this point you can load the data to any analytical tool for analysis in the UKB RAP such as R workbench, or you can download it if you won't be violating the MTA.
?
I just thought to share my experience, may be it can help someone?
Please sign in to leave a comment.