Query of the Week #1: Export Phenotypic Data to a File Pinned Featured

Ondrej Klempir DNAnexus Team

Hello everyone!

 

Lately, I have been trying to understand the possibilities of how to write queries for exporting phenotype data from the dnax database. I decided to summarize 4 possibilities of such "queries" in this post. I am aware that there is this documentation page: https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-phenotypic-data-as-a-file, which describes some of the methods. However, I wanted to give a closer look at the practical examples. I gave this post the title Query of the week #1, therefore, I have other topics in mind and thought I would like to share them with you. Ideally, I will try to share one post each week.

Let me know in the comments what you think about such a series and I will definitely appreciate your comments and feedback.

 

For this week, let's implement phenotype export in 4 different ways:

- exporting phenotypes directly in Cohort Browser (max. 30k of rows)

- via dx extract dataset

- via Table Exporter app

- and implementing an SQL query in the Spark based JupyterLab

 

I got consistent results using all 4 methods.

 

In my eyes, exporting phenotypes is typically "one time" operation, so the resulting csv can be stored on the dnax permanent storage (UKB RAP project).

 

For demonstration purposes, we are going to create and work with a cohort. Let's focus on participants having the "Age at recruitment" between 50 AND 55, and having not null information about "Volume of brain". This filtering operation in Cohort Browser lowered down the full >500k dataset to less than 10k.

 

I saved the cohort by clicking on the small disk icon.I saved my cohort by clicking on the small "floppy disk" icon.

 

1) Exporting phenotypes directly in Cohort Browser

 

This is the easiest and fastest way. It can be done directly in the Cohort Browser. Just click on the Data Preview tab, and a customizable table will appear. The main limitation is that only about 30,000 rows could be exported. You can add more columns of your interest and later download a csv.

 

Screenshot 2023-02-17 at 13.19.23 

2) dx extract dataset

 

dx extract dataset is a great utility, but you would need to know how to operate command line interface (CLI). This tool is part of dx toolkit. It can also be accessed using the ttyd app.

There are great notebooks and documentation for those who would like to learn more

https://github.com/dnanexus/OpenBio/tree/master/dx-toolkit super notebooks

https://documentation.dnanexus.com/user/helpstrings-of-sdk-command-line-utilities#extract_dataset

 

I implemented the following steps:

 

a) dx extract_dataset query_of_the_week_1 --dump-dataset-dictionary

--> which takes our created cohort as input and generates dictionary metadata

 

b) cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Participant ID"

cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Age at recruitment"

cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Volume of brain, grey+white

--> To convert field titles into field names

--> In order to construct our export query

 

c) dx extract_dataset query_of_the_week_1 --fields "participant.eid,participant.p21022,participant.p25010_i2"

--> export took about one minute and we could then mapped field names back to field title using the dumped dictionary

 

3) Table Exporter

 

Table exporter is a Spark based app available on the UKB RAP. The main advantage is its graphical interface, which may be suitable for those who do not want to work with CLI. https://documentation.dnanexus.com/developer/apps/developing-spark-apps/table-exporter-application

 

The settings I used can be seen here on this screenshot:

 

Screenshot 2023-02-17 at 15.51.06

  • FIELD-TITLE is the name of the field as shown in the cohort browser.
  • I had to use a backslash because one of my fields contained a comma.
  • It took 5 minutes.
  • Instance type used was mem3_ssd1_v2_x16.

 

4) Spark based JupyterLab

 

https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-spark-to-analyze-tabular-data

 

Spark based JupyterLab is a very powerful tool for implementing various highly customizable queries. I have built the phenotype export based on the following publicly available notebook:

https://github.com/dnanexus/OpenBio/blob/master/dxdata/getting_started_with_dxdata.ipynb

As we are working with Cohort object type, I have focused the most on "Loading a cohort and retrieving data section".

 

Apart from that, I also recommend this notebook for other types of SQL like queries and to get info how to export dataframes to csv:

https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb

 

Comments

39 comments

  • Comment author
    Ondrej Klempir DNAnexus Team

    Are there any other ways you prefer to export pheno data?

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    How often do you export phenotypes, is it one time operation for you?

    0
  • Comment author
    Former User of DNAx Community_84

    Hi @Ondrej Klempir? Thank you so much for your useful information. Just to be clear on the methods outlined, If I want to use for example, table exporter and/or spark based juptyer lab, I need to first set up the billing account, and after that, I will have to click on the start analysis button and select either of the apps I want to use, and try to run the query to extract the variable of interest? Is this correct?

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Hi @Chinonso Odebeatu?, yes, correct. You will need an active UKB RAP account. I believe this all will be covered in the webinar training this Thursday.

    0
  • Comment author
    Former User of DNAx Community_84

    Awesome. Many thanks for your reply. Looking forward to the webinar training then.

    0
  • Comment author
    Anastazie Sedlakova DNAnexus Team

    @Ondrej Klempir? so when using dx extract_dataset i need to use field titles as entity name + field name?

    e.g.

    $cat query_of_the_week_1.data_dictionary.csv | grep "participant" | grep "Age at recruitment"

    participant,p21022,integer,,,,,Population characteristics > Baseline characteristics,,,http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=21022,,,,Age at recruitment,years

     

    I will take first and second column: participant and p21022 and join them with '.': participant.p21022? EID remains eid because there field name and field title are the same?

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Yes, this is exactly what I did! Thanks for adding this comment.

    0
  • Comment author
    Former User of DNAx Community_77

    Hi Ondrej,

    I tried to use Table Exporter by providing a "File containing Field Names", and I kept getting error message "Invalid characters found in field names on line number(s) 1, 2, 3, 4 of the input." My Cohort is a subset of UKB participants that contains 4 Fields which I used as filters. The "File containing Field Names" includes all 4 Fields in the Cohort, and an additional Field not in Cohort (p40000). It is a txt of 5 rows:

     

    p131354

    p31

    p20003

    p20009

    p40000

     

    Could you tell where I did wrong?

    Also, if I want to obtain all instances and arrays under a Field, do I have to spell out all instances and arrays one by one, or can I just use the parent Field Name while omitting all instances and arrays in the name?

     

    Thank you for looking into this for me.

     

     

     

     

    0
  • Comment author
    Former User of DNAx Community_77

    Hi @Ondrej Klempir? , add to my question above, I also want to ask where can we access the data dictionary that contains all Field Names and Field Titles? I want to download a copy of it. Thanks!

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    I use a lot this catalog: https://biobank.ndph.ox.ac.uk/showcase/list.cgi

     

    Alternatively, I sometimes search Field name for given Field title in Cohort Browser.

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    For question "Also, if I want to obtain all instances and arrays under a Field, do I have to spell out all instances and arrays one by one, or can I just use the parent Field Name while omitting all instances and arrays in the name?", to my knowledge, you will need to list all the instances one by one.

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    I have not yet done experiments with field names. I personally prefer using field title in Table Exporter, but field names sound useful. Before I jump into running experiment - what will happen if you run it just on the subset of 4 fields names you have in your cohort?

    0
  • Comment author
    Former User of DNAx Community_77

    Thank you @Ondrej Klempir? . I have received your reply regarding this catalog https://biobank.ndph.ox.ac.uk/showcase/list.cgi. This morning when I listen to the RAP Overview webinar, there seems to be a data dictionary csv file stored somewhere in Github, but the link they provided in Q&A of the webinar is no longer valid when I tried. If we have such a csv file, it will be much easier to pull a long list of Fields.

    If anyone knows how to access that file, please post to this thread! Thanks!

    0
  • Comment author
    Former User of DNAx Community_77

    Hi @Ondrej Klempir? , I just tried with only the 4 Field Names that I have in my cohort, I still get the same error "Invalid characters found in field names on line number(s) 1, 2, 3, 4 of the input." This time my input txt looks like this:

    p131354

    p31

    p20003

    p20009.

     

    Among these, p131354 and p31 only have 1 instance and 1 array, so there shouldn't be any deviation on the Field Name. Not sure why even these two lines do not process.

     

    Thanks.

     

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    OK, I ran two experiments with field names and it worked. For both, I used the cohort created for purposes of the Query of the week #1. Please double check formatting of the items in your file and also add particular instance to it.

     

    A) the same fields as it was captured in the cohort

    Screenshot 2023-02-23 at 23.06.59prompt> cat list_ids                                 

    eid

    p21022

    p25010_i2

     

    B) not just field from cohort but also others

     

    prompt> cat list_ids_2

    eid

    p21022

    p25010_i2

    p131354

    p31

    p20003_i0

     

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Hi @Alexandra Lee?, do you know which github file could store this info?

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Is this the list? https://biobank.ndph.ox.ac.uk/showcase/schema.cgi?id=1

     

    https://biobank.ndph.ox.ac.uk/showcase/download.cgi

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    It should be downloadable by clicking Download button

    0
  • Comment author
    Former User of DNAx Community_77

    @Ondrej Klempir? No, it is a list with Field Name already formatted to DNAnexus rule (adding a "p" in front). Like the one below. The speakers in today's webinar showed it during their demo of dx extract_dataset. I think it was the male speaker who talked about that part. Could you @ him if you know his name?

    image

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    OK, I see. I refer here to the part 2a).

     

    You can use dx extract_dataset query_of_the_week_1 --dump-dataset-dictionary. This will give you the full dictionary in csv format.

     

    query_of_the_week_1.data_dictionary.csv 

    0
  • Comment author
    Alexandra Lee DNAnexus Team

    Hi Yimei,

     

    Here is the github link that I pointed to during the overview webinar: https://github.com/dnanexus/OpenBio/blob/master/dx-toolkit/dx_extract_dataset_r.ipynb

     

    This github doesn't contain the data dictionary file, but rather it contains example code for how you can get this data dictionary file using dx extract_dataset - see `2. Call ?dx extract_dataset? using a supplied dataset` section of the notebook.

     

    Sorry for the miscommunication, I hope this helps

     

    0
  • Comment author
    Former User of DNAx Community_77

    Oh I see, thank you Alex and Ondrej! Is the coding dictionary csv file only accessible by navigating through dx extract_dataset, rather than stored somewhere in the website or RAP folder?

    Thank you!

    0
  • Comment author
    Alexandra Lee DNAnexus Team

    As far as I know, I don't believe this is on the website or inside the RAP project. To get the coding_dictionary.csv file you can do the following:

     

    Make sure you have dxpy installed on your local machine

     

    Login to the RAP from the terminal by calling: `dx login`

     

    Then you can extract the coding_dictionary.csv + the other two dictionary files by calling: `dx extract_dataset <"project-id:record-id"> -ddd -o -`

     

    You will need to replace <"project-id:record-id"> if your project and record identifiers. You can find the project-id by navigating to the "Settings" tab when you're in RAP and the record-id will be found on the right-side panel when you select the checkbox associated with the dataset file you want.

     

    Alternatively:

    1. The github repo here shows you the steps for how to do this in R. You can launch an jupyter lab notebook with an R kernel and run the commands found here: https://github.com/dnanexus/OpenBio/blob/master/dx-toolkit/dx_extract_dataset_r.ipynb
    2. Here is how you can use a python kernel: https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/gwas-ex#getting-the-data

     

    Hope this helps

    0
  • Comment author
    Former User of DNAx Community_77

    Thank you Alex! This step-wise tutorial is very helpful! I am clear now.

    0
  • Comment author
    Alexandra Lee DNAnexus Team

    Thank you for sharing this information Ondrej, I'm curious how folks decide which method they will use for their research? Are there methods that are more suitable for certain analyses?

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Good question! Yeah, I think it would mostly depend on the technical skills and if user can interact with RAP using command line interface. For those who prefer GUI, Table Exporter and Cohort Browser would be the best. Also, as Table Exporter is a supported UKB RAP app, it could be also a component of a larger workflow. For more tech skilled persons, I would definitely recommend programmatic way, as one could create a more complex queries and custom analysis. Do you have any other ideas?

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    And +1 to your question "I'm curious how folks decide which method they will use for their research?" I would love to hear from others what would be their method of choice.

    0
  • Thanks for the Q/A, Alexandra and Ondrej. My experience is, if we only need to download a few variables or if we are a beginner at exploring the data structure, then Cohort Browser is the most user-friendly way. If we want to download a big batch of variables, we should use Table Exporter or dx_extract instead. However, Table Exporter seems to have some App configuration issues that cause unstable software performance, and I have not been successful with it on any jobs.

     

    Three weeks ago in the discussion thread above, I posted questions on Table Exporter, and I have tried manipulating the File containing Field Names with a format like this p<FIELD-ID>_i<INSTANCE-ID>_a<ARRAY-ID>. Still, no luck. I repeatedly received the same error message "Invalid characters found in field names on line number(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, .... of the input." Every single variable failed to pull. I have submitted a Failure Report to DNAnexus because the failure report suspects there to be an App Configuration issue or an internal error. I will share with this community if I hear anything useful back from DNAnexus Support Team.

    0
  • Hi Ondrej, I tried to download the phenotype file with the 5 variables you listed above on my Cohort of 15 participants, and it still gave me error message "Invalid characters found in field names on line number(s) 1, 2, 3, 4, 5 of the input." I basically copied your variables into a txt file, which print out like this:

     

    eid

    p21022

    p25010_i2

    p131354

    p31

    p20003_i0

     

    I also followed your example to fill out Entity as "participant". I cannot see where it went wrong in my input setting. In the email notification of the error message, it says there may be an issue with App configuration of internal error, so I reported the failed job to DNAnexus IT Support. If I hear back from them, I will keep the community updated on the mystery of error.

     

    Thanks.

    0
  • Here is what I did to get the required variables of interest:

    ?

    I opened the dataset record and I led me to the cohort browser. I selected all the variables I wanted by adding column to the data preview. Once I have selected all that I needed, I save it to my UKB RAP where my main dataset is (there is a save button on the cohort browser). Then, I started analysis and selected the table exporter. I used the recommended settings from the UKB RAP, including choosing the UK B data format in the settings, and the file format (which is csv). Then, I extracted the saved variables for all the participants, and saved it in the UKB RAP. At this point you can load the data to any analytical tool for analysis in the UKB RAP such as R workbench, or you can download it if you won't be violating the MTA.

    ?

    I just thought to share my experience, may be it can help someone?

    0

Please sign in to leave a comment.