How to extract all the phenotypes available for a single individual

Hello. I am wondering if it is possible to extract all the phenotype information available for an individual in one command? 

Comments

10 comments

  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    I don't think so. 

    This old thread (about using koalas to extract the tabular fields from the main parquet dataset) might be relevant https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/16019577725341-Hi-Question-How-to-retrieve-all-fields-from-phenotypic-data-for-a-specific-sample-or-list-of-samples-provided-a-file-for-example 

    See also https://community.ukbiobank.ac.uk/hc/en-gb/search?utf8=%E2%9C%93&query=query+of+the+week

     

    0
  • Comment author
    Arezoo Mohajeri

    Thank you, Rachael! What about the following command?

     dx find data --property eid=1234567

    Doesn't this extract all the information or only the files in Bulk? It seems that the information in the Dataset are not in the Bulk folder? 

    How can we extract certain phenotypes from a certain participant (example: 1234567) using dx extract_dataset? There are materials about the parameters but no actual example of the command with field names and etc. Let's say we would like to extract “Data of birth” and “Cognitive function summary” for participant 1234567 using dx extract_dataset. What would be the command for this? Thank you!

    https://biobank.ndph.ox.ac.uk/ukb/label.cgi?id=100094

    https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=20023

     

     

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst
    • Edited

    “dx find”  will only find the bulk files, as they have filenames that include the eids.   The tabular data is stored in a single very large SQL database.  See “Tabular data” section on page  https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/working-with-ukb-data .

    There are three main ways to access data in the SQL database:

    1. viewing it via the Cohort browser, see https://documentation.dnanexus.com/user/cohort-browser

    2. extracting the rows (participants) and columns (fields) that are required into a csv and then working with the csv

    3. using Spark SQL commands to query the database

    I would recommend extracting what you need into a csv.    The Cohort browser is quite nice for preliminary searches, but it has limitations of size and functionality so it is probably not sufficient for your main analysis.

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst
    • Edited

    To extract into a csv, consider either Table Exporter from the Tools tab or “dx extract_dataset”.  See this thread https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/16019586199837-Table-exporter-with-all-fields

    and documentation here https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-phenotypic-data-as-a-file

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst
    • Edited

    Hello again, re the example you asked for, here is a walk-through.

    Start by creating a small Cohort (using cohort-browser), with just the participant(s) you want data for.  Save it into your main project storage, with a meaningful name and a filepath that you choose, eg PersonalFolders/AM/SavedCohorts/very_small_cohort_test .   It is not necessary to add the columns you want into this Cohort, as it will only be used to set the participant list.  My test cohort actually has fields eid and p20001, because I used them to filter the participants down to just a few.

    Next, find the actual columns (with array numbers and instance numbers) that make up each field. For example, field 20023 Mean time to correctly identify matches has data from 4 different Instances (visits), and No Arrays, so the actual columns will be p20023 _i0, p20023_i1, p20023_i2 and p20023_i3.   Any fields with arrays will need _i0_a0 , _i0 _a1 etc in the list.

    There is a complication in Category 100094, in that field 33 Date of birth is Restricted, which in this case means not available except in very special cases which would mean huge complications and a delay of months.  Researchers need to use Year of Birth and Month of Birth instead (possibly creating an approximate date of birth with a random day of the month or using the middle of the month in each case).   For this example I will use field 34 Year of birth.

    Field 34 Year of birth has No Instances and No Arrays, so the only column is p34.

    I found the Array and Instance information from the Showcase page for each field.  When you are working with multiple fields, you can use the dx extract_dataset --ddd option to pull out the column names for the fields of interest into a file, and use that file as input to the next dx extract_dataset command.

    If you wanted data from any of the Record-tables, such as hesin_diag, then you would need to find the associated Entity.   For the “normal" fields in the tabular dataset, such as p34 and p20023, the Entity is called Participant.

    So, we write participant on every column name required: participant.eid,participant.p20023_i0,participant.p20023_i1,participant.p20023_i2,participant.p20023_i3,participant.p34

    where columns are separated by commas and there is a dot after the participant .

    Start a JupyterLab instance from the Tools tab, for example using Priority: Normal and Cluster Configuration: Single Node and Instance Type: mem1_hdd1_v2_x16 and Duration: 1 and Feature: Python_R, wait a few minutes till it says Ready, click on Open.  Wait a few seconds. Open a $_Terminal.

    Enter the command:

    dx extract_dataset PersonalFolders/AM/SavedCohorts/very_small_cohort_test --fields "participant.eid,participant.p20023_i0,participant.p20023_i1,participant.p20023_i2,participant.p20023_i3,participant.p34"

    In case it is not clear, there should be a single space between --fields and "participant.eid , not a newline. There is only a line break here because the command is too long for the forum screen.

    NB if you use a Word document for creating the field list, be sure not to copy and paste the word-doc “” quote-marks, as they are not right for Unix, you need to enter the “" from the keyboard directly, or maybe copy from a plain text file.

    Wait a while.   If the command fails, it will be a few seconds. If it works, then the extraction can take minutes.

    When the command finishes, there should be an output csv file, called very_small_cohort_test.csv in the JupyterLab instance storage.  Copy it to your main project storage, for example by using this command:

    dx upload very_small_cohort_test.csv --path /PersonalFolders/AM/SavedCohorts/

    If you don't provide a path, the default behaviour of dx upload is to save to the top level of your project.

    Check that the csv file is present in your main project storage (in folder PersonalFolders/AM/SavedCohorts) before you close the JupyterLab.   For a small file, you can use MoreActions:Preview to check that it has the data you want in it.

    See the dx extract_dataset documentation https://documentation.dnanexus.com/user/helpstrings-of-sdk-command-line-utilities#category-data for options to set the output file name, to use the data dictionary, or to extract the data for all participants from your project's main Parquet dataset instead of using a Cohort.

    0
  • Comment author
    Arezoo Mohajeri

    Thank you so much, Rachel! 

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst
    • Edited

    For more on using the data dictionary with dx extract_dataset, see this new thread https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/20152997905821 

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    For a video introduction to extracting phenotype data from the parquet database, see https://www.youtube.com/watch?v=dm1xROYy1dA&list=PLRkZ0Fz-n3Z71Pt9h8-0AEweopm0SVlTr&index=5 

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    For larger amounts of data, such as extracting olink i0 data in entity olink_instance_0 from the Parquet database to a csv, using extract_dataset OR spark OR table exporter, see this thread https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/23300100792221-How-do-I-extract-the-entire-Proteomic-data-without-being-linked-to-a-specific-Phenotype-cohort-from-the-browser 

     

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    For future reference:

    There are a few fields that say on Showcase that they are arrayed, but that have been condensed to a comma-separated list in the UKB-RAP.   If a particular field is difficult to extract, consider using the dx extract_dataset -ddd option to get the array information, or view the structure of the values in the cohort browser.

    Example condensed field: 6138 Qualifications

    0

Please sign in to leave a comment.