How do I extract the entire Proteomic data without being linked to a specific Phenotype cohort from the browser?

Federica Merella

Hello.

I am trying to understand how to extract Proteomic data (olink_instance_0) considering that my phenotype cohort has been generated using R script. I have no intention to re-generate the same cohort using the browser, as due to the multiple conditions it needs to satisfy, it might prove extremely challenging and or prone to errors. Obviously my cohort, uploaded on my Nexus project, has its own file ID.

Essentially, what I would like to do is to download the entire proteomic data and then intersect with the ID of my phenotype cohort and move from there. It is proving to be challenging. Could you please advise?

I have been watching this video Analyzing the UK Biobank Proteomics Data on the UK Biobank Research Analysis Platform - YouTube but I remain confused.

 

1) Can I use the cohort browser to select the entire  Proteomic data (olink_instance_0) with their participant ID, without selecting any phenotype cohort? (So, essentially the whole 53.000?) Is there a video to describe it? As I cannot find anything similar unless I select first my phenotype cohort which I do not want to do using the browser.

2) The video suggests, in alternative, to use dx (in ttyd?) or using Jupiter lab? The video does not explain and extract_dataset is not amongst the dx function.

The video suggests this link:

UKB_RAP/proteomics/0_extract_phenotype_protein_data.ipynb at main · dnanexus/UKB_RAP

reads" This notebook extracts proteomics data via the Cohort Browser"

1) # Automatically discover dispensed dataset ID dispensed_dataset = dxpy.find_one_data_object(    typename="Dataset", name="app*.dataset", folder="/", name_mode="glob" ) dispensed_dataset_id = dispensed_dataset["id"] 

what dataset the this code is referring to? Phenotype-linked to Proteomic?

2) Does the command in Jupiter lab "dx", "extract_dataset", dataset, "-ddd", "--delimiter", ","] subprocess.check_call(cmd)  accepts as dataset identifier(which I belive refers to the phenotype data identifier) an object ID (i.e. file-Gvjy……) rather than the project-xxx:record-xxx generated by the cohort browser?

 I would very much value your help.

Thanks

Federica

 

Comments

3 comments

  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst
    • Edited

    Hi Federica,

    there are at least two methods to get the olink data in csv format that you can then read into an R session:

    Method 1.  Use the cohort browser to create a cohort and then use dx extract_dataset several times (a few proteins at a time).

    Method 2.  Use some of the code in notebook A108 from https://github.com/UK-Biobank/UKB-RAP-Notebooks-Access .

    Method 1 

    In the cohort browser, create a cohort with all the participants having proteomics data at i0:

    Add Filter, Biological samples > Blood assays > Proteomics > Protein biomarkers > Number of proteins measured | Instance 0, Add Cohort Filter IS NOT NULL, Apply Filter (this can take a while)

    This should produce a cohort of ~53000   participants.

    Save cohort (floppy disc icon) to a suitable folder, such as FM/OLINK/ in your main project filing system, with a meaningful name such as ppts_with_non_null_num_proteins_i0.   

    For a list of all the olink fields, see https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-data/accessing-phenotypic-data , which points to https://github.com/dnanexus/UKB_RAP/blob/main/proteomics/field_names.txt , download the file to your computer (it is not individual-level participant data) , amend it to have olink_instance_0. at the start of every row, that is in front of each protein name and in front of “eid”.

    Split the field_names.txt file into several smaller files, field_names1.txt etc.   Include the eid row at the top of each small file.   I don't know how small the files need to be.   I tested the next steps with a file of 18 proteins, and it worked.  I also tried it with all 3000 proteins, and it failed with a timeout message (even with a very large Spark JupyterLab Instance).    

    Copy all of the amended field_namesN.txt files from your computer to your UKB-RAP project, for example by using the Add button.

    Start a Spark JupyterLab. (Tools, JupyterLab, cluster config Spark Cluster, Hail)

    Open a $_Terminal. 

    Copy all of the amended field_namesN.txt files from your UKB-RAP project main storage to the JupyterLab storage, for example by using:

    dx download FM/OLINK/field_names*

    Extract the first set of data from the Parquet database into a csv file in your project main storage FM/OLINK folder, by using:

    dx extract_dataset FM/OLINK/ppts_with_non_null_num_proteins_i0  --fields-file field_names1.txt -o FM/OLINK/olink_data1.csv

    Repeat the dx extract_dataset command for each of the field_namesN.txt files to make olink_dataN.csv

    Method 2

    Start a Spark JupyterLab.

    Open a new Notebook (or a new script if you prefer), paste this code into it:

    # Load required packages
    if (!require(pacman)) install.packages("pacman")
    install.packages("sparklyr")
    pacman::p_load(sparklyr, data.table, dplyr, stringr, DBI, purrr, bit64, readr)
    # Connect to master node to orchestrates the analysis in spark
    port <- Sys.getenv("SPARK_MASTER_PORT")
    master <- paste("spark://master:", port, sep = "")
    sc <- spark_connect(master)


     

    # Paths to database
    database_path <- system("dx find data --class database", intern =TRUE)
    app_substring <- na.omit(str_extract(database_path, '(app\\d+_\\d+)'))
    database_substring <- str_extract(database_path[str_detect(database_path, app_substring)], 'database-([A-Za-z0-9]+)') %>% tolower()  %>% str_replace("database-", "database_")
    database <- paste0(database_substring, "__", app_substring)
    # Project_id
    project_id <- Sys.getenv('DX_PROJECT_CONTEXT_ID')
    
    
    # Record_id
    record_id <- system("dx find data --type Dataset --delimiter ',' | awk -F ',' '{print $5}'" , intern = TRUE)
    
    
    # Project_record_id
    project_record_id <- paste0(project_id, ":", record_id)
    # Olink tables within database
    tables <- DBI::dbGetQuery(sc, paste0("SHOW TABLES IN ", database))
    tables %>%
        filter(str_detect(tableName, "olink")) %>%
        pull(tableName)

     

    # Instance 0
    table_dataframes_i0 <- replicate(12, data.frame(matrix(ncol = 0, nrow = 0)), simplify = FALSE)
    
    
    # Loop through each table name
    for (i in 1:12) {
      # Construct the table name
      table_name <- paste0("olink_instance_0_00", sprintf("%02d", i))
    
    
      # Construct the SQL query
      query <- paste0("SELECT * FROM ", database, ".", table_name)
    
    
      # Execute the query and store the result in a dataframe
      table_dataframes_i0[[i]] <- sdf_sql(sc, query)
    }

     

    # now combine in a wide-format
    instance_0_sdf_wide <- reduce(table_dataframes_i0, left_join, by = "eid")

     

    # write to instance storage
    readr::write_csv(data.frame(instance_0_sdf_wide), 'olink_i0_npx_wide.csv')

     

    Run the code a cell at a time.

    Ignore the warning message about pacman.

    Beware that if you run the spark_connect code cell more than once it will crash and you will need to close the kernel and start again.

    Copy the output from JupyterLab instance storage into your main project storage, by going to the $_ terminal again and running this:

    dx upload olink_i0_npx_wide.csv

    Consider saving the notebook you just created, and uploading that to project storage too.

     

    I hope this helps.  If I missed a step, please post again here with the error message.   As always, please be careful not to post any participant data or IDs.

    Thank you for using the forum.

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Hi Federica,

    Method 3 - using your saved cohort and Table Exporter tool to export a whole entity

     

    Select Tools, Tools Library, Table Exporter [Install if not already done], Run

    Select location for output file, click Next

    Select the cohort created previously, ie ppts_with_non_null_num_proteins_i0

    Leave the field names file blank

    Leave Options as the defaults, or amend if you prefer,

    In Advanced Options, enter Entity olink_instance_0, leave field-names and field-titles blank

    Select Start Analysis

     

    This job takes about 12 minutes, so be patient.

    After it finishes, find the output file, which will be called data.csv and will be in the “location for output file” specified.   Rename the file as something meaningful, such as olink_i0_fromTableExp_data.csv

    Start a single-node JupyterLab.

    Start a $_ terminal, and use dx download to copy the output file into the Instance storage.

     

     

     

     

     

    0
  • Comment author
    Qian Chen
    • Edited

    Your suggestions are very helpful, and I greatly appreciate the efforts of the UKB team.@Rachael W

    0

Please sign in to leave a comment.