How do I extract the entire Proteomic data without being linked to a specific Phenotype cohort from the browser?

27 November 2024 12:51
8 comments

Hello.

I am trying to understand how to extract Proteomic data (olink_instance_0) considering that my phenotype cohort has been generated using R script. I have no intention to re-generate the same cohort using the browser, as due to the multiple conditions it needs to satisfy, it might prove extremely challenging and or prone to errors. Obviously my cohort, uploaded on my Nexus project, has its own file ID.

Essentially, what I would like to do is to download the entire proteomic data and then intersect with the ID of my phenotype cohort and move from there. It is proving to be challenging. Could you please advise?

I have been watching this video Analyzing the UK Biobank Proteomics Data on the UK Biobank Research Analysis Platform - YouTube but I remain confused.

1) Can I use the cohort browser to select the entire Proteomic data (olink_instance_0) with their participant ID, without selecting any phenotype cohort? (So, essentially the whole 53.000?) Is there a video to describe it? As I cannot find anything similar unless I select first my phenotype cohort which I do not want to do using the browser.

2) The video suggests, in alternative, to use dx (in ttyd?) or using Jupiter lab? The video does not explain and extract_dataset is not amongst the dx function.

The video suggests this link:

UKB_RAP/proteomics/0_extract_phenotype_protein_data.ipynb at main · dnanexus/UKB_RAP

reads" This notebook extracts proteomics data via the Cohort Browser"

1) # Automatically discover dispensed dataset ID dispensed_dataset = dxpy.find_one_data_object( typename="Dataset", name="app*.dataset", folder="/", name_mode="glob" ) dispensed_dataset_id = dispensed_dataset["id"]

what dataset the this code is referring to? Phenotype-linked to Proteomic?

2) Does the command in Jupiter lab "dx", "extract_dataset", dataset, "-ddd", "--delimiter", ","] subprocess.check_call(cmd) accepts as dataset identifier(which I belive refers to the phenotype data identifier) an object ID (i.e. file-Gvjy……) rather than the project-xxx:record-xxx generated by the cohort browser?

I would very much value your help.

Thanks

Federica

Comments

8 comments

Rachael W UKB Community team Data Analyst
- Edited 31 December 2024 14:50
Hi Federica,
there are at least two methods to get the olink data in csv format that you can then read into an R session:
Method 1. Use the cohort browser to create a cohort and then use dx extract_dataset several times (a few proteins at a time).
Method 2. Use some of the code in notebook A108 from https://github.com/UK-Biobank/UKB-RAP-Notebooks-Access .
Method 1
In the cohort browser, create a cohort with all the participants having proteomics data at i0:
Add Filter, Biological samples > Blood assays > Proteomics > Protein biomarkers > Number of proteins measured | Instance 0, Add Cohort Filter IS NOT NULL, Apply Filter (this can take a while)
This should produce a cohort of ~53000 participants.
Save cohort (floppy disc icon) to a suitable folder, such as FM/OLINK/ in your main project filing system, with a meaningful name such as ppts_with_non_null_num_proteins_i0.
For a list of all the olink fields, see https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-data/accessing-phenotypic-data , which points to https://github.com/dnanexus/UKB_RAP/blob/main/proteomics/field_names.txt , download the file to your computer (it is not individual-level participant data) , amend it to have olink_instance_0. at the start of every row, that is in front of each protein name and in front of “eid”.
Split the field_names.txt file into several smaller files, field_names1.txt etc. Include the eid row at the top of each small file. I don't know how small the files need to be. I tested the next steps with a file of 18 proteins, and it worked. I also tried it with all 3000 proteins, and it failed with a timeout message (even with a very large Spark JupyterLab Instance).
Copy all of the amended field_namesN.txt files from your computer to your UKB-RAP project, for example by using the Add button.
Start a Spark JupyterLab. (Tools, JupyterLab, cluster config Spark Cluster, Hail)
Open a $_Terminal.
Copy all of the amended field_namesN.txt files from your UKB-RAP project main storage to the JupyterLab storage, for example by using:
dx download FM/OLINK/field_names*
Extract the first set of data from the Parquet database into a csv file in your project main storage FM/OLINK folder, by using:
dx extract_dataset FM/OLINK/ppts_with_non_null_num_proteins_i0 --fields-file field_names1.txt -o FM/OLINK/olink_data1.csv
Repeat the dx extract_dataset command for each of the field_namesN.txt files to make olink_dataN.csv
Method 2
Start a Spark JupyterLab.
Open a new Notebook (or a new script if you prefer), paste this code into it:
```
# Load required packages
if (!require(pacman)) install.packages("pacman")
install.packages("sparklyr")
pacman::p_load(sparklyr, data.table, dplyr, stringr, DBI, purrr, bit64, readr)
# Connect to master node to orchestrates the analysis in spark
port <- Sys.getenv("SPARK_MASTER_PORT")
master <- paste("spark://master:", port, sep = "")
sc <- spark_connect(master)
```
```
# Paths to database
database_path <- system("dx find data --class database", intern =TRUE)
app_substring <- na.omit(str_extract(database_path, '(app\\d+_\\d+)'))
database_substring <- str_extract(database_path[str_detect(database_path, app_substring)], 'database-([A-Za-z0-9]+)') %>% tolower()  %>% str_replace("database-", "database_")
database <- paste0(database_substring, "__", app_substring)
```
```
# Project_id
project_id <- Sys.getenv('DX_PROJECT_CONTEXT_ID')


# Record_id
record_id <- system("dx find data --type Dataset --delimiter ',' | awk -F ',' '{print $5}'" , intern = TRUE)


# Project_record_id
project_record_id <- paste0(project_id, ":", record_id)
```
```
# Olink tables within database
tables <- DBI::dbGetQuery(sc, paste0("SHOW TABLES IN ", database))
tables %>%
    filter(str_detect(tableName, "olink")) %>%
    pull(tableName)
```
```
# Instance 0
table_dataframes_i0 <- replicate(12, data.frame(matrix(ncol = 0, nrow = 0)), simplify = FALSE)


# Loop through each table name
for (i in 1:12) {
  # Construct the table name
  table_name <- paste0("olink_instance_0_00", sprintf("%02d", i))


  # Construct the SQL query
  query <- paste0("SELECT * FROM ", database, ".", table_name)


  # Execute the query and store the result in a dataframe
  table_dataframes_i0[[i]] <- sdf_sql(sc, query)
}
```
```
# now combine in a wide-format
instance_0_sdf_wide <- reduce(table_dataframes_i0, left_join, by = "eid")
```
```
# write to instance storage
readr::write_csv(data.frame(instance_0_sdf_wide), 'olink_i0_npx_wide.csv')
```
Run the code a cell at a time.
Ignore the warning message about pacman.
Beware that if you run the spark_connect code cell more than once it will crash and you will need to close the kernel and start again.
Copy the output from JupyterLab instance storage into your main project storage, by going to the $_ terminal again and running this:
dx upload olink_i0_npx_wide.csv
Consider saving the notebook you just created, and uploading that to project storage too.

I hope this helps. If I missed a step, please post again here with the error message. As always, please be careful not to post any participant data or IDs.
Thank you for using the forum.
0
Rachael W UKB Community team Data Analyst
- 09 December 2024 16:28
Hi Federica,
Method 3 - using your saved cohort and Table Exporter tool to export a whole entity

Select Tools, Tools Library, Table Exporter [Install if not already done], Run
Select location for output file, click Next
Select the cohort created previously, ie ppts_with_non_null_num_proteins_i0
Leave the field names file blank
Leave Options as the defaults, or amend if you prefer,
In Advanced Options, enter Entity olink_instance_0, leave field-names and field-titles blank
Select Start Analysis

This job takes about 12 minutes, so be patient.
After it finishes, find the output file, which will be called data.csv and will be in the “location for output file” specified. Rename the file as something meaningful, such as olink_i0_fromTableExp_data.csv
Start a single-node JupyterLab.
Start a $_ terminal, and use dx download to copy the output file into the Instance storage.

3
Qian Chen
- Edited 24 December 2024 08:08
Your suggestions are very helpful, and I greatly appreciate the efforts of the UKB team.@Rachael W

2
Federica Merella
- 24 March 2026 10:07
Thank you so much for all your intput. Federica

1
Chen Lou
- 29 March 2026 04:44
Rachael W
I would like to ask one point about exporting data using Table Exporter.
Suppose I specify the entity and the corresponding field names in both cases—for example, olink_instance_0 and all fields belonging to olink_instance_0.
As you mentioned, one approach is to first create the cohort ppts_with_non_null_num_proteins_i0 in the Cohort Browser, and then use Table Exporter. My understanding is that this method essentially defines a cohort consisting of participants who have Olink instance 0 data, and then exports data for that cohort.
In that case, would the output be the same as directly selecting the entire appxxxxxx_20260xxxxxxxxx.dataset object in Table Exporter, provided that I use the same entity (olink_instance_0) and the same set of field names?
In other words, if both the entity and field names are identical, is there any practical difference between:
1. exporting from the saved cohort ppts_with_non_null_num_proteins_i0, and
2. exporting directly from the full dataset object?
Thank you very much for your help.
Best regards,
Chen Lou
0
Rachael W UKB Community team Data Analyst
- 20 April 2026 14:38
Hi Chen, I don't know the answer to that, sorry.
I would expect that the data would be the same. The only question is whether exporting the whole olink_instance_0 entity would include a load of Null values for the 450000 participants with no Olink data. I don't know how the Parquet database works, so I don't know whether it is storing those Null values. The Parquet format is supposed to be efficient, so it probably isn't storing Null values. On the other hand, I don't want to test it, because it could be a 2-hour job. (If 53k ppts takes 12 minutes then 500k ppts could take ~120 minutes).

0
Chen Lou
- 20 April 2026 15:31
@Rachael W
the above csv in R is like

Hi Rachael,
Thanks for the insight! I just tested it using the Table Exporter (eid + 2,923 proteins) and it was very efficient—it took only 7 minutes.
You were right about the Parquet efficiency; the system automatically filtered the participants and only extracted the ~53k individuals with Olink data, skipping the Null values for the rest of the cohort.

1
Rachael W UKB Community team Data Analyst
- 21 April 2026 08:06
Hi Chen,
that is good to know, thank you for sharing the information.

0

Please sign in to leave a comment.