Extracting phenotypic data

The UK Biobank Research Analysis Platform (UKB-RAP) offers various tools that can be used to extract phenotypic and participant data from the UK Biobank database.

Data structure on the UKB-RAP

It is important to understand the structure of the database on the UKB-RAP prior to extracting and analysing data. Tabular data in the UK Biobank database is stored in entities. Each entity consists of fields of data relevant to that entity. The participant entity is the main dataset and contains comprehensive information about the UK Biobank participants such as sex, year of birth, and standing height. Further entities include death records, COVID-19 datasets, hospital inpatient records and more.

Most entities in the UKB-RAP are comprised of a single table though there are a couple of exceptions to this:

Olink: Olink record tables are split horizontally into multiple tables, therefore querying the whole table is not straightforward, see the A107_OMOP-data-hypertension-case-study_R GitHub notebook for details.
Participant: The participant entity is the main phenotypic dataset and includes comprehensive information about the participants. The participant entity is made up of multiple underlying tables. Keep this in mind if you plan to run a SQL query on fields within this entity.

Retrieving data from the UKB-RAP

We recommend that you use the table exporter tool to retrieve data from the UKB-RAP considering costs and time. Other methods include:

Table Exporter

The table exporter tool enables you to extract fields from all entities tables, pick any number of fields, and also allows for different types of inputs. The table exporter can be run interactively through the tools library, or programmatical on the command line from the terminal on a JupyterLab instance, RStudio session, or directly on the Command Line Interface (CLI). You need to make sure that you have specified the entity you want to extract the data from before running the tool. This can be done in the “Advanced Options” when you are running the tool interactively. The A110_Export_participant_data.Rmd notebook is a useful guide to run table exporter on RStudio.

dx extract dataset

The extract_dataset command from the dx-toolkit contains a number of arguments that enable the extraction of a range of data including field metadata, data dictionaries, and fields. The dx-toolkit can be run on the command line or within a JupyterLab instance. Within an RStudio instance, it is recommended to only use the dx_extract command to extract field metadata.

Spark Cluster in JupyterLab

Within Jupyter notebooks, data can be accessed using a spark cluster via either a Python or R notebook. Notebooks are available on the UK Biobank GitHub with examples of how to use the retrieve_fields command within a JupyterLab Spark Cluster in both Python (A103_Export-participant-data_Python.ipynb) and R (A107_OMOP-data-hypertension-case-study_R). Please note that a Spark Cluster cannot be initiated within an RStudio session.

Cohort Browser

A graphical interface to browse and filter participant data. It allows you visualise and filter the data interactively. You can export a table of up to 30,000 records from the participant entity.

SQL Query

Within a Spark Cluster in JupyterLab it is possible to extract data from the UKB-RAP in an R notebook using a SQL query by using the sdf_sql command from the SparklyR package. An example of how to do this is contained in the A105_Export-participant-data_R.ipynb notebook. Please note that this method cannot be used within an RStudio session.

The participant dataset is an entity comprised of multiple tables. Before extracting data from the participant entity using a SQL query you will need to know the table a field is in and how that relates to other tables/entities. You can find this out using the dx extract_dataset command from the dx-toolkit and specifying that the output should be SQL, this then returns the SQL command needed to access the required data.

e.g. to get the SQL code to extract the ‘age at recruitment – field id 21022’ field:

dx extract_dataset [project_name] --fields participant.p21022 --sql -o

Here, dx extract_dataset is the dx toolkit command, project name is the name of your project (GitHub notebook 101 provides instructions on how to get your project id), --sql -o specifies that you would like the output of your query to be sql code.

This returns the SQL code:

SELECT `participant_0042_1`.`p21022` AS `participant.p21022` FROM `database_xyz__app12345`.`participant_0042` AS `participant_0042_1`;[LK1]

This returns the sql query that you would use in order to select the fields you entered in the original query from within your UK Biobank project.

Watch a tutorial

Useful resources

The UK Biobank Community: Contains a wealth of resources to help maximise your use of the UKB-RAP. Search the articles, FAQs, post a query in the community forum or submit a ticket to contact us.

Information about fields and data in UK Biobank are detailed in the UK Biobank Showcase.
The DNAnexus website. Reach out to DNAnexus for issues relating to the database.
The UK Biobank GitHub page for notebooks showing worked examples of accessing and using data on the RAP.

Extracting phenotypic data

Data structure on the UKB-RAP

Retrieving data from the UKB-RAP

Watch a tutorial

Useful resources

Was this article helpful?

Comments