Returning Derived Data to UK Biobank

  • Updated

Researchers using UK Biobank data are required to return key individual-level derived results generated as part of their approved research project.  Returning these data allows them to be incorporated into the UK Biobank resource and made available for use by other approved researchers.  

This article provides guidance for researchers returning derived data either:

  • From data downloaded from UKB (only for projects where downloads are permitted)
  • From analyses carried out on the UK Biobank Research Analysis Platform (UKB-RAP) 

When should data be returned?

For projects returning results outside the RAP, these should be returned within 6 months of entering the public domain (whether that be through published papers, conference presentations, postings of results on websites or social media, etc.), or within 12 months of project end, whichever comes first.

For researchers using the UKB-RAP for their analysis, returned data should be provided at least 1 month before the UKB-RAP project expiry date to allow sufficient time for processing by UK Biobank before your project closes.

What you must return

UK Biobank only requires individual-level derived data generated as part of your approved project (e.g., derived phenotypes).  Details for what is required alongside this is provided in the sections below. 

What you do not need to return

  • Summary-level data 
  • Manuscripts (please note, UK Biobank does require notification of your publication(s), but it is not necessary as a return. See these articles on submitting publications and researcher responsibilities).
  • Posters 
  • Code that is not required to derive the returned data 
  • Temporary or working files 
  • Simple variables easily generated from touchscreen or physical measures (e.g., BMI) that others can derives easily
  • GWAS summary statistics (please do not return GWAS summary statistics to the UK Biobank. We ask that all GWAS summary statistics, published or unpublished from arrays or WGS analyses, should be submitted to the GWAS Catalog)

How will my returned data be used? 

Derived data that may be of use to other researchers will be incorporated into the resource and made available via the Data Showcase. Full acknowledgement of the provenance of the data will be provided.

Researchers should be aware that UK Biobank does not perform any quality assurance checks on the code or datasets made available via the Showcase Returns Catalogue. Where possible we try to ensure that datasets which include externally derived variables are accompanied by the code used to generate them, and would encourage users to perform their own review to confirm that the data meet their own quality requirements.

Should you have any questions, or require further clarification,  please contact the UK Biobank team via a support ticket.  


Return submission route

The route you use depends on where your derived data are held.

1.1 From downloaded data

If you are permitted to download data for analysis outside the UKB-RAP, these files should be returned to UKB via the Returned Results upload site, accessed through AMS. Please follow the instructions outlined in the Return of Results via AMS User Guide, only for procedural guidance, specifically for the steps on where and how to return results via AMS. The required documents and information, however, are outlined in this document, which should be used as the primary reference for what needs to be included in the return.

To ensure all returned data is provided in a standardised format and to avoid delays in processing, it is recommended that the folder structure follows the requirements as described in Section 2: Requirements for Returning Data.

The completed folder can be compressed into a single .zip file and uploaded using the “Data File” option in AMS. As this folder will contain all the required information, the other upload options (for example, Manuscript) can be ignored unless they are useful in specific circumstances.

1.2 From the UKB-RAP

If your analysis was conducted on the UKB-RAP, returned data must be prepared and returned via the UKB-RAP. Returned data must be provided in a standard format so that UK Biobank can extract and process it efficiently. Packaging requirements are described in Section 2. 

Once your return is prepared in line with these requirements:

  1. Contact UK Biobank to confirm your return is ready for extraction.
  2. Include: 
    • Your UK Biobank project ID 
    • A brief description of the data you are returning 
    • The name(s) of the top-level return folder(s) 
  3. Add org-ukb_reviewers to your UKB-RAP project. 

UK Biobank will then review and extract the returned data. To facilitate timely extraction, please ensure that all requirements are followed. Failure to do so may delay processing and may affect other researchers accessing and building on your data. 

Returning Multiple Datasets

Researchers may return more than one dataset from a single approved project. Each dataset must:

  • Be provided as a separate top-level return folder 
  • Independently meet all requirements described in this guidance. 
  • Not share files or metadata with other return folders

Each return folder is reviewed and extracted independently.  


Requirements for returning data 

2.1 Top-Level Folder Naming

Each returned dataset must be provided as a single top-level folder using the following naming convention: 

UKB_<application_id>_<dataset_short_name>_v<version>_<YYYY-MM-DD>

Example: 

UKB_40541_cardiac_idps_v1_2025-12-15

Versioning Rules 

  • Versions must start at v1
  • Versions must increment using whole numbers only (v1, v2v3, … ) 

 

2.2 Required folder structure

Each return folder must contain the following structure:

UKB_<application_id>_<dataset_name>_v<version>_<YYYY-MM-DD>/ 
├── README.md 
├── data/ 
│   └── <data file(s)> 
│   └── <zip_name>.zip (optional zipped bulk files – see Section 2.2.1) 
└── metadata/ 
    ├── field_spec.csv 
    ├── encodings.csv (required if any encodings are used) 
    └── return_manifest.csv 
├── code/ (exceptional - see Section 2.2.2)
└── checksums.md5 (bulk files only – see Section 2.9) 

 

2.2.1 Optional  zipped files

If your project generated very large numbers of individual-level files (e.g., multiple files per participant), these may be zipped and placed in the data/ folder. 

data/<zip_name>.zip

Where <zip_name> should follow a consistent naming convention: 

<eid>_<field_name>_<ins_index>_<arr_index>

<imaging_id> may be used in place of <eid> where appropriate. 

If zipped files are included: 

  • They must be listed in checksums.md5
  • The contents must be described in README.md

 

2.2.2 Exceptional inclusion of code

Where possible, code used to derive returned data should be made available via a public repository (e.g., GitHub) and referenced in the README.md. 

If code cannot be published on a public repository (for example, due to intellectual property or other proprietary restrictions), and you believe it is still important to provide the code alongside the returned data, please contact the UK Biobank Access Team via a support ticketbefore including it. 

In such exceptional cases, a code/ folder may be included where: 

  • Only scripts required to generate the returned derived data are included. 
  • A README.md is provided in this folder or in the top-level README.md explaining how to run the code (e.g., script order, required inputs, expected outputs) 

 

2.3 README.md 

Each return must include a README.md describing the dataset. The README.md should include the following sections: 

  1. Dataset overview  
    • Description of the data/what the data contains 
    • Why it is valuable to other researchers 
  2. UK Biobank application details 
    • UK Biobank application ID 
    • Dataset name, version, and creation date 
    • UK Biobank data release version used to derive the data 
  3. Participant coverage  
    • Number of participants included 
    • Any inclusion or exclusion criteria 
  4. Data contents and structure 
    • Data layout (e.g., long or wide) 
    • Brief description of the data files provided 
    • Use of instances or arrays 
    • Description of any encoded or partially encoded fields and reference to metadata/encodings.csv 
  5. Compute and storage 
    • Estimated compute resources used for data generation 
    • Storage size and associated costs 
  6. Software 
    • Software or tools used, including versions where relevant 
  7. Code availability 
    • Where the reproducible code is placed (e.g., link to code repository) 
    • If the code is provided on a public repository, please ensure it is documented sufficiently for others to rerun 
    • If the exceptional code/ folder is included, note this here and include description for running the code here or in the README within the code/ folder 
  8. Associated publications 
    • Citations and DOI links for any related publications (if applicable) 
  9. Known issues or limitations 
    • Any caveats or limitations others should be aware of 
  10. Contact details 
    • Name of data uploader 
    • Name of the principal investigator of the project 
  11. Licensing terms and publication restrictions

 

2.4 Declaring data layout and data files

The following must be declared in metadata/return_manifest.csv: 

  • The data layout (long or wide) 
  • The data file(s) 

Example: 

key,value
data_layout,long
data_files,data/derived_data.csv

If multiple data files together represent a single dataset with the same structure, all files must be listed in data_file with semicolons to separate them: 

key,value
data_layout,long
data_files, data/derived_data_pt1.csv; data/derived_data_pt2.csv 

If a dataset cannot be represented using this format, researchers should contact UK Biobank for advice. 

 

2.5 Data format

Long format

In long format, each row represents a single value for a participant and field. Required columns: 

  • eid– participant ID 
  • ins_index – instance index (NA if not applicable) 
  • arr_index – array index (NA if not applicable) 
  • field_name – in a standard format (i.e., no spaces and using underscores) and must exactly match field_name values defined in metadata/field_spec.csv 
  • value – field value 

If a column heading (e.g., instance or array) is not applicable, the column must be present and contain NA values.  

Example: 

eid,ins_index,arr_index,field_name,value
1111111,0,NA,lv_edv,133.5
1111111,0,NA,qc_flag,1

 

Wide format

In wide format, each row represents a participant (or participant/instance/array) and each field is provided as a separate column. Required columns: 

  • eid – participant ID 
  • ins_index and/or arr_index columns if applicable 
  • Column names that exactly match field_name defined in metadata/field_spec.csv

Example: 

eid,ins_index,lv_edv,qc_flag
1111111,0,133.5,1
1111112,0,138.2,1

 

2.6 Field specification 

File: metadata/field_spec.csv

Defines metadata for each returned field. 

Required columns: 

  • field_name
  • value_type – one of: int, realstring, date, datetimecategorical 
  • encoding_id (must be populated if the field uses encoded or partially encoded values, otherwise leave as blank) 
  • title
  • description

Optional columns: 

  • categorical_type (only applicable when value_type = categorical. Indicates whether a categorical field can have one or more than one categorical value for the same participant (within a single field/instance/array context).  
    • Single is where a field has one categorical value per participant (or per participant/instance/array). Example fields: sexbanana intake
    • Multi is where a field may have more than one categorical value for the same participant (or per participant/instance/array). Example fields: treatment/medication codequalifications
    • You may find the following lists of existing fields useful when determining the appropriate categorical_type for your returned variables: categorical (single)categorical (multi) 
  • units

 

Example: 

field_name,value_type,encoding_id,title,description,categorical_type,units
medications,categorical,meds_v1,Medications taken,List of medications,multi,
lv_edv,real,,LV EDV,Left ventricular end diastolic volume,,mL

Note: Leave optional or non-applicable cells blank 

 

2.7 Encoded and partially encoded values 

If a field uses encoded values (including partial encodings, where some values are literal, but others represent special codes such as “unknown”), this must be stated using encoding_id in metadata/field_spec.csv.

Where possible, researchers should reuse an existing UK Biobank encoding scheme rather than defining a new one. 

  • All encoded or partially encoded fields must reference an encoding_id
  • encoding_ididentifies a specific encoding scheme. 
  • The same encoding_idmust be reused across fields where codes have identical meanings. 
  • Encoded values must appear in the data as codes, not human-readable meanings. 

For examples of existing encoding schemes, please see UK Biobank schema page, including: 

encoding_id values should follow the format: 

<description>_v<version>

Examples: 

meds_v1

smoking_status_v1

Example of partial encoding: 

Field 3166 uses standard datetime values but also includes special coded values defined by Data-Coding 439. In this case, the special codes must be captured using encoding_id, and the corresponding code meanings included in metadata/encodings.csv.  

 

2.8 Encodings file (required if any encodings are used) 

File: metadata/encodings.csv

This file must be provided if any fields use encoded or partially encoded values.

Required columns: 

  • encoding_id
  • value
  • meaning

Example:  

encoding_id,value,meaning
meds_v1,1,Paracetamol
meds_v1,2,Ibuprofen
meds_v1,-1,Prefer not to answer
smoking_status_v1,0,Current
smoking_status_v1,1,Never

 

2.9 Checksums 

A file named checksums.md5 must be included listing MD5 checksums for bulk files only in the return folder with relative paths from return folder. They are not required for tabular data files (e.g., csv files with data on all participants)  

Example: 

D41d8cd98f00b204e9800998ecf8427e             data/1111111_t1_brain_struct_2_0.zip 
0cc175b9c0f1b6a831c399e269772661             data/1111112_t1_brain_struct_2_0.zip 
900150983cd24fb0d6963f7d28e17f72             data/1111113_t1_brain_struct_2_0.zip 

Questions and support 

A downloadable checklist for returning data via the UKB-RAP is available here: UKB-RAP Return Checklist.

For questions about returning derived data, please contact the UK Biobank team via a support ticket.  

If any required file or resource is missing or cannot be provided, please include an explanation in your ticket so we can advise on how best to proceed. 

 

Related to

Was this article helpful?

3 out of 3 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.