Researchers using UK Biobank data are required to return key individual-level derived results generated as part of their approved research project. Returning these data allows them to be incorporated into the UK Biobank resource and made available for use by other approved researchers.
This article provides guidance for researchers returning derived data either:
- From data downloaded from UKB (only for projects where downloads are permitted)
- From analyses carried out on the UK Biobank Research Analysis Platform (UKB-RAP)
When should data be returned?
For projects returning results outside the RAP, these should be returned within 6 months of entering the public domain (whether that be through published papers, conference presentations, postings of results on websites or social media, etc.), or within 12 months of project end, whichever comes first.
For researchers using the UKB-RAP for their analysis, returned data should be provided at least 1 month before the UKB-RAP project expiry date to allow sufficient time for processing by UK Biobank before your project closes.
What you must return
UK Biobank only requires individual-level derived data generated as part of your approved project (e.g., derived phenotypes). Details for what is required alongside this is provided in the sections below.
What you do not need to return
- Summary-level data
- Manuscripts (please note, UK Biobank does require notification of your publication(s), but it is not necessary as a return. See these articles on submitting publications and researcher responsibilities).
- Posters
- Code that is not required to derive the returned data
- Temporary or working files
- Simple variables easily generated from touchscreen or physical measures (e.g., BMI) that others can derives easily
- GWAS summary statistics (please do not return GWAS summary statistics to the UK Biobank. We ask that all GWAS summary statistics, published or unpublished from arrays or WGS analyses, should be submitted to the GWAS Catalog)
How will my returned data be used?
Derived data that may be of use to other researchers will be incorporated into the resource and made available via the Data Showcase. Full acknowledgement of the provenance of the data will be provided.
Researchers should be aware that UK Biobank does not perform any quality assurance checks on the code or datasets made available via the Showcase Returns Catalogue. Where possible we try to ensure that datasets which include externally derived variables are accompanied by the code used to generate them, and would encourage users to perform their own review to confirm that the data meet their own quality requirements.
Should you have any questions, or require further clarification, please contact the UK Biobank team via a support ticket.
Return submission route
The route you use depends on where your derived data are held.
1.1 From downloaded data
If you are permitted to download data for analysis outside the UKB-RAP, these files should be returned to UKB via the Returned Results upload site, accessed through AMS. Please follow the instructions outlined in the Return of Results via AMS User Guide, only for procedural guidance, specifically for the steps on where and how to return results via AMS. The required documents and information, however, are outlined in this document, which should be used as the primary reference for what needs to be included in the return.
To ensure all returned data is provided in a standardised format and to avoid delays in processing, it is recommended that the folder structure follows the requirements as described in Section 2: Requirements for Returning Data.
The completed folder can be compressed into a single .zip file and uploaded using the “Data File” option in AMS. As this folder will contain all the required information, the other upload options (for example, Manuscript) can be ignored unless they are useful in specific circumstances.
1.2 From the UKB-RAP
If your analysis was conducted on the UKB-RAP, returned data must be prepared and returned via the UKB-RAP. Returned data must be provided in a standard format so that UK Biobank can extract and process it efficiently. Packaging requirements are described in Section 2.
Once your return is prepared in line with these requirements:
- Contact UK Biobank to confirm your return is ready for extraction.
- Include:
- Your UK Biobank project ID
- A brief description of the data you are returning
- The name(s) of the top-level return folder(s)
- Add
org-ukb_reviewersto your UKB-RAP project.
UK Biobank will then review and extract the returned data. To facilitate timely extraction, please ensure that all requirements are followed. Failure to do so may delay processing and may affect other researchers accessing and building on your data.
Returning Multiple Datasets
Researchers may return more than one dataset from a single approved project. Each dataset must:
- Be provided as a separate top-level return folder
- Independently meet all requirements described in this guidance.
- Not share files or metadata with other return folders
Each return folder is reviewed and extracted independently.
Requirements for returning data
2.1 Top-Level Folder Naming
Each returned dataset must be provided as a single top-level folder using the following naming convention:
UKB_<application_id>_<dataset_short_name>_v<version>_<YYYY-MM-DD>
Example:
UKB_40541_cardiac_idps_v1_2025-12-15
Versioning Rules
- Versions must start at
v1 - Versions must increment using whole numbers only (
v1,v2,v3, … )
2.2 Required folder structure
Each return folder must contain the following structure:
UKB_<application_id>_<dataset_name>_v<version>_<YYYY-MM-DD>/
├── README.md
├── data/
│ └── <data file(s)>
│ └── <zip_name>.zip (optional zipped bulk files – see Section 2.2.1)
└── metadata/
├── field_spec.csv
├── encodings.csv (required if any encodings are used)
└── return_manifest.csv
├── code/ (exceptional - see Section 2.2.2)
└── checksums.md5 (bulk files only – see Section 2.9)
2.2.1 Optional zipped files
If your project generated very large numbers of individual-level files (e.g., multiple files per participant), these may be zipped and placed in the data/ folder.
data/<zip_name>.zip
Where <zip_name> should follow a consistent naming convention:
<eid>_<field_name>_<ins_index>_<arr_index>
<imaging_id> may be used in place of <eid> where appropriate.
If zipped files are included:
- They must be listed in
checksums.md5 - The contents must be described in
README.md
2.2.2 Exceptional inclusion of code
Where possible, code used to derive returned data should be made available via a public repository (e.g., GitHub) and referenced in the README.md.
If code cannot be published on a public repository (for example, due to intellectual property or other proprietary restrictions), and you believe it is still important to provide the code alongside the returned data, please contact the UK Biobank Access Team via a support ticket before including it.
In such exceptional cases, a code/ folder may be included where:
- Only scripts required to generate the returned derived data are included.
- A
README.mdis provided in this folder or in the top-levelREADME.mdexplaining how to run the code (e.g., script order, required inputs, expected outputs)
2.3 README.md
Each return must include a README.md describing the dataset. The README.md should include the following sections:
-
Dataset overview
- Description of the data/what the data contains
- Why it is valuable to other researchers
-
UK Biobank application details
- UK Biobank application ID
- Dataset name, version, and creation date
- UK Biobank data release version used to derive the data
-
Participant coverage
- Number of participants included
- Any inclusion or exclusion criteria
-
Data contents and structure
- Data layout (e.g., long or wide)
- Brief description of the data files provided
- Use of instances or arrays
- Description of any encoded or partially encoded fields and reference to metadata/encodings.csv
-
Compute and storage
- Estimated compute resources used for data generation
- Storage size and associated costs
-
Software
- Software or tools used, including versions where relevant
-
Code availability
- Where the reproducible code is placed (e.g., link to code repository)
- If the code is provided on a public repository, please ensure it is documented sufficiently for others to rerun
- If the exceptional
code/folder is included, note this here and include description for running the code here or in the README within the code/ folder
-
Associated publications
- Citations and DOI links for any related publications (if applicable)
-
Known issues or limitations
- Any caveats or limitations others should be aware of
-
Contact details
- Name of data uploader
- Name of the principal investigator of the project
- Licensing terms and publication restrictions
2.4 Declaring data layout and data files
The following must be declared in metadata/return_manifest.csv:
- The data layout (
longorwide) - The data file(s)
Example:
key,value
data_layout,long
data_files,data/derived_data.csvIf multiple data files together represent a single dataset with the same structure, all files must be listed in data_file with semicolons to separate them:
key,value
data_layout,long
data_files, data/derived_data_pt1.csv; data/derived_data_pt2.csv If a dataset cannot be represented using this format, researchers should contact UK Biobank for advice.
2.5 Data format
Long format
In long format, each row represents a single value for a participant and field. Required columns:
-
eid– participant ID -
ins_index– instance index (NA if not applicable) -
arr_index– array index (NA if not applicable) -
field_name– in a standard format (i.e., no spaces and using underscores) and must exactly matchfield_namevalues defined inmetadata/field_spec.csv -
value– field value
If a column heading (e.g., instance or array) is not applicable, the column must be present and contain NA values.
Example:
eid,ins_index,arr_index,field_name,value
1111111,0,NA,lv_edv,133.5
1111111,0,NA,qc_flag,1
Wide format
In wide format, each row represents a participant (or participant/instance/array) and each field is provided as a separate column. Required columns:
-
eid– participant ID -
ins_indexand/orarr_indexcolumns if applicable - Column names that exactly match
field_namedefined inmetadata/field_spec.csv
Example:
eid,ins_index,lv_edv,qc_flag
1111111,0,133.5,1
1111112,0,138.2,1
2.6 Field specification
File: metadata/field_spec.csv
Defines metadata for each returned field.
Required columns:
field_name-
value_type– one of:int,real,string,date,datetime,categorical -
encoding_id(must be populated if the field uses encoded or partially encoded values, otherwise leave as blank) titledescription
Optional columns:
-
categorical_type(only applicable whenvalue_type=categorical. Indicates whether a categorical field can have one or more than one categorical value for the same participant (within a single field/instance/array context).-
Singleis where a field has one categorical value per participant (or per participant/instance/array). Example fields: sex, banana intake. -
Multiis where a field may have more than one categorical value for the same participant (or per participant/instance/array). Example fields: treatment/medication code, qualifications. - You may find the following lists of existing fields useful when determining the appropriate
categorical_typefor your returned variables: categorical (single), categorical (multi)
-
units
Example:
field_name,value_type,encoding_id,title,description,categorical_type,units
medications,categorical,meds_v1,Medications taken,List of medications,multi,
lv_edv,real,,LV EDV,Left ventricular end diastolic volume,,mLNote: Leave optional or non-applicable cells blank
2.7 Encoded and partially encoded values
If a field uses encoded values (including partial encodings, where some values are literal, but others represent special codes such as “unknown”), this must be stated using encoding_id in metadata/field_spec.csv.
Where possible, researchers should reuse an existing UK Biobank encoding scheme rather than defining a new one.
- All encoded or partially encoded fields must reference an
encoding_id. -
encoding_ididentifies a specific encoding scheme. - The same
encoding_idmust be reused across fields where codes have identical meanings. - Encoded values must appear in the data as codes, not human-readable meanings.
For examples of existing encoding schemes, please see UK Biobank schema page, including:
- Encoding dictionaries, which list the available encoding schemes
- Values of encodings, which provide the individual code values and their meanings for each encoding (for example, Values for simple integer encodings).
encoding_id values should follow the format:
<description>_v<version>
Examples:
meds_v1
smoking_status_v1
Example of partial encoding:
Field 3166 uses standard datetime values but also includes special coded values defined by Data-Coding 439. In this case, the special codes must be captured using encoding_id, and the corresponding code meanings included in metadata/encodings.csv.
2.8 Encodings file (required if any encodings are used)
File: metadata/encodings.csv
This file must be provided if any fields use encoded or partially encoded values.
Required columns:
encoding_idvaluemeaning
Example:
encoding_id,value,meaning
meds_v1,1,Paracetamol
meds_v1,2,Ibuprofen
meds_v1,-1,Prefer not to answer
smoking_status_v1,0,Current
smoking_status_v1,1,Never
2.9 Checksums
A file named checksums.md5 must be included listing MD5 checksums for bulk files only in the return folder with relative paths from return folder. They are not required for tabular data files (e.g., csv files with data on all participants)
Example:
D41d8cd98f00b204e9800998ecf8427e data/1111111_t1_brain_struct_2_0.zip
0cc175b9c0f1b6a831c399e269772661 data/1111112_t1_brain_struct_2_0.zip
900150983cd24fb0d6963f7d28e17f72 data/1111113_t1_brain_struct_2_0.zip Questions and support
A downloadable checklist for returning data via the UKB-RAP is available here: UKB-RAP Return Checklist.
For questions about returning derived data, please contact the UK Biobank team via a support ticket.
If any required file or resource is missing or cannot be provided, please include an explanation in your ticket so we can advise on how best to proceed.
Related to
Comments
0 comments
Article is closed for comments.