Guidance for using code repositories when working with UK Biobank data

Writing, sharing, and updating code is an important part of research. This is also true when working with UK Biobank data.

UK Biobank supports open research. Sharing code helps other researchers understand your work and repeat your analysis. Many researchers use websites like GitHub, GitLab, or Bitbucket to store and share their code. These websites also help you keep track of changes to your code.

If you are a registered UK Biobank researcher, you must protect the data you use. This is part of your agreement with UK Biobank (called the Material Transfer Agreement or MTA). When you write and share code, you must make sure that it does not include any private or sensitive information. For example, your code should never contain participant data, login details, or anything that could identify individuals.

Please remember: you have a duty to protect UK Biobank data, as explained in your MTA.

To help you do this, UK Biobank has created a short online course. It will become mandatory for all registered researchers to complete this course.

Important Things to Remember When Sharing Code

1.Choose the right place to share your code

Understand who can see your code. Is it just your team, your organisation, or the public?

2.Do not include sensitive data in your code

Never include participant data, passwords, or login details in your code. For example, do not include eIDs or other personal information.

3.Check your code before sharing

You can check your code by reading it carefully. You can also use tools like .gitignore and Git hooks to help remove sensitive information.

4.Check again before you publish your code

A second check helps make sure you are not sharing anything by mistake.

If You Share Data by Mistake

If you accidentally share UK Biobank data, contact UK Biobank immediately. The team will help you fix the problem and protect the data.

If You Do Not Report a Problem

If you do not report a possible data breach, your access to UK Biobank data may be stopped. This is because it breaks the rules of your agreement (MTA). Not reporting a problem could allow others to see data they should not have access to.

How Will UKB Monitor This?

When your work goes public on a code repository, our automated tools search it within 24hr to check for the presence of pseudonymised participant data. Each day it searches major places researchers publish code and data (including GitHub, GitLab, Gitee, Hugging Face, and Zenodo) for anything related to UK Biobank.

When the tools find repositories requiring review, this occurs through a two-stage process:

First, a fast scan looks through the entire history of every file - not just the current version - for tell-tale patterns like external ID (EID) numbers.
Then, our multi-agent AI solution reads the repository the way a human reviewer would: it understands the README, figures out what the project is actually doing, and brings in specialist tools to identify different file types (spreadsheets, notebooks, genetic and imaging files, code, archives, and so on) to judge whether real participant data is genuinely exposed.

Each repository is then given a clear rating so that anything sensitive can be flagged and acted on quickly by our internal teams through standardised take-down processes.

Help and Support

UK Biobank offers several resources to help you:

Complete the online course about using code with UK Biobank data
Read our help article for tips on using Git-based code repositories
Use our checklist to make sure your code is safe to share
Look at our Git book for examples of scripts that check your code for UK Biobank data
Join the UK Biobank community forum to ask questions and share advice

If you need more help, you can submit a support ticket. You can also read our policy on using code repositories for more information.

Related to