Repository Management Best Practices for Sensitive Data

  • Updated

Tailored guidance for researchers working with UK Biobank datasets

Managing code repositories when handling sensitive data requires more than just technical know-how, it demands a proactive approach to privacy, collaboration, and version control. This guide outlines UK Biobank’s recommended practices for researchers using GitHub, with a focus on protecting participant data throughout the research lifecycle.

 

Good practice guidelines

These guidelines set out good practice for researchers use of online code repositories such that researchers can ensure data security within their own organisation and research groups and also ensure that UK Biobank data is not inadvertently published or otherwise made available (outside of UK Biobank’s normal access procedures).

Provision of roles

Roles and permissions should be assigned such that reviews are required ahead of creating
public uploads.


Write permissions (or higher) in repositories should only be provided to those who have
undergone appropriate training in data security and standards.
Organisation of code and data


To reduce the risk of inadvertent sharing of results, input data, or other private information
on GitHub, a folder structure that separates code from input and output files is strongly
recommended.


Other private information, such as IP addresses, passwords, or account credentials should also be kept outside of folders used for code. This reduces the risk of accidentally committing information which should not be publicly shared onto a Git repository.

 

Project Initialisation

Start with a private repository While UK Biobank supports open science, we recommend beginning with a private GitHub repository. This protects sensitive data during early development and allows for a controlled transition to public visibility when ready to publish.

How to change repository visibility

  1. Go to Settings
  2. Scroll to the Danger Zone
  3. Click Change visibility
  4. Confirm with Change to public

Collaborator access

  • Add collaborators manually, only those listed in your UK Biobank application.
  • Apply the least privilege principle: assign only the permissions necessary for each role. Avoid making everyone an administrator.

Data structure and safety

  • Input data: Store outside the repository directory to prevent accidental commits.
  • Output data:
    • Use a dedicated data/ folder inside the repo, or store externally.
    • If stored inside the repo, configure .gitignore to exclude it.

Setting up your .gitignore file Use .gitignore to prevent sensitive files from being committed. Example patterns to ignore:

  • The entire data/ folder
  • Common output file types (e.g., .csv, .tsv)
  • Files prefixed with results_

Review and update .gitignore regularly as your project evolves.

Project Development

Commit checklist Before committing, ask yourself:

  • Have I reviewed all pre-staged changes?
  • Why am I committing this code?
  • Have I created any output files?
  • Have I exposed any sensitive data?

Tips for safe commits

  • Keep .gitignore updated
  • Avoid git add -A; use git add -p for precision
  • Clear Jupyter notebook outputs before committing

Why it matters, even in private repositories GitHub tracks all changes, including deleted files. Sensitive data committed, even if later deleted, remains in the commit history. Treat your private repository as if it were public from day one.

Project Completion

Audit before publishing

Use UK Biobank’s Git Audio tool scan your repository’s commit history for sensitive files. This Python-based tool generates an audit report to help identify potential risks.

Alternative publishing strategy

Instead of changing visibility, consider:

  • Creating a new public repository
  • Copying clean files from the private repo
  • Excluding the .git folder to avoid transferring commit history

This approach is ideal for one-time publication. For ongoing development, maintaining a single clean repository is more efficient.

Project Maintenance and Continuation

Once your project is public, every change is visible. Continue applying best practices:

  • Review commits carefully
  • Protect sensitive data
  • Maintain a clean and intentional repository structure

How Will UKB Monitor Code Repositories?

When your work goes public on a code repository, our automated tools search it within 24hr to check for the presence of pseudonymised participant data. Each day it searches major places researchers publish code and data (including GitHub, GitLab, Gitee, Hugging Face, and Zenodo) for anything related to UK Biobank.

 When the tools find repositories requiring review, this occurs through a two-stage process:

  • First, a fast scan looks through the entire history of every file - not just the current version - for tell-tale patterns like external ID (EID) numbers.
  • Then, our multi-agent AI solution reads the repository the way a human reviewer would: it understands the README, figures out what the project is actually doing, and brings in specialist tools to identify different file types (spreadsheets, notebooks, genetic and imaging files, code, archives, and so on) to judge whether real participant data is genuinely exposed.

Each repository is then given a clear rating so that anything sensitive can be flagged and acted on quickly by our internal teams through standardised take-down processes

 

Related to

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.