This article will introduce you to using Jupyter Notebooks on the UK Biobank Research Analysis Platform (UKB-RAP) to explore and analyse the uniquely rich UK Biobank dataset. This article covers key functionalities of the UKB-RAP, including understanding the features of Jupyter notebooks, utilizing Table Exporter to export cohorts as TSV files, and loading and analyzing data files.
What is JupyterLab?
JupyterLab is a web-based user interface (UI) where you can work with Jupyter notebooks. Jupyter notebooks are documents where you can have code, text, equations, images, and interactive visualizations. This tool is commonly used for data analysis in Python.
Running Local Analysis vs. Cloud Analysis
Local analysis | Everything is stored and analysed on your local computer |
Cloud analysis | A temporary computer in the cloud, where transferring data between this temporary computer and permanent storage is necessary |
The UKB-RAP has two types of storage:
-
- Research Analysis Project Storage
- JupyterLab Storage: A temporary computer in the cloud. Everything in it will disappear once the session ends.
The two JupyterLab workflows
Before you can begin analysing UKB data in JupyterLab, you need to first create a file you’re your data of interest, and import it into JupyterLab in the right format. There are 2 ways in which you can do this, shown in the workflows below.
Workflow without Spark
- Click on the dataset in your project directory to open Cohort Browser. It should look something like this:
- Explore data by clicking on "Add Tile" and using Field Explorer.
- Create cohort using filters and save it to your project.
- Go to Tools Library and select Table Exporter. Click Run. Select the project you exported your cohort to
- Select the cohort you created
- Specify file format, coding options, and header style
- Export the cohort as a TSV/CSV file. This will launch a table exporter job, which should say ‘Done’ once the file is ready
- Load and analyse data in JupyterLab (see Launching JupyterLab)
Workflow with Spark
For examples on how to use dx data for exporting participant data into tabular formats, please see the UKB GitHub, for example the A103 notebook.
Launching JupyterLab
- The JupyterLab workstation can be found under the tools tab by clicking on JupyterLab.
- Launch a JupyterLab workstation by clicking on the button "New JupyterLab."
- Then, specify the project you wish to have access to in your JupyterLab instance. This selection will determine the files you'll have access to.
- Specify several other parameters to configure the instance or remote worker:
Choosing between Single Node and Spark Cluster JupyterLab
Single node JupyterLab | Use it when you have already extracted data into a file and do not need distributed computing or want to use Stata. Use it when you want to run Python, R, or Stata code. |
Spark cluster JupyterLab | Use it when you want to query the dataset directly, use complex queries or work with large datasets. You can use a spark cluster for distributed computing, or when needing to use Spark-based tools like dx data, Hail, and Glow. |
Instance type |
Determines the number of GPUs or CPUs, the amount of memory, and the amount of storage. Read more about instance types on the DNAnexus Github |
Duration | Set the time duration, which is the amount of time you expect to run this instance (default is four hours but can be extended) |
Feature | Determine which pre-compiled environment or set of libraries and packages you want to have available. The includes machine learning (ML), image preprocessing, python and R |
Start Environment
- When you click "Start Environment." The instance will take some time to start up. You can check the status of your instance from the monitor tab.
- You can access files on the platform by navigating to the DNAnexus tab in the left sidebar, or from the terminal:
- Alternatively, you can also use access data using dxFUSE or “dx download”. For example:
- Once you have finished analysing your data, make sure to upload results from your temporary JupyterLab storage back to your project storage using dx upload. This will make sure that you do not lose your data once your instance has ended. E.g.
dx upload "data_participant.tsv"
- Additionally, to save your software dependencies in a reusable format, you can create a snapshot and save the image to your project. This helps you make your analysis more reproducible, and will save you time in the future. The snapshot will be saved in a “.Notebook_snapshots” folder in your project directory:
- The next time you start up a Jupyterlab instance, just remember to select it in the ‘snapshot’ option before pressing start environment.
- The instance will automatically terminate after the set time duration, but you can terminate it earlier via the monitor tab by clicking on your JupyterLab executable and selecting “Terminate”.
Please check out our UKB Github for more information on how you can use JupyterLab and RStudio to explore UKB data. New to Github? Learn more about about notebooks.
Comments
0 comments
Please sign in to leave a comment.