Basic UKB usage - accessing data, phenotypes and interactive analysis

So, I have been working with UKB data for several years now, in HPC computing environments, but now unfortunately have had to make the jump to using the RAP for the first time.

With this in mind, I have been going through the introductory YouTube videos and documentation to try and get to grips with using the RAP, but frankly have not been able to find answers to the questions/issues I have. These are all super-basic usage questions, so I feel like it's possible they'd be in the docs somewhere and I just can't find them. So, I'd appreciate either guidance in how to solve these issues or pointing out where I'd be able to find the answers!

 

  1. While a number of the tutorials and videos in the documentation speak about submitting jobs/workflows/analyses etc, I couldn't see any information about "interactive" sessions for testing/running code. I.e. When first writing code for conducting an analysis, where it is necessary to interactively read in files/data to test how the analysis works. Is this something which is possible on the RAP? Is there an environment where you can open R and read a file, install packages and test whether a command will work?
  2. I have seen, numerous tutorials and videos which mention the phenotype browsers and similar ways to view aspects of the data on the RAP, however, it remains unclear to me how one actually obtains access to phenotypes and genotypes for conducting an analysis. I.e. continuing the previous example, if I want to select a set of phenotypes (age, sex, PCs) which previously we would download and extract, how do I now get access to a file which has these phenotypes?
  3. Likewise for question 2, I'd like to know if there's an equivalent for record level files (GP/hospital records etc).
  4. Perhaps as importantly as both questions 1-3, I am very used to testing/coding and then submitting analyses via command line. The Command Line Quick Set-up guide however mostly seemed focused on downloading some other external data and, running some kind of existing pipeline, which again was a bit confusing for me in terms of how I could begin to script something in R to read/analyse/run with my own project. Can I use the command line on my laptop to open an interactive session on the RAP, open R and then read files? Can I even use the command line to submit a job/script to run an analysis?

 

Again, I suspect many of these are super basic questions and I feel a bit foolish asking, but I really could not find answers for these anywhere else so any help would be appreciated!

Comments

7 comments

  • Comment author
    Kendall King

    Hi Sebastian,

    If you are interested in only exporting a several phenotypic variables at a time (< 50), I suggest using the dx extract_dataset bash command described here https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-data/accessing-phenotypic-data

    #You'll need python installed and to login to DNAnexus using your username/password:
    dx login

    # Create 'fields.txt' using the 'data_dictionary' output from UKBB for 'name' and title. 
    #Save this file in the same folder as desired data output

    #Ensure the format is as below:
    pariticpant.eid
    participant.p46_i0
    participant.p21001_i0
    participant.p21022

    #Save field data output in this file:
    C:\file-location\fields.txt

    #Save exported file as 'exported_data.txt' in output file from correct project/record ID in DNAnexus (under 'Tas1r2 Characterization' project:
    dx extract_dataset project-XXXX:record-XXX --fields-file "C:\file-location\fields.txt" --delimiter "," --output “C:\file-location\exported_data.txt”

    For your project and record number, you can find those in the RAP once you have started a project and your dataset record is available (takes a little while after creating a project). 

    #Export specific SNP genotype calls
    #Create 'ukbb_genotype_query' .JSON file containing SNP rsIDs in the below format ‘chromosome#_position_ref_alt’ (ensure no extra spaces/returns after final bracket, UTF-encoding):
    {
     "allele_id": ["1_18854521_G_C", "1_18854899_T_C"]
    }

    #Run export of SNPs using extract_assay germline command and save file as genotype_output.tsv:
    (dxenv) C:\Users\X > dx extract_assay germline appXXXX.dataset --retrieve-genotype "C:\file-location\ukbb_genotype_query.json" -o "C:\file-location\genotype_output.tsv"


    I hope this is at all helpful. I was very confused when getting started and found the cohort browser slow and cumbersome for exporting more than a few phenotypes. 

    0
  • Comment author
    Richard Karlsson Linner
    • Edited

    I think my initial experiences were similar until I realized that for my style of analysis, I should just use the JupyterLab bash terminal and continue working as I did on our hpc server under the data download model. Now, instead of using the old command-line utilities to pull data from the enc_ukb files, we now use the dx extract tool, followed by dx download to get it on the worker node instance. For many reasons, I won’t use Jupyter Notebooks for my data analysis, but instead execute R/python scripts from the terminal.

    0
  • Comment author
    Alejandra Rodriguez Sosa

    I believe that I should also start working on JupyterLab to build my cohort, however, when I use the Cohort Browser that is not costing me money, but if I run an interactive JupyterLab session to play with it, it will start the billing count and that stresses me out (I'm just a PhD researcher and I want to build a small cohort for a VERY simple ML model, so I count on the 40 complimentary pounds of UKBB). 

    I have been trying to explore my cohort using the Browser first to get acquainted with my available data, but more often than not I get timeout errors and the tiles won't load (even though I am only trying to load 3 - 4 tiles max). I am working with ~8000 patients maximum. Given the 500k size of the whole cohort, I did not expect it to work so poorly!!

    I have watched the Tutorial videos and I cannot find any written documents that mention if I can explore the cohort without running a JupyterLab paid session. Are you just assuming the cost when using the terminal?

    0
  • Comment author
    Richard Karlsson Linner

    Personally, so far I don’t see a strong reason for making cohorts. Instead, I use table-exporter directly on the .dataset object to extract the set of fields I need (e.g., birth year, bmi, date of assessment, etc), which I typically do for all 500k individuals. Sometimes I pull only a few fields, and for a project-wide file I often extract a few thousand fields.
     

    To do this, the key file you need to to create is a list of fields to give table-exporter as input. This file you can create locally by using the data dictionary dump.

    Next, I run table-exporter app using a “mem3_hdd1_x2” instance, which needs about an hour to extract a few thousand fields, costing about 0.2 GBP or perhaps even less.

    My normal work on phenotypic data in R typically needs a “mem3_ssd1_x2”, which costs about 0.5-0.7 GBP to run for a whole workday. I run this on the Cloud Workstation app on DNA nexus. Personally, this fits my workflow better than Jupyterlab.


    The above costs I consider negligible. Less than a cup of coffee in the university cafeteria.


    For heavier analyses, I would start small and scale up, to get a sense of the instance requirements and expected total costs.


     

    1
  • Comment author
    Alejandra Rodriguez Sosa

    Thank you! I think I am going to consider this approach, as I was expecting the browser to be a quick, interactive way to just get an idea of what my data looked like, but it looks like for 6k or 8k participants just doesn't quite work. Much appreciated.

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst
    • Edited

    Hi Alejandra,   

    as a PhD student you might qualify as an “early career researcher”,  so you might be eligible for further credits (in addition to the the 40 pounds) via the Platform Credits Programme.   Please see https://www.ukbiobank.ac.uk/enable-your-research/costs/financial-support-for-researchers#TheUKBiobankPlatformCreditsProgramme .

    Any researchers who are working on a project where the Institution is based in a low-income country might be eligible for further credits via the same Platform Credits Programme .   They might also be eligible for initial access fee funding via the Global Researcher Access Fund https://www.ukbiobank.ac.uk/enable-your-research/costs/financial-support-for-researchers#GlobalResearcherAccessFund .

    Researchers who do not qualify as early career or low-income-country might be able to get temporary funding help with transitioning to the UKB-RAP, see  the Transition Credits Programme information https://www.ukbiobank.ac.uk/enable-your-research/costs/financial-support-for-researchers#TCP  .

    0
  • Comment author
    Gabriel Doctor

    Hi Sebastian, 

    i wrote this applet to allow an easy way of accessing UKB RAP data from the command line, with interactive and submit modes.
    https://github.com/gtdoctor/BasicBashWorker_for_UKBiobankDNANexus

     

    1

Please sign in to leave a comment.