Poor Performance Extracting Phenotype Data

Henrik Litsne

I want to download 107 different columns of data, I have tried all the methods listed in “Accessing Phenotype Data”, but all of the are so slow that I shut the jobs down after a while having gotten no data.

The only way I was able to extract any data was by breaking down the SQL query and downloading each table separately (with a Python script). Trying any kind of Join as part of the extraction seems hopeless.

I'm using the default settings for Jupyter lab etc.

How does everyone else solve extracting data? Pay for much better performance?

Comments

5 comments

  • Comment author
    Lauren C The helpers that keep the community running smoothly. UKB Community team
    • Official comment

    Thanks for your question, just to clarify, data can’t be downloaded out of the Research Analysis Platform (RAP). All analysis and data handling must be carried out within the RAP environment, and exporting any individual data externally isn’t permitted. This includes anything at the level of an individual participant identifier.

    If you're running into performance issues with large phenotype extracts or complex joins, you may find it helpful to: 

    - Work with smaller subsets first to check your logic before scaling up. 

    - Use more efficient SQL approaches (e.g., filtering early, limiting joins, indexing temporary tables). 

    - Leverage the RAP’s internal compute options (e.g., using Spark or switching to a more performant compute size if available within your workspace configuration). 

    - Write intermediate outputs to RAP storage rather than trying to pull everything in one step. 

    - For visualisation of images, the UKB-RAP provides the 3D slicer tool.

    But in all cases, extraction and processing must stay inside RAP, downloading raw data out of the environment isn’t supported.

  • Comment author
    Henrik Litsne

    Download = extract to RAP/JupyterLab. I have used spark. As mentioned I am only interested in phenotype data. Since this data will form the basis of a machine learning project it is undesirable to limit the number of participants. 

    So far, my conclusion is that I will have to do all set manipulation on files in RAP storage. Compared to our own databases, with many more rows, on SQL Server the database appears to perform incredibly badly. That's why I wondered if I had missed something.

    Is there something like SSMS in your environment which would make it easier to interact with the database to try using temporary tables? It would also be interesting to see which indexes exist.

    0
  • Comment author
    Henrik Litsne

    I have been running this query SELECT * FROM database_j15kv6qjy304kbz1yxbgxp7z__app596898_20250612144632.participant_0001 LIMIT 1000 for at least 30 min now. Generally I would expect a query like this to return data within seconds. I am running it like this:

    retrieve_sql = "SELECT * FROM database_j15kv6qjy304kbz1yxbgxp7z__app596898_20250612144632.participant_0001 LIMIT 1000"

    temp_df = spark.sql(retrieve_sql.strip(";"))

    if bDebug:
        print("Before executing sql")

    pdf = temp_df.toPandas()

    Obviously there are also some code ahead of it.

    0
  • Comment author
    Anna D The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Dear Henrik,

    Thank you for your questions. May I ask if you have seen our  metabolomics notebook? This provides an example of using a spark instance to extract phenotypic data in JupyterLab (written in R). 

    Within this notebook you will see an example of how to extract specific fields:

    # Filter the field_schema for a specific subset of the metabolomics data
    field_ids_of_interest<- field_schema %>% 
                             filter(main_category == 222| # This category includes all the processing indicators (useful for QC)
                                title %in% c("Acetate", "Acetate, QC Flag", "Date of attending Assessment Centre")| #Specific metabolite and its related QC field (names as they appear on showcase)
                                field_id==53) %>% #Date of attending assessment centre (used to identify when the blood sample was collected from the participant)  
                             pull(field_id)

    You should be able to follow the notebook and then replace the metabolomics related fields with your own fields of interest.

    For reference, running this notebook for all metabolomic related fields (n=~450) for all participants should take less than 30 mins when using the instance ‘mem1_hdd1_v2_x16’. 

    Do let us know if this helps!

     

    0
  • Comment author
    Noah Risse

    I had a similar issue when trying to get data for >250 columns. The fix for me was to use table exporter and set entity to “participant” . Before doing so the job ran many hours before i stopped it. 
    After trying again with specified entity it took only a couple of minutes. 
    in the data dictionary you can find the corresponding entity for your columns.
    Hope that also fixes your problem. 

    0

Please sign in to leave a comment.