Differences Between /mnt/ Directory and Cohort Browser Regarding Accessible IDs

Esra Lenz

Hey everyone! 

I’ve noticed a discrepancy between the data accessible via the Cohort Browser and what’s available in the /mnt/ directory when working with Jupyter Notebooks.

For example, when I query for participants with valid FreeSurfer output (MRI imaging) using the Cohort Browser, I get around 3,000 more participants compared to when I access and extract the same data via the /mnt/ directory.

When extracting data from /mnt/, I unzip files and pull the relevant tables, but it seems that fewer participants appear. This makes me wonder:

  • Is this difference intended behavior?
  • Could it be related to how I’m extracting the files or perhaps data availability in /mnt/?
  • Has anyone else experienced something similar?

I’d prefer to avoid manually clicking through all the FreeSurfer measurement tables in the Cohort Browser to get the data. Any guidance, insights, or tips would be greatly appreciated! 🙏
 

Thanks in advance for your help! 

example:
This can be tested, for example, by exporting all IDs after filtering for valid FreeSurfer outputs (e.g., non-null values for the amygdala) and then using these IDs to access the /mnt/ Bulk-Data. Some IDs will result in errors because they do not exist in the /mnt/ directory.

Comments

3 comments

  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Hi Esra,

    please check that your selection of the Instance is consistent.  If you use data from both instances in one place and only one instance in the other this could cause a discrepancy.   For more details on Instances, see https://community.ukbiobank.ac.uk/hc/en-gb/articles/15955986227357-What-is-an-instance-index .

    I am not familiar with the Freesurfer data, but for most of the UKB data any particular field will be either in the cohort browser's Parquet database, or in the Bulk folder, not both.   Of course, variables that are derived from images would have the derived value in the Parquet database and the original image in the Bulk folder.  Quite often, there will be more of the original images than of the derived values, but  I wouldn't expect it to be the other way round.   Could you give an example of the field name or the field id that you are comparing participant counts for?   For example, I can see that there is Freesurfer data in Field p26547_2 Mean intensity of Amygdala (left hemisphere) | Instance 2, which is visible in the cohort browser.  Which Bulk field would you expect it to correspond to?

    As always, please be careful not to share any participant data or participant IDs, as the forum is not secure.

    1
  • Comment author
    Esra Lenz

    Thank you for your answer Rachael,

    I don't have a problem with the instances (time of taking the data) but the compute-instances and their MNT-Directory. (I mean this /mnt/ dir, for example mentioned here in the documentation: https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-data/accessing-bulk-data)  → search for /mnt/

    Goal:

    Unzip all the participants in the MNT-Directory of a Jupyter-Notebook-Compute-Instance, to retrieve the derivates of the Freesurfer-pipeline, put them in one table, and download it to my project, so I have the data.

    To prefilter, for which IDs there are actually valid measurements, I use the Cohort-Browser to filter for valid outputs ("[Measurment]" ≠ None). Because sometime the data-quality is to bad, so the output is not valid. 

    If I then take these IDs to filter in the mnt-directory provided in the jupyter instance, the IDs are not the same. There are more IDs by filtering the Cohort-Browser, than there are in the MNT-Directory of the jupyter-compute-instance.

     

    So to make it concrete and reproducible:

    1. I open the Cohort - Browser. I filter the IDs based on the field p26547_2 Mean intensity of Amygdala (left hemisphere) | Instance 2 ≠ None → This is to get just derivates that are valid. I now get all IDs that had valid output for the Freesurfer-Pipeline.

    2. I download the ID list, with my valid IDs. I start a jupyter-notebook-Computing-Instance. I upload the list there. 

    3. In this jupyter-notebook-Computing-Instance I use the provided MNT-Directory and loop through it. I unzip the zipped files and I use my Valid-ID-list to find the correct participant IDs that actually should have data. I extract the necessary information.

    → There I get the discrepancy! There are more valid IDs found in the Cohort-Browser, then in there are in the MNT-Directory. So the Cohort-Browser and the MNT-Directory don't seem to be synced. If I would like to have ALL of the information I would either use the “dx download” functionality or click all important information by hand in the cohort browser.

     

    I hope this made my problem clear :)

    Thanks for your reply and looking forward to your help.

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Hi Esra,

    one more question :  which Folder(s) of the MNT-directory are you searching in?

    0

Please sign in to leave a comment.