Missing values in proteomics data

Silvia Shen

There appears to be a lot of missing values in the proteomics data… Perhaps I have done something wrong, but just looking at the dataset for instance 0, every single individual is missing values for at least one protein. I was wondering if this is indeed the case and why? 

Comments

6 comments

  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    According to coding 143 https://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=143 and to Resource 4654, https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=4654   there are a total of 2923 proteins assayed in the Olink set.    According to Showcase Field 30900, https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30900, the maximum number of proteins measured per participant in instance_0 is 2922.  So, Yes, every individual is missing values for at least one protein.   (It is not caused by any problems with your process.)

    Three assays in particular had a lot of missing items, see the graph at the top of page 10 of Resource 4658, https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=4658 .  This includes one assay with almost 100% unsuccessful (it might even be exactly 100%, but it is not easy to tell for sure from the graph).

    Note also that there is a small amount of data still to come, from batch 7, so that will also have an effect.

    See Category 1839, https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=1839 , for more notes and resources.

    See also the DNAnexus Olink documentation, https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/working-with-ukb-data#about-proteomics-data , as the Olink data is a Record-table in AMS but is individual fields in the main dataset in the RAP.

    0
  • Comment author
    Silvia Shen

    Great, thank you so much for this comprehensive answer! There's not a huge amount of literature precedent on this, so I'm wondering if you have any recommendations for imputation of this data? There appear to be some individuals with high levels (>80%) missing protein values, would you recommend excluding them? 

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    That is not something I could comment on, sorry.   I have no idea, and UK Biobank does not advise researchers how to use the UKB data.  If you make a new forum post with that question in the title, other researchers might discuss it. 

    0
  • Comment author
    Silvia Shen

    Hi Rachael - thank you for your reply. I am also wondering about the sample size of the UKB PPP. From the UKB RAP, there are 53,016 with non-null entries for instance 0. However, the number specified in the proteomics paper after outlier removal is much less than this - 52,790 on page 8 of the supplementary materials, in the section ‘Data pre-processing and quality checking’. This is after outliers, sample swaps, samples with assay warnings etc have been removed. I am wondering why the sample size is higher in the UKB showcase/RAP - is this data not the pre-processed data from the paper? 

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Hi Silvia, No, presumably not.  See https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=4658 for a description of which items were included in the available UKB data on RAP and Showcase.

    (By the way, I think the title of Table 2 in Res 4658 must be a typo).

    I am not sure where the difference arises, but I would guess that it relates to the definition of “outlier” to be excluded, and that the Sun paper analysis was more stringent in excluding extreme values.  Alternatively, maybe Sun et al excluded all results for samples with any associated qc warnings but the UKB data only excluded the specific results with associated qc warnings.

    If you find the answer, please post it here.

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    See particularly this paragraph of Res 4658:

    For the consortium version data, to err on the side of caution, if the aforementioned plates or samples contained 4 or more panels flagged as potential sample swaps, we removed the remaining panels of those plates or samples. These additional sample removals were labeled as "extra" in the flag column of the dataset. Removing these samples reduced the size of the dataset by about 1.1%. Please note that in the dataset provided to UK Biobank for approved researchers, we did not remove these additional samples; instead, we leave the decision on whether to retain or remove entire panels or plates based on a minority of potential sample swaps to the discretion of individual analysts.

    0

Please sign in to leave a comment.