A mismatch in cause of death - ICD - 10 data

Permanently deleted user
I found a mismatch in cause of death - ICD - 10 data. When I select I21 (Acute myocardial infarction) to exclude, the filter exclude more participants than what is already there. How is that possible? I would expected 500,459 participants to remain after excluding.  [Image: Ekran Resmi 2023-10-02 16.50.24] [Image: Ekran Resmi 2023-10-02 16.47.05]

Comments

8 comments

  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    The filter "with death cause record cause of death icd-10 Excludes I21"

    will show you All the participants who have a death record, Except the participants who have I21 in their death record.

     

    Approximately 39514 participants have a death record in your dataset.

     

    By the way, I notice that your RAP dataset was last updated 2022-12-20. If you want more up-to-date data you can refresh the data, see https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/updating-dispensed-data . (This can take a few hours, possibly a day or two at busy times, so don't do it when you need to be working on the data.)

    0
  • Comment author
    Permanently deleted user

    Hi Rachael W ? ,

     

    Thank you for your response. Actually, when I use filtering, I did not put a check mark on "exclude results with missing data". Therefore I thought, It would not exclude participants who have not a death record. Why does the filter exclude participant who have not a death record even though I do not check off?

     

    Ekran Resmi 2023-10-04 12.49.02

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Hi {@005t000000BBrFkAAL}? 

     

    I don't know why it does it. I guess whoever wrote the tool though that people would want to search in that way. You could guess that it might do it from the fact that is says "... with death cause record ... " , so it ignores people without a death cause record.

     

    One way to get the cohort "all participants except the participants who have I21 in their death record":

    Save the full cohort, with a name such as all_participants

    Select and save the cohort of participants that do have I21 in the death cause record, with a name such as with_I21_in_death_cause

    Combine those two cohorts using Subtraction, save the resulting cohort with a name such as all_ppts_except_with+I21_in_death_cause.

     

    You might need to decide whether you only want I21, or whether you want I21.1, I21.9 etc as well.

     

    Note that I have more participants with death records than you do, because my dataset has been refreshed more recently.

     

    Images to illustrate the above:

     

    image 

    image 

    image 

    image 

    A different way to find the cohort you need would be to use the alternative form of the death data, ie these fields:

    image 

    Filters using these fields will behave in the way you were originally expecting. However, if you want to include all the I21 values in the Contributory causes as well as the I21 in the Primary causes, you will need to select and filter using all 32 fields.

     

    At some point you will probably need to use JupyterLab instead of the Cohort Browser. If it is very difficult to select the cohort you need, you could consider doing cohort selection in JupyterLab.

    0
  • Comment author
    Permanently deleted user

    Thank you so much for detailed answer. Is there any hands-on training videos like How to include / exclude participants from cohort in Jupyterlab? If yes, could you share?

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    The main advantage of using a SparkJupyterLab (SJL) instead of the CohortBrowser to do Cohort Selection and save a Cohort is that it is quicker to select fields (columns) that are required.

     

    The main disadvantage of the SJL is that it is more expensive. For example, I used 3 hours of SJL yesterday and the cost was about £2.25 (which isn't too bad if you only need to do it once, but you might not want to do it very often).

     

    For both methods, you need to understand the field structure (instances and arrays) and you need to work out how to filter. Filtering in the SJL can be done either using SQL commands or using Python commands. (There are some examples in the videos and notebook below. ) Writing a lot of filter commands might turn out to be almost as tedious as selecting a lot of fields in the cohort-browser, but at least you can copy and paste.

     

    Note that there is a difference between a single-node JupyterLab and a Spark JupyterLab. In order to do cohort selection you will need a Spark JL. Once you have selected a cohort (rows) with the fields (columns) you need, a single-node JL should be fine. Single-node JLs are less expensive.

     

    There are 2 videos to watch before you use either kind of JL:

     

    Videos 3 and 4 on this page https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/research-analysis-platform-training-webinars , ie Introduction to Jupyter , Exploring and Analyzing with Jupyter.

     

    The videos refer to a sample Notebook, to be found at https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb

    (There is quite a lot to take in and I needed to watch them more than once. Check that you have understood what the videos say about the difference between Project storage and JL storage).

     

    If you haven't previously used a python Notebook, you might like to ask someone in your institution to show you one and talk about the different kind of cells. There is a description here , but I think it is better in person.

     

    You might also want a vague understanding of What is a python Package, such as Pandas, and what is a Pandas Dataframe.

     

    The Notebook includes some examples of selecting fields, and some examples of filtering (including and excluding participants).

     

    0
  • Comment author
    Georgia Anne Brice

    Hi Rachael W/UK Biobank community,

    I hope you're doing well.

    I am wondering whether the cost of £2.25 for 3 hours of SJL is still relevant now? I might use SJL to filter my dataset and would like to know the estimated cost of this.

    Thank you in advance,

    Georgia.

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Hi Georgia,

    the current Rate Card for the UKB-RAP is at https://20779781.fs1.hubspotusercontent-na1.net/hubfs/20779781/Product%20Team%20Folder/Rate%20Cards/BiobankResearchAnalysisPlatform_Rate%20Card_Current.pdf 

    In general, the cost will depend on the size of the SJL Instance that you choose, and on whether it is Spot or On-demand.

    If you select High priority, it will be the On-demand rate.

    If you select Low priority, it will be the Spot rate (but you might be waiting for a very long time).

    If you select Normal priority, it might be the Spot rate if the system is not too busy, otherwise it will be the On-demand rate.

    To choose the size of the SJL Instance, I suggest you start with something quite small, and increase it if the small one fails.   However, if you know you want to export thousands of fields, the smallest Instances won't be big enough.   See the documentation here https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-data/using-spark-to-analyze-tabular-data , which says “The default settings allow for casual interrogation of the data. If you will be running complex queries or analyzing a large amount of data in memory, you may need to select a larger instance type. To increase parallelization efficiency and reduce processing time, you may need to select more nodes.”

    Thank you for using the forum.

     

    0
  • Comment author
    Georgia Anne Brice

    Thank you, Rachael!

    0

Please sign in to leave a comment.