In september 2022, I downloaded our main data set to a linux server, checked md5, converted to csv, and created a data dictionary, all according to Data Access Guide.
I have now searched what corresponds to UDI 40001-0.0 (Underlying (primary) cause of death), which is supposed to use data-coding 19 (ICD10 codes).
The following command:
cat ukbXXXXXX.csv | awk -F ',' '{print $11998}' | sort | uniq -c
will give me a lot of floats, dates, strings like "Pulse Wave Velocity", and other other irrelevant and obviously misplaced data. I also get some ICD10 codes, but not nearly the amount I find in RAP.
I tested one group (I42) and found 178 diseased by this in RAP but only 2 in my downloaded data.
What is all this noise, and why is my data missing so many ICD codes?
Best regards,
Carina
Did you use correct "download" command to get the above mentioned "ukbXXXXXX.csv" file?
You may want to get the ICD10 data from RAP. For such data/field extraction, I normally use JupyterLab and export data into csv (it is highly customizable and one can write efficient queries). The relevant field id is here: https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=40001
The number 178 for the specific category - you got on RAP - seems to be correct:
Also I confirmed the number 178 in my Cohort Browser - seeing the same as you. Alternatively, for cohort <30k I was able to visualize a pair of fields ?eid, Underlying cause of death? with help of the Data Preview tab in Cohort Browser. I could then click Download button to get the data into csv format.
I downloaded the .enc file from Showcase, and had no errors while checking, unpacking and converting to selected format.
I have a very large linux server available, so, if possible, I wanted to do my work locally instead of waiting and paying for JupyterLab every time I wanted to look at or download data.
Understood. You would need to contact UKB directly for the issues with data in Showcase. This community is for UKB-RAP users, so we don't know how to help you.
I would recommend you to try the UKB-RAP though. There is a 40 pounds free credit, and the rate is extremely economical. There is much more value in RAP than just computing power. You could gain the advantage of collaboration, interactive visualization, cohort selection, data provenience, security, and compliance, etc.
For some data that is allowed by your MTA with UKB, you can download data from RAP too which should be easier than downloading from Showcase. In this case the easiest way is probably using Table exporter app or dx extract_dataset to extract and download it.
Turned out that the data set also contains comma inside the data fields (according to UKB), so I couldn't use the command that I used (cat ukbXXXXXX.csv | awk -F ',' '{print $11998}' | sort | uniq -c). I removed all commas in the data fields and replaced them with ':', and now it works.
I see. Thanks for sharing the solution. In case you are interested, on UKB-RAP, you could use Table Exporter app to get the data you need and select if you want TSV or CSV. It still won't work directly for your awk command, but you could then keep the comma in header if you select TSV.
Comments
7 comments
Did you download it through UKB-RAP or you download it directly from Showcase?
Did you use correct "download" command to get the above mentioned "ukbXXXXXX.csv" file?
You may want to get the ICD10 data from RAP. For such data/field extraction, I normally use JupyterLab and export data into csv (it is highly customizable and one can write efficient queries). The relevant field id is here: https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=40001
The number 178 for the specific category - you got on RAP - seems to be correct:
Also I confirmed the number 178 in my Cohort Browser - seeing the same as you. Alternatively, for cohort <30k I was able to visualize a pair of fields ?eid, Underlying cause of death? with help of the Data Preview tab in Cohort Browser. I could then click Download button to get the data into csv format.
@Chai Fungtammasan? & @Ondrej Klempir?
Thank you for your responses :)
I downloaded the .enc file from Showcase, and had no errors while checking, unpacking and converting to selected format.
I have a very large linux server available, so, if possible, I wanted to do my work locally instead of waiting and paying for JupyterLab every time I wanted to look at or download data.
Understood. You would need to contact UKB directly for the issues with data in Showcase. This community is for UKB-RAP users, so we don't know how to help you.
I would recommend you to try the UKB-RAP though. There is a 40 pounds free credit, and the rate is extremely economical. There is much more value in RAP than just computing power. You could gain the advantage of collaboration, interactive visualization, cohort selection, data provenience, security, and compliance, etc.
For some data that is allowed by your MTA with UKB, you can download data from RAP too which should be easier than downloading from Showcase. In this case the easiest way is probably using Table exporter app or dx extract_dataset to extract and download it.
Thank you for your answer. I will contact UKB.
I do use RAP also, and I am very impressed with the features it has.
Turned out that the data set also contains comma inside the data fields (according to UKB), so I couldn't use the command that I used (cat ukbXXXXXX.csv | awk -F ',' '{print $11998}' | sort | uniq -c). I removed all commas in the data fields and replaced them with ':', and now it works.
I see. Thanks for sharing the solution. In case you are interested, on UKB-RAP, you could use Table Exporter app to get the data you need and select if you want TSV or CSV. It still won't work directly for your awk command, but you could then keep the comma in header if you select TSV.
Please sign in to leave a comment.