Reading data into RStudio on DNA Nexus

24 May 2022 00:00
9 comments

I'm interested in analysing the UK Biobank healthcare record data using RStudio / RMarkdown through DNA Nexus. I know that using a Python session, it is possible to query the UK Biobank's data tables in the Python environment, e.g. spark.sql("USE " + dispensed_database_name) df=spark.sql("SELECT * FROM hesin") Is there an equivalent that would allow me to be able to do this in RStudio? I'd like to be able to run SQL queries on the UK Biobank's data inside an R script.

Comments

9 comments

Chai Fungtammasan DNAnexus Team
- 31 May 2022 03:46
Currently, there is no method to query spark DB inside DNAnexus using Rstudio, so the data need to be prepared a head of time using Table Exporter (for people who prefer UI and use just a few fields) or use Python spark cluster.

0
Former User of DNAx Community_63
- 31 May 2022 07:56
This makes the DNAnexus platform problematic in terms of making R analysis pipelines reproducible and reliable. Is this going to be fixed at any time?

0
Chai Fungtammasan DNAnexus Team
- 31 May 2022 16:21
Could you elaborate how this would affect reproducibility?
The Table Exporter improvement to allow larger number of field selection is in progress.

0
Former User of DNAx Community_63
- 01 June 2022 09:47
The most reproducible pipeline is one in which one run of the code is sufficient to produce all of the analysis, numbers, figures etc. needed for the paper using only the raw data. This is how, up to now, I have been conducting most of my analysis. ICD10, read3, opcs4 codes etc are defined at the top of the script, and one run of the code pulls out the necessary records, performs the appropriate analysis and outputs a complete paper.

If this were possible on DNA nexus it would be even better than our local analysis because I could send the script to someone of a different institute and their raw data would be in the same file structure. However, creating tables using the table exporter, exporting them, and reading them into R means that running the code is not sufficient to produce the results, and introduces potential points of error,

I hope that's clear. Happy to provide examples if helpful.

0
Chai Fungtammasan DNAnexus Team
- 02 June 2022 01:20
Thank you, Harry. This is very insightful.
The complexity of this issue is not just for engineering work to make this happens, but also licensing with related parties including Rstudio if we have to modify the Rstudio code.

I like to understand more about how do your current workflow work all-in-one Rscript alone though. I would imagine that outside the RAP, you would have to use multiple tools in UKB toolkits to encrypt and download data first. It's then that you have raw data available for data analysis in R. Won't this be the same analogy as having to use table exporter make your data ready in UKB before you analyze with Rstudio? On RAP, this is actually more advance since you could select only the combination of fields you need in Table exporter and it's much faster as well.

Or if you use `system` command in R to handle data download to R, then you could also use system command in R to run table exporter and make your data ready to use in Rstudio as well.

I'm happy to hear your opinion if I misunderstood your intended workflow.

0
Chai Fungtammasan DNAnexus Team
- 02 June 2022 03:06
I have another idea. Since UKB-RAP support workflow, you can make the workflow of two steps 1 table exporter and 2 Rscript wrapped inside a Docker or applet. This won't work for interactive exploration phase, but once you get a good idea of what you are looking for, you can convert script into applet/Docker and use this workflow for production analysis. You can change some parameters for each of the run either at table exporter or applet/Docker that contains your R code.

I just want to share the workaround that could work in a short term. Hopefully, some of this is useful.

0
Former User of DNAx Community_63
- 08 June 2022 11:35
Hi Chai,

Thanks a lot for your detailed response, that's very helpful.

On my current workflow, I have this on Github which is the way I like to do things. Lines 103 to 110 are the way I'd like to be able to do things on the DNA Nexus Platform: https://github.com/hdg204/ProstateCancer/blob/main/prca_paper.Rmd

Using a list of read3 codes stored in a file cancercodes, the script queries the GP record data (gp_clinical_230K_171019) and stores that in a dataframe. This dataframe is then analysed in the same script. Although we do have to download the data first, extracting the relevant records is done in the same script as the analysis. I try, wherever possible, to avoid the practice of saving a table in one script and reading it in from another. DNA Nexus would streamline this process even further if extracting healthcare records from R was possible.

Using the table exporter from system in R sounds like a good workaround and I didn't know that could be done with system. Does the table exporter have access to individual healthcare records though? When I was looking at it earlier I could find a way to extract, e.g., everyone with read_3 code 'B46..', but what I need is the actual records because the timing is important for my research, My case definition is like 'record of cancer (ICD10 / GP) within 2 years of these symptoms (GP)'. So I read in all records of cancer / symptoms then do the relevant data wrangling in dplyr.

0
Chai Fungtammasan DNAnexus Team
- 12 June 2022 03:38
Our team looked into the overall issue around using data within Rstudio and wrote an Rscript that could export phenotypic field by calling the table exporter from within the Rstudio
https://github.com/dnanexus/UKB_RAP/blob/main/rstudio_demo/export_phenotypes.R
Could you give a try if this would help you with your problem? We gave a try on about 50 fields, and it works as expected.
We describe the script around 8:30.
https://www.youtube.com/watch?v=hVDbE1EykXA

0
Former User of DNAx Community_63
- 21 June 2022 14:08
Hi Chai,

I had issues with getting that script to run, but what I have noticed is that storing a copy of the entire healthcare records as a csv is a lot cheaper than I thought it was, so the table exporter idea works fine.

My concerns over the table maker were that the filtering of the data was not part of the analysis script, but that isn't an issue if the tablemaker is used to copy all the records. I've used the table maker to make a csv copy of the entire hesin, hesin_diag, gp_clinical, etc tables using the table maker and can read these in with R very easily. The csv is easier to work with for other members of my team anyway.

If it helps with you to understand how people are using DNA Nexus, I've written this for use by my team https://github.com/hdg204/UKBB. I find this workflow faster than using the Python spark cluster and I think this is how most people at Exeter will be accessing the healthcare records as we start moving towards DNA Nexus. My R script takes about 15 seconds to take an ICD10 code and read the relevant records into an R dataframe.

Thanks a lot for your help, I feel we're now up and running on DNA Nexus and able to use it for all of our UKBB phenotyping going forwards

0

Please sign in to leave a comment.