Creating dataframe and data dictionary in R

Edited 28 October 2025 05:55
2 comments

I have found downloading the actual data into R workbench incredibly difficult. I have watched and read many tutorials and listened to this tutorial https://www.youtube.com/watch?v=uT_jD1Ey3Fk (however I can't find the slides so I can click on the link). Is there a easy step by step tutorial on how to get the data from RAP into R. I have tried the cohort browser > table exporter however I need 70 data fields and it keeps missing columns nd is hard to keep track. I have tried using dx toolkit

data_dictionary <- read_csv("data/appxxx_xxxx.dataset.data_dictionary.csv")
exact_fields <- c("Sex","Year of birth")
keywords <- c("circumference","Sleep duration","menopause","HRT")
filt_dict <- data_dictionary %>%
mutate(ent_field = glue("{entity}.{name}")) %>%
filter(
 title %in% exact_fields |
grepl(paste(keywords, collapse = "|"), title, ignore.case = TRUE)) %>%  
 filter(!grepl("Olink", entity, ignore.case = TRUE)) %>%  # exclude Olink
arrange(title)

field_list <-filt_dict %>%
    pull(ent_field)
    
fields_arg <- paste(field_list, collapse = ",")
outfile <- "data/filtered_data.csv"
if (file.exists(outfile)) file.remove(outfile)
template <- "dx extract_dataset {dataset_id} --fields {fields_arg} -o data/filtered_data.csv"
cmd <- glue::glue(template)
system(cmd)

dat <- read_csv("filtered_data.csv")

but keep getting this error “Please consider using `--sql` option to generate the SQL query and query via a private compute cluster. Fetch data exceeded timeout [120]. Cancelled”

Comments

2 comments

Richard Karlsson Linner
- 24 November 2025 09:52
To my experience, the extract_dataset function has limited functionality. Rather, keep your current workflow of creating a field_list, that is good practice, but then extract this list of fields by using the table-exporter applet:
https://documentation.dnanexus.com/developer/apps/developing-spark-apps/table-exporter-application
The key flags for this applet are:
-ientity="participant"
-ifield_names_file_txt=file.txt

Where file.txt has the following single-column format (easy to create from the dictionary files):
eid
p21_i0
p21_i1
p21_i2
p21_i3
p31
p34
p84_i0_a0
p84_i0_a1
p84_i0_a2
p84_i0_a3

Here is a full pseudo-unix code:
dx run table-exporter \
--name="runid" \
--priority high \
--yes \
--watch \
-idataset_or_cohort_or_dashboard="see applet docs" \
-ioutput="see applet docs" \
-ioutput_format="see applet docs" \
-icoding_option="see applet docs" \
-ientity="participant" \
-ifield_names_file_txt="file.txt" \
--destination="see applet docs" \
--instance-type="mem2_ssd1_v2_x2"

2
Danielle Hiam
- 25 November 2025 02:25
Thank you this worked perfectly!

0

Please sign in to leave a comment.