Creating dataframe and data dictionary in R

Danielle Hiam

I have found downloading the actual data into R workbench incredibly difficult. I have watched and read many tutorials and listened to this tutorial https://www.youtube.com/watch?v=uT_jD1Ey3Fk (however I can't find the slides so I can click on the link). Is there a easy step by step tutorial on how to get the data from RAP into R.  I have tried the cohort browser > table exporter however I need 70 data fields and it keeps missing columns nd is hard to keep track. I have tried using dx toolkit 

data_dictionary <- read_csv("data/appxxx_xxxx.dataset.data_dictionary.csv")
exact_fields <- c("Sex","Year of birth")
keywords <- c("circumference","Sleep duration","menopause","HRT")
filt_dict <- data_dictionary %>%
mutate(ent_field = glue("{entity}.{name}")) %>%
filter(
 title %in% exact_fields |
grepl(paste(keywords, collapse = "|"), title, ignore.case = TRUE)) %>%  
 filter(!grepl("Olink", entity, ignore.case = TRUE)) %>%  # exclude Olink
arrange(title)

field_list <-filt_dict %>%
    pull(ent_field)
    
fields_arg <- paste(field_list, collapse = ",")
outfile <- "data/filtered_data.csv"
if (file.exists(outfile)) file.remove(outfile)
template <- "dx extract_dataset {dataset_id} --fields {fields_arg} -o data/filtered_data.csv"
cmd <- glue::glue(template)
system(cmd)

dat <- read_csv("filtered_data.csv")

but keep getting this error “Please consider using `--sql` option to generate the SQL query and query via a private compute cluster. Fetch data exceeded timeout [120]. Cancelled” 

 

Comments

2 comments

  • Comment author
    Richard Karlsson Linner

    To my experience, the extract_dataset function has limited functionality. Rather, keep your  current workflow of creating a field_list, that is good practice, but then extract this list of fields by using the table-exporter applet:
    https://documentation.dnanexus.com/developer/apps/developing-spark-apps/table-exporter-application

    The key flags for this applet are:
    -ientity="participant"
    -ifield_names_file_txt=file.txt

    Where file.txt has the following single-column format (easy to create from the dictionary files):
    eid
    p21_i0
    p21_i1
    p21_i2
    p21_i3
    p31
    p34
    p84_i0_a0
    p84_i0_a1
    p84_i0_a2
    p84_i0_a3

    Here is a full pseudo-unix code:
     dx run table-exporter \
           --name="runid" \
           --priority high \
           --yes \
           --watch \
           -idataset_or_cohort_or_dashboard="see applet docs" \
           -ioutput="see applet docs" \
           -ioutput_format="see applet docs"  \
           -icoding_option="see applet docs"  \
           -ientity="participant" \
           -ifield_names_file_txt="file.txt" \
           --destination="see applet docs" \
           --instance-type="mem2_ssd1_v2_x2"

    2
  • Comment author
    Danielle Hiam

    Thank you this worked perfectly! 

    0

Please sign in to leave a comment.