Are there simpler ways of doing things?
I started using the RAP for UKB analyses two weeks ago, and I have noticed that simple tasks seem to be taking me a very long time to do, and are much more complicated than UKB analyses I have done using our university cluster. If possible, I'd appreciate a bit of input from experienced users on how I can speed up my working to get things done faster.
?
Here are two examples, both of which took me multiple days each to figure out:
?
Task 1: Creating a CSV file of phenotype data for selected fields
I wanted to make a single CSV including phenotype data on all individuals from a selection of ~400 fields of interest. I want to be able to update and re-run this from time to time as I add new fields of interest.
?
There were too many fields to use the data exporter (which seems to cap out after you select 100 fields), so I used the Apache Spark cluster, building queries using PySpark.
?
The only way I could figure out how to achieve this was:
?
- Use spark.sql("Describe TBL") on each of the participant_* tables, to find what table each field is stored in
- Create spark.sql("SELECT ---") queries to select these correct fields from each table, and use df.join() to put the queries together into a single query data frame
- Recode all of the columns with array types to strings using col().cast()
- Write this query as a set of csvs to the hadoop file system using df.write.csv()
- Copy these csvs from the hadoop file system to the local file system using a system call to hdfs
- Read the CSVs back in again from the local file system, merge them together and write them back out again
- Upload the combined CSV to the project file system using a system call to dx upload
?
Task 2: Extracting a subset of SNPs for all samples from the exome data
I have a site VCF including ~1M variants of interest, and I'd like to get a BCF including genotypes for all individuals at these sites from the 200k exome release.
?
The only way I could figure out how to do this was:
- Write a Docker file to make an image that included bcftools, build it on cloud_workstation, and save it using dx upload
- Write a WBL script, and compile with dxCompile, to extract out the SNPs from a set of VCF files with bcftools isec (streaming the VCFs/indexes etc) and merge them into a BCF
- Write a bash script to generate and run ~100 job submission commands to run this script (each run being given 10 of the exome region chunks each).
- Run another job to download these output BCFs, merge then together and upload them again
?
Can I be doing these things more simply?
So, this all feels like I am overcomplicating things somehow. Previously they would have been just one or two short commands using our old UKB files: for the first, it would be reading a few big CSVs into R with fread and them merging and subsetting using tidyverse, for the second it would be setting off small job array script with a few bcftools commands using SLURM. But on RAP everything seems like a multi-stage process involving different servers and filesystems and programs.
?
Are the ways I am approaching these tasks on the RAP sensible, or have I gone wrong somewhere? Do people have tricks/workarounds/existing workflows/images etc for doing these sort of common tasks quickly? Or do you just get used to working in these systems eventually and adjust to the different way of doing things?
Comments
2 comments
For #1, the dx extract_dataset might work better for you. In this thread, you can see multiple option of how to extract data. The discussion is a bit long. IMO, for fast turnaround, dx extract_dataset would work best. This can be done anywhere that have dx-toolkit installed whether it's local or cloud.
https://community.dnanexus.com/s/question/0D5t000004SBm0eCAD/query-of-the-week-1-export-phenotypic-data-to-a-file
For #2 you can speed things up for both data retrieval and processing
retrieval
Processing
If you are new to platform, you may find the recorded webinar helpful. https://www.youtube.com/watch?v=uT_jD1Ey3Fk&list=PLRkZ0Fz-n3Z6ku1U9V_C2bV5kqafRwrY7
Have you seen these dnanexus notebooks, https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb
There are also some newly-released example notebooks on GitHub that you might find useful, see https://github.com/UK-Biobank/UKB-RAP-Notebooks
Please sign in to leave a comment.