Are there simpler ways of doing things?

11 October 2023 00:00
2 comments

I started using the RAP for UKB analyses two weeks ago, and I have noticed that simple tasks seem to be taking me a very long time to do, and are much more complicated than UKB analyses I have done using our university cluster. If possible, I'd appreciate a bit of input from experienced users on how I can speed up my working to get things done faster.

Here are two examples, both of which took me multiple days each to figure out:

Task 1: Creating a CSV file of phenotype data for selected fields

I wanted to make a single CSV including phenotype data on all individuals from a selection of ~400 fields of interest. I want to be able to update and re-run this from time to time as I add new fields of interest.

There were too many fields to use the data exporter (which seems to cap out after you select 100 fields), so I used the Apache Spark cluster, building queries using PySpark.

The only way I could figure out how to achieve this was:

Use spark.sql("Describe TBL") on each of the participant_* tables, to find what table each field is stored in
Create spark.sql("SELECT ---") queries to select these correct fields from each table, and use df.join() to put the queries together into a single query data frame
Recode all of the columns with array types to strings using col().cast()
Write this query as a set of csvs to the hadoop file system using df.write.csv()
Copy these csvs from the hadoop file system to the local file system using a system call to hdfs
Read the CSVs back in again from the local file system, merge them together and write them back out again
Upload the combined CSV to the project file system using a system call to dx upload

Task 2: Extracting a subset of SNPs for all samples from the exome data

I have a site VCF including ~1M variants of interest, and I'd like to get a BCF including genotypes for all individuals at these sites from the 200k exome release.

The only way I could figure out how to do this was:

Write a Docker file to make an image that included bcftools, build it on cloud_workstation, and save it using dx upload
Write a WBL script, and compile with dxCompile, to extract out the SNPs from a set of VCF files with bcftools isec (streaming the VCFs/indexes etc) and merge them into a BCF
Write a bash script to generate and run ~100 job submission commands to run this script (each run being given 10 of the exome region chunks each).
Run another job to download these output BCFs, merge then together and upload them again

Can I be doing these things more simply?

So, this all feels like I am overcomplicating things somehow. Previously they would have been just one or two short commands using our old UKB files: for the first, it would be reading a few big CSVs into R with fread and them merging and subsetting using tidyverse, for the second it would be setting off small job array script with a few bcftools commands using SLURM. But on RAP everything seems like a multi-stage process involving different servers and filesystems and programs.

Are the ways I am approaching these tasks on the RAP sensible, or have I gone wrong somewhere? Do people have tricks/workarounds/existing workflows/images etc for doing these sort of common tasks quickly? Or do you just get used to working in these systems eventually and adjust to the different way of doing things?

Comments

2 comments

Chai Fungtammasan DNAnexus Team
- 11 October 2023 17:36
For #1, the dx extract_dataset might work better for you. In this thread, you can see multiple option of how to extract data. The discussion is a bit long. IMO, for fast turnaround, dx extract_dataset would work best. This can be done anywhere that have dx-toolkit installed whether it's local or cloud.
https://community.dnanexus.com/s/question/0D5t000004SBm0eCAD/query-of-the-week-1-export-phenotypic-data-to-a-file

For #2 you can speed things up for both data retrieval and processing
retrieval
- the snps for WES are in cohort browser, so you may just get it from there first without having to access bulk file. The tips on extracting data from cohort browser is also in post that I shared above.
- if you want to use data from bulk file, you may find this tool useful if you aren't sure which files contain your variant of interest, you may find this useful. https://github.com/dnanexus-rnd/multi-tabix Note that you should compile with Cargo rather than rustc.
Processing
- You may use Swiss-army-knife since those common tools are already installed
If you are new to platform, you may find the recorded webinar helpful. https://www.youtube.com/watch?v=uT_jD1Ey3Fk&list=PLRkZ0Fz-n3Z6ku1U9V_C2bV5kqafRwrY7
0
Rachael W UKB Community team Data Analyst
- 12 October 2023 07:30
Have you seen these dnanexus notebooks, https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb

There are also some newly-released example notebooks on GitHub that you might find useful, see https://github.com/UK-Biobank/UKB-RAP-Notebooks

0

Please sign in to leave a comment.