Query of the week #3: Bring your custom script to analyse UKB data

Ondrej Klempir DNAnexus Team

Let's imagine a situation where we would like to run something that is not on platform yet, but we don't have time to dive into how to build an app or use docker.

 

Typically, you would like to run some simple script that you have developed and tested locally, with which you would like to analyse UKB data on RAP. However, since you do not plan to perform this analysis repeatedly, you would like to avoid implementing this logic into the form of applets or workflows, i.e. there would be additional development steps such as dxapp.json, configuration, building the applet, etc. We would rather be interested in how we can submit a script that we may prepare in advance.

 

This post will focus on simplified examples (this is certainly not an exhaustive list of RAP tools) where you just copy and paste your code or simply submit a shell script. The presented methods are: ttyd, Cloud Workstation, and adding script to Swiss Army Knife. These concepts are of course applicable to any other custom analysis of UKB data using a script.

 

Today we will try to run program to query Exome CRAM data. For my experiment, I chose data from Bulk > Exome sequences > Exome OQFE CRAM files. As a prerequisite for my work, I have prepared a test BAM file. First, I needed to convert CRAM to BAM. This is a fairly simple bioinformatic step, which is for example mentioned here: https://community.dnanexus.com/s/question/0D5t000004SDQrMCAX/how-can-i-convert-cram-file-to-bam-file-using-swiss -army-knife

 

As for today's querying tool itself, I decided to try samql:

https://github.com/maragkakislab/samql/

It is a SQL-like query language for the SAM/BAM file format. We will try to get consistent results for a specific query using all three methods.

 

This is the script that will execute the required logic:

 

# download latest samql executable for linux and set execute permissions

wget -nv https://github.com/maragkakislab/samql/releases/download/1.6/samql_linux_amd64 && chmod 775 samql_linux_amd64

# test if this works and get help menu

./samql_linux_amd64 -h

# download testing BAM file from my dnax parent project folder

dx download /user/oklempir/ukb_X_23143tobam_0_0.bam

# using samql, return count of reads

./samql_linux_amd64 --where "(RNAME = 'chr1' AND POS > 1000000)" --count ukb_X_23143tobam_0_0.bam

 

ttyd

 

ttyd is a great app that looks like a remote machine accessible via a web browser. I mostly use it for prototyping and also use this app for deploying web applications on ttyd exposed ports. It is well described here: https://ukbiobank.dnanexus.com/app/ttyd

Another advantage for my work is that it uses dxfuse, i.e. the parent dnax project is mounted on the worker. One tip - don't forget to terminate it when you're done with your work! :)

 

As soon as the web terminal was ready, I simply copied and pasted my script and was waiting a minute for the results. Here is a screenshot from the ttyd environment:

 

Screenshot 2023-03-03 at 14.28.40If you would generate results that you want to keep on RAP, copy the files to the project using the "dx upload" command.

 

Cloud Workstation (https://documentation.dnanexus.com/developer/cloud-workstation)

 

Working with the Cloud Workstation is quite similar to working with ttyd. In this case, we will not get a web terminal, but a Byobu Terminal multiplexer. There are some differences. The advantage is that we can connect to it via ssh. And what's cool is that it supports snapshots (which is a big advantage if you want to save your env for future reuse). On the other hand, there is no dxfuse mounted project and to interact with the project's directories, it is necessary to set up a workspace (https://documentation.dnanexus.com/developer/cloud-workstation#step-3-setting-up-the-workspace), i.e. to run at least these two commands:

 

unset DX_WORKSPACE_ID

dx cd $DX_PROJECT_CONTEXT_ID:

 

After setting up the environment, I copied and pasted my code into the prompt. A screenshot from my work as follow:

 

Screenshot 2023-03-03 at 15.01.42 

Run a bash script via Swiss Army Knife

 

I assume that many of you know the Swiss Army Knife app (https://ukbiobank.dnanexus.com/app/swiss-army-knife). I would like to demonstrate that it can be used as a means to run our bash script, which is a good way for batch mode processing. I will only show the simplest scenario here for inspiration. The script might be of course better written and it could have non-hardcoded specified inputs.

 

For creating the bash script, I just simply saved my code as a text file and upload it to my dnax ukb project.

 

"cat samql_script" will print content of my samql_script:

 

wget -nv https://github.com/maragkakislab/samql/releases/download/1.6/samql_linux_amd64 && chmod 775 samql_linux_amd64 # download latest executable for linux and set execute permissions

./samql_linux_amd64 -h

./samql_linux_amd64 --where "(RNAME = 'chr1' AND POS > 1000000)" --count ukb_X_23143tobam_0_0.bam > output_count.txt

rm samql_linux_amd64

 

Screenshot 2023-03-03 at 14.39.06Screenshot 2023-03-03 at 14.54.51 

To conclude this week's query, all three methods produced the same COUNT results.

 

Comments

1 comment

  • Comment author
    Ondrej Klempir DNAnexus Team

    In the text above, I did not mention (Spark based)-JupyterLab, which has an interactive notebook interface and Terminal as well. However, I wonder if anyone of you have any idea how to run your own script on Spark? What steps would need to be taken to get the output file into the dnax project?

    0

Please sign in to leave a comment.