Best way to run the same command on all CRAM files?
Hi everyone,
Sorry for the long post.
Aside from apparently not being able to avoid downloading files by prepending '/mnt/project' to file paths (see my other question here:
https://community.dnanexus.com/s/question/0D5t000003yJQvECAW/how-to-use-dxfuse-style-file-paths-in-swiss-army-knife) , I have finally managed to run the two commands I need for a single input file. First, I can retrieve the idxstats with:
dx run swiss-army-knife \
-iin="/Bulk/Whole genome sequences/Whole genome CRAM files/60/6024088_23193_0_0.cram" \
-icmd='samtools idxstats "$in_name" > "${in_prefix}.tsv"' \
--instance-type "mem1_ssd1_v2_x16" \
--destination <projectID>:<out_fd1> \
-y --brief
(obviously using my own project ID)
and extract the number of reads aligning to a particular region of interest with:
dx run swiss-army-knife \
-iin="/Bulk/Whole genome sequences/Whole genome CRAM files/60/6024088_23193_0_0.cram" \
-iin="/Bulk/Whole genome sequences/Whole genome CRAM files/60/6024088_23193_0_0.cram.crai" \
-icmd='samtools view -c "${in_name[0]}" <region> > "${in_prefix[0]}.txt"' \
--instance-type "mem1_ssd1_v2_x16" \
--destination <projectID>:<out_fd2> \
-y --brief
I have also managed to retrieve a list of all available CRAM files and their corresponding folder (although in a quite convoluted way, since I haven't yet found a way to use wildcards on `dx ls`).
I would now need to run these two same scripts for each and every of the 200,028 CRAM files. I know I could "simply" enclose each dx run call within a for loop, where each iteration reads a row from the file list and uses it on the -iin parameters.
This should work, but I refuse receiving 2x200k emails notifying me of completed jobs. I asked in the other post whether there is a way to avoid receiving emails, but I haven't received any answer so far. No parameter from `dx run` appears to control this, and I haven't found any global setting in this regard. (If it's not currently possible to avoid sending emails, please introduce that option immediately!).
As the most logical alternative, I've been exploring how to launch batch jobs with `dx run` (as explained in https://documentation.dnanexus.com/user/running-apps-and-workflows/running-batch-jobs), but I must admit I feel even more confused than before reading the documentation.
Is it possible to use --batch-tsv with swiss-army-knife? If so, do the columns in the tsv need to be named in any particular manner? Are the names in the example arbitrary and then bwa deduces which is which, or do they coincide with particular named arguments? If so, what about positional arguments?
(Incidentally, generate_batch_inputs failed to generate a tsv file from the CRAM and their corresponding crai files, but I should be able to generate a suitable tsv without issue as long as I know the expected format).
On the examples, however, swiss-army-knife only appears within a for loop, which worries me.
Any help would be deeply appreciated.
Cheers,
Fran
Comments
3 comments
C: "I have also managed to retrieve a list of all available CRAM files and their corresponding folder (although in a quite convoluted way, since I haven't yet found a way to use wildcards on `dx ls`)."
--> Did you try to list the files via "dx find data"? Check "dx find data -h" for more information.
C: "I refuse receiving 2x200k emails notifying me of completed jobs. I asked in the other post whether there is a way to avoid receiving emails, but I haven't received any answer so far. No parameter from `dx run` appears to control this, and I haven't found any global setting in this regard. (If it's not currently possible to avoid sending emails, please introduce that option immediately!)."
--> I was able to use the following API call to turn off the email notifications for my RAP user: https://documentation.dnanexus.com/developer/api/users#api-method-user-xxxx-update
I used option "never": do not email the user about successful or failed executions. See the details below:
~ [1]> dx describe user-myuser --verbose --json | grep email
"email": "myuser@template.com",
"emailWhenJobComplete": "always",
~> dx api user-myuser update '{"policies": {"emailWhenJobComplete": "never"}}'
I really like the way to interact with the DNAnexus platform via API in Python. This can be done by using the DNAnexus Python Bindings (dxpy package, dxpy.api module).
The documentation is available here: http://autodoc.dnanexus.com/bindings/python/current/index.html
To get the input and output of each API call, including the server address, you can use the _DX_DEBUG environmental variable altogether with the desired command as you can see in this example:
_DX_DEBUG=2 dx api user-myuser describe
Description of the _DX_DEBUG is available at https://github.com/dnanexus/dx-toolkit/blob/master/src/python/Readme.md#debugging
Thanks, Ondrej.
I managed to find a way around it by making dxFUSE finally work (https://community.dnanexus.com/s/question/0D5t000003yJQvECAW/how-to-use-dxfuse-style-file-paths-in-swiss-army-knife), so that I could then run subsets of the files.
If I need to submit again, I will certainly check the API access to deactivate emails
Please sign in to leave a comment.