How to use --batch-tsv with swiss army knife?
Hi,
I am trying to subset some vcf files, but I need to pass in both the vcf and corresponding tbi files. However when I check the logs for my job runs, it gives me the error: could not retrieve index file for 'XX.diploidSV.vcf.gz'. It seems that it is only downloading the vcf file and not the tbi file. Here is my code:
dx generate_batch_inputs --path "${projectid}:/Bulk/Whole\ genome\ sequences/Manta-called\ scored\ structural\ variant\ and\ indel\ candidates/10" -iin="(.*)_diploidSV.vcf.gz$" -iin2="(.*)_diploidSV.vcf.gz.tbi$"
head -n 1 dx_batch.0000.tsv > temp.tsv && tail -n +2 dx_batch.0000.tsv | awk '{sub($4, "[&]"); print}' | awk '{sub($5, "[&]"); print}' >> temp.tsv; tr -d '\r' < temp.tsv > new.tsv; rm temp.tsv
batch file looks like this now:
batch ID in in2 in ID in2 ID
XX XX_0_0.diploidSV.vcf.gz XX_0_0.diploidSV.vcf.gz.tbi [project-ID:file-ID] [project-ID:file-ID]
So I have IDs for both the vcf and tbi files. Then I try this:
dx run swiss-army-knife --batch-tsv test.tsv \
-icmd='bcftools view -r chr1:1-100000 * -O z > "$in_prefix"-filtered.vcf.gz' -y --brief --priority normal \
--instance-type mem1_ssd1_v2_x2 --destination "${projectid}:/SV_10/" --detach
Is there a way to make swiss army knife read all the inputs in the batch-tsv?
Comments
5 comments
As a non-batch testing job, are you able to get one of your Swiss Army Knife jobs Done? I mean just one vcf and its tbi in non-batch mode?
In your dx generate_batch_inputs, I noticed the digit "2" in -iin2. I think this is incorrect syntax. In my understanding, SAK takes multiple inputs defined as "-iin", regardless of position of the input, try just -iin also for tbi.
@Ondrej Klempir I have tried "-iin" also for the tbi, and it rewrites the column leaving the tbi files only (it looks like there can only be one iin column). Is there another way to overcome this?
Thanks
Hi Ana
If you give the inputs different names they will both print in one line (ie code below)
For more information on running batch jobs please see: https://documentation.dnanexus.com/user/running-apps-and-workflows/running-batch-jobs
dx generate_batch_inputs --path "/Bulk/Previous WGS releases/GATK and GraphTyper WGS/Manta-called scored structural variant and indel candidates [Vanguard 50k release]/10" -ivcf="(.*).diploidSV.vcf.gz$" -itbi="(.*).diploidSV.vcf.gz.tbi$"
Thank you for getting in touch, hope this helps.
Hi,
Has this issue been solved? I have been struggling a lot with the exactly same problem. I am trying to run swiss army knife bcftools in the batch mode. I am able to successfully run it in a non-batch mode using this command:
dx run app-J2fv5P89f2ZFbj52533QZKPG -iin="PCMT1_AgingDisease:/Bulk/DRAGEN\ WGS/Whole\ genome\ SV\ call\ files\ \(DRAGEN\)\ [500k\ release]/12/XXXXXXX_24059_0_0.dragen.s
v.vcf.gz" -iin="PCMT1_AgingDisease:/Bulk/DRAGEN\ WGS/Whole\ genome\ SV\ call\ files\ \(DRAGEN\)\ [500k\ release]/12/XXXXXXX_24059_0_0.dragen.sv.vcf.gz.tbi" -icmd="bcfto
ols view -r chr6:149749695-149811421 -o PCMT1_SV_b.vcf XXXXXXX_24059_0_0.dragen.sv.vcf.gz" -imount_inputs="true" --priority normal --instance-type '{"main" : "mem1_hdd1
_v2_x4"}' -y --brief
However, I am not getting successful in running it in batch mode through ‘dx generate_batch_inputs’ despite trying in many ways.
I used this command to generate batch file: dx generate_batch_inputs --path "PCMT1_AgingDisease:/SV_VCFs" -ivcf="^(.*)_0_0\.dragen\.sv\.vcf\.gz$" -itbi="^(.*)_0_0\.dragen\.sv\.vcf\.gz.tbi$"
and my batch file looks like this (just copying header):
batch ID tbi vcf tbi ID vcf ID
(and I have processed the file further to make sure that the file IDs are in square bracket as mentioned in the documentation page here: https://documentation.dnanexus.com/user/running-apps-and-workflows/running-batch-jobs)
However, using this batch file while running the app through following command:
dx run app-J2fv5P89f2ZFbj52533QZKPG --batch-tsv new.tsv -icmd="bcftools view -r chr6:149749695-149811421 -o PCMTsv.vcf *" -imount_inputs="true" --priority normal --instance-type '{"main" : "mem1_hdd1_v2_x4"}' --destination "/results/" --detach --tag "count" -y --brief --batch-folders
gave me this error:
Exception: Mismatch in number of launch_args vs. batch_ids (0 != 5)
I understand that the header of batch tsv file should only have the values in and in ID for last four columns, so I tried that way with my batch tsv file header looking like this (with one in and in ID columns for vcf.gz and other in and in ID columns for vcf.gz.tbi):
batch ID in in in ID in ID
(BTW, I made this file using this command dx generate_batch_inputs --path "PCMT1_AgingDisease:/SV_VCFs" -iin="^(.*)_0_0\.dragen\.sv\.vcf\.gz$" -iinb="^(.*)_0_0\.dragen\.sv\.vcf\.gz.tbi$" and then manually edit the header replacing inb with in, as generate_batch_inputs command only accepts one -in)
I was though able to submit my jobs with this batch tsv file but the jobs failed with error: Failed to read from XXXXXXX_24059_0_0.dragen.sv.vcf.gz: could not load index
I got this error, although both vcf.gz and vcf.gz.tbi file IDs are present in the batch tsv file with the header (in). When running swiss army knife in the non-batch mode (as shown in the command line above) it worked by using -iin for both vcf.gz and vcf.gz.tbi file. However the batch run is giving issue.
I spent much time solving this issue but failed. If someone would please provide some help and technical support regarding this.
Thank you
Please sign in to leave a comment.