SAK: bcftools batch job error with reference loading for pVCFs
Hello,
Please can I request assistance with running batch scripts in the command line environment? I am having trouble loading the reference sequence with the list of vcfs and running my script when including the tbi column in the tsv.
I have obtained a batch tsv listing vcfs and corresponding tbi files:
dx generate_batch_inputs --path "project-XXXXXX:/Bulk/Exome sequences_Previous exome releases/Population level exome OQFE variants, pVCF format - interim 200k release/" -ivcf="ukbXXXXX_c1_(.*).gz$" -itbi="ukbXXXXX_c1_(.*).gz.tbi$" -otest
Adapted the file:
head -n 1 test.0000.tsv > temp.tsv && tail -n +2 test.0000.tsv | awk '{sub($4, "[" $4 "]"); sub($5, "[" $5 "]"); print}' >> temp.tsv; tr -d '\r' < temp.tsv > new.tsv; rm temp.tsv
Printed only 2 files from the tsv for testing purposes:
head -3 new.tsv > new_first2.tsv
Script run (sh test_run):
#!/bin/bash
dx run swiss-army-knife \
--batch-tsv new_first2.tsv \
-iin="project-XXXXXX:file-G572Pj8JykJZ52BP4z6GqB21" \
-iin="project-XXXXXX:file-G572P18JykJzBBP44pQz4P89" \
-iin="project-XXXXXX:file-G572PkjJykJk0j6Y4k2GGy9z" \
-icmd='bcftools norm -f "project-XXXXXX:file-G572Pj8JykJZ52BP4z6GqB21" -m -any -Oz -o "$vcf_prefix"_norm_decomp.vcf.gz "$vcf_name"' \
--priority normal \
--instance-type mem1_ssd1_v2_x2 \
--destination "project-XXXXXX:/vcf_enquiries/QC’d" \
--tag "norm_decomp"
Without the tbi column (only vcf column) in tsv file, the job submits but returns: Failed to read from GRCh38_full_analysis_set_plus_decoy_hla.fa: unknown file type.
If I do not put the reference fa/fai/dict as ‘-iin’, the job errors with ‘cannot find fai’.
With the tbi and vcf column in the tsv, the job will not submit and returns: Exception: Mismatch in number of launch_args vs. batch_ids (0 != 2).
Any help would be appreciated as I’ve been going around in circles trying to apply multiple fixes from other issues posted on the community pages, but none have worked so far.
Thank you!
Comments
1 comment
Likely the issue experienced here is because SAK wasn't build to handle such complex jobs.
DNAnexus have suggested the following steps to resolve the issue.
1. Generate the batch tsv using the dx generate_batch_inputs command used above:
dx generate_batch_inputs --path "project-XXXXXX:/Bulk/Exome sequences_Previous exome releases/Population level exome OQFE variants, pVCF format - interim 200k release/" -ivcf="ukbXXXXX_c1_(.*).gz$" -itbi="ukbXXXXX_c1_(.*).gz.tbi$" -otest
2. Reformat the tsv with:
tr -d '\r' < test.0000.tsv > tmp ; mv tmp test.0000.tsv
3. Now we create a shell script (let's call it run.sh) like below:
REF_FA="project-GkyXXQjJ7QxzY36kg4Kj2bFb:file-J011qfQJ7QxpG9ggf72Z47y5" # this is the reference fasta
REF_FAI="project-GkyXXQjJ7QxzY36kg4Kj2bFb:file-J011qj8J7QxfpZ934b08BGFk" # we need the .fai of the same reference fasta for bcftools REF_NAME="GRCh38_full_analysis_set_plus_decoy_hla.fa" # use the actual name of the file since Swiss Army Knife only works on local data
VCF_PREFIX="$1"
VCF_NAME="$2"
VCF_FILE_ID="$3"
VCF_OUTPUT="${VCF_PREFIX}_norm_decomp.vcf.gz"
CMD="bcftools norm -f $REF_NAME -m -any -Oz -o $VCF_OUTPUT $VCF_NAME"
dx run swiss-army-knife \
-iin="$REF_FA" \
-iin="$REF_FAI" \
-iin="$VCF_FILE_ID" \
-icmd="$CMD" \
--priority normal \
--instance-type mem1_ssd1_v2_x2 \
--destination="/path/to/output"
4. Put it through a for loop to mimic batch behavior:
sed '1d' test.0000.tsv | while IFS=$'\t' read batch_id tbi vcf tbi_id vcf_id; do bash run.sh "${batch_id/.vcf/}" "$vcf" "$vcf_id"; done
Note: Beware of the numbers of samples to be processed of the test.0000.tsv. It might launch large of number of instances at once and the platform limits 100 jobs simultaneously. So please use either split -l 100 test.0000.tsv chunk_ or refer to this document: https://documentation.dnanexus.com/user/running-apps-and-workflows/running-batch-jobs#batching-multiple-inputs
Once it's launched, please check and monitor whether the instances are launched.
Please sign in to leave a comment.