SAK: bcftools batch job error with reference loading for pVCFs

Laura Chegwidden

Hello,

Please can I request assistance with running batch scripts in the command line environment? I am having trouble loading the reference sequence with the list of vcfs and running my script when including the tbi column in the tsv.

 

I have obtained a batch tsv listing vcfs and corresponding tbi files:

dx generate_batch_inputs --path "project-XXXXXX:/Bulk/Exome sequences_Previous exome releases/Population level exome OQFE variants, pVCF format - interim 200k release/" -ivcf="ukbXXXXX_c1_(.*).gz$" -itbi="ukbXXXXX_c1_(.*).gz.tbi$" -otest

Adapted the file:

head -n 1 test.0000.tsv > temp.tsv && tail -n +2 test.0000.tsv | awk '{sub($4, "[" $4 "]"); sub($5, "[" $5 "]"); print}' >> temp.tsv; tr -d '\r' < temp.tsv > new.tsv; rm temp.tsv

Printed only 2 files from the tsv for testing purposes:

head -3 new.tsv > new_first2.tsv

 

Script run (sh test_run):

#!/bin/bash

dx run swiss-army-knife \

        --batch-tsv new_first2.tsv \

        -iin="project-XXXXXX:file-G572Pj8JykJZ52BP4z6GqB21" \

        -iin="project-XXXXXX:file-G572P18JykJzBBP44pQz4P89" \

        -iin="project-XXXXXX:file-G572PkjJykJk0j6Y4k2GGy9z" \

        -icmd='bcftools norm -f "project-XXXXXX:file-G572Pj8JykJZ52BP4z6GqB21" -m -any -Oz -o "$vcf_prefix"_norm_decomp.vcf.gz "$vcf_name"' \

        --priority normal \

        --instance-type mem1_ssd1_v2_x2 \

        --destination "project-XXXXXX:/vcf_enquiries/QC’d" \

        --tag "norm_decomp"

 

Without the tbi column (only vcf column) in tsv file, the job submits but returns: Failed to read from GRCh38_full_analysis_set_plus_decoy_hla.fa: unknown file type.

If I do not put the reference fa/fai/dict as ‘-iin’, the job errors with ‘cannot find fai’.

With the tbi and vcf column in the tsv, the job will not submit and returns: Exception: Mismatch in number of launch_args vs. batch_ids (0 != 2).

 

Any help would be appreciated as I’ve been going around in circles trying to apply multiple fixes from other issues posted on the community pages, but none have worked so far.

Thank you!

 

Comments

1 comment

  • Comment author
    Louise R

    Likely the issue experienced here is because SAK wasn't build to handle such complex jobs.

    DNAnexus have suggested the following steps to resolve the issue. 
    1. Generate the batch tsv using the dx generate_batch_inputs command used above: 
    dx generate_batch_inputs --path "project-XXXXXX:/Bulk/Exome sequences_Previous exome releases/Population level exome OQFE variants, pVCF format - interim 200k release/" -ivcf="ukbXXXXX_c1_(.*).gz$" -itbi="ukbXXXXX_c1_(.*).gz.tbi$" -otest 
     

    2. Reformat the tsv with:
    tr -d '\r' < test.0000.tsv > tmp ; mv tmp test.0000.tsv


    3. Now we create a shell script (let's call it run.sh) like below:
    REF_FA="project-GkyXXQjJ7QxzY36kg4Kj2bFb:file-J011qfQJ7QxpG9ggf72Z47y5"  # this is the reference fasta
    REF_FAI="project-GkyXXQjJ7QxzY36kg4Kj2bFb:file-J011qj8J7QxfpZ934b08BGFk"  # we need the .fai of the same reference fasta for bcftools REF_NAME="GRCh38_full_analysis_set_plus_decoy_hla.fa"  # use the actual name of the file since Swiss Army Knife only works on local data 
    VCF_PREFIX="$1"
    VCF_NAME="$2"
    VCF_FILE_ID="$3"
    VCF_OUTPUT="${VCF_PREFIX}_norm_decomp.vcf.gz"
    CMD="bcftools norm -f $REF_NAME -m -any -Oz -o $VCF_OUTPUT $VCF_NAME" 
    dx run swiss-army-knife \
       -iin="$REF_FA" \
       -iin="$REF_FAI" \
       -iin="$VCF_FILE_ID" \
       -icmd="$CMD" \
       --priority normal \
       --instance-type mem1_ssd1_v2_x2 \
       --destination="/path/to/output"

    4. Put it through a for loop to mimic batch behavior: 
    sed '1d' test.0000.tsv | while IFS=$'\t' read batch_id tbi vcf tbi_id vcf_id; do bash run.sh "${batch_id/.vcf/}" "$vcf" "$vcf_id"; done


    Note: Beware of the numbers of samples to be processed of the test.0000.tsv. It might launch large of number of instances at once and the platform limits 100 jobs simultaneously. So please use either split -l 100 test.0000.tsv chunk_ or refer to this document: https://documentation.dnanexus.com/user/running-apps-and-workflows/running-batch-jobs#batching-multiple-inputs

    Once it's launched, please check and monitor whether the instances are launched.

     

    0

Please sign in to leave a comment.