How to avoid job failures/restarts?

23 January 2025 11:03
4 comments

Hi all, I have been noticing recently that some jobs I submit via swiss army knife will only succeed after multiple tries. On the tries that fail I get this:

Cause of Failure
The machine running the job was terminated by the cloud provider

I'm not sure how to improve this and save costs on my end. It seems to sometimes also happen even when the jobs are set to high priority. The log file doesn't have an error message, it just cuts off what it is doing and restarts.

Is there anything I can do to minimise the chance of a job failing and restarting itself, losing its progress and increasing my costs?

Comments

4 comments

Gabriel Doctor

30 January 2025 13:30

Hi Hannah

If you haven't checked whether the instance size is large enough that's probably a good place to start; and check paths etc on your script. But if you're anything like me, i get frequently booted of worker, so i wrote this:

I put together aa complicated swiss army knife ‘wrapper’ around a fairly simple task “plink --freq" task - this may give you some ideas. Essentially it uploads any output files that I want to save as soon as they are created, rather than waiting till the completion of the script (the default for SAK), so that if the script is interrupted for whatever reason, at least the work that has been done already is saved. For each input file, it checks whether the corresponding output file is already uploaded and saved on the platform, and if it moves to the next input. Thus even if there is a crash, if the script is rerun it won't have to prcess everything again. it also allows parallelisation of tasks, and can use both dx-fuse and save directly to the platform.

Gabriel

#####


# example script to retain intermediate files, avoiding reprocessing. 
# g.doctor@ucl.ac.uk

export DXHOME="PROJNAME/Gdoc"
export CHR=$1
export BEDZIP="/plinkbedrange.chr${CHR}/f1.zip"
export OUTFOLDER="$DXHOME/outputs/Graphtyprwgsindex/combinedcoords/chr${CHR}/f1"
export DRAGENPATH="/mnt/project/Bulk/GATK and GraphTyper WGS/GraphTyper population level WGS variants, pVCF format [500k release]/chr$CHR/"
export CORES=4 # select n cores
export raptoken=
export project=

## sort out workspace
unset DX_WORKSPACE_ID
dx cd $DX_PROJECT_CONTEXT_ID:
# wipe all dx env variables out
source ~/.dnanexus_config/unsetenv
dx clearenv
dx login --noprojects --token $raptoken
dx select $project

echo -e "\nBedzip is $BEDZIP"
echo "Output folder is $OUTFOLDER"
unzip $BEDZIP # this creates a local list of files, that have the same 

process_file() {
    BEDPOSFILE="$1"
    # Extract the string between "loci_" and ".bed"
    VCF=$(echo "$BEDPOSFILE" | rev | cut -c5- | rev)
    STEM=${VCF%.vcf.gz}
    echo "VCF file is $VCF"
    echo "Stem is $STEM"
    # Set a maximum number of retries
    MAX_RETRIES=4
    RETRY_COUNT=0

    while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
        echo "Check if the afreq file exists in the directory"
        if dx ls "${OUTFOLDER}/${STEM}.afreq" > /dev/null 2>&1; then
            echo "${STEM}.afreq already processed"
            return 0
        else
            echo -e "\nCalculating variants for loci in $VCF"
            plink2 --vcf "$DRAGENPATH/$VCF" \
                --chr "$CHR" \
                --freq  \
                --extract bed1 "f1/${BEDPOSFILE}" \
                --out "$STEM"
           
            if [ $? -eq 0 ]; then
                echo "plink2 command completed successfully."


                echo "Uploading .afreq files and removing them if upload succeeds"
                dx upload -p  "${STEM}.afreq" "${STEM}.log" --path "${OUTFOLDER}/" && \
                echo -e "\n${STEM}.afreq and logfile uploaded" && \
                rm -f "${STEM}.afreq" "${STEM}.log"


                if dx ls "${OUTFOLDER}/${STEM}.afreq" > /dev/null 2>&1; then
                    echo -e "${STEM}.afreq successfully processed."
                    return 0
                else
                    echo "File upload failed."
                fi
            else
                echo "plink2 command failed with exit status $?."
            fi    
            echo -e "Retrying... ($((RETRY_COUNT + 1))/$MAX_RETRIES)"
            ((RETRY_COUNT++))
        fi
    done


    # If retries are exhausted, log failure
    echo -e "Failed to process ${STEM}.afreq after $MAX_RETRIES attempts."
    return 1
}

export -f process_file
# Find files and process them in parallel
find ./ -type f -name "*.bed" -printf "%f\n" | parallel -j "$CORES" process_file  

# this is so that any other files created are not automatically uploaded once hte worker completes. 
rm -r f1/
rm /home/dnanexus/out/out/*

It is run with this from my command line:

dx run app-swiss-army-knife \
-iin=PROJNAME:Gdoc/scripts/script.sh \
-icmd="bash script.sh 6 \
-y --name "chr6" \
--instance-type XXX\
--priority low

Rachael W UKB Community team Data Analyst
- Edited 03 February 2025 10:21
Hi Hannah,
if the job failures definitely occur even with High Priority, then the issue needs to be investigated individually by the DNAnexus support team.
Please add “org-support” (without quotes) to your UKB-RAP project as a member with VIEW permission. When that is done, please contact DNAnexus support using the Help tab within the UKB-RAP GUI (select Contact support). Describe the issue, and mention that you have added org-support to your project.
You can find more information on project sharing here:
https://documentation.dnanexus.com/getting-started/ui-quickstart#step-2.-add-project-members

Thank you for using the forum.

0
Dr. Mc. Ninja
- 16 May 2025 09:01
https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/24641199060125/comments/24813371215005

So nice Gabriel Doctor ! I wonder if RAP people could make this the default behaviour!

1
Eric Kernfeld
- 10 February 2026 22:17
There is a similar tool that does have a default behavior of saving files as they become available: WDL scatter with Smart Reuse.

1

Please sign in to leave a comment.