How to avoid job failures/restarts?

Hi all, I have been noticing recently that some jobs I submit via swiss army knife will only succeed after multiple tries. On the tries that fail I get this:

Cause of Failure
The machine running the job was terminated by the cloud provider

I'm not sure how to improve this and save costs on my end. It seems to sometimes also happen even when the jobs are set to high priority. The log file doesn't have an error message, it just cuts off what it is doing and restarts.

 

Is there anything I can do to minimise the chance of a job failing and restarting itself, losing its progress and increasing my costs?

Comments

4 comments

  • Comment author
    Gabriel Doctor

    Hi Hannah

    If you haven't checked whether the instance size is large enough that's probably a good place to start; and check paths etc on your script.  But if you're anything like me, i get frequently booted of worker, so i wrote this:

    I put together aa complicated swiss army knife ‘wrapper’ around a fairly simple task “plink --freq" task - this may give you some ideas. Essentially it uploads any output files that I want to save as soon as they are created, rather than waiting till the completion of the script (the default for SAK), so that if the script is interrupted for whatever reason, at least the work that has been done already is saved. For each input file, it checks whether the corresponding output file is already uploaded and saved on the platform, and if it moves to the next input. Thus even if there is a crash, if the script is rerun it won't have to prcess everything again. it also allows parallelisation of tasks, and can use both dx-fuse  and save directly to the platform. 

    Gabriel 

    #####

    
    # example script to retain intermediate files, avoiding reprocessing. 
    # g.doctor@ucl.ac.uk
    
    export DXHOME="PROJNAME/Gdoc"
    export CHR=$1
    export BEDZIP="/plinkbedrange.chr${CHR}/f1.zip"
    export OUTFOLDER="$DXHOME/outputs/Graphtyprwgsindex/combinedcoords/chr${CHR}/f1"
    export DRAGENPATH="/mnt/project/Bulk/GATK and GraphTyper WGS/GraphTyper population level WGS variants, pVCF format [500k release]/chr$CHR/"
    export CORES=4 # select n cores
    export raptoken=
    export project=
    
    ## sort out workspace
    unset DX_WORKSPACE_ID
    dx cd $DX_PROJECT_CONTEXT_ID:
    # wipe all dx env variables out
    source ~/.dnanexus_config/unsetenv
    dx clearenv
    dx login --noprojects --token $raptoken
    dx select $project
    
    echo -e "\nBedzip is $BEDZIP"
    echo "Output folder is $OUTFOLDER"
    unzip $BEDZIP # this creates a local list of files, that have the same 
    
    process_file() {
        BEDPOSFILE="$1"
        # Extract the string between "loci_" and ".bed"
        VCF=$(echo "$BEDPOSFILE" | rev | cut -c5- | rev)
        STEM=${VCF%.vcf.gz}
        echo "VCF file is $VCF"
        echo "Stem is $STEM"
        # Set a maximum number of retries
        MAX_RETRIES=4
        RETRY_COUNT=0
    
        while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
            echo "Check if the afreq file exists in the directory"
            if dx ls "${OUTFOLDER}/${STEM}.afreq" > /dev/null 2>&1; then
                echo "${STEM}.afreq already processed"
                return 0
            else
                echo -e "\nCalculating variants for loci in $VCF"
                plink2 --vcf "$DRAGENPATH/$VCF" \
                    --chr "$CHR" \
                    --freq  \
                    --extract bed1 "f1/${BEDPOSFILE}" \
                    --out "$STEM"
               
                if [ $? -eq 0 ]; then
                    echo "plink2 command completed successfully."
    
    
                    echo "Uploading .afreq files and removing them if upload succeeds"
                    dx upload -p  "${STEM}.afreq" "${STEM}.log" --path "${OUTFOLDER}/" && \
                    echo -e "\n${STEM}.afreq and logfile uploaded" && \
                    rm -f "${STEM}.afreq" "${STEM}.log"
    
    
                    if dx ls "${OUTFOLDER}/${STEM}.afreq" > /dev/null 2>&1; then
                        echo -e "${STEM}.afreq successfully processed."
                        return 0
                    else
                        echo "File upload failed."
                    fi
                else
                    echo "plink2 command failed with exit status $?."
                fi    
                echo -e "Retrying... ($((RETRY_COUNT + 1))/$MAX_RETRIES)"
                ((RETRY_COUNT++))
            fi
        done
    
    
        # If retries are exhausted, log failure
        echo -e "Failed to process ${STEM}.afreq after $MAX_RETRIES attempts."
        return 1
    }
    
    export -f process_file
    # Find files and process them in parallel
    find ./ -type f -name "*.bed" -printf "%f\n" | parallel -j "$CORES" process_file  
    
    # this is so that any other files created are not automatically uploaded once hte worker completes. 
    rm -r f1/
    rm /home/dnanexus/out/out/*

     

    It is run with this from my command line: 

    dx run app-swiss-army-knife \
    -iin=PROJNAME:Gdoc/scripts/script.sh \
    -icmd="bash script.sh 6 \
    -y --name "chr6" \
    --instance-type XXX\
    --priority low
    
    2
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst
    • Edited

    Hi Hannah,

    if the job failures definitely occur even with High Priority, then the issue needs to be investigated individually by the DNAnexus support team.

    Please add “org-support” (without quotes) to your UKB-RAP project as a member with VIEW permission.  When that is done, please contact DNAnexus support using the Help tab within the UKB-RAP GUI (select Contact support).  Describe the issue, and mention that you have added org-support to your project.  

    You can find more information on project sharing here: 
    https://documentation.dnanexus.com/getting-started/ui-quickstart#step-2.-add-project-members

    Thank you for using the forum.

    0
  • Comment author
    Dr. Mc. Ninja

    https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/24641199060125/comments/24813371215005

     

    So nice Gabriel Doctor ! I wonder if RAP people could make this the default behaviour!

    1
  • Comment author
    Eric Kernfeld

    There is a similar tool that does have a default behavior of saving files as they become available: WDL scatter with Smart Reuse

    1

Please sign in to leave a comment.