How to automatically restart jobs upon "SpotInstanceInterruption"?

Running many tasks in a WDL workflow - what execution policy should I have in my extras file in order to restart jobs upon the error "SpotInstanceInterruption" as this currently causes the whole workflow to terminate.   Many Thanks, Barney

Comments

3 comments

  • Comment author
    Ondrej Klempir DNAnexus Team

    From reading this doc page: https://documentation.dnanexus.com/developer/api/running-analyses/io-and-run-specifications#run-specification, theoretically (I have not tested it), for SpotInstanceInterruption error, you might try to add "SpotInstanceInterruption" to the restart-on policy. That might automatically restart the job in case of spot instance interruption.

     

    So theoretically no need to recompile WDL and modify extras file. I would try to run the workflow using something like this:

     

    dx run workflow-XXXX --extra-args '{"executionPolicy":{"restartOn": {"SpotInstanceInterruption": 3,"JMInternalError": 1,"ExecutionError": 1}}}'

    or

    dx run workflow-XXXX --extra-args '{"executionPolicy":{"restartOn": {"*": 1}}}'

     

    "*" is to indicate all restartable failure reasons that are otherwise not present as keys

     

    If none of the above mentioned work, your workflow could be a good candidate for High Priority job:

    https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/managing-job-priority#high-priority

     

     

    0
  • Comment author
    Former User of DNAx Community_51

    Thankyou very much for. your response - Hmm I've been using SpotInstanceInterruption but I'm still getting SpotInstanceInterruption errors - has anyone else been able to get his option to work?

    1
  • Comment author
    Ondrej Klempir DNAnexus Team

    Have the number of spot interrupted jobs decreased after you introduce restartOn policy? How many failed jobs/tasks we are talking about?

     

    Using the dx toolkit, you can get info about each job/subjob. For example:

    dx describe job-XXXX

     

    You can check the section executionPolicy, finalPriority and failureCounts in the resulting output.

     

     

    0

Please sign in to leave a comment.