Running many tasks in a WDL workflow - what execution policy should I have in my extras file in order to restart jobs upon the error "SpotInstanceInterruption" as this currently causes the whole workflow to terminate.
Many Thanks,
Barney
Thankyou very much for. your response - Hmm I've been using SpotInstanceInterruption but I'm still getting SpotInstanceInterruption errors - has anyone else been able to get his option to work?
Comments
3 comments
From reading this doc page: https://documentation.dnanexus.com/developer/api/running-analyses/io-and-run-specifications#run-specification, theoretically (I have not tested it), for SpotInstanceInterruption error, you might try to add "SpotInstanceInterruption" to the restart-on policy. That might automatically restart the job in case of spot instance interruption.
So theoretically no need to recompile WDL and modify extras file. I would try to run the workflow using something like this:
dx run workflow-XXXX --extra-args '{"executionPolicy":{"restartOn": {"SpotInstanceInterruption": 3,"JMInternalError": 1,"ExecutionError": 1}}}'
or
dx run workflow-XXXX --extra-args '{"executionPolicy":{"restartOn": {"*": 1}}}'
"*" is to indicate all restartable failure reasons that are otherwise not present as keys
If none of the above mentioned work, your workflow could be a good candidate for High Priority job:
https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/managing-job-priority#high-priority
Thankyou very much for. your response - Hmm I've been using SpotInstanceInterruption but I'm still getting SpotInstanceInterruption errors - has anyone else been able to get his option to work?
Have the number of spot interrupted jobs decreased after you introduce restartOn policy? How many failed jobs/tasks we are talking about?
Using the dx toolkit, you can get info about each job/subjob. For example:
dx describe job-XXXX
You can check the section executionPolicy, finalPriority and failureCounts in the resulting output.
Please sign in to leave a comment.