How to avoid job failures/restarts?
Hi all, I have been noticing recently that some jobs I submit via swiss army knife will only succeed after multiple tries. On the tries that fail I get this:
Cause of Failure
The machine running the job was terminated by the cloud providerI'm not sure how to improve this and save costs on my end. It seems to sometimes also happen even when the jobs are set to high priority. The log file doesn't have an error message, it just cuts off what it is doing and restarts.
Is there anything I can do to minimise the chance of a job failing and restarting itself, losing its progress and increasing my costs?
Comments
4 comments
Hi Hannah
If you haven't checked whether the instance size is large enough that's probably a good place to start; and check paths etc on your script. But if you're anything like me, i get frequently booted of worker, so i wrote this:
I put together aa complicated swiss army knife ‘wrapper’ around a fairly simple task “plink --freq" task - this may give you some ideas. Essentially it uploads any output files that I want to save as soon as they are created, rather than waiting till the completion of the script (the default for SAK), so that if the script is interrupted for whatever reason, at least the work that has been done already is saved. For each input file, it checks whether the corresponding output file is already uploaded and saved on the platform, and if it moves to the next input. Thus even if there is a crash, if the script is rerun it won't have to prcess everything again. it also allows parallelisation of tasks, and can use both dx-fuse and save directly to the platform.
Gabriel
#####
It is run with this from my command line:
Hi Hannah,
if the job failures definitely occur even with High Priority, then the issue needs to be investigated individually by the DNAnexus support team.
Please add “org-support” (without quotes) to your UKB-RAP project as a member with VIEW permission. When that is done, please contact DNAnexus support using the Help tab within the UKB-RAP GUI (select Contact support). Describe the issue, and mention that you have added org-support to your project.
You can find more information on project sharing here:
https://documentation.dnanexus.com/getting-started/ui-quickstart#step-2.-add-project-members
Thank you for using the forum.
https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/24641199060125/comments/24813371215005
So nice Gabriel Doctor ! I wonder if RAP people could make this the default behaviour!
There is a similar tool that does have a default behavior of saving files as they become available: WDL scatter with Smart Reuse.
Please sign in to leave a comment.