How can we execute modified applets without creating duplicated data?
From what I can tell from experiments, there is no way to tell applets to overwrite output files or to respect the output from previous applet versions. While this isn't a problem thanks to caching *as long as an applet is unchanged*, for those of us who are actively developing applets, this leads to duplicate files being created with the same name when an applet is updated.
Let's say that you run an applet and you realize that some of the tasks didn't complete because of a logic issue. With this logic issue, either the output was successfully produced without error, or the job was terminated. In other words, any output is known to be good, as long as there is output. But to fix the logic issue to allow the rest of the tasks to complete, of course the applet needs to be modified. Now when you run the code, you'll get duplicated output for all of the tasks that worked the first time.
My guess, based on this thread ( https://community.dnanexus.com/s/question/0D582000000Lpb6CAC/how-to-overwrite-preexisting-file-with-the-same-name-and-path-when-using-dx-upload ), is that we will have to manage this type of complex statefulness ourselves without any benefit of a filesystem-like backend.
In my opinion, this is a bad situation to find ourselves in. It would be better to either (1) use the presence of all output files as an indicator that the task was complete (similar to the GCP pipelines API), or (2) to simply overwrite files when they have the same name and path as a pre-existing file. Creating a duplicate file with the same path and name breaks downstream name-based tasks.
Comments
2 comments
Hi James ?,
Thanks for sharing your feedback with Community. Could you please forward this to ukbiobank-support@dnanexus.com? The support team will be able to document it for Eng team for consideration.
Have you considered WDL for your applet/pipeline development? Would WDL address some of your suggestions, e.g. the Job Reuse feature in particular? https://documentation.dnanexus.com/user/running-apps-and-workflows/job-reuse
The DNANexus implementation of WDL Job Reuse causes to the situation I am describing.
Please sign in to leave a comment.