In a workflow, how to not upload intermediate files and only upload the output of the final step?

I'm creating a workflow with multiple Swiss Army Knife steps. There are big intermediate files that feed into the next step. The issue is that all these intermediate files are being uploaded onto the project. How can I keep them on the cloud and only upload the final output onto the project?   For each middle step, I've been adding the -o intermediate_file.cram and use that file as an input for the next step.

Comments

2 comments

  • Comment author
    Chai Fungtammasan DNAnexus Team

    I don't think this is doable on the platform at this moment. Currently, the workflow is viewed as a graph of multiple jobs. To keep the provenience of the analysis flow, all the intermediate files must be intact in order to track back to original inputs of the analysis.

     

    There is a way around if you don't want all those intermedidated files though. You can set the workflow to direct all intermediate files to some throw away location on the project like /trash or /tmp or something. At the same time, set the output location for files you need to the proper place. This way, you can regularly clean up unwanted files in /trash /temp to save your storage and also don't compromise you analysis flow. In case that the workflow has error in the middle, you would still have all those intermediate files to continue without having to rerun everything from scratch.

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Just for more information, I am sharing here my experience with building DNAnexus workflow manually. I am not sure if this is a good fit for you use case and if this could be combined with Swiss Army Knife stages, but I remember that some time ago I build a workflow as I am going to describe here.

     

    The key term here is API method https://documentation.dnanexus.com/developer/api/running-analyses/workflows-and-analyses#api-method-workflow-xxxx-update

     

    There is the following option:

    outputSpecMods mapping (optional) Update(s) to how the stage output specification is exported for the workflow; any subset can be provided. This field follows the same syntax as for inputSpecMods defined above and behaves roughly the same but modifies outputSpec instead. The exception in behavior occurs for the hidden field. If an output has hidden set to true, its data object value (if applicable) will not be cloned into the parent container when the stage or analysis is doneThis may be a useful feature if a stage in your analysis produces many intermediate outputs that are not relevant to the analysis or are not ultimately useful once the analysis has finished.

     

    I manually built my testing workflow using "first" and "second" applets.

    https://documentation.dnanexus.com/developer/workflows/workflow-build-process

     

    See a skeleton of my builder script here (showing a main logic and structure how to work with building workflows and what you can theoretically do):

     

    # Clean up before building a new from scratch

    dx rm firstsecond

     

    # Create a new 'open' workflow

    dx new workflow firstsecond

     

    # Adds a stage that outputs a file

    dx add stage firstsecond first --name first

     

    # Describe the workflow

    dx describe firstsecond --json > firstsecond.json

     

    # Get the stage ID of first

    FIRST_STAGE_ID=`cat firstsecond.json | jq '.stages[0].id'`

    WORKFLOW_ID=`cat firstsecond.json | jq '.id'`

     

    echo $FIRST_STAGE_ID

    echo $WORKFLOW_ID

     

    # A stage that takes the output as an input

    dx add stage firstsecond bar -j '{"in": {"$dnanexus_link": { "stage": \"${FIRST_STAGE_ID}\", "outputField": "out" }}}'

     

    dx add stage firstsecond bar -j '{"in": {"$dnanexus_link": { "stage": "stage-XXXX", "outputField": "out" }}}'

     

    # sets the output to hidden

    dx api workflow-XXXX update '{"editVersion": 2, "stages": { "stage-XXXX": { "outputSpecMods": { "out": { "hidden": true } } }, "stage-XXXX": { "outputSpecMods": { "out": { "hidden": true } } } } }'

     

    dx api $WORKFLOW_ID update '{"editVersion": 2, "stages": {"'"$FIRST_STAGE_ID"'": { "outputSpecMods": { "sorted_bam": { "hidden": true } } } } }'

     

    # Prevents any futher modifications

    dx close firstsecond

     

     

     

    0

Please sign in to leave a comment.