How to tell that a job has failed due to an out of memory error when it completes with state "done"

I am running an applet on the DNA Nexus RAP that pipes output from bcftools into another program. Sometimes the applet runs into insufficient memory... but it does not die with an error - it completes with status "Done", and I end up with partial outputs uploaded. This is despite the fact that I have used 'set -euxo pipefail' in my script.

 

The only indication of an error is in the job logs that I can see on the web UI, where I see this:

event_loop invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

Out of memory: Killed process 10219 (bcftools) total-vm:7020000kB, anon-rss:6620320kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:13696kB oom_score_adj:0

 

Surprisingly, when I try to get logs for the job using 'dx watch --get-streams job-Gf09PK8JpY8FB5jqF0p2YxxB', it does not report the oom-killer error above!

 

Of course, the output is incomplete - but it is difficult to know that this is the case without comparing with the original file. Since I'm trying to process all 150,000+ pVCF files for the WGS, it's not feasible to compare with the original files.

 

Is there any way to see whether a job has run into an out of memory condition like the above? If the job actually died, I would be able to detect it, but as it is, I won't even know which jobs have failed.

Thanks for any help!

 

Comments

3 comments

  • Comment author
    Former User of DNAx Community_47

    I realised that passing --get-streams to dx watch restricts the output to the stdout and stderr streams of the app, whereas if I omit this parameter (just dx watch jobid) then I get more detail, and can see the "ALERT" output.

    I'm still perplexed why the job completes successfully, but at least I have a way to tell which jobs failed.

     

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Hi @Jeremy Schwartzentruber?, I would send this possible OOM masking error to ukbiobank-support@dnanexus.com for a detailed inspection.

    0
  • Comment author
    Former User of DNAx Community_47

    Interestingly, I found that this OOM error is only "masked" when the offending process is given as a process substitution file input.

    That is, when I do the command like the below, it may incorrectly show as completed successfuly, even though killed by oom_killer:

    ./my_program <( bcftools norm -Ov -m -any myvcf.gz )

     

    However, when I run it like this, then it correctly shows as failed when it runs out of memory:

    bcftools norm -Ov -m -any myvcf.gz | ./my_program

     

    0

Please sign in to leave a comment.