Multiple job failures + data retention

Hi all,

?

I launched a total of 14 batched jobs using dxpy --batch-tsv . Each job contains 250 CRAM files (and corresponding index files) along with all the required reference files. Each job is expected to run for approximately 18 hours. Out of those 14 jobs, only two succeeded.

?

I've seen two different errors: "The machine running the job became unresponsive" (n=8) and "The machine running the job was terminated by the cloud provider" (n=4).

?

I'm guessing that the 4 failed jobs "terminated by the cloud provider" is a spot VM interruption, and there is nothing I can do about it. Is that right?

?

I'm more concerned about the 8 "unresponsive" VMs. Looking at the log, it doesn't look like I was "stressing" them. I'm using all the available cores (though there is no load information) and there is sufficient memory and storage left as shown in the last three LOG lines below.

?

CPU: 96% (4 cores) * Memory: 20888/31649MB * Storage: 25/142GB * Net: 2?/0?MBps

CPU: 95% (4 cores) * Memory: 17602/31649MB * Storage: 25/142GB * Net: 4?/0?MBps

CPU: 89% (4 cores) * Memory: 13117/31649MB * Storage: 25/142GB * Net: 4?/0?MBps

?

Is there something I can do about this, or is this just bad luck?

?

The other thing is that those jobs managed to process 98% of the data (246 / 250 CRAM files). For now, I'm using dx-upload-all-outputs at the end of the (Bash) script, as I didn't find how to upload the files as they are generated. My applet generates a single output as an array:file class. The applet allows for any number of CRAM files as an input. Is there a way to upload the files as they are generated, but as a single array output?

?

Thanks for your help!

Comments

4 comments

  • Comment author
    Chai Fungtammasan DNAnexus Team

    I can think of two most likely causes:

    1 Unresponsive worker due to spot market.

    2 high load on I/O when all jobs upload all outputs at the same time.

     

    The easiest way to distinguish them is to see when the error occur. Is it in the middle of it or when the jobs are about to finish.

     

    The way to fix them are a bit different, but it doesn't hurt to try to address them at the same time.

     

    1 For Unresponsive worker, you can have each job runs for a shorter time like 50 or 100 CRAM per job to avoid spot termination. If money is not a concern, you may run them on-demand too.

    2 For high I/O load, you may upload each output as it's done just like you said. In the loop that operate each CRAM file, you may have a code like this.

       temp_variant_link=$(dx upload sample646464.vcf.gz --brief)

      dx-jobutil-add-output vcf_output "$temp_variant_link" --class=array:file

    This would add each output to array output when it's available. Then you can remove sample646464.vcf.gz too to save the space.

    The same thing for download input, you can make a for loop and each loop download only the input that you need. After the output is uploaded, you can remove all input/intermediated/output that are only applicable to the finished sample. This way, you don't need bigger instance to handle more files.

    0
  • Thank you for the quick response!

    I tried using a series of dx upload / dx-jobutil-add-output, but I might have not done it correctly. I'll try again and update this post.

    0
  • I implemented your fix and it looks like it will solve my issue. I'll launch smaller batches and upload the output files as they are generated.

    Thanks again for your help.

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    That's fast! Best of luck!

    0

Please sign in to leave a comment.