Bulk-downloading job logs and investigating failed jobs

Kristen
  • Edited

Getting logs for many jobs at once

Here are some commands that can help you fetch your job logs in bulk and write each one to a separate file for easy searching/grepping:

# Make a list of jobs to get logs for
dx find jobs --tag $batch_tag --delim -n9999 | awk '{print $1,$3,$4}' > joblist.txt

# Look for jobs that had a status other than "done"
grep -v done joblist.txt 

# Get the logs (for all jobs, even those that appeared to succeed)
# Put standard error and standard output in separate files for easier checking
# Jobs that are still "active" and producing new log output must be excluded or these commands will get stuck
cat joblist.txt | egrep -v 'running|runnable|restartable' | awk '{print $1,$3}' | xargs -n 2 -P 100 bash -c 'dx watch $1 --get-stdout --quiet > $0.$1.out'
cat joblist.txt | egrep -v 'running|runnable|restartable' | awk '{print $1,$3}' | xargs -n 2 -P 100 bash -c 'dx watch $1 --get-stderr --quiet > $0.$1.err'

# Find all unique lines output to standard error (excluding lines starting with + which record commands being run)
# Not everything here is an error. Common non-error messages written to stderr by Swiss Army Knife are excluded using grep.
cat *.err | grep -v '^+' | grep -v '^Downloading files using' | grep -v 'tar: Removing leading' | sort | uniq

# Grep (case insensitive) for any instances of the words "warning" or "error" that were written to standard output rather than standard error
egrep -i '(warning|error)' *.out

 

Troubleshooting jobs that failed without an obvious explanation

If you have jobs that failed without writing relevant output to their job log, here are some things you can check:

  1. Is job state "failed" or something else?
  2. Is the job directing its output to the place where you expect to find it?
  3. What are the failure reason and failure message?
  4. Is Total Price >= costLimit?
  5. Did the job fail on its 9th try?
    1. DNANexus has a maxRestarts limit of 9, as described here. Jobs preempted or otherwise restarted 9 times will fail.
    2. You can set lower limits for specific restart types. For example, if you don't want your job to try re-running after more than 3 SpotInstanceInterruptions, you could give dx run the flag --extra-args ‘{"executionPolicy":{"restartOn": {"SpotInstanceInterruption": 3}}}’
    3. The upper limit of 9 for maxRestarts is a hard limit that applies to all restart causes combined. I asked DNANexus support and they said it is not currently possible to set maxRestarts higher than 9.

You can get a lot of details about your job with dx describe job-abc123, where job-abc123 is the job ID. Here is a command to grep that output for some of the most common causes of failure:

dx describe job-abc123 | egrep -w 'Try|State|Output folder|Total Price|costLimit|failureCounts|Failure'

Here is example output for that command:

Try                               9
State                             failed
Output folder                     /QC/rareqc_step1
Try created                       Wed Aug 20 20:45:17 2025
Failure reason                    SpotInstanceInterruption
Failure message                   The machine running the job was terminated by the cloud provider
Total Price                       £0.04
costLimit                         0.5
failureCounts                     {"SpotInstanceInterruption": 8, "UnresponsiveWorker": 1}

The job in this example failed because it hit the maxRestarts limit of 9. It was preempted 8 times, and also had one restart for the reason “UnresponsiveWorker”.

 

Checking an entire batch of jobs for non-obvious causes of failure

You can also check for common causes of “failure without a log entry” in bulk for a batch of jobs. This uses the “jq” command to parse JSON from the command line (most Linux or MacOS systems will have this). In this example I search for all failed jobs that share the same tag, but you can modify the dx find jobs command to search by other things like date or job name.

Code:

dx find jobs --tag $batch_tag --state failed -n99999999 --json | jq -r '
  [ .[]
    | {
      job_id: .id,
      state: .state,
      try_number: .try,
      total_price: .totalPrice,
      cost_limit: (if (.costLimit) then .costLimit else "NA" end),
      final_failure_reason: .failureReason,
      failure_counts: (
        if (.failureCounts and (.failureCounts | length) > 0)
        then (.failureCounts | to_entries | map("\(.key):\(.value)") | join(","))
        else "NA"
        end
      ),
      failure_message: .failureMessage
    }
  ]
  | (.[0] | keys_unsorted) as $keys # Get names for header rows
  | ($keys, (.[] | [.[ $keys[] ]])) # Combine header rows and the data rows that go under them (putting the columns of the data rows in the same order as the headers)
  | @tsv
' | column -t -s $'\t' | less -S

Example output:

job_id                        state   try_number  total_price           cost_limit  final_failure_reason      failure_counts                                   failure_message
job-redacted1                 failed  9           0.043327517333333336  0.5         SpotInstanceInterruption  SpotInstanceInterruption:9                       The machine running the job was terminated by the cloud provider
job-redacted2                 failed  9           0.050323937           0.5         SpotInstanceInterruption  SpotInstanceInterruption:9                       The machine running the job was terminated by the cloud provider
job-redacted3                 failed  9           0.04065517866666667   0.5         SpotInstanceInterruption  SpotInstanceInterruption:8,UnresponsiveWorker:1  The machine running the job was terminated by the cloud provider
job-redacted4                 failed  9           0.03854973733333334   0.5         SpotInstanceInterruption  SpotInstanceInterruption:9                       The machine running the job was terminated by the cloud provider

Comments

0 comments

Please sign in to leave a comment.