Bulk-downloading job logs and investigating failed jobs
Getting logs for many jobs at once
Here are some commands that can help you fetch your job logs in bulk and write each one to a separate file for easy searching/grepping:
# Make a list of jobs to get logs for
dx find jobs --tag $batch_tag --delim -n9999 | awk '{print $1,$3,$4}' > joblist.txt
# Look for jobs that had a status other than "done"
grep -v done joblist.txt
# Get the logs (for all jobs, even those that appeared to succeed)
# Put standard error and standard output in separate files for easier checking
# Jobs that are still "active" and producing new log output must be excluded or these commands will get stuck
cat joblist.txt | egrep -v 'running|runnable|restartable' | awk '{print $1,$3}' | xargs -n 2 -P 100 bash -c 'dx watch $1 --get-stdout --quiet > $0.$1.out'
cat joblist.txt | egrep -v 'running|runnable|restartable' | awk '{print $1,$3}' | xargs -n 2 -P 100 bash -c 'dx watch $1 --get-stderr --quiet > $0.$1.err'
# Find all unique lines output to standard error (excluding lines starting with + which record commands being run)
# Not everything here is an error. Common non-error messages written to stderr by Swiss Army Knife are excluded using grep.
cat *.err | grep -v '^+' | grep -v '^Downloading files using' | grep -v 'tar: Removing leading' | sort | uniq
# Grep (case insensitive) for any instances of the words "warning" or "error" that were written to standard output rather than standard error
egrep -i '(warning|error)' *.outTroubleshooting jobs that failed without an obvious explanation
If you have jobs that failed without writing relevant output to their job log, here are some things you can check:
- Is job state "failed" or something else?
- Is the job directing its output to the place where you expect to find it?
- What are the failure reason and failure message?
- Is Total Price >= costLimit?
- Did the job fail on its 9th try?
- DNANexus has a maxRestarts limit of 9, as described here. Jobs preempted or otherwise restarted 9 times will fail.
- You can set lower limits for specific restart types. For example, if you don't want your job to try re-running after more than 3 SpotInstanceInterruptions, you could give dx run the flag
--extra-args ‘{"executionPolicy":{"restartOn": {"SpotInstanceInterruption": 3}}}’ - The upper limit of 9 for maxRestarts is a hard limit that applies to all restart causes combined. I asked DNANexus support and they said it is not currently possible to set maxRestarts higher than 9.
You can get a lot of details about your job with dx describe job-abc123, where job-abc123 is the job ID. Here is a command to grep that output for some of the most common causes of failure:
dx describe job-abc123 | egrep -w 'Try|State|Output folder|Total Price|costLimit|failureCounts|Failure'Here is example output for that command:
Try 9
State failed
Output folder /QC/rareqc_step1
Try created Wed Aug 20 20:45:17 2025
Failure reason SpotInstanceInterruption
Failure message The machine running the job was terminated by the cloud provider
Total Price £0.04
costLimit 0.5
failureCounts {"SpotInstanceInterruption": 8, "UnresponsiveWorker": 1}The job in this example failed because it hit the maxRestarts limit of 9. It was preempted 8 times, and also had one restart for the reason “UnresponsiveWorker”.
Checking an entire batch of jobs for non-obvious causes of failure
You can also check for common causes of “failure without a log entry” in bulk for a batch of jobs. This uses the “jq” command to parse JSON from the command line (most Linux or MacOS systems will have this). In this example I search for all failed jobs that share the same tag, but you can modify the dx find jobs command to search by other things like date or job name.
Code:
dx find jobs --tag $batch_tag --state failed -n99999999 --json | jq -r '
[ .[]
| {
job_id: .id,
state: .state,
try_number: .try,
total_price: .totalPrice,
cost_limit: (if (.costLimit) then .costLimit else "NA" end),
final_failure_reason: .failureReason,
failure_counts: (
if (.failureCounts and (.failureCounts | length) > 0)
then (.failureCounts | to_entries | map("\(.key):\(.value)") | join(","))
else "NA"
end
),
failure_message: .failureMessage
}
]
| (.[0] | keys_unsorted) as $keys # Get names for header rows
| ($keys, (.[] | [.[ $keys[] ]])) # Combine header rows and the data rows that go under them (putting the columns of the data rows in the same order as the headers)
| @tsv
' | column -t -s $'\t' | less -SExample output:
job_id state try_number total_price cost_limit final_failure_reason failure_counts failure_message
job-redacted1 failed 9 0.043327517333333336 0.5 SpotInstanceInterruption SpotInstanceInterruption:9 The machine running the job was terminated by the cloud provider
job-redacted2 failed 9 0.050323937 0.5 SpotInstanceInterruption SpotInstanceInterruption:9 The machine running the job was terminated by the cloud provider
job-redacted3 failed 9 0.04065517866666667 0.5 SpotInstanceInterruption SpotInstanceInterruption:8,UnresponsiveWorker:1 The machine running the job was terminated by the cloud provider
job-redacted4 failed 9 0.03854973733333334 0.5 SpotInstanceInterruption SpotInstanceInterruption:9 The machine running the job was terminated by the cloud provider
Comments
0 comments
Please sign in to leave a comment.