How to make a report with cost/runtime/re-runs/preemptions/etc. for your jobs

Kristen
  • Edited

Here is a command that will generate .tsv data with information about your jobs, including the total cost for each job, the number of re-runs each job had (due to preemption/etc.), the runtime for the last run of the job, and the total runtime for the job across all attempts.

The code below assumes:

  • You have the dx toolkit commands installed.
  • You have already used ‘dx login’ and ‘dx select’ and have the project you want to generate the report for as your active project
  • You have access to the “jq” command for parsing JSON from the command line (should be available on almost any Linux or MacOS system)

You can change the options to dx find jobs to search by date, user, job tag, etc.

Once you have generated the report, you can read it with R or another analysis tool of your choice, or you can view it with nicely-aligned human-readable columns from the command-line with cat $outfile | column -s $'\t' -t | less -S

I hope this will save someone else some time figuring out how to do it!!

A simple version that only looks at jobs' most recent run attempts

This version only looks at each job's most recent run attempt. Total cost is already a total across all run attempts, so you can use this version to get total cost. But if you want other information like the total number of minutes your job ran across all run attempts, that will require the more detailed version below.

The code:

dx find jobs --created-after 2024-01-01 -n 999999999999999999 --json | jq -r '
  [ .[] |
    {
      jobname: .name,
      tags: (if (.tags | length) > 0 then (.tags | join(",")) else "NA" end),
      executable: .executableName,
      state: .state,
      total_cost: ( if .totalPrice then (.totalPrice * 10000 | floor | . / 10000) else "NA" end),
      instanceType: .instanceType,
      priority: .priority,
      last_runtime_minutes: (if (.startedRunning and .stoppedRunning) then ((.stoppedRunning - .startedRunning) / 1000 / 60 * 10 | floor / 10) else "NA" end),
      failureCounts: (
        if (.failureCounts and (.failureCounts | length) > 0)
        then (.failureCounts | to_entries | map("\(.key):\(.value)") | join(", "))
        else "NA"
        end
      ),
      jobid: .id,
      submitted: (if .created then (.created / 1000 | strflocaltime("%Y-%m-%d %H:%M")) else "" end),
      user: (.launchedBy | sub("^user-"; ""))
    }
  ]
  | (.[0] | keys_unsorted) as $keys # Get names for header rows
  | ($keys, (.[] | [.[ $keys[] ]])) # Combine header rows and the data rows that go under them (putting the columns of the data rows in the same order as the headers)
  | @tsv
'

The output will look like this:

jobname               tags          executable        state  total_cost  instanceType     priority  last_runtime_minutes  failureCounts               jobid         submitted         user
qcstep1.chr9chunk004  qcstep1.run2  swiss-army-knife  done   0.0523      mem3_ssd1_v2_x4  low       35.4                  SpotInstanceInterruption:4  job-redacted5  2025-06-18 22:42  redacted
qcstep1.chr9chunk003  qcstep1.run2  swiss-army-knife  done   0.0094      mem3_ssd1_v2_x4  low       16.9                  NA                          job-redacted4  2025-06-18 22:42  redacted
qcstep1.chr9chunk002  qcstep1.run2  swiss-army-knife  done   0.0281      mem3_ssd1_v2_x4  low       18.2                  SpotInstanceInterruption:3  job-redacted3  2025-06-18 22:42  redacted
qcstep1.chr9chunk001  qcstep1.run2  swiss-army-knife  done   0.0201      mem3_ssd1_v2_x4  low       18.4                  SpotInstanceInterruption:2  job-redacted2  2025-06-18 22:42  redacted
qcstep1.chr8chunk011  qcstep1.run2  swiss-army-knife  done   0.0104      mem3_ssd1_v2_x4  low       18.6                  NA                          job-redacted1  2025-06-18 22:42  redacted

 

A more detailed version that combines information across run attempts/preemptions

This example contains some fields that do advanced things like “sum runtime across all attempts/restarts for jobs that got preempted”.

For this version, you need the --include-restarted flag for dx find jobs so it will output one entry per run attempt (instead of only showing each job's most recent run attempt).

The code:

dx find jobs --created-after 2024-01-01 -n 999999999999999999 --include-restarted --json | jq -r '
  # For each run attempt, create attempt_runtime_minutes as "Last time job entered a state other than running - last time job entered the state 'running'"
  # map() works kind of like a for loop, and . + {} means "take the current JSON object (one run attempt for one job) and add the key attempt_runtime_minutes"
  map(
    . + { attempt_runtime_minutes: (
            (.stateTransitions | map(select(.newState == "running")) | sort_by(.setAt) | last) as $run
            | (.stateTransitions | sort_by(.setAt) | last) as $end
            | if ($run and $end and $run.setAt and $end.setAt) then (($end.setAt - $run.setAt) / 1000 / 60) else null end
        ) }
  )
  # Take current array of JSON objects (one per job run attempt) and collapse it down to an array with one object per job (covering all of its run attempts)
  # This uses group_by(.id) to group objects by job ID. We will create new JSON objects below that summarize each group of objects.
  # When we create the new objects with {}, using .[0].instanceType accesses the value for that key from the first job in the group
  # For total_restarts, "length" gives the total number of run attempts in each group
  # For total_runtime_minutes, we will use map() to iterate over the groups of run attempts for each job and add up all non-null values (and round to one post-decimal digit)
  | [
    group_by(.id)[]
    | {
        jobname: .[0].name,
        tags: (if (.[0].tags | length) > 0 then (.[0].tags | join(",")) else "NA" end),
        executable: .[0].executableName,
        state: .[0].state,
        total_cost: (if (.[0].totalPrice) then (.[0].totalPrice * 10000 | floor | . / 10000) else "NA" end),
        instanceType: .[0].instanceType,
        priority: .[0].priority,
        last_runtime_minutes: (if (.[0].startedRunning and .[0].stoppedRunning) then ((.[0].stoppedRunning - .[0].startedRunning) / 1000 / 60 * 10 | floor / 10) else "NA" end),
        total_restarts: (length - 1),  
        failureCounts: (
          if (.[0].failureCounts and (.[0].failureCounts | length) > 0)
          then (.[0].failureCounts | to_entries | map("\(.key):\(.value)") | join(", "))
          else "NA"
          end
        ),
        total_runtime_minutes: (
          ([.[] | .attempt_runtime_minutes] | map(select(. != null))) as $runtimes
          | if ($runtimes | length) > 0 then (($runtimes | add) * 10 | floor / 10) else "NA" end
        ),
        jobid: .[0].id,
        submitted: (if .[0].created then (.[0].created / 1000 | strflocaltime("%Y-%m-%d %H:%M")) else "" end),
        user: (.[0].launchedBy | sub("^user-"; ""))
      }
    ]
  | (.[0] | keys_unsorted) as $keys # Get names for header rows
  | ($keys), (.[] | [.[ $keys[] ]]) # Combine header rows and the data rows that go under them (putting the columns of the data rows in the same order as the headers)
  | @tsv
'

The output will look like this:

jobname                          tags                   executable                  state       total_cost  instanceType      priority  last_runtime_minutes  total_restarts  failureCounts                                     total_runtime_minutes  jobid                         submitted         user
qcstep1.chr8chunk011             qcstep1.run2           swiss-army-knife            done        0.0104      mem3_ssd1_v2_x4   low       18.6                  0               NA                                                18.6                   job-redacted1                 2025-06-18 22:42  redacted
qcstep1.chr9chunk001             qcstep1.run2           swiss-army-knife            done        0.0201      mem3_ssd1_v2_x4   low       18.4                  2               SpotInstanceInterruption:2                        62.7                   job-redacted2                 2025-06-18 22:42  redacted
qcstep1.chr9chunk002             qcstep1.run2           swiss-army-knife            done        0.0281      mem3_ssd1_v2_x4   low       18.2                  3               SpotInstanceInterruption:3                        95.5                   job-redacted3                 2025-06-18 22:42  redacted
qcstep1.chr9chunk003             qcstep1.run2           swiss-army-knife            done        0.0094      mem3_ssd1_v2_x4   low       16.9                  0               NA                                                16.9                   job-redacted4                 2025-06-18 22:42  redacted
qcstep1.chr9chunk004             qcstep1.run2           swiss-army-knife            done        0.0523      mem3_ssd1_v2_x4   low       35.4                  4               SpotInstanceInterruption:4                        153.2                  job-redacted5                 2025-06-18 22:42  redacted

Comments

0 comments

Please sign in to leave a comment.