Speed/throttling when submitting 1000s of jobs

Gabriel Doctor

Hi

Please could you let me know if there is any constraint against queing up e.g. 1000 or 2000 jobs to be done as  lowpriority workers become available? the jobs are accessing a few small similar files from my platform, as well as different large ones but I don't think that that is the problem as they aren't even starting.  

When I ran just c. 100 jobs, all seemed to start within a  few hours and completed quickly.

With 1000 queued, I've noticed that very few have even started. Am I being throttled because of some flag raised, or is it is just the time of day and patience required? 

 

 

Comments

4 comments

  • Comment author
    Dr. Mc. Ninja

    Let me know if the below is accurate or just slop.

    kthxbi

     

    There is no rule that stops you from submitting thousands of jobs, but two built-in throttles explain why only a few of your 1 000 low-priority jobs are actually starting:

    | Where the brake comes from          | What it does                                      | What you can do                                 |
    | ----------------------------------- | ------------------------------------------------- | ----------------------------------------------- |
    | **Per-user worker quota**           | By default a DNAnexus account can have only       | - Ask UKB-RAP support to raise your *Running    |
    |                                     | **100 workers running at once**. Once you hit     |   Workers* limit if your project really needs   |
    |                                     | that, every extra job stays in *runnable* state   |   more parallelism.                             |
    |                                     | until one of the active jobs finishes or the      | - Or design the pipeline so each job processes  |
    |                                     | quota is raised.                                  |   more than one "unit", reducing the number of  |
    |                                     |                                                   |   jobs you launch.                              |
    | ----------------------------------- | ------------------------------------------------- | ----------------------------------------------- |
    | **Spot-capacity waiting**           | Low-priority jobs run on AWS Spot instances. If   | - Try smaller or more common instance types.    |
    | (low-priority only)                 | EC2 has no spare capacity of the instance type    | - Switch a backlog of urgent jobs to *normal*   |
    |                                     | you requested, the job simply waits—sometimes     |   priority (they fall back to on-demand after   |
    |                                     | for hours or even days—until a spot VM becomes    |   15 min) or to *high* priority (always         |
    |                                     | available.                                        |   on-demand).                                   |

    Practical guidance for huge submissions

    • DNAnexus’ large-batch best-practice notes suggest keeping the “live” queue well below 5 000 jobs and, for HLA-typing as an example, they actually ran 2 000 jobs in parallel while bundling 100 samples per job .
      Submitting in waves—e.g. 500 at a time—lets you spot errors early and keeps the monitor view usable .
    • Nothing is being “flagged” because you queued 1 000 jobs; they are just sitting behind the 100-worker ceiling and/or waiting for spot capacity. You can confirm this by running:
    dx find jobs --state running | wc -l     # how many are actually on workers
    dx find jobs --state runnable | head     # the rest are waiting for capacity

    Bottom line: the platform happily accepts thousands of queued jobs, but only the first ~100 per user can run simultaneously, and low-priority work is further gated by spot-instance availability. Raise your worker quota or batch submissions more coarsely if you need faster turn-around.

    1
  • Comment author
    Gabriel Doctor

    Thanks this is a reasonable answer

    0
  • Comment author
    Hayley Jane Power

    Hi, I am aiming to submit a batch of 1000 jobs as you've spoken about here and just wondered if you could confirm that jobs which exceed the 100 limit and remain in the queue aren't charged?

    0
  • Comment author
    Dr. Mc. Ninja

    Short answer: yes, that’s correct 👍

    On the UK Biobank RAP (DNAnexus), you are only charged once a job actually starts running on an instance. Jobs that exceed the concurrent running limit (for example, you submit 1000 jobs but only ~100 are allowed to run at once) will simply sit in the queue, and queued jobs do not incur any compute charges.

    A few concrete points to make it crisp:

    • 🕒 Queued jobs = no billing
      While a job is in queued / waiting state and has not been assigned an instance, there is zero compute cost.
    • ▶️ Billing starts at execution
      Charges begin only when the job transitions to running and an AWS instance is allocated.
    • 📊 100-job concurrency limit
      Submitting 1000 jobs is fine. The platform throttles execution automatically; the excess jobs just wait their turn.
    • 💾 No hidden storage costs for queued jobs
      A queued job does not spin up disks, containers, or temporary storage. Those only appear once the job starts.
    • ⚠️ Edge case to be aware of
      If a job starts running and then retries or restarts (e.g. spot eviction, failure), each running attempt is billed, but again only for the time it is actually running.

    This is exactly why large scatter-style submissions (hundreds or thousands of jobs) are a normal and supported pattern on RAP. You can safely fire off the whole batch without worrying about being charged for the backlog sitting patiently in the wings 🐦‍⬛.

    If you want, I can also show you:

    • how to sanity-check job states via dx describe / dx find jobs, or
    • patterns to stagger submissions or tag jobs so you can monitor cost cleanly at scale.
    0

Please sign in to leave a comment.