How to submit a large input-json with dx run?

I have a large input json (lists all ~60K WGS vcfs) that I am trying to provide to a workflow and am hitting the following timeout:

 

```

$ x=workflow-GGyX92QJ4Q9k9yxx9xPZz9z2

dnanexus@job-GGzK130J4Q9V5ZfPBppqJX2g:~$ dx run $x -y --input-json-file all.json --destination ukbb200k_sites/

 

 

WARNING:dxpy:[Thu Oct 6 14:41:00 2022] POST http://10.0.3.1:8124/workflow-GGyX92QJ4Q9k9yxx9xPZz9z2/dryRun: urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='10.0.3.1', port=8124): Read timed out. (read timeout=600). Waiting 1 seconds before retry 1 of 6...

WARNING:dxpy:[Thu Oct 6 14:51:01 2022] POST http://10.0.3.1:8124/workflow-GGyX92QJ4Q9k9yxx9xPZz9z2/dryRun: urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='10.0.3.1', port=8124): Read timed out. (read timeout=600). Waiting 2 seconds before retry 2 of 6...

WARNING:dxpy:[Thu Oct 6 15:01:03 2022] POST http://10.0.3.1:8124/workflow-GGyX92QJ4Q9k9yxx9xPZz9z2/dryRun: urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='10.0.3.1', port=8124): Read timed out. (read timeout=600). Waiting 2 seconds before retry 3 of 6...

WARNING:dxpy:[Thu Oct 6 15:11:05 2022] POST http://10.0.3.1:8124/workflow-GGyX92QJ4Q9k9yxx9xPZz9z2/dryRun: urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='10.0.3.1', port=8124): Read timed out. (read timeout=600). Waiting 8 seconds before retry 4 of 6...

```

 

Any suggestions?

Comments

2 comments

  • Comment author
    Chai Fungtammasan DNAnexus Team

    There is a limit in term of how long each API call can be and how many input files are used for each job. Are you looking for an operation that would need all files together at once or you would need to process all of them and plan to use this job to orchestrate it?

     

    Assuming the former, you might need to create an applet that take folder path or a file containing list of input file as input and internally consume the file within applet itself.

     

    Assuming the later, you may find this guideline useful. https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/guide-to-analyzing-large-sample-sets#create-job-submissions-where-each-job-processes-100-samples

    This is the simplest possible way to manage it, but you can use more advance trick like using each instance for more than one input file as well, but it's harder to optimize.

     

    0
  • Kind of both. I'm trying to do a scatter-gather with wdl (here is the code). So each file is first processed independently (so 60K jobs) and then those outputs are processed together in one job.

     

    Thanks for the links - I will take a look.

     

    0

Please sign in to leave a comment.