Set time limit policy on dx run jobs with 'cmd' input and how to check time limits

Hi!

I am trying to run some GWAS analysis on Hail in the UK Biobank RAP. To make things automatised I am using the Spark JupiterLab app through the dx run command. I could not find much information on how to set up time limits in this case (since I would like to be in full control of what I am running).

From the manual page here: https://platform.dnanexus.com/app/app-dxjupyterlab_spark_cluster

It is specified that this is how to run a dxjupyterlab or dxjupyterlab_spark_cluster command: 

my_cmd="papermill notebook.ipynb output_notebook.ipynb"
dx run dxjupyterlab -icmd="$my_cmd" -iin="notebook.ipynb"

It is also specified that :

the "duration" argument will be ignored when running the app with "cmd". The app can be run from commandline with the --extra-args flag to limit the runtime, e.g. "dx run dxjupyterlab --extra-args '{"timeoutPolicyByExecutable": {"app-xxxx":{"*": {"hours": 1}}}}'".

I tried using this command to run my notebook (including the timeoutPolicyByExecutable ):

my_cmd="papermill notebook.ipynb output_notebook.ipynb"
dx run dxjupyterlab_spark_cluster -icmd="$my_cmd" -iin="notebook.ipynb" --instance-type "mem1_ssd1_v2_x16" --tag="test1" --priority "high" --name "test1" --extra-args '{"timeoutPolicyByExecutable": {"app-dxjupyterlab_spark_cluster":{"*": {"minutes": 5}}}}'

But the job kept running beyond the 5 minutes limit I set up. Moreover, when I look for information on the timeoutpolicy in the json of the running job by typing:

dx describe job-xxxx --json | jq '.timeoutPolicyByExecutable'

I do get a null as output.

What am I doing wrong? Is there some clear information around on this?

Thanks

Comments

7 comments

  • Comment author
    Dr. Mc. Ninja

    Hi Gabriele Maria Sgarlata 

     

    Here is a suggestion from my assistant. Let me know if it's any help or not. Plesae bear in mind that, while enthusiastic, she's a bit confused by the amount of documentation I give her to process ;-)

     

    The flag itself is fine — the problem is the key you pass inside timeoutPolicyByExecutable.

    • timeoutPolicyByExecutable expects the executable ID (app-xxxxxxxxxxxxxxxxxxxxxxxx) as the first-level key, not the human-readable name dxjupyterlab_spark_cluster.
      The platform validates the key against the regex ^(app|applet)-[0-9A-Za-z]{24}$ and silently ignores anything that does not match, so the policy you sent is dropped and the job inherits the default 30-day timeout .
    • In the docs the working examples always use real IDs, e.g. the REGENIE snippet (note the long hexadecimal IDs) .
    • At runtime, the policy is looked up by ID, then by entry-point name, exactly as described in the “Specifying Job Timeouts” section .
      Because no key matched, dx describe … | jq '.timeoutPolicyByExecutable' quite correctly prints null.

    How to apply a timeout that actually works

    # 1. Get the ID of the executable you are launching
    APP_ID=$(dx describe dxjupyterlab_spark_cluster --json | jq -r .id)
    # returns something like app-G1PQbYj0f9Yk8Zy061p0Q123
    
    # 2. Build your JSON with that ID
    EXTRA='{"timeoutPolicyByExecutable": {"'"$APP_ID"'": {"*": {"minutes": 5}}}}'
    
    # 3. Launch
    my_cmd='papermill notebook.ipynb output_notebook.ipynb'
    dx run dxjupyterlab_spark_cluster \
      -icmd="$my_cmd" -iin=notebook.ipynb \
      --instance-type mem1_ssd1_v2_x16 \
      --tag test1 --priority high --name test1 \
      --extra-args "$EXTRA"
    
    • Five minutes after the container starts, the root JupyterLab job will be terminated with JobTimeoutExceeded.
    • Any sub-jobs it spawns inherit the same 5-minute limit unless they carry their own override.
    • You can confirm the policy was set with
      dx describe <job-id> --json | jq '.timeoutPolicyByExecutable'.

    Alternative shortcuts

    • Limit just this one job instead of the whole tree:

      --extra-args '{"timeoutPolicy":{"*":{"minutes":5}}}'
      

      Here you omit the executable ID completely; the policy is attached directly to the job’s own timeoutPolicy field .

    • Use hours or days – any combination of "days", "hours", "minutes" is accepted.

    Why the CLI --duration flag does not help

    When you supply -icmd=…, the app switches to “headless” mode and the high-level duration parameter is explicitly ignored (see the app help page).
    timeoutPolicyByExecutable or timeoutPolicy are therefore the only supported ways to enforce a wall-clock limit for scripted JupyterLab/Spark launches.

    TL;DR
    Replace app-dxjupyterlab_spark_cluster with the real 24-character app ID (or use the simpler timeoutPolicy form) and the timeout will be honoured.

    0
  • Comment author
    Gabriele Maria Sgarlata

    Thank you! 

    This helped solving my problem and now the job stops at the set time limit.

    However, when I type dx describe <job-id> --json | jq '.timeoutPolicyByExecutable' I do still get null.  Note that I type this command when the job is already running.

    Thank you,

    Gabriele

    0
  • Comment author
    Dr. Mc. Ninja

    Glad this worked. Can you paste the whole of `dx describe <job-id> --json` (with anything sensitive redacted)? I'm not familiar with the output, but I wonder if timeoutPolicyByExecutable is the right key.

     

    0
  • Comment author
    Gabriele Maria Sgarlata

    Yeah, sure! here it is. This is an example of an analysis to which I gave a time limit of 10 minutes.

    I hope it helps. Thanks you!

    {
        "id": "job-J0g0K08Jy7K0q8K0qQ976KqK",
        "region": "aws:eu-west-2",
        "name": "chr22_bgen2mt",
        "tags": [
            "chr22_bgen2mt"
        ],
        "properties": {},
        "executable": "app-GykQjbQ0k912pQjJ1bJ9z88J",
        "executableName": "dxjupyterlab_spark_cluster",
        "class": "job",
        "created": 1747846913718,
        "modified": 1747847060522,
        "project": "project-xxx",
        "billTo": "",
        "costLimit": null,
        "invoiceMetadata": null,
        "folder": "/",
        "parentJob": null,
        "originJob": "job-J0g0K08Jy7K0q8K0qQ976KqK",
        "parentAnalysis": null,
        "analysis": null,
        "stage": null,
        "rootExecution": "job-J0g0K08Jy7K0q8K0qQ976KqK",
        "state": "running",
        "function": "main",
        "workspace": "container-J0g0K10J2VV3fxX3X0j5GF38",
        "launchedBy": "user-xxxxx",
        "detachedFrom": null,
        "priority": "high",
        "workerReuseDeadlineRunTime": {
            "state": "reuse-off",
            "waitTime": -1,
            "at": -1
        },
        "dependsOn": [],
        "singleContext": false,
        "failureCounts": {},
        "stateTransitions": [
            {
                "newState": "runnable",
    "setAt": 1747846917146
            },
            {
                "newState": "running",
                "setAt": 1747847060098
            }
        ],
        "ignoreReuse": true,
        "httpsApp": {
            "ports": [
                443,
                8081
            ],
            "shared_access": "NONE",
            "dns": {
                "url": "https://job-J0g0K08Jy7K0q8K0qQ976KqK.dnanexus.cloud"
            },
            "enabled": true,
            "loginURL": "https://ukbiobank.dnanexus.com/login",
            "isolatedBrowsing": false
        },
        "rank": 0,
        "parentJobTry": null,
        "detachedFromTry": null,
        "tryCreated": 1747846913718,
        "details": {},
        "systemRequirements": {
            "*": {
                "instanceType": "mem1_ssd1_v2_x16",
                "nvidiaDriver": "R535",
                "clusterSpec": {
                    "type": "spark",
                    "version": "3.5.2",
                    "initialInstanceCount": 2,
    "bootstrapScript": "#!/usr/bin/env bash\n\nset -e -o pipefail\n\n# Requires bash version 4\n# Uncomment particular regions once the app is enabled in them\ndeclare -A TARGET_PROJECTS\ndeclare -r IMAGES_ENV_FILE='/home/dnanexus/images.env'\ndeclare -r OLDEST_SUPPORTED_SNAPSHOT_APP_VERSION=2.0.0\n\nTARGET_PROJECTS[azure:westus]=\"App and Applet Assets Azure\"\nTARGET_PROJECTS[aws:us-east-1]=\"App and Applet Assets\"\nTARGET_PROJECTS[aws:eu-central-1]=\"App and Applet Assets Germany\"\nTARGET_PROJECTS[aws:eu-west-2]=\"App and Applet Assets London\"\nTARGET_PROJECTS[aws:eu-west-2-g]=\"App and Applet Assets Europe (London)\"\nTARGET_PROJECTS[aws:me-south-1]=\"App and Applet Assets Bahrain\"\nTARGET_PROJECTS[azure:uksouth-ofh]=\"App and Applet Assets OFH-TRE (London)\"\n# Adding VEP involves two steps:\n# - downloading ensemblorg/ensembl-vep Docker image, loading it, and tagging. The loaded image\n#   will have the \"latest\" set by default, a tag with a descriptive version should be additionally set.\n# - downloading and unpacking the cache files used by vep and storing them in /cluster/vep\n# The downloaded tarballs have to be available in public DNAnexus projects \"App and Applet Assets\"\n\nfunction locate_image() {\n    # returns path to the image used by docker\n    # the flavor image paths are in variables sourced from images.env\n    # if a snapshot is provided it has priority\n\n    # use snapshot path if provided\n    if [[ -n $snapshot ]]; then\n        echo \"$snapshot\"\n    else\n        # selecting flavor\n        local DOCKER_IMAGE=$spark_cluster_image\n        # apps and applets have different env variables for resource location\n        if [[ \"$DX_RESOURCES_ID\" != \"\" ]]; then\n            local DX_ASSETS_ID=\"$DX_RESOURCES_ID\"\n        else\n            # in apps images are on the top level in resource container,\n            # in applets they are in a folder inside the project\n            local DX_ASSETS_ID=\"$DX_PROJECT_CONTEXT_ID\"\n            local DIR_NAME=${DOCKER_IMAGE//_*//} # dxjupyterlab-r_1.2.3 -> dxjupyterlab-r/\n            local DOCKER_IMAGE=$DIR_NAME$DOCKER_IMAGE\n        fi\n\n        echo \"$DX_ASSETS_ID:$DOCKER_IMAGE\"\n    fi\n}\n\nversion_comp() {\n    # converts semantic version with three numbers to format that can be compared with test -lt\n    # if it is not a three number semantic version return 000000000\n    version=$(echo \"$1\" | sed 's/\\([0-9]*\\.[0-9]*\\.[0-9]*\\).*+build.*/\\1/' | awk -F. '{ printf \"%d.%d.%d\\n\", $1, $2, $3 }')\n    if [[ $version =~ ^[0-9]+\\.[0-9]+\\.[0-9]+$ ]]; then\n        # shellcheck disable=SC2046,SC2183\n        printf \"%03d%03d%03d\" $(echo \"$version\" | tr '.' ' ')\n    else\n        echo \"000000000\"\n    fi\n}\n\ncheck_snapshot(){\n    # if using snapshot, check that it is not from a too old app version\n    if [[ -n $snapshot ]] ; then\n        snapshot_app_version=$(dx describe \"$snapshot\" --json | jq -r '.details.app_version')\n        current_app_version=$(cat /home/dnanexus/dnanexus-executable.json | jq -r '.version')\n        if [[ 10#$(version_comp \"$snapshot_app_version\") -lt 10#$(version_comp $OLDEST_SUPPORTED_SNAPSHOT_APP_VERSION) ]]; then\n        echo \"AppError: Cannot use snapshot created by app version $snapshot_app_version with this version of the app ('$current_app_version'). Please use a different version of the app with this snapshot, or re-create the snapshot using this version of the app and use it with this version of the app.\"\n        dx-jobutil-report-error \"AppError: Cannot use snapshot created by app version $snapshot_app_version with this version of the app ('$current_app_version'). Please use a different version of the app with this snapshot, or re-create the snapshot using this version of the app and use it with this version of the app.\"\n        exit 1\n        fi\n    fi\n}\n\nload_docker(){\n    # load docker image\n    IMG_LOCATION=$(locate_image)\n    echo \"Loading the Docker image: $IMG_LOCATION\"\n    out=$(dx cat \"$IMG_LOCATION\" | docker load)\n    echo \"$out\"\n    imagename=${out#Loaded image: }\n    echo -e \"\\nexport imagename=$imagename\" >>/cluster/dx-cluster.environment\n}\n\nsetup_java_python(){\n    # take env from conda of image to host\n    docker run -v /opt:/scratch $imagename cp -r /opt/conda /scratch\n    source /opt/conda/bin/activate\n    export PYSPARK_PYTHON=/opt/conda/bin/python\n    export PYSPARK_DRIVER_PYTHON=/opt/conda/bin/python\n\n    # because we can not overwrite java in /cluster/hadoop/etc/hadoop/hadoop-env.sh\n    # which is generated by this script: https://github.com/dnanexus/nucleus/blob/6e6f607a3576c07eaa5dc1cd71f06506f1490360/tip/packages/ubuntu/custom/cluster-pkg/cluster/hadoop/etc/hadoop/hadoop-env.sh#L33\n    # so force to create new symlink\n    ln -sf /opt/conda/bin/java /usr/bin/java \n    echo -e \"\\nexport JAVA_HOME=/opt/conda\" >>/cluster/dx-cluster.environment\n}\n\n# restart java\nrestart_java_service(){\n    setup_java_python\n    restart_worker(){\n        echo \"Stopping Spark HDFS worker\"\n        $HADOOP_HOME/bin/hdfs --config \"${HADOOP_CONF_DIR}\" --daemon stop datanode\n        $SPARK_HOME/sbin/stop-worker.sh spark://master:$SPARK_MASTER_PORT\n\n        echo \"Restarting Spark HDFS worker\"\n        source /cluster/dx-cluster.environment\n        $HADOOP_HOME/bin/hdfs --config \"${HADOOP_CONF_DIR}\" --daemon start datanode\n        $SPARK_HOME/sbin/start-worker.sh spark://master:$SPARK_MASTER_PORT\n    }\n\n    restart_master(){\n        echo \"Stopping Spark HDFS head node\"\n        $HADOOP_HOME/bin/hdfs --config \"${HADOOP_CONF_DIR}\" --daemon stop namenode\n        $SPARK_HOME/sbin/stop-master.sh --properties-file $SPARK_CONF_DIR/spark-standalone.conf\n\n        echo \"Restarting Spark HDFS head node\"\n        source /cluster/dx-cluster.environment\n        $HADOOP_HOME/bin/hdfs --config \"${HADOOP_CONF_DIR}\" --daemon start namenode\n        $SPARK_HOME/sbin/start-master.sh --properties-file $SPARK_CONF_DIR/spark-standalone.conf\n        if [ $DNAX_INSTANCE_COUNT -eq 1 ]; then\n            restart_worker\n        fi\n    }\n  \n    if [ -z \"$DX_CLUSTER_MASTER_IP\" ]; then\n        echo \"Restarting Spark HDFS head node\"\n        restart_master\n    else\n        echo \"Restarting Spark HDFS worker...\"\n        restart_worker\n    fi\n}\n\ninstall_vep() {\n    # Both sha and tag need to be updated when updating VEP\n    VEP_DOCKER_SHA=bc0984bf18c78c968e5cfe59e819cedbeed41f4b564ec3750dd7843b51a63dfb\n    VEP_DOCKER_VERSION=1.0.9\n    echo \"VEP version: $VEP_DOCKER_VERSION\"\n\n    REGION=$(cat /home/dnanexus/dnanexus-job.json | jq -r .region)\n\n    echo \"Downloading and loading the VEP Docker image..\"\n    DOCKER_TARBALL=docker_dnanexus_vep_$VEP_DOCKER_VERSION.tar.gz\n    dx cat \"${TARGET_PROJECTS[$REGION]}:/jupyterlab/vep/$DOCKER_TARBALL\" | docker load &\n\n    echo \"Downloading and unpacking VEP cache file..\"\n    dx cat \"${TARGET_PROJECTS[$REGION]}:/jupyterlab/vep/homo_sapiens_vep_GRCh38_${VEP_DOCKER_VERSION}.tar.gz\" | tar zxf - -C /cluster/\n\n    echo \"Downloading and unpacking LOFTEE plugin files.\"\n    gzip -d /cluster/vep/loftee.sql.gz\n    gzip -d /cluster/vep/human_ancestor.fa.gz\n    wait\n    docker tag dnanexus/dxjupyterlab-vep:$VEP_DOCKER_VERSION dnanexus/dxjupyterlab-vep:latest\n\n    chmod a+rwx /cluster/vep\n}\n\n# cluster-adv-pkg debian package will contain any third party libraries that needs to be installed\n# in all cluster nodes. Glow is one such example. The package has the following folder structure.\n#   /cluster-adv/third-party/<feature>/<feature-version>/*\n# Based on the feature, we might have to follow specific installation steps. In the case of glow,\n# it requires that all the glow related jars to be copied to spark classpath.\ninstall_glow_or_hail() {\n\n    JOB_INPUT_FILE=/home/dnanexus/job_input.json\n    sudo apt-get install cluster-adv-pkg\n\n    if [ -f \"$JOB_INPUT_FILE\" ]; then\n        feature=\"$(jq .feature --raw-output $JOB_INPUT_FILE)\"\n        echo \"Selected feature input is $feature. Installing..\"\n        if [ \"${feature}\" == 'GLOW' ]; then\n            /cluster-adv/third-party/glow/install.sh 2.0.0\n        elif [ \"${feature}\" == 'HAIL' ]; then\n            /cluster-adv/third-party/hail/install.sh 0.2.132\n        elif [ \"${feature}\" == 'HAIL-VEP' ]; then\n            /cluster-adv/third-party/hail/install.sh 0.2.132\n            install_vep\n        fi\n    fi\n}\n\n\nif [ \"$DX_JOB_ID\" != \"\" ]; then\n    # Attach log from prebootstrap script to the job log\n    if [[ -f $PRE_BOOTSTRAP_LOG ]]; then\n        cat $PRE_BOOTSTRAP_LOG\n    fi\n    source $IMAGES_ENV_FILE\n    load_docker\n    check_snapshot\n    setup_java_python\n    echo \"Executing bootstrap script on all nodes of the cluster\"\n    restart_java_service || echo \"Failed to restart Java service, continuing...\"\n    install_glow_or_hail\n    /cluster/dx-cluster.sh hdfs-enable-dnax\n    echo \"Done executing bootstrap script\"\nfi\n",
                    "ports": "9000, 40000-55000"
                }
            }
        },
        "executionPolicy": {
            "restartOn": {
                "UnresponsiveWorker": 2,
                "JMInternalError": 1,
                "ExecutionError": 1
            }
        },
        "instanceType": "mem1_ssd1_v2_x16",
        "finalPriority": "high",
        "networkAccess": [
            "*"
        ],
        "runInput": {
            "cmd": "",
            "in": [
                {
                    "$dnanexus_link": {
                        "project": "project-xxx",
    "id": "file-J0bxp2QJy7K82gKpy5bXkBfB"
                    }
                }
            ]
        },
        "originalInput": {
            "cmd": "",
            "in": [
                {
                    "$dnanexus_link": {
                        "project": "project-xxx",
                        "id": "file-J0bxp2QJy7K82gKpy5bXkBfB"
                    }
                }
            ],
            "duration": 240,
            "feature": "HAIL"
        },
        "input": {
            "cmd": "",
            "in": [
                {
                    "$dnanexus_link": "file-J0bxp2QJy7K82gKpy5bXkBfB"
                }
            ],
            "duration": 240,
            "feature": "HAIL"
        },
        "output": null,
        "clusterSlaves": [
            {
                "host": "ec2-18-170-99-229.eu-west-2.compute.amazonaws.com",
                "sshPort": 22,
                "internalIp": "10.60.33.38"
            }
        ],
        "host": "ec2-13-40-105-76.eu-west-2.compute.amazonaws.com",
        "debug": {},"app": "app-GykQjbQ0k912pQjJ1bJ9z88J",
        "resources": "container-GykQjbQJ3PgXpQjJ1bJ9z88j",
        "projectCache": "container-Gykvj90Jy7K5yQyqXx9FbZF4",
        "startedRunning": 1747846993000,
        "delayWorkspaceDestruction": false,
        "clusterID": "cluster-J0g0K08Jy7K0q8K0qQ976KqG",
        "clusterSpec": {
            "type": "spark",
            "version": "3.5.2",
            "initialInstanceCount": 2,
            "bootstrapScript": "#!/usr/bin/env bash\n\nset -e -o pipefail\n\n# Requires bash version 4\n# Uncomment particular regions once the app is enabled in them\ndeclare -A TARGET_PROJECTS\ndeclare -r IMAGES_ENV_FILE='/home/dnanexus/images.env'\ndeclare -r OLDEST_SUPPORTED_SNAPSHOT_APP_VERSION=2.0.0\n\nTARGET_PROJECTS[azure:westus]=\"App and Applet Assets Azure\"\nTARGET_PROJECTS[aws:us-east-1]=\"App and Applet Assets\"\nTARGET_PROJECTS[aws:eu-central-1]=\"App and Applet Assets Germany\"\nTARGET_PROJECTS[aws:eu-west-2]=\"App and Applet Assets London\"\nTARGET_PROJECTS[aws:eu-west-2-g]=\"App and Applet Assets Europe (London)\"\nTARGET_PROJECTS[aws:me-south-1]=\"App and Applet Assets Bahrain\"\nTARGET_PROJECTS[azure:uksouth-ofh]=\"App and Applet Assets OFH-TRE (London)\"\n# Adding VEP involves two steps:\n# - downloading ensemblorg/ensembl-vep Docker image, loading it, and tagging. The loaded image\n#   will have the \"latest\" set by default, a tag with a descriptive version should be additionally set.\n# - downloading and unpacking the cache files used by vep and storing them in /cluster/vep\n# The downloaded tarballs have to be available in public DNAnexus projects \"App and Applet Assets\"\n\nfunction locate_image() {\n    # returns path to the image used by docker\n    # the flavor image paths are in variables sourced from images.env\n    # if a snapshot is provided it has priority\n\n    # use snapshot path if provided\n    if [[ -n $snapshot ]]; then\n        echo \"$snapshot\"\n    else\n        # selecting flavor\n        local DOCKER_IMAGE=$spark_cluster_image\n        # apps and applets have different env variables for resource location\n        if [[ \"$DX_RESOURCES_ID\" != \"\" ]]; then\n            local DX_ASSETS_ID=\"$DX_RESOURCES_ID\"\n        else\n            # in apps images are on the top level in resource container,\n            # in applets they are in a folder inside the project\n            local DX_ASSETS_ID=\"$DX_PROJECT_CONTEXT_ID\"\n            local DIR_NAME=${DOCKER_IMAGE//_*//} # dxjupyterlab-r_1.2.3 -> dxjupyterlab-r/\n            local DOCKER_IMAGE=$DIR_NAME$DOCKER_IMAGE\n        fi\n\n        echo \"$DX_ASSETS_ID:$DOCKER_IMAGE\"\n    fi\n}\n\nversion_comp() {\n    # converts semantic version with three numbers to format that can be compared with test -lt\n    # if it is not a three number semantic version return 000000000\n    version=$(echo \"$1\" | sed 's/\\([0-9]*\\.[0-9]*\\.[0-9]*\\).*+build.*/\\1/' | awk -F. '{ printf \"%d.%d.%d\\n\", $1, $2, $3 }')\n    if [[ $version =~ ^[0-9]+\\.[0-9]+\\.[0-9]+$ ]]; then\n        # shellcheck disable=SC2046,SC2183\n        printf \"%03d%03d%03d\" $(echo \"$version\" | tr '.' ' ')\n    else\n        echo \"000000000\"\n    fi\n}\n\ncheck_snapshot(){\n    # if using snapshot, check that it is not from a too old app version\n    if [[ -n $snapshot ]] ; then\n        snapshot_app_version=$(dx describe \"$snapshot\" --json | jq -r '.details.app_version')\n        current_app_version=$(cat /home/dnanexus/dnanexus-executable.json | jq -r '.version')\n        if [[ 10#$(version_comp \"$snapshot_app_version\") -lt 10#$(version_comp $OLDEST_SUPPORTED_SNAPSHOT_APP_VERSION) ]]; then\n        echo \"AppError: Cannot use snapshot created by app version $snapshot_app_version with this version of the app ('$current_app_version'). Please use a different version of the app with this snapshot, or re-create the snapshot using this version of the app and use it with this version of the app.\"\n        dx-jobutil-report-error \"AppError: Cannot use snapshot created by app version $snapshot_app_version with this version of the app ('$current_app_version'). Please use a different version of the app with this snapshot, or re-create the snapshot using this version of the app and use it with this version of the app.\"\n        exit 1\n        fi\n    fi\n}\n\nload_docker(){\n    # load docker image\n    IMG_LOCATION=$(locate_image)\n    echo \"Loading the Docker image: $IMG_LOCATION\"\n    out=$(dx cat \"$IMG_LOCATION\" | docker load)\n    echo \"$out\"\n    imagename=${out#Loaded image: }\n    echo -e \"\\nexport imagename=$imagename\" >>/cluster/dx-cluster.environment\n}\n\nsetup_java_python(){\n    # take env from conda of image to host\n    docker run -v /opt:/scratch $imagename cp -r /opt/conda /scratch\n    source /opt/conda/bin/activate\n    export PYSPARK_PYTHON=/opt/conda/bin/python\n    export PYSPARK_DRIVER_PYTHON=/opt/conda/bin/python\n\n    # because we can not overwrite java in /cluster/hadoop/etc/hadoop/hadoop-env.sh\n    # which is generated by this script: https://github.com/dnanexus/nucleus/blob/6e6f607a3576c07eaa5dc1cd71f06506f1490360/tip/packages/ubuntu/custom/cluster-pkg/cluster/hadoop/etc/hadoop/hadoop-env.sh#L33\n    # so force to create new symlink\n    ln -sf /opt/conda/bin/java /usr/bin/java \n    echo -e \"\\nexport JAVA_HOME=/opt/conda\" >>/cluster/dx-cluster.environment\n}\n\n# restart java\nrestart_java_service(){\n    setup_java_python\nrestart_worker(){\n        echo \"Stopping Spark HDFS worker\"\n        $HADOOP_HOME/bin/hdfs --config \"${HADOOP_CONF_DIR}\" --daemon stop datanode\n        $SPARK_HOME/sbin/stop-worker.sh spark://master:$SPARK_MASTER_PORT\n\n        echo \"Restarting Spark HDFS worker\"\n        source /cluster/dx-cluster.environment\n        $HADOOP_HOME/bin/hdfs --config \"${HADOOP_CONF_DIR}\" --daemon start datanode\n        $SPARK_HOME/sbin/start-worker.sh spark://master:$SPARK_MASTER_PORT\n    }\n\n    restart_master(){\n        echo \"Stopping Spark HDFS head node\"\n        $HADOOP_HOME/bin/hdfs --config \"${HADOOP_CONF_DIR}\" --daemon stop namenode\n        $SPARK_HOME/sbin/stop-master.sh --properties-file $SPARK_CONF_DIR/spark-standalone.conf\n\n        echo \"Restarting Spark HDFS head node\"\n        source /cluster/dx-cluster.environment\n        $HADOOP_HOME/bin/hdfs --config \"${HADOOP_CONF_DIR}\" --daemon start namenode\n        $SPARK_HOME/sbin/start-master.sh --properties-file $SPARK_CONF_DIR/spark-standalone.conf\n        if [ $DNAX_INSTANCE_COUNT -eq 1 ]; then\n            restart_worker\n        fi\n    }\n  \n    if [ -z \"$DX_CLUSTER_MASTER_IP\" ]; then\n        echo \"Restarting Spark HDFS head node\"\n        restart_master\n    else\n        echo \"Restarting Spark HDFS worker...\"\n        restart_worker\n    fi\n}\n\ninstall_vep() {\n    # Both sha and tag need to be updated when updating VEP\n    VEP_DOCKER_SHA=bc0984bf18c78c968e5cfe59e819cedbeed41f4b564ec3750dd7843b51a63dfb\n    VEP_DOCKER_VERSION=1.0.9\n    echo \"VEP version: $VEP_DOCKER_VERSION\"\n\n    REGION=$(cat /home/dnanexus/dnanexus-job.json | jq -r .region)\n\n    echo \"Downloading and loading the VEP Docker image..\"\n    DOCKER_TARBALL=docker_dnanexus_vep_$VEP_DOCKER_VERSION.tar.gz\n    dx cat \"${TARGET_PROJECTS[$REGION]}:/jupyterlab/vep/$DOCKER_TARBALL\" | docker load &\n\n    echo \"Downloading and unpacking VEP cache file..\"\n    dx cat \"${TARGET_PROJECTS[$REGION]}:/jupyterlab/vep/homo_sapiens_vep_GRCh38_${VEP_DOCKER_VERSION}.tar.gz\" | tar zxf - -C /cluster/\n\n    echo \"Downloading and unpacking LOFTEE plugin files.\"\n    gzip -d /cluster/vep/loftee.sql.gz\n    gzip -d /cluster/vep/human_ancestor.fa.gz\n    wait\n    docker tag dnanexus/dxjupyterlab-vep:$VEP_DOCKER_VERSION dnanexus/dxjupyterlab-vep:latest\n\n    chmod a+rwx /cluster/vep\n}\n\n# cluster-adv-pkg debian package will contain any third party libraries that needs to be installed\n# in all cluster nodes. Glow is one such example. The package has the following folder structure.\n#   /cluster-adv/third-party/<feature>/<feature-version>/*\n# Based on the feature, we might have to follow specific installation steps. In the case of glow,\n# it requires that all the glow related jars to be copied to spark classpath.\ninstall_glow_or_hail() {\n\n    JOB_INPUT_FILE=/home/dnanexus/job_input.json\n    sudo apt-get install cluster-adv-pkg\n\n    if [ -f \"$JOB_INPUT_FILE\" ]; then\n        feature=\"$(jq .feature --raw-output $JOB_INPUT_FILE)\"\n        echo \"Selected feature input is $feature. Installing..\"\n        if [ \"${feature}\" == 'GLOW' ]; then\n            /cluster-adv/third-party/glow/install.sh 2.0.0\n        elif [ \"${feature}\" == 'HAIL' ]; then\n            /cluster-adv/third-party/hail/install.sh 0.2.132\n        elif [ \"${feature}\" == 'HAIL-VEP' ]; then\n            /cluster-adv/third-party/hail/install.sh 0.2.132\n            install_vep\n        fi\n    fi\n}\n\n\nif [ \"$DX_JOB_ID\" != \"\" ]; then\n    # Attach log from prebootstrap script to the job log\n    if [[ -f $PRE_BOOTSTRAP_LOG ]]; then\n        cat $PRE_BOOTSTRAP_LOG\n    fi\n    source $IMAGES_ENV_FILE\n    load_docker\n    check_snapshot\n    setup_java_python\n    echo \"Executing bootstrap script on all nodes of the cluster\"\n    restart_java_service || echo \"Failed to restart Java service, continuing...\"\n    install_glow_or_hail\n    /cluster/dx-cluster.sh hdfs-enable-dnax\n    echo \"Done executing bootstrap script\"\nfi\n",
            "ports": "9000, 40000-55000"
        },
        "nvidiaDriver": "R535",
        "preserveJobOutputs": null,
        "detailedJobMetrics": false,
        "try": 0,
        "egressReport": {},
        "timeout": 600000,
        "treeTurnaroundTime": 190
    }

     

    0
  • Comment author
    Dr. Mc. Ninja

    "timeout": 600000, ← This looks like what you want?

    0
  • Comment author
    Gabriele Maria Sgarlata

    Thank you a lot Dr. Mc. Ninja!!

    Yes, this is what I wanted. It is good to know that time is expressed in milliseconds. Also, I would suggest updating the documentation page, specifying that to get the time limits the right key to use is timeout, that is:

    dx describe <job-id> --json | jq '.timeout'

    Best,

    Gabriele

    0
  • Comment author
    Dr. Mc. Ninja

    Glad it helped. 

    Yeah, in theory the gitbook documentation is somewhere in github and can be improved by the community, however, I've not found the right repo to make a PR against… I forget what I wanted to change. Perhaps someone from RAP can do it :-)

    0

Please sign in to leave a comment.