Is there a way to access locally stored files in project from a submitted job without having to use "dx download"

I am new to the DNANexus platform and have been trying to get to grips with the way analysis is done.

I apologise if this question has been in some form before- but I have to say the documentation on this feature of the DNANexus/UKBB platform really unclear.

I have been running a custom applet that analyses WGS data from distinct loci.

The data of interest are in our local storage folder: "Project_name:/Bulk/Whole\ genome\ sequences/Whole\ genome\ CRAM\ files/"

The script that the applet wraps goes as follows:

```

dx download "$reference" -o reference 

length=${#reads[@]} # number of samples in batch

for i in $(seq 0 $(($length - 1)))

  do

    readspath=${reads[$i]} 

    prefix=${output_prefix[$i]}

    echo "Processing Cram file: '$readspath'"

    dx download "$readspath" -o mapped_reads #DOWNLOAD SAMPLE WGS FILE (LONG WAIT TIME)

    dx download "${readspath}".crai -o mapped_reads.crai

####DO STUFF WITH WGS DATA TO MAKE VCF OUTPUT

    output_vcf=$(dx upload "$prefix".vcf.gz --brief --path "$output_folder"/"$prefix".vcf.gz -p)

 

    # The following line(s) use the utility dx-jobutil-add-output to format and

    # add output variables to your job's output as appropriate for the output

    # class. Run "dx-jobutil-add-output -h" for more information on what it

    # does.

    rm "$prefix"*

  done

}

```

I am running this app in batches on sets of 50 CRAM files (my total number of files is ~200000). See below:

 

```

##JOB1

dx run ehv5_multi -ireads="Project_name:/Bulk/Whole\ genome\ sequences/Whole\ genome\ CRAM\ files/24/2411023_23193_0_0.cram" -ireads= ...... -ioutput_prefix=4334617 -ireference=GRCh38_full_analysis_set_plus_decoy_hla.fa -ithreads=16 -ioutput_folder=output_dir/

##JOB2

dx run ehv5_multi -ireads="Project_name:/Bulk/Whole\ genome\ sequences/Whole\ genome\ CRAM\ files/23/1411023_23193_0_0.cram" -ireads= ...... -ioutput_prefix=1411023 -ireference=GRCh38_full_analysis_set_plus_decoy_hla.fa -ithreads=16 -ioutput_folder=output_dir/

......

```

See the line of code highlighted in bold that I am having to use dx download to access a file that is in my local storage... This step is increasing the run time of each job massively.... Hence I'm wondering- is this completely necessary? Why do I have to download something that is already in my local storage.

 

Having looked around- it seems that dxfuse may be a potential solution whereby I possibly could replace the Project_name:/Bulk with /mnt/project/Bulk/ ? See the following links that suggest this:

 

 

https://community.dnanexus.com/s/question/0D5t000003lClcbCAC/is-there-a-way-to-mount-projectbucket-folders-on-vm-workers-directly-like-nfs-mount

https://community.dnanexus.com/s/question/0D582000000L513CAC/dxfuse-automatically-mount-mntproject-on-custom-docker-images

 

I have installed dxfuse on my local computer but the /mnt/project variable does not seem to carry to my submitted jobs.... Clearly I'm lost....

Is this just a design feature of the DNAnexus platform?

 

Alternately could I possibly write an app to extract the read data, from the loci of interest, from the large files into subset BAM files and then use these much more manageable downloads in my jobs?

 

Comments

11 comments

  • Comment author
    Ondrej Klempir DNAnexus Team

    My idea would be - if I am able to use a tool which can stream data via dxfuse sequentially, e.g. samtools, I do not need to download whole file on the worker.

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    From what I am reading and since you have developed your custom dnanexus app(let), I am wondering whether you also added dxfuse as a prerequisite/installation to your applet? My understanding is that if you build your own tool and you want to interact with UKB data via dxfuse, i.e. via /mnt/project/...) - you have to include all dependencies during the applet build phase...

     

    Alternatively, you could run some already published public app which has dxfuse technology already installed (e.g. Swiss Army Knife). This contains a list of standard bioinformatics tools to interact with UKB data. Based on my experience, I was able to use a publicly avaialable dockerized tool to run it as an env for Swiss Army Knife and this approach saved a lot of my time that I would spend with developing applet from scratch.

    0
  • Comment author
    Former User of DNAx Community_14

    @Ondrej Klempir? 

    Ah right I did not know that dx fuse needed to be included in the prerequisites- how do I specify this?

    Currently my dxapp.json file looks like this:

     

    ```

    "runSpec": {

      "execDepends": [

        {"name": "bcftools"},

        {"name": "tabix"}

      ]

    ```

    Can I just put {"name":"dxfuse"} in there and it will know where to download it from?

    0
  • Comment author
    Former User of DNAx Community_14

    dxfuse does not seem to be installable in this way...

     

    dxpy.utils.exec_utils.DXExecDependencyError: Error while installing apt packages [{'name': 'bcftools'}, {'name': 'tabix'}, {'name': 'dxfuse'}]

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Github page is here: https://github.com/dnanexus/dxfuse, you can test it and later install it from there.

     

    You can create an asset using Makefile: https://documentation.dnanexus.com/developer/apps/dependency-management/asset-build-process#asset-directory-structure

     

    and build your applet

    0
  • Comment author
    Former User of DNAx Community_14

    {@005t0000006BZL2AAO}? 

    Thank you for your reply.

    I did not understand the documentation that you linked me to. However I tried the following:

     

    I built the dxfuse package on my local computer.

     

    git clone git@github.com:dnanexus/dxfuse.git

    cd dxfuse

    go build -o dxfuse cli/main.go

     

    Then I copied the binary path to my applet resources folder:

     

    cp dxfuse/dxfuse applet/resources/usr/bin/

    and then I edited my applet bash script as follows:

     

    dx run -ireads=/mnt/project/.... -ireads=/mnt/project/.... applet

     

    ###BASH APP

    main(){

     dx-mount-all-inputs

      length=${#reads[@]} # number of samples in batch

      for i in $(seq 0 $(($length - 1)))

      do

        readspath=${reads[$i]} 

     

    done

    }

     

    however this results in the following error:

    File "/usr/local/bin/dx-mount-all-inputs", line 77, in <module>

    dxpy.mount_all_inputs(exclude=args.exclude, verbose=args.verbose)

    File "/usr/local/lib/python3.8/dist-packages/dxpy/bindings/mount_all_inputs.py", line 149, in mount_all_inputs

    dxfuse_version = subprocess.check_output([dxfuse_cmd, "-version"])

    File "/usr/lib/python3.8/subprocess.py", line 415, in check_output

    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,

    File "/usr/lib/python3.8/subprocess.py", line 493, in run

    with Popen(*popenargs, **kwargs) as process:

    File "/usr/lib/python3.8/subprocess.py", line 858, in __init__

    self._execute_child(args, executable, preexec_fn, close_fds,

    File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child

    raise child_exception_type(errno_num, err_msg, err_filename)

    OSError: [Errno 8] Exec format error: '/usr/bin/dxfuse'

     

    Is there something that I am missing- would appreciate your input?

    0
  • Comment author
    Former User of DNAx Community_14

    I resolved this issue by downloading a binary executable version of dxfuse from https://github.com/dnanexus/dxfuse/releases

    and putting this in the applet/resources/usr/bin/. Hence dxfuse now seems to work in my applet. However- the /mnt/project/ still does not seem to exist in the applet:

    My code goes:

    dx run -ireads=/mnt/project/.... -ireads=/mnt/project/.... applet

    ###BASH APP

    main(){

     dx-mount-all-inputs

      length=${#reads[@]} # number of samples in batch

      for i in $(seq 0 $(($length - 1)))

      do

        readspath=${reads[$i]} 

     ......

    done

    }

     

    And yet when I run the app- it errors out with:

     

    STDOUT 2023-12-08T09:21:52,[/mnt/project/Bulk/Whole\ is not a path to an existing file]

     

    Hence the dx-mount-all-inputs command does not seem to work... Does anyone know why this might be?

    0
  • Comment author
    Former User of DNAx Community_14

    From the DNAnexus IT help desk:

    You are able to access the data from your project by mounting it to the job. You can do that by using dxfuse in the source code when building the applet. However, please note that even the project is streamed to the job, if the tool/software need to read a file, under the hood, that file needs to be downloaded to the job.

     

    Regarding your last question, DNAnexus is a cloud-based platform, the compute site (job/worker) is separated form the storage site (project). Therefore, in order for the job to access your data, it needs to be downloaded from your project.

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Yes, you did the same installation things/steps I would do as well. I do not have any working example ready now, just sharing my thoughts about this.

     

    What I would try to do next would be to make sure that dxfuse is installed correctly on the cloud workstation and which path you have to specify in order to access it. I would either try installing dxfuse from scratch in some interactive session - e.g. ttyd or cloud workstation OR run your job with debug-on-hold, and access the failed job via ssh to inspect what folder structure is inside.

    0
  • Comment author
    Former User of DNAx Community_14

    Ok so in your response to your previous posts I have rerun the job with --allow-ssh --allow-debug-on AppInternalError and when I ssh to the job (dx ssh job-Gbv4F20JY20ZfPY55jJgJQPV)- I get the following:

    dnanexus@job-Gbv4F20JY20ZfPY55jJgJQPV:~$ dx pwd

    job-Gbv4F20JY20ZfPY55jJgJQPV-workspace:/

    dnanexus@job-Gbv4F20JY20ZfPY55jJgJQPV:~$ pwd

    /home/dnanexus

    dnanexus@job-Gbv4F20JY20ZfPY55jJgJQPV:~$ dxfuse

    usage:

      dxfuse [options] MOUNTPOINT PROJECT1 PROJECT2 ...

      dxfuse [options] MOUNTPOINT manifest.json

    options:

     -debugFuse  

        Tap into FUSE debugging information

     -gid int 

        User group id (gid)

     -help  

        display program options

     -uid int 

        User id (uid)

     -verbose int 

        Enable verbose debugging

     -version  

        Print the version and exit

     

    Hence the basic dxfuse binary seems to be fine (?) When I actually try to find what is in the /mnt folder with ls /mnt there seems to be nothing in there.....

    Why is this? Is the dx-mount-all-inputs command wrong in some way?

    0
  • Comment author
    Ondrej Klempir DNAnexus Team
    1. Could it be related to https://github.com/dnanexus/dxfuse?tab=readme-ov-file#common-problems? Could you make sure that the project is mounted?
    2. I have no idea why and when to use dx-mount-all-inputs  for building applets. Did you somehow actively add it? I would "turn it off" if possible. From what I know - dx-mount-all-inputs is used with swiss-army-knife. I would not use dx-mount-all-inputs and rather access the project folder directly via dxfuse. Sorry, I might be wrong and this is going beyond my knowledge without experimenting with it hands on.

     

     

    0

Please sign in to leave a comment.