Is there a way to access locally stored files in project from a submitted job without having to use "dx download"

06 December 2023 00:00
11 comments

I am new to the DNANexus platform and have been trying to get to grips with the way analysis is done.

I apologise if this question has been in some form before- but I have to say the documentation on this feature of the DNANexus/UKBB platform really unclear.

I have been running a custom applet that analyses WGS data from distinct loci.

The data of interest are in our local storage folder: "Project_name:/Bulk/Whole\ genome\ sequences/Whole\ genome\ CRAM\ files/"

The script that the applet wraps goes as follows:

```

dx download "$reference" -o reference

length=${#reads[@]} # number of samples in batch

for i in $(seq 0 $(($length - 1)))

readspath=${reads[$i]}

prefix=${output_prefix[$i]}

echo "Processing Cram file: '$readspath'"

dx download "$readspath" -o mapped_reads #DOWNLOAD SAMPLE WGS FILE (LONG WAIT TIME)

dx download "${readspath}".crai -o mapped_reads.crai

####DO STUFF WITH WGS DATA TO MAKE VCF OUTPUT

output_vcf=$(dx upload "$prefix".vcf.gz --brief --path "$output_folder"/"$prefix".vcf.gz -p)

# The following line(s) use the utility dx-jobutil-add-output to format and

# add output variables to your job's output as appropriate for the output

# class. Run "dx-jobutil-add-output -h" for more information on what it

# does.

rm "$prefix"*

done

}

```

I am running this app in batches on sets of 50 CRAM files (my total number of files is ~200000). See below:

```

##JOB1

dx run ehv5_multi -ireads="Project_name:/Bulk/Whole\ genome\ sequences/Whole\ genome\ CRAM\ files/24/2411023_23193_0_0.cram" -ireads= ...... -ioutput_prefix=4334617 -ireference=GRCh38_full_analysis_set_plus_decoy_hla.fa -ithreads=16 -ioutput_folder=output_dir/

##JOB2

dx run ehv5_multi -ireads="Project_name:/Bulk/Whole\ genome\ sequences/Whole\ genome\ CRAM\ files/23/1411023_23193_0_0.cram" -ireads= ...... -ioutput_prefix=1411023 -ireference=GRCh38_full_analysis_set_plus_decoy_hla.fa -ithreads=16 -ioutput_folder=output_dir/

......

```

See the line of code highlighted in bold that I am having to use dx download to access a file that is in my local storage... This step is increasing the run time of each job massively.... Hence I'm wondering- is this completely necessary? Why do I have to download something that is already in my local storage.

Having looked around- it seems that dxfuse may be a potential solution whereby I possibly could replace the Project_name:/Bulk with /mnt/project/Bulk/ ? See the following links that suggest this:

https://community.dnanexus.com/s/question/0D5t000003lClcbCAC/is-there-a-way-to-mount-projectbucket-folders-on-vm-workers-directly-like-nfs-mount

https://community.dnanexus.com/s/question/0D582000000L513CAC/dxfuse-automatically-mount-mntproject-on-custom-docker-images

I have installed dxfuse on my local computer but the /mnt/project variable does not seem to carry to my submitted jobs.... Clearly I'm lost....

Is this just a design feature of the DNAnexus platform?

Alternately could I possibly write an app to extract the read data, from the loci of interest, from the large files into subset BAM files and then use these much more manageable downloads in my jobs?

Comments

11 comments

Ondrej Klempir DNAnexus Team
- 29 October 2023 16:06
My idea would be - if I am able to use a tool which can stream data via dxfuse sequentially, e.g. samtools, I do not need to download whole file on the worker.

0
Ondrej Klempir DNAnexus Team
- 07 December 2023 09:26
From what I am reading and since you have developed your custom dnanexus app(let), I am wondering whether you also added dxfuse as a prerequisite/installation to your applet? My understanding is that if you build your own tool and you want to interact with UKB data via dxfuse, i.e. via /mnt/project/...) - you have to include all dependencies during the applet build phase...

Alternatively, you could run some already published public app which has dxfuse technology already installed (e.g. Swiss Army Knife). This contains a list of standard bioinformatics tools to interact with UKB data. Based on my experience, I was able to use a publicly avaialable dockerized tool to run it as an env for Swiss Army Knife and this approach saved a lot of my time that I would spend with developing applet from scratch.

0
Former User of DNAx Community_14
- 07 December 2023 11:10
@Ondrej Klempir?
Ah right I did not know that dx fuse needed to be included in the prerequisites- how do I specify this?
Currently my dxapp.json file looks like this:

```
"runSpec": {
  "execDepends": [
    {"name": "bcftools"},
    {"name": "tabix"}
  ]
```
Can I just put {"name":"dxfuse"} in there and it will know where to download it from?

0
Former User of DNAx Community_14
- 07 December 2023 11:47
dxfuse does not seem to be installable in this way...

dxpy.utils.exec_utils.DXExecDependencyError: Error while installing apt packages [{'name': 'bcftools'}, {'name': 'tabix'}, {'name': 'dxfuse'}]

0
Ondrej Klempir DNAnexus Team
- 07 December 2023 11:52
Github page is here: https://github.com/dnanexus/dxfuse, you can test it and later install it from there.

You can create an asset using Makefile: https://documentation.dnanexus.com/developer/apps/dependency-management/asset-build-process#asset-directory-structure

and build your applet

0
Former User of DNAx Community_14
- 07 December 2023 15:07
{@005t0000006BZL2AAO}?
Thank you for your reply.
I did not understand the documentation that you linked me to. However I tried the following:

I built the dxfuse package on my local computer.

git clone git@github.com:dnanexus/dxfuse.git
cd dxfuse
go build -o dxfuse cli/main.go

Then I copied the binary path to my applet resources folder:

cp dxfuse/dxfuse applet/resources/usr/bin/
and then I edited my applet bash script as follows:

dx run -ireads=/mnt/project/.... -ireads=/mnt/project/.... applet

###BASH APP
main(){
dx-mount-all-inputs
  length=${#reads[@]} # number of samples in batch
  for i in $(seq 0 $(($length - 1)))
  do
    readspath=${reads[$i]}

done
}

however this results in the following error:
File "/usr/local/bin/dx-mount-all-inputs", line 77, in <module>
dxpy.mount_all_inputs(exclude=args.exclude, verbose=args.verbose)
File "/usr/local/lib/python3.8/dist-packages/dxpy/bindings/mount_all_inputs.py", line 149, in mount_all_inputs
dxfuse_version = subprocess.check_output([dxfuse_cmd, "-version"])
File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.8/subprocess.py", line 493, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 8] Exec format error: '/usr/bin/dxfuse'

Is there something that I am missing- would appreciate your input?

0
Former User of DNAx Community_14
- 08 December 2023 09:57
I resolved this issue by downloading a binary executable version of dxfuse from https://github.com/dnanexus/dxfuse/releases
and putting this in the applet/resources/usr/bin/. Hence dxfuse now seems to work in my applet. However- the /mnt/project/ still does not seem to exist in the applet:
My code goes:
dx run -ireads=/mnt/project/.... -ireads=/mnt/project/.... applet
###BASH APP
main(){
dx-mount-all-inputs
  length=${#reads[@]} # number of samples in batch
  for i in $(seq 0 $(($length - 1)))
  do
    readspath=${reads[$i]}
......
done
}

And yet when I run the app- it errors out with:

STDOUT 2023-12-08T09:21:52,[/mnt/project/Bulk/Whole\ is not a path to an existing file]

Hence the dx-mount-all-inputs command does not seem to work... Does anyone know why this might be?

0
Former User of DNAx Community_14
- 08 December 2023 11:05
From the DNAnexus IT help desk:
You are able to access the data from your project by mounting it to the job. You can do that by using dxfuse in the source code when building the applet. However, please note that even the project is streamed to the job, if the tool/software need to read a file, under the hood, that file needs to be downloaded to the job.

Regarding your last question, DNAnexus is a cloud-based platform, the compute site (job/worker) is separated form the storage site (project). Therefore, in order for the job to access your data, it needs to be downloaded from your project.

0
Ondrej Klempir DNAnexus Team
- 11 December 2023 08:37
Yes, you did the same installation things/steps I would do as well. I do not have any working example ready now, just sharing my thoughts about this.

What I would try to do next would be to make sure that dxfuse is installed correctly on the cloud workstation and which path you have to specify in order to access it. I would either try installing dxfuse from scratch in some interactive session - e.g. ttyd or cloud workstation OR run your job with debug-on-hold, and access the failed job via ssh to inspect what folder structure is inside.

0
Former User of DNAx Community_14
- 12 December 2023 12:02
Ok so in your response to your previous posts I have rerun the job with --allow-ssh --allow-debug-on AppInternalError and when I ssh to the job (dx ssh job-Gbv4F20JY20ZfPY55jJgJQPV)- I get the following:
dnanexus@job-Gbv4F20JY20ZfPY55jJgJQPV:~$ dx pwd
job-Gbv4F20JY20ZfPY55jJgJQPV-workspace:/
dnanexus@job-Gbv4F20JY20ZfPY55jJgJQPV:~$ pwd
/home/dnanexus
dnanexus@job-Gbv4F20JY20ZfPY55jJgJQPV:~$ dxfuse
usage:
  dxfuse [options] MOUNTPOINT PROJECT1 PROJECT2 ...
  dxfuse [options] MOUNTPOINT manifest.json
options:
-debugFuse
    Tap into FUSE debugging information
-gid int
    User group id (gid)
-help
    display program options
-uid int
    User id (uid)
-verbose int
    Enable verbose debugging
-version
    Print the version and exit

Hence the basic dxfuse binary seems to be fine (?) When I actually try to find what is in the /mnt folder with ls /mnt there seems to be nothing in there.....
Why is this? Is the dx-mount-all-inputs command wrong in some way?

0
Ondrej Klempir DNAnexus Team
- 18 December 2023 11:58
1. Could it be related to https://github.com/dnanexus/dxfuse?tab=readme-ov-file#common-problems? Could you make sure that the project is mounted?
2. I have no idea why and when to use dx-mount-all-inputs for building applets. Did you somehow actively add it? I would "turn it off" if possible. From what I know - dx-mount-all-inputs is used with swiss-army-knife. I would not use dx-mount-all-inputs and rather access the project folder directly via dxfuse. Sorry, I might be wrong and this is going beyond my knowledge without experimenting with it hands on.
0

Please sign in to leave a comment.