Is there a way to access locally stored files in project from a submitted job without having to use "dx download"
I am new to the DNANexus platform and have been trying to get to grips with the way analysis is done.
I apologise if this question has been in some form before- but I have to say the documentation on this feature of the DNANexus/UKBB platform really unclear.
I have been running a custom applet that analyses WGS data from distinct loci.
The data of interest are in our local storage folder: "Project_name:/Bulk/Whole\ genome\ sequences/Whole\ genome\ CRAM\ files/"
The script that the applet wraps goes as follows:
```
dx download "$reference" -o reference
length=${#reads[@]} # number of samples in batch
for i in $(seq 0 $(($length - 1)))
do
readspath=${reads[$i]}
prefix=${output_prefix[$i]}
echo "Processing Cram file: '$readspath'"
dx download "$readspath" -o mapped_reads #DOWNLOAD SAMPLE WGS FILE (LONG WAIT TIME)
dx download "${readspath}".crai -o mapped_reads.crai
####DO STUFF WITH WGS DATA TO MAKE VCF OUTPUT
output_vcf=$(dx upload "$prefix".vcf.gz --brief --path "$output_folder"/"$prefix".vcf.gz -p)
# The following line(s) use the utility dx-jobutil-add-output to format and
# add output variables to your job's output as appropriate for the output
# class. Run "dx-jobutil-add-output -h" for more information on what it
# does.
rm "$prefix"*
done
}
```
I am running this app in batches on sets of 50 CRAM files (my total number of files is ~200000). See below:
```
##JOB1
dx run ehv5_multi -ireads="Project_name:/Bulk/Whole\ genome\ sequences/Whole\ genome\ CRAM\ files/24/2411023_23193_0_0.cram" -ireads= ...... -ioutput_prefix=4334617 -ireference=GRCh38_full_analysis_set_plus_decoy_hla.fa -ithreads=16 -ioutput_folder=output_dir/
##JOB2
dx run ehv5_multi -ireads="Project_name:/Bulk/Whole\ genome\ sequences/Whole\ genome\ CRAM\ files/23/1411023_23193_0_0.cram" -ireads= ...... -ioutput_prefix=1411023 -ireference=GRCh38_full_analysis_set_plus_decoy_hla.fa -ithreads=16 -ioutput_folder=output_dir/
......
```
See the line of code highlighted in bold that I am having to use dx download to access a file that is in my local storage... This step is increasing the run time of each job massively.... Hence I'm wondering- is this completely necessary? Why do I have to download something that is already in my local storage.
Having looked around- it seems that dxfuse may be a potential solution whereby I possibly could replace the Project_name:/Bulk with /mnt/project/Bulk/ ? See the following links that suggest this:
I have installed dxfuse on my local computer but the /mnt/project variable does not seem to carry to my submitted jobs.... Clearly I'm lost....
Is this just a design feature of the DNAnexus platform?
Alternately could I possibly write an app to extract the read data, from the loci of interest, from the large files into subset BAM files and then use these much more manageable downloads in my jobs?
Comments
11 comments
My idea would be - if I am able to use a tool which can stream data via dxfuse sequentially, e.g. samtools, I do not need to download whole file on the worker.
From what I am reading and since you have developed your custom dnanexus app(let), I am wondering whether you also added dxfuse as a prerequisite/installation to your applet? My understanding is that if you build your own tool and you want to interact with UKB data via dxfuse, i.e. via /mnt/project/...) - you have to include all dependencies during the applet build phase...
Alternatively, you could run some already published public app which has dxfuse technology already installed (e.g. Swiss Army Knife). This contains a list of standard bioinformatics tools to interact with UKB data. Based on my experience, I was able to use a publicly avaialable dockerized tool to run it as an env for Swiss Army Knife and this approach saved a lot of my time that I would spend with developing applet from scratch.
@Ondrej Klempir?
Ah right I did not know that dx fuse needed to be included in the prerequisites- how do I specify this?
Currently my dxapp.json file looks like this:
```
"runSpec": {
"execDepends": [
{"name": "bcftools"},
{"name": "tabix"}
]
```
Can I just put {"name":"dxfuse"} in there and it will know where to download it from?
dxfuse does not seem to be installable in this way...
dxpy.utils.exec_utils.DXExecDependencyError: Error while installing apt packages [{'name': 'bcftools'}, {'name': 'tabix'}, {'name': 'dxfuse'}]
Github page is here: https://github.com/dnanexus/dxfuse, you can test it and later install it from there.
You can create an asset using Makefile: https://documentation.dnanexus.com/developer/apps/dependency-management/asset-build-process#asset-directory-structure
and build your applet
{@005t0000006BZL2AAO}?
Thank you for your reply.
I did not understand the documentation that you linked me to. However I tried the following:
I built the dxfuse package on my local computer.
git clone git@github.com:dnanexus/dxfuse.git
cd dxfuse
go build -o dxfuse cli/main.go
Then I copied the binary path to my applet resources folder:
cp dxfuse/dxfuse applet/resources/usr/bin/
and then I edited my applet bash script as follows:
dx run -ireads=/mnt/project/.... -ireads=/mnt/project/.... applet
###BASH APP
main(){
dx-mount-all-inputs
length=${#reads[@]} # number of samples in batch
for i in $(seq 0 $(($length - 1)))
do
readspath=${reads[$i]}
done
}
however this results in the following error:
File "/usr/local/bin/dx-mount-all-inputs", line 77, in <module>
dxpy.mount_all_inputs(exclude=args.exclude, verbose=args.verbose)
File "/usr/local/lib/python3.8/dist-packages/dxpy/bindings/mount_all_inputs.py", line 149, in mount_all_inputs
dxfuse_version = subprocess.check_output([dxfuse_cmd, "-version"])
File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.8/subprocess.py", line 493, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 8] Exec format error: '/usr/bin/dxfuse'
Is there something that I am missing- would appreciate your input?
I resolved this issue by downloading a binary executable version of dxfuse from https://github.com/dnanexus/dxfuse/releases
and putting this in the applet/resources/usr/bin/. Hence dxfuse now seems to work in my applet. However- the /mnt/project/ still does not seem to exist in the applet:
My code goes:
dx run -ireads=/mnt/project/.... -ireads=/mnt/project/.... applet
###BASH APP
main(){
dx-mount-all-inputs
length=${#reads[@]} # number of samples in batch
for i in $(seq 0 $(($length - 1)))
do
readspath=${reads[$i]}
......
done
}
And yet when I run the app- it errors out with:
STDOUT 2023-12-08T09:21:52,[/mnt/project/Bulk/Whole\ is not a path to an existing file]
Hence the dx-mount-all-inputs command does not seem to work... Does anyone know why this might be?
From the DNAnexus IT help desk:
You are able to access the data from your project by mounting it to the job. You can do that by using dxfuse in the source code when building the applet. However, please note that even the project is streamed to the job, if the tool/software need to read a file, under the hood, that file needs to be downloaded to the job.
Regarding your last question, DNAnexus is a cloud-based platform, the compute site (job/worker) is separated form the storage site (project). Therefore, in order for the job to access your data, it needs to be downloaded from your project.
Yes, you did the same installation things/steps I would do as well. I do not have any working example ready now, just sharing my thoughts about this.
What I would try to do next would be to make sure that dxfuse is installed correctly on the cloud workstation and which path you have to specify in order to access it. I would either try installing dxfuse from scratch in some interactive session - e.g. ttyd or cloud workstation OR run your job with debug-on-hold, and access the failed job via ssh to inspect what folder structure is inside.
Ok so in your response to your previous posts I have rerun the job with --allow-ssh --allow-debug-on AppInternalError and when I ssh to the job (dx ssh job-Gbv4F20JY20ZfPY55jJgJQPV)- I get the following:
dnanexus@job-Gbv4F20JY20ZfPY55jJgJQPV:~$ dx pwd
job-Gbv4F20JY20ZfPY55jJgJQPV-workspace:/
dnanexus@job-Gbv4F20JY20ZfPY55jJgJQPV:~$ pwd
/home/dnanexus
dnanexus@job-Gbv4F20JY20ZfPY55jJgJQPV:~$ dxfuse
usage:
dxfuse [options] MOUNTPOINT PROJECT1 PROJECT2 ...
dxfuse [options] MOUNTPOINT manifest.json
options:
-debugFuse
Tap into FUSE debugging information
-gid int
User group id (gid)
-help
display program options
-uid int
User id (uid)
-verbose int
Enable verbose debugging
-version
Print the version and exit
Hence the basic dxfuse binary seems to be fine (?) When I actually try to find what is in the /mnt folder with ls /mnt there seems to be nothing in there.....
Why is this? Is the dx-mount-all-inputs command wrong in some way?
Please sign in to leave a comment.