I ask because it seems that it takes 12 minutes to download a 50GB cram from a bucket folder to a worker and we have a few hundred thousand WGS cram files. In addition samtools view uses a 3GB cache from ebi by default each time. It would be nice if I can access a pre-built fasta cache in the bucket from a worker directly.
If I have to download the files, I assume it would be more efficient to download cache from bucket then from http://www.ebi.ac.uk/ through internet?
Thanks for help.
If you prepend a /mnt/project/ to the beginning of your file paths (such as /mnt/project/Bulk Files/...), you can utilize the dxFUSE file system to access files from the project storage without downloading the file first. Note that it is currently only read-only.
This works in bash, Python, and R code.
Hope that helps.
Ted
0
Permanently deleted user
Hi Ted,
Thank you! That's really good to know. It seems the performance of download vs direct access through dxFUSE are similar from github site https://github.com/dnanexus/dxfuse. What is the preferred method, or more common method, to access the project data, download vs /mnt/project ?
I believe that is because of how the platform is designed. Folders themselves are not data objects on the platform, they are represented in the metadata for each of the data objects.
Hence, any usage of /mnt/project/ must refer to a particular data object, so I think just calling /mnt/project/ will return an error.
We are now recommending that users use /mnt/project/ because it is more convenient to them.
Just keep in mind that it is currently read only. You won't be able to do something like write.csv(my_file, "/mnt/project/my_folder/myfile.csv") - you'll have to use dx upload to get results off of the platform.
0
Permanently deleted user
Hi Ted,
Thank you, Ted for the prompt answer.
So we can open the file to read. But can't do something like
In all cases of using dxfuse, it is only performant with sequential (streaming) reads in order, therefore I think that "samtools view /mnt/project/MyBam/test.cram" might be a good use case for dxfuse. On the other hand "cp /mnt/project/bucketfolder/myfile myfile.copy" will need to "read/download/stream" entire file so there will not be much difference between dxfuse and dx download.
Hello, as far as I know, the dxfuse, i.e. "/mnt/project/" is not available/preinstalled everywhere. It is part of Swiss Army Knife, JupyterLab and also ttyd, but for instance not available in applets (but it can be installed there).
For those, like me, who tried this in their own apps based on this response: this doesn't seem to generally be true, it only seems to be true for specific apps that DNANexus has created.
Comments
14 comments
Hi Yong,
If you prepend a /mnt/project/ to the beginning of your file paths (such as /mnt/project/Bulk Files/...), you can utilize the dxFUSE file system to access files from the project storage without downloading the file first. Note that it is currently only read-only.
This works in bash, Python, and R code.
Hope that helps.
Ted
Hi Ted,
Thank you! That's really good to know. It seems the performance of download vs direct access through dxFUSE are similar from github site https://github.com/dnanexus/dxfuse. What is the preferred method, or more common method, to access the project data, download vs /mnt/project ?
Yong
Actually, if I do this
subprocess.check_call('ls -l /mnt/project/', shell=True)
in the python code, I am getting
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
STDERR raise CalledProcessError(retcode, cmd)
STDERR subprocess.CalledProcessError: Command 'ls -l /mnt/project/' returned non-zero exit status 2
What did I miss?
Hi Yong,
I believe that is because of how the platform is designed. Folders themselves are not data objects on the platform, they are represented in the metadata for each of the data objects.
Hence, any usage of /mnt/project/ must refer to a particular data object, so I think just calling /mnt/project/ will return an error.
We are now recommending that users use /mnt/project/ because it is more convenient to them.
Best,
Ted
Just keep in mind that it is currently read only. You won't be able to do something like write.csv(my_file, "/mnt/project/my_folder/myfile.csv") - you'll have to use dx upload to get results off of the platform.
Hi Ted,
Thank you, Ted for the prompt answer.
So we can open the file to read. But can't do something like
subprocess.check_call('cp /mnt/project/bucketfolder/myfile myfile.copy', shell=True)
or
subprocess.check_call('samtools view /mnt/project/MyBam/test.cram', shell=True)
assuming samtools is specified in the runSpec in dxapp.json?
Yong
In all cases of using dxfuse, it is only performant with sequential (streaming) reads in order, therefore I think that "samtools view /mnt/project/MyBam/test.cram" might be a good use case for dxfuse. On the other hand "cp /mnt/project/bucketfolder/myfile myfile.copy" will need to "read/download/stream" entire file so there will not be much difference between dxfuse and dx download.
Hello, as far as I know, the dxfuse, i.e. "/mnt/project/" is not available/preinstalled everywhere. It is part of Swiss Army Knife, JupyterLab and also ttyd, but for instance not available in applets (but it can be installed there).
Are you trying 'ls -l /mnt/project/' from JupyterLab or your custom-made applet? If the latter, I would guess that dxfuse is not available.
I am trying to write to access the project files from my applet. I did notice that JupyterLab terminal allows the /mnt/project access. Thanks.
Sorry for the naive question: Can you include Swiss Army Knife in your custom applet (any applet example for that?) ?
Thank you.
a) samtools view is part of Swiss Army Knife (samtools command is specified via -icmd parameter)
https://ukbiobank.dnanexus.com/app/swiss-army-knife
b) if needed, you can run SAK from your custom made bash applet via
dx run app-swiss-army-knife
For more options and details:
Thank you! This really helps.
For those, like me, who tried this in their own apps based on this response: this doesn't seem to generally be true, it only seems to be true for specific apps that DNANexus has created.
I've opened up a question to see if there is a way to enable this in our own apps: https://community.dnanexus.com/s/question/0D582000000L513CAC/dxfuse-automatically-mount-mntproject-on-custom-docker-images
Please sign in to leave a comment.