Hi,
I just tried dxfuse https://github.com/dnanexus/dxfuse
I understand the dnanexus FS is NOT posix.
I understand that the speed is limited by the banwidth
.
It worked fine to list the files. Nevertheless when I used my local bcftools to read a remote file, it raised a I/O error.
```
$ bcftools view "FUSE_UKBIOBANK/xxxxxxx/xxxxxdiploidSV.vcf.gz"
Failed to read from "FUSE_UKBIOBANK/xxxxxxx/xxxxxdiploidSV.vcf.gz Input/output error
````
So is there a way to use my bcftools to read the remote files ?
Does it 'cost'? anything to access the files via dxfuse ?
I never tried this use case before, but this sounds very useful.
Before looking into dxfuse, can you check if you have download permission for these files? The MTA between UKB researchers and UKB prohibit researchers from downloading WES/WGS and impute data from UKB-RAP (except the data that were available before UKB-RAP was created), so the platform has download block as guardrail to prevent user from accidentally download the data.
If this is one of the file that you are not allowed to download, you should get an error with any other attempt to transfer file out with `dx cat <file-id/path> | bcftools view `or ` dx download <file-id/path> `
The dxfuse should not cost anything just to see the list of files since it's just viewing metadata. However, once you view the content of the file or download it, if you do that outside the UKB-RAP, most likely you would be paying the egress fee. If you do it inside the RAP, you would pay for EC2 instance rental cost, but no egress fee.
If you want to provide more additional testing and experiment more with dxfuse on your local machine, this is what I would try:
I would download some small publicly available vcf.gz and upload it to UKB RAP.
Check whether the file could be seen by dxfuse (list the dir).
I would try running commands such as cat, head or file on the /mnt/project/XYZ/.../my.vcf.gz and test if it is giving you the same IO error.
Out of my curiosity, I would then test some other bioinformatics tool, not just bcftools to read the file. I am not sure if that is the case for bcftools, but dxfuse it is only performant with sequential (streaming) reads in order. If reading out-of-order the read performance will be significantly worse. By "out-of-order" I mean random access pattern.
@Ondrej Klempir? thank you for your answer. I was exploring the ways to run workflows and to read the data. I'll leave WDL and dxfuse for now and I'll focus on running nextflow on UKBB side.
Comments
4 comments
I never tried this use case before, but this sounds very useful.
Before looking into dxfuse, can you check if you have download permission for these files? The MTA between UKB researchers and UKB prohibit researchers from downloading WES/WGS and impute data from UKB-RAP (except the data that were available before UKB-RAP was created), so the platform has download block as guardrail to prevent user from accidentally download the data.
If this is one of the file that you are not allowed to download, you should get an error with any other attempt to transfer file out with `dx cat <file-id/path> | bcftools view `or ` dx download <file-id/path> `
The dxfuse should not cost anything just to see the list of files since it's just viewing metadata. However, once you view the content of the file or download it, if you do that outside the UKB-RAP, most likely you would be paying the egress fee. If you do it inside the RAP, you would pay for EC2 instance rental cost, but no egress fee.
no, you're right I was not allowed to download those files !
If you want to provide more additional testing and experiment more with dxfuse on your local machine, this is what I would try:
@Ondrej Klempir? thank you for your answer. I was exploring the ways to run workflows and to read the data. I'll leave WDL and dxfuse for now and I'll focus on running nextflow on UKBB side.
Please sign in to leave a comment.