dxfuse + bcftools " Input/output error" when reading.

Hi,   I just tried dxfuse https://github.com/dnanexus/dxfuse I understand the dnanexus FS is NOT posix. I understand that the speed is limited by the banwidth . It worked fine to list the files. Nevertheless when I used my local bcftools to read a remote file, it raised a I/O error.   ``` $ bcftools view "FUSE_UKBIOBANK/xxxxxxx/xxxxxdiploidSV.vcf.gz" Failed to read from "FUSE_UKBIOBANK/xxxxxxx/xxxxxdiploidSV.vcf.gz Input/output error ````   So is there a way to use my bcftools to read the remote files ? Does it 'cost'? anything to access the files via dxfuse ?

Comments

4 comments

  • Comment author
    Chai Fungtammasan DNAnexus Team

    I never tried this use case before, but this sounds very useful.

     

    Before looking into dxfuse, can you check if you have download permission for these files? The MTA between UKB researchers and UKB prohibit researchers from downloading WES/WGS and impute data from UKB-RAP (except the data that were available before UKB-RAP was created), so the platform has download block as guardrail to prevent user from accidentally download the data. 

     

    If this is one of the file that you are not allowed to download, you should get an error with any other attempt to transfer file out with `dx cat <file-id/path> | bcftools view `or ` dx download <file-id/path> `

     

    The dxfuse should not cost anything just to see the list of files since it's just viewing metadata. However, once you view the content of the file or download it, if you do that outside the UKB-RAP, most likely you would be paying the egress fee. If you do it inside the RAP, you would pay for EC2 instance rental cost, but no egress fee.

    0
  • Comment author
    Former User of DNAx Community_46

    no, you're right I was not allowed to download those files !image

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    If you want to provide more additional testing and experiment more with dxfuse on your local machine, this is what I would try:

    1. I would download some small publicly available vcf.gz and upload it to UKB RAP.
    2. Check whether the file could be seen by dxfuse (list the dir).
    3. I would try running commands such as cat, head or file on the /mnt/project/XYZ/.../my.vcf.gz and test if it is giving you the same IO error.
    4. Out of my curiosity, I would then test some other bioinformatics tool, not just bcftools to read the file. I am not sure if that is the case for bcftools, but dxfuse it is only performant with sequential (streaming) reads in order. If reading out-of-order the read performance will be significantly worse. By "out-of-order" I mean random access pattern.
    0
  • Comment author
    Former User of DNAx Community_46

    @Ondrej Klempir? thank you for your answer. I was exploring the ways to run workflows and to read the data. I'll leave WDL and dxfuse for now and I'll focus on running nextflow on UKBB side.

    0

Please sign in to leave a comment.