Using WGS files with Hail

Andrew Anighoro

Hello,

I am trying to load WGS pVCF files into Hail. My workflow is modelled after this tutorial:

https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/pVCF_import.ipynb

My code:

from pyspark.sql import SparkSession
import hail as hl
import os

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)
PATH_TO_VCF = "file:///mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/chr22/ukb24310_c22_b2446_v1.vcf.gz"
mt = hl.import_vcf(
    PATH_TO_VCF,
    force_bgz=True, 
    reference_genome="GRCh38", 
    array_elements_required=False,
)

Error summary:

2024-02-29 15:37:00.736 Hail: WARN: 'file:///mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/chr22/ukb24310_c22_b2446_v1.vcf.gz' refers to no files
[...]
Hail version: 0.2.116-cd64e0876c94
Error summary: HailException: arguments refer to no files

Can you please suggest how to fix this?

Comments

12 comments

  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Could this be related to this post https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/16019642090141-How-to-properly-write-paths-with-the-CLI-and-dx 

    0
  • Comment author
    Andrew Anighoro

    From what I could get from that post, I tried changing the path definition to:

    PATH_TO_VCF = "file:///mnt/project/Bulk/DRAGEN\ WGS/DRAGEN\ population\ level\ WGS\ variants,\ pVCF\ format\ */chr22/ukb24310_c22_b2446_v1.vcf.gz"

    Unfortunately, that did not help.

    It seems like internally it can resolve the path as the first line of the error message includes the string [500k release] which I represented with the wildcard * to avoid using brackets:

    2024-03-07 15:48:11.752 Hail: WARN: 'file:/mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/chr22/ukb24310_c22_b2446_v1.vcf.gz' refers to no files

    Listing the path after file: in the error message within a terminal correctly identifies the file:

    root@job-[...]:/opt/notebooks# ls "/mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/chr22/ukb24310_c22_b2446_v1.vcf.gz"
    '/mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/chr22/ukb24310_c22_b2446_v1.vcf.gz'
    0
  • Comment author
    Mike DNAnexus Team

    This is an issue with Hail when the file path contains square brackets [], in this case the path to data-field 24310 is /Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/.

    The workaround is to create a new folder without the square brackets, then move all chromosome subfolders to it. I have created a bash script for this task.

    Please save it to an .sh file and upload the script to your project, then open a ttyd/SAK job and run this script. This will create a new /Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format - 500k release/ directory (without square brackets), then move subfolders to it.

    #!/usr/bin/env bash
    set -xeuo pipefail
    
    # create a new folder without square brackets []
    dx mkdir "/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format - 500k release/"
    
    # move each subfolder to the new location
    for folder in $(dx ls "/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/"); do
        echo "Moving folder ${folder}"
        dx mv "/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/${folder}" "/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format - 500k release/${folder}"
    done
    echo "Done"
    3
  • Comment author
    Andrew Anighoro

    Thanks for your reply. Can you comment if by running this I'll be incurring any extra data storage costs?

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    If Hail is internally seeing the [ ] correctly, then maybe there is a different issue.  Can you pass Hail a file path that doesn't have any [ ] ?

    I notice that the RAP files have distinct file IDs.  Could Hail accept file IDs instead of file names?

    0
  • Comment author
    Mike DNAnexus Team

    Andrew Anighoro The dx mv command only moves the UKB sponsored files around in the same project, there are no new files that are created and therefore no extra storage costs.

    Rachael W Yes you can provide Hail a file path that does not have any [ ]. I tried providing file IDs to Hail but it did not work.

    2
  • Comment author
    Weichen Song

    Mike Thanks for the guidance. Now it seems that even after removing [] Hail still could not find the vcf:

    Error summary: HailException: arguments refer to no files: Vector(file:///mnt/project/Bulk/DRAGEN WGS/chr11/ukb24310_c11_b2973_v1.vcf.gz).

     

    with dx ls "/Bulk/DRAGEN WGS/chr11/ukb24310_c11_b2973_v1.vcf.gz", I confirmed that the path is correct after dx mv step. Is there any other possible reason? Thanks for your help.

    0
  • Comment author
    Alan Alexander Dimitriev

    Wondering if this issue has been resolved? Mike Weichen Song ?

    I am looking to use HAIL on the DRAGEN pVCF files and faced similar challenges loading the VCF files from the default directory. When I made a copy in a different folder in my project it worked fine.

    I tried to run the .sh script but SAK fails on submission and the ttyd opens in read only mode and cannot alter files. Any advice would be greatly appreciated.

    0
  • Comment author
    Alan Alexander Dimitriev

    For anyone reading this I managed to get Mike's suggestion to work and move all the pVCF files to a folder without square brackets in the name using his code in a .sh file:
     

    #!/usr/bin/env bash
    set -xeuo pipefail
    
    # create a new folder without square brackets []
    dx mkdir "/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format - 500k release/"
    
    # move each subfolder to the new location
    for folder in $(dx ls "/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/"); do
        echo "Moving folder ${folder}"
        dx mv "/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/${folder}" "/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format - 500k release/${folder}"
    done
    echo "Done"

     

    I faced issues running the .sh file as I work on a windows machine but it turns out you can use the terminal in a jupyter session just fine and call bash <your_script>.sh and it works just fine!

     

    1
  • Comment author
    Salvatore Loguercio

    Hi,

    I am trying to move the files using the bash script as suggested before, but getting:
    Error while creating /mnt/project/Bulk/pVCF_current in project-J4xXF10JyZfGvFyg268y3bFz  Folder "/mnt/project/Bulk" does not exist in project "project-J4xXF10JyZfGvFyg268y3bFz", code 404

    My bash script is in /opt/notebooks, maybe it doesn't see /mnt from there?

    Thanks
     

    0
  • Comment author
    Yuanyuan Ye

    Hi, may I ask if this issue has been resolved? I also encountered the same problem. Looking forward to successful experiences. Thank you.

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    It  is not possible to create or amend files using the dxfuse “/mnt/project” method, because it is read-only.

    I suggest you copy the files you need to use from your Project storage into your Instance storage using the “dx download” command.   If you are working within a JupyterLab Instance then you can enter dx commands within a “$_” Terminal.   The dx commands are already installed in all UKB-RAP JupyterLab Instances.  

    For more on dx commands, see https://documentation.dnanexus.com/user/helpstrings-of-sdk-command-line-utilities .   For a general list of resources and documentation about the UKB-RAP, see https://community.ukbiobank.ac.uk/hc/en-gb/articles/15956808110749-UKB-RAP-resources-and-documentation .

     

    0

Please sign in to leave a comment.