Using WGS files with Hail
Hello,
I am trying to load WGS pVCF files into Hail. My workflow is modelled after this tutorial:
https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/pVCF_import.ipynb
My code:
from pyspark.sql import SparkSession
import hail as hl
import os
builder = (
SparkSession
.builder
.enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)PATH_TO_VCF = "file:///mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/chr22/ukb24310_c22_b2446_v1.vcf.gz"mt = hl.import_vcf(
PATH_TO_VCF,
force_bgz=True,
reference_genome="GRCh38",
array_elements_required=False,
)Error summary:
2024-02-29 15:37:00.736 Hail: WARN: 'file:///mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/chr22/ukb24310_c22_b2446_v1.vcf.gz' refers to no files
[...]
Hail version: 0.2.116-cd64e0876c94
Error summary: HailException: arguments refer to no filesCan you please suggest how to fix this?
Comments
12 comments
Could this be related to this post https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/16019642090141-How-to-properly-write-paths-with-the-CLI-and-dx
From what I could get from that post, I tried changing the path definition to:
Unfortunately, that did not help.
It seems like internally it can resolve the path as the first line of the error message includes the string
[500k release]which I represented with the wildcard*to avoid using brackets:Listing the path after
file:in the error message within a terminal correctly identifies the file:This is an issue with Hail when the file path contains square brackets [], in this case the path to data-field 24310 is
/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/.The workaround is to create a new folder without the square brackets, then move all chromosome subfolders to it. I have created a bash script for this task.
Please save it to an .sh file and upload the script to your project, then open a ttyd/SAK job and run this script. This will create a new
/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format - 500k release/directory (without square brackets), then move subfolders to it.Thanks for your reply. Can you comment if by running this I'll be incurring any extra data storage costs?
If Hail is internally seeing the [ ] correctly, then maybe there is a different issue. Can you pass Hail a file path that doesn't have any [ ] ?
I notice that the RAP files have distinct file IDs. Could Hail accept file IDs instead of file names?
Andrew Anighoro The
dx mvcommand only moves the UKB sponsored files around in the same project, there are no new files that are created and therefore no extra storage costs.Rachael W Yes you can provide Hail a file path that does not have any [ ]. I tried providing file IDs to Hail but it did not work.
Mike Thanks for the guidance. Now it seems that even after removing [] Hail still could not find the vcf:
with dx ls "/Bulk/DRAGEN WGS/chr11/ukb24310_c11_b2973_v1.vcf.gz", I confirmed that the path is correct after dx mv step. Is there any other possible reason? Thanks for your help.
Wondering if this issue has been resolved? Mike Weichen Song ?
I am looking to use HAIL on the DRAGEN pVCF files and faced similar challenges loading the VCF files from the default directory. When I made a copy in a different folder in my project it worked fine.
I tried to run the .sh script but SAK fails on submission and the ttyd opens in read only mode and cannot alter files. Any advice would be greatly appreciated.
For anyone reading this I managed to get Mike's suggestion to work and move all the pVCF files to a folder without square brackets in the name using his code in a .sh file:
I faced issues running the .sh file as I work on a windows machine but it turns out you can use the terminal in a jupyter session just fine and call bash <your_script>.sh and it works just fine!
Hi,
I am trying to move the files using the bash script as suggested before, but getting:
Error while creating /mnt/project/Bulk/pVCF_current in project-J4xXF10JyZfGvFyg268y3bFz Folder "/mnt/project/Bulk" does not exist in project "project-J4xXF10JyZfGvFyg268y3bFz", code 404
My bash script is in /opt/notebooks, maybe it doesn't see /mnt from there?
Thanks
Hi, may I ask if this issue has been resolved? I also encountered the same problem. Looking forward to successful experiences. Thank you.
It is not possible to create or amend files using the dxfuse “/mnt/project” method, because it is read-only.
I suggest you copy the files you need to use from your Project storage into your Instance storage using the “dx download” command. If you are working within a JupyterLab Instance then you can enter dx commands within a “$_” Terminal. The dx commands are already installed in all UKB-RAP JupyterLab Instances.
For more on dx commands, see https://documentation.dnanexus.com/user/helpstrings-of-sdk-command-line-utilities . For a general list of resources and documentation about the UKB-RAP, see https://community.ukbiobank.ac.uk/hc/en-gb/articles/15956808110749-UKB-RAP-resources-and-documentation .
Please sign in to leave a comment.