How to import a VCF file? File not found error when importing VCF file with Hail
I am trying to import a VCF file from a Spark+HAIL Jupyter Notbook.
But the VCF file is not found.
In preparation for the notebook:
- instance type is mem1_hdd1_v2_x4
- I dx-downloaded a single VCF file to the /opt/notebooks folder.
- I changed the permissions of such VCF file rwx
The code I am using is in a Jupyter notebook :
import pyspark
import hail as hl
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
hl.init(sc=sc)
vcf_path = 'file:///opt/notebooks/XXXXXXXXXXXXXX.vcf.gz'
mt = hl.import_vcf(vcf_path, force=True, reference_genome='GRCh38', array_elements_required=False)
mt.show(5) #force computation, error here!
Error: File not found file file:///opt/notebooks/XXXXXXXXXXXXXX.vcf.gz does not exist
Thank you for your help
Felipe
Comments
13 comments
Could you try with example Hail notebooks that we just published today? We have example on how to load vcf and most common operation in Hail as well.
https://community.dnanexus.com/s/question/0D5t0000043xrVhCAI/hail-tutorial-and-example-notebooks-for-ukbrap-analysis
Hi Chai, I have a similar error while following your new notebook (import pVCF with Hail) .
It keeps failing at the step of writing the matrix table (last step), or at the ?print(f"Num partitions: {mt.n_partitions()}?)? which I have commented out because it takes forever without failing.
The error I get is:
Hail version: 0.2.78-b17627756568
Error summary: FileNotFoundException: /mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c19_b59_v1.vcf.gz (Transport endpoint is not connected)
However the file should be available because it did manage to read it to get a proper description of its fields in mt.describe().
What is the problem?
it seems to happen frequently, that files clearly exist from my notebook/ by python, but fail with "FileNotFoundException" when submitted to a hail/spark job as in following your import pVCF tutorial.
Is it possible that some/all of the worker nodes lose access to the mounted files (/mnt), even when they are clearly properly mounted when I check through python?
These are very basic operations and seem to not be able to reliable access the mounted file system by multiple users.
I got a problem trying this myself too with pVCF from WGS. The engineer team is looking into it. It was working with pVCF from WES that they run the test in early Sep.
We get the same issue now (24th October) when trying to read plink files with glow.
Just want to keep you all posted that this is still under investigation. Here is what we know so far.
1) The BGEN needs to be compressed with zlib rather than zstd in order to be used with Hail.
2) The BGEN also could not be multi-allelic
3) The Hail version needs to be updated.
4) We don't quite understand it yet, but pVCF for WES is working while the pVCF for WGS isn't working. We are still investigating it.
Once the investigation is finished, the plan is to specify in the guideline what data could be used with Hail, and the conversion command for those that could not be used.
@Katie Sandford? for Glow, you may create a new topic or contact ukbiobank-support@dnanexus.com.
Don't know if it will help with Glow, but I found that in order to work with plink files (in general on UKB RAP) I have to create symbolic links first because of the special characters in the path the UKB use (spaces, etc, which plink is more sensitive to than others).
Hi @Felipe Golib? and @Or Yaacov?,
I would like to let you know that @Chai Fungtammasan? has recently published the following post about Hail troubleshooting for UKB data:
https://community.dnanexus.com/s/question/0D5t000004AflSiCAJ/hail-troubleshooting-for-ukb-data
@Felipe Golib? I looked in to your case, and I think it's because the cluster can't see your file. When you run dx download, the file goes to driver node. It usually need hdfs to transfer those files to worker nodes. Rathe than download the file, you can refer to the file using dxfuse. This would make the file visible for both driver and worker nodes.
vcf_path = "file:///mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c1_b88_v1.vcf.gz"
Then you can give vcf_path as input for hl.import_vcf
Could you try again with this pattern instead.
For your cases, you can see if the driver or worker has die using https://job-xxxx.dnanexus.cloud:8081/jobs/ (replace xxxx with job id).
I got some driver or worker die from time to time. In most case, it's memory issue. I personally like to use mem3_ssd1 because Hail is more memory intensive than storage intensive (I haven't run into the need for ssd2 or hdd2 yet).
Hello, I am getting this same error where when I am trying to import the directly genotyped data to hail, I am getting file not found exception even though I could able to see those files in /mnt/project/. How can I be able to fix it?
@Akhil Pampana? I recommend that you open a new thread otherwise there is a high change that people won't see your question.
The data need to be on project first before you can see them with dxfuse.
https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data
Please sign in to leave a comment.