How to import a VCF file? File not found error when importing VCF file with Hail

Former User of DNAx Community_14

19 August 2022 00:00
13 comments

I am trying to import a VCF file from a Spark+HAIL Jupyter Notbook.

But the VCF file is not found.

In preparation for the notebook:

instance type is mem1_hdd1_v2_x4
I dx-downloaded a single VCF file to the /opt/notebooks folder.
I changed the permissions of such VCF file rwx

The code I am using is in a Jupyter notebook :

import pyspark

import hail as hl

sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc)

hl.init(sc=sc)

vcf_path = 'file:///opt/notebooks/XXXXXXXXXXXXXX.vcf.gz'

mt = hl.import_vcf(vcf_path, force=True, reference_genome='GRCh38', array_elements_required=False)

mt.show(5) #force computation, error here!

Error: File not found file file:///opt/notebooks/XXXXXXXXXXXXXX.vcf.gz does not exist

Thank you for your help

Felipe

Comments

13 comments

Chai Fungtammasan DNAnexus Team
- 09 September 2022 21:22
Could you try with example Hail notebooks that we just published today? We have example on how to load vcf and most common operation in Hail as well.
https://community.dnanexus.com/s/question/0D5t0000043xrVhCAI/hail-tutorial-and-example-notebooks-for-ukbrap-analysis

0
Former User of DNAx Community_22
- 30 September 2022 15:15
Hi Chai, I have a similar error while following your new notebook (import pVCF with Hail) .

It keeps failing at the step of writing the matrix table (last step), or at the ?print(f"Num partitions: {mt.n_partitions()}?)? which I have commented out because it takes forever without failing.

The error I get is:

Hail version: 0.2.78-b17627756568
Error summary: FileNotFoundException: /mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c19_b59_v1.vcf.gz (Transport endpoint is not connected)

However the file should be available because it did manage to read it to get a proper description of its fields in mt.describe().

What is the problem?

0
Former User of DNAx Community_22
- 03 October 2022 18:28
it seems to happen frequently, that files clearly exist from my notebook/ by python, but fail with "FileNotFoundException" when submitted to a hail/spark job as in following your import pVCF tutorial.

Is it possible that some/all of the worker nodes lose access to the mounted files (/mnt), even when they are clearly properly mounted when I check through python?

These are very basic operations and seem to not be able to reliable access the mounted file system by multiple users.

0
Chai Fungtammasan DNAnexus Team
- 03 October 2022 18:30
I got a problem trying this myself too with pVCF from WGS. The engineer team is looking into it. It was working with pVCF from WES that they run the test in early Sep.

0
Former User of DNAx Community_8
- 24 October 2022 11:49
We get the same issue now (24th October) when trying to read plink files with glow.

0
Chai Fungtammasan DNAnexus Team
- 25 October 2022 15:35
Just want to keep you all posted that this is still under investigation. Here is what we know so far.

1) The BGEN needs to be compressed with zlib rather than zstd in order to be used with Hail.
2) The BGEN also could not be multi-allelic
3) The Hail version needs to be updated.
4) We don't quite understand it yet, but pVCF for WES is working while the pVCF for WGS isn't working. We are still investigating it.

Once the investigation is finished, the plan is to specify in the guideline what data could be used with Hail, and the conversion command for those that could not be used.

0
Chai Fungtammasan DNAnexus Team
- 28 November 2022 06:10
@Katie Sandford? for Glow, you may create a new topic or contact ukbiobank-support@dnanexus.com.

0
Former User of DNAx Community_22
- 28 November 2022 14:36
Don't know if it will help with Glow, but I found that in order to work with plink files (in general on UKB RAP) I have to create symbolic links first because of the special characters in the path the UKB use (spaces, etc, which plink is more sensitive to than others).

0
Ondrej Klempir DNAnexus Team
- 07 December 2022 14:34
Hi @Felipe Golib? and @Or Yaacov?,

I would like to let you know that @Chai Fungtammasan? has recently published the following post about Hail troubleshooting for UKB data:
https://community.dnanexus.com/s/question/0D5t000004AflSiCAJ/hail-troubleshooting-for-ukb-data

0
Chai Fungtammasan DNAnexus Team
- 09 January 2023 03:30
@Felipe Golib? I looked in to your case, and I think it's because the cluster can't see your file. When you run dx download, the file goes to driver node. It usually need hdfs to transfer those files to worker nodes. Rathe than download the file, you can refer to the file using dxfuse. This would make the file visible for both driver and worker nodes.
vcf_path = "file:///mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c1_b88_v1.vcf.gz"
Then you can give vcf_path as input for hl.import_vcf
Could you try again with this pattern instead.

0
Chai Fungtammasan DNAnexus Team
- 09 January 2023 03:34
For your cases, you can see if the driver or worker has die using https://job-xxxx.dnanexus.cloud:8081/jobs/ (replace xxxx with job id).

I got some driver or worker die from time to time. In most case, it's memory issue. I personally like to use mem3_ssd1 because Hail is more memory intensive than storage intensive (I haven't run into the need for ssd2 or hdd2 yet).

0
Former User of DNAx Community_6
- 02 February 2023 15:43
Hello, I am getting this same error where when I am trying to import the directly genotyped data to hail, I am getting file not found exception even though I could able to see those files in /mnt/project/. How can I be able to fix it?

0
Chai Fungtammasan DNAnexus Team
- 02 February 2023 18:05
@Akhil Pampana? I recommend that you open a new thread otherwise there is a high change that people won't see your question.
The data need to be on project first before you can see them with dxfuse.
https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

0

Please sign in to leave a comment.