Store Hail MT from WES data to DNAX

Ning Li

23 February 2025 22:16
1 comment

I used Hail on a Spark cluster to perform QC on WES data, but I did not write the MatrixTable (MT) to DNAX. Below is my code. When I attempted to run mt.write(url), the process took an extremely long time and eventually failed.

The instance type I used was mem1_ssd1_v2_36, with 2 nodes. The log file showed the following resource usage:

INFO CPU: 0% (36 cores) * Memory: 5235/70303MB * Storage: 24/842GB * Net: 0↓/0↑MBps

It seems that despite the available resources, the writing process was not progressing efficiently. Could you provide any insights or suggestions on how to resolve this issue?

import pyspark 
import dxpy 
sc = pyspark.SparkContext() 
spark = pyspark.sql.SparkSession(sc) 
spark.sql("CREATE DATABASE my_database LOCATION 'dnax://'")

import hail as hl
my_database = dxpy.find_one_data_object(
    name="my_database", 
    project=dxpy.find_one_project()["id"]
)["id"]

hl.init(sc=sc, tmp_dir=f'dnax://{my_database}/tmp/')

file_url = 'file:///mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c21_*.vcf.gz'
mt = hl.import_vcf(file_url, 
                   force_bgz=True, 
                   reference_genome="GRCh38", 
                   array_elements_required=False)
print(f"Num partitions: {mt.n_partitions()}")
mt.describe()
mt_name = "test.mt"                      
url = f"dnax://{my_database}/{mt_name}"
mt.write(url)

Comments

1 comment

Ashwin Lakshman Koppayi
- 10 April 2026 19:28
Stuck on this process too… Did you find an efficient way to store the Hail MT?

0

Please sign in to leave a comment.