Store Hail MT from WES data to DNAX

Ning Li

I used Hail on a Spark cluster to perform QC on WES data, but I did not write the MatrixTable (MT) to DNAX. Below is my code. When I attempted to run mt.write(url), the process took an extremely long time and eventually failed.

The instance type I used was mem1_ssd1_v2_36, with 2 nodes. The log file showed the following resource usage:

INFO CPU: 0% (36 cores) * Memory: 5235/70303MB * Storage: 24/842GB * Net: 0↓/0↑MBps
 

It seems that despite the available resources, the writing process was not progressing efficiently. Could you provide any insights or suggestions on how to resolve this issue?

import pyspark 
import dxpy 
sc = pyspark.SparkContext() 
spark = pyspark.sql.SparkSession(sc) 
spark.sql("CREATE DATABASE my_database LOCATION 'dnax://'")

import hail as hl
my_database = dxpy.find_one_data_object(
    name="my_database", 
    project=dxpy.find_one_project()["id"]
)["id"]

hl.init(sc=sc, tmp_dir=f'dnax://{my_database}/tmp/')

file_url = 'file:///mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c21_*.vcf.gz'
mt = hl.import_vcf(file_url, 
                   force_bgz=True, 
                   reference_genome="GRCh38", 
                   array_elements_required=False)
print(f"Num partitions: {mt.n_partitions()}")
mt.describe()
mt_name = "test.mt"                      
url = f"dnax://{my_database}/{mt_name}"
mt.write(url)

Comments

1 comment

  • Comment author
    Ashwin Lakshman Koppayi

    Stuck on this process too… Did you find an efficient way to store the Hail MT?

    0

Please sign in to leave a comment.