Problems writing Hail Matrix Table

Emma Mary Wade

Hi there

I am quite new to the UKBB so apologies if this is asked regulary, but I couldn't find a recent solution to my problem. I am following the instructions here: https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/pVCF_import.ipynb 
to convert vcf to Hail matrix table, and I can see the matrix exists with:
mt.describe() 
and I have successfully annotated the matrix table using Hail's annotation database. But I cannot figure out how to write the matrix table somewhere so I can resume using the same matrix table in a new Jupyter session. 
Things I have tried:

Following the instructions to create a database in dnanexus as described in the GitHub notebook above and writing the matrix table to that. I can make a database, as described, but the table does not get written there. I understand we now cannot directly write files to our project space on dnanexus from Jupyter lab?

Using mt.write to try and write the matrix table to the current working directory, as I understand it /opt/notebooks. I don't get an error but nothing happens.

Using mt.export to try and write the thing as a tsv file stored in my current working directory, again no error but no writing either.

I am currently using single participant exome files, to keep the executions short while I learn what to do, so I don't think it's a memory or compute problem. 

Any guidance would be much appreciated!

Thank you so much

 

Comments

1 comment

  • Comment author
    Dr. Mc. Ninja

    My assistant says … (NOTE, she has the habit of optimistically making stuff up, so please let me know if this works or not) …


    You’re not doing anything “wrong” – this is mostly about how Hail + Spark are wired up on RAP. The big gotcha is:

    You cannot persist Hail Tables/MatrixTables to /opt/notebooks or the project FUSE mount and expect them to survive the cluster.
    They must live in a DNAnexus database (dnax://…), or you must explicitly export them out with hadoop fs + dx upload. (Hail Discussion)

    Here’s a concrete pattern you can drop into your notebook.

    1. Create (or reuse) a DNAnexus “database” for Hail

    Run this once per project (or just reuse an existing DB name):

    import pyspark
    import dxpy
    import hail as hl
    
    # 1. Start Spark with Hive support
    sc = pyspark.SparkContext()
    spark = pyspark.sql.SparkSession(sc)
    
    # 2. Create a database that lives in DNAnexus "dnax://" storage
    db_name = "hail_db"  # ⚠️ keep it lowercase, Hive/db names are case-sensitive and upper-case often causes trouble
    
    spark.sql(f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'")
    spark.sql(f"SHOW DATABASES LIKE '{db_name}'").show()
    

    Optionally, confirm the DB object exists:

    db_obj = dxpy.find_one_data_object(name=db_name, classname="database")
    db_id = db_obj["id"]
    print("Database ID:", db_id)
    

    2. Initialise Hail pointing at that database

    Each time you start a new Jupyter session you should:

    import pyspark
    import dxpy
    import hail as hl
    
    sc = pyspark.SparkContext()
    spark = pyspark.sql.SparkSession(sc)
    
    db_name = "hail_db"
    db_id = dxpy.find_one_data_object(name=db_name, classname="database")["id"]
    base = f"dnax://{db_id}"
    
    hl.init(sc=sc, tmp_dir=f"{base}/tmp/")
    

    Key points:

    • tmp_dir must be a dnax:// path, not /opt/notebooks or /mnt/project.
    • Using the same db_name (and hence db_id) across sessions is what lets you read/write the same MatrixTable later. (Hail Discussion)

    3. Write your MatrixTable so you can re-use it

    Once you’ve imported and annotated:

    # mt is your MatrixTable
    mt_name = "exome_single_sample_mt"
    mt_path = f"{base}/{mt_name}"   # e.g. dnax://database-xxxx/exome_single_sample_mt
    
    mt.write(mt_path, overwrite=True)
    

    This actually writes a whole directory of Parquet files into the DNAnexus database, not into your project file tree – so you won’t see it in the web UI.

    To convince yourself it’s there, you can use Spark or HDFS:

    # From Python:
    spark._jvm.org.apache.hadoop.fs.FileSystem.get(
        spark._jsc.hadoopConfiguration()
    ).listStatus(spark._jvm.org.apache.hadoop.fs.Path(mt_path))
    

    (or more simply from a terminal cell:)

    
    hadoop fs -ls "dnax://$DB_ID/exome_single_sample_mt"
    

    4. Read the same MatrixTable in a new Jupyter session

    Next time you start a Spark/Hail notebook:

    import pyspark
    import dxpy
    import hail as hl
    
    sc = pyspark.SparkContext()
    spark = pyspark.sql.SparkSession(sc)
    
    db_name = "hail_db"
    db_id = dxpy.find_one_data_object(name=db_name, classname="database")["id"]
    base = f"dnax://{db_id}"
    
    hl.init(sc=sc, tmp_dir=f"{base}/tmp/")
    
    mt_path = f"{base}/exome_single_sample_mt"
    mt = hl.read_matrix_table(mt_path)
    
    mt.describe()
    

    If you get FileNotFoundError, double-check:

    • Same db_name as when you wrote it.
    • Same mt_name string.
    • No accidental capital letters in the DB name.(community.ukbiobank.ac.uk)

    5. Why /opt/notebooks and mt.export “do nothing”

    What’s happening with the things you tried:

    1. mt.write("/opt/notebooks/...")
      • Hail interprets this as a path on the cluster’s Hadoop filesystem, not your JupyterLab container’s local disk.
      • The workers write into an ephemeral HDFS location that disappears when the Spark cluster shuts down. That’s why you see no files in the Jupyter file browser.
    2. mt.export("file:///opt/notebooks/whatever.tsv")
      • Same issue: export goes to the cluster filesystem, not to the Jupyter container.
      • Again, it vanishes along with the cluster.
    3. “We can’t write directly into project space anymore?”
      • You can still write “normal” files to the project using dx upload or dxpy APIs.
      • But Hail Tables/MatrixTables must live in a DNAnexus database (dnax://), because they are many small files and DNAnexus’s project filesystem doesn’t support them efficiently. (Hail Discussion)

    6. (Optional) Export a TSV/VCF into project space

    If you want a plain TSV/VCF in your project:

    1. Export from Hail into the database filesystem:

      out_path = f"{base}/results/my_variants.tsv.bgz"
      mt.rows().select().export(out_path)
      
    2. Then move it into your project with hadoop fs + dx upload:

      
      DB_ID=$(dx find data -n hail_db --class database --brief)
      hadoop fs -cat "dnax://$DB_ID/results/my_variants.tsv.bgz" \
        | dx upload -o /results/my_variants.tsv.bgz
      

    The uploaded /results/my_variants.tsv.bgz will then appear in the project UI like any other file. (Hail Discussion)

    If you’d like, I can turn this into a tiny “minimal Hail template” notebook for UKB RAP that:

    1. Starts Spark/Hail correctly
    2. Creates the DB if needed
    3. Imports one exome pVCF
    4. Writes and re-reads a MatrixTable

    so you’ve always got a working reference.

    0

Please sign in to leave a comment.