How to read the UKB exome data from Python?
I've tried to run the pipeline described in this notebook.
The problem is that it's super slow - not just when loading the data, but on every single operation. For example, each time I run mt.describe() it takes over an hour.
Is there a way to work interactively and have Spark/Hail load the data only when I actually need it (and only the specific variants that I need)? I want to get the genotypes only for a small genomic region of about a single gene, so it really shouldn't be that slow.
Here's the code I run:
from pyspark.sql import SparkSession
import hail as hl
import os
builder = (
SparkSession
.builder
.enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)
file_url = "file:///mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c22_b0_v1.vcf.gz"
mt = hl.import_vcf(file_url,
force_bgz=True,
reference_genome="GRCh38",
array_elements_required=False)
mt.describe()
Comments
6 comments
You could speed it up by increase the number of node and core in instance type in the analysis. It would cost more though.
What analysis you are looking to perform? There are quite a number of tutorial in Tools and Tutorials section. Those are non-interactive type of analysis, but common analyses like GWAS or filter for specific variant, it's actually quite efficient to do in non-interactive mode with file base operation.
Thank you for the suggestions!
I want to perform a rather specific custom analysis, not something that falls into an existing pipeline (such as GWAS). For a start, I'd like to extract the genotypes for all the variants within a specific gene into a numpy matrix (or a scipy sparse matrix if its more efficient). Do you have a suggestion how to do that efficiently?
I see. How about narrow down the pvcf file first using this https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=837 to figure out which file contain your regions of interest and use bcftools in swiss army knife to trim them to specific region. Then you can load variant interactively.
It doesn't exactly solve the issue you have, but should reduce the cost and operation time in Hail. The short non-interactive jobs can be easily managed with spot instance, and you will use on-demand only for interactive part.
I don't think it really solves the fundamental problem. In the end I'll want to run the same analysis genomewide. I can in principle write my own VCF parser that reads the file line by line and that would be a great deal faster, but I find it weird that the official VCF parser is so slow compared to some patchy script I could write on my own. Surely there's something I don't understand about how to use it properly..
I see. I got the wrong impression that you are only interested in subset of gene.
The other thing you might do is to check in spark UI if the clusters are doing the job they suppose to do. I share my experience in Hail troubleshooting here: https://community.dnanexus.com/s/question/0D5t000004AflSiCAJ/hail-troubleshooting-for-ukb-data
This is all I have for now. We can wait and see if other members have other tricks.
Hail is optimized for its native formats (Hail tables and matrix tables) rather than VCFs (https://hail.is/docs/0.2/methods/impex.html).
Have you tried writing the import result as a Hail matrix table and then working against it?
As an example,
# Import and write in the native MatrixTable format
hl.import_vcf(file_url, ...).write('example.mt')
# Read the native format (reusable)
mt = hl.read_matrix_table('example.mt')
# Query from the native format
mt.describe()
Please sign in to leave a comment.