index_bgen in RAP using imputation files
Hello!
I am following the instructions provided at https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/BGEN_import.ipynb to try to use UKB bgen files with Hail.
from pyspark.sql import SparkSession
import hail as hl
import os
builder = (
SparkSession
.builder
.enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)
# Generate hail index
bgen_path = ?/mnt/project/Bulk/Imputation/UKB imputation from genotype?
file_url = f?file://{bgen_path}/ukb22828_c9_b0_v3.bgen?
hl.index_bgen(file_url,reference_genome=?GRCh37?)
When I run the code above, the job never ends. I don?t know how long it should take but it has been 7h now. When I run os.listdir(bgen_path) I see that the file ?ukb22828_c9_b0_v3.bgen? is in there. What am I missing?
Comments
7 comments
Hello, you can try to inspect all the steps that Hail is taking: https://documentation.dnanexus.com/developer/apps/developing-spark-apps#monitoring-the-spark-ui and monitor whether the job is stuck. You may also consider changing the instance type to possibly get more memory, but I would first check the Spark UI as mentioned above.
Hi! That worked, thank you very much!?
However, I am still experiencing quite a few issues when trying to run index_bgen() (HAIL) using jupyterlab.
First, 90% of the times I get the error "transport endpoint is not connected". I found the solution for this here: https://community.dnanexus.com/s/question/0D582000000LYT6CAO/dxfuse-mounted-filesystem-eventually-fails-with-transport-endpoint-is-not-connected. It "works", but again, after fixing the mounting using the commands provided in the link, when I try to run index_bgen(), most of the times again the error reappears and I have to mount again. Is this normal?
The second error, is that my jupyterlab gets completely disconnected. In fact, I see the instance as "Ended" even if I had still time left. Eventually, and without me doing anything, the instance gets "Ready" again, but when I click on it to open it, I get "502 Bad Gateway" (see attached image).
Any help is greatly appreciated ?
For the second error, first I get:
"Server Connection ErrorA connection to the Jupyter server could not be established. JupyterLab will continue trying to reconnect. Check your network connection or Jupyter server configuration."
while jupyterlab is in principle running.
Sorry for all the messages, another error that I am often getting when running bgen_index() is the following:
Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (ip-10-60-159-19.eu-west-2.compute.internal executor driver): java.lang.AssertionError: assertion failed
I am using mem1_hdd1_v2_v4 with 4 nodes.
Is the job eventually finish with failed status? It seems to me still like an issue with memory? Regarding the file size, have you tried first testing the indexing on a smaller file? e.g. just subset of BGEN?
For running Hail, I would use instance type with a lot more memory and also ssd instead of hdd.
if the issue persists, feel free to forward it to DNAnexus technical support team: ukbiobank-support@dnanexus.com. They will be able to check the particular job and job log.
Please sign in to leave a comment.