index_bgen in RAP using imputation files

10 November 2023 00:00
7 comments

Hello!

I am following the instructions provided at https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/BGEN_import.ipynb to try to use UKB bgen files with Hail.

from pyspark.sql import SparkSession

import hail as hl

import os

builder = (

SparkSession

.builder

.enableHiveSupport()

)

spark = builder.getOrCreate()

hl.init(sc=spark.sparkContext)

# Generate hail index

bgen_path = ?/mnt/project/Bulk/Imputation/UKB imputation from genotype?

file_url = f?file://{bgen_path}/ukb22828_c9_b0_v3.bgen?

hl.index_bgen(file_url,reference_genome=?GRCh37?)

When I run the code above, the job never ends. I don?t know how long it should take but it has been 7h now. When I run os.listdir(bgen_path) I see that the file ?ukb22828_c9_b0_v3.bgen? is in there. What am I missing?

Comments

7 comments

Ondrej Klempir DNAnexus Team
- 10 November 2023 16:12
Hello, you can try to inspect all the steps that Hail is taking: https://documentation.dnanexus.com/developer/apps/developing-spark-apps#monitoring-the-spark-ui and monitor whether the job is stuck. You may also consider changing the instance type to possibly get more memory, but I would first check the Spark UI as mentioned above.

0
Former User of DNAx Community_13
- 11 November 2023 12:59
Hi! That worked, thank you very much!?

However, I am still experiencing quite a few issues when trying to run index_bgen() (HAIL) using jupyterlab.

First, 90% of the times I get the error "transport endpoint is not connected". I found the solution for this here: https://community.dnanexus.com/s/question/0D582000000LYT6CAO/dxfuse-mounted-filesystem-eventually-fails-with-transport-endpoint-is-not-connected. It "works", but again, after fixing the mounting using the commands provided in the link, when I try to run index_bgen(), most of the times again the error reappears and I have to mount again. Is this normal?

The second error, is that my jupyterlab gets completely disconnected. In fact, I see the instance as "Ended" even if I had still time left. Eventually, and without me doing anything, the instance gets "Ready" again, but when I click on it to open it, I get "502 Bad Gateway" (see attached image). Any help is greatly appreciated ?

0
Former User of DNAx Community_13
- 11 November 2023 13:25
For the second error, first I get:
"Server Connection ErrorA connection to the Jupyter server could not be established. JupyterLab will continue trying to reconnect. Check your network connection or Jupyter server configuration."
while jupyterlab is in principle running.

0
Former User of DNAx Community_13
- 12 November 2023 18:11
Sorry for all the messages, another error that I am often getting when running bgen_index() is the following:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (ip-10-60-159-19.eu-west-2.compute.internal executor driver): java.lang.AssertionError: assertion failed

I am using mem1_hdd1_v2_v4 with 4 nodes.

0
Ondrej Klempir DNAnexus Team
- 13 November 2023 12:42
Is the job eventually finish with failed status? It seems to me still like an issue with memory? Regarding the file size, have you tried first testing the indexing on a smaller file? e.g. just subset of BGEN?

0
Ondrej Klempir DNAnexus Team
- 13 November 2023 12:44
For running Hail, I would use instance type with a lot more memory and also ssd instead of hdd.

0
Ondrej Klempir DNAnexus Team
- 13 November 2023 12:45
if the issue persists, feel free to forward it to DNAnexus technical support team: ukbiobank-support@dnanexus.com. They will be able to check the particular job and job log.

0

Please sign in to leave a comment.