Running HAIL-based SampleQC Script in JupyterLab

18 July 2023 00:00
1 comment

I am attempting to run a HAIL-based WES Sample QC script that our group has developed in-house on the nearly 500k WES pVCF.

I have uploaded the necessary files for this script (reference VCF, LCR, the hail_sample_qc.py script, and coding interval files into the permanent storage in our project).

I am starting small by simply trying to perform the script on the pVCF files pertaining to Chr21 as it is the smallest chromosome in our analysis.

I create a JupyterLab Spark cluster environment, and I download all of the files from the permanent storage, into the temporary local storage in the terminal:

dx download hail_sample_qc.py

dx download LCR-hg38-noHLA.interval_list

dx download with_chr_noMT_dbsnp144.b38.vcf.gz

dx download xgen_plus_spikein.GRCh38.bed

When I try running the script in the terminal:

python hail_sample_qc.py LCR-hg38-noHLA.interval_list 'test' with_chr_noMT_dbsnp144.b38.vcf.gz 'file:///mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c21_b*_v1.vcf.gz' --coding-intervals xgen_plus_spikein.GRCh38.bed

I get the following error:

WARN: 'with_chr_noMT_dbsnp144.b38.vcf.gz' refers to no files

ERROR: HailException: arguments refer to no files

From is.hail.utils.HailException: arguments refer to no files

When I try running the script Python Jupyter Notebook:

!python hail_sample_qc.py LCR-hg38-noHLA.interval_list 'test' with_chr_noMT_dbsnp144.b38.vcf.gz 'file:///mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c21_b*_v1.vcf.gz' --coding-intervals xgen_plus_spikein.GRCh38.bed

I get the following error:

TaskSchedulerImpl: WARN: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Does anyone have suggestions for the best route to perform this step? I am fairly new to bioinformatics as a whole, and would greatly appreciate any help.

Comments

1 comment

Former User of DNAx Community_6
- 18 July 2023 18:05
I believe jupyter notebook with Spark enabled have hail in it already. Have you tried that? In jupyter note book, u dont have to run it as a linux script. Just copy and paste the script to jupyter notebook. Check hail.is for more info. I believe they have a tutorial to run in a notebook

Also try this `python hail_sample_qc.py LCR-hg38-noHLA.interval_list 'test' with_chr_noMT_dbsnp144.b38.vcf.gz '/mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c21_b*_v1.vcf.gz' --coding-intervals xgen_plus_spikein.GRCh38.bed`

I would also like to ask what kind of steps u are trying to perform within the code.

0

Please sign in to leave a comment.