SAK (bcftools) errors when streaming VCFs in batch jobs
Hello,
I am running batch jobs which use SAK (bcftools) to filter and extract variants from the exome sequencing VCFs.
Although some of the jobs complete successfully, the majority show unexpected errors in their log (although they all run to completion, with status "Done").
I've attached the batch script, the job script, and a text file with a few lines copied from the log of one job.
In a nutshell, the batch script submits ~300 jobs. The input for each job includes a text file of the paths to 100 exome VCFs (e.g. "/mnt/project/Bulk/Exome sequences_Alternative exome processing/Exome variant call files (gnomAD) (VCFs)/ukb24068_c10_b1004_v1.vcf.gz")
On the worker, the job script iterates through each line (a file path) in that text file, streaming the path as an input to bcftools, which filters and reformats the variants of interest.
In order, the most common errors in the log look like:
- [E::bgzf_uncompress] Inflate operation failed: progress temporarily not possible, or in() / out() returned an error
- [E::bgzf_read_block] Invalid BGZF header at offset 1152366786
- [E::bgzf_uncompress] Inflate operation failed: progress temporarily not possible, or in() / out() returned an error
- [E::vcf_parse_format] Number of columns at chrX:16818593 does not match the number of samples (11166 vs 454671)
- Error: VCF parse error
I can't really account for the source of these errors, especially as some of the jobs finish without any errors. Could you advise? Looking online, these errors are typical of corrupted or incorrectly formatted data. This made me think that there may be a problem with streaming using /mnt/project/... Perhaps I should be changing my approach. I've put some thoughts below. Are any of these a potential solution?
- Download the VCFs to the worker (dx download, or naming as -iin inputs), rather than streaming using /mnt/project/
- Reduce, or stop entirely, parallelisation in the worker. (Currently the job script sends up to five bcftools commands to run in the background at once.)
I'd be grateful for any suggestions!
Alex
Comments
4 comments
We probably need support team, ukbiobank-support@dnanexus.com, to look into it. I personally would check a few things.
1) check if such error is deterministic. For example, does any of the failed job succeed upon rerun.
2) Did some of these file corrupted or empty? All compressed vcf empty files will have small, but not zero size. Does it contain any record or just the header.
I have also shared this with support team, but we would need them to help with investigation, since we don't have access to your project.
Hi Chai, thanks for the quick reply.
I've written to the support team as you suggested.
1) The errors are reproducible. I get identical error messages when I rerun the jobs the batch script or run the commands interactively in JupyterLab.
2) None of the files are empty. All of the files seem to contain variant records, as well as the header.
So I am concerned that maybe some of the files are corrupted... But these files are the "raw data" which has been dispensed to my project by UKB, and I have not modified them in any way. Is there a good way to verify the integrity of the data? Are there any other steps I should take?
Thanks again for your help.
Alex
I have found a work-around for this issue.
By downloading the data to the worker using dx download, rather than streaming from /mnt/project/, I am able to run these jobs successfully.
The support team are still looking into the cause of the original issue.
Alex
Thanks for sharing, Alex. I'm surprised that this helps given that only some samples have issue. I will also report to product team in case there is an systematic limitation with dxfuse.
Please sign in to leave a comment.