about I/O limitations on the cloud when analyzing multiple vcf in parallel
Hi,
I launched swiss army knife on a mem2_ssd1_v2_x2 instance and used xargs to run bcftools on two WGS vcf in parallel:
process_i() {
i=$1
bcftools query -i 'AC=1' -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%INFO[\t%GT]\n' "file:///mnt/project/Bulk/DRAGEN WGS/chr22/ukb24310_c22_b${i}_v1.vcf.gz" | awk ‘…’ > chr22.b${i}.singleton
wc -l chr22.b${i}.singleton
}
export -f process_i
seq $start $end | xargs -n 1 -P 2 -I {} bash -c 'process_i "$@"' _ {}and I found that the speed is almost the same as in series (-P 1). From the job log, the memory is enough (roughly using 60%), and CPU usage is ideal (~95%). I thus suggest that the bottleneck is the I/O limitation.Is there any settings to improve the speed of reading vcf from the file system in parallel? Thanks for you help.
Comments
1 comment
bcftoolsis extremely CPU-light once the variant records are in RAM – the wall-clock time you’re now seeing almost certainly reflects how fast the VCFs can be delivered to the tool rather than how fast the tool can crunch them.On a RAP Swiss-Army-Knife (SAK) job the files are read from UKB object-storage through dxfuse, the FUSE layer that lazily streams data a few MB at a time. Two independent
bcftoolsprocesses therefore contend for the same dxfuse mount and the same per-job bandwidth limits, so running “-P 2” brings almost no benefit.Below are the levers that have proven to make the biggest difference:
A minimal pattern that scales well
With the files on the local disk you should see nearly linear scaling up to the number of physical cores you have (four on
mem2_ssd1_v2_x2, eight onmem1_ssd2_x8, etc.).Why not just spawn more SAKs?
Sometimes you only want a small tweak, but if you are planning to scan many chromosomes or do cohort-wide aggregation, think about switching from an interactive SAK session to:
Take-away
Two
bcftoolsprocesses reading through the same dxfuse mount are still bottlenecked by a single object-storage stream. Move the hot data onto the node’s NVMe (or give the job a faster NVMe/network pipe), and the parallelism you’re already using will translate directly into speed-ups.Please sign in to leave a comment.