about I/O limitations on the cloud when analyzing multiple vcf in parallel

26 March 2025 01:17
1 comment

Hi,

I launched swiss army knife on a mem2_ssd1_v2_x2 instance and used xargs to run bcftools on two WGS vcf in parallel:

process_i() {
 i=$1
 bcftools query -i 'AC=1' -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%INFO[\t%GT]\n' "file:///mnt/project/Bulk/DRAGEN WGS/chr22/ukb24310_c22_b${i}_v1.vcf.gz" | awk ‘…’  > chr22.b${i}.singleton
 wc -l chr22.b${i}.singleton
}
export -f process_i
seq $start $end | xargs -n 1 -P 2 -I {} bash -c 'process_i "$@"' _ {}

and I found that the speed is almost the same as in series (-P 1). From the job log, the memory is enough (roughly using 60%), and CPU usage is ideal (~95%). I thus suggest that the bottleneck is the I/O limitation.Is there any settings to improve the speed of reading vcf from the file system in parallel? Thanks for you help.

Comments

1 comment

Dr. Mc. Ninja
- 27 June 2025 13:59
bcftools is extremely CPU-light once the variant records are in RAM – the wall-clock time you’re now seeing almost certainly reflects how fast the VCFs can be delivered to the tool rather than how fast the tool can crunch them.
On a RAP Swiss-Army-Knife (SAK) job the files are read from UKB object-storage through dxfuse, the FUSE layer that lazily streams data a few MB at a time. Two independent bcftools processes therefore contend for the same dxfuse mount and the same per-job bandwidth limits, so running “-P 2” brings almost no benefit.
Below are the levers that have proven to make the biggest difference:
A minimal pattern that scales well
```
# 1. Parallel pre-fetch to local NVMe
parallel -j 2 'dx download "file:///mnt/project/Bulk/DRAGEN WGS/chr22/ukb24310_c22_b{1}_v1.vcf.gz" \
               -o /mnt/workspace/b{1}.vcf.gz' ::: $start $end

# 2. Process (still in parallel) from the NVMe copy
parallel -j 2 '
    bcftools query -i "AC=1" -f "%CHROM\t%POS\t%REF\t%ALT\n" \
        /mnt/workspace/b{1}.vcf.gz > /mnt/workspace/b{1}.singletons &&
    wc -l /mnt/workspace/b{1}.singletons
' ::: $start $end
```
With the files on the local disk you should see nearly linear scaling up to the number of physical cores you have (four on mem2_ssd1_v2_x2, eight on mem1_ssd2_x8, etc.).
Why not just spawn more SAKs?
Sometimes you only want a small tweak, but if you are planning to scan many chromosomes or do cohort-wide aggregation, think about switching from an interactive SAK session to:
- A WDL/CWL workflow compiled with dxCompiler so the scatter runs on independent jobs – each job then streams its own VCF and you can crank the scatter width as far as your project quota allows.
- Spark over the RAP dataset for operations that can be expressed as SQL or Glow functions (variant counts, QC metrics, etc.).
Take-away
Two bcftools processes reading through the same dxfuse mount are still bottlenecked by a single object-storage stream. Move the hot data onto the node’s NVMe (or give the job a faster NVMe/network pipe), and the parallelism you’re already using will translate directly into speed-ups.
0

Please sign in to leave a comment.

Comments

A minimal pattern that scales well

Why not just spawn more SAKs?

Take-away