about I/O limitations on the cloud when analyzing multiple vcf in parallel

Weichen Song

Hi,

I launched swiss army knife on a mem2_ssd1_v2_x2 instance and used xargs to run bcftools on two WGS vcf in parallel:

process_i() {
 i=$1
 bcftools query -i 'AC=1' -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%INFO[\t%GT]\n' "file:///mnt/project/Bulk/DRAGEN WGS/chr22/ukb24310_c22_b${i}_v1.vcf.gz" | awk ‘…’  > chr22.b${i}.singleton
 wc -l chr22.b${i}.singleton
}
export -f process_i
seq $start $end | xargs -n 1 -P 2 -I {} bash -c 'process_i "$@"' _ {}

and I found that the speed is almost the same as in series (-P 1). From the job log, the memory is enough (roughly using 60%), and CPU usage is ideal (~95%). I thus suggest that the bottleneck is the I/O limitation.Is there any settings to improve the speed of reading vcf from the file system in parallel? Thanks for you help.

 

Comments

1 comment

  • Comment author
    Dr. Mc. Ninja

    bcftools is extremely CPU-light once the variant records are in RAM – the wall-clock time you’re now seeing almost certainly reflects how fast the VCFs can be delivered to the tool rather than how fast the tool can crunch them.
    On a RAP Swiss-Army-Knife (SAK) job the files are read from UKB object-storage through dxfuse, the FUSE layer that lazily streams data a few MB at a time. Two independent bcftools processes therefore contend for the same dxfuse mount and the same per-job bandwidth limits, so running “-P 2” brings almost no benefit.

    Below are the levers that have proven to make the biggest difference:

    A minimal pattern that scales well

    # 1. Parallel pre-fetch to local NVMe
    parallel -j 2 'dx download "file:///mnt/project/Bulk/DRAGEN WGS/chr22/ukb24310_c22_b{1}_v1.vcf.gz" \
                   -o /mnt/workspace/b{1}.vcf.gz' ::: $start $end
    
    # 2. Process (still in parallel) from the NVMe copy
    parallel -j 2 '
        bcftools query -i "AC=1" -f "%CHROM\t%POS\t%REF\t%ALT\n" \
            /mnt/workspace/b{1}.vcf.gz > /mnt/workspace/b{1}.singletons &&
        wc -l /mnt/workspace/b{1}.singletons
    ' ::: $start $end
    

    With the files on the local disk you should see nearly linear scaling up to the number of physical cores you have (four on mem2_ssd1_v2_x2, eight on mem1_ssd2_x8, etc.).

    Why not just spawn more SAKs?

    Sometimes you only want a small tweak, but if you are planning to scan many chromosomes or do cohort-wide aggregation, think about switching from an interactive SAK session to:

    • A WDL/CWL workflow compiled with dxCompiler so the scatter runs on independent jobs – each job then streams its own VCF and you can crank the scatter width as far as your project quota allows.
    • Spark over the RAP dataset for operations that can be expressed as SQL or Glow functions (variant counts, QC metrics, etc.).

    Take-away

    Two bcftools processes reading through the same dxfuse mount are still bottlenecked by a single object-storage stream. Move the hot data onto the node’s NVMe (or give the job a faster NVMe/network pipe), and the parallelism you’re already using will translate directly into speed-ups.

    0

Please sign in to leave a comment.