about I/O limitations on the cloud when analyzing multiple vcf in parallel

Weichen Song

Hi,

I launched swiss army knife on a mem2_ssd1_v2_x2 instance and used xargs to run bcftools on two WGS vcf in parallel:

process_i() {
 i=$1
 bcftools query -i 'AC=1' -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%INFO[\t%GT]\n' "file:///mnt/project/Bulk/DRAGEN WGS/chr22/ukb24310_c22_b${i}_v1.vcf.gz" | awk ‘…’  > chr22.b${i}.singleton
 wc -l chr22.b${i}.singleton
}
export -f process_i
seq $start $end | xargs -n 1 -P 2 -I {} bash -c 'process_i "$@"' _ {}

and I found that the speed is almost the same as in series (-P 1). From the job log, the memory is enough (roughly using 60%), and CPU usage is ideal (~95%). I thus suggest that the bottleneck is the I/O limitation.Is there any settings to improve the speed of reading vcf from the file system in parallel? Thanks for you help.

 

Comments

0 comments

Please sign in to leave a comment.