about I/O limitations on the cloud when analyzing multiple vcf in parallel
Hi,
I launched swiss army knife on a mem2_ssd1_v2_x2 instance and used xargs to run bcftools on two WGS vcf in parallel:
process_i() {
i=$1
bcftools query -i 'AC=1' -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%INFO[\t%GT]\n' "file:///mnt/project/Bulk/DRAGEN WGS/chr22/ukb24310_c22_b${i}_v1.vcf.gz" | awk ‘…’ > chr22.b${i}.singleton
wc -l chr22.b${i}.singleton
}
export -f process_i
seq $start $end | xargs -n 1 -P 2 -I {} bash -c 'process_i "$@"' _ {}
and I found that the speed is almost the same as in series (-P 1). From the job log, the memory is enough (roughly using 60%), and CPU usage is ideal (~95%). I thus suggest that the bottleneck is the I/O limitation.Is there any settings to improve the speed of reading vcf from the file system in parallel? Thanks for you help.
Comments
0 comments
Please sign in to leave a comment.