How to speed up sample subsetting from UK Biobank exome pVCF files?
I’m working with the UKB exome release (pVCF format) on DNAnexus RAP. Each pVCF file contains around 450k samples, and I need to extract a subset of about 60k samples based on a list of EIDs.
I’ve been using bcftools view -S keep.ids -Oz -o subset.vcf.gz with 8 threads on a mem1_ssd1_v2_x16 instance, but subsetting even one file can take over 30 minutes. Since there are 80+ files per chromosome, the total runtime becomes quite long.
Could anyone share tips or best practices to make the sample subsetting faster on DNAnexus?
Any advice or examples of efficient pipelines would be greatly appreciated!
Thanks!
Comments
0 comments
Please sign in to leave a comment.