Is there a way to determine which instance should be used for a particular application?

Permanently deleted user

24 March 2023 00:00
1 comment

So I'm trying to run some QC on the pVCFs. This involves creating some intermediate files after left normalization. So far I'm using WDL to run the analyses in batches of 100 vcfs (not sure if this is ideal). What would be the best way of calculating how much memory and storage would I need to accomplish this? One approach I was thinking was to run one file and monitor peak memory (if this is somehow doable) and storage once the task is done. Then scale the storage depending on the number of files and not really sure how to scale memory usage? I would appreciate any guidelines if anyone has done analyses like this ones already. Thanks a lot

Comments

1 comment

Chai Fungtammasan DNAnexus Team
- 25 March 2023 02:24
This is a really good question. It's a critical question for efficient cloud computing. I can share my tips, but would love to see what other people in the community think. Hope you don't mind a super long post.

Usually, I start out with model/speculation, but always have to confirm with some experiment. Start from one file that you mention is excellent.

The storage is usually easiest to model. If I don't know anything about the program, I would give storage just a bit more than twice the input size as a starting point. Usually, the output in bioinformatics is smaller than input. If it's shrinking of representation (like bam to vcf or genome assembly), I would give it less. If it gives lots of processed output (like fastq mapping that gives both bam and sorted bam), I would give more storage.

For compute core, I would start by checking if the program can multi-thread. If not, it's not making sense to give lots of compute cores. I would try to use a large mem/storage type rather than giving more compute cores for smaller mem/storage instances.

Memory is tricky. It highly depended on the type of processing. If it needs to load all data (like de Bruijn graph genome assembler, or load massive index like transcriptome processing), you would need big memory.

Now combine all three together, you are looking for an instance type that fits the requirement with lowest cost as a starting point. However, things would get complicated when you consider that ssd type instances are more expensive than hdd at the same storage, but give faster I/O. Also, more cores would mean faster download/upload speed. Ultimately, experiment would trump over speculation.

Now when analyzing a large number of input, you would have to ask yourself what's going on in that operation. Is it sequentially analyzing each file separately? If so, then only storage needs to increase from single file operation. If you would process multiple samples simultaneously within a single job, then you multiply the need of memory by the number of samples you allow to be processed at once. However, if you would perform some kind of joint calling, this would be hard to predict. Most likely, you need to try to know.

You may need to think about whether you really need to batch them up. How many jobs would you expect to have in total? If it's in the level of a few thousands, you may as well just submit them as individual jobs. However, if you would have more than 10k jobs, then grouping would be important to make things more manageable.

Just for fun. We wrote opinion articles on this topic. https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009757 and https://medium.com/dnanexus/how-to-perform-large-scale-data-processing-in-bioinformatics-4006e8088af2. We made it pretty generic for a broad audience, but we plan to have training on this that is specific to RAP in the future.

0

Please sign in to leave a comment.