Cost effective use of python tool

02 January 2025 04:28

Hi all, I'm hoping to get some feedback re the optimal solution for my problem.

I have a lightweight python script that calls blood groups from VCFs. We have all UKB exomes on our HPC and the script parses them all in ~1hr with ~100 CPUs. I have a folder of ~500000 symlinks as input for the script, which looks for *vcf.gz. We now want to do the same for WGS DRAGEN VCFs, but are no longer allowed to download the data so I need to make my script work in DNA NEXUS. I've written a test applet to test how that works and now I need to convert my python script into an applet.

I'm tossing up between (a) trimming all DRAGON VCFs with a bedfile, dumping them somewhere that the applet can see and using the script pretty much as is. or (b) modifying the script in the applet so it just pulls down full VCFs, uses them then deletes them - this seems more cloud native but is very slow in tests. or (c) … any suggestions?

Comments