Efficient ways to use Swiss Army Knife

I have been using the RAP on DNA nexus running analyses on the 150k WGS data. I have mostly been using the Swiss Army Knife tool on the graphic interface to convert vcf to plink files and use these for my analyses. I uploaded all the circa 57000 vcf files to the Swiss Army Knife. Considering that only 2000 files can be uploaded at the same time, I had to repeat this step a certain number of times. I ran batch jobs on these files and there also seemed to be some kind of interference when uploading a new set of 2000 files before all the previous ones had finished running (the ones that had not run were blocked and I had to re-launch those jobs). This was of course very time consuming and costly. Considering that now the 200k are available and hopefully soon the 500k will be too, I would like to know if there is some more efficient way to run these analyses.   

Comments

2 comments

  • Comment author
    Chai Fungtammasan DNAnexus Team

    Swiss-army-knife is versatile and a good start, but if you want to handle large scale data effectively, it might be useful to develop applet to handle this data specifically. See an example here: https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/guide-to-analyzing-large-sample-sets

     

    Here is a webinar on developing Docker: https://www.youtube.com/watch?v=aOP_iSZpR6g&t=1s

    0
  • {@005t000000149vjAAA}?  is pointing to some more advanced tools like docker, batch processing, WDL workflows etc. Those advanced tools are great and you should definitely look into those once you are comfortable.

     

    I tend to use swiss-army-knife from the command line option through dx run. Using dx run is similar to submitting slurm or pbs jobs on an HPC cluster.

     

    1) spot instances on AWS are the least expensive method of running compute on the RAP. it is best to keep the total runtime of a job as short as possible so that they to not get pre-empted. Once they get restarted they are moved to "on-demand" which is much more expensive. In my experience, most jobs with a runtime of under 6 hours will seldom get moved over to "on-demand".

     

    2) Always work on an individual chromosome basis. This is a fundamental unit of parallelization. This will allow you to run 22+ independent jobs at the same time with your rate limiting step being the largest chromosome. Once all analyses are complete, you can gather the results together at the end.

     

    3) Never try to do too much in a single job. Do you need to run 5 commands in series? Consider splitting them up in to 2 jobs with 2 commands in one and three in another.

     

    4) Consider using a smaller sub-sample of controls. Jobs always scale with the number of variants + the number of subjects. You do not always need to run your analysis on all 200+K subjects, especially when you are building out your initial workflow.

     

    -Phil Greer

    1

Please sign in to leave a comment.