How to run the R script faster on UKB RAP?

I have my R script, and I am taking the SNPs on 22 chromosomes as input one by one. When I was running the script on R studio on UKB RAP, it took more than 24 hours to run the task only for 10k SNPs.   One more thing- I run the script for different instances, but it is taking an almost equal amount of time for each of them.   What other available resources( like Regenie), how can I utilize to run the code faster on RAP? Is there a way to parallelize my job? If it is there, please explain how I can use it for my purpose.

Comments

8 comments

  • Comment author
    Chai Fungtammasan DNAnexus Team

    I can share what I know, but let's hear from R expert in the community.

     

    How fast you could run things mainly depend on the nature of algorithm, implementation, and parallelization/distribution of the compute. Regenie case is fast because it's a low compute algorithm.

     

    With that, the answer to this question would depend on what type of operation you try to do in R. You may ask yourself if there is a lower compute algorithm tool that you can use instead of your current option or if GPU accelerated version of it is available.

     

    Maybe a more important question is that "why do you want it fast?" . This may sound trivial, but important question. Do you just want it fast because you don't want to wait? Why waiting time does matter to you? Or do you want it fast, so you don't have to pay a lot of compute. You need to know what do you actually want to optimize.

     

    If you are after a lower cost, the lowest hanging fruit it to develop script in Rstudio, but execute in production using Swiss army knife. This way, you can get spot rate which is much cheaper than on-demand rate.

     

    If this is a code you wrote yourself, you can see if you can vectorize the core computing part to make it faster.

     

    However, if you really want super fast program, R might not be the best option.

    0
  • Hi, Chai,

    Thanks for your explanation.

     

    • I have 5 multiple linear regressions in my code, which are regressed upon genotypes of individual SNPs.
    • On 22 chromosomes, there are around 600k SNPs. All I need is to run my script on these 22 chromosomes. In Rstudio, it takes a long time to run only for one individual and I do not know how I can run the 22 analyses in parallel.
    • I noticed that even if I try different instances, the time to compute a certain number of SNPs is almost the same.
    • Yes, I wish to have the analysis be time as well as cost-efficient
    • Can I run my own R script in some other available fast platforms?
    • Can someone suggest how I can run my R scripts in the swiss-army-knife?

     

    Please let me know if more clarification is needed. I also request the experts in the community to give your suggestions.

     

    Thanks and Regards,

    -Saurabh

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    The easiest way to achieve this is to wrap you script into R script like myscript.r

    Then you can run see example on how to run Rscript in swiss-army-knife documentation here: https://ukbiobank.dnanexus.com/app/app-GKyyzJQ951j4Bkfq4jFkGX1K

    You can then use for loop to submit 22 jobs, one chromosome each.

    It's probably best for you to design the script that can be used for all the job, like `Rscript myscript.r input1.bgen`. Basically, you can write your Rscript to take argument of input and user variable to create output based on input name.

     

    This should speed up your processing at least 12 folds and cut the cost by 3 folds. I think that is a decent improvement.

    0
  • Hi Chai,

     

    I was trying to implement your suggestions to run my Rscript in SAK. But I was facing problems loading the R packages as there is an older version of R in SAK. I have contacted the support team and hope they upgrade it soon. Once it is upgraded, I shall try running my script, and I hope it performs as fast as you mentioned.

     

    For a few packages I was installing (e.g., tidyverse), it took 5-10 minutes to get installed, which barely takes 1 minute in Rstudio.

    Do I need to install the packages again and again whenever I run a new script, or once it is installed, can I load them in a different Rscript?

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    I see. R is tricky with dependency. I'm a little worry that after the update, some dependency might be too new and end up with another incompatibility.

    Install specific libraries on the fly when the script is launched is another option. However, you would then pay for all these 10 mins installation for every run as you mentioned. To avoid this, you can wrap this up in Docker. Basically, you install all libraries you need and save your script there inside Docker. Then you can run swiss army knife to invoke those script and installed libraries within Docker image without having to install them again. The manual on how to use Docker in swiss army knife is in the same link I shared with you last time. You can find link to Docker training here. The video also points you to example code.

    https://www.youtube.com/watch?v=aOP_iSZpR6g

    It may sound like big investment to learn Docker if you have not had this experience before, but this has become the best practice in bioinformatics industry. I think it would be a good use of time.

    You can decide what would work best for you though. Honestly, 10 mins installation doesn't cost that much on RAP since it has special pricing model from AWS. If you run with "low" priority and get spot price, it's almost nothing.

     

    0
  • Yes, you are right. Yes, the issue of dependency in R is quite annoying sometimes. This only forced me to explore Docker, something very new to me. I got to know about Docker here only and have already started exploring it. As you said, it seems a big investment, but once I learn to use it, it will be of great help. Thank you for the motivation.

     

    Also, the 10 minutes installation is not a problem for me if other things work fine. I was wondering why the same packages take different amounts of time in R on two different platforms.

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    My guess for longer installation in Swiss-army-knife is the amount of dependency that need to be installed compared to what come with the program (Rstudio vs R in Swiss-army-knife). There could be other reason like network bandwidth due to instance type different, but 10 mins is several order magnitude higher than bandwidth difference.

     

    Hope you enjoy the Docker learning. One compromised way is to use public Docker image that have R program installed. You can then use mounting (-v) to link your Rscript to that Docker and execute it there. However, unless you are very lucky, you would run into original issue that the public Docker image didn't have R libraries the version that you want and you would end up have to install libraries on the fly again.

     

    On the other hand, if you find Dockerfile with R program that someone shared, you can modify that to install the library version you need. The most time consuming step for Docker is to build Dockerfile you need, so this approach might be the most probably path.

     

    Hope these aren't too many possibilities.

    0
  • Thanks for telling me about the possibilities that may occur during learning Docker. They'll be quite helpful, I am sure.

    0

Please sign in to leave a comment.