How do we span AWS instances with GPUs for deep learning projects?

10 July 2024 10:19
4 comments

Due to the recent data policy change, I would need to use the RAP to train my deep learning models.

Currently, it only seems possible to use Jupyter notebook for such work. Jupyter notebooks will be unreliable to monitor my training progress and make changes during the training. This is particular so because some runs will take weeks of time to finish.

How can I train large-scale deep learning model on let's say imaging data via a command line tool?

Thank you ;D

Comments

4 comments

Esra Lenz
- 21 October 2024 15:16
Hello Hang,
I am facing the same issue as you do. I want to train on MRI-Images to let's say classify them.
I can not wrap my head around this that I have to load all the images first in my instance (like 60.000 MRIs) and run my analyses in a Notebook.
The solution I came up with so far, is to ssh into a Cloud-Instance and set it up in a way that I want it and then run it.
My code anyways uses Pytorch-Ligthning, Hydra and weighs and biases for logging.
I also think that maybe the “workflows” coud be something that helps but I am not sure yet. I am also completely new to the RAP.
Did you find any good solution that worked for you?

Best

0
Hang Yuan
- 06 November 2024 11:32
Just came across this thread as the Forum doesn't seem to send me notifications via email.
I am actually surprised that you can get Lightning to work because we can only run jobs one instance instead of multiple instances?
My honest opinion is that if your data is small enough. Stick to the notebook workflow. For your case, which might not be.
You might want to consider applying for the access exemption application given the limited support for deep learning projects at scale.

H

1
Esra Lenz
- Edited 06 November 2024 12:30
Hi Hang,
My lightning code does not work yet on multiple GPUs unfortunately but on one. I am not sure if this is due to my code or some underlying problem. But I think in general it should be possible. I'll let you know if so.

And yes, we are also going for this access exemption.
If you would be up to that, maybe you could share information on how you achieved this.
In our case it is also just for the reason that we are working on the niftis directly and are using Clustering etc. which needs a lot of epochs.
Also we want to try some transformers.

1
Hang Yuan
- 06 November 2024 12:42
Esra, we don't have plans to move our ML projects to RAP within the short term given the existing evidence seen from the RAP.
Good luck with your application!

1

Please sign in to leave a comment.