Using the Swiss-Army-Knife Tool to run long PyTorch Lightning training runs

David Marvin Hart

Hi everyone,

Hoping someone here has solved a problem I've been stuck on for a while.

I've been using the JupyterLab on the UKBiobank Rap to train some deep learning models on eye images. The JupyterLab setup works okay for short training runs, but if you want to run something for a really long time, it won't complete. You can use a `nohup` command to keep it running in the background, but then you have to constantly check on it to see to make sure it hasn't crashed or finished so you don't waste money keeping the interactive session open.

Thus, I've been trying to transition to using the swiss-army-knife tool so the longer runs can keep going and the session will stop automatically. I found I could use the Notebook Snapshot from the JupyterLab session as the docker image for the swiss-army-knife tool.

I've almost got everything working, but I keep running into one more problem. For the dataloaders, if I use num_workers > 0, the job will crash saying that the “workers have run out of shared memory”. From what I have found, it seems to be related to where the workers store information in a directory called /dev/shm. In JupyterLab, that directory has 32GB, but in swiss-army-knife setup, it only seems to have 64MB. I have tried both from the system and Python side to increase the memory or change it's location, but nothing seems to work.

I'm curious if anyone else has found success running PyTorch Lightning with the swiss-army-knife tool or has a better alternative to the this.

Thanks in advance!

Comments

0 comments

Please sign in to leave a comment.