My custom image model crashed hard after 3 days of training

Honestly, I was trying to train a model on a set of 5000 product photos to get a specific style. It was going fine for about 72 hours, then the whole thing just stopped and gave me a CUDA out of memory error. I had to go back and cut my batch size in half, from 8 to 4, which added another full day to the training time. Tbh, I think my dataset had a few corrupted image files that messed everything up. Has anyone else run into this with PyTorch lately and found a better fix?

3 comments

3 Comments

jake_hall882mo ago

Ever try using gradient checkpointing? Worked for me when I had the same issue as you and @jade_hernandez is right, it's just part of the grind. Also running a quick script to find any broken images before training saved me a ton of headache later.

luna_green22d ago

The gradient checkpointing thing is actually a solid hack, I've used it before too. It's funny how this stuff mirrors real life though, like how you'd think adding more people to a project would make it go faster but instead you just get more communication errors and slowdowns, similar to how bigger batch sizes eat up memory. My dataset had a few weird EXIF headers that caused the crash, definitely run that validation script first next time.

jade_hernandez2mo ago

Eh, CUDA errors happen all the time though. Cutting the batch size is just part of the process, not some huge disaster.