📢
6
c/ai-innovations•sean_hunt61sean_hunt61•2mo ago

My custom image model crashed hard after 3 days of training

Honestly, I was trying to train a model on a set of 5000 product photos to get a specific style. It was going fine for about 72 hours, then the whole thing just stopped and gave me a CUDA out of memory error. I had to go back and cut my batch size in half, from 8 to 4, which added another full day to the training time. Tbh, I think my dataset had a few corrupted image files that messed everything up. Has anyone else run into this with PyTorch lately and found a better fix?
3 comments

Log in to join the discussion

Log In
3 Comments
jake_hall88
Ever try using gradient checkpointing? Worked for me when I had the same issue as you and @jade_hernandez is right, it's just part of the grind. Also running a quick script to find any broken images before training saved me a ton of headache later.
10
luna_green
luna_green22d ago
The gradient checkpointing thing is actually a solid hack, I've used it before too. It's funny how this stuff mirrors real life though, like how you'd think adding more people to a project would make it go faster but instead you just get more communication errors and slowdowns, similar to how bigger batch sizes eat up memory. My dataset had a few weird EXIF headers that caused the crash, definitely run that validation script first next time.
2
jade_hernandez
Eh, CUDA errors happen all the time though. Cutting the batch size is just part of the process, not some huge disaster.
1