24
Question about AI model training data quality
I keep seeing people in my local tech meetup talk about training models on any data they can scrape, focusing only on volume. Last month, a developer from Austin showed a project that failed because the training set was full of duplicate and low-quality forum posts. The model just repeated nonsense. I think clean, verified data matters more than sheer size. Has anyone else run into this and found a good way to source better datasets?
3 comments
Log in to join the discussion
Log In3 Comments
michael8952mo ago
Wait, they just used any forum posts they could find? I mean, that's basically asking for a model to just spit back garbage.
3
river_gonzalez662mo ago
Isn't it more about how they filter and clean the data first?
1
gavin_allen481mo ago
I heard one team actually spent months just on the filtering part.
0