The problem with this analogy is that the companies are large enough to have leverage over the legal system to likely be able to avoid any consequences. Even if case law eventually rules in favor of the original copyright holders of the training data, it's the customers of the ML companies who would most likely be directly liable for infringement, not the ML companies themselves (though the customers could then try to sue them for damages).
Since there is no legal precedent and the law itself isn't clear about this use case, it's basically a huge gamble on a legal gray area at this point. For VCs the risk doesn't really matter as ML startups only need to exist long enough to provide an exit with high ROI and for enterprise companies it doesn't matter as ML products are just one of many ventures for them.
It's worth noting that unlike Germany, where book and newspaper publishers have won rather unusual copyright claims against companies like Google, in the US the big publishing industries to worry about are movies and music, and most ML projects right now seem to focus on generating images or text rather than music or video. If "AI generated music" caught on like DALL-E 2 did, I think we'd see a lot more contention over how copyright law applies to ML training data.
Since there is no legal precedent and the law itself isn't clear about this use case, it's basically a huge gamble on a legal gray area at this point. For VCs the risk doesn't really matter as ML startups only need to exist long enough to provide an exit with high ROI and for enterprise companies it doesn't matter as ML products are just one of many ventures for them.
It's worth noting that unlike Germany, where book and newspaper publishers have won rather unusual copyright claims against companies like Google, in the US the big publishing industries to worry about are movies and music, and most ML projects right now seem to focus on generating images or text rather than music or video. If "AI generated music" caught on like DALL-E 2 did, I think we'd see a lot more contention over how copyright law applies to ML training data.