The original point, that LLMs are plagiarising inputs, is a very common and common sense opinion.
There are court cases where this is being addressed currently, and if you think about how LLMs operate, a reasonable person typically sees that it looks an awful lot like plagiarism.
If you want to claim it is not plagiarism, that requires a good argument, because it is unclear that LLMs can produce novelty, since they're literally trying to recreate the input data as faithfully as possible.
I need you to prove to me that it's not plagiarism when you write code that uses a library after reading documentation, I guess.
> since they're literally trying to recreate the input data as faithfully as possible.
Is that how they are able to produce unique code based on libraries that didn't exist in their training set? Or that they themselves wrote? Is that how you can give them the documentation for an API and it writes code that uses it? Your desire to make LLMs "not special" has made you completely blind to reality. Come back to us.
Oh wild, I was operating under the assumption that the law requires you to prove that a law was broken, but it turns out you need to prove it wasn't. Thanks!
The LLM is trained on a corpus of text, and when it is given a sequence of tokens, it finds a set of token that, when one of them is appended, make the resulting sequence most like the text in that corpus.
If it is given a sequence of tokens that is unlike anything in its corpus, all bets are off and it produces garbage, just like machine learning models in general: if the input is outside the learned distribution, quality goes downhill fast.
The fact that they've added a Monte Carlo feature to the sequence generation, which makes it sometimes select a token that is slightly less like the most exact match in the corpus does not change this.
LLMs are fuzzy lookup tables for existing text, that hallucinate text for out-of-distribution queries.
This is LLM 101.
If the LLM was only trained using documentation, then there would be no problem. If it would generate a design, look at the documentation, understand the semantics of both, and translate the design to code by using the documentation as a guide.
But that's not how it works. It has open source repositories in its corpus that it then recreates by chaining together examples in this stochastic parrot -method I described above.