Or the original point doesn't actually hold up to basic scrutiny and is indistin...

jacquesm · 2025-12-07T22:54:02 1765148042

HN has guidelines for a reason.

incr_me · 2025-12-07T23:06:04 1765148764

You're adhering to an excess of rules, methinks!

tovej · 2025-12-08T06:42:18 1765176138

The original point, that LLMs are plagiarising inputs, is a very common and common sense opinion.

There are court cases where this is being addressed currently, and if you think about how LLMs operate, a reasonable person typically sees that it looks an awful lot like plagiarism.

If you want to claim it is not plagiarism, that requires a good argument, because it is unclear that LLMs can produce novelty, since they're literally trying to recreate the input data as faithfully as possible.

DangitBobby · 2025-12-08T13:30:31 1765200631

I need you to prove to me that it's not plagiarism when you write code that uses a library after reading documentation, I guess.

> since they're literally trying to recreate the input data as faithfully as possible.

Is that how they are able to produce unique code based on libraries that didn't exist in their training set? Or that they themselves wrote? Is that how you can give them the documentation for an API and it writes code that uses it? Your desire to make LLMs "not special" has made you completely blind to reality. Come back to us.

jacquesm · 2025-12-08T15:04:43 1765206283

No, you need to prove that it is not plagiarism when you use an LLM to produce a piece of code that you then claim as yours.

You have the whole burden of proof thing backwards.

DangitBobby · 2025-12-08T20:34:20 1765226060

Oh wild, I was operating under the assumption that the law requires you to prove that a law was broken, but it turns out you need to prove it wasn't. Thanks!

tovej · 2025-12-08T13:55:44 1765202144

What?

The LLM is trained on a corpus of text, and when it is given a sequence of tokens, it finds a set of token that, when one of them is appended, make the resulting sequence most like the text in that corpus.

If it is given a sequence of tokens that is unlike anything in its corpus, all bets are off and it produces garbage, just like machine learning models in general: if the input is outside the learned distribution, quality goes downhill fast.

The fact that they've added a Monte Carlo feature to the sequence generation, which makes it sometimes select a token that is slightly less like the most exact match in the corpus does not change this.

LLMs are fuzzy lookup tables for existing text, that hallucinate text for out-of-distribution queries.

This is LLM 101.

If the LLM was only trained using documentation, then there would be no problem. If it would generate a design, look at the documentation, understand the semantics of both, and translate the design to code by using the documentation as a guide.

But that's not how it works. It has open source repositories in its corpus that it then recreates by chaining together examples in this stochastic parrot -method I described above.

DangitBobby · 2025-12-08T20:30:56 1765225856