More

Oarch · 2025-12-16T07:14:27 1765869267

Bob Dylan used to regularly play a small annual festival (Hop Farm) near me in Kent, England.

Growing up, I always found it slight strange that this big global name was playing down the road.

IAmBroom · 2025-12-16T14:37:37 1765895857

James Corden pulled a stunt in a small pub where, if you picked a Beatles song on the jukebox, instead of the record playing a curtain would open, and Sir Paul McCartney and his band would play the song. Then the curtain would close.

Then ensuing mob of (well-behaved, deliriously happy) fans rushing to the pub was videotaped as well.

All in all a very cute stunt.

Oarch · 2025-12-16T04:27:36 1765859256

I imagine this would be a great use case for AI helping out?

lodovic · 2025-12-16T07:01:22 1765868482

I'm thinking of installing the extension in a sandbox and then use a local agent to have endless fake conversations with it

automatedideas · 2025-12-16T06:16:02 1765865762

“There’s too much human harmful code to review and too few human reviewers.”

“I know, let’s have an AI do all the work for us instead. Let’s take a coffee break.”

free_bip · 2025-12-16T04:49:10 1765860550

No way that could backfire... Prompt injection is a solved problem right?

Oarch · 2025-12-15T00:24:57 1765758297

Earlier this year I thought that rare proprietary knowledge and IP was a safe haven from AI, since LLMs can only scrub public data.

Then it dawned on me how many companies are deeply integrating Copilot into their everyday workflows. It's the perfect Trojan Horse.

findjashua · 2025-12-15T00:36:05 1765758965

providers' ToS explicitly states whether or not any data provided is used for training purposes. the usual that i've seen is that while they retain the right to use the data on free tiers, it's almost never the case for paid tiers

sotrusting · 2025-12-15T00:42:19 1765759339

Right, so totally cool to ignore the law but our TOS is a binding contract.

mc32 · 2025-12-15T00:47:27 1765759647

Yes, they can be sued for breach of contract. And it’s not a regular ToS but a signed MSA and other legally binding documents.

blibble · 2025-12-15T00:55:45 1765760145

the license on my open source code is a contract, and they ignored that

if they can get away with it (say by claiming it's "fair use"), they'll ignore corporate ones too

LPisGood · 2025-12-15T07:21:17 1765783277

If I were to go out on a limb, those companies spend more on tech companies than you and they have larger legal teams than you. That is a carrot and a stick for AI companies to follow the contract.

blibble · 2025-12-15T15:32:29 1765812749

no, it's not an incentive to follow the contract

it's an incentive to pretend as if you're following the contract, which is not the same thing

protocolture · 2025-12-15T00:48:59 1765759739

Where are they ignoring the law?

sotrusting · 2025-12-15T01:24:01 1765761841

https://www.reuters.com/business/environment/musks-xai-opera...

protocolture · 2025-12-15T04:18:25 1765772305

Thats an allegation. Doesnt an allegation need to be tested?

yieldcrv · 2025-12-15T01:09:19 1765760959

people that say this tend to have a misinterpretation of copyright, and use all the court cases brought by large rights holders as validation

despite all 3 branches of the government disagreeing with them over and over again

torginus · 2025-12-15T09:57:41 1765792661

I bet companies are circumventing this in a way that allows them to derive almost all the benefit from your data, yet makes it very hard to build a case against them.

For example, in RL, you have a train set, and a test set, which the model never sees, but is used to validate it - why not put proprietary data in the test set?

I'm pretty sure 99% of ML engineers would say this would constitute training on your data, but this is an argument you could drag out in courts forever.

Or alternatively - it's easier to ask for forgiveness than permission.

I've recently had an apocalyptic vision, that one day we'll wake up, an find that AI companies have produced an AI copy of every piece of software in existence - AI Windows, AI Office, AI Photoshop etc.

Oarch · 2025-12-15T01:47:38 1765763258

Given the conduct we've seen to date, I'd trust them to follow the letter - but not the spirit - of IP law.

There may very well be clever techniques that don't require directly training on the users' data. Perhaps generating a parallel paraphrased corpus as they serve user queries - one which they CAN train on legally.

The amount of value unlocked by stealing practically ~everyone's lunch makes me not want to put that past anyone who's capable of implementing such a technology.

bdangubic · 2025-12-15T01:30:40 1765762240

it is amazing in almost 2026 there is anyone believing this… amazing

GCUMstlyHarmls · 2025-12-15T00:41:29 1765759289

I wonder how much wiggle there is for collect now (to provide service, context history, etc), then later anonymise (some how, to some level) and then train on it?

Also I wonder if the ToS covers "queries & interaction" vs "uploaded data" - I could imagine some tricky language in there that says we wont use your word document, but we may at some time use the queries you put against it, not as raw corpus but as a second layer examining what tools/workflows to expand/exploit.

danielheath · 2025-12-15T02:31:45 1765765905

“We don’t train on your data” doesn’t exclude metadata, training on derived datasets via some anonymisation process, etc.

There’s a range of ways to lie by omission, here, and the major players have established a reputation for being willing to take an expansive view of their legal rights.

matt-p · 2025-12-15T01:12:06 1765761126

Even if they're were doing this (I highly doubt it) so much would be lost to distillation I'm not convinced there would be much that actually got in, apart from perhaps internal codenames or whatever which will be obvious.

kankerlijer · 2025-12-15T01:52:00 1765763520

Well, perhaps this is naive of me from the perspective of not fully understanding the training process. However, at some point, with all available training data having been exhausted, gains with synthetic data exhausted, and a large pool of publicly available AI generated code, at what point is it 'smart' to scrape codebases from what you identify as high quality code based, clean it up to remove identifiers, and use that for training a smaller model?

phendrenad2 · 2025-12-15T01:39:36 1765762776

Ironically (for you), copilot is the one provider that is doing a good job of provably NOT training on user data. The rest are not up to speed on that compliance angle, so many companies ban them (of course, people still use them).

Aurornis · 2025-12-15T01:57:30 1765763850

Do you have a source for this?

There are claims all through this thread that “AI companies” are probably doing bad things with enterprise customer data but nobody has provided a single source for the claim.

This has been a theme on HN. There was a thread a few weeks back where someone confidently claimed up and down the thread that Gemini’s terms of service allowed them to train on your company’s customer data, even though 30 seconds of searching leads to the exact docs that say otherwise. There is a lot of hearsay being spread as fact, but nobody actually linking to ToS or citing sections they’re talking about.

phendrenad2 · 2025-12-15T16:00:16 1765814416

Sources aren't hard to find[1]. But getting software developers to look outside their idiot-savant caves and not dismiss the entire legal system as "unrealistic", is much harder to accomplish.

[1] - https://www.microsoft.com/en-us/trust-center/privacy/data-ma...

gaigalas · 2025-12-15T00:46:47 1765759607

What kind of rare proprietary knowledge?

Oarch · 2025-12-15T01:51:12 1765763472

It could be a wide range of things depending on your field: highly particular materials, knowledge or processes that give your products or services a particular edge, and which a company has often incurred high R&D costs to discover.

Many businesses simply couldn't afford to operate without such an edge.

Aurornis · 2025-12-15T00:31:19 1765758679

Using an LLM on data does not ingest that data into the training corpus. LLMs don’t “learn” from the information they operate on, contrary to what a lot of people assume.

None of the mainstream paid services ingest operating data into their training sets. You will find a lot of conspiracy theories claiming that companies are saying one thing but secretly stealing your data, of course.

Retric · 2025-12-15T00:48:59 1765759739

Companies have already shifting from not using customer data to giving them an option to opt out ex:

“How can I control whether my data is used for model training?

If you are logged into Copilot with a Microsoft Account or other third-party authentication, you can control whether your conversations are used for training the generative AI models used in Copilot. Opting out will exclude your past, present, and future conversations from being used for training these AI models, unless you choose to opt back in. If you opt out, that change will be reflected throughout our systems within 30 days.” https://support.microsoft.com/en-us/topic/privacy-faq-for-mi...

At this point suggesting it has never and will her happen is wildly optimistic.

Aurornis · 2025-12-15T01:43:41 1765763021

An enterprise Copilot contract will have already decided this for the organization.

Retric · 2025-12-15T03:10:17 1765768217

That possibility in no way address the underlying concern here.

olyjohn · 2025-12-15T02:59:56 1765767596

30 days to opt out? That's skeezy as fuck.

leptons · 2025-12-15T00:34:31 1765758871

> LLMs don’t “learn” from the information they operate on, contrary to what a lot of people assume.

Nothing is really preventing this though. AI companies have already proven they will ignore copyright and any other legal nuisance so they can train models.

lioeters · 2025-12-15T00:39:14 1765759154

They're already using synthetic data generated by LLMs to further train LLMs. Of course they will not hesitate to feed "anonymized" data generated by user interactions. Who's going to stop them? Or even prove that it's happening. These companies have already been allowed to violate copyright and privacy on a historic global scale.

Archelaos · 2025-12-15T00:42:06 1765759326

How should they dinstinguish between real and fake data? It would be far to easy to pollute their models with nonesense.

leptons · 2025-12-15T05:33:31 1765776811

I have no doubt that Microsoft has already classified the nature of my work and quality of my code. Of course it's probably "anonymized". But there's no doubt in my mind that they are watching everything you give them access to, make no mistake.

tick_tock_tick · 2025-12-15T00:48:02 1765759682

I mean is it really ignoring copyright when copyright doesn't limit them in anyway on training?

leptons · 2025-12-15T05:25:02 1765776302

Tell that to all the people suing them for using their copyrighted work. In some cases the data was even pirated.

Aurornis · 2025-12-15T01:52:11 1765763531

> Nothing is really preventing this though

The enterprise user agreement is preventing this.

Suggesting that AI companies will uniquely ignore the law or contracts is conspiracy theory thinking.

leptons · 2025-12-15T05:31:39 1765776699

It already happened.

"Meta Secretly Trained Its AI on a Notorious Piracy Database, Newly Unredacted Court Docs Reveal"

https://www.wired.com/story/new-documents-unredacted-meta-co...

They even admitted to using copyrighted material.

"‘Impossible’ to create AI tools like ChatGPT without copyrighted material, OpenAI says"

https://www.theguardian.com/technology/2024/jan/08/ai-tools-...

cess11 · 2025-12-15T10:36:49 1765795009

Though the porn they copied was just for personal use, because clearly that's an important perk of being employed there:

https://www.vice.com/en/article/meta-says-the-2400-adult-mov...

lwhi · 2025-12-15T00:50:49 1765759849

Information about the way we interact with the data (RLHF) can be used to refine agent behaviour.

While this isn't used specifically for LLM training, it can involve aggregating insights from customer behaviour.

Aurornis · 2025-12-15T01:50:13 1765763413

That’s a training step. It requires explicitly collecting the data and using it in the training process.

Merely using an LLM for inference does not train it on the prompts and data, as many incorrectly assume. There is a surprising lack of understanding of this separation even on technical forums like HN.

lwhi · 2025-12-15T11:11:12 1765797072

That's definitely a fair point.

However, let's say I record human interactions with my app; for example when a user accepts or rejects an AI sythesised answer.

This data can be used by me, to influence the behaviour of an LLM via RAG or by altering application behaviour.

It's not going to change the weighting of the model, but it would influence its behaviour.

AuthAuth · 2025-12-15T00:36:29 1765758989

They are not directly ingesting the data into their trainning sets but they are in most cases collecting it and will be using it to train future models.

Aurornis · 2025-12-15T01:58:43 1765763923

Do you have any source for this at all?

AuthAuth · 2025-12-15T18:09:07 1765822147

Its stated in the private policy.

nerdponx · 2025-12-15T00:33:48 1765758828

If they weren't, then why would enterprise level subscriptions include specific terms stating that they don't train on user provided data? There's no reason to believe that they don't, and if they don't now then there's no reason to believe that they won't later whenever it suits them.

Aurornis · 2025-12-15T01:45:51 1765763151

> then why would enterprise level subscriptions include specific terms stating that they don't train on user provided data?

What? That’s literally my point: Enterprise agreements aren’t training on the data of their enterprise customers like the parent commenter claimed.

TheRoque · 2025-12-15T01:13:04 1765761184

Just read the ToS of the LLM products please

doctorpangloss · 2025-12-15T01:16:20 1765761380

This is so naive. The ToS permits paraphrasing of user conversations, by not excluding it, and then training on THAT. You’d never be able to definitively connected paraphrased data to yours, especially if they only train on paraphrased data that covers frequent, as opposed to rare, topics.

Aurornis · 2025-12-15T01:50:34 1765763434

Do it have a citation for this?

doctorpangloss · 2025-12-15T04:14:48 1765772088

“Hey DoctorPangloss, how can we train on user data without training on user data?”

“You can use an LLM to paraphrase the incoming requests and save that. Never save the verbatim request. If they ask for all the request data we have, we tell them the truth, we don’t have it. If they ask for paraphrased data, we’d have no way of correlating it to their requests.”

“And what would you say, is this a 3 or a 5 or…”

Everything obvious happens. Look closely at the PII management agreements. Btw OpenAI won’t even sign them because they’re not sure if paraphrasing “counts.” Google will.

Aurornis · 2025-12-15T01:58:06 1765763886

I have. Have you? Can you quote the sections you’re talking about?

TheRoque · 2025-12-15T07:03:37 1765782217

https://www.anthropic.com/news/updates-to-our-consumer-terms

"We will train new models using data from Free, Pro, and Max accounts when this setting is on (including when you use Claude Code from these accounts)."

fzeroracer · 2025-12-15T00:38:25 1765759105

> You will find a lot of conspiracy theories claiming that companies are saying one thing but secretly stealing your data, of course.

It's not really a conspiracy when we have multiple examples of high profile companies doing exactly this. And it keeps happening. Granted I'm unaware of cases of this occuring currently with professional AI services but it's basic security 101 that you should never let anything even have the remote opportunity to ingest data unless you don't care about the data.

james_marks · 2025-12-15T00:54:32 1765760072

> never let anything even have the remote opportunity to ingest data unless you don't care about the data

This is objectively untrue? Giants swaths of enterprise software is based on establishing trust with approved vendors and systems.

Aurornis · 2025-12-15T01:46:38 1765763198

> It's not really a conspiracy when we have multiple examples of high profile companies doing exactly this.

Do you have any citations or sources for this at all?

mulquin · 2025-12-15T01:05:45 1765760745

To be pedantic, it is still a conspiracy, just no longer a theory.

rightbyte · 2025-12-15T19:28:31 1765826911

To be pedantic, a theory that has been proven correct is still a theory, right?

popalchemist · 2025-12-15T01:40:02 1765762802

Wrong, buddy.

Many of the top AI services use human feedback to continuously apply "reinforcement learning" after the initial deployment of a pre-trained model.

https://en.wikipedia.org/wiki/Reinforcement_learning_from_hu...

Aurornis · 2025-12-15T01:48:28 1765763308

RLHF is a training step.

Inference (what happens when you use an LLM as a customer) is separate from training.

Inference and training are separate processes. Using an LLM doesn’t train it. That’s not what RLHF means.

popalchemist · 2025-12-15T02:18:11 1765765091

I am aware, I've trained my own models. You're being obtuse.

The big companies - take Midjourney, or OpenAI, for example - take the feedback that is generated by users, and then apply it as part of the RLHF pass on the next model release, which happens every few months. That's why they have the terms in their TOS that allow them to do that.

agumonkey · 2025-12-15T01:12:06 1765761126

maybe prompts are enough to infer the rest ?

sotrusting · 2025-12-15T00:39:38 1765759178

[flagged]

protocolture · 2025-12-15T00:53:53 1765760033

>Ah yes, blindly trusting the corpo fascists that stole the entire creative output of humanity to stop now.

Stealing implies the thing is gone, no longer accessible to the owner.

People aren't protected from copying in the same way. There are lots of valid exclusions, and building new non competing tools is a very common exclusion.

The big issue with the OpenAI case, is that they didn't pay for the books. Scanning them and using them for training is very much likely to be protected. Similar case with the old Nintendo bootloader.

The "Corpo Fascists" are buoyed by your support for the IP laws that have thus far supported them. If anything, to be less "Corpo Fascist" we would want more people to have more access to more data. Mankind collectively owns the creative output of Humanity, and should be able to use it to make derivative works.

Oarch · 2025-12-15T01:55:11 1765763711

> Stealing implies the thing is gone, no longer accessible to the owner.

Isn't this a little simplistic?

If the value of something lies in its scarcity, then making it widely available has robbed the owner of a scarcity value which cannot be retrieved.

A win for consumers, perhaps, but a loss for the owner nonetheless.

protocolture · 2025-12-15T04:21:21 1765772481

No calling every possible loss due to another persons actions "Stealing" is simplistic. We have terms for all these things, like "intellectual property infringement".

Trying to group (Thing I dont like) with (Thing everyone doesnt like) is an old semantic trick that needs to be abolished. Taxonomy is good, if your arguments are good, you dont need emotively charged imprecise language.

Oarch · 2025-12-15T05:24:06 1765776246

I literally reused the definition of stealing you gave in the post above.

sotrusting · 2025-12-15T01:18:45 1765761525

> Stealing implies the thing is gone, no longer accessible to the owner.

You know a position is indefensible when you equivocation fallacy this hard.

> The "Corpo Fascists" are buoyed by your support for the IP laws

You know a position is indefensible when you strawman this hard.

> If anything, to be less "Corpo Fascist" we would want more people to have more access to more data. Mankind collectively owns the creative output of Humanity, and should be able to use it to make derivative works.

Sounds about right to me, but why you would state that when defending slop slingers is enough to give me whiplash.

> Scanning them and using them for training is very much likely to be protected.

Where can I find these totally legal, free, and open datasets all of these slop slingers are trained on?

protocolture · 2025-12-15T06:38:26 1765780706

>You know a position is indefensible when you equivocation fallacy this hard.

No its quite defensible. And if that was equivocation, you can simply outline that you didn't mean to invoke the specific definition of stealing, but were just using it for its emotive value.

>You know a position is indefensible when you strawman this hard.

Its accurate. No one wants thes LLM guys stopped more than other big fascistic corporations, plenty of oppositional noise out there for you to educate yourself with.

>Sounds about right to me, but why you would state that when defending slop slingers is enough to give me whiplash.

Cool, so if you agree all data should usable to create derivative works then I don't see what your complaint is.

>Where can I find these totally legal, free, and open datasets all of these slop slingers are trained on?

You invoked "strawman" and then hit me with this combo strawman/non sequitur? Cool move <1 day old account, really adds to your 0 credibility.

I literally pointed out they should have to pay the same access fee as anyone else for the data, but once obtained, should be able to use it any way. Reading the comment explains the comment.

Unless, charitably, you are suggesting that if a company is legally able to purchase content, and use it as training data, that somehow compels them to release that data for free themselves?

Weird take if true.

Oarch · 2025-12-14T06:46:30 1765694790

I had the same sense. I can see why people like the shows, but to me there's a subtle arrogance to the rich, white American guy just holding court everywhere he goes and explaining local matters as if he's an expert. The food aspect of his shows was often secondary.

rdtsc · 2025-12-14T07:41:56 1765698116

> I had the same sense. I can see why people like the shows, but to me there's a subtle arrogance to the rich, white American guy just holding court everywhere he goes and explaining local matters as if he's an expert. The food aspect of his shows was often secondary.

My most memorable moment from the show was when Bourdain visited some poor farmer to see how they were harvesting yuca (or maybe yams, I forgot) and he went into the typical (I am paraphrasing) "oh look, this is the life, so perfect being one with nature, etc...". And the farmer shut him up pretty quickly with something like "How about a trade: you stay here and farm yams in the rain, in the perfect unity with the nature, and I go to live in your apartment in New York?"

jorvi · 2025-12-14T14:30:36 1765722636

Jon Krakauer pointed something similar out in relation to the native people around Everest attaining a higher quality of life (and thus more Western lifestyle) as more and more commercial Everest expeditions started happening.

Climbing tourists would be complaining that the local culture was being destroyed and that the huts they would visit would have the local kids be wearing, say, a fashion shirt and the huts themselves had amenities like a heater instead of burning dung for heat.

Basically, wealthy climber tourists wanted these people to live in stasis in a lifestyle of poverty just so the atmosphere of quaint mountain life was maintained for them. Almost like an open-air museum.

somenameforme · 2025-12-15T04:00:24 1765771224

I think there's a more charitable way to interpret their perspective, as well as that of Bourdain. Climbing Everest is pain, suffering, and a fairly significant chance at death. And practically speaking, to even try to do it in modern times you generally need to be wealthy. So why are these people doing it? Because wealth doesn't provide contentment or satisfaction in life in and of itself. It's people searching for a meaning and purpose in life.

And so when you see people who live lives that are indeed much harder, but for whom there seems to be true meaning and purpose, there's going to be some major internal conflicts in seeing them striving to push that away to pursue something that one knows leads to just vapidness and emptiness in the end. Obviously you might argue that wealth need not trend towards the end of culture, but scarce is the society with a rich culture and a rich economy. Does it even exist?

Like don't you see a paradox in effectively equating a higher quality of life and a more Western lifestyle, when in the West a vast (and rising) percent of people are drugged out on various psychotropic pharmaceuticals just to make it through day to day. Yet look at poorer cultures and it's not like 1 in 6 people are walking around with untreated mental conditions - they simply seem to be far healthier from a mental, to say nothing of physical, perspective.

So I think wealth and quality of life have a far more nuanced relationship than most appreciate. And the ostensible subset relationship (a rich man can easily become poor if he so chooses, but the other way around is much more difficult) is not so simple. Many people are endlessly addicted to things that they genuinely believe make their life worse, and that they could easily cast away, yet find it difficult to do so. See: social media. And obviously casting away wealth is going to be many orders of magnitude more difficult than something like social media.

djmips · 2025-12-15T11:57:01 1765799821

nice essay, food for thought. Addiction to wealth.

PunchyHamster · 2025-12-14T07:16:21 1765696581

It's always funny when I watch stuff about some foreigner visiting my home country and they either focus on something not all that important, or get something completely wrong.

The funniest part is trying to present some dish as "traditional" that everyone here eats, while it's some super niche thing only one region does, occasionally, if you have grandma that remembers how to make it

michtzik · 2025-12-14T19:40:21 1765741221

If you only eat it when you have a grandma that remembers how to make it, I would consider that the very definition of "traditional". And also interesting to hear about!

(But yea, perhaps not "everyone here eats" in that case. And yet, if everyone knew what it was -- even if it's "what grandma used to eat" -- I'd even let that slide. I don't eat what my grandparents ate, but I know more about it than a foreigner.)

Oarch · 2025-12-05T11:47:07 1764935227

Always has been.

Oarch · 2025-12-02T23:53:20 1764719600

I enjoyed this far too much!

Oarch · 2025-12-02T10:15:13 1764670513

I was like you once...

takes long drag from cigarette

Oarch · 2025-11-29T18:49:23 1764442163

I enjoyed this thoughtful write up. It's a vitally important area for good, transparent work to be done.

Oarch · 2025-11-23T08:39:55 1763887195

Google released Genie 3 back in August, which seemed more compelling than this. I was surprised by how little fanfare it received.

Oarch · 2025-11-18T16:23:02 1763482982

Neigh, a Trojan Horse.

collingreen · 2025-11-18T16:44:20 1763484260

This pun made me actually laugh out loud. I almost lost some coffee.

gnarlouse · 2025-11-18T17:20:26 1763486426

Something something, "my kingdom for a horse." Obligatory upvote, wp