Many are speculating that Sam Altman could just move on and create another OpenAI 2.0 because he could easily attract talent and investors.
What this misses is all the regulatory capture that he’s been campaigning for. All the platforms have now closed their gardens. Authors and artists are much more vigilant about copyright etc. So it’s now a totally different game compared to 3 years ago because the data is not just there up for grabs anymore.
"Why Sam Altman (who can have the funding, talent, and the vision OpenAI has right now) can't just create OpenAI 2.0?" is an amazing question that also answers whats OpenAI's moat.
People speculated it was the funding, or attracting talent or having "access". Turns out it was none of them (obviously they all have a part, but having all three doesn't mean you can best OpenAI which gives you the fundemental reason why it is so hard to compete with them).
Do BoringAI or LibreAI and it's just a fork but you ripped out all the old, bad stuff. (This joke doesn't really work because OpenAI is not really old enough for legacy cruft and isn't actually open enough to just be forked)
I don't think getting training data is that hard still, the biggest platforms that locked down their APIs still use them for their mobile apps and can easily be reverse engineered to find keys or undocumented endpoints (or in the case of reddit, an entirely different internal API with less limits and a lot more info leaks...)
Text really doesn't take up that much space, and in addition it compresses pretty well.
The entire English language Wikipedia is only around 60GB in a format that can be readily searched and randomly accessed (ZIM), for example: https://kiwix.org/
Does Kiwix actually work? I see people hyping it here but I could never get it to actually, y'know, download the file and display the wikipedia on my phone.
Kiwix worked for me. IIRC there may be difficulties opening an archive that was downloaded outside of the mobile app, but archives downloaded in-app were fine.
For the mobile app I used one of the smaller Wikipedia subsets, since I didn't want to take up too much space on my phone. The full offline Wikipedia download is saved to my laptop.
Assuming the Reddit app does not use certificate pinning, you can use your computer to provide internet to your phone and then use an app like Charles Proxy to inspect requests being made from an app. Pretty easy to reverse engineer the API.
If the app does use certificate pinning, then you can use an Android phone and a modified app that removes the logic that enforces certificate pinning. This is more involved but also not impossible.
That does not sound like the proper way to do an openAI 2.0. If Reddit ever hears that's how an AI company scraped them, they'll get sued for fun and profits.
It's essentially impossible to prove in court that training data was obtained or used improperly unless you go and tell on yourself. And even then it requires you to actually make someone with a lot of money mad, or to not have enough money yourself. Certainly microsoft would have already caught lots of flak for training their models on every github repo, instead they got a minor paddling from the public eye that went away after not much time had passed.
The point is that the data is easily accessible. If you wanted to get your hands on the data while simultaneously keeping them clean, contract with a Russian contracting company to give you a data dump. You don't need to know how they got it.
They make a point out of not directly asking for the crime when they do that. Just increasing pressure on subcontractors that leads to cutting corners including the law.
It is harder to prove to a "should have known" standard compared to say buying stolen speakers from the back of a truck for 20% of the list price.
There’s an implicit assumption in your argument that you’re going to directly ask for a crime to be committed. Why are you assuming that? You’ll go to a contractor and say “we want Reddit data.” Anyone with even mild technical competence can figure out how to get it.
Llms know the contents of books because they are analyzed, reviewed and spoken about everywhere. Pick some obscure book that doesn't show up on any social media and ask about it's contents. GPT won't have a clue
Did you read the article (this one misstates the case but if you look at the one linked about the lawsuit)? This is a lawsuit. Nothing has been proven. Burden of proof is on you
I'm building a magazine encyclopedia and I would estimate that 99.9% of all magazines ever published are not available electronically. And that the content in magazines probably exceeds the content in books by an order of magnitude.
I know this is getting off-topic, but as a non-native speaker, I'm interested in hearing how a third data point would be needed to judge whether things differ "by an order of magnitude".
I was under the impression that "an order of magnitude" meant "one more digit", meaning very roughly a 10x difference. "a >= 10*b" can be determined without the need of a third data point. Is there some other meaning to the phrase I haven't come across?
Not the original poster, but you have it more or less correct. An order of magnitude is 10X. Orders of magnitude just refers to “at least 100X.” Colloquially, orders of magnitude just means “significantly more/less.”
What has been crawled stays crawled and there are plenty of copies of sets of tokens that can be used to retrain a model. For a bit of money you can probably get any set that you really want (bit: billions, but pocket change for anything that is going to go head to head with OpenAI).
Is the logic here that training a base model isn’t as easy or even possible in the same way that OpenAI did in the past, and that what they have in a trained model is valuable in that even with all the code and experience it couldn’t be reproduced today with new restrictions?
The data has to come from somewhere, and all of the outlets that were used to train ChatGPT, stable diffusion, etc. have since been locked down. Any new company that Sam Altman makes in the AI space won't be competing just on merits of talent and product, they will also need to pay for and negotiate access to data.
I'd actually expect this to get far worse going forward, now that other organizations have an idea of how valuable their data is. It's also trivial to justify locking it down under the guise of protecting people, privacy, etc.
OpenAI has enough momentum and built enough moat that Sam Altman cannot replicate it. If he can actually replicate it and over take openai, then the business itself has no legs as it will be easily commoditized and any moat nullified in no time
I've never considered this angle, but god it'd be hilarious if this ended up being the case, the dude ruining everything because of his own greed ultimately fucking himself over because of it.
Here's to hoping there's still some poetic irony left to dish out in the world.
You are assuming he wouldn't steal t from OpenAI. He could have a low level employee steal it, and manage to keep it a secret until AGI is born then he takes over the world.
This is a pretty wild comment. That's a very safe assumption and no low level employee will do Sams bidding in an illegal enterprise. And keeping it a secret isn't going to work either and whether or not AGI is 'born' (who will bear it) is an open question to which I hope the answer is 'not for a while'. Because we haven't even figured out how to get humans to cooperate which I think should be a prerequisite.
> no low level employee will do Sams bidding in an illegal enterprise
Many people have betrayed their country to foreign governments in exchange for mere thousands of dollars. It is never safe to rule out the willingness of employees to engage in corporate espionage, even in exchange for truly pitiful rewards. It would be a stupid idea, but that doesn't mean it won't happen.
What this misses is all the regulatory capture that he’s been campaigning for. All the platforms have now closed their gardens. Authors and artists are much more vigilant about copyright etc. So it’s now a totally different game compared to 3 years ago because the data is not just there up for grabs anymore.