Many are speculating that Sam Altman could just move on and create another OpenA...

treesciencebot · on Nov 20, 2023

"Why Sam Altman (who can have the funding, talent, and the vision OpenAI has right now) can't just create OpenAI 2.0?" is an amazing question that also answers whats OpenAI's moat.

People speculated it was the funding, or attracting talent or having "access". Turns out it was none of them (obviously they all have a part, but having all three doesn't mean you can best OpenAI which gives you the fundemental reason why it is so hard to compete with them).

Zambyte · on Nov 20, 2023

He should make ClosedAI and then publish all the work as open source

plorg · on Nov 20, 2023

Do BoringAI or LibreAI and it's just a fork but you ripped out all the old, bad stuff. (This joke doesn't really work because OpenAI is not really old enough for legacy cruft and isn't actually open enough to just be forked)

pjot · on Nov 20, 2023

Yeah that will show ‘em!

Shekelphile · on Nov 20, 2023

I don't think getting training data is that hard still, the biggest platforms that locked down their APIs still use them for their mobile apps and can easily be reverse engineered to find keys or undocumented endpoints (or in the case of reddit, an entirely different internal API with less limits and a lot more info leaks...)

monocasa · on Nov 20, 2023

Easier than that would just be downloading the torrent of all of Reddit through Sept 2023.

https://academictorrents.com/details/89d24ff9d5fbc1efcdaf9d7...

q7xvh97o2pDhNrh · on Nov 20, 2023

That's fascinating that the total size is so tiny — only 2.4 TB‽

I assume this must be only the text portion, and heavily compressed?

lxgr · on Nov 20, 2023

Text really doesn't take up that much space, and in addition it compresses pretty well.

The entire English language Wikipedia is only around 60GB in a format that can be readily searched and randomly accessed (ZIM), for example: https://kiwix.org/

lmm · on Nov 20, 2023

Does Kiwix actually work? I see people hyping it here but I could never get it to actually, y'know, download the file and display the wikipedia on my phone.

vatueil · on Nov 20, 2023

Kiwix worked for me. IIRC there may be difficulties opening an archive that was downloaded outside of the mobile app, but archives downloaded in-app were fine.

For the mobile app I used one of the smaller Wikipedia subsets, since I didn't want to take up too much space on my phone. The full offline Wikipedia download is saved to my laptop.

pc2slow4webpack · on Nov 20, 2023

Just downloaded. Doesn't seem to want to download Wikipedia on my phone, it says "detecting if filesystem supports 4gb files"

lxgr · on Nov 20, 2023

It works perfectly for me, both on iOS and macOS.

PaulDavisThe1st · on Nov 20, 2023

The question is: if you then added all of Usenet before, say, 1992, would the effective intelligence of the trained LLM go up or down?

yjftsjthsd-h · on Nov 20, 2023

I can't speak to intelligence, but the result would be ignorant in a meaningful way.

thunkshift1 · on Nov 20, 2023

I think its a lot harder, while you still have lots of lawsuits coming in against AI models

mongol · on Nov 20, 2023

That would still pose a legal problem.

bloqs · on Nov 20, 2023

Can you explain the reddit one?

4death4 · on Nov 20, 2023

Assuming the Reddit app does not use certificate pinning, you can use your computer to provide internet to your phone and then use an app like Charles Proxy to inspect requests being made from an app. Pretty easy to reverse engineer the API.

If the app does use certificate pinning, then you can use an Android phone and a modified app that removes the logic that enforces certificate pinning. This is more involved but also not impossible.

philistine · on Nov 20, 2023

That does not sound like the proper way to do an openAI 2.0. If Reddit ever hears that's how an AI company scraped them, they'll get sued for fun and profits.

Shekelphile · on Nov 20, 2023

It's essentially impossible to prove in court that training data was obtained or used improperly unless you go and tell on yourself. And even then it requires you to actually make someone with a lot of money mad, or to not have enough money yourself. Certainly microsoft would have already caught lots of flak for training their models on every github repo, instead they got a minor paddling from the public eye that went away after not much time had passed.

mongol · on Nov 20, 2023

It is not impossible. You can call witnesses, refer to emails, source code etc.

az226 · on Nov 20, 2023

You can legally scrape anything that does not require a login in the US. You can also legally train an AI on it for now.

bonsai_spool · on Nov 20, 2023

Are you referring to the LinkedIn case? There has not been a decision on the legality of scraping in that matter

4death4 · on Nov 20, 2023

The point is that the data is easily accessible. If you wanted to get your hands on the data while simultaneously keeping them clean, contract with a Russian contracting company to give you a data dump. You don't need to know how they got it.

twoodfin · on Nov 20, 2023

Well, until discovery, wherein your deliberate not knowing will be a pretty big deal.

mr_toad · on Nov 20, 2023

Subcontracting out your crimes isn’t going to fly in court.

flir · on Nov 20, 2023

If it's done in a country where it's legal, maybe even processed in the same country and all you take out is the weights, I bet it gets a bit muddier.

4death4 · on Nov 20, 2023

Really? It's done pretty regularly to limit liability.

Nasrudith · on Nov 20, 2023

They make a point out of not directly asking for the crime when they do that. Just increasing pressure on subcontractors that leads to cutting corners including the law.

It is harder to prove to a "should have known" standard compared to say buying stolen speakers from the back of a truck for 20% of the list price.

4death4 · on Nov 20, 2023

There’s an implicit assumption in your argument that you’re going to directly ask for a crime to be committed. Why are you assuming that? You’ll go to a contractor and say “we want Reddit data.” Anyone with even mild technical competence can figure out how to get it.

wahnfrieden · on Nov 20, 2023

you're aware openai trained on a boatload of pirated ebooks?

they "steal" access to data because the LLM launders it on the other end

philistine · on Nov 20, 2023

That is frustrating to no end. If I pirate one book I should pay a hefty fine. If a company does it it's unlocking untapped value.

bko · on Nov 20, 2023

What do you base this on?

Llms know the contents of books because they are analyzed, reviewed and spoken about everywhere. Pick some obscure book that doesn't show up on any social media and ask about it's contents. GPT won't have a clue

wahnfrieden · on Nov 20, 2023

https://qz.com/openai-books-piracy-microsoft-meta-google-cha....

What's your evidence contrary to this? Sounds like your common sense rather than inside knowledge

bko · on Nov 20, 2023

Did you read the article (this one misstates the case but if you look at the one linked about the lawsuit)? This is a lawsuit. Nothing has been proven. Burden of proof is on you

patcon · on Nov 20, 2023

Yeah! <3 https://github.com/mitmproxy/android-unpinner

gumballindie · on Nov 20, 2023

Why y’all desperate to steal data to train non intelligent software? Reddit and others should sue for license violations.

Shekelphile · on Nov 20, 2023

The reddit app uses an undocumented graphql based api seperate from the publicly available rest api used by third party apps.

qingcharles · on Nov 20, 2023

There are still huge vaults of untapped data.

I'm building a magazine encyclopedia and I would estimate that 99.9% of all magazines ever published are not available electronically. And that the content in magazines probably exceeds the content in books by an order of magnitude.

anacrolix · on Nov 20, 2023

Orders of magnitude require at least 3 data points, and you have only 2.

Vingdoloras · on Nov 20, 2023

I know this is getting off-topic, but as a non-native speaker, I'm interested in hearing how a third data point would be needed to judge whether things differ "by an order of magnitude". I was under the impression that "an order of magnitude" meant "one more digit", meaning very roughly a 10x difference. "a >= 10*b" can be determined without the need of a third data point. Is there some other meaning to the phrase I haven't come across?

pilotneko · on Nov 20, 2023

Not the original poster, but you have it more or less correct. An order of magnitude is 10X. Orders of magnitude just refers to “at least 100X.” Colloquially, orders of magnitude just means “significantly more/less.”

aj0strow · on Nov 20, 2023

I would bet the senior researchers know exactly how and where to get plenty of tokens.

jacquesm · on Nov 20, 2023

What has been crawled stays crawled and there are plenty of copies of sets of tokens that can be used to retrain a model. For a bit of money you can probably get any set that you really want (bit: billions, but pocket change for anything that is going to go head to head with OpenAI).

shoelessone · on Nov 20, 2023

Is the logic here that training a base model isn’t as easy or even possible in the same way that OpenAI did in the past, and that what they have in a trained model is valuable in that even with all the code and experience it couldn’t be reproduced today with new restrictions?

coderenegade · on Nov 20, 2023

Yes, potentially.

The data has to come from somewhere, and all of the outlets that were used to train ChatGPT, stable diffusion, etc. have since been locked down. Any new company that Sam Altman makes in the AI space won't be competing just on merits of talent and product, they will also need to pay for and negotiate access to data.

I'd actually expect this to get far worse going forward, now that other organizations have an idea of how valuable their data is. It's also trivial to justify locking it down under the guise of protecting people, privacy, etc.

rblatz · on Nov 20, 2023

Couldn’t he just partner with Microsoft and start with the Bing corpus?

piuantiderp · on Nov 20, 2023

What if Microsoft, et al, signed exclusivity with OpenAI?

CommandDo · on Nov 20, 2023

By the sounds of it, they just DID.

yalogin · on Nov 20, 2023

OpenAI has enough momentum and built enough moat that Sam Altman cannot replicate it. If he can actually replicate it and over take openai, then the business itself has no legs as it will be easily commoditized and any moat nullified in no time

paulddraper · on Nov 20, 2023

...unless he takes half the key employees with him.

sensanaty · on Nov 20, 2023

I've never considered this angle, but god it'd be hilarious if this ended up being the case, the dude ruining everything because of his own greed ultimately fucking himself over because of it.

Here's to hoping there's still some poetic irony left to dish out in the world.

Loughla · on Nov 20, 2023

Seriously, this will be the exact reason the word schadenfreude exists.

shmatt · on Nov 20, 2023

Exactly. If new Altman AI company is the next big thing, then Astral AI is the next trillion dollar company

Easier said than done

paulddraper · on Nov 20, 2023

There are massive numbers of archives.

facu17y · on Nov 20, 2023

You are assuming he wouldn't steal t from OpenAI. He could have a low level employee steal it, and manage to keep it a secret until AGI is born then he takes over the world.

jacquesm · on Nov 20, 2023

This is a pretty wild comment. That's a very safe assumption and no low level employee will do Sams bidding in an illegal enterprise. And keeping it a secret isn't going to work either and whether or not AGI is 'born' (who will bear it) is an open question to which I hope the answer is 'not for a while'. Because we haven't even figured out how to get humans to cooperate which I think should be a prerequisite.

mcpackieh · on Nov 20, 2023

> no low level employee will do Sams bidding in an illegal enterprise

Many people have betrayed their country to foreign governments in exchange for mere thousands of dollars. It is never safe to rule out the willingness of employees to engage in corporate espionage, even in exchange for truly pitiful rewards. It would be a stupid idea, but that doesn't mean it won't happen.