One thing that makes me skeptical is the lack of specific labels on the first two accuracy graphs. They just say it's a "log scale", without giving even a ballpark on the amount of time it took.
Did the 80% accuracy test results take 10 seconds of compute? 10 minutes? 10 hours? 10 days? It's impossible to say with the data they've given us.
The coding section indicates "ten hours to solve six challenging algorithmic problems", but it's not clear to me if that's tied to the graphs at the beginning of the article.
The article contains a lot of facts and figures, which is good! But it doesn't inspire confidence that the authors chose to obfuscate the data in the first two graphs in the article. Maybe I'm wrong, but this reads a lot like they're cherry picking the data that makes them look good, while hiding the data that doesn't look very good.
> Did the 80% accuracy test results take 10 seconds of compute? 10 minutes? 10 hours? 10 days? It's impossible to say with the data they've given us.
The gist of the answer is hiding in plain sight: it took so long, on an exponential cost function, that they couldn't afford to explore any further.
The better their max demonstrated accuracy, the more impressive this report is. So why stop where they did? Why omit actual clock times or some cost proxy for it from the report? Obviously, it's because continuing was impractical and because those times/costs were already so large that they'd unfavorably affect how people respond to this report
See also: them still sitting on Sora seven months after announcing it. They've never given any indication whatsoever of how much compute it uses, so it may be impossible to release in its current state without charging an exorbitant amount of money per generation. We do know from people who have used it that it takes between 10 and 20 minutes to render a shot, but how much hardware is being tied up during that time is a mystery.
It's also entirely possible they are simply sincere about their fear it may be used to influence the upcoming US election.
Plenty of people (me included) are sincerely concerned about the way even mere still image generators can drown out the truth with a flood of good-enough-at-first-glance fiction.
If they were sincere about that concern then they wouldn't build it at all, if it's ever made available to the public then it will eventually be available during an election. It's not like the 2024 US presidential election is the end of history.
The risk is not “interfering with the US elections”, but “being on the front page of everything as the only AI company interfering with US elections”. This would destroy their peacocking around AGI/alignment while raising billions from pension funds.
OpenAI is in a very precarious position. Maybe they could survive that hit in four years, but it would be fatal today. No unforced errors.
i think the hope is by the next presidential election no one will trust video anymore anyway so the new normal wont be as chaotic as if the dropped in the middle of an already contentious election.
as for not building it at all its a obvious next step in generative ai models that if they don't make it someone else will anyway.
I'd give it about 20 years before humanoid robots can be indistinguishable from originals without an x-ray or similar — covering them in vat-grown cultures of real human skin etc. is already possible but the robots themselves aren't good enough to fool anyone.
unfortunately that would mean two firstly things only swing states would get to hear what politicians are actually saying and secondly to reach everyone the primary process would have to start even earlier so the candidates would have a chance to give enough speeches before early voting
Even if Kamala wins (praise be to god that she does), those people aren't just going to go away until social media does. Social media is the cause of a lot of the conspiracy theory mania.
So yeah, better to never release the model...even though Elon would in a second if he had it.
But this cat run out of the bag years ago, didn't it? Trump himself is using AI-generated images in his campaign. I'd go even further: the more fake images appear, the faster the society as a whole will learn to distrust anything by default.
Their public statements that the only way to safely learn how to deal with the things AI can do, is to show what it can do and get feedback from society:
"""We want to successfully navigate massive risks. In confronting these risks, we acknowledge that what seems right in theory often plays out more strangely than expected in practice. We believe we have to continuously learn and adapt by deploying less powerful versions of the technology in order to minimize “one shot to get it right” scenarios.""" - https://openai.com/index/planning-for-agi-and-beyond/
I don't know if they're actually correct, but it at least passes the sniff test for plausibility.
Isn't this balloon video shared by openai? How is this not counted? For others I don't have evidences. But this balloon video case is enough to cast the doubts.
As someone experienced with operations / technical debt / weird company specific nonsense (Platform Engineer). No, you have to solve nuclear fusion at <insert-my-company>. You gotta do it over and over again. If it were that simple we wouldn't have even needed AI we would have hand written a few things, and then everything would have been legos, and legos of legos, but it takes a LONG time to find new true legos.
Yeah you’re right, all businesses are made of identical, interchangeable parts that we can swap out at our leisure.
This is why enterprises change ERP systems frictionlessly, and why the field of software engineering is no longer required. In fact, given that apparently, all business is solved, we can probably just template them all out, call it a day and all go home.
Yeah but thats not a Lego. A Lego is something that fits everwhere else. Not just previous work. There's a lot of previous work. There are very few true Legos.
AlphaFold simulated the structure of over 200 million proteins. Among those, there could be revolutionary ones that could change the medical scientific field forever, or they could all be useless. The reasoning is sound, but that's as far as any such tool can get, and you won't know it until you attempt to implement it in real life. As long as those models are unable to perfectly recreate the laws of the universe to the maximum resolution imaginable and follow them, you won't see an AI model, let alone a LLM, provide anything of the sort.
with these methods the issue is the log scale of compute. Let's say you ask it to solve fusion. It may be able to solve it but the issue is it's unverifiable WHICH was correct.
So it may generate 10 Billion answers to fusion and only 1-10 are correct.
There would be no way to know which one is correct without first knowing the answer to the question.
This is my main issue with these methods. They assume the future via RL then when it gets it right they mark that.
We should really be looking at methods of percentage it was wrong rather then it was right a single time.
Which is why it is incredibly depressing that OpenAI will not publish the raw chain of thought.
“Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.”
maybe they will enable to show CoT for a limited uses, like 5 prompts a day for Premium users, or for Enterprise users with agreement not to steal CoT or something like that.
if OpenAI sees this - please allow users to see CoT for a few prompts per day, or add it to Azure OpenAI for Enterprise customers with legal clauses not to steal CoT
Imagine if this tech was available in the middle ages and it was asked to 'solve' alchemy or perpetual motion, and responded that it was an impossible problem... people would (irrationally from our perspective) go Luddite on it I suspect. Now apply to the 'fusion power' problem.
The new thing that can do more at the "ceiling" price doesn't remove your ability to still use the 100x cheaper tokens for the things that were doable on that version.
That exact pattern is always true of technological advance. Even for a pretty broad definition of technology. I'm not sure if it's perfectly described by the name "induced demand" but it's basically the same thing.
- At the high end, there is a likely nonlinear relationship between answer quality and compute.
- We've gotten used to a flat-price model. With AGI-level models, we might have to pay more for more difficult and more important queries. Such is the inherent complexity involved.
- All this stuff will get better and cheaper over time, within reason.
I'd say let's start by celebrating that machine thinking of this quality is possible at all.
I don't think it's worth any debate. You can simply find out how it does for you, now(-ish, rolling out).
In contrast: Gemini Ultra, the best, non-existent Google Model for the past few month now, that people nonetheless are happy to extrapolate excitement over.
When one axis is on log scale and the other is linear with the plot points appearing linear-ish, doesn't it mean there's a roughly exponential relationship between the two axis?
It'd be more accurate to call it a logarithmic relationship, since compute time is our input variable. Which itself is a bit concerning, as that implies that modest gains in accuracy require exponentially more compute time.
In either case, that still doesn't excuse not labeling your axis. Taking 10 seconds vs 10 days to get 80% accuracy implies radically different things on how developed this technology is, and how viable it is for real world applications.
Which isn't to say a model that takes 10 days to get an 80% accurate result can't be useful. There are absolutely use cases where that could represent a significant improvement on what's currently available. But the fact that they're obfuscating this fairly basic statistic doesn't inspire confidence.
> Which itself is a bit concerning, as that implies that modest gains in accuracy require exponentially more compute time
This is more of what I was getting at. I agree they should label the axis regardless, but I think the scaling relationship is interesting (or rather, concerning) on its own.
The absolute time depends on hardware, optimizations, exact model, etc; it's not a very meaningful number to quantify the reinforcement technique they've developed, but it is very useful to estimate their training hardware and other proprietary information.
A linear graph with a log scale on the vertical axis means the original graph had near exponential growth.
A linear graph with a log scale on the the horizontal axis means the original graph had law of diminishing return kick it (somewhat similar to logarithmic but with a vertical asymptote).
Super hand-waving rough estimate: Going off of five points of reference / examples that sorta all point in the same direction.
1. looks like they scale up by about ~100-200 on the x axis when showing that test time result.
2. Based on the o1-mini post [1], there's an "inference cost" where you can see GPT-4o and GPT-4o mini as dots in the bottom corner, haha (you can extract X values, ive done so below)
3. There's a video showing the "speed" in the chat ui (3s vs. 30s)
4. The pricing page [2]
5. On their API docs about reasoning, they quantify "reasoning tokens" [3]
First, from the original plot, we have roughly 2 orders of magnitude to cover (~100-200x)
Next, from the cost plots: super handwaving guess, but since 5.77 / 0.32 = ~18, and the relative cost for gpt-4o vs gpt-4o-mini is ~20-30, this roughly lines up. This implies that o1 costs ~1000x the cost than gpt-4o-mini for inference (not due to model cost, just due to the raw number of chain of thought tokens it produces). So, my first "statement", is that I trust the "Math performance vs Inference Cost" plot on the o1-mini page to accurately represent "cost" of inference for these benchmark tests. This is now a "cost" relative set of numbers between o1 and 4o models.
I'm also going to make an assumption that o1 is roughly the same size as 4o inherently, and then from that and the SVG, roughly going to estimate that they did a "net" decoding of ~100x for the o1 benchmarks in total. (5.77 vs (354.77 - 635)).
Next, from the CoT examples they gave us, they actually show the CoT preview where (for the math example) it says "...more lines cut off...", A quick copy paste of what they did include includes ~10k tokens (not sure if copy paste is good though..) and from the cipher text example I got ~5k tokens of CoT, while there are only ~800 in the response. So, this implies that there's a ~10x size of response (decoded tokens) in the examples shown. It's possible that these are "middle of the pack" / "average quality" examples, rather than the "full CoT reasoning decoding" that they claim they use. (eg. from the log scale plot, this would come from the middle, essentially 5k or 10k of tokens of chain of thought). This also feels reasonable, given that they show in their API [3] some limits on the "reasoning_tokens" (that they also count)
All together, the CoT examples, pricing page, and reasoning page all imply that reasoning itself can be variable length by about ~100x (2 orders of magnitude), eg. example: 500, 5k (from examples) or up to 65,536 tokens of reasoning output (directly called out as a maximum output token limit).
Taking them on their word that "pass@1" is honest, and they are not doing k-ensembles, then I think the only reasonable thing to assume is that they're decoding their CoT for "longer times". Given the roughly ~128k context size limit for the model, I suspect their "top end" of this plot is ~100k tokens of "chain of thought" self-reflection.
Finally, at around 100 tokens per second (gpt-4o decoding speed), this leaves my guess for their "benchmark" decoding time at the "top-end" to be between ~16 minutes (full 100k decoding CoT, 1 shot) for a single test-prompt, and ~10 seconds on the low end. So for that X axis on the log scale, my estimate would be: ~3-10 seconds as the bottom X, and then 100-200x that value for the highest value.
All together, to answer your question: I think the 80% accuracy result took about ~10-15 minutes to complete.
I also believe that the "decoding cost" of o1 model is very close to the decoding cost of 4o, just that it requires many more reasoning tokens to complete. (and then o1-mini is comparable to 4o-mini, but also requiring more reasoning tokens)
1. I wish that Y-axes would switch to be logit instead of linear, to help see power-law scaling on these 0->1 measures. In this case, 20% -> 80% it doesn't really matter, but for other papers (eg. [2] below) it would help see this powerlaw behavior much better.
2. The power law behavior of inference compute seems to be showing up now in multiple ways. Both in ensembles [1,2], as well as in o1 now. If this is purely on decoding self-reflection tokens, this has a "limit" to its scaling in a way, only as long as the context length. I think this implies (and I am betting) that relying more on multiple parallel decodings is more scalable (when you have a better critic / evaluator).
For now, instead of assuming they're doing any ensemble like top-k or self-critic + retries, the single rollout with increasing token size does seem to roughly match all the numbers, so that's my best bet. I hypothesize we'd see a continued improvement (in the same power-law sort of way, fundamentally along with the x-axis of "flop") if we combined these longer CoT responses, with some ensemble strategy for parallel decoding and then some critic/voting/choice. (which has the benefit of increasing flops (which I believe is the inference power-law), while not necessarily increasing latency)
On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
showing that as they increase the k of ensemble, they can continue to get it higher. All the way up to 93% when using 1000 samples.
Yeah, this hiding of the details is a huge red flag to me. Even if it takes 10 days, it’s still impressive! But if they’re afraid to say that, it tells me they are more concerned about selling the hype than building a quality product.
It's not AGI - it's tree of thoughts, driven by some RL-derived heuristics.
I suppose what this type of approach provides is better prediction/planning by using more of what the model learnt during training, but it doesn't address the model being able to learn anything new.
It'll be interesting to see how this feels/behaves in practice.
There are already 440 nuclear reactors operating in 32 countries today.
Sam Altman owns a stake in Oklo, a small modular reactor company. Bill Gates has a huge stake in his TerraPower reactor company. In China, 5 reactors are being built every year. You just don't hear about it... yet.
No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather. It's like range anxiety in an electric car. If you have N days of battery storage and the sun doesn't shine for N+1 days, you're in trouble.
Nuclear fission is safe, clean, secure, and reliable.
An investor might consider buying physical uranium (via ticker SRUUF in America) or buying Cameco (via ticker CCJ).
Cameco is the dominant Canadian uranium mining company that also owns Westinghouse. Westinghouse licenses the AP1000 pressurized water reactor used at Vogtle in the U.S. as well as in China.
Hey, I got a random serious comment about nuclear power :-)))
To your point:
> No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather.
Like nuclear winter caused by a nuclear power plant blowing up and everyone confusing the explosion with the start of a nuclear war? :-p
On a more serious note:
> No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather. It's like range anxiety in an electric car. If you have N days of battery storage and the sun doesn't shine for N+1 days, you're in trouble.
We still have hydro plants, wind power, geothermal, long distance electrical transmission, etc. Also, what's "doesn't shine"? Solar panels generate power as long as it's not night and it's never night all the time around the world.
Plus they're developing sodium batteries, if you want to put your money somewhere, put it there. Those will be super cheap and they're the perfect grid-level battery.
> No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather.
Sure there is, let's do some math. Just like we can solve all of the Earth's energy needs with a solar array the size of Lithuania or West Virginia, we can do some simple math to see how many batteries we'd need to protect a solar grid.
Let's say the sun doesn't shine for an entire year. That seems like a large enough N such that we won't hit N+1. If the sun doesn't shine for an entire year, we're in some really serious trouble, even if we're still all-in on coal.
Over 1 year, humanity uses roughly 24,000 terawatt-hours of energy. Let's assume batteries are 100% efficient storage (they're not) and that we're using lithium ion batteries, which we'll say have an energy density of 250 watt-hours per liter (Wh/L). The math then says we need 96 km³ of batteries protect a solar grid from having the sun not shine for an entire year.
Thus, the amount of batteries to protect a solar grid is 1.92 quadrillion 18650 batteries, or a cube 4.6 kilometers along each side. This is about 24,000 year's worth of current world wide battery production.
That's quite a lot! If we try for N = 4 months for winter, that is to say, if the sun doesn't shine at all in the winter, then we'd need 640 trillion 18650 cell, or 8,000 years of current global production, but at least this would only be 32 km³, or a cube with 3.2 km sides.
Still wildly out of reach, but this is for all of humanity, mind you.
Anyway, point is, they said Elon was mad for building the original gigafactory, but it turns out that was a prudent investment. It now accounts for some 10% of the world's lithium ion battery production and demand for lithium-ion batteries doesn't seem to be letting up.
Well, you have to take into account that if something like that were to happen, within 1 week we'd have curfews and rationing everywhere. So those 24 000 TWh probably become 5-6 000, or something like that.
Plus we'd still have hydro, wind, geothermal, etc, etc.
it's not obviously achievable. for instance, we don't have the compute power to simulate cellular organisms of much complexity, and have not found efficiencies to scale that
Did the 80% accuracy test results take 10 seconds of compute? 10 minutes? 10 hours? 10 days? It's impossible to say with the data they've given us.
The coding section indicates "ten hours to solve six challenging algorithmic problems", but it's not clear to me if that's tied to the graphs at the beginning of the article.
The article contains a lot of facts and figures, which is good! But it doesn't inspire confidence that the authors chose to obfuscate the data in the first two graphs in the article. Maybe I'm wrong, but this reads a lot like they're cherry picking the data that makes them look good, while hiding the data that doesn't look very good.