Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Super hand-waving rough estimate: Going off of five points of reference / examples that sorta all point in the same direction. 1. looks like they scale up by about ~100-200 on the x axis when showing that test time result. 2. Based on the o1-mini post [1], there's an "inference cost" where you can see GPT-4o and GPT-4o mini as dots in the bottom corner, haha (you can extract X values, ive done so below) 3. There's a video showing the "speed" in the chat ui (3s vs. 30s) 4. The pricing page [2] 5. On their API docs about reasoning, they quantify "reasoning tokens" [3]

First, from the original plot, we have roughly 2 orders of magnitude to cover (~100-200x)

Next, from the cost plots: super handwaving guess, but since 5.77 / 0.32 = ~18, and the relative cost for gpt-4o vs gpt-4o-mini is ~20-30, this roughly lines up. This implies that o1 costs ~1000x the cost than gpt-4o-mini for inference (not due to model cost, just due to the raw number of chain of thought tokens it produces). So, my first "statement", is that I trust the "Math performance vs Inference Cost" plot on the o1-mini page to accurately represent "cost" of inference for these benchmark tests. This is now a "cost" relative set of numbers between o1 and 4o models.

I'm also going to make an assumption that o1 is roughly the same size as 4o inherently, and then from that and the SVG, roughly going to estimate that they did a "net" decoding of ~100x for the o1 benchmarks in total. (5.77 vs (354.77 - 635)).

Next, from the CoT examples they gave us, they actually show the CoT preview where (for the math example) it says "...more lines cut off...", A quick copy paste of what they did include includes ~10k tokens (not sure if copy paste is good though..) and from the cipher text example I got ~5k tokens of CoT, while there are only ~800 in the response. So, this implies that there's a ~10x size of response (decoded tokens) in the examples shown. It's possible that these are "middle of the pack" / "average quality" examples, rather than the "full CoT reasoning decoding" that they claim they use. (eg. from the log scale plot, this would come from the middle, essentially 5k or 10k of tokens of chain of thought). This also feels reasonable, given that they show in their API [3] some limits on the "reasoning_tokens" (that they also count)

All together, the CoT examples, pricing page, and reasoning page all imply that reasoning itself can be variable length by about ~100x (2 orders of magnitude), eg. example: 500, 5k (from examples) or up to 65,536 tokens of reasoning output (directly called out as a maximum output token limit).

Taking them on their word that "pass@1" is honest, and they are not doing k-ensembles, then I think the only reasonable thing to assume is that they're decoding their CoT for "longer times". Given the roughly ~128k context size limit for the model, I suspect their "top end" of this plot is ~100k tokens of "chain of thought" self-reflection.

Finally, at around 100 tokens per second (gpt-4o decoding speed), this leaves my guess for their "benchmark" decoding time at the "top-end" to be between ~16 minutes (full 100k decoding CoT, 1 shot) for a single test-prompt, and ~10 seconds on the low end. So for that X axis on the log scale, my estimate would be: ~3-10 seconds as the bottom X, and then 100-200x that value for the highest value.

All together, to answer your question: I think the 80% accuracy result took about ~10-15 minutes to complete. I also believe that the "decoding cost" of o1 model is very close to the decoding cost of 4o, just that it requires many more reasoning tokens to complete. (and then o1-mini is comparable to 4o-mini, but also requiring more reasoning tokens)

[1] https://openai.com/index/openai-o1-mini-advancing-cost-effic...

  Extracting "x values" from the SVG:
  GPT-4o-mini: 0.3175
  GPT-4o: 5.7785
  o1: (354.7745, 635)
  o1-preview: (278.257, 325.9455)
  o1-mini: (66.8655, 147.574)
[2] https://openai.com/api/pricing/

  gpt-4o:
  $5.00 / 1M input tokens
  $15.00 / 1M output tokens

  o1-preview:
  $15.00 / 1M input tokens
  $60.00 / 1M output tokens
[3] https://platform.openai.com/docs/guides/reasoning

  usage: {
    total_tokens: 1000,
    prompt_tokens: 400,
    completion_tokens: 600,
    completion_tokens_details: {
      reasoning_tokens: 500
    }
  }


Some other follow up reflections

1. I wish that Y-axes would switch to be logit instead of linear, to help see power-law scaling on these 0->1 measures. In this case, 20% -> 80% it doesn't really matter, but for other papers (eg. [2] below) it would help see this powerlaw behavior much better.

2. The power law behavior of inference compute seems to be showing up now in multiple ways. Both in ensembles [1,2], as well as in o1 now. If this is purely on decoding self-reflection tokens, this has a "limit" to its scaling in a way, only as long as the context length. I think this implies (and I am betting) that relying more on multiple parallel decodings is more scalable (when you have a better critic / evaluator).

For now, instead of assuming they're doing any ensemble like top-k or self-critic + retries, the single rollout with increasing token size does seem to roughly match all the numbers, so that's my best bet. I hypothesize we'd see a continued improvement (in the same power-law sort of way, fundamentally along with the x-axis of "flop") if we combined these longer CoT responses, with some ensemble strategy for parallel decoding and then some critic/voting/choice. (which has the benefit of increasing flops (which I believe is the inference power-law), while not necessarily increasing latency)

[1] https://arxiv.org/abs/2402.05120 [2] https://arxiv.org/abs/2407.21787


oh, they do talk about it

  On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
showing that as they increase the k of ensemble, they can continue to get it higher. All the way up to 93% when using 1000 samples.


I think I'd be curious to know, if the size of ensemble is another scaling dimension for compute, alongside the "thinking time".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: