1673 ELO is wild If its actually true in practice, I sincerely cannot imagine a ...

usaar333 · on Sept 12, 2024

I'm not sure how well codeforces percentiles correlate to software engineering ability. Looking at all the data, it still isn't. Key notes:

1. AlphaCode 2 was already at 1650 last year.

2. SWE-bench verified under an agent has jumped from 33.2% to 35.8% under this model (which doesn't really matter). The full model is at 41.4% which still isn't a game changer either.

3. It's not handling open ended questions much better than gpt-4o.

deisteve · on Sept 12, 2024

i think you are right now actually initially i got excited but now i think OpenAI pulled the hype card again to seem relevant as they struggle to be profitable

Claude on the other hand has been fantastic and seems to do similar reasoning behind the scenes with RL

usaar333 · on Sept 12, 2024

The model is really impressive to be fair. It's just how economically relevant it is.

deisteve · on Sept 12, 2024

currently my workflow is generate some code, run it, if it doesn't run i tell LLM what I expected, it will then produce code and I frequently tell it how to reason about the problem.

with O1 being in the 89th percentile would mean it should be able to think at junior to intermediate level with very strong consistency.

i dont think people in the comments realize the implication of this. previously LLMs were able to only "pattern match" but now its able to evaluate itself (with some guidance ofc) essentially, steering the software into depth of edge cases and reason about it in a way that feels natural to us.

currently I'm copying and pasting stuff and notifying LLM the results but once O1 is available its going to significantly lower that frequency.

For example, I expect it to self evaluate the code its generate and think at higher levels.

ex) oooh looks like this user shouldn't be able to escalate privileges in this case because it would lead to security issues or it could conflict with the code i generated 3 steps ago, i'll fix it myself.