According to the data provided by OpenAI, that isn't true anymore. And I trust data more than anecdotal claims made by people whose job is being threatened by systems like these.
>According to the data provided by OpenAI, that isn't true anymore
OpenAI main job is to sell that their models are better than human. I still remember when they're marketing their gpt-2 weights as too dangerous to release.
I remember that too, it's when I started following the space (shout out computerphile/robert miles) and iirc the reason they gave was not "it's too dangerous cause it's so badass" they basically were correct in that it can produce sufficiently "human" output as to break typical bot detectors on social media which is a legitimate problem - whether the repercussions of that failure to detect botting is meaningful enough to be considered "dangerous" is up to the reader to decide
also worth noting I don't agree with the comment you're replying to - but did want to add context to the situation of gpt-2
What? Surely you have some area of your life you are above-average knowledgable about. Have a conversation with chatGPT about it, with whatever model, and you can see for yourself it is far from expert level.
You are not "trusting data more than anecdotal claims", you are trusting marketing over reality.
Benchmarks can be gamed. Statistics can be manipulated. Demonstrations can be cherry picked.
PS: I stand to gain heavily if AI systems could perform at an expert level, this is not a claim from someone 'whose job is being threatened'.
> For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.
Did you read the post? OpenAI clearly states that the results are cherry-picked. Just a random query will have far worse results. To get equal results you need to ask the same query dozens of time and then have enough expertise to pick the best one, which might be quite hard for a problem that you have little idea about.
Combine this with the fact that this blog post is a sales pitch with the very best test results out of probably many more benchmarks we will never see and it seems obvious that human experts are still several order of magnitudes ahead.
When I read that line too I was very confused lol. I interpreted it as them saying they basically took other contestant submissions and allowing the model to see these "solutions" as part of context? and then having the model generate its own "solution" to be used for the benchmark. I fail to see how this is "solving" a ioi level question.
What is interesting is the following paragraph in the post
" With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy. "
So they didn't allow sampling from other contest solutions here? If that is the case quite interesting, since the model is effectively imo able to brute force questions. Provided you have some form of a validator able to tell it to halt.