The bicycle frame is a bit wonky but the pelican itself is great: https://gist.g...

stkai · 2026-02-05T18:56:40 1770317800

Would love to find out they're overfitting for pelican drawings.

fdeage · 2026-02-05T23:28:49 1770334129

OpenAI claims not to: https://x.com/aidan_mclau/status/1986255202132042164

mattacular · 2026-02-06T13:01:28 1770382888

That settles it

fragmede · 2026-02-05T19:54:41 1770321281

The estimation I did 4 months ago:

> there are approximately 200k common nouns in English, and then we square that, we get 40 billion combinations. At one second per, that's ~1200 years, but then if we parallelize it on a supercomputer that can do 100,000 per second that would only take 3 days. Given that ChatGPT was trained on all of the Internet and every book written, I'm not sure that still seems infeasible.

https://news.ycombinator.com/item?id=45455786

eli · 2026-02-05T20:07:21 1770322041

How would you generate a picture of Noun + Noun in the first place in order to train the LLM with what it would look like? What's happening during that 1 estimated second?

fragmede · 2026-02-06T12:59:02 1770382742

Use any of the image generation models (eg Nanobanana, Midjourney, or ChatGPT) to generate a picture of a noun on a noun. Simonw's test is to have a Language (text) model generate a Scalar Vector Graphic, which the language model has to do by writing curves and colors, like draw a spline from point 150,100 to 200,300 of type cubic, using width 20, color orange.

In that hypothetical second is freaking fascinating. It's a denoising algorithm, and then a bunch of linear algebra, and out pops a picture of a pelican on a bicycle. Stable diffusion does this quite handily. https://stablediffusionweb.com/image/6520628-pelican-bicycle...

metalliqaz · 2026-02-05T21:51:52 1770328312

its pelicans all the way down

Terretta · 2026-02-05T20:39:20 1770323960

This is why everyone trains their LLM on another LLM. It's all about the pelicans.

AnimalMuppet · 2026-02-05T21:28:47 1770326927

But you need to also include the number of prepositions. "A pelican on a bicycle" is not at all the same as "a pelican inside a bicycle".

There are estimated to be 100 or so prepositions in English. That gets you to 4 trillion combinations.

jodrellblank · 2026-02-06T15:41:15 1770392475

The prompt was "a pelican riding a bicycle"; not prepositions but every verb. Potentially every adverb+verb combination - "a pelican clumsily pushing a bicycle"

theanonymousone · 2026-02-05T21:44:45 1770327885

Even if not intentionally, it is probably leaking into training sets.

andy_ppp · 2026-02-05T19:16:19 1770318979

Yes, Racoon on a unicycle? Magpie on a pedalo?

throw310822 · 2026-02-05T21:27:51 1770326871

Correct horse battery staple:

https://claude.ai/public/artifacts/14a23d7f-8a10-4cde-89fe-0...

Schlagbohrer · 2026-02-06T15:58:24 1770393504

That is the nastiest, ugliest horse ever

HappMacDonald · 2026-02-07T23:46:29 1770507989

wait, how do you know my pw?

ta988 · 2026-02-05T21:49:58 1770328198

no staple?

iwontberude · 2026-02-05T22:24:54 1770330294

it looks like a bodge wire

_kb · 2026-02-05T22:38:37 1770331117

Platypus on a penny farthing.

gcanyon · 2026-02-05T18:44:25 1770317065

One aspect of this is that apparently most people can't draw a bicycle much better than this: they get the elements of the frame wrong, mess up the geometry, etc.

arionmiles · 2026-02-05T20:08:32 1770322112

There's a research paper from the University of Liverpool, published in 2006 where researchers asked people to draw bicycles from memory and how people overestimate their understanding of basic things. It was a very fun and short read.

It's called "The science of cycology: Failures to understand how everyday objects work" by Rebecca Lawson.

https://link.springer.com/content/pdf/10.3758/bf03195929.pdf

devilcius · 2026-02-05T22:02:17 1770328937

There’s also a great art/design project about exactly this. Gianluca Gimini asked hundreds of people to draw a bicycle from memory, and most of them got the frame, proportions, or mechanics wrong. https://www.gianlucagimini.it/portfolio-item/velocipedia/

rcxdude · 2026-02-05T20:32:39 1770323559

A place I worked at used it as part of an interview question (it wasn't some pass/fail thing to get it 100% correct, and was partly a jumping off point to a different question). This was in a city where nearly everyone uses bicycles as everyday transportation. It was surprising how many supposedly mechanical-focused people who rode a bike everyday, even rode a bike to the interview, would draw a bike that would not work.

gcanyon · 2026-02-05T21:54:01 1770328441

I wish I had interviewed there. When I first read that people have a hard time with this I immediately sat down without looking at a reference and drew a bicycle. I could ace your interview.

throwuxiytayq · 2026-02-05T21:15:43 1770326143

This is why at my company in interviews we ask people to draw a CPU diagram. You'd be surprised how many supposedly-senior computer programmers would draw a processor that would not work.

niobe · 2026-02-05T21:25:09 1770326709

If I was asked that question in an interview to be a programmer I'd walk out. How many abstraction layers either side of your knowledge domain do you need to be an expert in? Further, being a good technologist of any kind is not about having arcane details at the tip of your frontal lobe, and a company worth working for would know that.

duped · 2026-02-06T00:26:41 1770337601

I mean gp is clearly a joke but

A fundamental part of the job is being able to break down problems from large to small, reason about them, and talk about how you do it, usually with minimal context or without deep knowledge in all aspects of what we do. We're abstraction artists.

That question wouldn't be fundamentally different than any other architecture question. Start by drawing big, hone in on smaller parts, think about edge cases, use existing knowledge. Like bread and butter stuff.

I much more question your reaction to the joke than using it as a hypothetical interview question. I actually think it's good. And if it filters out people that have that kind of reaction then it's excellent. No one wants to work with the incurious.

niobe · 2026-02-06T03:25:06 1770348306

If it was framed as "show us how you would break down this problem and think about it" then sure. If it's the gotcha quiz (much more common in my experience) then no.

But if that's what they were going for it should be something on a completely different and more abstract topic like "develop a method for emptying your swimming pool without electricity in under four hours"

kortilla · 2026-02-06T04:00:19 1770350419

It has nothing to do with “incurious”. Being asked to draw the architecture for something that is abstracted away from your actual job is a dickhead move because it’s just a test for “do you have the same interests as me?”

It’s no different than asking for the architecture of the power supply or the architecture of the network switch that serves the building. Brilliant software engineers are going to have gaps on non-software things.

gedy · 2026-02-05T21:26:42 1770326802

That's reasonable in many cases, but I've had situations like this for senior UI and frontend positions, and they: don't ask UI or frontend questions. And ask their pet low level questions. Some even snort that it's softball to ask UI questions or "they use whatever". It's like, yeah no wonder your UI is shit and now you are hiring to clean it up.

selcuka · 2026-02-06T00:13:53 1770336833

Poe's Law [1]:

> Without a clear indicator of the author's intent, any parodic or sarcastic expression of extreme views can be mistaken by some readers for a sincere expression of those views.

[1] https://en.wikipedia.org/wiki/Poe%27s_law

rsc · 2026-02-05T21:46:35 1770327995

Raises hand.

gnatolf · 2026-02-05T19:03:14 1770318194

Absolutely. A technically correct bike is very hard to draw in SVG without going overboard in details

falloutx · 2026-02-05T19:26:49 1770319609

Its not. There are thousands of examples on the internet but good SVG sites do have monetary blocks.

https://www.freepik.com/free-photos-vectors/bicycle-svg

jefftk · 2026-02-05T21:08:48 1770325728

Several of those have incorrect frames:

https://www.freepik.com/free-vector/cyclist_23714264.htm

https://www.freepik.com/premium-vector/bicycle-icon-black-li...

Or missing/broken pedals:

https://www.freepik.com/premium-vector/bicycle-silhouette-ic...

https://www.freepik.com/premium-vector/bicycle-silhouette-ve...

http://freepik.com/premium-vector/bicycle-silhouette-vector-...

gnatolf · 2026-02-05T21:41:48 1770327708

From smaller to larger nitpick, there's basically something wrong with all of the first 15 or so of these drawings. Thanks for agreeing :)

RussianCow · 2026-02-05T20:22:23 1770322943

I'm not positive I could draw a technically correct bike with pen and paper (without a reference), let alone with SVG!

nateglims · 2026-02-05T20:33:17 1770323597

I just had an idea for an RLVR startup.

cyanydeez · 2026-02-05T19:32:03 1770319923

Yes, but obviously AGI will solve this by, _checks notes_ more TerraWatts!

hackernudes · 2026-02-05T19:43:26 1770320606

The word is terawatts unless you mean earth-based watts. OK then, it's confirmed, data centers in space!

seanhunter · 2026-02-05T19:41:21 1770320481

…in space!

franze · 2026-02-05T20:43:11 1770324191

here the animated version https://claude.ai/public/artifacts/3db12520-eaea-4769-82be-7...

gryfft · 2026-02-05T20:48:05 1770324485

That's hilarious. It's so close!

einrealist · 2026-02-05T18:10:17 1770315017

They trained for it. That's the +0.1!

zahlman · 2026-02-05T21:10:09 1770325809

Do you find that word choices like "generate" (as opposed to "create", "author", "write" etc.) influence the model's success?

Also, is it bad that I almost immediately noticed that both of the pelican's legs are on the same side of the bicycle, but I had to look up an image on Wikipedia to confirm that they shouldn't have long necks?

Also, have you tried iterating prompts on this test to see if you can get more realistic results? (How much does it help to make them look up reference images first?)

simonw · 2026-02-05T23:07:49 1770332869

I've stuck with "Generate an SVG of a pelican riding a bicycle" because it's the same prompt I've been using for over a year now and I want results that are sort-of comparable to each other.

I think when I first tried this I iterated a few times to get to something that reliably output SVG, but honestly I didn't keep the notes I should ahve.

eaf7e281 · 2026-02-05T18:11:14 1770315074

There's no way they actually work on training this.

fragmede · 2026-02-05T19:49:18 1770320958

The people that work at Anthropic are aware of simonw and his test, and people aren't unthinking data-driven machines. How valid his test is or isn't, a better score on it is convincing. If it gets, say, 1,000 people to use Claude Code over Codex, how much would that be worth to Anthropic?

$200 * 1,000 = $200k/month.

I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.

margalabargala · 2026-02-05T18:29:32 1770316172

I suspect they're training on this.

I asked Opus 4.6 for a pelican riding a recumbent bicycle and got this.

https://i.imgur.com/UvlEBs8.png

WarmWash · 2026-02-05T18:54:37 1770317677

It would be way way better if they were benchmaxxing this. The pelican in the image (both images) has arms. Pelicans don't have arms, and a pelican riding a bike would use it's wings.

ryandrake · 2026-02-05T19:43:03 1770320583

Having briefly worked in the 3D Graphics industry, I don't even remotely trust benchmarks anymore. The minute someone's benchmark performance becomes a part of the public's purchasing decision, companies will pull out every trick in the book--clean or dirty--to benchmaxx their product. Sometimes at the expense of actual real-world performance.

seanhunter · 2026-02-05T19:42:38 1770320558

Pelicans don’t ride bikes. You can’t have scruples about whether or not the image of a pelican riding a bike has arms.

jevinskie · 2026-02-05T19:48:38 1770320918

Wouldn’t any decent bike-riding pelican have a bike tailored to pelicans and their wings?

actsasbuffoon · 2026-02-05T21:23:30 1770326610

Sure, that’s one solution. You could also Isle of Dr Moreau your way to a pelican that can use a regular bike. The sky is the limit when you have no scruples.

cinntaile · 2026-02-05T20:03:54 1770321834

Now that would be a smart chat agent.

mrandish · 2026-02-05T18:39:18 1770316758

Interesting that it seems better. Maybe something about adding a highly specific yet unusual qualifier focusing attention?

TheDong · 2026-02-06T06:35:50 1770359750

I don't think that really proves anything, it's unsurprising that recumbent bicycles are represented less in the training data and so it's less able to produce them.

Try something that's roughly equally popular, like a Turkey riding a Scooter, or a Yak driving a Tractor.

riffraff · 2026-02-05T19:53:22 1770321202

perhaps try a penny farthing?

KeplerBoy · 2026-02-05T18:15:38 1770315338

There is no way they are not training on this.

collinmanderson · 2026-02-05T18:16:47 1770315407

I suspect they have generic SVG drawing that they focus on.

athrowaway3z · 2026-02-05T18:15:02 1770315302

This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.

They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.

What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.

As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.

hoeoek · 2026-02-05T18:07:12 1770314832

This really is my favorite benchmark

etwigg · 2026-02-05T22:00:30 1770328830

If we do get paperclipped, I hope it is of the "cycling pelican" variety. Thanks for your important contribution to alignment Simon!

bityard · 2026-02-05T19:37:10 1770320230

Well, the clouds are upside-down, so I don't think I can give it a pass.

beemboy · 2026-02-05T20:57:01 1770325021

Isn't there a point at which it trains itself on these various outputs, or someone somewhere draws one and feeds it into the model so as to pass this benchmark?

nine_k · 2026-02-05T19:43:51 1770320631

I suppose the pelican must be now specifically trained for, since it's a well-known benchmark.

7777777phil · 2026-02-05T18:13:30 1770315210

best pelican so far would you say? Or where does it rank in the pelican benchmark?

mrandish · 2026-02-05T18:21:13 1770315673

In other words, is it a pelican or a pelican't?

canadiantim · 2026-02-05T21:07:45 1770325665

You’ve been sitting on that pun just waiting for it to take flight

nubg · 2026-02-05T18:04:23 1770314663

What about the Pelo2 benchmark? (the gray bird that is not gray)

risyachka · 2026-02-05T20:06:33 1770321993

Pretty sure at this point they train it on pelicans

ares623 · 2026-02-05T18:02:27 1770314547

Can it draw a different bird on a bike?

simonw · 2026-02-05T18:10:37 1770315037

Here's a kākāpō riding a bicycle instead: https://gist.github.com/simonw/19574e1c6c61fc2456ee413a24528...

I don't think it quite captures their majesty: https://en.wikipedia.org/wiki/K%C4%81k%C4%81p%C5%8D

zahlman · 2026-02-05T21:14:19 1770326059

Now that I've looked it all up, I feel like that's much more accurate to a real kākāpō than the pelican is to a real pelican. It's almost as if it thinks a pelican is just a white flamingo with a different beak.

DetroitThrow · 2026-02-05T18:01:01 1770314461

The ears on top are a cute touch

6thbit · 2026-02-05T20:23:30 1770323010

do you have a gif? i need an evolving pelican gif

Kye · 2026-02-06T00:54:07 1770339247

A pelican GIF in a Pelican(TM) MP4 container.

MaysonL · 2026-02-06T02:22:31 1770344551

Except for both its legs being on the same side of the bike.

copilot_king_2 · 2026-02-05T18:17:07 1770315427

I'm firing all of my developers this afternoon.

RGamma · 2026-02-05T19:11:10 1770318670

Opus 6 will fire you instead for being too slow with the ideas.

insane_dreamer · 2026-02-05T20:58:41 1770325121

Too late. You’ve already been fired by a moltbot agent from your PHB.

behnamoh · 2026-02-05T18:35:02 1770316502

[flagged]

smokel · 2026-02-05T19:25:16 1770319516

I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.

blibble · 2026-02-05T20:50:46 1770324646

it ceases to be a useful benchmark of general ability when you post it publicly for them to train against

quinnjh · 2026-02-05T19:26:50 1770319610

the field is advancing so fast it's hard to do real science as their will be a new SOTA by the time you're ready to publish results. i think this is a combination of that and people having a laugh.

Would you mind sharing which benchmarks you think are useful measures for multimodal reasoning?

techpression · 2026-02-05T20:50:10 1770324610

A benchmark only tests what the benchmark is doing, the goal is to make that task correlate with actually valuable things. Graphic benchmarks is a good example, extremely hard to know what you will get in a game by looking at 3D Mark scores, it varies by a lot. Making a SVG of a single thing doesn’t help much unless that applies to all SVG tasks.

fullstackchris · 2026-02-05T21:21:52 1770326512

[flagged]

dang · 2026-02-05T22:18:44 1770329924

Personal attacks are not allowed on HN. No more of this, please.