Hacker Newsnew | past | comments | ask | show | jobs | submit | anonyfox's commentslogin

the sad part is ... its likely true.

give 5.4 a shot - its straneg but surprisingly good for once. speaking as a daily opus user.

Used codex cli (5.4) for the first time (had never used codex or gpt for coding before - was using Opus 4.5 for everything), and it seems quite good. One thing I like is it's very focused on tests. Like it will just start setting up units tests for specs without you asking (whereas Opus would never do that unless you asked)-- I like that and think it's generally good. One thing I don't like about GPT though is it pauses too much throughout tasks where the immediate plan and also the more outward plan are all extremely well defined already in agents.md, but it still pauses too much between tasks saying, next logical task is X, and I say yeah go ahead, instead of it just proceeding to the next task which Id rather it do. I suppose that is a preference that should be put in some document? (agents.md?)

well I have a running model (ha!) in my head about the frontier providers thats roughly like this:

- chatgpt is kinda autistic and must follow procedures no matter what and writes like some bland soulless but kinda correct style. great at research, horrible at creativity, slow at getting things done but at least getting there. good architect, mid builder, horrible designer/writer.

- claude is the sensitive diva that is able to really produce elegant code but has to be reminded of correctness checks and quality gates repeatedly, so it arrives at something good very fast (sometimes oneshot) but then loses time for correction loops and "those details". great overall balance, but permanent helicoptering needed or else it derails into weird loops.

- grok is the maker, super fast and on target, but doesn't think deeply as the others, its entirely goal/achievement focussed and does just enough things to get there. uniqiely it doesn't argue or self-monologue constantly about doubts or safety or ethics, but drives forward where other stuggles, and faster than others. cannot conenctrate for too long, but delivers fast. tons of quick edits? grok it is. "experimental" stuff that is not safe talking about... definitely grok.

- gemini is whatever you quickly need in your GSuite, plus looking at what others are doing and helping out with a sometimes different perspective, but beyond that worse than all the others on top.

- kimi: currently using it on the side, not bad at all so far, but also nothing distinct I crystallized in my head.


Tried using 5.4 xhigh/codex yesterday with very narrow direction to write bazel rules for something. This is a pretty boiler-plate-y task with specific requirements. All it had to do was produce a normal rule set s.t. one could write declarative statements to use them just like any other language integration. It gave back a dumpsterfire, just shoehorning specific imperative build scripts into starlark. Asked opus 4.6 and got a normal sane ruleset.

5.4 seems terrible at anything that's even somewhat out-of-distribution.


I got it to build a stereoscopic Metal raytracing renderer of a tesseract for the Vision Pro in less than half a day.

It surprisingly went at it progressively, starting with a basic CPU renderer, all the way to a basic special-purpose Metal shader. Now it’s trying its teeth at adding passthrough support. YMMV.


because gemini, despite what stats say, still produces garbage once the problem gets harder. it nails it for lab conditions, but messy reality or creativity or even code quality is a far cry from opus or the latest gpt5.4 by a long shot. and always has been. its pretty good inside the GSuite because of integrations, but standalone its near worthless compared to even grok-code-fast which doesn't think much at all (but damn it is fast). At this point google keeps throwing noodlepots with AI against every wall in reach to see what sticks, which is more kind of desperation that still works to increase wall street highscores, but not exactly a streak or breakthrough. just rapid fire shotgun launches to see if anything sticks. No one serious talks Gemini because its not even worth considering still for real things outside shiny presentations and artificial benchmarks.

Gemini schools the other two when doing code reviews.

I used to think tokens are a commodity, but it’s becoming clear that the jagged frontier is different enough even for the easiest use case of SWE that there’s room for having two if not three providers of different foundational models. It isn’t a winner takes all, they’re all winning together. Cursor isn’t properly taking advantage of the situation yet.


My experience exactly. The more "real" the problems become, the more other models become unsuitable when compared to claude, with the sole exceptions being deepseek/kimi, which while speaking strictly w.r.t metrics and basic tasks are not better, they are more interesting and handle more odd and totally out of domain stuff better than the US models. An example being code i wrote for a hypercomplex sedenion based artififial neural network broke claude so bad it start saying it is chatgpt and cant evaluate/run code. similar experience for all US models, which are characterized by being extremely brittle at the fringes, though cladue least among them. Meanwhile chinese models are less capable for cookie cutter stuff but keep swinging when things get really weird and unusual. It's like US models optimize for the lowest minima acheivable, and god help you if distribution changes. Chinese models on the otgerhand seem to optimize for the flattest minima, giving poorer quality across the board but far more robust behaviour.

not the poster, but I guess thats kinda american thinking that actually believes voting with your wallet will make any difference in this late stage crony capitalism in a post-facts world.

realistically: AI WILL get used in military and for killing autonomously, like it or not, believe it or not. I am also against that in principle but I do accept the fact my opinion just doesn't matter and practice radial acceptance or reality as-is. twitter/X is also alive and kicking, despite musk and anti-musk-hate. xAI/Grok is genuinely really good too compared to OAI/Claude, a bit different but very good. At this point all the "outcries" feel like noise I just skip on principle. But it could turn up the fire under the OAI team to go aggressive feature/pricing wise in order to retain/increase their userbase again, which is ... good, after all.


Well, slightly different take: it's like telling an artist the world doesn't need another song about love, these already exist and can be re-heard as needed. Sharper formulated: a CRM or TODO-list is a solved problem in theory, right? tons of solutions even free ones to use out there. still look at what people are doing and selling - CRMs and TODO-list variations. because, in fact, its not solved, and always has certain tradeoffs that doesn't fit some people.

youre getting it backwards. anyone can get to something that looks alright in a browser... until you actually click something and it fails spectacularly, leaks secrets, doesn't scale beyond 10 users and is a swamp of a codebase that prevents clean ongoing extension = hard wall for non techies, suddenly the magical LLM stops producing results and makes things worse.

All this senior engineering experience is a critical advantage in these new times, you implicitly ask things slightly different and circumvent these showstoppers without even thinking if you are that experienced. You don't even need to read the code at all, just a glimpse in the folder and scrolling a few meters of files with inline "pragmatic" snippets measured in meters and you know its wrong without even stepping through it. even if the autogenerated vanity unit tests say all green.

Don't feel let down. Slightly related to when Google sprung into existence - everyone has access and can find stuff, but knowing how to search well is an art even today most people don't have, and makes dramatic differences in everyday usage. Amplified now with the AI search results even that often are just convincing nonsense but most people cannot see it. That intuitive feel from hard won experience about what is "wrong" even without having an instant answer what would be "right" is getting more and more the differentiator.

Anyone can force their vibe coded app into some shape thats sufficient for their own daily use and they're used to avoiding their own pitfalls of the tool they created and know are there, but as soon as there's some kind of scaling (scope, users, revenue, ...) involved, true experts are needed.

Even the new agent tools like Claude for X products at the end perform dramatically different in the hands of someone who knows the domain in depth.


> ChatGPT has a good name

I don't know but around here common people all say "Chatty" nowadays, and also most people if writing the correct name fail to spell "gpt" right quite often in chat.



this night I got accidentially the update to the latest iOS with this liquid glass stuff - and its schockingly bad in any dimension. keyboard input lags, many thing ned MORE clicks/touches then before, weird contenxt menu popovers that don't even register taps 50% of the time, general lags and sluggishness and UI artifacts everywhere. Its really really a degradiation of UI/UX even though I personally am a fan of that glass-style design in itself


guess people upvote what they're right now interested in. still a lot better than the next AI slop trying to hype


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: