Don't fall into the anti-AI hype

totallykvothe · 2026-01-12T03:02:57 1768186977

I don't understand the stance that AI currently is able to automate away non-trivial coding tasks. I've tried this consistently since GPT 3.5 came out, with every single SOTA model up to GPT 5.1 Codex Max and Opus 4.5. Every single time, I get something that works, yes, but then when I start self-reviewing the code, preparing to submit it to coworkers, I end up rewriting about 70% of the thing. So many important details are subpar about the AI solution, and many times fundamental architectural issues cripple any attempt at prompting my way out of it, even though I've been quite involved step-by-step through the whole prototyping phase.

I just have to conclude 1 of 2 things:

1) I'm not good at prompting, even though I am one of the earliest AI in coding adopters I know, and have been consistent for years. So I find this hard to accept.

2) Other people are just less picky than I am, or they have a less thorough review culture that lets subpar code slide more often.

I'm not sure what else I can take from the situation. For context, I work on a 15 year old Java Spring + React (with some old pages still in Thymeleaf) web application. There are many sub-services, two separate databases,and this application needs to also 2-way interface with customer hardware. So, not a simple project, but still. I can't imagine it's way more complicated than most enterprise/legacy projects...

unyttigfjelltol · 2026-01-12T03:54:22 1768190062

> non-trivial coding tasks

I’ve come back to the idea LLMs are super search engines. If you ask it a narrow, specific question, with one answer, you may well get the answer. For the “non-trivial” questions, there always will be multiple answers, and you’ll get from the LLM all of these depending on the precise words you use to prompt it. You won’t get the best answer, and in a complex scenario requiring highly recursive cross-checks— some answers you get won’t be functional.

It’s not readily apparent at first blush the LLM is doing this, giving all the answers. And, for a novice who doesn’t know the options, or an expert who can scan a list of options quickly and steer the LLM, it’s incredibly useful. But giving all the answers without strong guidance on non-trivial architectural points— entropy. LLMs churning independently quickly devolve into entropy.

20k · 2026-01-12T09:34:19 1768210459

I wish LLMs were good at search. I've tried to evaluate them many times for their quality at answering research questions for astrophysics (specifically numerical relativity). If they were good at answering questions, I'd use them in a heartbeat

Without exception, every technical question I've ever asked an LLM that I know the answer to, has been substantially wrong in some fashion. This makes it just.. absolutely useless for research. In some cases I've spotted it straight up plagiarising from the original sources, with random capitalisation giving it away

The issue is that once you get even slightly into a niche, they fall apart because the training data just doesn't exist. But they don't say "sorry there's insufficient training data to give you an answer", they just make shit up and state it as confidently incorrect

pxc · 2026-01-12T15:38:18 1768232298

> Without exception, every technical question I've ever asked an LLM that I know the answer to, has been substantially wrong in some fashion.

The other problem that I tend to hit is a tradeoff between wrongness and slowness. The fastest variants of the SOTA models are so frequently and so severely wrong that I don't find them useful for search. But the bigger, slower ones that spend more time "thinking" take so long to yield their (admittedly better) results that it's often faster for me to just do some web searching myself.

They tend to be more useful the first time I'm approaching a subject, or before I've familiarized myself with the documentation of some API or language or whatever. After I've taken some time to orient myself (even by just following the links they've given me a few times), it becomes faster for me to just search by myself.

simonw · 2026-01-12T14:19:51 1768227591

LLMs got good at search last year. You need to use the right ones though - ChatGPT Thinking mode and Google AI mode (that's https://www.google.com/ai - which is NOT the same as regular Google's "AI overviews" which are still mostly trash) are both excellent.

I've been tracking advances in AI assisted search here - https://simonwillison.net/tags/ai-assisted-search/ - in particular:

- https://simonwillison.net/2025/Apr/21/ai-assisted-search/ - April is when they started getting good, with o3 and the various deep research tools

- https://simonwillison.net/2025/Sep/6/research-goblin/ - GPT-5 got excellent. This post includes several detailed examples, including "Starbucks in the UK don’t sell cake pops! Do a deep investigative dive".

- https://simonwillison.net/2025/Sep/7/ai-mode/ - AI mode from Google

elzbardico · 2026-01-12T13:48:51 1768225731

If you nudge it towards tool use, A lot of time it can give you better answers.

Instead of "how cheese X is usually made" "search the web and give me a summary on the ways cheese X is made"

sandworm101 · 2026-01-12T12:39:29 1768221569

>> at answering research questions for astrophysics

I googled for "helium 3" yesterday. Google's AI answer said that helium 3 is "primarily sourced from the moon", as if we were actively mining it there already.

BYazfVCcq · 2026-01-12T12:53:37 1768222417

There are probably thousands of scifi books where the moon has some forms of helium 3 mining. Considering Google pirated and used them all for training it makes sense that it puts it in present tense.

HPsquared · 2026-01-12T13:49:43 1768225783

I wonder how much memory and computing time goes into making them, vs. a typical "proper" LLM prompt. It's like the freebies you get with a Christmas cracker.

yunohn · 2026-01-12T10:15:34 1768212934

> I wish LLMs were good at search

The entire situation of web search for LLMs is a mess. None of the existing providers return good or usable results; and Google refuses to provide general access to theirs. As a result, all LLMs (except maybe Gemini) are severely gimped forever until someone solves this.

I seriously believe that the only real new breakthrough for LLM research can be achieved by a clean, trustworthy, comprehensive search index. Maybe someone will build that? Otherwise we’re stuck with subpar results indefinitely.

embedding-shape · 2026-01-12T10:51:09 1768215069

YaCy does a pretty good job, and is free, and you can run yourself, so the quality/experience is pretty much up to you. Paired together with a local GPT-OSS-120b with reasoning_effort set to high, I'm getting pretty good results. Validated with questions I do know the answer to, and seems alright although could be better of course, still getting better results out of GPT5.2 Pro which I guess is to be expected.

yunohn · 2026-01-12T12:43:59 1768221839

The point of my comment was that the AI/LLM is almost irrelevant in light of low quality search engine APIs/indexes. Is there a way to validate the actual quality and comprehensiveness of YaCY beyond anecdata?

embedding-shape · 2026-01-12T12:56:10 1768222570

> Is there a way to validate the actual quality and comprehensiveness of YaCY beyond anecdata?

No, because it's your own index essentially, hence the "the quality/experience is pretty much up to you" part.

yunohn · 2026-01-12T13:12:33 1768223553

Yeah, that’s not really reassuring nor indicative of its usefulness or value.

embedding-shape · 2026-01-12T13:17:44 1768223864

Yeah, if that's how you feel about your own abilities, then I guess that's the way it is. Not sure what that has to do with YaCy or my original comment.

PeterStuer · 2026-01-12T07:39:56 1768203596

An example I had last month. Some code (dealing with PDF's) package ran into a resources problem in production. LLM suggested an adaptation to the segment that caused the problem, but that code pulled in 3 new non-trivial dependecies. Added constraints and the next iteration it dropped 1 of the 3. Pushed further and it confirmed my suggestion that the 2 remaining dependencies could be covered just by specifying an already existing parameter in the constructor.

The real problem btw was a bug introduced in the PDF handeling package 2 versions ago that caused resource handeling problems in some contexts, and the real solution was roling back to the version before the bug.

I'm still using AI daily in my development though, as as long as you sort of know what you are doing and have enough knowledge to evaluate it is very much a net productivity multiplier for me.

friendzis · 2026-01-12T08:38:46 1768207126

> But giving all the answers without strong guidance on non-trivial architectural points— entropy. LLMs churning independently quickly devolve into entropy.

Typical iterative-circular process "write code -> QA -> fix remarks" works because the code is analyzable and "fix" is on average cheaper than "write", therefore the process, eventually, converges on a "correct" solution.

LLM prompting is on average much less analyzable (if at all) and therefore the process "prompt LLM -> QA -> fix prompt" falls somewhere between "does not converge" and "convergence tail is much longer".

This is consistent with typical observation where LLMs are working better: greenfield implementations of "slap something together" and "modify well structured, uncoupled existing codebase", both situations where convergence is easier in the first place, i.e. low existing entropy.

dividedbyzero · 2026-01-12T11:13:14 1768216394

They don't even really do that IME. If I ask Claude or ChatGPT to generate terraform for non-trivial but by no means obscure or highly unusual setups, they almost invariably hallucinate part of the answer even if a documented solution exists that isn't even that difficult. Maybe vibe coding JavaScript is that much better, or I'm just hopeless at prompting, but I feel a few dozen lines of fairly straightforward terraform config shouldn't require elaborate prompt setups, or I can just save some brain cycles by writing it myself.

PunchyHamster · 2026-01-12T13:31:36 1768224696

that would be true if not for LLM making up answers where none exists.

Like, I've seen Claude go thru source code of the program, telling (correctly!) what counters are in code that return value I need (I just wanted to look at some packet metrics), then inventing entirely fake CLI command to extract those metrics

IAmGraydon · 2026-01-12T04:19:12 1768191552

>I’ve come back to the idea LLMs are super search engines.

Yes! This is exactly what it is. A search engine with a lossy-compressed dataset of most public human knowledge, which can return the results in natural language. This is the realization that will pop the AI bubble if the public could ever bring themselves to ponder it en masse. Is such a thing useful? Hell yes! Is such a thing intellegent? Certainly NO!

10c8 · 2026-01-12T04:26:09 1768191969

While I agree, I can't help but wonder: if such a "super search engine" were to have the knowledge on how to solve individual steps of problems, how different would that be from an "intelligent" thing? I mean that, instead of "searching" for the next line of code, it searches for the next solution or implementation detail, then using it as the query that eventually leads to code.

chongli · 2026-01-12T05:02:41 1768194161

Having knowledge isn't the same as knowing. I can hold a stack of physics papers in my hand but that doesn't make me a physics professor.

LLMs possess and can retrieve knowledge but they don't understand it, and when people try to get them to do that it's like talking to a non-expert who has been coached to smalltalk with experts. I remember reading about a guy who did this with his wife so she could have fun when travelling to conferences with him!

antonvs · 2026-01-12T09:42:00 1768210920

> …can return the results in natural language.

That’s one of the most important features, though. For example, LLMs can analyze a code base and tell you how it works in natural language. That demonstrates functional understanding and intelligence - in addition to exceeding the abilities of the majority of humans in this area.

You’d need a very no-true-Scotsmanned definition of intelligence to be able exclude LLMs. That’s not to say that they’re equivalent to human intelligence in all respects, but intelligence is not an all-or-nothing property. (If it were, most humans probably wouldn’t qualify.)

omnimus · 2026-01-12T10:34:59 1768214099

LLMs being intelligence or not is not really that interesting. It's just matter of how you define intelligence. It matters maybe to the AI CEOs and their investors because of marketing.

What matters is how useful LLMs actually are. Many people here say it is useful as advanced search engine and not that useful as your coworker. That is very useful but most likely not something the AI companies want to hear.

runarberg · 2026-01-12T10:50:30 1768215030

> You’d need a very no-true-Scotsmanned definition of intelligence to be able exclude LLMs.

The thing is, that intelligence is an anthropocentric term. And has always been defined in a no-true-Scotsman way. When we describe the intelligence of other species we do so in extremely human terms (except for dogs). For example we consider dolphins smart when we see them play with each other, talk to each other, etc. We consider chimpanzees when we see them use a tool, row a boat, etc. We don’t consider an ant colony smart when they optimize a search for food sources, only because humans don’t normally do that. The only exception here are dogs, who we consider smart when they obey us more easily.

Personally, my take on this is that intelligence is not a useful term in philosophy nor science. Describing a behavior as intelligent is kind of like calling a small creature a bug. It is useful in our day to day speech, but fails when we want to build any theory around it.

lutusp · 2026-01-12T07:22:02 1768202522

> Is such a thing intellegent [sic]? Certainly NO!

A proofreader would have caught this humorous gaffe. In fact, one just did.

lxgr · 2026-01-12T08:09:45 1768205385

I personally had the completely opposite takeaway: Intelligence, at its core, really might just be a bunch of extremely good and self-adapting search heuristics.

croon · 2026-01-12T08:52:50 1768207970

I don't blurt out different answers to the same question using different phrasing, I doubt any human does.

easyThrowaway · 2026-01-12T09:25:19 1768209919

We actually do, and often - depending on who our speaker is, our relationship with them, the tone of the message, etc. Maybe our intellect is not fully an LLM, but I truly wonder how much of our dialectical skills are.

croon · 2026-01-12T11:50:36 1768218636

You're describing the same answer with different phrasing.

Humans do that, LLMs regularly don't.

If you phrase the question "what color is your car?" a hundred different ways, a human will get it correct every time. LLMs randomly don't, if the token prediction veers off course.

Edit:

A human also doesn't get confused at fundamental priors after a reasonable context window. I'm perplexed that we're still having this discussion after years of LLM usage. How is it possible that it's not clear to everyone?

Don't get me wrong, I use it daily at work and at home and it's indeed useful, but there's is absolutely 0 illusion of intelligence for me.

ffwd · 2026-01-12T07:02:09 1768201329

Even though I think it's true that it's lossy, I think there is more going on in an LLM neural net. Namely that when it uses tokens to produce output, you essentially split the text into millions or billions of chunks, each with probability of those chunks. So in essence the LLM can do a form of pattern recognition where the patterns are the chunks and it also enables basic operations on those chunks.

That's why I think you can work iteratively on code and change parts of the code while keeping others, because the code gets chunked and "probabilitized'. It can also do semantic processing and understanding where it can apply knowledge about one topic (like 'swimming') to another topic (like a 'swimming spaceship', it then generates text about what a swimming spaceship would be which is not in the dataset). It chunks it into patterns of probability and then combines them based on probability. I do think this is a lossy process though which sucks.

ffwd · 2026-01-12T13:11:02 1768223462

Maybe it's looked down upon to complain about downvotes but I have to say I'm a little disappointed that there is a downvote with no accompanying post to explain that vote, especially to a post that is factually correct and nothing obviously wrong with it.

carlmr · 2026-01-12T07:50:56 1768204256

>It’s not readily apparent at first blush the LLM is doing this, giving all the answers.

Now I'm wondering if I'm prompting wrong. I usually get one answer. Maybe a few options but rarely the whole picture.

I do like the super search engine view though. I often know what I want, but e.g. work with a language or library I'm not super familiar with. So then I ask how do I do x in this setting. It's really great for getting an initial idea here.

Then it gives me maybe one or two options, but they're verbose or add unneeded complexity. Then I start probing asking if this could be done another way, or if there's a simpler solution to this.

Then I ask what are the trade-offs between solutions. Etc.

It's maybe a mix of search engine and rubber ducking.

Agents are, like for OP, a complete failure for me though. Still can't get them to not run off into a completely strange direction, leaving a minefield of subtle coding errors and spaghetti behind.

richardw · 2026-01-12T08:18:39 1768205919

I’ve recently created many Claude skills to do repeatable tasks (architecture review, performance, magic strings, privacy, SOLID review, documentation review etc). The pattern is: when I’ve prompted it into the right state and it’s done what I want, I ask it to create a skill. I get codex to check the skill. I could then run it independently in another window etc and feed back to adjust…but you get the idea.

And almost every time it screws up we create a test, and often for the whole class of problem. More recent it’s been far better behaved. Between Opus, skills, docs, generating Mermaid diagrams, tests it’s been a lot better. I’ve also cleaned up so much of the architecture so there’s only one way to do things. This keeps it more aligned and helps with entropy. And they’ll work better as models improve. Having a match between code, documents and tests means it’s not just relying on one source.

Prompts like this seem to work: “what’s the ideal way to do this? Don’t be pragmatic. Tokens are cheaper than me hunting bugs down years later”

FooBarWidget · 2026-01-12T08:41:50 1768207310

Can you tell me more about how you do tests? How do they look like? What testing tools or frameworks do you use?

XenophileJKO · 2026-01-12T07:35:55 1768203355

I'm not going to argue about how capable the models are, I personally think they are pretty capable.

What I will argue is that the LLMs are not just search engines. They have "compressed" knowledge. When they do this, they learn relations between all kinds of different levels of abstractions and meta patterns.

It is really important to understand that the model can follow logical rules and has some map of meta relationships between concepts.

Thinking of a LLM as a "search engine" is just fundamentally wrong in how they work, especially when connected to external context like code bases or live information.

lxgr · 2026-01-12T08:07:48 1768205268

A sufficiently advanced search engine might actually be indistinguishable from intelligence.

After all, until quite recently, chess engines really were quite mechanically search engines too.

XenophileJKO · 2026-01-12T09:28:22 1768210102

I'm just saying you are doing a dis-service to yourself if that is your mental model on how current SOTA models work.

gf000 · 2026-01-12T08:45:17 1768207517

Well, it's "a search engine that applies some transformations on top of the results" doesn't sound to me as a terrible way to think about LLMs.

> can follow logical rules

This is not their strong suite, though. They can only follow through a few levels on their own. This can be improved by agent-style iterations or via invoking external tools.

XenophileJKO · 2026-01-12T09:30:34 1768210234

Let's see how this comment ages why don't we. I've understood where we are going and if you look at my comment history. I have confidence that in 12 months time. One opinion will be proved out with observations and the other will not.

gf000 · 2026-01-12T13:26:23 1768224383

For the "only few levels" claim, I think this one is sort of evident from the way they work. Solving a logical problem can have an arbitrary number of steps, and in a single pass there is only so many connection within a LLM to do some "work".

As mentioned, there are good ways to counter this problem (e.g. writing a plan and then iteratively going over those less-complex ones, or simply using the proper tool for the problem: use e.g. a SAT solver and just "translate" the problem to and from the appropriate format)

Nonetheless, I'm always open to new information/evidence and it will surely improve a lot in a year. As for reference, to date this is my favorite description of LLMs: https://news.ycombinator.com/item?id=46561537

zelphirkalt · 2026-01-12T15:51:15 1768233075

That's the curse of the expert. You see many of the shortcomings, that someone less experienced might not even think about, when they go to social media and blurt out that AI is now able to fully replace them.

torginus · 2026-01-12T14:10:08 1768227008

Personally I've yet to see any high profile programming person (who's not directly invested into AI) endorse only coding by prompting.

Experienced coders that I follow, who do use AI tend to focus on tight and fast feedback loops, and precise edits (or maybe exploratory coding) rather than agentic fire-and-forget workflows.

Also, an interesting side note, I expected programmers I think of as highly skilled, who I know personally to reject AI from personal pride - that has not been the case. However 2 criticisms I've heard consistently from this crowd (besides the thing I mentioned before) was

- AI makes hosting and participating in coding competitions impossible, and denies them of brain-teasers and an ability to hone their skills.

- A lot of them are concerned about the ethics of training on large codebases - and consider AI plagiarism as much of an issue as artists do.

daxfohl · 2026-01-12T03:48:20 1768189700

Agreed, but:

There's been a notable jump over the course of the last few months, to where I'd say it's inevitable. For a while I was holding out for them to hit a ceiling where we'd look back and laugh at the idea they'd ever replace human coders. Now, it seems much more like a matter of time.

Ultimately I think over the next two years or so, Anthropic and OpenAI will evolve their product from "coding assistant" to "engineering team replacement", which will include standard tools and frameworks that they each specialize in (vendor lock in, perhaps), but also ways to plug in other tech as well. The idea being, they market directly to the product team, not to engineers who may have specific experience with one language, framework, database, or whatever.

I also think we'll see a revival of monolithic architectures. Right now, services are split up mainly because project/team workflows are also distributed so they can be done in parallel while minimizing conflicts. As AI makes dev cycles faster that will be far less useful, while having a single house for all your logic will be a huge benefit for AI analysis.

concats · 2026-01-12T09:17:28 1768209448

> Ultimately I think over the next two years or so, Anthropic and OpenAI will evolve their product from "coding assistant" to "engineering team replacement"

The way I see it, there will always be a layer in the corporate organization where someone has to interact with the machine. The transitioning layer from humans to AIs. This is true no matter how high up the hierarchy you replace the humans, be it the engineers layer, the engineering managers, or even their managers.

Given the above, it feels reasonable to believe that whatever title that person has—who is responsible for converting human management's ideas into prompts (or whatever the future has the text prompts replaced by)—that person will do a better job if they have a high degree of technical competence. That is to say, I believe most companies will still want and benefit if that/those employees are engineers. Converting non-technical CEO fever dreams and ambitions into strict technical specifications and prompts.

What this means for us, our careers, or Anthropic's marketing department, I cannot say.

eloisant · 2026-01-12T11:25:24 1768217124

That reminds me of the time where 3GL languages arrived and bosses claimed they no longer needed developers, because anyone could write code in those English-like languages.

Then when mouse-based tools like Visual Basic arrived, same story, no need for developers because anyone can write programs by clicking!

Now bosses think that with AI anyone will be able to create software, but the truth is that you'll still need software engineers to use those tools.

Will we need less people? Maybe. But in the past 40 years we have been increasing the developers productivity so many times, and yet we still need more and more developers because the needs have grown faster.

sublinear · 2026-01-12T04:48:08 1768193288

This doesn't make any sense. If the business can get rid of their engineers, then why can't the user get rid of the business providing the software? Why can't the user use AI to write it themselves?

I think instead the value is in getting a computer to execute domain-specific knowledge organized in a way that makes sense for the business, and in the context of those private computing resources.

It's not about the ability to write code. There are already many businesses running low-code and no-code solutions, yet they still have software engineers writing integration code, debugging and making tweaks, in touch with vendor support, etc. This has been true for at least a decade!

That integration work and domain-specific knowledge is already distilled out at a lot of places, but it's still not trivial. It's actually the opposite. AI doesn't help when you've finally shaved the yak smooth.

chongli · 2026-01-12T05:14:41 1768194881

If the business can get rid of their engineers, then why can't the user get rid of the business providing the software?

A lot of businesses are the only users of their own software. They write and use software in-house in order to accomplish business tasks. If they could get rid of their engineers, they would, since then they'd only have to pay the other employees who use the software.

They're much less likely to get rid of the user employees because those folks don't command engineer salaries.

hypeatei · 2026-01-12T09:57:42 1768211862

So instead of paying a human that "commands an engineer salary" then they'll be forced to pay whatever Anthropic or OpenAI commands to use their LLMs? I don't see how that's a better proposition: the LLM generates a huge volume of code that the product team (or whoever) cannot maintain themselves. Therefore, they're locked-in and need to hope the LLM can solve whatever issues they have, and if it can't, hope that whatever mess it generated can be fixed by an actual engineer without costing too much money.

Also, code is only a small piece and you still need to handle your hosting environment, permissions, deployment pipelines, etc. which LLMs / agentic workflows will never be able to handle IMO. Security would be a nightmare with teams putting all their faith into the LLM and not being able to audit anything themselves.

I don't doubt that some businesses will try this, but on paper it sounds like a money pit and you'd be better off just hiring a person.

chongli · 2026-01-12T15:00:11 1768230011

It’s the same business model as consulting firms. Rather than hiring a few people for 65k each, a VP will bring in a consulting firm for 10M and get a bloated, half-working solution that costs even more to get working. The VP doesn’t care though because he ends up looking like a big shot in front of the other execs.

matwood · 2026-01-12T08:01:25 1768204885

> If the business can get rid of their engineers, then why can't the user get rid of the business providing the software?

I have't checked the stats lately, but at one point most software written was in non-tech companies for the single business. The first 1/2 of my career was spent writing in-house software for a company that did everything from custom reporting and performance tracking to scraping data of automated phone dialers. There's so much software out there that effectively has a user base of a single company.

catlifeonmars · 2026-01-12T05:05:06 1768194306

I actually think it’s the opposite. We’ll see fewer monorepos because small, scoped repos are the easiest way to keep an agent focused and reduce the blast radius of their changes. Monorepos exist to help teams of humans keep track of things.

daxfohl · 2026-01-12T06:49:32 1768200572

Could be. Most projects I've worked on tend to span multiple services though, so I think AI would struggle more trying to understand and coordinate across all those services versus having all the logic in a single deployable instance.

The way I see feature development in the future is, PM creates a dev cluster (also much easier with a monolith), has AI implement a bunch of features to spec, AI provides some feedback and gets input on anywhere it might conflict with existing functionality, whether eventual consistency is okay, which pieces are performance criticial, etc., and provides the implementation, a bunch of tests for review, and errata about where to find observability data, design decisions considered and chosen, etc. PM does some manual testing across various personas and products (along with PMs from those teams), has AI add feature flags, launches. The feature flag rollout ends up being the long-pole, since generally the product team needs to monitor usage data for some time before increasing the rollout percentage.

So I see that kind of workflow as being a lot easier in a monolithic service. Granted, that's a few years down the road though, before we have AI reliable enough to do that kind of work.

twelvedogs · 2026-01-12T09:21:20 1768209680

honestly i think they got the low hanging fruit already. they're bumping up against the limits of what it can do and while it's impressive it's not spectacular

embedding-shape · 2026-01-12T10:54:01 1768215241

Maybe I'm easily impressed, but that LLMs even work to output basic human-like text to me is bananas, and I do understand a bit of how it works, yet it's still up there as "Amazing that huge airplanes even can fly" is for me.

smt88 · 2026-01-12T03:52:34 1768189954

There's no chance LLMs will be an engineering team replacement. The hallucination problem is unsolvable and catastrophic in some edge cases. Any company using such a team would be uninsurable and sued into oblivion.

eru · 2026-01-12T04:36:34 1768192594

Writing software is actually one of the domains where hallucinations are easiest to fix: you can easily check whether it builds and passes tests.

If you want to go further, you can even require the LLM to produce a machine checkable proof that the software is correct. That's beyond the state of the art at the moment, but it's far from 'unsolvable'.

If you hallucinate such a proof, it'll just not work. Feed back the error message from the proof checker to your coding assistant, and the hallucination goes away / isn't a problem.

thesz · 2026-01-12T05:41:36 1768196496

  > you can easily check whether it builds and passes tests.

This link were on HN recently: https://spectrum.ieee.org/ai-coding-degrades

  "...recently released LLMs, such as GPT-5, have a much more insidious method of failure. They often generate code that fails to perform as intended, but which on the surface seems to run successfully, avoiding syntax errors or obvious crashes. It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution."

The trend for LLM generated code is to build and pass tests but do not deliver functionality needed.

Also, please consider how SQLite is tested: https://sqlite.org/testing.html

The ratio between test code and code itself is mere 590 times (590 LOC of tests per LOC of actual code), it used to be more than 1100.

Here is notes on current release: https://sqlite.org/releaselog/3_51_2.html

Notice fixes there. Despite being one of the most, if not the most, tested pieces of software in the world, it still contains errors.

  > If you want to go further, you can even require the LLM to produce a machine checkable proof that the software is correct.

Haha. How do you reconcile a proof with actual code?

vanviegen · 2026-01-12T08:07:14 1768205234

I've recently seen Opus, after struggling for a bit, implement an API by having it return JSON that includes instructions for a human to manually accomplish the task I gave it.

It proudly declared the task done.

DrSiemer · 2026-01-12T14:08:09 1768226889

Recent models have started to "fix" HTML issues with ugly hacks like !important. The result looks like it works, but the tech debt is considerable.

Still, it's just a temporary hindrance. Nothing a decent system prompt can't take care of until the models evolve.

eab- · 2026-01-12T10:27:20 1768213640

> Haha. How do you reconcile a proof with actual code?

Languages like Lean allow you to write programs and proofs under the same umbrella.

eru · 2026-01-12T06:49:27 1768200567

> Haha. How do you reconcile a proof with actual code?

You can either proof your Rust code correct, or you can use a proof system that allows you to extract executable code from the proofs. Both approaches have been done in practice.

Or what do you mean?

DrammBA · 2026-01-12T05:03:38 1768194218

You focused on writing software, but the real problem is the spec used to produce the software, LLMs will happily hallucinate reasonable but unintended specs, and the checker won’t save you because after all the software created is correct w.r.t. spec.

Also tests and proof checkers only catch what they’re asked to check, if the LLM misunderstands intent but produces a consistent implementation+proof, everything “passes” and is still wrong.

simonw · 2026-01-12T05:04:54 1768194294

This is why every one of my coding agent sessions starts with "... write a detailed spec in spec.md and wait for me to approve it". Then I review the spec, then I tell it "implement with red/green TDD".

tsimionescu · 2026-01-12T06:53:52 1768200832

The premise was that the AI solution would replace the engineering team, so who exactly is writing/reviewing this detailed spec?

PurpleRamen · 2026-01-12T11:57:07 1768219027

The premise is in progress. We are only at the beginning of the fourth year of this hype-phase, and we haven't even reached AGI yet. It's obviously not perfect, maybe never will, but we are not a the point yet were we can conclude which future is true. The singularity hasn't happend yet, so we are still moving with (llm-enhanced) human speed at the moment, meaning things need time.

eru · 2026-01-12T08:11:01 1768205461

Well, perhaps it'll only shrink the engineering team by 95% then.

LouisSayers · 2026-01-12T11:11:27 1768216287

Why would you shrink the team rather than become 20x more productive as a whole?

eru · 2026-01-12T13:01:49 1768222909

Either way can work. It depends on what the rest of the business needs.

simonw · 2026-01-12T06:55:45 1768200945

That's a bad premise.

tsimionescu · 2026-01-12T07:01:12 1768201272

Maybe, but you're responding to a thread about why AI might or might not be able to replace an entire engineering team:

> Ultimately I think over the next two years or so, Anthropic and OpenAI will evolve their product from "coding assistant" to "engineering team replacement", which will include standard tools and frameworks that they each specialize in (vendor lock in, perhaps), but also ways to plug in other tech as well.

This is the context of how this thread started, and this is the context in which DrammBA was saying that the spec problem is very hard to fix [without an engineering team].

galaxyLogic · 2026-01-12T08:42:30 1768207350

The job of AI is to do what we tell it to do. It can't "create a spec" on its own. If it did and then implemented that spec, it wouldn't accomplish what we want it to accomplish. Therefore we the humans must come up with that spec. And when you talk about a software application, the totality of its spec written out, can be very complex, very complicated. To write and understand, and evolve and fix such a spec takes engineers, or what used to be called "system analysts".

To repeat: To specify what a "system" we want to create does is a highly complicated task, which can only be dones by human engineers who understand the requirements for the system, and how parts of those requirements/specs interact with other parts of the spec, what are the consequences of one (part of the) spec to other parts of it. We must not writ e"impossible specs" like draw me a round square. Maybe the AI can check whether the spec is impossible or not, but I'm not so sure of that.

So I expect that software engineers will still be in high demand, but they will be much more productive with AI than without it. This means there will be much more software because it will be cheaper to produce. And the quality of the software will be higher in terms of doing what humans need it to do. Usability. Correctness. Evolvability. In a sense the natural language-spec we give the AI is really something written in a very high-level programming-language - the language of engineers.

BTW. As I write this I realize there is no spell-checker integrated into Hacker News. (Or is there?). Why? Because it takes developers to specify and implement such a system - which must be integrated into the current HN implementation. If AI can do that for HN, it can be done, because it will be cheap enough to do it -- if HN can exactly spell out what kind of system it wants. So we do need more software, better software, cheaper software, and AI will helps us do that.

A 2nd factor is that we don't really know if a spec is "correct" until we test the implemented system with real users. At that point we typically find many problems with the spec. So somebody must fix the problems with the spec, evolve the spec and rinse and repeat the testing with real users -- the developers who understand the current spec and why it is is not good enough.

AI can write my personal scripts for me surely. But writing a spec for a system to be used by thousands of humans, still takes a lot of (human) work. The spec must work for ALL users. That makes it complicated and difficult to get right.

matwood · 2026-01-12T08:08:39 1768205319

Might be good to define the (legacy) engineering team. Instead of thinking 0/1 (ugh, almost nothing happens this way), the traditional engineering team may be replaced by something different. A team mostly of product, spec writers, and testers. IDK.

daxfohl · 2026-01-12T06:16:50 1768198610

Same, and similarly something like a "create a holistic design with all existing functionality you see in tests and docs plus new feature X, from scratch", then "compare that to the existing implementation and identify opportunities for improvement, ranked by impact, and a plan to implement them" when the code starts getting too branchy. (aka "first make the change easy, then make the easy change"). Just prompting "clean this code up" rarely gets beyond dumb mechanical changes.

Given so much of the work of managing these systems has become so rote now, my only conclusion is that all that's left (before getting to 95+% engineer replacement) is an "agent engineering" problem, not an AI research problem.

ulrikrasmussen · 2026-01-12T04:59:40 1768193980

In order to prove safety you need a formal model of the system and formally defined safety properties that are both meaningful and understandable by humans. These do not exist for enterprise systems

eru · 2026-01-12T05:57:26 1768197446

An exhaustive formal spec doesn't exist. But you can conservatively proof some properties. Eg program termination is far from sufficient for your program to do what you want, but it's probably necessary.

(Termination in the wider sense: for example an event loop has to be able to finish each run through the loop in finite time.)

You can see eg Rust's or Haskell's type system as another light-weight formal model that lets you make and proof some simple statements, without having a full formal spec of the whole desired behaviour of the system.

ulrikrasmussen · 2026-01-12T07:49:15 1768204155

Yeah, but with all respect, that is a totally uninteresting property in an enterprise software system where almost no software bugs actually manifest as non-termination.

The critical bugs here are related to security (DDoS attacks, authorization and authentication, data exfiltration, etc), concurrency, performance, data corruption, transactionality and so forth. Most enterprise systems are distributed or at least concurrent systems which depend on several components like databases, distributed lock managers, transaction managers, and so forth, where developing a proper formal spec is a monumental task and possibly impossible to do in a meaningful way because these systems were not initially developed with formal verification in mind. The formal spec, if faithful, will have to be huge to capture all the weird edge cases.

Even if you had all that, you need to actually formulate important properties of your application in a formal language. I have no idea how to even begin doing that for the vast majority of the work I do.

Proving the correctness of linear programs using techniques such as Hoare logic is hard enough already for anything but small algorithms. Proving the correctness of concurrent programs operating on complex data structures requires much more advanced techniques, setting up complicated logical relations and dealing with things like separation logic. It's an entirely different beast, and I honestly do not see LLMs as a panacea that will suddenly make these things scale for anything remotely close in size to a modern enterprise system.

eru · 2026-01-12T08:12:37 1768205557

Oh, there's lots more simple properties you can state and prove that capture a lot more, even in the challenging enterprise setting.

I just gave the simplest example I could think of.

And termination is actually a much stronger and more useful property than you make it out to be---in the face of locks and concurrency.

tsimionescu · 2026-01-12T07:05:12 1768201512

That is true and very useful for software development, but it doesn't help if the goal is to remove human programmers from the loop entirely. If I'm a PM who is trying to get a program to, say, catalogue books according to the Dewey Decimal system for a library, a proof that the program terminates is not going to help that much when the program is mis-categorizing some books.

seanmcdirmid · 2026-01-12T07:07:54 1768201674

Is removing the human in the loop really the goal, or is the goal right now to make the human a lot more productive? Because...those are both very different things.

tsimionescu · 2026-01-12T07:11:16 1768201876

I don't know what the goal for OpenAI or Anthropic really is.

But the context of this thread is the idea that the user daxfohl launched that these companies will, in the next few years, launch an "engineering team replacement" program; and then the user eru claimed that this is indeed more doable in programming than other domains because you can have specs and tests for programs in a way that you can't for, say, an animated movie.

eru · 2026-01-12T08:14:30 1768205670

OK, so you successfully argued that replacing the entire engineering team is hard. But you can perhaps still shrink it by 99%. To the point where a sole founder can do the remaining tech role part time.

seanmcdirmid · 2026-01-12T08:28:34 1768206514

I have no idea what will happen in a few years, maybe LLM tech will hit a wall and humans will continue to be needed in the loop. But today humans are definitely needed in the loop in some way.

somenameforme · 2026-01-12T05:31:23 1768195883

Tests and proofs can only detect issues that you design them to detect. LLMs and other people are remarkably effective at finding all sorts of new bugs you never even thought to test against. Proofs are particularly fragile as they tend to rely on pre/post conditions with clean deterministic processing, but the whole concept just breaks down in practice pretty quickly when you start expanding what's going on in between those, and then there's multithreading...

mohaine · 2026-01-12T05:10:00 1768194600

Ah, most the problem in programming is writing the tests. Once you know what you need the rest is just typing.

I can see an argument where you can get none programers to create the input and output of said tests but if the can do that, they are basically programmers.

This is of course leaving aside that half the stated use cases I hear for AI are that it can 'write the tests for you'. If it is writing the code and the tests it is pointless.

discreteevent · 2026-01-12T09:35:38 1768210538

You need more than tests. Test induced design damage:

https://dhh.dk/2014/test-induced-design-damage.html

solid_fuel · 2026-01-12T05:01:59 1768194119

> Writing software is actually one of the domains where hallucinations are easiest to fix: you can easily check whether it builds and passes tests.

What tests? You can't trust the tests that the LLM writes, and if you can write detailed tests yourself you might as well write the damn software.

eru · 2026-01-12T05:55:58 1768197358

Use multiple competing LLM. Generative adversarial network style.

solid_fuel · 2026-01-12T06:45:25 1768200325

Cool. That sure sounds nice and simple. What do you do when the multiple LLMs disagree on what the correct tests are? Do you sit down and compare 5 different diffs to see which have the tests you actually want? That sure sounds like a task you would need an actual programmer for.

At some point a human has to actually use their brain to decide what the actual goals of a given task are. That person needs to be a domain expert to draw the lines correctly. There's no shortcut around that, and throwing more stochastic parrots at it doesn't help.

eru · 2026-01-12T06:48:32 1768200512

Just because you can't (yet) remove the human entirely from the loop, doesn't mean that economising on the use of the humans time is impossible.

For comparison have a look at compilers: nowadays approximately no one writes their software by hand, we write a 'prompt' in something like Rust or C, and ask another computer program to create the actual software.

We still need the human in the loop here, but it takes much less human time than creating the ELF directly.

solid_fuel · 2026-01-12T10:16:39 1768212999

It’s not “economizing” if I have to verify every test myself. To actually validate that tests are good I need to understand the system under test, and at that point I might as well just write the damn thing myself.

This is the fundamental problem with this “AI” mirage. If I have to be an expert to validate that the LLM actually did the task I set out, and isn’t just cheating on tests, then I might as well code the solution myself.

exceptione · 2026-01-12T08:39:28 1768207168

I can see how LLMs can help with testing, but one should never compare LLMs with deterministic tools like compilers. LLMs are entirely a separate category.

shevy-java · 2026-01-12T06:48:54 1768200534

Well - the end result can be garbage still. To be fair: humans also write a lot of garbage. I think in general most software is rather poorly written; only a tiny percentage is of epic prowess.

rezonant · 2026-01-12T06:53:09 1768200789

Who writes the tests?

eru · 2026-01-12T08:14:36 1768205676

A competing AI.

Marazan · 2026-01-12T09:35:40 1768210540

Ah, it is turtles all the way down.

eru · 2026-01-12T13:02:57 1768222977

Yes. But it's no different from the question of how a non-tech person can make sure that whatever their tech person tells them actually makes sense: you hire another tech person to have a look.

Marazan · 2026-01-12T09:35:14 1768210514

Who is writing the tests?

matwood · 2026-01-12T08:05:29 1768205129

These types of comments are interesting to me. Pre-chatGPT there were tons of posts how so many software people were terrible at their jobs. Bugs were/are rampant. Software bugs caused high profile issues, but likely so many more we never heard about.

Today we have chatGPT and only now will teams be uninsurable and sued into oblivion? LOL

elzbardico · 2026-01-12T12:45:42 1768221942

LLMs were trained on exactly that kind of code.

kristiandupont · 2026-01-12T06:58:54 1768201134

I use LLM's to write the majority of my code. I haven't encountered a hallucination for the better part of a year. It might be theoretically unsolvable but it certainly doesn't seem like a real problem to me.

smt88 · 2026-01-12T15:41:04 1768232464

I use LLMs whenever I'm coding, and it makes mistakes ~80% of the time. If you haven't seen it make a huge mistake, you may not be experienced enough to catch them.

fragmede · 2026-01-12T05:59:53 1768197593

I am but a lowly IC, with no notion of the business side of things. If I am an IC at, say, a FANG company, what insurance has been taken out on me writing code there?

smt88 · 2026-01-12T15:43:35 1768232615

> If I am an IC at, say, a FANG company, what insurance has been taken out on me writing code there?

Every non-trivial software business has liability insurance to cover them for coding lapses that lead to data breaches or other kinds of damages to customers/users.

mvkel · 2026-01-12T03:21:16 1768188076

> Other people are just less picky than I am

I think this is part of it.

When coding style has been established among a team, or within an app, there are a lot of extra hoops to jump through, just to get it to look The Right Way, with no detectable benefit to the user.

If you put those choices aside and simply say: does it accomplish the goal per the spec (and is safe and scalable[0]), then you can get away with a lot more without the end user ever having a clue.

Sure, there's the argument for maintainability, and vibe coded monoliths tend to collapse in on themselves at ~30,000 LOC. But it used to be 2,000 LOC just a couple of years ago. Temporary problem.

[0]insisting that something be scalable isn't even necessary imo

newsoftheday · 2026-01-12T03:30:54 1768188654

> When coding style has been established

It feels like you're diminishing the parent commenter's views, reducing it to the perspective of style. Their comment didn't mention style.

mvkel · 2026-01-12T03:33:36 1768188816

Style = syntax, taste, architecture choices, etc. Things you would see on a 15-year-old Java app.

i.e. not a greenfield project.

Bayko · 2026-01-12T04:45:21 1768193121

Isn't coding style a solved problem with claude.md files or something?

oofbey · 2026-01-12T05:42:15 1768196535

You can control some simple things that way. But the subtle stylistic choices that many teams agree on are difficult to articulate clearly. Plus they don’t always do everything you tell them to in the prompts or rule files. Even when it’s extremely clear sometimes they just don’t. And often the thing you want isn’t clear.

matsemann · 2026-01-12T09:28:52 1768210132

> with no detectable benefit to the user

Except the fact that the idioms and patterns used means that I can jump in and understand any part of the codebase, as I know it will be wired up and work the same as any other part.

dspillett · 2026-01-12T11:19:33 1768216773

I think here “to the user” is referring to the end user, not the programmer (the user of the coding style). There is a comprehension benefit for the team working on the code, but there is no direct¹ benefit to the end user.

--------

[1] The indirect benefits of there possibly being a faster release cadence and/or fewer bugs, could also be for many other reasons.

matsemann · 2026-01-12T12:03:27 1768219407

But you could say the same about tests, documentation, CI, issue trackers or really any piece of technology used. So it's not a very interesting statement if so.

eru · 2026-01-12T04:33:54 1768192434

> When coding style has been established among a team, or within an app, there are a lot of extra hoops to jump through, just to get it to look The Right Way, with no detectable benefit to the user.

Morphing an already decent PR into a different coding style is actually something that LLMs should excel at.

rezonant · 2026-01-12T06:50:30 1768200630

I've seen vibe coding fall apart at 600 lines of code. It turns out lines of code is not a good metric for this or any other purpose.

saltyoutburst · 2026-01-12T06:03:20 1768197800

Do you have any references for "vibe coded monoliths tend to collapse in on themselves at ~30,000 LOC"? I haven't personally vibed up anything with that many LOC, so I'm legitimately curious if we have solid numbers yet for when this starts to happen (and for which definitions of "collapse").

AlexandrB · 2026-01-12T04:03:00 1768190580

What's that old adage? "Programs must be written for people to read, and only incidentally for machines to execute."[1]

[1] https://cs61a.org/articles/composition/

PunchyHamster · 2026-01-12T13:29:32 1768224572

It's the second.

Like, yes, prompting is a skill and you need to learn it for AI to do something useful but usefulness quickly falls down a cliff once you go past "greenfield implementation" or "basically example code" or "the thing done a lot so AI have a lot of reference to put from" it quickly gets into kinda sorta but not really working state.

It can still be used effectively on smaller parts of the codebase (I used it a lot basically to generate some boilerplate to run the test even if I had to rewrite a bunch of actual tests) but as whole very, very overrated by the AI peddlers.

And it probably stems from the fact that for the clueless ones it looks like amazing productivity boost because they go from "not even knowing framework" to "somewhat working app"

nicce · 2026-01-12T13:58:05 1768226285

People already say here that they don’t even look the code anymore. ”That is AIs job”. As long as there is a spec and tests pass, they are happy! I just can’t do that.

mixdup · 2026-01-12T14:25:18 1768227918

It's just the next rung on the enshittification ladder. So many steps in our "progress" to enlightenment as a society, as a technology community, is just abstracting away work with a "good enough" solution that is around an 80% solution

That's fine for the first iteration or two, because you think "oh man this is going to make me so productive, I'll be able to use this new productivity to wring 40% of progress out of that 20% gap"

But instead we just move on to the next thing, bring that 20% shittified gap along with us, and the next thing that gets built or paved over has a 20% gap, and eventually we're bankrupt from rolling over all that negative equity

nicce · 2026-01-12T15:24:38 1768231478

The counter argument for this is the comparison for traditional compilers. AI is "the new compiler", just for natural language. The optimization happens over time! But I am not so sure about that.

lopatin · 2026-01-12T03:21:46 1768188106

At work, I have the same difficulty using AI as you. When working on deep Jiras that require a lot of domain knowledge, bespoke testing tools, but maybe just a few lines of actual code changes across a vast codebase, I have not been able to use it effectively.

For personal projects on the other hand, it has expedited me what? 10x, 30x? It's not measurable. My output has been so much more than what would have been possible earlier, that there is no benchmark because these level of projects would not have been getting completed in the first place.

Back to using at work: I think it's a skill issue. Both on my end and yours. We haven't found a way to encode our domain knowledge into AI and transcend into orchestrators of that AI.

nikita2206 · 2026-01-12T10:08:33 1768212513

> deep Jiras that require a lot of domain knowledge, bespoke testing tools, but maybe just a few lines of actual code changes

How do new hires onboard? Do you spend days of your own time guiding them in person, do they just figure things out on their own after a few quarters of working on small tickets, or are things documented? Basically AI, when working on a codebase, has the same level of context that a new hire would have, so if you want them to get started faster then provide them with ample documentation.

lopatin · 2026-01-12T12:31:33 1768221093

> Do you spend days of your own time guiding them in person, do they just figure things out on their own after a few quarters of working on small tickets

It is this rather than docs. I think you're absolutely right about our lack of documentation handicapping AI agents.

khafra · 2026-01-12T05:58:10 1768197490

> Non-trivial coding tasks

A coding agent just beat every human in the AtCoder Heuristic optimization contest. It also beat the solution that the production team for the contest put together. https://sakana.ai/ahc058/

It's not enterprise-grade software, but it's not a CRUD app with thousands of examples in github, either.

tete · 2026-01-12T09:48:43 1768211323

> AtCoder Heuristic optimization contest

Optimization space that has been automated before LLMs. Big surprise, machines are still better at this.

This feels a bit like comparing programming teams to automated fuzzing.

In fact not too rarely developing algorithms involved some kind of automated algorithm testing where the algorithm is permuted in an automatic manner.

It's also a bit like how OCR and a couple of other fields (protein folding) are better to be done in an automated manner.

The fact that now this is done by an LLM, another machine isn't exactly surprising. Nobody claims that computers aren't good at these kinds of tasks.

fmbb · 2026-01-12T07:10:36 1768201836

> It's not enterprise-grade software, but it's not a CRUD app with thousands of examples in github, either.

Optimization is a very simple problem though.

Maintaining a random CRUD app from some startup is harder work.

matwood · 2026-01-12T07:57:32 1768204652

> Optimization is a very simple problem though.

C'mon, there's post every other week that optimization never happens anymore because it's too hard. If AI can take all the crap code humans are writing and make it better, that sounds like a huge win.

nothrabannosir · 2026-01-12T08:10:31 1768205431

Simple is the opposite of complex; the opposite of hard is easy. They are orthogonal. Chess is simple and hard. Go is simpler and harder than chess.

Program optimization problems are less simple than both, but still simpler than free-form CRUD apps with fuzzy, open ended acceptance criteria. It would stand to reason an autonomous agent would do well at mathematically challenging problems with bounded search space and automatically testable and quantifiable output.

(Not GP but I assume that's what they were getting at)

dns_snek · 2026-01-12T13:04:58 1768223098

> If AI can take all the crap code humans are writing and make it better, that sounds like a huge win.

This sort of misunderstanding of achievements is what keeps driving the AI mania. The AI generated an algorithm for optimizing a well-defined, bounded mathematical problem that marginally beat the human-written algorithms.

This AI can't do what you're hyping it up to do because software optimization is a different kind of optimization problem - it's complex, underspecified, and it doesn't have general algorithmic solutions.

LLM may play a significant role in optimizing software some day but it's not going to have much in common with optimization in a mathematical sense so this achievement doesn't get us any closer to that goal.

lazyasciiart · 2026-01-12T08:20:28 1768206028

The argument was about “non-trivial”. Are you calling this work trivial or not?

PunchyHamster · 2026-01-12T13:32:16 1768224736

Compilers beat most coders before LLM were even popular

tripzilch · 2026-01-12T11:26:14 1768217174

had to scroll far to find the problem description

> AHC058, held on December 14, 2025, was conducted over a 4-hour competition window. The problem involved a setting where participants could produce machines with hierarchical relationships, such as multiple types of “apple-producing machines” and “machines that build those machines.” The objective was to construct an efficient production planning algorithm by determining which types and hierarchies of machines to upgrade and in what specific order.

... so not a CRUD app but it beat humans at Cookie Clicker? :-)

antirez · 2026-01-12T08:22:33 1768206153

After you review, instead of rewriting 70% of the code, have you tried to follow up with a message with a list of things to fix?

Also: in my experience 1. and 2. are not needed for you to have bad results. The existing code base is a fundamental variable. The more complex / convoluted it is, the worse is the result. Also in my experience LLMs are constantly better at producing C code than anything else (Python included).

I have the feeling that the simplicity of the code bases I produced over the years, and that now I modify with LLMs, and the fact they are mostly in C, is a big factor why LLMs appear to work so well for me.

Another thing: Opus 4.5 for me is bad on the web, compared to Gemini 3 PRO / GPT 5.2, and very good if used with Claude Code, since it requires to reiterate to reach the solution, why the others sometimes are better first-shotter. If you generate code via the web interface, this could be another cause.

There are tons of variables.

embedding-shape · 2026-01-12T09:39:39 1768210779

> After you review, instead of rewriting 70% of the code, have you tried to follow up with a message with a list of things to fix?

I think this is the wrong approach, already by having "wrong code" in the context, makes every response after this worse.

Instead, try restarting, but this time specify exactly how you expected that 70% of the code to actually have worked, from the get go. Often, LLMs seem to make choices because they have to, and if you think they made the wrong choice, you can often find that you didn't actually specify something well enough, hence the LLM had to do something, since apparently the single most important thing for them is that they finish something, no matter how right or wrong.

After a while, you'll get better at knowing what you have to be precise, specific and "extra verbose" about, compared to other things. Something that also seems to depend on the model, like with how Gemini you can have 5 variations of "Don't add any comments" yet it does anyways, but say that once to GPT/Claude-family of models and it seems they get it at once.

dspillett · 2026-01-12T11:49:27 1768218567

> After you review, instead of rewriting 70% of the code, have you tried to follow up with a message with a list of things to fix?

This is one of my problems with the whole thing, at least from a programming PoV. Even though superficially it seems like the ST:TNG approach to using an intelligent but not aware computer as a tool to collaboratively solve a problem, it is really more like guiding a junior through something complex. While guiding a junior (or even some future AGI) in that way is definitely a good thing, if I am a good guide they will learn from the experience so it will be a useful knowledge sharing process, that isn't a factor for an LLM (at least not the current generations). But if I understand the issue well enough to be a good guide, and there is no teaching benefit external to me, I'd rather do it myself and at most use the LLM as a glorified search engine to help muddle through bad documentation for hidden details.

That and TBH I got into techie things because I like tinkering with the details. If I thought I'd not dislike guiding others doing the actual job, I'd have not resisted becoming a manager throughout all these years!

vignesh37 · 2026-01-12T04:38:27 1768192707

The biggest frustration with LLMs for me is people telling me I'm not promoting it in a good way. Just think about any product where they are selling a half baked product, and repeatedly telling the user you are not using it properly.

simonw · 2026-01-12T04:41:52 1768192912

But that's not how most products work.

If you buy a table saw and can't figure out how to cut a straight line in a piece of wood with it - or keep cutting your fingers off - but didn't take any time at all to learn how to use it, that's on you.

Likewise a car, you have to take lessons and a test before you can use those!

Why should LLMs be any different?

torginus · 2026-01-12T14:27:00 1768228020

I'm glad you brought up the power tool analogy - I've bought a $40 soldering iron once, which looked just like the Weller that cost like 5x as much. There was nothing wrong with it on the surface, it was well built and heated up just fine.

But every time i tried to solder with it, the results sucked. I couldn't articulate why, and assumed I was doing something wrong (I probably was).

Then at my friends house, I got to try the real thing, and it worked like a dream. Again I can't pin down why, but everything just worked.

This is how I felt with LLMs (and image generation) - sometimes it just doesn't feel right, and I can't put my finger on what should I fix, but I come away often with the feeling that I needed to do way more tweaking than necessary and the results were just still mediocre.

lelanthran · 2026-01-12T10:43:12 1768214592

> But that's not how most products work.

That's exactly how most products work :-/

> If you buy a table saw and can't figure out how to cut a straight line in a piece of wood with it - or keep cutting your fingers off - but didn't take any time at all to learn how to use it, that's on you.

Of course - that's deterministic, so if you make a mistake and it comes out wrong, you can fix the mistake you made.

> Why should LLMs be any different?

Because they are not deterministic; you can't use experience with LLMs in any meaningful way. They may give you a different result when you run the same spec through the LLM a second time.

ben_w · 2026-01-12T13:24:39 1768224279

> Because they are not deterministic; you can't use experience with LLMs in any meaningful way. They may give you a different result when you run the same spec through the LLM a second time.

Lots of things, and indeed humans, are also as non-deterministic; I absolutely do use experience working with humans and non-deterministic things to improve my future interactions with them.

Table saws are kinda infamous in this regard: you may say that kick-back is hidden state/incomplete information rather than non-deterministic, but in practice the impact is the same.

embedding-shape · 2026-01-12T10:57:48 1768215468

> They may give you a different result when you run the same spec through the LLM a second time.

Yes kind of, but only different results (maybe) for the things you didn't specify. If you ask for A, B and C, and the LLM automatically made the choice to implement C in "the wrong way" (according to you), you can retry but specify exactly how you want C to be implemented, and it should follow that.

Once you've nailed your "spec" enough so there isn't any ambiguity, the LLM won't have to make any choices for you, and then you'll get exactly what you expected.

Learning this process, and learning how much and what exactly you have to instruct it to do, is you building up your experience learning how to work with an LLM, and that's meaningful, and something you get better with as you practice it.

troupo · 2026-01-12T12:55:44 1768222544

> Yes kind of, but only different results (maybe) for the things you didn't specify.

No. They will produce a different result for everything, including the things you specify.

It's so easy to verify that I'm surprised you're even making this claim.

> Once you've nailed your "spec" enough so there isn't any ambiguity, the LLM won't have to make any choices for you, and then you'll get exactly what you expected

1. There's always ambiguity, or else you'll end up an eternity writing specs

2. LLMs will always produce different results even if the spec is 100% unambiguous for a huge variety of reasons, the main one being: their output is non-deterministic. Except in the most trivial of cases. And even then the simple fact of "your context window is 80% full" can lead to things like "I've rewritten half of your code even though the spec only said that the button color should be green"

embedding-shape · 2026-01-12T13:16:46 1768223806

> It's so easy to verify that I'm surprised you're even making this claim.

Well, to be fair, I'm surprised you're even trying to say this claim isn't true, when it's so easy to test yourself.

If I prompt "Create a function with two arguments, a and b, which returns adding those two together", I'll get exactly what I specify. If I feel like it using u8 instead of u32 was wrong, I add "two arguments which are both u8", then you now get this.

Is this not the experience you get when you use LLMs? How does what you get differ from that?

> 1. There's always ambiguity, or else you'll end up an eternity writing specs

There isn't though, at one point it does end. If it's worth going so deep into specifying the exact implementation is up to you and what you're doing, sometimes it is, sometimes it isn't.

> LLMs will always produce different results even if the spec is 100% unambiguous for a huge variety of reasons, the main one being: their output is non-deterministic.

Again, it's so easy to verify that this isn't true, and also surprising you'd say this, because earlier you say "always ambiguity" yet somehow you seem to also know that you can be 100% unambiguous.

Like with "manual" programming, the answer is almost always "divide and conquer", when you apply that with enough granularity, you can reach "100% umambiguity".

> And even then the simple fact of "your context window is 80% full" can lead to things like "I've rewritten half of your code even though the spec only said that the button color should be green"

Yes, this is a real flaw, once you go beyond two messages, the models absolutely lose track almost immediately. Only workaround for this is constantly restarting the conversation. I never "correct" an agent if they get it wrong with more "No, I meant", I rewrite my first message so there are no corrections needed. If your context goes beyond ~20% of what's possible, you're gonna get shit results basically. Don't trust the "X tokens context length", because "what's possible" is very different from "what's usable".

lelanthran · 2026-01-12T14:14:28 1768227268

> If I prompt "Create a function with two arguments, a and b, which returns adding those two together", I'll get exactly what I specify. If I feel like it using u8 instead of u32 was wrong, I add "two arguments which are both u8", then you now get this.

This is actually a good example of how your spec will progress:

First pass: "Create a function [in language $X] with two arguments, a and b, which returns adding those two together"

Second pass: "It must take u8 types, not u32 types"

Third pass: "You are not handling overflows. It must return a u8 type."

Fourth pass: "Don't clamp the output, and you're still not handling overflows"

Fifth pass: "Don't panic if the addition overflows, return an error" (depending on the language, this could be "throw an exception" or return a tuple with an error field, or use an out parameter for the result or error)

For just a simple "add two numbers" function, the specification can easily exceed the actual code. So you can probably understand the skepticism when the task is not trivial, and depends on a lot of existing code.

embedding-shape · 2026-01-12T14:28:55 1768228135

So you do know how the general "writing specification" part is working, you just have the wrong process. Instead of iterating and adding more context on top, restructure your initial prompt to include the context.

DONT DO:

First pass: "Create a function [in language $X] with two arguments, a and b, which returns adding those two together"

Second pass: "It must take u8 types, not u32 types"

INSTEAD DO:

First pass: "Create a function [in language $X] with two arguments, a and b, which returns adding those two together"

Second pass: "Create a function [in language $X] with two arguments, a and b, both using u8, which returns adding those two together"

----

What you don't want to do, is adding additional messages/context on top of "known bad" context, so instead you should take the clue that the LLM didn't understand correctly as "I need to edit my prompt" not "I need to now after their reply, add more context to correct what was wrong". The goal should be to completely avoid anything bad, not correct it.

Together with this, you build up a system/developer prompt you can reuse across projects/scopes, that follows how you code. In that, you add stuff as you discover what's needed to be added, like "Make sure to always handle Exceptions in X way" or similar.

> > For just a simple "add two numbers" function, the specification can easily exceed the actual code. So you can probably understand the skepticism when the task is not trivial, and depends on a lot of existing code.

Yes, please be skeptical, I am as well, which I guess is why I am seemingly more effective at using LLMs than others who are less skeptical. It's a benefit here to be skeptical, not a drawback.

And yes, it isn't trivial to verify work that others have done for you, when you have a concrete idea of how it should be exactly. But as I managed to work with outsourced/contracting developers before, or even collaborate with developers in the same company as me, I also learned to use LLMs in a similar way where you have to review and ensure code follow the architecture/design you intended.

troupo · 2026-01-12T14:50:25 1768229425

> INSTEAD DO:

> First pass: "Create a function [in language $X] with two arguments, a and b, which returns adding those two together"

> Second pass: "Create a function [in language $X] with two arguments, a and b, both using u8, which returns adding those two together"

So it will create two different functions (and LLMs do love to ignore anything that came before and create a lot of stuff from scratch again and again). Now what.

embedding-shape · 2026-01-12T14:59:49 1768229989

What? No, I think you fundamentally misunderstand what workflow I'm suggesting here.

You ask: "Do X". The LLM obliges, gives you something you don't want. At this point, don't accept/approve it, so nothing has changed, you still have an empty directory, or whatever.

Then you start a brand new context, with iteration on the prompt: "Do X with Y", and the LLM again tries to do it. If something is wrong, repeat until you get what you're happy with, extract what you can into reusable system/developer prompts, then accept/approve the change.

Then you end up with one change, and one function, exactly as you specified it. Then if you want, you can re-run the exact same prompt, with the exact same context (nothing!) and you'll get the same results.

"LLMs do love to ignore anything that came before" literally cannot happen in this workflow, because there is nothing that "came before".

troupo · 2026-01-12T15:50:22 1768233022

> No, I think you fundamentally misunderstand what workflow I'm suggesting here.

Ah. Basically meaningless monkey work of baby sitting an eager junior developer. And this is for a simple thing like adding two numbers. See how it doesn't scale at all with anything remotely complex?

> "LLMs do love to ignore anything that came before" literally cannot happen in this workflow, because there is nothing that "came before".

Of course it can. Because what came before is the project you're working on. Unless of course you end up specifying every single utility function and every single library call in your specs. Which, once again, doesn't scale.

consp · 2026-01-12T05:40:16 1768196416

A table saw does not advertise to be a panacea which will make everyone obsolete.

simonw · 2026-01-12T05:45:39 1768196739

You should ignore anyone who says that LLMs are a panacea that will make everyone obsolete.

achierius · 2026-01-12T06:38:02 1768199882

Even if they're your boss? Remember that most people here are not independently wealthy, they're stuck answering to someone who may not have so level a take on these things as you do.

mckn1ght · 2026-01-12T07:59:14 1768204754

The problem there is the boss, not the technology. If it isn’t an insane take on AI, it’d be on something else, and eventually will be. People quit bad managers, not bad jobs. If you have a bad manager, work on quitting them.

enraged_camel · 2026-01-12T09:02:11 1768208531

>> Even if they're your boss?

Especially if they are your boss.

tjr · 2026-01-12T04:49:47 1768193387

It seems generally agreed that LLMs (currently) do better or worse with different programming languages at least, and maybe with other project logistical differences.

The fact that an LLM works great for one user on one project does not mean it will work equally great for another user on a different project. It might! It might work better. It might work worse.

And both users might be using the tool equally well, with equal skill, insofar as their part goes.

PunchyHamster · 2026-01-12T13:36:42 1768225002

Now imagine the table saw is really, REALLY shit at being table saw and saw no straight angle anywhere during its construction. And they come with new one every 6 months that is very slightly less crooked but controls are all moved over so you have to tweak your workflow

Would you still blame the user ?

what · 2026-01-12T04:49:04 1768193344

It’s more like the iPhone “you’re holding it wrong”.

troupo · 2026-01-12T06:26:00 1768199160

Table saws and cars are deterministic. Once uou learn how to use them, the experience is repeatable.

The various magic incantations that LLMs require cannot be learned or repeated. Whatever the "just one more prompt bro" du jour you're thinking of may or may not work at any given time for any given project in any given language.

jonas21 · 2026-01-12T06:51:56 1768200716

Operating a car (i.e. driving) is certainly not deterministic. Even if you take the same route over and over, you never know exactly what other drivers or pedestrians are going to do, or whether there will be unexpected road conditions, construction, inclement weather, etc. But through experience, you build up intuition and rules of thumb that allow you to drive safely, even in the face of uncertainty.

It's the same programming with LLMs. Through experience, you build up intuition and rules of thumb that allow you to get good results, even if you don't get exactly the same result every time.

lee_ars · 2026-01-12T13:02:58 1768222978

> It's the same programming with LLMs. Through experience, you build up intuition and rules of thumb that allow you to get good results, even if you don't get exactly the same result every time.

Friend, you have literally described a nondeterministic system. LLM output is nondeterministic. Identical input conditions result in variable output conditions. Even if those variable output conditions cluster around similar ideas or methods, they are not identical.

troupo · 2026-01-12T12:24:11 1768220651

> Operating a car (i.e. driving) is certainly not deterministic.

Yes. Operating a car or a table saw is deterministic. If you turn your steering wheel left, the car will turn left every time with very few exceptions that can also be explained deterministically (e.g. hardware fault or ice on road).

Operating LLMs is completly non-deterministic.

ben_w · 2026-01-12T13:30:53 1768224653

> Operating LLMs is completly non-deterministic.

Claiming "completely" is mapping a boolean to a float.

If you tell an LLM (with tools) to do a web search, it usually does a web search. The biggest issue right now is more at the scale of: if you tell it to create turn-by-turn directions to navigate across a city, it might create a python script that does this perfectly with OpenStreetMap data, or it may attempt to use its own intuition and get lost in a cul-de-sac.

troupo · 2026-01-12T14:24:11 1768227851

Wow. It can do a web search. And that is useful in the context of programming how? Or in any context?

The question is about the result of an action. Given the same problem statement in the same codebase it will produce wildly different results even if prompted two times in a row.

Even for trivial tasks the output may vary between just a simple fix, and a rewrite of half of the codebase. You can never predict or replicate the output.

To quote Douglas Adams, "The ships hung in the sky in much the same way that bricks don't". Cars and table saws operate in much the same way that LLMs don't.

ben_w · 2026-01-12T15:53:55 1768233235

> Wow. It can do a web search. And that is useful in the context of programming how? Or in any context?

Your own example was turning a steering wheel.

A web search is as relevant to the broader problems LLMs are good at, as steering wheels are to cars.

> Given the same problem statement in the same codebase it will produce wildly different results even if prompted two times in a row.

Do you always drive the same route, every day, without alteration?

Does it matter?

> You can never predict or replicate the output.

Sure you can. It's just less like predicting what a calculator will show and more like predicting if, when playing catch, the other player will catch your throw.

You can learn how to deal with reality even when randomness is present, and in fact this is something we're better at than the machines.

simonw · 2026-01-12T06:45:45 1768200345

I'm finding the prompting techniques I've learned over the last six months continue to work just fine.

troupo · 2026-01-12T12:27:05 1768220825

Have you run the "same prompting technique" on the same problem in the same code base and got the same result all the time?

I also have prompting techniques that work better than other magical incantations. They do also fail often. Or stop working in a new context. Or...

AuryGlenz · 2026-01-12T08:21:54 1768206114

Have you seen the way some people google/prompt? It can be a murder scene.

Not coding related but my wife is certainly better than most and yet I’ve had to reprompt certain questions she’s asked ChatGPT because she gave it inadequate context. People are awful at that. Us coders are probably better off than most but just as with human communication if you’re not explaining things correctly you’re going to get garbage back.

lee_ars · 2026-01-12T13:13:08 1768223588

People are "awful at that" because when two people communicate, we're using a lot more than words. Each person participating in a conversation is doing a lot of active bridge-building. We're supplying and looking for extra nonverbal context; we're leaning on basic assumptions about the other speaker, their mood, their tone, their meanings; we're looking at not just syntax but the pragmatics of the convo (https://en.wikipedia.org/wiki/Pragmatics). The communication of meaning is a multi-dimensional thing that everyone in the conversation is continually contributing to and pushing on.

In a way, LLMs are heavily exploitative of human linguistic abilities and expectations. We're wired so hard to actively engage and seek meaning in conversational exchanges that we tend to "helpfully" supply that meaning even when it's absent. We are "vulnerable" to LLMs because they supply all the "I'm talking to a person" linguistic cues, but without any form of underlying mind.

Folks like your wife aren't necessarily "bad" at LLM prompting—they're simply responding to the signals they get. The LLM "seems smart." It seems like it "knows" things, so many folks engage with them naturally, as they would with another person, without painstakingly feeding in context and precisely defining all the edges. If anything, it speaks to just how good LLMs are at being LLMs.

trinix912 · 2026-01-12T08:58:02 1768208282

Until we get LLMs with deterministic output for a given prompt, there's no guarantee that you and me typing the same prompt will yield a working solution of similar quality.

I agree that it helps to add context, but then again assuming people aren't already doing it doesn't help in any way. You can add all the context there is and still get a total smudge out of it. You can select regenerate a few times and it's no better. There's nothing indisputably proving which part of your prompt the LLM will fixate on more and which one it will silently forget (this one's even more apparent with longer prompts).

tomjen3 · 2026-01-12T05:03:23 1768194203

If my mum buys a copy of Visual Studio, is it their fault if she cannot code?

vignesh37 · 2026-01-12T05:46:28 1768196788

its more like I buy Visual studio, it will crash at random time, and I get a response like you don't know how to use the ide.

simonw · 2026-01-12T05:47:25 1768196845

It's not like that though.

It's like you buy Visual Studio and don't believe anyone who tells you that it's complex software with a lot of hidden features and settings that you need to explore in order to use it to its full potential.

vignesh37 · 2026-01-12T06:38:45 1768199925

I feel it's not worth the effort to spend time and learn the hidden features. whenever I use it to plug something new into a existing codebase it either gives something good at first shot or repeat the non working solution again and again. after such session I only get a feeling instead of spending the last 15 minutes on prompting this, I should have learnt these stuff and this learning would be useful for me forever.

I use LLMs as a better form of search engines and that's a useful product.

9dev · 2026-01-12T07:39:30 1768203570

> I feel it's not worth the effort to spend time and learn the hidden features.

And that's the only issue here. Many programmers feel offended by an AI threatening their livelihood, and are too arrogant to invest some time in a tool they do deem below themselves—then proceed to complain how useless the tool is on the internet.

I'd really suggest taking antirez' advice at heart, and invest time in actually learning how to work with AI properly. Just because Claude Code has a text prompt like ChatGPT doesn't mean you know how to work with it yet. It is going to pay off.

ben_w · 2026-01-12T13:15:33 1768223733

> I should have learnt these stuff and this learning would be useful for me forever.

Oh, if only software worked like that.

Even a decade ago, one could reasonably say that half of what we proudly add to our CVs becomes obsolete every 18 months, it's just hard to predict which half.

ninininino · 2026-01-12T15:54:15 1768233255

How much buggy / incorrect Java written by first year computer science University students is there on Stack Overflow (in SO post bodies)? Decades of it.

Ask the same question of Golang, or Rust, or Typescript.

I have a theory that the large dichotomy in how people experience AI coding has to do with the quality of the training corpus for each language online.

f1shy · 2026-01-12T13:52:43 1768225963

I'm exactly on the same boat.

To anybody who want to try, a concrete example, that I have tested in all available LLMs:

Make a prompt to get a common lisp application which makes a "hello triangle" in open gl, without using SDL or any framework, only OpenGL and GLFW bindings.

None of the replies even compiled. I kept asking at least 5 times, with error feedback, to see if AI can do it. It did't work. Never.

The best I got was from gemini, a code where I had to change about 10 lines, absolutely no trivial changes that need to be familiar with opengl and lisp. After doing the changes I asked back, what does it think of the changes, it replied I was wrong, with those changes it will never work.

If anybody can make a prompt that get me that, please let me known...

Philpax · 2026-01-12T14:42:28 1768228948

It sounds like you're using LLMs directly, instead of a coding agent. Agents are capable of testing their own code and using that to fix issues, which is what makes them so powerful.

Using Claude Code, I was able to successfully produce the Hello Triangle you asked for (note that I have never used CL before): https://github.com/philpax/hello-triangle-cl

For reference, here is the transcript of the entire interaction I had with CC (produced with simonw's excellent claude-code-transcripts): https://gisthost.github.io/?7924519b32addbf794c17f4dc7106bc2...

Edit: To better contextualise what it's doing, the detailed transcript page may be useful: https://gisthost.github.io/?7924519b32addbf794c17f4dc7106bc2...

f1shy · 2026-01-12T15:24:11 1768231451

Nice. The code I got running from gemini was much much cleaner, it did not work, but after the manual changes it did work. I will hive it a try with the next task: put text and generate primitives like rectangle, circle, polygon, etc…

james_a_craig · 2026-01-12T14:44:53 1768229093

"Please write me a program in Common LISP (SBCL is installed) which will render a simple "hello world" triangle in OpenGL. You should use only OpenGL and GLFW (using sbcl's FFI) for this, not any other existing 3D graphics framework."

This worked in codex-cli, albeit it took three rounds of passing back the errors. https://gist.github.com/jamesacraig/9ae0e5ed8ebae3e7fe157f67... has the resulting code.

f1shy · 2026-01-12T15:18:33 1768231113

That is using sb-alien and sb-sys, which basically is no common lisp anymore. That is basically sbcl. I didn’t get anything in that direction (my prompt said nothing about a CL implementation) but I would have rejected it. I just wanted to see glfw and opengl in the :use clause. I have to do something that has to work in Mac, Linux and Windows, with at least ECL, sbcl and ccl.

prezk · 2026-01-12T13:33:40 1768224820

Maybe LLMs are like a next evolution of a rubber ducky: you can talk to it, and it's very helpful, just don't expect that IT will give you the final answer.

jstummbillig · 2026-01-12T11:47:22 1768218442

> Every single time [...] I end up rewriting about 70% of the thing

If that number has not significantly changed since GPT 3.5, I think it's safe to assume that something very weird is happening on your end.

dns_snek · 2026-01-12T12:28:30 1768220910

I think I know what they mean, I share a similar experience. It has changed, 3.5 couldn't even attempt to solve non-trivial tasks so it was a 100% failure, now it's 70%.

mark_l_watson · 2026-01-12T14:10:19 1768227019

I think you are correct, with one large caveat:

With very good tooling (e.g., Google Antigravity, Claude Coding, Open AI’s codex, and several open platforms) and not caring about your monthly API and subscription costs, then very long running trial and error and also with tools for testing code changes, then some degree of real autonomy is possible.

But, do we want to work like this? I don’t.

I feel very good about using strong AI for research and learning new things (self improvement) and I also feel good about using strong AI as a ‘minor partner’ in coding.

Balinares · 2026-01-12T08:01:29 1768204889

That's been pretty much exactly my experience too.

For what it's worth, multiple times in my career, I've worked at shops that once thought they could do it quick and cheap and it would be good enough, and then had to hire someone 'picky' like me to sort out the inevitable money-losing mess.

From what I've seen even Opus 4.5 spit, the 'picky' are going to remain in demand for a little while longer still. Will that last? No clue. We'll see.

cpursley · 2026-01-12T14:46:43 1768229203

You can be picky with Opus, just yell at it to refactor a few times. To reduce refactor cycles, give it correct and enough context before you start along with expected code style, etc. These things aren't one shot magic machines.

AlexCoventry · 2026-01-12T08:03:34 1768205014

> I don't understand the stance that AI currently is able to automate away non-trivial coding tasks.

I'm happy enough for it to automate away the trivial coding tasks. That's an immense force multiplier in its own right.

KronisLV · 2026-01-12T07:18:02 1768202282

> I end up rewriting about 70% of the thing.

Doesn't match my experience, that figure is closer to about 20-40% to me, though a lot of those changes I want are possible by just further prompting OR turning to a different model, or adding some automated checks that promptly fail and the AI can do a few more loops of fixes.

> Other people are just less picky than I am, or they have a less thorough review culture that lets subpar code slide more often.

This is also likely, or you are just doing stuff that is worse represented in the training data, or working on novel things where the output isn't as good. But I'm leaning towards people just being picky about what they view as "good code" (or underspecifying how the AI is supposed to output it) at least roughly since Sonnet 4, since with some people I work with it's just endless and oftentimes meaningless discussions and bikeshedding when in code review.

You can always be like: "This here pattern in these 20 files is Good Code™, use the same collection of approaches and code style when working on this refactoring/new feature."

9dev · 2026-01-12T07:35:03 1768203303

> You can always be like: "This here pattern in these 20 files is Good Code™, use the same collection of approaches and code style when working on this refactoring/new feature."

…and then add that to your CLAUDE.md, and never worry about having to say it again manually.

KronisLV · 2026-01-12T08:37:57 1768207077

Exactly! Unless you use something that doesn’t read CLAUDE.md, then you’d still just tell the model to read the file as a part of its work.

What helped me a bunch was having prebuild scripts (can be Bash, can be Python, can be whatever) for each of the architectural or style conventions I want to enforce. Tools like ESLint are also nice but focused a bit more on the code than architecture/structure.

Problems start when a colleague might just remove some of those due to personal preference without discussion but then you have other problems - in my experience, with proper controls in place AI will cause less issues and friction than people (ofc depending on culture fit).

saxenaabhi · 2026-01-12T04:22:34 1768191754

Why not post a github gist with prompt and code so that people here can give you their opinion?

Madmallard · 2026-01-12T07:41:41 1768203701

Those just don't appear at all on HackerNews

Gee I wonder why

dns_snek · 2026-01-12T13:14:14 1768223654

Because most people don't work on public projects and can't share the code publicly?

What's more interesting is the lack of examples of non-trivial projects that are provably vibe-coded and that claim to be of high-quality.

I think many of us are looking for: "I vibe-coded [this] with minimal corrections/manual coding on a livestream [here] and I believe it to be high-quality code"

If the code is in fact good quality then the livestream would serve as educational material for using LLMs/agents productively and I guarantee that it would change many minds. Stop telling people how great it all is, show them. I don't want to be a naysayer, I want to be impressed.

Madmallard · 2026-01-12T14:04:53 1768226693

I'm considering attempting to vibe code translate one of my XNA games to javascript and recording the process and using all of the latest tools and strategies like agents and .md files and multiple LLMs etc

dspillett · 2026-01-12T11:56:16 1768218976

> 1) I'm not good at prompting,

I assume this is part of the problem (though I've avoided using LLMs mostly so can't comment with any true confidence here) but to a large extent this is blaming you for a suboptimal interface when the interface is the problem.

That some people seem to get much better results than others, and that the distinction does not map well to differences in ability elsewhere, suggests to me that the issue is people thinking slightly differently and the training data for the models somehow being biased to those who operate in certain ways.

> 2) Other people are just less picky than I am

That is almost certainly a much larger part of the problem. “Fuck it, it'll do, someone else can tidy it later if they are bothered enough” attitudes were rampant long before people started outsourcing work to LLMs.

rbbydotdev · 2026-01-12T07:49:42 1768204182

> I end up rewriting about 70% of the thing.

I think this touches on the root of the issue. I am seeing a results over process winning. Code quality will reduce. Out of touch or apathetic project management who prioritize results, now are even more emboldened to have more tech debt riddled code

davidguetta · 2026-01-12T12:44:08 1768221848

I think you are not hardcore enough. I paste entire files or 2 3 files at once and ask to rewrite everything.

Then you rewiew it and in general have to ask to remove some stuff. And then it's (good enough). You have to accept to not nitpick some parts (like random functions being generated) as long as your test suite pass, otherwise of course you will end up rewritin everything

It also depends on your setting, some area (web vs AI vs robotics) can be more suited than other

willtemperley · 2026-01-12T04:05:24 1768190724

I get the best results when using code to demonstrate my intention to an LLM, rather than try and explain it. It doesn't have to be working code.

I think that mentally estimating the problem space helps. These things are probabilistic models, and if there are a million solutions the chance of getting the right one is clearly unlikely.

Feeding back results from tests really helps too.

Aeolun · 2026-01-12T05:07:45 1768194465

Maybe if your coding style is already close to what an LLM like Claude outputs, you’ll never have these issues? At least it generally seems to be doing what I would do myself.

Most of the architectural failures come from it still not having the whole codebase in mind when changing stuff.

trueno · 2026-01-12T06:03:50 1768197830

I actually think it's less about code style and more about the disjointed way end outcomes seem to be the culmination of a lot of prompt attempts over the course of a project/implementation.

The funny thing is reviewing stuff claude has made isn't actually unfamiliar to me in the slightest. It's something I'm intimately familiar with and have been intimately familiar with for many years, long before this AI stuff blew up...

..it's what code I've reviewed/maintained/rejected looks like when a consulting company was brought on board to build something. Such a company that leverages probably underpaid and overworked laborers both overseas and US based workers on visas. The delivered documentation/code is noisy+disjointed.