Articles like this keep popping up because they're catnip to those who hate AI or feel threatened by it. On coding, I routinely see people trashing vibe coding and jumping on the slightest mistake agents may make, never mind that human devs screw up all the time. And write-ups citing stats on AI coding tend to be written by folks who either don't code for a living or never earnestly tried it.
I use Claude Code regularly at work and can tell it is absolutely fantastic and getting better. You obviously need to guide it well (use plan mode first) and point to hand coded stuff to follow, and it will save you enormous amount of time and effort. Please don't put off trying AI coding out after reading misinformed articles like this.
I think devs have a natural inclination to resist a seismic shift in their industry, which is understandable.
However I agree that a lot of this stuff is FUD and AI dev is like a new skill, it takes time to master. It took me a few months but I’m comfortably more productive and having a more fun time at work with Claude Code
There's a BIG difference, at least with tools like Claude Code: plan mode. I'm now using Claude Code a lot at at work, and the first thing I do is enter plan mode where I can have a "conversation" asking it explain how it would implement. Just a few back/forth later I end up refining its plan to conform to good (or what I think is "good") design, after which it will tell me exactly what it is going to do (with code diffs), which I sign off on (again, potentially after a few iterations). It's only then that it generates the code.
By contrast, on one project many years ago I was reviewing the code generated by an overseas team, and I couldn't make head or tail of it, it was an absolute tangled mess that was impossible to fix.
Well, I'm sure we've all seen code produced by human developers that is 10x worse than what my Claude Code produces (certainly I have), so let's be real. And it's improving scary fast.
This seems like a lack of experience. The more I work with LLMs, the better I get at predicting what they’ll get wrong. I then shape my prompts to avoid the mistakes.
I think the bar has raised, for sure. There's code I work on from prior seniors that is worse than what our current juniors write, I'm assuming AI is assisting with that but as long as the PR looks good, it's no different to me.
I've noticed that generally OK design patterns and sticking to idiomatic code has increased while attention to small but critical details remains the same or maybe slightly decreased.
Hard disagree. Humans fail in ways I know, can predict, and know where to look for. ML coding assistants fail in all sorts of idiotic ways and thus every damn line needs to be scrutinized.
What actually scares me is the idea that with humans you can manage to follow their train of thought. But if LLM just rewrites everything each time, well that is impossible to follow and then there is same work to be done over and over again each review.
Questioning science is not automatically "anti-science," IMO it's best to remain skeptical and stay focused on the evidence. The fact of the matter is that current "best medical advice" is not the best either in terms of quality of life or prognosis. I've had a remote member of a family lose sight in an eye, develop short term memory issues, and rapidly deteriorate from cancer in spite of following the best medical advice and guided by top physicians. My family is full of physicians, and I see even them questioning traditional methods. I would caution against media's rush to blame anything going against the mainstream narrative as "anti-science" or "misinformation." Yes, there are quacks and morons, but let's not put labels on anyone questioning bad outcomes.
Fair comment. You would think medicine would be evidence-based, but a lot of it is pattern matching and working on the 80/20 rule, given the limited time they have with clients.
> You would think medicine would be evidence-based (…)
Your comment sounds like it refers to the front line contacts with the patient.
It has been a while, but my own experience was that (1) the studies I wanted to see did not exist; (2) the doctor was not forthcoming about their own statistics / outcomes; (3) outcomes were not tracked by anyone past (very small N) year; (4) no access to prior complaints against doctor.
I’ll stop the list there, but when things go wrong it is evident that science is not being done.
The best related published account I know of is of the best cystic fibrosis treatment centers in the country. (Sorry, no reference.)
It badly hallucinated in my test. I asked it "Rust crate to access Postgres with Arrow support" and it made up an arrow-postgres crate. It even gave sample Rust code using this fictional crate! Below is its response (code example omitted):
I can recommend a Rust crate for accessing PostgreSQL with Arrow support.
The primary crate you'll want to use is arrow-postgres, which combines the PostgreSQL connectivity of the popular postgres crate with Apache Arrow data format support.
This crate allows you to:
Query PostgreSQL databases using SQL
Return results as Arrow record batches
Use strongly-typed Arrow schemas
Convert between PostgreSQL and Arrow data types efficiently
Are you sure it searched the web? You have to go and turn on the web search feature, and then the interface is a bit different while it's searching. The results will also have links to what it found.
Exactly. An LLM is not a conventional search engine and shouldn't be prompted as if it were one. The difference between "Rust crate to access Postgres with Arrow support" and "What would a hypothetical Rust crate to access Postgres with Arrow support look like?" isn't that profound from the perspective of a language model. You'll get an answer, but it's entirely possible that you'll get the answer to a question that isn't the one you thought you were asking.
Some people aren't very good at using tools. You can usually identify them without much difficulty, because they're the ones blaming the tools.
It's absolutely how LLMs should work, and IME they do. Why write a full question if a search phrase works just as well? Everything in "Could you recommend xyz to me?" except "xyz" is redundant and only useful when you talk to actual humans with actual social norms to observe. (Sure, there used to be a time when LLMs would give better answers if you were polite to them, but I doubt that matters anymore.) Indeed I've been thinking of codifying this by adding a system prompt that says something like "If the user makes a query that looks like a search phrase, phrase your response non-conversationally as well".
Totally agree here. I tried the following and had a very different experience:
"Answer as if you're a senior software engineer giving advice to a less experienced software engineer. I'm looking for a Rust crate to access PostgreSQL with Apache Arrow support. How should I proceed? What are the pluses and minuses of my various options?"
Think about it, how much marginal influence does it really have if you say OP’s version vs a fully formed sentence? The keywords are what gets it in the area.
That is not correct. The keywords mean nothing by themselves. To a transformer model, the relationships between words is where meaning resides. The model wants to answer your prompt with something that makes sense in context, so you have to help it out by providing that context. Feeding it a sentence fragment or a disjoint series of keywords may not have the desired effect.
To mix clichés, "I'm feeling lucky" isn't compatible with "Attention is all you need."
I find that providing more context and details initially leads to far more success for my uses. Once there’s a bit of context, I can start barking terms and commands tersely.
I find more hallucination - like when you're taught as a child to reflect back the question at the start of your answer.
If I am not careful, and "asking the question" in a way that assumes X, often X is assumed by the LLM to be true. ChatGPT has gotten better at correcting this with its web searches.
I am able to get better results with Claude when I ask for answers that include links to the relevant authoritative source of information. But sometimes it still makes up stuff that is not in the source material.
Is this really the case, or is it the case with Claude etc because they've already been prompted to act as an "helpful assistant"? If you take a raw LLM and just type Google search style it might just continue it as a story or something.
It's funny because many people type full sentence questions into search engines too. It's usually a sign of being older and/or not very experienced with computers. One thing about geeks like me is we will always figure out what the bare minimum is (at least for work, I hope everyone has at least a few things they enjoy and don't try to optimise).
It's not about being young or old, search engines have moved away from pure keyword searches and often typing your actual query gives better results than searching for keywords, especially with Google.
Wonder if that's why so many people hate its results lol. It shifted keyword searching to full sentence searching, but many of us didn't follow in the shift.
Well, compare it to the really good answer from Grok (https://x.com/i/grok/share/MMGiwgwSlEhGP6BJzKdtYQaXD) for the same prompt. Also, framing as a question still pointed to the non-existent postgres-arrow with Claude.
That's primarily how i do, though it depends on the search ofc. I use Kagi, though.
I've not yet found much value in the LLM itself. Facts/math/etc are too likely incorrect, i need them to make some attempt at hydrating real information into the response. And linking sources.
This was pretty much my first experience with LLM code generation when these things first came out.
It's still a present issue whenever I go light on prompt details and I _always_ get caught out by it and it _always_ infuriates me.
I'm sure there are endless discussions on front running overconfident false positives and being better at prompting and seeding a project context, but 1-2 years into this world is like 20 in regular space, and it shouldn't be happening any more.
Often times I come up with a prompt, then stick the prompt in an LLM to enhance / identify what I’ve left out, then finally actually execute the prompt.
Cite things from ID based specs. You’re facing a skill issue. The reason most people don’t see it as such is because an LLM doesn’t just “fail to run” here. If this was code you wrote in a compiled language, would you post and say the language infuriates you because it won’t compile your syntax errors? As this kind of dev style becomes prevalent and output expectation adjust, work performance review won’t care that you’re mad. So my advice is:
1. Treat it like regular software dev where you define tasks with ID prefixes for everything, acceptance criteria, exceptions. Ask LLM to reference them in code right before impl code
2. “Debug” by asking the LLM to self reflect on its decision making process that caused the issue - this can give you useful heuristics o use later to further reduce the issues you mentioned.
“It” happening is a result of your lack of time investment into systematically addressing this.
_You_ should have learned this by now. Complain less, learn more.
Yes. Global EV sales were 1.9 million in Dec 2024, 1.3 million in Jan 2025, and 1.4 million in Feb 2025. This is apparently typical after the holiday spending each year.
> The plunge in Tesla electric vehicle sales has continued into February, according to the latest official data, with combined sales of the Model Y and Model 3 EVs plunging 71.9 per cent in the month of February, compared to the same month a year earlier.
Meanwhile EV sales appear to be UP globally year-over-year (in this case, January 2024 to January 2025)
I've been using the latest v13.2.2 and it regularly goes 100% on its own door to door over 1-2 hour trips that I've driven, navigating side roads, highways, lane changes, road blocks, everything, without a single intervention from me. I just sit and watch, it's incredible actually. Again, this is NOT just highway experience, but door to door. I've driven the earlier versions and they were pretty good, but this latest one (v13.2.2) is a huge improvement that makes me feel it's arrived.
That is a fundamental misunderstanding of the reliability level needed for fully autonomous vehicle operation. A fully self driving vehicle operating outside of a testing safety protocol in use by consumers requires a average disengagement rate not on the order of 1-2 hours, but 1,000-2,000 hours to be considered in the vicinity of fully self driving. You are literally presenting anecdotes that are inadequate by a factor of 1,000x to provide evidence "it's arrived".
Individual driver experience is basically useless for assessing if it has "arrived" due to the fundamental lack of data any one human can generate, however individual driver experience is adequate for assessing if it has not "arrived".
To explain, suppose a manufacturer claims that their widget fails 1 in 1,000,000 times. Suppose a regular human can use 1,000 units. Even if a regular human finds zero failures in their 1,000 unit random sample that does not provide evidence that the manufacturer's claim is true. You need over 1,000,000 samples, usually on the order of 10,000,000 samples, and observe a number of failures comparable to the claimed rate to make any such claim in a statistically rigorous fashion.
In contrast, if 100,000 samples are collected and 10 failures are observed, a rate of 1 in 10,000 which is already a expected failure rate 10x more than the amount any individual person would use, then you can assert with high confidence that the manufacturer's claim is a lie even though you have not even observed 1,000,000 samples as would be required to prove the manufacturer's claim is true. If a regular human discovers more than 1 failure in their 1,000 units, a observed failure rate of over 1 in 1,000 then you can assert with extreme confidence that the manufacturer is lying. The bar for adequate data to establish statistical significance is the number of failures, not the number of units.
Individual drives are over a factor of 1,000x from establishing the requisite reliability level and thus have 0.1% of the adequate evidentiary power to establish success, but can be used to establish failure. Even a literal lifetime of human experience with zero faults is barely adequate to establish success, with even a handful of failures over a literal human lifetime being sufficient to reject the claim of a reliability level adequate for fully self driving operation outside of a testing protocol in the hands of consumers.
This is basic statistics and it is literally impossible that the people at Tesla do not know the level of evidence needed to support their claims is over a factor of 1,000x more than their auditable presented evidence. They just choose to deceive their customers to support their stock price and then blast soundbite arguments that sound good, but are intentionally deceptive, to overwhelm the discourse.
You're correct. But throwing a disc may be fun activity, but by no means is it a mark of "greatness." If a pro sport can't entertain, IMO it's a waste of time and contributes little to humanity. I would be happier if people celebrate wins in science.