> non-trivial coding tasks I’ve come back to the idea LLMs are super search engi...

20k · 2026-01-12T09:34:19 1768210459

I wish LLMs were good at search. I've tried to evaluate them many times for their quality at answering research questions for astrophysics (specifically numerical relativity). If they were good at answering questions, I'd use them in a heartbeat

Without exception, every technical question I've ever asked an LLM that I know the answer to, has been substantially wrong in some fashion. This makes it just.. absolutely useless for research. In some cases I've spotted it straight up plagiarising from the original sources, with random capitalisation giving it away

The issue is that once you get even slightly into a niche, they fall apart because the training data just doesn't exist. But they don't say "sorry there's insufficient training data to give you an answer", they just make shit up and state it as confidently incorrect

simonw · 2026-01-12T14:19:51 1768227591

LLMs got good at search last year. You need to use the right ones though - ChatGPT Thinking mode and Google AI mode (that's https://www.google.com/ai - which is NOT the same as regular Google's "AI overviews" which are still mostly trash) are both excellent.

I've been tracking advances in AI assisted search here - https://simonwillison.net/tags/ai-assisted-search/ - in particular:

- https://simonwillison.net/2025/Apr/21/ai-assisted-search/ - April is when they started getting good, with o3 and the various deep research tools

- https://simonwillison.net/2025/Sep/6/research-goblin/ - GPT-5 got excellent. This post includes several detailed examples, including "Starbucks in the UK don’t sell cake pops! Do a deep investigative dive".

- https://simonwillison.net/2025/Sep/7/ai-mode/ - AI mode from Google

locknitpicker · 2026-01-12T17:16:38 1768238198

> LLMs got good at search last year. You need to use the right ones though - ChatGPT Thinking mode and Google AI mode (that's https://www.google.com/ai - which is NOT the same as regular Google's "AI overviews" which are still mostly trash) are both excellent.

I disagree. You might have seen some improvements in the results, but all LLMs still hallucinate quite hard on simple queries where you prompt them to cite their sources. You'll see ChatGPT insist quite hard that the source of their assertions is the 404 link that it asserts is working.

pxc · 2026-01-12T15:38:18 1768232298

> Without exception, every technical question I've ever asked an LLM that I know the answer to, has been substantially wrong in some fashion.

The other problem that I tend to hit is a tradeoff between wrongness and slowness. The fastest variants of the SOTA models are so frequently and so severely wrong that I don't find them useful for search. But the bigger, slower ones that spend more time "thinking" take so long to yield their (admittedly better) results that it's often faster for me to just do some web searching myself.

They tend to be more useful the first time I'm approaching a subject, or before I've familiarized myself with the documentation of some API or language or whatever. After I've taken some time to orient myself (even by just following the links they've given me a few times), it becomes faster for me to just search by myself.

elzbardico · 2026-01-12T13:48:51 1768225731

If you nudge it towards tool use, A lot of time it can give you better answers.

Instead of "how cheese X is usually made" "search the web and give me a summary on the ways cheese X is made"

sandworm101 · 2026-01-12T12:39:29 1768221569

>> at answering research questions for astrophysics

I googled for "helium 3" yesterday. Google's AI answer said that helium 3 is "primarily sourced from the moon", as if we were actively mining it there already.

fishtacos · 2026-01-12T16:42:07 1768236127

On a similar note, Gemini told that I was born in 2025 when I did a cursory search for my real name. It's rather confident.

BYazfVCcq · 2026-01-12T12:53:37 1768222417

There are probably thousands of scifi books where the moon has some forms of helium 3 mining. Considering Google pirated and used them all for training it makes sense that it puts it in present tense.

HPsquared · 2026-01-12T13:49:43 1768225783

I wonder how much memory and computing time goes into making them, vs. a typical "proper" LLM prompt. It's like the freebies you get with a Christmas cracker.

yunohn · 2026-01-12T10:15:34 1768212934

> I wish LLMs were good at search

The entire situation of web search for LLMs is a mess. None of the existing providers return good or usable results; and Google refuses to provide general access to theirs. As a result, all LLMs (except maybe Gemini) are severely gimped forever until someone solves this.

I seriously believe that the only real new breakthrough for LLM research can be achieved by a clean, trustworthy, comprehensive search index. Maybe someone will build that? Otherwise we’re stuck with subpar results indefinitely.

embedding-shape · 2026-01-12T10:51:09 1768215069

YaCy does a pretty good job, and is free, and you can run yourself, so the quality/experience is pretty much up to you. Paired together with a local GPT-OSS-120b with reasoning_effort set to high, I'm getting pretty good results. Validated with questions I do know the answer to, and seems alright although could be better of course, still getting better results out of GPT5.2 Pro which I guess is to be expected.

yunohn · 2026-01-12T12:43:59 1768221839

The point of my comment was that the AI/LLM is almost irrelevant in light of low quality search engine APIs/indexes. Is there a way to validate the actual quality and comprehensiveness of YaCY beyond anecdata?

embedding-shape · 2026-01-12T12:56:10 1768222570

> Is there a way to validate the actual quality and comprehensiveness of YaCY beyond anecdata?

No, because it's your own index essentially, hence the "the quality/experience is pretty much up to you" part.

immibis · 2026-01-12T23:50:07 1768261807

How to build a search engine, apparently:

1. Install YaCy

2. Draw the rest of the owl

yunohn · 2026-01-12T13:12:33 1768223553

Yeah, that’s not really reassuring nor indicative of its usefulness or value.

embedding-shape · 2026-01-12T13:17:44 1768223864

Yeah, if that's how you feel about your own abilities, then I guess that's the way it is. Not sure what that has to do with YaCy or my original comment.

yunohn · 2026-01-12T18:11:00 1768241460

Respectfully, you said:

> YaCy does a pretty good job

I assume that should be qualified with some basic amount of evidence beyond “I said so”? Anyways, thanks for pointing me in the direction of YaCy, will try it out.

PeterStuer · 2026-01-12T07:39:56 1768203596

An example I had last month. Some code (dealing with PDF's) package ran into a resources problem in production. LLM suggested an adaptation to the segment that caused the problem, but that code pulled in 3 new non-trivial dependecies. Added constraints and the next iteration it dropped 1 of the 3. Pushed further and it confirmed my suggestion that the 2 remaining dependencies could be covered just by specifying an already existing parameter in the constructor.

The real problem btw was a bug introduced in the PDF handeling package 2 versions ago that caused resource handeling problems in some contexts, and the real solution was roling back to the version before the bug.

I'm still using AI daily in my development though, as as long as you sort of know what you are doing and have enough knowledge to evaluate it is very much a net productivity multiplier for me.

friendzis · 2026-01-12T08:38:46 1768207126

> But giving all the answers without strong guidance on non-trivial architectural points— entropy. LLMs churning independently quickly devolve into entropy.

Typical iterative-circular process "write code -> QA -> fix remarks" works because the code is analyzable and "fix" is on average cheaper than "write", therefore the process, eventually, converges on a "correct" solution.

LLM prompting is on average much less analyzable (if at all) and therefore the process "prompt LLM -> QA -> fix prompt" falls somewhere between "does not converge" and "convergence tail is much longer".

This is consistent with typical observation where LLMs are working better: greenfield implementations of "slap something together" and "modify well structured, uncoupled existing codebase", both situations where convergence is easier in the first place, i.e. low existing entropy.

0xf8 · 2026-01-12T16:28:05 1768235285

I very much agree with how you’ve categorized the initial state condition that is amenable to LLM assisted SWE and works well to a greater state of beneficial order. And implicitly I also agree most of the complement to that set of applied contexts yields ~medium to not so productive results.

But what do you mean by “LLM prompting is on average much less analyzable” ? Isn’t structured prompting (what that should optimally look like) the most objective and well defined part of the whole workflow. it’s the lowest entropy part of the situation, we know pretty well what a good LLM prompt is and what will be ineffective, even LLMs “know” that. Do you mean “context engineering” is hard to optimize around ? That’s often thought of interchangeably I think, but regardless that has in fact become the “hard problem” (user facing) in effectively leveraging LLM for dev work. Ever since the reasoning class models were introduced I think, it became more about context engineering in practice than prompting. Nowadays from the very onset Even resuming a session efficiently often requires a non-trivial approach that we’ve already started to design patterns and built tools around, (like CLI coding workflows adding /compact as user directive, etc).

I’m not a software engineer by trade, so I can’t pretend to know what that fully entails at the tail ends of enterprise scale and complexity, but I’ve spent a decent amount of time programming and as far as LLMs go, I think there’s probably somewhere down the road where we get so methodical about context engineering and tooling and memory management, all of the vast still somewhat nebulous surrounding space and scaffolding to LLM workflows that have a big impact on productive use of them—we may eventually engineer that aspect to an extent that will be able to much more consistently yield better results across more applied contexts than the “clean code”/“trivial app” dichotomy. But … I think the depth of additional effort and knowledge and skill required by human user to do this optimal context engineering (once we fully understand how even) to get the best out of LLMs… I think that quickly just converges to — what it means to be a competent software engineer already. the meta layers around just “writing code” that are required to build robust systems and maintain them, the amount of work required to coerce non-deterministic models into effectively internalizing that, or at minimum not fvcking it up… that juice might not be worth the squeeze when it’s essentially what a good developer’s job is already. If that’s true then there will likely remain a ceiling of finite productivity you can expect from LLM assisted development for a long time… (I conjecture).

dividedbyzero · 2026-01-12T11:13:14 1768216394

They don't even really do that IME. If I ask Claude or ChatGPT to generate terraform for non-trivial but by no means obscure or highly unusual setups, they almost invariably hallucinate part of the answer even if a documented solution exists that isn't even that difficult. Maybe vibe coding JavaScript is that much better, or I'm just hopeless at prompting, but I feel a few dozen lines of fairly straightforward terraform config shouldn't require elaborate prompt setups, or I can just save some brain cycles by writing it myself.

JohnMakin · 2026-01-12T16:54:22 1768236862

For better or for worse have spent a large amount of time in terraform since 0.13 and I can confidently say LLM's are very, very bad at it. My favorite is when it invents internal functions (that look suspiciously like python) that do not exist, even when corrected, it will still keep going back to them. A year or two ago there were bad problems with hallucinated resource field names but I haven't seen that as much these days.

It however, is pretty good at refactoring given a set of constraints and an existing code base. It is decent at spitting out boilerplate code for well-known resources (such as AWS), but then again, those boilerplate examples are mostly coming straight from the documentation. The nice thing about refactoring with LLM's in terraform is, even if you vibe it, the refactor is trivially verifiable because the plan should show no changes, or the exact changes you would expect.

PunchyHamster · 2026-01-12T13:31:36 1768224696

that would be true if not for LLM making up answers where none exists.

Like, I've seen Claude go thru source code of the program, telling (correctly!) what counters are in code that return value I need (I just wanted to look at some packet metrics), then inventing entirely fake CLI command to extract those metrics

IAmGraydon · 2026-01-12T04:19:12 1768191552

>I’ve come back to the idea LLMs are super search engines.

Yes! This is exactly what it is. A search engine with a lossy-compressed dataset of most public human knowledge, which can return the results in natural language. This is the realization that will pop the AI bubble if the public could ever bring themselves to ponder it en masse. Is such a thing useful? Hell yes! Is such a thing intellegent? Certainly NO!

10c8 · 2026-01-12T04:26:09 1768191969

While I agree, I can't help but wonder: if such a "super search engine" were to have the knowledge on how to solve individual steps of problems, how different would that be from an "intelligent" thing? I mean that, instead of "searching" for the next line of code, it searches for the next solution or implementation detail, then using it as the query that eventually leads to code.

chongli · 2026-01-12T05:02:41 1768194161

Having knowledge isn't the same as knowing. I can hold a stack of physics papers in my hand but that doesn't make me a physics professor.

LLMs possess and can retrieve knowledge but they don't understand it, and when people try to get them to do that it's like talking to a non-expert who has been coached to smalltalk with experts. I remember reading about a guy who did this with his wife so she could have fun when travelling to conferences with him!

IAmGraydon · 2026-01-12T21:41:05 1768254065

I've spent a lot of time thinking about that - what if the realization that we need is not that LLMs are intelligent, but that our own brains work in the same way as the LLMs. There is certainly a cognitive bias to believe that humans are somehow special and that our brains are not simply machinery.

The difference, to me, is that an LLM can very efficiently recall information, or more accurately, a statistical model of information. However, they seem to be unable to actually extrapolate from it or rationalize about it (they can create the illusion of rationalization be knowing what the rationalization would look like). A human would never be able to ingest and remember the amount of information that an LLM can, but we seem to have the incredible ability of extrapolation - to reach new conclusions by deeply reasoning about old ones.

This is much like the difference in being "book smart" and "actually smart" that some people use to describe students. Some students can memorize vast amounts of information, pass all tests with straight A's, only to fail when they're tasked with thinking on their own. Others perform terribly on memorization tasks, but naturally are gifted at understanding things in a more intuitive sense.

I have seen heaps of evidence that LLMs have zero ability to reason, so I believe that there's something very fundamental missing. Perhaps the LLM is a small part of the puzzle, but there doesn't seem to be any breakthroughs that seem like we might be moving towards actual reasoning. I do think that the human brain can very likely be emulated if we cracked the technology. I just don't believe we're close.

antonvs · 2026-01-12T09:42:00 1768210920

> …can return the results in natural language.

That’s one of the most important features, though. For example, LLMs can analyze a code base and tell you how it works in natural language. That demonstrates functional understanding and intelligence - in addition to exceeding the abilities of the majority of humans in this area.

You’d need a very no-true-Scotsmanned definition of intelligence to be able exclude LLMs. That’s not to say that they’re equivalent to human intelligence in all respects, but intelligence is not an all-or-nothing property. (If it were, most humans probably wouldn’t qualify.)

omnimus · 2026-01-12T10:34:59 1768214099

LLMs being intelligence or not is not really that interesting. It's just matter of how you define intelligence. It matters maybe to the AI CEOs and their investors because of marketing.

What matters is how useful LLMs actually are. Many people here say it is useful as advanced search engine and not that useful as your coworker. That is very useful but most likely not something the AI companies want to hear.

runarberg · 2026-01-12T10:50:30 1768215030

> You’d need a very no-true-Scotsmanned definition of intelligence to be able exclude LLMs.

The thing is, that intelligence is an anthropocentric term. And has always been defined in a no-true-Scotsman way. When we describe the intelligence of other species we do so in extremely human terms (except for dogs). For example we consider dolphins smart when we see them play with each other, talk to each other, etc. We consider chimpanzees when we see them use a tool, row a boat, etc. We don’t consider an ant colony smart when they optimize a search for food sources, only because humans don’t normally do that. The only exception here are dogs, who we consider smart when they obey us more easily.

Personally, my take on this is that intelligence is not a useful term in philosophy nor science. Describing a behavior as intelligent is kind of like calling a small creature a bug. It is useful in our day to day speech, but fails when we want to build any theory around it.

antonvs · 2026-01-12T17:27:17 1768238837

In the context of "AI", the use of the word "intelligence" has referred to human-comparable intelligence for at least the last 75 years, when Alan Turing described the Turing Test. That test was explicitly intended to test for a particular kind of human equivalent intelligence. No other animal has come close to passing the Turing Test. As such, the distinction you're referring to isn't relevant to this discussion.

> Personally, my take on this is that intelligence is not a useful term in philosophy nor science.

Hot take.

runarberg · 2026-01-12T23:03:27 1768259007

The Turing test was debunked by John Searle in 1980 with the Chinese room thought experiment. And even looking past that, the existence, and the pervasiveness, of the Turing test proves my point that this term is and always has been extremely anthropocentric.

In statistics there has been a prevailing consensus for a really long time that artificial intelligence is not only a misnomer, but also rather problematic, and maybe even confusing. There has been a concerted effort the past 15 years to move away from this term onto something like machine learning (machine learning is not without its own set of downsides, but is still miles better then AI). So honestly my take is not that hot (at least not in statistics; maybe in psychology and philosophy).

But I want to justify my take in psychology. Psychometricians have been doing intelligence testing for well over a century now, and the science is not much further along then it was a century ago. No new prediction, no new subfields, etc. This is a hallmark of a scientific dead end. And on the flip side, psychological theories that don‘t use intelligence at all are doing just fine.

lutusp · 2026-01-12T07:22:02 1768202522

> Is such a thing intellegent [sic]? Certainly NO!

A proofreader would have caught this humorous gaffe. In fact, one just did.

lxgr · 2026-01-12T08:09:45 1768205385

I personally had the completely opposite takeaway: Intelligence, at its core, really might just be a bunch of extremely good and self-adapting search heuristics.

croon · 2026-01-12T08:52:50 1768207970

I don't blurt out different answers to the same question using different phrasing, I doubt any human does.

easyThrowaway · 2026-01-12T09:25:19 1768209919

We actually do, and often - depending on who our speaker is, our relationship with them, the tone of the message, etc. Maybe our intellect is not fully an LLM, but I truly wonder how much of our dialectical skills are.

croon · 2026-01-12T11:50:36 1768218636

You're describing the same answer with different phrasing.

Humans do that, LLMs regularly don't.

If you phrase the question "what color is your car?" a hundred different ways, a human will get it correct every time. LLMs randomly don't, if the token prediction veers off course.

Edit:

A human also doesn't get confused at fundamental priors after a reasonable context window. I'm perplexed that we're still having this discussion after years of LLM usage. How is it possible that it's not clear to everyone?

Don't get me wrong, I use it daily at work and at home and it's indeed useful, but there's is absolutely 0 illusion of intelligence for me.

ffwd · 2026-01-12T07:02:09 1768201329

Even though I think it's true that it's lossy, I think there is more going on in an LLM neural net. Namely that when it uses tokens to produce output, you essentially split the text into millions or billions of chunks, each with probability of those chunks. So in essence the LLM can do a form of pattern recognition where the patterns are the chunks and it also enables basic operations on those chunks.

That's why I think you can work iteratively on code and change parts of the code while keeping others, because the code gets chunked and "probabilitized'. It can also do semantic processing and understanding where it can apply knowledge about one topic (like 'swimming') to another topic (like a 'swimming spaceship', it then generates text about what a swimming spaceship would be which is not in the dataset). It chunks it into patterns of probability and then combines them based on probability. I do think this is a lossy process though which sucks.

ffwd · 2026-01-12T13:11:02 1768223462

Maybe it's looked down upon to complain about downvotes but I have to say I'm a little disappointed that there is a downvote with no accompanying post to explain that vote, especially to a post that is factually correct and nothing obviously wrong with it.

carlmr · 2026-01-12T07:50:56 1768204256

>It’s not readily apparent at first blush the LLM is doing this, giving all the answers.

Now I'm wondering if I'm prompting wrong. I usually get one answer. Maybe a few options but rarely the whole picture.

I do like the super search engine view though. I often know what I want, but e.g. work with a language or library I'm not super familiar with. So then I ask how do I do x in this setting. It's really great for getting an initial idea here.

Then it gives me maybe one or two options, but they're verbose or add unneeded complexity. Then I start probing asking if this could be done another way, or if there's a simpler solution to this.

Then I ask what are the trade-offs between solutions. Etc.

It's maybe a mix of search engine and rubber ducking.

Agents are, like for OP, a complete failure for me though. Still can't get them to not run off into a completely strange direction, leaving a minefield of subtle coding errors and spaghetti behind.

richardw · 2026-01-12T08:18:39 1768205919

I’ve recently created many Claude skills to do repeatable tasks (architecture review, performance, magic strings, privacy, SOLID review, documentation review etc). The pattern is: when I’ve prompted it into the right state and it’s done what I want, I ask it to create a skill. I get codex to check the skill. I could then run it independently in another window etc and feed back to adjust…but you get the idea.

And almost every time it screws up we create a test, and often for the whole class of problem. More recent it’s been far better behaved. Between Opus, skills, docs, generating Mermaid diagrams, tests it’s been a lot better. I’ve also cleaned up so much of the architecture so there’s only one way to do things. This keeps it more aligned and helps with entropy. And they’ll work better as models improve. Having a match between code, documents and tests means it’s not just relying on one source.

Prompts like this seem to work: “what’s the ideal way to do this? Don’t be pragmatic. Tokens are cheaper than me hunting bugs down years later”

FooBarWidget · 2026-01-12T08:41:50 1768207310

Can you tell me more about how you do tests? How do they look like? What testing tools or frameworks do you use?

XenophileJKO · 2026-01-12T07:35:55 1768203355

I'm not going to argue about how capable the models are, I personally think they are pretty capable.

What I will argue is that the LLMs are not just search engines. They have "compressed" knowledge. When they do this, they learn relations between all kinds of different levels of abstractions and meta patterns.

It is really important to understand that the model can follow logical rules and has some map of meta relationships between concepts.

Thinking of a LLM as a "search engine" is just fundamentally wrong in how they work, especially when connected to external context like code bases or live information.

lxgr · 2026-01-12T08:07:48 1768205268

A sufficiently advanced search engine might actually be indistinguishable from intelligence.

After all, until quite recently, chess engines really were quite mechanically search engines too.

XenophileJKO · 2026-01-12T09:28:22 1768210102

I'm just saying you are doing a dis-service to yourself if that is your mental model on how current SOTA models work.

gf000 · 2026-01-12T08:45:17 1768207517

Well, it's "a search engine that applies some transformations on top of the results" doesn't sound to me as a terrible way to think about LLMs.

> can follow logical rules

This is not their strong suite, though. They can only follow through a few levels on their own. This can be improved by agent-style iterations or via invoking external tools.

XenophileJKO · 2026-01-12T09:30:34 1768210234

Let's see how this comment ages why don't we. I've understood where we are going and if you look at my comment history. I have confidence that in 12 months time. One opinion will be proved out with observations and the other will not.

gf000 · 2026-01-12T13:26:23 1768224383

For the "only few levels" claim, I think this one is sort of evident from the way they work. Solving a logical problem can have an arbitrary number of steps, and in a single pass there is only so many connection within a LLM to do some "work".

As mentioned, there are good ways to counter this problem (e.g. writing a plan and then iteratively going over those less-complex ones, or simply using the proper tool for the problem: use e.g. a SAT solver and just "translate" the problem to and from the appropriate format)

Nonetheless, I'm always open to new information/evidence and it will surely improve a lot in a year. As for reference, to date this is my favorite description of LLMs: https://news.ycombinator.com/item?id=46561537