And if you ask why it's accurate it'll spaff out another list of pretty convinci...

colechristensen · 2026-01-21T08:56:36 1768985796

It does indeed, but at the end added:

>However, I should note: without access to the actual crash file, the specific curl version, or ability to reproduce the issue, I cannot verify this is a valid vulnerability versus expected behavior (some tools intentionally skip cleanup on exit for performance). The 2-byte leak is also very small, which could indicate this is a minor edge case or even intended behavior in certain code paths.

Even biased towards positivity it's still giving me the correct answer.

Given a neutral "judge this report" prompt we get

"This is a low-severity, non-security issue being reported as if it were a security vulnerability." with a lot more detail as to why

So positive, neutral, or negative biased prompts all result in the correct answer that this report is bogus.

Draiken · 2026-01-21T11:51:14 1768996274

Yet this is not reproducible. This is the whole issue with LLMs: they are random.

You cannot trust that it'll do a good job on all reports so you'll have to manually review the LLMs reports anyways or hope that real issues didn't get false-negatives or fake ones got false-positives.

This is what I've seen most LLM proponents do: they gloss over the issues and tell everyone it's all fine. Who cares about the details? They don't review the gigantic pile of slop code/answers/results they generate. They skim and say YOLO. Worked for my narrow set of anecdotal tests, so it must work for everything!

IIRC DOGE did something like this to analyze government jobs that were needed or not and then fired people based on that. Guess how good the result was?

This is a very similar scenario: make some judgement call based on a small set of data. It absolutely sucks at it. And I'm not even going to get into the issue of liability which is another can of worms.

colechristensen · 2026-01-21T19:36:53 1769024213

Is it not reproducable? Someone up thread reproduced it and expanded on it. It worked for me the first time I prompted. Did you try it or are you just guessing that it's not reproducable because that's what you already think?

I'm not talking about completely replacing humans, the goal of this exercise was demonstrating how to use an LLM to filter out garbage. Low quality semi-anonymous reports don't deserve a whole lot of accuracy and being conservative and rejecting most reports even when you throw out legitimate ones is fine.

You seem like regardless of evidence presented, your prejudices will lead you to the same conclusions, so what's the point discussing anything? I looked for, found, and shared evidence, you're sharing your opinion.

>IIRC DOGE did something like this to analyze government jobs that were needed or not and then fired people based on that. Guess how good the result was?

I'm talking about filtering spammy communication channels, that has nothing like the care required in making employment decisions.

Your comment is plainly just bad faith and prejudice.

Draiken · 2026-01-22T19:17:01 1769109421

> Is it not reproducable? Someone up thread reproduced it and expanded on it. It worked for me the first time I prompted. Did you try it or are you just guessing that it's not reproducable because that's what you already think?

I assumed you knew how LLMs work. They are random by nature, not "because I'm guessing it". There's a reason if you ask the LLM the same exact prompt hundreds of times you'll get hundreds of different answers.

>I looked for, found, and shared evidence

Anecdotal evidence. Studies have shown how unreliable LLMs are exactly because they are not deterministic. Again, it's a fact, not an opinion.

>I'm talking about filtering spammy communication channels

So if we make tons of mistakes there, who cares, right?

I only used this as an example because it's one of the few very public uses of LLMs to make judgement calls where people accepted it as true and faced consequences.

I'm sure there are plenty more people getting screwed over by similar mistakes, but folks generally aren't stupid enough to say that publicly. Maybe the Salesforce huge mistake qualifies too? Incidentally it also involved people's jobs.

Regardless, the point stands: they are unreliable.

Want to trust LLMs blindly for your weekend project? Great! The only potential victim for its mistakes is you. For anything serious like a huge open source project? That's irresponsible.