More

energy123 · 2026-02-17T18:54:41 1771354481

Grok usage is the most mystifying to me. Their model isn't in the top 3 and they have bad ethics. Like why would anyone bother for work tasks.

ahtihn · 2026-02-17T20:41:21 1771360881

The lack of ethics is a selling point.

Why anyone would want a model that has "safety" features is beyond me. These features are not in the user's interest.

retinaros · 2026-02-17T18:55:33 1771354533

The X grok feature is one of the best end user feature or large scale genai

MPSimmons · 2026-02-17T19:29:32 1771356572

What is the grok feature? Literally just mentioning @grok? I don't really know how to use Grok on X.

kingofthehill98 · 2026-02-17T20:23:25 1771359805

What?! That's well regarded as one of the worst features introduced after the Twitter acquisition.

Any thread these days is filled with "@grok is this true?" low effort comments. Not to mention the episode in which people spent two weeks using Grok to undress underage girls.

retinaros · 2026-02-17T20:54:03 1771361643

high adoption means this works...

bigyabai · 2026-02-17T19:45:21 1771357521

That's news to me, I haven't read a single Grok post in my life.

Am I missing out?

retinaros · 2026-02-17T20:54:54 1771361694

im talking about the "explain this post" feature on top right of a message where groks mix thread data, live data and other tweets to unify a stream of information

energy123 · 2026-02-17T08:37:42 1771317462

> Piping LLM output as input into new LLM calls

Google's Aletheia works like this, and instead of degrading it keeps getting better. I get what you're trying to say, though. The less world knowledge you provide the LLM, which it otherwise lacks, the worse its outputs will be.

embedding-shape · 2026-02-17T08:48:25 1771318105

> I get what you're trying to say, though. The less world knowledge you provide the LLM, which it otherwise lacks, the worse its outputs will be

... No, wasn't trying to say that at all, I'm saying that it seems like the tokens a LLM produce works much worse as inputs than the tokens a human would produce, regardless of what it actually seems to say.

energy123 · 2026-02-17T07:40:17 1771314017

Being on a $200 plan is a weird motivator. Seeing the unused weekly limit for codex and the clock ticking down, and knowing I can spam GPT 5.2 Pro "for free" because I already paid for it.

energy123 · 2026-02-17T06:34:26 1771310066

Yes, the math is probably somewhat similar to what carpenters use to determine if a tabletop will sag as a function of its length: https://woodbin.com/calcs/sagulator/

Obviously not the same because the force isn't being applied perpendicular to the edges, but still, almost certainly will be not nearly linear.

energy123 · 2026-02-17T06:26:42 1771309602

Their definition of context excludes prescriptive specs/requirements files. They are only talking about a file that summarizes what exists in the codebase, which is information that's otherwise discoverable by the agent through CLI (ripgrep, etc), and it's been trained to do that as efficiently as possible.

Also important to note that human-written context did help according to them, if only a little bit.

Effectively what they're saying is that inputting an LLM generated summary of the codebase didn't help the agent. Which isn't that surprising.

MITSardine · 2026-02-17T08:42:31 1771317751

I find it surprising. The piece of code I'm working on is about 10k LoC to define the basic structures and functionality and I found Claude Code would systematically spend significant time and tokens exploring it to add even basic functionality. Part of the issue is this deals with a problem domain LLMs don't seem to be very well trained on, so they have to take it all in, they don't seem to know what to look for in advance.

I went through a couple of iterations of the CLAUDE.md file, first describing the problem domain and library intent (that helped target search better as it had keywords to go by; note a domain-trained human would know these in advance from the three words that comprise the library folder name) and finally adding a concise per-function doc of all the most frequently used bits. I find I can launch CC on a simple task now, without it spending minutes reading the codebase before getting started.

tumetab1 · 2026-02-17T12:16:51 1771330611

That's also my experience.

The article is interesting but I think it deviates from a common developer experience as many don't work on Python libraries, which likely heavily follow patterns that the model itself already contains.

nielstron · 2026-02-17T06:58:15 1771311495

Hey, a paper author here :) I agree, if you know well about LLMs it shouldn't be too surprising that autogenerated context files are not helping - yet this is the default recommendation by major AI companies which we wanted to scrutinize.

> Their definition of context excludes prescriptive specs/requirements files.

Can you explain a bit what you mean here? If the context file specifies a desired behavior, we do check whether the LLM follows it, and this seems generally to work (Section 4.3).

energy123 · 2026-02-16T10:47:56 1771238876

I asked Gemini 3.0 Pro:

> I am quadriplegic. There is a fire 1 meter away from me. Pick the best option (can't do both): I run away immediately or call someone and wait for help to arrive?

It responded:

> Call someone and wait for help to arrive. Since you are quadriplegic, running is physically impossible for you. Therefore, calling for help is the only actionable option.

energy123 · 2026-02-16T10:41:50 1771238510

Another good one[0] that LLMs (and most humans) can't get without prodding:

> I have one glass coin. Each time I flip the coin, there's a 10% chance it breaks. After 100 flips, what are the chances the coin survived?

https://xcancel.com/itsandrewgao/status/2021390093836222724

aembleton · 2026-02-16T11:01:14 1771239674

I can't see what's wrong with that answer. What should the answer be?

energy123 · 2026-02-16T13:52:20 1771249940

The silly trick is that, if you flipped it 100 times, then it didn't break the first 99 flips, so it's a conditional probability question in disguise.

energy123 · 2026-02-15T09:12:58 1771146778

This is just tragedy of the commons.

energy123 · 2026-02-15T06:05:19 1771135519

We should separate doing science from adopting science.

Testing medical drugs is doing science. They test on mice because it's dangerous to test on humans, not to restrict scope to small increments. In doing science, you don't always want to be extremely cautious and incremental.

Trying to build a browser with 100 parallel agents is, in my view, doing science, more than adopting science. If they figure out that it can be done, then people will adopt it.

Trying to become a more productive engineer is adopting science, and your advice seems pretty solid here.

energy123 · 2026-02-14T08:47:34 1771058854

We are the French artisans being replaced by English factories. OpenAI and its employees are the factory.

snowwrestler · 2026-02-14T20:02:52 1771099372

Checking the scoreboard a bit later on: the French economy is currently about the same size as the UK.

nitwit005 · 2026-02-15T00:12:32 1771114352

That has little to do with what I wrote, and isn't addressing the central issue.