More

anupamchugh · 2026-02-11T20:38:20 1770842300

We're having a UI argument about a workflow problem.

We treat a stateless session like a colleague, then get upset when it forgets our preferences. Anthropic simplified the output because power users aren't the growth vector. This shouldn't surprise anyone.

The fix isn't verbose mode. It's a markdown file the model reads on startup — which files matter, which patterns to follow, what "good" looks like. The model becomes as opinionated as your instructions. The UI becomes irrelevant.

The model is a runtime. Your workflow is the program. Arguing about log verbosity is a distraction.

anupamchugh · 2026-02-09T06:43:25 1770619405

Building my dev workspace into an operating system. Not metaphorically — structurally.

  10 MCP servers as device drivers (exchange APIs, browser automation, Apple docs, issue tracking).
  200+ skills as prose runbooks that compose system calls. Agent-mail for IPC between parallel
  agents. A drift detector called "wobble" that scores skill stability using bias/variance analysis.

anupamchugh · 2026-02-09T06:40:58 1770619258

socketcluster nailed it. I've seen this firsthand — the same agent produces clean output when the codebase has typed specs and a manifest, and produces garbage when it's navigating tribal knowledge. The hard part was always there. Agents just can't hide it like humans can.

anupamchugh · 2026-02-09T06:39:50 1770619190

This is correct if the prompt is single-use. It's wrong if the prompt becomes a reusable skill that fires correctly 200 times.

The problem isn't generation — it's unstructured generation. Prompting ad-hoc and hoping the output holds up. That fails, obviously.

200 skills later: skills are prose instructions matched by description, not slash commands. The thinking happens when you write the skill. The generation happens when you invoke it. That's composition, not improvisation. Plus drift detection that catches when a skill's behavior diverges from its intent.

Don't stop generating. Start composing.

anupamchugh · 2026-02-09T06:08:34 1770617314

  Agents don't need full logs to learn from failures — they need structured error patterns. Compact
  logs older than 7 days into summaries (status, errors, token count, 1-line context). Saves 80-90%
  tokens for 30-day trend analysis. Like materializing aggregates instead of re-querying raw tables.

Opened an RFC for this: https://github.com/github/gh-aw/issues/14603

anupamchugh · 2026-02-08T18:30:12 1770575412

> The interesting failure mode isn’t just “one bad actor slips through”, it’s provenance: if you want to > “denounce the tree rooted at a bad actor”, you need to record where a vouch came from (maintainer X, > imported list Y, date, reason), otherwise revocation turns into manual whack-a-mole. > > Keeping the file format minimal is good, but I’d want at least optional provenance in the details field > (or a sidecar) so you can do bulk revocations and audits.

anupamchugh · 2026-02-08T17:32:26 1770571946

Pinning exists, but the interesting part is signal quality: macOS gets consistent “urgency” signals (QoS) from a lot of frameworks/apps, so scheduling on heterogeneous cores is less guessy than infer from runtime behavior.

anupamchugh · 2026-02-08T08:23:24 1770539004

Bad code crashes. You fix crashes. Acceptable code fails by doing nothing. You don't fix nothing.

Good code isn't dying. The cost of bad code just went up.

anupamchugh · 2026-02-07T10:36:24 1770460584

The ralph loop is a great shape — "read, identify, task, execute" is how autonomous refactoring should work.

The part I'd push on: what happens on loop N+1? 250 services refactored means 250 places where the spec the agent built against might have already changed, cross-references got broken across context windows, and comments now point at functions that were renamed three loops ago.

I've been working on this problem from the other end. The generation side is largely solved — agents can build. The unsolved part is drift: the slow, silent divergence between what you intended and what actually exists. Spec drift, behavioral drift, comment drift. If you don't measure it, you don't see it until something breaks in production.

The intelligence buying framing is right, but I think the real cost isn't the tokens — it's the maintenance surface area those tokens create. Every autonomous refactor is a bet that the output stays aligned with intent over time. Without something watching for divergence, you're buying intelligence today and technical debt tomorrow.

fvdessen · 2026-02-07T16:37:22 1770482242

That's the point of the loop, (the prompt is in another comment) start with a fresh context at every step, read the whole code base, and do one thing at a time.

Two important part that has been left out from the article is 1) service code size, our services are small enough to fit in a context + leave room for implementation of the change. If this is not the case you need to scope it down from 'read the whole service'.

The other part is that our services interact with http apis specified as openapi yaml specs, and the refactoring hopefully doesn't alter their behaviour and specifications. If it was internal apis or libraries where the spec are part of the code that would potentially be touched by the reafctoring I would be less at ease with this kind of approach

The service also have close to 100% test coverage, and this is still essential as the models still do mistakes that wouldn't be caught without them

burnerToBetOut · 2026-02-07T17:32:49 1770485569

    > …our services interact with http apis…
    > … 
    > …If it was internal apis or libraries…

That reminds me that I wanted to ask you: How good is your agent with complying with your system's architectural patterns?

Given my admittedly limited experience with coding agents, I'd expect a fully autonomous agent to have a tendency to do naïve juniory dev stuff.

Like, for example, write code that makes direct calls to your data access layer (i.e., the repository) from your controllers.

Or bypass the façade layer in favor of direct calls from your business services to external services.

FWIW: Those are Java/Spring Boot idioms. I'd have to research whether or not there are parallels in microservices implemented in Go.

fvdessen · 2026-02-07T17:58:07 1770487087

The architectural patterns are similar in go. The part of the prompt that contains the refactoring concerns that I wanted to fix are specific to this go project. You can very well add what you just explained and not only will it follow it, it will cleanup the parts when it isn't done. You don't need to fully explain the concept as it probably nows them well, just mentionning the concept you want to fix is enough.

In my experience the latest model (Opus 4.6 in this case) are perfectly able to do senior stuff. It's just that they don't do it from the get go, as they will give you the naive junior dev solution as a first draft. But then you can iterate on refactoring later on

burnerToBetOut · 2026-02-07T18:28:31 1770488911

    > …You don't need to fully explain the
    > concept as it probably nows them well…

Unsurprisingly, many would disagree [1]…

> 1 Establish a Clear Vision

…

You have experienced the world, and you want to work together with a system that has no experience in this world you live in. Every decision in your project that you don’t take and document will be taken for you by the AI…

[1] https://news.ycombinator.com/item?id=46916586

burnerToBetOut · 2026-02-07T10:53:00 1770461580

    > …If you don't measure it, you don't see 
    > it until something breaks in production…
    > …
    > …the slow, silent divergence between
    > what you intended and what actually exists…

What's your take on the absence of any mention of tests in the OP's loop steps?

fvdessen · 2026-02-07T16:23:40 1770481420

Opus 4.6 is smart enough to run the tests without being told to do so, that's why it isn't in the prompt

burnerToBetOut · 2026-02-07T17:32:44 1770485564

Implicit knowledge like what you know about Opus 4.6 (and that I don't) is what I meant about it being "an amazing learning opportunity".

So, thanks :)

anupamchugh · 2026-02-07T05:48:13 1770443293

This is the real story buried under the simulation angle. If you can generate reliable 3D LiDAR from 2D video, every dashcam on earth becomes training data. Every YouTube driving video, every GoPro clip, every security camera feed.

Waymo's fleet is ~700 cars. The internet has millions of hours of driving footage. This technique turns the entire internet into a sensor suite. That's a bigger deal than the simulation itself.