Hacker Newsnew | past | comments | ask | show | jobs | submit | anupamchugh's commentslogin

We're having a UI argument about a workflow problem.

We treat a stateless session like a colleague, then get upset when it forgets our preferences. Anthropic simplified the output because power users aren't the growth vector. This shouldn't surprise anyone.

The fix isn't verbose mode. It's a markdown file the model reads on startup — which files matter, which patterns to follow, what "good" looks like. The model becomes as opinionated as your instructions. The UI becomes irrelevant.

The model is a runtime. Your workflow is the program. Arguing about log verbosity is a distraction.


Building my dev workspace into an operating system. Not metaphorically — structurally.

  10 MCP servers as device drivers (exchange APIs, browser automation, Apple docs, issue tracking).
  200+ skills as prose runbooks that compose system calls. Agent-mail for IPC between parallel
  agents. A drift detector called "wobble" that scores skill stability using bias/variance analysis.

socketcluster nailed it. I've seen this firsthand — the same agent produces clean output when the codebase has typed specs and a manifest, and produces garbage when it's navigating tribal knowledge. The hard part was always there. Agents just can't hide it like humans can.

This is correct if the prompt is single-use. It's wrong if the prompt becomes a reusable skill that fires correctly 200 times.

The problem isn't generation — it's unstructured generation. Prompting ad-hoc and hoping the output holds up. That fails, obviously.

200 skills later: skills are prose instructions matched by description, not slash commands. The thinking happens when you write the skill. The generation happens when you invoke it. That's composition, not improvisation. Plus drift detection that catches when a skill's behavior diverges from its intent.

Don't stop generating. Start composing.


  Agents don't need full logs to learn from failures — they need structured error patterns. Compact
  logs older than 7 days into summaries (status, errors, token count, 1-line context). Saves 80-90%
  tokens for 30-day trend analysis. Like materializing aggregates instead of re-querying raw tables.
Opened an RFC for this: https://github.com/github/gh-aw/issues/14603

> The interesting failure mode isn’t just “one bad actor slips through”, it’s provenance: if you want to > “denounce the tree rooted at a bad actor”, you need to record where a vouch came from (maintainer X, > imported list Y, date, reason), otherwise revocation turns into manual whack-a-mole. > > Keeping the file format minimal is good, but I’d want at least optional provenance in the details field > (or a sidecar) so you can do bulk revocations and audits.

Pinning exists, but the interesting part is signal quality: macOS gets consistent “urgency” signals (QoS) from a lot of frameworks/apps, so scheduling on heterogeneous cores is less guessy than infer from runtime behavior.

Bad code crashes. You fix crashes. Acceptable code fails by doing nothing. You don't fix nothing.

Good code isn't dying. The cost of bad code just went up.


The ralph loop is a great shape — "read, identify, task, execute" is how autonomous refactoring should work.

The part I'd push on: what happens on loop N+1? 250 services refactored means 250 places where the spec the agent built against might have already changed, cross-references got broken across context windows, and comments now point at functions that were renamed three loops ago.

I've been working on this problem from the other end. The generation side is largely solved — agents can build. The unsolved part is drift: the slow, silent divergence between what you intended and what actually exists. Spec drift, behavioral drift, comment drift. If you don't measure it, you don't see it until something breaks in production.

The intelligence buying framing is right, but I think the real cost isn't the tokens — it's the maintenance surface area those tokens create. Every autonomous refactor is a bet that the output stays aligned with intent over time. Without something watching for divergence, you're buying intelligence today and technical debt tomorrow.


That's the point of the loop, (the prompt is in another comment) start with a fresh context at every step, read the whole code base, and do one thing at a time.

Two important part that has been left out from the article is 1) service code size, our services are small enough to fit in a context + leave room for implementation of the change. If this is not the case you need to scope it down from 'read the whole service'.

The other part is that our services interact with http apis specified as openapi yaml specs, and the refactoring hopefully doesn't alter their behaviour and specifications. If it was internal apis or libraries where the spec are part of the code that would potentially be touched by the reafctoring I would be less at ease with this kind of approach

The service also have close to 100% test coverage, and this is still essential as the models still do mistakes that wouldn't be caught without them


    > …our services interact with http apis…
    > … 
    > …If it was internal apis or libraries…
That reminds me that I wanted to ask you: How good is your agent with complying with your system's architectural patterns?

Given my admittedly limited experience with coding agents, I'd expect a fully autonomous agent to have a tendency to do naïve juniory dev stuff.

Like, for example, write code that makes direct calls to your data access layer (i.e., the repository) from your controllers.

Or bypass the façade layer in favor of direct calls from your business services to external services.

FWIW: Those are Java/Spring Boot idioms. I'd have to research whether or not there are parallels in microservices implemented in Go.


The architectural patterns are similar in go. The part of the prompt that contains the refactoring concerns that I wanted to fix are specific to this go project. You can very well add what you just explained and not only will it follow it, it will cleanup the parts when it isn't done. You don't need to fully explain the concept as it probably nows them well, just mentionning the concept you want to fix is enough.

In my experience the latest model (Opus 4.6 in this case) are perfectly able to do senior stuff. It's just that they don't do it from the get go, as they will give you the naive junior dev solution as a first draft. But then you can iterate on refactoring later on


    > …You don't need to fully explain the
    > concept as it probably nows them well…
Unsurprisingly, many would disagree [1]…

> 1 Establish a Clear Vision

You have experienced the world, and you want to work together with a system that has no experience in this world you live in. Every decision in your project that you don’t take and document will be taken for you by the AI…

[1] https://news.ycombinator.com/item?id=46916586


    > …If you don't measure it, you don't see 
    > it until something breaks in production…
    > …
    > …the slow, silent divergence between
    > what you intended and what actually exists…
What's your take on the absence of any mention of tests in the OP's loop steps?


Opus 4.6 is smart enough to run the tests without being told to do so, that's why it isn't in the prompt


Implicit knowledge like what you know about Opus 4.6 (and that I don't) is what I meant about it being "an amazing learning opportunity".

So, thanks :)


This is the real story buried under the simulation angle. If you can generate reliable 3D LiDAR from 2D video, every dashcam on earth becomes training data. Every YouTube driving video, every GoPro clip, every security camera feed.

Waymo's fleet is ~700 cars. The internet has millions of hours of driving footage. This technique turns the entire internet into a sensor suite. That's a bigger deal than the simulation itself.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: