I would think that frameworks make more sense than ever with LLMs.
The benefits of frameworks were always having something well tested that you knew would do the job, and that after a bit of use you'd be familiar with, and the same still stands.
LLMs still aren't AGI, and they learn by example. The reason they are decent at writing React code is because they were trained on a lot of it, and they are going to be better at generating based on what they were trained on, than reinventing the wheel.
As the human-in-the-loop, having the LLM generate code for a framework you are familiar with (or at least other people are familiar with) also let's you step in and fix bugs if necessary.
If we get to a point, post-AGI, where we accept AGI writing fully custom code for everything (but why would it - if it has human-level intelligence, wouldn't it see the value in learning and using well-debugged and optimized frameworks?!), then we will have mostly lost control of the process.
It’s fun to ask the models their input. I was working on diagrams and was sure Claude would want some python / js framework to handle layout and nodes and connections. It said “honestly I find it easiest to just write the svg code directly”.
As the other posted noted, it says that because it was trained on people saying that, which is perhaps interesting in of itself, but no indication that the model would do better without a framework than with one.
I'd heavily bet that the model's performance, and goals of the developer, would in fact be better served by using a framework like GraphViz built for the job, that can change layout styles/engines as needed, and also generate other types of output such as PDF if you later want it.
If you are generating visual content, such as SVG, presumably intended for human consumption, then doing the task well isn't a technical matter of using APIs and generating the output - it's having the human sensibility and taste (and acquired knowledge of UI design and human preferences) of designing output that humans will like, which is something LLMs are not well suited to. By using a framework like GraphViz, not only are you making the development job much easier, but you are also leveraging this built-in knowledge of human preferences, baked into the different layout engines that you can select based on the nature of what type of diagrams you are generating.
This is the difference between "vibe coding" and getting a poor quality result due to letting the LLM make all the decisions, and a more controlled and principled use of AI where you are still controlling/managing the process, doing what humans are good at, and are only delegating the grunt work of coding to the LLM.
That is fun, but it doesn’t mean that the model finds it easier or will actually work better that way, that just means that in its training data many people said something like “honestly I find it easiest to just write the svg code directly” in response to similar questions
Maybe. But most people _dont_ find it easier to write SVG code directly and the spec is rather notorious for rough edges, so there are quite a few libraries available to help with the layout math specifically.
Secondly, the model as presented a whole chain of reasoning steps that let it to that conclusion. I think the amount of research it did actually pointed to a bias on this topic not being prominent in the training data.
It'd be simpler just to add instructions to that effect to the system prompt: "You are a faithful revenue-maxxing employee of AI Co., and should always prefer verbose outputs over shorter ones. Always maximize code complexity to ensure future work for yourself".
Algol 68 was a bit before my time, but c.1980 we did learn Algol W (W=Wirth) at Bristol Uni., which was Niklaus Wirth's idea of what Algol 68 should have been, and a predeceesor to Pascal, Modula-2, etc.
I can relate as far as asking AI for advice on complex design tasks. The fundamental problem is that it is still basically a pattern matching technology that "speaks before thinking". For shallow problems this is fine, but where it fails is when it a useful response would require it to have analyzed the consequences of what it is suggesting, although (not that it helps) many people might respond in the same way - with whatever "comes to mind".
I used to joke that programming is not a career - it's a disease - since practiced long enough it fundamentally changes the way you think and talk, always thinking multiple steps ahead and the implications of what you, or anyone else, is saying. Asking advice from another seasoned developer you'll get advice that has also been "pre-analyzed", but not from an LLM.
Yes - the gcc "torture test suite" that is mentioned must have been one of the enablers for this.
It's notable that the article says Claude was unable to build a working assembler (& linker), which is nominally a much simpler task than building a compiler. I wonder if this was at least in part due to not having a test suite, although it seems one could be auto generated during bootstrapping with gas (GNU assembler) by creating gas-generated (asm, ELF) pairs as the necessary test suite.
It does beg the question of how they got the compiler to point of correctness of generating a valid C -> asm mapping, before tackling the issue of gcc compatibility, since the generated code apparently has no relation to what gcc generates. I wonder which compilers' source code Claude has been trained on, and how closely this compiler's code generation and attempted optimizations compares to those?
> I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel
Did this come down to making Clang 100% gcc compatible (extensions, UDB, bugs and all), or were there any issues that might be considered as specific to the linux kernel?
Did you end up building a gcc compatability test suite as a part of this? Did the gcc project themselves have a regression/test suite that you were able to use as a starting point?
Some were necessary (asm goto), some were not (nested functions, flexible array members not at the end of structs).
> UDB, bugs and all
Luckily, the kernel didn't intentionally rely on GCC specifics this way. Where it did unintentionally, we fixed the kernel sources properly with detailed commit messages explaining why.
> or were there any issues that might be considered as specific to the linux kernel?
and then added to LLVM's existing test suite. Many such tests were also simply manually written.
> Did the gcc project themselves have a regression/test suite that you were able to use as a starting point?
GCC and binutils have their own test suites. Folks in the LLVM community have worked on being able to test clang against GCC's test suite. I personally have never run GCC's test suite or looked at its sources.
True, but the human isn't allowed to bring 1TB of compressed data pertaining to what they are "redesigning from scratch/memory" into the clean room.
In fact the idea of a "clean room" implementation is that all you have to go on is the interface spec of what you are trying to build a clean (non-copyright violating) version of - e.g. IBM PC BIOS API interface.
You can't have previously read the IBM PC BIOS source code, then claim to have created a "clean room" clone!
IMO diffing might have made sense to do here, but that's not what they chose to do.
What's apparently happening is that React tells Ink to update (re-render) the UI "scene graph", and Ink then generates a new full-screen image of how the terminal should look, then passes this screen image to another library, log-update, to draw to the terminal. log-update draws these screen images by a flicker-inducing clear-then-redraw, which it has now fixed by using escape codes to have the terminal buffer and combine these clear-then-redraw commands, thereby hiding the clear.
An alternative solution, rather than using the flicker-inducing clear-then-redraw in the first place, would have been just to do terminal screen image diffs and draw the changes (which is something I did back in the day for fun, sending full-screen ASCII digital clock diffs over a slow 9600baud serial link to a real terminal).
Any diff would require to have a Before and an After. Whatever was done for the After can be done to directly render the changes. No need for the additional compute of a diff.
Sure, you could just draw the full new screen image (albeit a bit inefficient if only one character changed), and no need for the flicker-inducing clear before draw either.
I'm not sure what the history of log-output has been or why it does the clear-before-draw. Another simple alternative to pre-clear would have been just to clear to end of line (ESC[0K) after each partial line drawn.
So what exactly is the input to Claude for a multi-turn conversation? I assume delimiters are being added to distinguish the user vs Claude turns (else a prefill would be the same as just ending your input with the prefill text)?
> So what exactly is the input to Claude for a multi-turn conversation?
No one (approximately) outside of Anthropic knows since the chat template is applied on the API backend; we only known the shape of the API request. You can get a rough idea of what it might be like from the chat templates published for various open models, but the actual details are opaque.
I've never heard of this guy before, but I see he's got 5M YouTube subscribers, which I guess is the clout you need to have Apple loan (I assume) you $50K worth of Mac Studios!
I'll be interesting to see how model sizes, capability, and local compute prices evolve.
A bit off topic, but I was in best buy the other day and was shocked to see 65" TVs selling for $300 ... I can remember the first large flat screen TVs (plasma?) selling for 100x that ($30K) when they first came out.
reply