Yes, and transformer models can do more than text.
There's almost certainly better options out there given it looks like we don't need so many examples to learn from, though I'm not at all clear if we need those better ways or if we can get by without due to the abundance of training data.
If you come up with a new system, you're going to want to integrate AI into the system, presuming AI gets a bit better.
If AI can only learn after people have used the system for a year, then your system will just get ignored. After all, it lacks AI. And hence it will never get enough training data to get AI integration.
Learning needs to get faster. Otherwise, we will be stuck with the tools that already exist. New tools won't just need to be possible to train humans on, but also to train AIs on.
Edit: a great example here is the Tamarin protocol prover. It would be great, and feasible, to get AI assistance to write these proofs. But there aren't enough proofs out there to train on.
That seems to already be happening with o1 and Orion.
Instead of rewarding the network directly for finding a correct answer, reasoning chains that end up with the correct answer is fed back into the training set.
That way you're training it to develop reasoning processes that end up with correct answers.
And for math problems, you're training it to find ways of generating "proofs" that happen to produce the right result.
While this means that reasoning patterns that are not stricly speaking 100% consistent can be learned, that's not necessarily even a disadvantage, since this allows it to find arguments that are "good enough" to produce the correct output, even where a fully watertight proof may be beyond it.
Kind of like physicists have taken shortcuts like the Dirac Delta function, even before mathematicians could verify that the math was correct.
Anyway, by allowing AI's to generate their own proofs, the number of proofs/reasoning chains for all sorts or problems can be massively expanded, and AI may even invent new ways of reasoning that humans are not even aware of. (For instance because they require combining more factors in one logical step than can fit into human working memory.)
If the user manual fits into the context window, existing LLMs can already do an OK-but-not-great job. Not previously heard of Tamarin, quick google suggests that's a domain where the standard is theoretically "you need to make zero errors" but in practice is "be better than your opponent because neither of you is close to perfect"? In either case, have you tried giving the entire manual to the LLM context window?
If the new system can be interacted with in a non-destructive manner at low cost and with useful responses, then existing AI can self-generate the training data.
If it merely takes a year, businesses will rush to get that training data even if they need to pay humans for a bit: Cars are an example of "real data is expensive or destructive", it's clearly taking a lot more than a year to get there, and there's a lot of investment in just that.
Pay 10,000 people USD 100,000 each for a year, that billion dollar investment then gets reduced to 2.4 million/year in ChatGPT Plus subscription fees or whatever. Plenty of investors will take that deal… if you can actually be sure it will work.
2. You might need only several hundred of examples for fine-tuning. (OpenAI's minimum is 10 examples.)
3. I don't think research into fine-tuning efficiency have exhausted its possibilities. Fine-tuning is just not a very hot topic, given that general models work so well. In image generation where it matters they quickly got to a point where 1-2 examples are enough. So I won't be surprised if doc-to-model becomes a thing.
Or rather, it was trained to reproduce the patterns within internal monologues that lead to correct answers to problems, particularily STEM problems.
While this still uses text at some level, it's no longer regurgitation of human-produced text, but something more akin to AlphaZero's training to become superhuman at games like Go or Chess.
> While this still uses text at some level, it's no longer regurgitation of human-produced text, but something more akin to AlphaZero's training to become superhuman at games like Go or Chess.
How did you know that? I've never seen that anywhere. For all we know, it could just be a very elaborate CoT algorithm.
Notice that the CoT is trained via RL, meaning the CoT itself is a model (or part of the main model).
Also, RL means it's not limited to the original data the way traditional LLM's are. It implies that the CoT processes itself is trained based on it's own performance, meaning the steps of the CoT from previous runs are fed back into the training process as more data.
at the very least you could say "parsing and predicting text, images, and audio". and you would be correct - physical embodiment and spatial reasoning are missing.
It's all just text though, both images and audio are presented to LLM as a text, the training data is a text and all it does is append small bits of text to a larger text iteratively. So parent poster was correct.