Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Back-tracking is a very nearly solved problem in the context of Prolog-like languages or mathematical theorem provers (as you probably well know). There are many ways you could integrate an LLM-like system into a tactic-based theorem prover without having to restart from the beginning for each alternative. Simply checkpointing and backtracking to a checkpoint would naively improve upon your described Monte Carlo algorithm. More likely I assume they are using RL to unwind state backwards and update based on the negative result, which would be significantly more complicated but also much more powerful (essentially it would one-shot learn from each failure).

That's just what I came up with after thinking on it for 2 minutes. I'm sure they have even better ideas.



You can also consider the chatGPT app as a RL environment. The environment is made of the agent (AI), a second agent (human), and some tools (web search, code, APIs, vision). This grounds the AI into human and tool responses. They can generate feedback that can be incorporated into the model by RL methods.

Basically every reply from a human can be interpreted as a reward signal. If the human restates the question, it means a negative reward, the AI didn't get it. If the human corrects the AI, another negative reward, but if they continue the thread then it is positive. You can judge turn-by-turn and end-to-end all chat logs with GPT4 to annotate.

The great thing about chat based feedback is that it is scalable. OpenAI has 100M users, they generate these chat sessions by the millions every day. Then they just need to do a second pass (expensive, yes) to annotate the chat logs with RL reward signals and retrain. But they get the human-in-the-loop for free, and that is the best source of feedback.

AI-human chat data is in-domain for both the AI and human, something we can't say about other training data. It will contain the kind of mistakes AI does, and the kind of demands humans want to solve with AI. My bet is that OpenAI have realized this and created GPTs in order to enrich and empower the AI to create the best training data for GPT-5.

The secret sauce of OpenAI is not their people, or Sam, or the computers, but the training set, especially the augmented and synthetic parts.


There are certainly efforts along the lines of what you suggest. There are problems though. The number of backtracks is 10^k where k is not 2, or 3, or 4.....

Another issue is that of autoformalisation. This is the one part of the problem where an LLM might be able to help, if it were reliable enough (it isn't currently) or if it could truly understand the logical structure of mathematical problems correctly (currently they can't).


> That's just what I came up with after thinking on it for 2 minutes. I'm sure they have even better ideas.

the thing is that ideas not necessary easy to implement. There will be many obstacles on route you described:

- quality of provers, is there good ergo provers which also can run at large scales (say billions of facts)

- you need some formalization approach, probably LLM will do some work, but we don't know what will be quality

- LLM likely will generate many individual factoids, which are losely compatible, contradicting, etc, and untrivial effort is required to reconcile and connect them




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: