If I had to guess, the name Q\* is pronounced Q Star, and probably the Q refers ...

kromem · on Nov 23, 2023

Given the topic they were excited about was "basic math problems being solved" it immediately indicated to me as well that this is a completely separate approach and likely in the vein of DeepMind's focus with things like AlphaZero.

In which case it's pretty appropriate to get excited about solving grade school math if you were starting from scratch with persistent self-learning.

Though with OpenAI's approach to releasing papers on their work lately, we may be waiting a long time to see a genuine paper on this. (More likely we'll see a paper from the parallel development at a different company after staff shift around bringing along best practices.)

Davidzheng · on Nov 23, 2023

Ok if it started from scratch like zero knowledge and then solved grade school math. This would be FUCKING HUGE.

dinobones · on Nov 23, 2023

Paper? You mean more like the Bing Search plugin?

macrolime · on Nov 23, 2023

I think more likely it's for finetuning a pre-trained model like GPT-4, kinda like RLHF, but in this case using reinforcement learning somewhat similar to AlphaZero. The model gets pre-trained and then fine-tuned to achieve mastery in tasks like mathematics and programming, using something like what you say and probably something like tree of thought and some self reflection to generate the data that it's using reinforcement learning to improve on.

What you get then is a way to get a pre-trained model to keep practicing certain tasks like chess, go, math, programming and many other things as it gets figured out how to do it.

wegfawefgawefg · on Nov 23, 2023

I do not think that is correct as the RL in RLHF already stands for reinforcement learning. :^)

However, I do think you are right that self play, and something like reinforcement learning will be involved more in the future of ML. Traditional "data-first" ml has limits. Tesla conceded to RL for parking lots, where the action and state space was too unknowable for hand designed heuristics to work well. In Deep Reinforcement Learning making a model just copy data is called "behavior cloning", and in every paper I have seen it results in considerably worse peak performance than letting the agent learn from its own efforts.

Given that wisdom alone, we are under the performance ceiling with pure language models.

stygiansonic · on Nov 23, 2023

This assumes that they will publish a paper that has substantive details about Q*

nabakin · on Nov 23, 2023

https://youtu.be/PtAIh9KSnjo?t=3754

wegfawefgawefg · on Nov 23, 2023

To give context on this video for anyone who doesn't understand. In this video PI* is referring to an idealized policy of actions that result in the maximum possible reward. (In Reinforcement Learning PI is just the actions you take in a situation). To use chess as an example, if you were to play the perfect move at every turn, that would be PI. Q is some function that tells you optimally, with perfect information the value of any move you could make. (Just like how stock-fish can tell you how many points a move in chess is worth.)

Now my personal comment: for games that are deterministic, there is no difference between a policy that takes the optimal move given only the current state of the board, and a policy that takes the optimal move given even more information, say a stack of future possible turns, etc.

However, in real life, you need to predict the future states, and sum across the best action taken at each future state as well. Unrealistic in the real world where the space of actions is infinite, and the universe to observe is not all simultaneously knowable. (hidden information)

Given the traditional educational background of the professionals in RL, maybe they were referring to the Q* from traditional rl. But I don't see why that would be novel, or notable, as it is a very old idea. Old Old math. From the 60s I think. So I sort of assumed its not. Could be relevant, or just a name collision.

GaryNumanVevo · on Nov 23, 2023

Man it's so sad to see how far Lex has fallen. From a graduate level guest lecturer at MIT to a glorified Joe Rogan

dmix · on Nov 23, 2023

Nothing wrong with being a podcaster and getting tens of thousands of people excited by ideas and interviewing a great collection of people from an informed perspective (at a minimum more informed than the average podcaster/talking head).

Not everyone needs to be doing hard academic stuff. There's plenty of value in communication.

wegfawefgawefg · on Nov 23, 2023

I could have given this lecture, and I think I could have made it much more entertaining, with fun examples.

Lex should stick to what he likes, though his interviews can be somewhat dull. On occasion I learn things from his guests I would have had no other chance of exposure to.

lucubratory · on Nov 23, 2023

So they implemented it in a semantically grounded way or what? That video is more technical than I can handle, struggling to figure out what this could be.