The number of GPUs they have (which may well be export-legal H800's as NVidia be...

The number of GPUs they have (which may well be export-legal H800's as NVidia believe they are) goes hand in hand with the amount it cost to train (however you define that), and is something people trying to replicate their approach can verify (or not).

It seems obvious that you need to have a model trained, or fine-tuned, on some reasoning data (with backtracking etc) such that reasoning behavior is part of it's repertoire, before you can use RL to hopefully get it to use such reasoning pursuant to whatever goals you are setting. I'd not be surprised if they used O1 outputs to bootstrap the model in this way, although O1's reasoning traces are a deliberate obfuscation of what it is really doing (an after-the-fact summary) so even if this is the case that should be borne in mind!

OTOH, while reasoning data may be scarce in the wild, it's presumably not entirely unavailable, and/or DeepSeek may have created some themselves, so who knows what mix DeepSeek used for this initial bootstrapping stage. As you say, this aspect remains as "secret sauce".

Of course once they've got their first stage model trained they then use that to generate data for the second/final stage.