In the primary CoT research paper they discuss figuring out how to train models using formal languages instead of just natural ones. I'm guessing this is one piece to the model learning tree-like reasoning.
Based on the quick searching it seems like they are using RL to provide positive/negative feedback on which "paths" to choose when performing CoT.
Based on the quick searching it seems like they are using RL to provide positive/negative feedback on which "paths" to choose when performing CoT.