Asked it to write PyTorch code which trains an LLM and it produced 23 steps in 6...

Asked it to write PyTorch code which trains an LLM and it produced 23 steps in 62 seconds.

With gpt4-o it immediately failed with random errors like mismatched tensor shapes and stuff like that.

The code produced by gpt-o1 seemed to work for some time but after some training time it produced mismatched batch sizes. Also, gpt-o1 enabled cuda by itself while for gpt-4o, I had to specifically spell it out (it always used cpu). However, showing gpt-o1 the error output resulted in broken code again.

I noticed that back-and-forth iteration when it makes mistakes has worse experience because now there's always 30-60 sec time delays. I had to have 5 back-and-forths before it produced something which does not crash (just like gpt-4o). I also suspect too many tokens inside the CoT context can make it accidentally forget some stuff.

So there's some improvement, but we're still not there...