The reason LLMs solve school problems is because they've been trained on solutions. The problems are actually very repetitive. Not surprising for each 'new' of them there was something similar in training set. For research level problems there is nothing in training set. That's why they don't perform well.
Just today I asked GPT4 a simple task. Having mouse position in zoomed and scrolled image find it's position in the original image. GPT4 happily wrote the code, but it was completely wrong. I had to fix it manually.
However, the performance can be increased if there are several threads working on solution. Some suggesting and others analyzing the solution(s). This will increase the size of 'active' memory, at least. And decrease the load on threads, making them more specialized and deeper. This requires more resources, of course. And good management with task split. May be a dedicated thread for that.
Just today I asked GPT4 a simple task. Having mouse position in zoomed and scrolled image find it's position in the original image. GPT4 happily wrote the code, but it was completely wrong. I had to fix it manually.
However, the performance can be increased if there are several threads working on solution. Some suggesting and others analyzing the solution(s). This will increase the size of 'active' memory, at least. And decrease the load on threads, making them more specialized and deeper. This requires more resources, of course. And good management with task split. May be a dedicated thread for that.