The OPs point is that it’s likely impossible to do what is claimed here in general. Imagine the LLM says something like Fermat’s Last Theorem. To verify it, you’d have to either 1) have a proof assistant powerful enough to construct a proof 2) use a second ML model to guess truthfulness. The former is technically challenging and the latter is another model, with its own biases and factual inconsistencies.