Do you have plans to improve the quality of the LLM as judge, in order to achieve better parity with human clinician annotators? For example, fine-tuning models?
Thinking that the comparative clinician judgements themselves would make useful fine-tuning material.