This is a good point. We have not tested the clinicians but I believe they would not score each other perfectly as we observed some disagreement also between the scores which also reflects different opinions between clinicians
It is nice to have an accurate measure of things and a human baseline would be additionally helpful too.
Many things can be useful before they reach the level of world's best. Although with AI, non-intuitive failure modes must be taken into consideration too.