So the results are meaningful in terms of establishing that LLM therapeutic performance can be evaluated.
But not meaningful in terms of comparing LLMs with human clinicians.
So in that case, how can you justify the title you used for submission, "New benchmark shows top LLMs struggle in real mental health care"?
How are they struggling? Struggling relative to what? For all your work shows, couldn't they be outperforming the average human? Or even if they're below that, couldn't they still have a large net positive effect with few negative outcomes?
I don't understand where the negative framing of your title is coming from.
LLMs have room for improvement (we show that their scores are medium-low on several dimensions).
Maybe the average human also has lots of room for improvement. One thing does not necessarily depend on the other.
the same way we can say that LLMs still have room for improvement on a specific task (lets say mathematics) but the average human is also bad at mathematics...
We don't do any claims about human therapists. Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy. Showing this is the first step to improve them
But you chose the word "struggle". And now you say:
> Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy.
That implies they're not currently good at therapy. But you haven't shown that, have you? How are you defining that a score of 4 isn't already "good"? How do you know that isn't already correlated with meaningfully improved outcomes, and therefore already "good"?
Everybody has room for improvement if you say 6 is perfection and something isn't reaching 6 on average. But that doesn't mean everybody's struggling.
I take no issue with your methodology. But your broader framing, and title, don't seem justified or objective.
So the results are meaningful in terms of establishing that LLM therapeutic performance can be evaluated.
But not meaningful in terms of comparing LLMs with human clinicians.
So in that case, how can you justify the title you used for submission, "New benchmark shows top LLMs struggle in real mental health care"?
How are they struggling? Struggling relative to what? For all your work shows, couldn't they be outperforming the average human? Or even if they're below that, couldn't they still have a large net positive effect with few negative outcomes?
I don't understand where the negative framing of your title is coming from.