I think you're restating (in a longer and more accurate way) what I understood the original criticism to be, that this grading test isn't testing what's it's supposed to, partly because a grade is too few tokens.
The model could "assess" the code qualitatively the same and still give slightly different letter grades.
The model could "assess" the code qualitatively the same and still give slightly different letter grades.