Are you being sarcastic or serious? Meeting requirements is implicitly part of any task. Quality/quantification will be embedded in the tasks (e.g. X must be Y <unit>); code style and quality guidelines are probably there somewhere in his tasks templates. Implicitly, explicit portions of tasks will be covered by testing.
I do think it's overly complex though; but it's a novel concept.
Everything you said is also done for regular non-ai development, OP is saying there is no way to compare the two (or even compare version x of gas town to version y of gas town) because there are 0 statistics or metrics on what gas town produces.
I do think it's overly complex though; but it's a novel concept.