Hand signals are pretty standardised and obvious, and even the first Kinect (XBox accessory from 2010) can estimate people's skeletons sufficiently well to figure out construction worker hand signals. I would expect that you can solve this pretty well even without any ML at all.