In my experience, if you ask 4o or Claude for the definition of a word, they can fumble the tone and get meanings of toneless words mixed up.
For instance, when prompted for the meaning of "da", it correctly provides skin, but also provides "already", which is a confusion with "đã". I don't believe I had any notable difference in error rate between Claude and 4o, and it's looking like I will create a system to have someone manually review what is generated.
My main motivation for working on this is that there is such a drought of materials. SOTA LLMs are certainly great and have a pretty good effectiveness, but I wouldn't say they make great teachers of more obscure languages yet. I recall the Llama models explicitly state what languages the model is proficient in.