Author here. I think it's more of a capability issue than a safety issue. Since ...

JoshTriplett · 2025-10-21T16:30:43 1761064243

Audio models for speech not understanding pitch, seems similar to how text LLMs often don't understand spelling: it's not what they were trying to recognize.

smusamashah · 2025-10-21T20:35:53 1761078953

There was an example, of ChatGPT copying and responding in the speakers voice mid conversation, on OpenAI blog. This was presented an example on non-alignment.

oezi · 2025-10-21T16:39:15 1761064755

> generated from text via a text-to-speech

Yes, frustratingly we don't have good speech-to-text (STT/ASR) to transcribe such differences.

I recently finetuned a TTS* to be able to emit laughter and hunting for transcriptions which include non-verbal sounds was the hardest part of it. Whisper and other popular transcription systems will ignore sigh, sniff, laugh, etc and can't detect mispronounciations etc.

* = https://github.com/coezbek/PlayDiffusion

jasonjayr · 2025-10-21T16:42:46 1761064966

IIRC -- the 15.ai dev was training on fan-made "My Little Pony" transcriptions, specificaly because they included more emotive clues in the transcription, and supported a syntax to control the emotive aspect of the speech.

dotancohen · 2025-10-21T18:26:17 1761071177

Where can I read about this?

jasonjayr · 2025-10-22T03:05:20 1761102320

> During this phase, 15 discovered the Pony Preservation Project, a collaborative project started by /mlp/, the My Little Pony board on 4chan.[47] Contributors of the project had manually trimmed, denoised, transcribed, and emotion-tagged thousands of voice lines from My Little Pony: Friendship Is Magic and had compiled them into a dataset that provided ideal training material for 15.ai.[48]

From https://en.wikipedia.org/wiki/15.ai#2016%E2%80%932020:_Conce...

vvolhejn · 2025-10-22T10:36:16 1761129376

I had no idea this existed, the internet is amazing

wordglyph · 2025-10-21T22:38:32 1761086312

I used aistudio and it understood pitch and and even emotion with an uploaded mp3

j45 · 2025-10-21T16:04:25 1761062665

Accent detection or consciously ignoring it is a filter step.