I've been experimenting with something similar to this approach recently. IndexTTS2 gives you emotion vectors as an input, I used an external emotion classification model on the LLM output to modulate the TTS emotion vectors. You need to manage the state of the current affect with a bit of care or it sounds unhinged, but it's worked surprisingly well so far. I wired it together using Cats Effect.
As you'd expect latency isn't great, but I think it can be improved.
As you'd expect latency isn't great, but I think it can be improved.