This fork of sesame csm has several performance improvements to create realtime voice streaming (generation time is shorter than speech time) with lower latency. https://github.com/davidbrowne17/csm-streaming
Have you incorporated those changes into your project?
Thanks and Regards.