SAPI 5.0 TTS output to buffer

Yash Girdhar

2014-06-01 07:00:57 UTC

Thank you so much for this sample code. It helped me a lot :)

Here is my working solution if anyone needs it.

https://github.com/itsyash/MS-SAPI-demo

Thanks for your suggestions. As it happens, I baulked at implementing
ISpAudio, and used the Win32 global stream object to do all the hard work in
an implementation of ISpStreamFormat as you suggest.
This has met all of my immediate requirements but does not allow control of
the synthesis processing. That is, using ISpStream allows you to set where
the synthesis output goes, but does not allow real-time control of the
synthesis so that blocks of audio are only produced as required by the
output device.
Is there any way of achieving this kind of control, other than implementing
ISpAudio?
I suppose that using the Win32 global stream object to implement the
fundamental IStream of the ISpAudio object means that the code would be
reasonably straightforward...
Daniel Heckenberg
Design Engineer
Lake Technology Ltd.

Hi Daniel,
So there shouldn't be any problems implementing ISpAudio as you described.
- You may want to just implement ISpStreamFormat as this may be

simpler.

- The threading model on your new object needs to be set to "Both"

(For

TTS it won't actually be called on multiple threads, but you still need

the

setting).
- There are some problems when an application calls ISpVoice::Pause()
when using a custom audio object so do not use this method.
Note there may be an easier way to do all this by using the SpStream

helper

class. This class already implements ISpStreamFormat, and allows the

output

to be set to either a wave file or an IStream. This (oft-repeated) sample
uses CreateStreamOnHGlobal to make an IStream using Win32 global memory

and

CComPtr<ISpStream> cpStream;
CComPtr<IStream> cpBaseStream;
GUID guidFormat; WAVEFORMATEX* pWavFormatEx;
HRESULT hr = cpStream.CoCreateInstance(CLSID_SpStream);
if(SUCCEEDED(hr))
{
hr = CreateStreamOnHGlobal(NULL, FALSE, &cpBaseStream);
}
if(SUCCEEDED(hr))
{
hr = SpConvertStreamFormatEnum(SPSF_22kHz16BitMono, &guidFormat,
&pWavFormatEx);
}
if(SUCCEEDED(hr))
{
hr = cpStream->SetBaseStream(cpBaseStream, guidFormat,
pWavFormatEx);
cpBaseStream.Release();
}
if(SUCCEEDED(hr))
{
hr = cpVoice->SetOutput(cpStream, TRUE);
}
Then when you want to access the memory use the GetHGlobalFromStream and
then GlobalLock Win32 methods.
You certainly don't have to do it this way but it might be simpler.
Hope this helps,

I'm trying to use the SAPI 5.0 API to do TTS and produce output into a
buffer. My application is involves realtime behaviour and is CPU

intensive

so I want to produce output in small blocks (around 1000 samples) as
required. Ideally the TTS processing should be spread evenly time

required

to output the speech.
After fishing through the documentation and sample code for the SAPI 5.0
beta, my understanding is that I can implement a class with the ISpAudio
interface that should perform the required functions.
This seems like a rather generic scenario, so I'm wondering whether

anyone

else has done this and whether my approach is littered with pitfalls.

the doco is pretty sketchy at the moment, I fear some surprises.
Daniel Heckenberg
Design Engineer
Lake Technology Ltd.