Going beyond Azure AI Text to Speech Maximum Media Duration of 10 Minutes

If you have ever used the Azure AI Speech Services, on either the free or the standard plan, you may have encountered the following error message when you process text-to-speech longer than 10 minutes:

Error code: 1007. Error details: The processed audio has exceeded the configured maximum media duration of 600000ms

In this post we will cover the two ways to resolve this issue; Splitting the input to create smaller segments and using Azure Speech to process the audio file using the Batch Synthesis API endpoints.

Splitting the Input

Last year I wrote a popular post on Enhancing Accessibility - Building a Read This Page feature with Azure Speech Service and C#. This used the free tier of Azure AI speech to generate audio versions of blog posts. Recently, I have been writing more in-depth longer posts and have hit up against the 10-minute limit. Trying to keep within the free tier I have decided to update the code to split up the text.

Text Groups

To split the text, I very roughly calculated the number of words that would be in an 8-minute clip speaking at 130 words per minute. This gave me a rough number of words that could be processed in a batch. I then split the text into chunks of this size using the chunk method.

var maxBatchSize = 130 * 8; //8 minutes of audio at 130 wpm
var batches = fullText.Split(" ").Chunk(maxBatchSize).ToArray();

This returns an array of batches/chunks. Each of these entries is a string array of the words in that batch. We can then re-join the text in the batch to process each batch.

var text = string.Join(" ", batches[0]);

Requesting Audio Segments

Unfortunately, in my original code, we cannot just loop over the chunked segments as the API returns the audio asynchronously via a callback. Later I did find a way to do this without the callback, but I will cover that later.

The easiest way to work around this is to run the processes sequentially and wait for the first segment to complete before requesting the next. To do this we can use a recursive function to process each segment.

public static async Task  SynthesizeAudioAsync(SpeechConfig config, string[][] batches, int index){

    var text = string.Join(" ", batches[index]);
    using var synthesizer = new SpeechSynthesizer(config, null as AudioConfig);

    synthesizer.SynthesisCompleted += async (s, e) =>
    {
        //save code omitted here
        if(index < batches.Length - 1){
            await SynthesizeAudioAsync(config, batches, index + 1);
        }
    };

    using var result = await synthesizer.SpeakTextAsync(text);
    //error handling code omitted here
}

As this code uses an asynchronous callback, we have the risk that the code will exit the SynthesizeAudioAsync function before the audio is returned.

Running this in a console application, it could then exit too early so I have added a counter variable to track how many segments have been processed. Finally, I added code wait until all parts had been completed.

while (TotalCompleted < batches.Length)
{
    await Task.Delay(1000);
}

Combining the results

In the previous version of the processor (where the code had not been batched), I used the following code to save the results to an MP3 file, called from the SynthesisCompleted callback.

static void SaveOutput(byte[] wavFile, string fileName)
{
    using (var inputStream = new MemoryStream(wavFile))
    using (var waveReader = new WaveFileReader(inputStream))
    using (var lameWriter = new LameMP3FileWriter(fileName, waveReader.WaveFormat, 128))
    {
        waveReader.CopyTo(lameWriter);
    }
}

We now have multiple segments that need to be combined into a single file in the correct order. Luckily this can easily be done by moving the lameWriter up, passing it through the recursive method and just copying each set of wav bytes in as they arrive.

static void SaveOutput(byte[] wavFile, LameMP3FileWriter lameWriter)
{
    using (var inputStream = new MemoryStream(wavFile))
    using (var waveReader = new WaveFileReader(inputStream))
    {
        waveReader.CopyTo(lameWriter);
    }
}

In order to create the LameMP3FileWriter we need to know the format of the audio. I have gathered this from a previous segment of audio returned from the API and then hard-coded in the values.

 var wavFormat = new WaveFormat(16000, 16, 1);
 using var lameWriter = new LameMP3FileWriter(outputFile, wavFormat, 128);

The final result is a full audio file of the text, that was longer than 10 minutes in length.

Thoughts and Notes

While we can now process longer text it is worth remembering that on the free tier, we are limited to 20 transactions per 60 seconds and 0.5 million characters per month for free.

The above code is a very rough example and could be improved in many ways. For a more complete version see the code on my GitHub.

Interestingly after I had made this change to my code I found an example that used a different approach to split the text into paragraphs and then process them sequentially without a callback.

The code is in the SDK samples and uses an approach that polls for the status rather than using the callback event. It also writes directly to an MP3 file rather than using the LameMP3FileWriter class, though I have not tried this method.

It is worth noting that both of these examples only really work with plain text as they do not handle any markup that may span across the boundaries of the split.

Azure Speech Batch Synthesis API

An alternative approach is to move our code to the batch-processing API. This API is a replacement for the older Long Audio API that is due to be retired.

This requires the standard tier of the Azure Speech Service and the code is a little more complex as it uses HTTP calls rather than an SDK. It does however not require the text to be split into chunks and has the advantage of being able to process SSML input in addition to plain text.

The process works by making an HTTP request, before polling the status of the request until it is complete. The audio is then downloaded from the URL provided in the response. The full details are documented on the Microsoft Learn Website.

A code sample has been provided along with many other examples in the SDK samples GitHub repository.

Conclusion

The 10-minute limit can at first seem frustrating however, with a little work, it is possible to work around it. You can use either the batch processing API if you are already on the standard tier of the Azure Speech Service, or if you are on the free tier then splitting the text is a good option, just remember to keep an eye on the quotas.

Title Image by Nile from Pixabay