AMAI on-premise text-to-speech engine

# AMAI on-premise text-to-speech engine Ultra realistic and streaming voice synthesis with customizable emotions, self-hosted version. The main payload of this python HTTP server is `json` object describing how and what to voice with response of a human indistinguishable speech audio stream. ## Launch requirements - Docker Runtime - Write or obtain the `.env` file. It must contain the variables `LICENSE_KEY` and `ACTIVATION_TOKEN`, you can get them from Telegram [@maxbaluev](https://t.me/maxbaluev). ## Running tts server ```sh docker network create tts_network docker-compose up --build ``` And go to http://localhost:8080 to run the ui and conveniently test the synthesis. ## Usage Server supports POST queries with `body` in JSON format. Let's voice one sentence in English language: ```json POST /synth HTTP 1.1 { "format": "wav", "realtime": true, "data": [ { "type": "text", "data": { "text": "Hi, I'm Kathrine, the voice of AMAI. I am a creation of algorithms!", "lang": "en", "speaker": "Kathrine", "emotion": [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ] } } ] } ``` Data from request is sent to `DPI` where it's getting cleaned. After that, it stressed and normalized by `TSP` module. Then audio is generated in `TTS` module and the result can be continuously streamed to client. By the following example let's voice a dialogue between man and woman with different languages: ```json POST /synth HTTP 1.1 { "format": "wav", "data": [ { "type": "text", "data": { "text": "Hi, I'm Kathrine, the voice of AMAI. I am a creation of algorithms!", "lang": "en", "speaker": "Kathrine", "emotion": [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ], "pauseAfter": 2500, "pauseBefore": 1000 } }, { "type": "text", "data": { "text": "Привет, я Михаил. Но тебя не понимаю. Ты говоришь по-русски?", "lang": "ru", "speaker": "Michael", "emotion": [ 0, 0, 0, 0, 0.7, 0, 0, 0, 0, 0.3 ], "pauseAfter": 2500 } } ] } ``` The properties `pauseAfter` and `pauseBefore` allow you to add a pause in milliseconds to the synthesized stream. You can test this request with the `tests/test_post.py`. ### Emotions You have to provide an array of emotion indices, in range from 0 to 1, where 0 corresponds to lack of said emotion, 1 to its maximal vividness. **Sum of indices should always be equal to 1**. You can use various values to get unique mixes of emotions. Keep in mind, though, that not all combinations are tested yet, and some of them, especially bizarre ones like happiness+sadness, may not work properly. Indices and their corresponding emotions in array are: ``` 0. Love 1. Sadness 2. Curiosity 3. Disgust 4. Happiness 5. Dissapointment 6. Fear 7. Surprise 8. Anger 9. Default ``` ## Languages and speakers Voice availability depends on your license. Basic voices are Kathrine for English and Anna for Russian. If your license provides unique voices, you can find support in Telegram @maxbaluev. ## Performance The number of lines supported as well as the synthesis speed depends on the hardware. With `GPU Tesla V100` - the response speed (the first bytes) the client receives in less than `200ms`, when running on the `CPU` of an average laptop - `800ms`. To run on `CPU` regardless of having `GPU`, use `-c` argument. We recommend to use it even if you don't have `GPU` at all, since it will keep some warning messages from appearing during server initialization. ## Logging To get verbose logging, you need to pass `-v` argument to server. It is enabled by default in docker container. ## Chunked and whole synth By default, we process data in chunks, but we also can synth the whole message. Chunked synth with lossy audio compression will have gaps. In order to circumvent this ussue, by default we use `ogg` format. If you want to use whole message synth, set `realtime` argument to `false`. ## Format Can be specified in `format` argument. #### Supported formats * ogg - default format, actually this is oga, ogg in flac container. * wav - provided in two variants, `wav` and `wav_head`. The former provides wav without headers for all but first chunk, while the latter retains all of them. These headers, however, turn into gaps in normal playback. But in browsers, on the other hand, usual `wav` format may not work as intended in realtime synth. * mp3 - lossy, may produce gaps in realtime * aac - same as with mp3 #### Custom formats You can use any custom format as long as it's supported in ffmpeg. To do this, provide desired format with `custom:` prefix, like `custom:3gp` or `custom:pvf`. You are on your own here, we can't guarantee adequate performance with any untested format.