Soniox STT implementation

# Integrating Soniox as a New STT Provider in Kalimera ## Introduction We have added a support for **Soniox** as a speech-to-text (STT) provider on the orchestration side. We only need the configuration parameters on the UI side as I mentioned below and we're ready to go ahead with this. This document describes: - Why we chose Soniox - The configuration parameters to be passed from config/UI - Complete parameters list and usage of each - Soniox api limits - Things we consider while implementing - Best Practices --- ## Why Soniox? - Soniox works well when dealing with **numbers, codes, technical vocabulary, and domain-specific terms** - It also works great with greek. - We found it transcribes **email** in a much better way. - Offers built-in language identification, diarization, endpoint detection, and most importantly, **the ability to bias transcription using context**. - Pricing is “token-based” — you pay for input audio tokens, input text (such as custom context) tokens, and output text tokens. - Rate per hour (typical conversational speech) works out to roughly **≈ $0.12/hr** for streaming. --- ## Configuration Parameters Below is an example of how we initialize the Soniox session in code. All parameters under `sonioxConfig` must be supplied from the Kalimera UI side. ```csharp var startConfig = new JObject { ["api_key"] = sonioxConfig?.ApiKey, ["model"] = sonioxConfig?.Model, ["audio_format"] = AudioConstants.AUDIO_FORMAT, ["sample_rate"] = AudioConstants.SAMPLE_RATE, ["num_channels"] = AudioConstants.CHANNELS, ["language_hints"] = new JArray(sonioxConfig?.LanguageCode ?? "en"), ["enable_speaker_diarization"] = sonioxConfig?.EnableSpeakerDiarization ?? true, ["enable_language_identification"] = sonioxConfig?.EnableLanguageIdentification ?? true, ["enable_endpoint_detection"] = true, ["context"] = sonioxConfig?.Context }; ``` ## Parameters needed from Kalimera UI Side/config : - **api_key** - APi key from soniox dashboard. - Field type : *string* - It is required field - **model** - Soniox model name.(for example. "stt-rt-preview-v2"). [See model list here](https://soniox.com/docs/stt/models) - Field type : *string* - It is required field - **enable_speaker_diarization** - Enables detection of different speakers. - Field type : *boolean* - Not required. Default value is false - **enable_language_identification** - Enables automatic detection of spoken language.Useful in multilingual environments. - Field type : *boolean* - Not required. Default value is false - **language_hints** - List of language codes - Field type : *array of strings* - Not required. Default value is empty array([]) - **context** - Free-form text or structured input that guides the AI. Significantly improves recognition accuracy for domain-specific terminology, brand names, technical jargon, rare or custom words. Reduces transcription errors for unique vocabularies. - You provide context through the context object that can include up to four sections, each improving accuracy in different ways: ## Complete parameters list soniox supports : | **Parameter** | **Type** | **Required** | **Default Value** | **Description** | |----------------|-----------|---------------|-------------------|-----------------| | `api_key` | string | **Yes** | *None* | Your API key (permanent or temporary) to authenticate the session. | | `model` | string | **Yes** | *None* | Model to use, e.g., `"stt-rt-preview"`, `"stt-rt-preview-v2"`. | | `audio_format` | string | **No** | `"auto"` | Audio format of binary frames. `"auto"` detects common containers (mp3, wav, aac, etc.). For raw PCM use `"pcm_s16le"`. | | `sample_rate` | integer | Required if `audio_format` = raw PCM | *None* | Sample rate in Hz for raw PCM audio, e.g., 8000 or 16000. Required if using raw PCM format. | | `num_channels` | integer | Required if `audio_format` = raw PCM | *None* | Number of channels (typically 1 for mono). Required with raw PCM. | | `language_hints` | array of strings | No | `[]` (empty array) | List of language codes to bias transcription accuracy, e.g., `["en", "es"]`. | | `context` | string | No | `""` (empty string) | Free text context to bias recognition with domain-specific terms. Max ~10,000 chars / 8,000 tokens. | | `enable_speaker_diarization` | boolean | No | `false` | Enables speaker diarization (speaker labeling). Up to 15 speakers supported. | | `enable_language_identification` | boolean | No | `false` | Enables per-token language identification in transcripts. | | `enable_endpoint_detection` | boolean | No | `false` | Enables endpoint detection; model emits `<end>` token on utterance finalization. | | `translation` | object | No | *None* | Translation configuration object (optional)."one_way" or "two_way" (mode of translation). | ## Soniox API Limits : If the outlined API quota limits don’t work for us, we can issue an API quota limit increase request and tell them more about our requirements. | **Type** | **Limit** | **Notes** | |-----------|------------|-----------| | Streaming session duration | 60 minutes | Max length per WebSocket session. Reconnect required after 1 hour. | | Concurrent requests | 10 | Max active streaming sessions. | | Requests per minute | 100 | Rate limiting may apply. | ## Things to remember while implementing : - If audio_format is omitted, "auto" is assumed, and Soniox will auto-detect the codec/container from binary frames. - If you use "auto", do not specify sample_rate or num_channels. - If you send raw PCM audio ("pcm_s16le"), sample_rate and num_channels must be specified. - If any optional parameters are omitted, defaults as above are used (false booleans, empty arrays/strings for hints/context). - Always provide a valid api_key and specify model. ### Best Practices for tuning To improve **accuracy**, **latency**, and **robustness**: #### 🧠 Provide `context` and `language_hints` - **Use `context`** to bias recognition toward domain-specific vocabulary such as product names, jargon, technical terms, or client names. Example: ```json "context": "Celebrex, Zyrtec, acme corp, quarterly report" ``` - Use language_hints to specify expected spoken languages (ISO codes like "en", "es") for better language recognition accuracy. - Especially important for real-time sessions involving multiple possible languages or accents.