# Integrating Soniox as a New STT Provider in Kalimera
## Introduction
We have added a support for **Soniox** as a speech-to-text (STT) provider on the orchestration side. We only need the configuration parameters on the UI side as I mentioned below and we're ready to go ahead with this.
This document describes:
- Why we chose Soniox
- The configuration parameters to be passed from config/UI
- Complete parameters list and usage of each
- Soniox api limits
- Things we consider while implementing
- Best Practices
---
## Why Soniox?
- Soniox works well when dealing with **numbers, codes, technical vocabulary, and domain-specific terms**
- It also works great with greek.
- We found it transcribes **email** in a much better way.
- Offers built-in language identification, diarization, endpoint detection, and most importantly, **the ability to bias transcription using context**.
- Pricing is “token-based” — you pay for input audio tokens, input text (such as custom context) tokens, and output text tokens.
- Rate per hour (typical conversational speech) works out to roughly **≈ $0.12/hr** for streaming.
---
## Configuration Parameters
Below is an example of how we initialize the Soniox session in code. All parameters under `sonioxConfig` must be supplied from the Kalimera UI side.
```csharp
var startConfig = new JObject
{
["api_key"] = sonioxConfig?.ApiKey,
["model"] = sonioxConfig?.Model,
["audio_format"] = AudioConstants.AUDIO_FORMAT,
["sample_rate"] = AudioConstants.SAMPLE_RATE,
["num_channels"] = AudioConstants.CHANNELS,
["language_hints"] = new JArray(sonioxConfig?.LanguageCode ?? "en"),
["enable_speaker_diarization"] = sonioxConfig?.EnableSpeakerDiarization ?? true,
["enable_language_identification"] = sonioxConfig?.EnableLanguageIdentification ?? true,
["enable_endpoint_detection"] = true,
["context"] = sonioxConfig?.Context
};
```
## Parameters needed from Kalimera UI Side/config :
- **api_key**
- APi key from soniox dashboard.
- Field type : *string*
- It is required field
- **model**
- Soniox model name.(for example. "stt-rt-preview-v2"). [See model list here](https://soniox.com/docs/stt/models)
- Field type : *string*
- It is required field
- **enable_speaker_diarization**
- Enables detection of different speakers.
- Field type : *boolean*
- Not required. Default value is false
- **enable_language_identification**
- Enables automatic detection of spoken language.Useful in multilingual environments.
- Field type : *boolean*
- Not required. Default value is false
- **language_hints**
- List of language codes
- Field type : *array of strings*
- Not required. Default value is empty array([])
- **context**
- Free-form text or structured input that guides the AI. Significantly improves recognition accuracy for domain-specific terminology, brand names, technical jargon, rare or custom words. Reduces transcription errors for unique vocabularies.
- You provide context through the context object that can include up to four sections, each improving accuracy in different ways:
## Complete parameters list soniox supports :
| **Parameter** | **Type** | **Required** | **Default Value** | **Description** |
|----------------|-----------|---------------|-------------------|-----------------|
| `api_key` | string | **Yes** | *None* | Your API key (permanent or temporary) to authenticate the session. |
| `model` | string | **Yes** | *None* | Model to use, e.g., `"stt-rt-preview"`, `"stt-rt-preview-v2"`. |
| `audio_format` | string | **No** | `"auto"` | Audio format of binary frames. `"auto"` detects common containers (mp3, wav, aac, etc.). For raw PCM use `"pcm_s16le"`. |
| `sample_rate` | integer | Required if `audio_format` = raw PCM | *None* | Sample rate in Hz for raw PCM audio, e.g., 8000 or 16000. Required if using raw PCM format. |
| `num_channels` | integer | Required if `audio_format` = raw PCM | *None* | Number of channels (typically 1 for mono). Required with raw PCM. |
| `language_hints` | array of strings | No | `[]` (empty array) | List of language codes to bias transcription accuracy, e.g., `["en", "es"]`. |
| `context` | string | No | `""` (empty string) | Free text context to bias recognition with domain-specific terms. Max ~10,000 chars / 8,000 tokens. |
| `enable_speaker_diarization` | boolean | No | `false` | Enables speaker diarization (speaker labeling). Up to 15 speakers supported. |
| `enable_language_identification` | boolean | No | `false` | Enables per-token language identification in transcripts. |
| `enable_endpoint_detection` | boolean | No | `false` | Enables endpoint detection; model emits `<end>` token on utterance finalization. |
| `translation` | object | No | *None* | Translation configuration object (optional)."one_way" or "two_way" (mode of translation). |
## Soniox API Limits :
If the outlined API quota limits don’t work for us, we can issue an API quota limit increase request and tell them more about our requirements.
| **Type** | **Limit** | **Notes** |
|-----------|------------|-----------|
| Streaming session duration | 60 minutes | Max length per WebSocket session. Reconnect required after 1 hour. |
| Concurrent requests | 10 | Max active streaming sessions. |
| Requests per minute | 100 | Rate limiting may apply. |
## Things to remember while implementing :
- If audio_format is omitted, "auto" is assumed, and Soniox will auto-detect the codec/container from binary frames.
- If you use "auto", do not specify sample_rate or num_channels.
- If you send raw PCM audio ("pcm_s16le"), sample_rate and num_channels must be specified.
- If any optional parameters are omitted, defaults as above are used (false booleans, empty arrays/strings for hints/context).
- Always provide a valid api_key and specify model.
### Best Practices for tuning
To improve **accuracy**, **latency**, and **robustness**:
#### 🧠 Provide `context` and `language_hints`
- **Use `context`** to bias recognition toward domain-specific vocabulary such as product names, jargon, technical terms, or client names.
Example:
```json
"context": "Celebrex, Zyrtec, acme corp, quarterly report" ```
- Use language_hints to specify expected spoken languages (ISO codes like "en", "es") for better language recognition accuracy.
- Especially important for real-time sessions involving multiple possible languages or accents.