Ordered Message Processing at Scale with Kafka and Go

# Ordered Message Processing at Scale with Kafka and Go ## Introduction Apache Kafka is a distributed event streaming platform designed for high-throughput, fault-tolerant message processing. Messages are written to topics, split into partitions, and consumed sequentially within each partition. Ordering is guaranteed at the partition level, and consumers track progress using offsets. In systems where messages represent state transitions, such as chat sessions, financial transactions, or workflow orchestration, maintaining strict ordering becomes critical. While Kafka guarantees ordering within a partition, preserving that ordering across application logic, retries, parallel workers, and crash recovery requires deliberate architectural design. I implemented the core Kafka messaging layer in Go for a real-time chat pipeline. Every message flows through multiple processing stages and must return a response fast. The constraint is unforgiving: messages within a session must be processed in strict sequence, end-to-end(within at least 4 processes), in under 200ms, with zero message loss. Ordering is not a nice-to-have; for instance, if message 2 processes before message 1, the system responds to the wrong context entirely. Here is the architecture I implemented to achieve zero ordering violations at 14,000 messages per second. --- ## The Core Problem Kafka gives you partition-level ordering. You hash a key (in this case, `sessionID`) to route all messages for a session to the same partition. That's step one. The problem is what happens *after* the broker. The moment you introduce a worker pool for throughput, you break ordering. Message 1 lands on Worker A, message 2 lands on Worker B. Add retries, and it gets worse: message 1 fails and waits in backoff while message 2 overtakes it. You need parallelism *across* sessions and strict sequencing *within* a session simultaneously. These feel contradictory, but they're not. --- ## Session Affinity via Consistent Hashing The producer assigns a partition key using `sessionID`. On the consumer side, we mirror this with a 256-goroutine worker pool where each worker owns a buffered channel (size 10). We set it at 256 because we needed a large worker pool, which can be adjusted based on observed load and utilisation patterns. ```go workerIndex := fnv32a(sessionID) % 256 workers[workerIndex].channel <- msg ``` `FNV32a` is fast and produces a stable, deterministic mapping. Session A always routes to Worker 17. Session B always routes to Worker 42. They run in parallel, but within each worker, processing is strictly sequential. You get both. ``` Sessions fnv32a(id) % 256 Workers (256 goroutines) ┌─────────────────────┐ session_a8f2 [1][2][3] ──┐ ┌──▶ │ Worker 17 │ ├──▶ hash ──────────┤ │ [A1][A2][A3][C1] │ ← sequential session_c3d9 [1][2] ──┘ │ └─────────────────────┘ │ ┌─────────────────────┐ session_e1b7 [1][2][3] ───────▶ hash ────────┘ │ Worker 42 │ │ [B1][B2] │ ← sequential session_b3f1 [1][2] ───────▶ hash ──────────▶ └─────────────────────┘ Worker 17 ∥ Worker 42 (parallel across sessions) ``` The 256 worker count came from profiling under load. We started at 64 workers and observed CPU cores sitting idle at 40% utilization while message latency spiked. Doubling to 128 improved throughput but still left headroom. At 256 workers, CPU utilization hit 75-80% under peak load with stable latency. Beyond 256, we saw diminishing returns,context switching overhead started eating the gains. Each goroutine costs ~2KB of stack, so 256 workers is roughly 512KB, negligible compared to message buffers. The sweet spot is where CPU stays busy without thrashing. For this workload, that was 256. --- ## Why Not Partition-Per-Session or Single-Threaded Consumers? A natural question is why not rely purely on Kafka partitions for ordering, or use one consumer thread per partition and avoid custom worker routing altogether. Partition-per-session is not viable at scale. With thousands or millions of concurrent sessions, the required partition count would exceed operational limits and degrade broker performance. Kafka performs best when partition counts are controlled and predictable. A single-threaded-per-partition model also becomes inefficient when partitions carry interleaved session traffic. While partition ordering is guaranteed, session-level sequencing inside that partition still requires application-level coordination if you want parallelism across independent sessions. The consistent hashing approach allows strict ordering within a session while preserving high parallelism across sessions without requiring unbounded partition growth. --- ## Retry Without Breaking Order This is where most implementations fall apart. When a message fails, you can't just re-enqueue it, the messages behind it will overtake it. The approach: **session blocking**. Say Session A has messages `[1, 2, 3, 4, 5]`. Message 3 fails. 1. Worker 17 marks Session A as `RETRYING` and stores message 3 as the `blockedMessage` 2. Messages 4 and 5 arrive — Worker 17 sees `RETRYING` and queues them in-memory rather than processing them 3. The retry sleep happens in a **separate goroutine** so Worker 17 stays unblocked for other sessions 4. After the backoff delay, message 3 is pushed into the retry channel and re-processed 5. On success (or after max retries → DLQ), the session state resets to `NORMAL` and the queued messages drain in order: 4, then 5 The retry backoff uses exponential delays with jitter (100ms → 200ms → 400ms, capped at 2s), preventing synchronized retry storms under failure conditions. We also classify errors before retrying. A `PoisonPillError`, which is an unmarshal failure or a recovered panic, bypasses all retry logic and goes straight to the dead letter queue. Retrying a fundamentally bad message wastes time and blocks the session indefinitely. Error classification happens in two tiers. Tier 1: Transient network errors, downstream service timeouts. These retry with backoff. Tier 2: Poison pills (unmarshal failures, schema violations, panics). These skip retry entirely and route to DLQ with full context for offline debugging. The circuit breaker prevents retry storms when a downstream service degrades; sessions back off collectively rather than hammering a failing endpoint. --- ## Crash Recovery Scenario Consider a worker crash while a session is in RETRYING state. Because offsets are only committed up to the highest contiguous completed watermark, any uncommitted offsets will replay on restart. The retry state is reconstructed naturally from replayed messages, and idempotent processing guarantees correctness even if certain stages execute more than once. This design intentionally trades minimal replay work for deterministic recovery. In distributed systems, replaying safely is preferable to risking silent data loss. --- ## Offset Commits Without Data Loss With 256 workers processing in parallel, messages complete out of order. Naively committing the latest completed offset is dangerous: if offset 104 completes before 102, committing 104 means a crash would skip 102 entirely. We solve this with a **contiguous watermark**. For each partition, we maintain two maps: `inFlight` (offsets currently being processed) and `completed` (offsets that finished). Every 5 seconds, a commit loop advances the watermark: ``` Partition log: ┌────┬────┬────┬────┬────┬────┐ │ 99 │100 │101 │102 │103 │104 │ └────┴────┴────┴────┴────┴────┘ ▲ ▲ ▲ ▲ ▲ ▲ │ │ │ │ │ │ committed ✓ ✗ ✓ ✗ ✓ (last) done in- done in- done flight flight Watermark advances: 99 → 100 → STOP at 101 (in-flight gap) → Commit offset 101. On crash, replay from 101. 100 already done. 102, 104 re-process (idempotent). ``` If the broker crashes now, replay starts at offset 101. Offset 100 was already processed. Offsets 102 and 104 will re-process, which is safe, because message processing is idempotent. Idempotency is enforced at the event level. Every message carries a unique event ID. Before processing, we check: "Have I seen this event ID for this session?" If yes, skip—it's a replay. If no, process and mark it seen. For external API calls (e.g LLM inference), we derive request IDs from `sessionID + eventID`, ensuring the same message never triggers duplicate state transitions even if replayed after a crash The key insight: **only commit up to the highest contiguous completed offset**. Gaps in in-flight messages act as blockers, preventing any unsafe advancement. --- ## Backpressure When workers are saturated, the system does not drop messages or crash; it pauses fetching. `routeToWorker()` uses a 1-second timeout when sending to a worker's channel. If that times out, the channel is full. We call `PauseFetchPartitions()` on franz-go to stop pulling from that partition. A monitor goroutine checks worker capacity every second: ``` capacity = available buffer slots / total buffer slots (across all 256 workers) ``` When capacity crosses 50%, we call `ResumeFetchPartitions()`. The message is re-delivered from Kafka — it was never committed, so nothing is lost. --- ## Benchmark Results The benchmark ran on AWS c5.2xlarge instances (8 vCPU, 16GB RAM), a 3-broker Kafka cluster with 10 partitions per topic, replication factor 3. Each consumer instance handled 4 partitions, meaning 3 consumer instances in the group. The pipeline simulated realistic per-stage latency: ingest (~5ms), NLU (~20ms), LLM (~50ms-80ms), delivery (~10ms). We injected failures at 10% message rate with exponential backoff to validate retry behavior under pressure. Testing against a 4-stage pipeline (ingest → NLU → LLM → delivery) with simulated per-stage processing: | Metric | Result | Target | |---|---|---| | Raw throughput | 14,027 msg/s | >10k msg/s | | E2E latency p50 | 11.7ms | <200ms | | E2E latency p95 | 115ms | <200ms | | E2E latency p99 | 185ms | <200ms | | Ordering violations | 0 | 0 | | Session affinity | 100% | 100% | The ordering result is the one that matters most—zero violations across 50,000 messages, 10 sessions, with failures injected at 10% message rate. The p95/p99 latency needs tuning (reducing `FetchMaxWait`, increasing partition count), but the ordering and throughput story is solid. Failure conditions were deliberately injected to validate ordering guarantees under retry pressure. --- ## Scaling Considerations The worker pool size (256) is not fixed. It was chosen based on observed CPU utilisation, memory profile, and expected session concurrency. Scaling horizontally can be achieved by: Increasing partition count to distribute load across more consumer instances Increasing worker pool size per instance Running additional consumer group members The consistent hashing model preserves session ordering regardless of horizontal scaling, as long as sessionID remains the partition key. The critical invariant across all scaling strategies is that sessionID remains the partition key. As long as that invariant holds, ordering guarantees remain intact regardless of horizontal expansion. --- ## Why Not Kafka Streams or Flink? Kafka Streams and Flink both offer stateful stream processing with ordering guarantees. We evaluated both. Kafka Streams felt heavy for our use case. Its state store abstraction is powerful, but we didn't need windowed aggregations or complex joins, just sequential message routing. The operational overhead (RocksDB state stores, rebalancing complexity) outweighed the benefits when our state is ephemeral session metadata. We needed microsecond routing decisions, not stateful transformations. Flink is overkill for this pipeline. It's designed for complex event time processing, exactly-once semantics across multiple stages, and large-scale analytics. Our problem is simpler: route messages to workers, retry on failure, preserve order. Flink's checkpoint overhead and operational complexity (JobManager, TaskManager coordination) didn't justify the value when a custom consumer group with consistent hashing solves it cleanly. The lesson: understand your actual requirements before reaching for frameworks. We needed strict ordering + parallel throughput + fine-grained retry control. A focused implementation with franz-go and session affinity delivered that at a fraction of the complexity. Sometimes the right tool is just Go's concurrency primitives and a clear mental model. --- ## What Actually Matters The architecture reduces to three core primitives: Consistent hashing for session affinity — same session, same worker, always. This guarantees ordering at the application level. Session blocking during retries — workers must treat retries as a session-level state, not an individual message event. Contiguous watermark commits — offsets are only committed when every preceding offset has completed. This eliminates silent data loss during crashes. Everything else — DLQ fallback tiers, backpressure control, poison pill detection are operational hardening layers built on top of these foundations. The key lesson is that ordering guarantees do not stop at the broker boundary. They must be enforced deliberately at every layer of the system,in worker routing, retry state management, and offset commit strategy. This architecture is running in production, handling millions of messages daily across thousands of concurrent chat sessions. The zero-violation ordering guarantee has held under load spikes, broker failures, and application crashes. The design is simple enough to debug at 3 AM, which matters more than theoretical elegance. If you're building anything where message order carries semantic weight, financial transactions, workflow state machines, real-time chat, the primitives here transfer. Consistent hashing for affinity, session-level retry state, contiguous watermark commits.