> The system is not instantaneously consistent, but eventually consistent, and must be traceable and repairable ## Step 6: Eventual Consistency Architecture — Traceable and Repairable System Design ### Core Principle: Why Do We Accept "Eventual Consistency"? In flash sale systems, **instantaneous consistency** (all databases update simultaneously) causes: - After Redis deducts inventory, MySQL synchronous write → slow, blocks user response - Payment callback waits for MySQL transaction to complete → timeout, retry storms - Any link failure → entire flow rolls back, poor user experience **Eventual consistency** advantages: - **Fast response**: Redis deducts inventory successfully and immediately returns, user doesn't need to wait for MySQL write - **Traffic shaping**: High concurrency requests enter Kafka queue, slowly consumed, database won't be overwhelmed - **Fault tolerance**: If a link fails (e.g., MySQL downtime), messages remain in Kafka, system continues processing after recovery **But the cost is:** - Redis says "inventory deducted", but MySQL may not have written the order yet - Payment succeeded, but order status may still be `PENDING` (Kafka consumption delay) - **Without tracking and repair mechanisms, data will be permanently inconsistent!** ### Step 7: How Do We Achieve "Traceable and Repairable"? #### **1. Every Link Records Status + Unique ID** | Component | Recorded Content | Purpose | |---|---|---| |**Redis**|`stock:{skuId}` current inventory|Real-time deduction, but unreliable (crash loss)| |**Kafka**|Message includes `messageId`, `orderId`, `timestamp`|Replayable, traceable, prevents duplicate consumption| |**MySQL orders**|`status`, `created_at`, `updated_at`|Order state machine, can query "stuck at which step"| |**MySQL payments**|`transaction_id`, `status`, `callback_time`|Third-party payment reconciliation, retry records| |**inventory_audit**|`operation`, `quantity`, `ref_id`, `timestamp`|**Append-only audit log**, all inventory changes can be backtracked| **Example:** ```sql -- inventory_audit table design CREATE TABLE inventory_audit ( id BIGINT PRIMARY KEY AUTO_INCREMENT, sku_id INT NOT NULL, operation ENUM('reserve', 'decrease', 'release', 'refund'), quantity INT NOT NULL, ref_id VARCHAR(50), -- order_id or payment_id created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); ``` **Why is it important?** - An order's inventory was deducted, but payment failed → query `inventory_audit` to find `reserve` record → compensation mechanism releases inventory - After Redis crashes, MySQL data still exists → can rebuild Redis inventory from `inventory_audit` #### **2. State Machine Design: Make Every State Traceable** ``` Order state machine: PENDING (order placed successfully, waiting for payment) → PAID (payment completed, waiting for confirmation) → CONFIRMED (order confirmed, preparing to ship) → SHIPPED (shipped) → DELIVERED (delivered) → REFUNDED (refunded) Payment state machine: PENDING (waiting for payment) → COMPLETED (payment successful) → FAILED (payment failed) → REFUNDED (refunded) ``` **Key points:** - **Each state is a clear checkpoint** → When system gets stuck, we know "stuck at which step" - **State transitions have rules** → Cannot jump directly from `PENDING` to `SHIPPED`, prevents data corruption - **Payment and order status are decoupled** → `payments.status = COMPLETED` does not equal `orders.status = CONFIRMED` (Kafka consumption delay) **How to repair:** ``` Scenario: In normal flow: 1. User places order → Order status: PENDING 2. User payment succeeds → Payment status: COMPLETED 3. System (or Kafka consumer) updates order status from PENDING to CONFIRMED But sometimes problems occur, such as: - Kafka consumer crashes - Program crashes - Network jitter causes message loss This results in payment succeeded, but order is still PENDING inconsistency. ``` What does the SQL statement do? It finds those: payment received → but order not updated abnormal orders. ```javascript // Scheduled task: Check payment succeeded but order still PENDING cases SELECT o.id, o.status, p.status FROM orders o JOIN payments p ON o.id = p.order_id WHERE p.status = 'COMPLETED' AND o.status = 'PENDING' AND o.updated_at < NOW() - INTERVAL 5 MINUTE; // Found anomaly → Manually trigger Kafka retry or directly update order status ``` #### **3. Idempotent Design: Prevent Duplicate Operations** **Problem scenarios:** - Payment callback sent repeatedly (network jitter, third-party retry) - Kafka message duplicate consumption (consumer rebalance) - User repeatedly clicks order button **Solutions:** | Scenario | Idempotency Mechanism | Implementation | | ------------ | -------------------------- | ------------------------------------------------- | | **Payment callback** | Check if `transaction_id` has been processed | `INSERT IGNORE INTO payments ...` or Redis `SETNX` | | **Kafka consumption** | Check `messageId` or `orderId` | Redis `SET message:{id} 1 NX EX 3600` | | **Inventory deduction** | Lua script atomic check | `if stock > 0 then decr else return 0` | **Code example:** ```javascript // Payment callback idempotent handling async function handlePaymentCallback(orderId, transactionId) { const lockKey = `lock:payment:${orderId}`; const acquired = await redis.set(lockKey, '1', 'NX', 'EX', 10); if (!acquired) { // Distributed Lock prevents simultaneous processing console.log('Duplicate callback, ignore'); return; } try { // Check if already processed, idempotent handling const existing = await db.query( 'SELECT id FROM payments WHERE transaction_id = ?', [transactionId] ); if (existing.length > 0) { console.log('This transactionId already processed'); return; } // Update order and payment status await db.query('UPDATE payments SET status = "COMPLETED" WHERE order_id = ?', [orderId]); await db.query('UPDATE orders SET status = "CONFIRMED" WHERE id = ?', [orderId]); // Send Kafka message await kafka.send('payment.completed', { orderId, transactionId }); } finally { await redis.del(lockKey); } } ``` #### **4. Compensation Mechanism (Compensation): Repair Inconsistencies** **Problem scenarios:** - Redis deducts inventory successfully → Kafka message lost → MySQL has no order record - Payment succeeded → Order status update failed (MySQL downtime) → User's money deducted, but order still `PENDING` - User cancels order → Inventory not released back to Redis **Solution: Outbox Pattern** ```sql -- outbox table: Ensures message will definitely be sent CREATE TABLE outbox ( id BIGINT PRIMARY KEY AUTO_INCREMENT, aggregate_id VARCHAR(50), -- order_id event_type VARCHAR(50), -- 'order.created', 'payment.completed' payload JSON, processed BOOLEAN DEFAULT FALSE, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- When placing order: Write orders + outbox in the same transaction BEGIN; INSERT INTO orders (id, user_id, total, status) VALUES (...); INSERT INTO order_items (...); INSERT INTO outbox (aggregate_id, event_type, payload) VALUES (order_id, 'order.created', JSON_OBJECT('orderId', order_id, ...)); COMMIT; -- Background task: Periodically scan outbox table, send to Kafka SELECT * FROM outbox WHERE processed = FALSE ORDER BY created_at LIMIT 100; -- After sending to Kafka, mark processed = TRUE ``` **Why is it effective?** - MySQL transaction guarantee: Order write succeeds ↔ outbox record must exist - Even if Kafka temporarily crashes, outbox messages remain in database, retry later - **Guarantees eventual consistency: Messages will definitely be sent** #### **Step 3: Async Task Queue: Kafka Consumer Responsibilities** Kafka Consumer guarantees eventual consistency - Redis deducts inventory → is "quasi-real-time inventory" (deduct first, prevent users from overselling) - Kafka Consumer (MySQL) then **truly syncs MySQL inventory** (final confirmation inventory decreased) - Then write audit log (traceable) - Then mark as processed (guarantees idempotency, won't deduct twice) | Phase | Synchronous Phase** (User Request) | Async Phase** (Kafka Consumer) | | --- | ---------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- | | What it does | Redis deducts inventory<br>(Because Redis is fast, supports high concurrency,<br>handles traffic peaks)<br>MySQL writes order status: PENDING | Kafka Consumer consumes "order.created"<br>MySQL deducts real inventory (final inventory record must be accurate)<br>Write audit log (traceable, prevents black box, enables reconciliation)<br>Mark processing complete (idempotent, avoid duplicate inventory deduction) | | Characteristics | Fast, lightweight, absolutely no complex logic | Doesn't block user, slow processing, can retry, guarantees eventual consistency | | Purpose | Let user immediately see "order placed successfully" | Let system's "accounts" ultimately be accurate and consistent | ```javascript // Kafka Consumer: Consume "order.created" message kafka.subscribe('order.created', async (message) => { const { orderId, skuId, quantity } = message; // Idempotency check const processed = await redis.get(`processed:order:${orderId}`); if (processed) return; // Sync inventory to MySQL await db.query( 'UPDATE skus SET stock = stock - ? WHERE id = ?', [quantity, skuId] ); // Write audit log await db.query( 'INSERT INTO inventory_audit (sku_id, operation, quantity, ref_id) VALUES (?, "decrease", ?, ?)', [skuId, quantity, orderId] ); // Mark as processed await redis.set(`processed:order:${orderId}`, '1', 'EX', 86400); }); ``` ### Step 4: How Does the System Become "Traceable and Repairable"? | Problem Scenario | How to Discover? | How to Repair? | | ------------------------- | ----------------------------------------------------------- | ----------------------------------------------- | | **Redis deducted inventory, MySQL has no order** | Query Redis `stock` vs MySQL `skus.stock` difference | From `inventory_audit` find orphan deductions, manually create order or release inventory | | **Payment succeeded, order still PENDING** | Query `payments.status = COMPLETED` but `orders.status = PENDING` | Resend Kafka message or directly update order status | | **Kafka message lost** | `outbox.processed = FALSE` for more than 10 minutes | Background task resend | | **Duplicate inventory deduction** | `orders:expired_at` < now() and `payments=INCOMPLETE` | Release inventory | | **User cancelled order, inventory not released** | Order status `REFUNDED` but Redis inventory didn't increase | Compensation task: `INCR stock:{skuId}` + write `inventory_audit` | **Monitoring tools:** - **Reconciliation task**: Hourly compare Redis inventory vs MySQL inventory - **Anomaly alerts**: Orders stuck at `PENDING` for more than 30 minutes - **Audit log queries**: `SELECT * FROM inventory_audit WHERE ref_id = 'order_123'` to view complete inventory change history ### Step 5: Summary: Cost and Benefits of Eventual Consistency #### **What Did We Sacrifice?** - **Instantaneous consistency**: After Redis deducts inventory, MySQL may delay a few seconds before having the order - **Complexity**: Need state machine, idempotency, compensation, monitoring #### **What Did We Gain?** - **High concurrency capability**: 10,000 people flash sale, Redis responds quickly, MySQL won't be overwhelmed - **System resilience**: Any component failure (Redis, Kafka, MySQL), system can recover - **Traceable**: All operations have logs (`inventory_audit`, `outbox`, Kafka offset) - **Repairable**: Through compensation mechanisms, reconciliation tasks, data will eventually be consistent #### **Key Design Principles:** 1. **Every operation has a unique ID** (orderId, transactionId, messageId) 2. **Every state is queryable** (state machine + timestamp) 3. **Every change is recorded** (append-only audit log) 4. **Every message is idempotent** (prevents duplicate processing) 5. **Every failure is compensatable** (Outbox Pattern + scheduled tasks) > **"The system is not instantaneously consistent, but eventually consistent, and must be traceable and repairable"** > This is not a compromise, but the only feasible architectural choice under high concurrency. > The key is: **When inconsistency occurs, we have the ability to discover and repair it.**