# Eval-driven testing framework for agentic workflows for coding ## Table of Contents - [Motivation](#motivation) - [Overview](#overview) - [Coding Tasks and Design Docs (draft)](#coding-tasks-and-design-docs-draft) - [Agentic Workflow System (draft)](#agentic-workflow-system-draft) - [Deterministic Code Quality Metrics for Generated Code](#deterministic-code-quality-metrics-for-generated-code) - [Requirements](#requirements) - [Selected Metrics](#selected-metrics) - [Dependencies](#dependencies) - [Rejected Metrics and Why](#rejected-metrics-and-why) - [Related Layers (Covered Elsewhere)](#related-layers-covered-elsewhere) - [SI: Proof of Concept](#si-proof-of-concept) ## Motivation Existing eval frameworks are designed to evaluate LLM text outputs, not the structural quality of generated code. We investigated whether any of them could serve our use case. ### Problem 1: No structural code quality metrics None of the major eval frameworks provide built-in metrics for structural code quality (complexity, duplication, dead code, function size). | Framework | Code structural metrics? | What it provides instead | |---|---|---| | promptfoo | No | Text assertions, NLP metrics (BLEU, ROUGE), LLM-as-judge, semantic similarity | | autoevals (Braintrust) | No | LLM-as-judge (Factuality, Battle, ClosedQA), heuristic string matching | | DeepEval | No | 50+ metrics for hallucination, faithfulness, relevancy, bias, toxicity | | Ragas | No | RAG-specific metrics (faithfulness, context precision/recall) | | Inspect AI | No | Text matching, model-graded QA, benchmark scoring | | OpenAI Evals | No | String check, text similarity, Python grader | | LangSmith | No | Has a "code compiles" heuristic but no structural analysis | These frameworks can run custom code (e.g. promptfoo's `javascript`/`python` assertions, OpenAI Evals' Python grader), so it would be possible to wrap structural analysis tools as custom evaluators. But no framework provides these metrics out of the box. ### Problem 2: API token requirement Existing eval frameworks that use LLM-as-a-judge require pay-per-use API tokens (e.g. `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`), which are billed separately from Anthropic/OpenAI subscriptions (Claude Pro/Max, ChatGPT Plus). These are separate products with separate billing — no framework can use a consumer subscription to power its evaluators. | Framework | API token required? | Notes | |---|---|---| | promptfoo | For LLM-based assertions only | Deterministic assertions (contains, regex, is-json, javascript, python) work without any API key | | autoevals (Braintrust) | Yes, for most evaluators | Defaults to `OPENAI_API_KEY`. Heuristic scorers (Levenshtein, ExactMatch) work locally | | DeepEval | Yes, for all built-in metrics | "Almost all predefined metrics use LLM-as-a-judge" | | Ragas | For LLM-based metrics only | Has explicit non-LLM metrics (BLEU, ROUGE, exact match) that work locally | | Inspect AI (UK AISI) | For model-graded scorers only | Built-in text scorers (includes, match, pattern) need no key | | OpenAI Evals | Yes | Even deterministic graders run on the OpenAI platform | | LangSmith | Yes (LangSmith key + LLM key) | Requires LangSmith auth even for heuristic evaluators | Some frameworks (promptfoo, Ragas, Inspect AI) support Ollama/local models as a workaround, but this adds infrastructure and doesn't solve the core problem: LLM-as-judge evaluation is non-deterministic regardless of how it's hosted. ### Conclusion For evaluating the structural quality of LLM-generated code in an agentic workflow, we need a custom metric system that: 1. Runs entirely locally with no API keys or additional billing 2. Produces deterministic, reproducible scores 3. Measures structural properties that existing eval frameworks ignore The metrics defined below address this gap. ## Overview The framework has three components, each designed independently: ``` ┌─────────────────────────┐ │ 1. Coding Tasks │ What to build: task definitions │ + Design Docs │ and implementation specifications └───────────┬─────────────┘ │ input v ┌─────────────────────────┐ │ 2. Agentic Workflow │ How to build it: harness + model │ (harness + LLM) │ that generates code from tasks └───────────┬─────────────┘ │ output v ┌─────────────────────────┐ │ 3. Quality Metrics │ How to judge it: deterministic │ (evaluation) │ structural analysis of the output └─────────────────────────┘ ``` ## Coding Tasks and Design Docs (draft) > **This section is under construction.** This component defines the coding tasks the agent will implement and the design documents that specify the expected behavior. Together they form the input to the agentic workflow. ## Agentic Workflow System (draft) > **This section is under construction.** We plan to support two agentic workflow harnesses: - **Claude Code Agent SDK** — for Claude-based workflows. Despite limitations (Anthropic model lock-in, Opus being slow, Sonnet having weekly rate limits), its popularity makes it the pragmatic choice for Claude workflows. - **PI (customized)** — for a model-agnostic alternative. PI can be configured to work with other subsidized subscriptions (e.g. GPT-5.4 models), which offer faster throughput and are less rate-limited. This gives us flexibility to use whichever model best fits the task without being locked into a single provider. ## Deterministic Code Quality Metrics for Generated Code ### Requirements All metrics must be **deterministic** — the same input always produces the same output, with no LLM-as-judge variance. They must also be **structural, not cosmetic** — measuring design properties that require actual redesign to fix (decomposition, refactoring), not surface-level issues that a formatter or auto-fix can resolve (naming, import order, whitespace). ### Selected Metrics #### 1. Cyclomatic Complexity (lizard) Counts the number of **linearly independent execution paths** through a function. Each branching construct adds 1 to the score. Consider this example: ```python def process(data, mode): # starts at 1 for item in data: # +1 (loop) if item.get("active"): # +1 (branch) if mode == "strict": # +1 (branch) validate(item) elif mode == "fast": # +1 (branch) skip(item) return True # Total CCN: 5 ``` Each `if`, `elif`, `for`, `while`, `and`, `or`, `except`, and ternary adds 1. A CCN of 24 means there are 24 distinct paths through the function, each of which would need its own test case for full branch coverage. This catches functions with too many branches to reason about or test exhaustively. Cyclomatic complexity treats all branches equally regardless of nesting depth — a flat sequence of guard clauses scores the same as deeply nested logic. This is why we pair it with cognitive complexity. #### 2. Cognitive Complexity (cognitive-complexity) Like cyclomatic complexity, but **weights each branch by its nesting depth**. A branch at the top level adds 1, but a branch nested 5 levels deep adds 6. Consider the same function restructured two ways: ```python # Flat guards — cognitive complexity: 4 def validate(x): if not x: # +1 (depth 0) return False if x < 0: # +1 (depth 0) return False if x > 100: # +1 (depth 0) return False if not active: # +1 (depth 0) return False return True # Deep nesting — cognitive complexity: 10 def validate(x): if x: # +1 (depth 0) if x > 0: # +2 (depth 1) if x <= 100: # +3 (depth 2) if active: # +4 (depth 3) return True return False ``` Both versions have the same logic and the same cyclomatic complexity (4), but the nested version is harder to follow. Cognitive complexity captures this: it scores 4 vs 10. We include both metrics because cyclomatic complexity tells you **"how many paths to test"** while cognitive complexity tells you **"how hard is this to understand."** A function can have CCN=24 and cognitive=62 (like our sample `process_records`) — the cognitive score reveals the nesting problem that cyclomatic misses. The `cognitive-complexity` library works via Python AST analysis. For JS/TS, `eslint-plugin-sonarjs` provides an equivalent cognitive complexity rule. #### 3. Function Size: NLOC + Parameter Count (lizard) **NLOC** (Non-comment Lines Of Code) counts lines containing actual logic, excluding blanks and comments. **Parameter count** is the number of function arguments. Together they measure how large and wide a function's interface is: ```python # NLOC: 2, Params: 2 — small and focused def add(a, b): return a + b # NLOC: 58, Params: 8 — monolithic def process_records(data, config, mode, threshold, retries, verbose, strict, timeout): # 58 lines of branching logic... ``` LLMs tend to generate monolithic functions that inline all logic rather than decomposing into helpers. A high parameter count signals a function that is doing too many things. NLOC is a blunt metric — a 60-line function can be fine if it's a straightforward sequence — so it is best used as a supporting signal alongside complexity. #### 4. Dead Code (vulture) Finds unused functions, variables, imports, and unreachable code via static analysis. LLMs sometimes generate helper functions or variables that nothing uses, or parameters that are accepted but never read. For example, in our sample code `process_records` accepts a `timeout` parameter but never references it — vulture flags this. Vulture analyzes files in isolation by default, so a function that looks unused in one file may be imported elsewhere. For best results, point vulture at the entire project rather than a single file. It is conservative by design — it prefers false negatives over false positives for module-level functions. #### 5. Code Duplication (jscpd) Token-based clone detection. Normalizes code into tokens and finds repeated sequences above a minimum length. This catches copy-pasted blocks where the LLM generated near-identical logic instead of abstracting a shared function. Duplication is a structural problem — it means bugs must be fixed in multiple places and the code is harder to maintain. Token-based matching can miss semantic duplication where the same logic uses different variable names or a slightly different structure. It catches the obvious cases, which are the most common with LLM output. ### Dependencies #### Python (pip) ``` lizard # cyclomatic complexity, NLOC, parameter count vulture # dead code detection cognitive-complexity # cognitive complexity via AST ``` #### Node (npm) ``` jscpd # code duplication detection ``` Install: ```bash pip install lizard vulture cognitive-complexity npm install jscpd --prefix . ``` ### Rejected Metrics and Why #### Does not meet requirements: opinionated / trivially fixable | Tool/Metric | Why rejected | |---|---| | Ruff (most rules) | Style rules (naming, import order, whitespace) are auto-fixable and opinionated. Exception: C901 (McCabe complexity) is structural, but lizard covers this already | | Pylint | Mostly style and convention rules. Complexity checks overlap with lizard | | Biome / ESLint style rules | Formatting and convention — auto-fixable | #### Does not meet requirements: needs a reference solution | Tool/Metric | Why rejected | |---|---| | CodeBLEU | Compares against a "correct" implementation — useful for benchmarks, not for evaluating open-ended generation | | CrystalBLEU | Same problem as CodeBLEU | | CodeBERTScore | Embedding similarity to reference code | | AST match rate | Compares syntax trees against expected output | | Edit similarity | Distance to reference solution | These are research metrics for comparing model outputs against known answers on standardized datasets. When evaluating real-world LLM code generation, there's no single "correct" answer to compare against. #### Does not meet requirements: not a library / too heavy | Tool/Metric | Why rejected | |---|---| | SonarQube | Full platform requiring server deployment. The relevant metrics (cognitive complexity, duplication) are available through standalone tools instead | #### Does not meet requirements: unmaintained | Tool/Metric | Why rejected | |---|---| | escomplex | Last release 2016 | | typhonjs-escomplex | Last release 2018 | | radon | Unmaintained since 2023. Ruff C901 and lizard cover its use case | #### Does not meet requirements: not applicable to production eval | Tool/Metric | Why rejected | |---|---| | pass@k (k>1) | Measures model capability by sampling multiple outputs. In production, you ship k=1 | | BLEU / ROUGE | Text similarity metrics, not designed for code structure | ### Related Layers (Covered Elsewhere) These metrics measure **code structure**. The other layers of code quality are handled separately: - **Functional correctness** — covered by a test suite. The generated code is run against test cases to verify it produces correct output. This is the highest-signal check but is orthogonal to structural quality. - **Security** — covered by a separate security metric (e.g. Semgrep, Bandit). Treated as its own evaluation dimension since security issues have different severity and remediation patterns than structural ones. - **Type correctness** — not included. In the agentic workflow, type errors (mypy, pyright, tsc) are surfaced as feedback to the agent during generation. The agent iterates until types pass, so by the time code reaches evaluation, type correctness is assumed. Measuring it here would not provide useful signal about the workflow's quality. - **Readability / idiom** — not included. Subjective, would require LLM-as-judge, violates the determinism requirement. --- ### SI: Proof of Concept #### Sample code under analysis The following file (`sample_code.py`) represents typical LLM-generated Python with intentional structural issues at varying severity: a clean function, a moderately complex one, a deeply nested monster function, duplicated logic, and dead code. ``` 1 def simple_add(a, b): 2 return a + b 3 4 5 def validate_input(data): 6 """Moderate complexity — a few branches but readable.""" 7 if not isinstance(data, dict): 8 raise TypeError("Expected dict") 9 if "name" not in data: 10 raise ValueError("Missing name") 11 if "age" not in data or not isinstance(data["age"], int): 12 raise ValueError("Missing or invalid age") 13 if data["age"] < 0 or data["age"] > 150: 14 raise ValueError("Age out of range") 15 return True 16 17 18 def process_records(data, config, mode, threshold, retries, verbose, strict, timeout): 19 """ 20 Monster function: deeply nested, too many params, high complexity. 21 Typical of LLM code that handles every case inline without decomposition. 22 """ 23 results = [] 24 errors = [] 25 26 for i, record in enumerate(data): 27 if record.get("type") == "A": 28 if record.get("status") == "active": 29 if record.get("value", 0) > threshold: 30 if mode == "strict": 31 if verbose: 32 print(f"Processing {i}") 33 try: 34 result = record["value"] * config.get("multiplier", 1) 35 if result > 100: 36 if strict: 37 results.append(result) 38 else: 39 results.append(min(result, 100)) 40 else: 41 results.append(result) 42 except Exception as e: 43 if retries > 0: 44 errors.append(str(e)) 45 else: 46 raise 47 elif mode == "lenient": 48 results.append(record.get("value", 0)) 49 else: 50 results.append(0) 51 else: 52 if verbose: 53 print(f"Skipping {i}: below threshold") 54 else: 55 if verbose: 56 print(f"Skipping {i}: inactive") 57 elif record.get("type") == "B": 58 if record.get("status") == "active": 59 value = record.get("value", 0) 60 if value > threshold: 61 try: 62 result = value * config.get("multiplier", 1) * 0.5 63 if result > 100: 64 if strict: 65 results.append(result) 66 else: 67 results.append(min(result, 100)) 68 else: 69 results.append(result) 70 except Exception as e: 71 if retries > 0: 72 errors.append(str(e)) 73 else: 74 raise 75 else: 76 if verbose: 77 print(f"Skipping B {i}: below threshold") 78 79 if errors and verbose: 80 print(f"Encountered {len(errors)} errors") 81 82 return results 83 84 85 def categorize_users(users): 86 """First of two near-identical functions — duplication target.""" 87 active = [] 88 for user in users: 89 if user.get("active"): 90 name = user.get("name", "unknown") 91 email = user.get("email", "") 92 score = user.get("score", 0) 93 if score > 50: 94 active.append( 95 {"name": name, "email": email, "score": score, "tier": "gold"} 96 ) 97 elif score > 20: 98 active.append( 99 {"name": name, "email": email, "score": score, "tier": "silver"} 100 ) 101 else: 102 active.append( 103 {"name": name, "email": email, "score": score, "tier": "bronze"} 104 ) 105 return active 106 107 108 def categorize_orders(orders): 109 """Copy-paste of categorize_users with variable renamed.""" 110 active = [] 111 for order in orders: 112 if order.get("active"): 113 name = order.get("name", "unknown") 114 email = order.get("email", "") 115 score = order.get("score", 0) 116 if score > 50: 117 active.append( 118 {"name": name, "email": email, "score": score, "tier": "gold"} 119 ) 120 elif score > 20: 121 active.append( 122 {"name": name, "email": email, "score": score, "tier": "silver"} 123 ) 124 else: 125 active.append( 126 {"name": name, "email": email, "score": score, "tier": "bronze"} 127 ) 128 return active 129 130 131 def unused_helper(): 132 """Never called — dead code.""" 133 return {"status": "ok", "code": 200} 134 135 136 def _build_cache_key(prefix, entity_id, version): 137 """Also never called — dead code.""" 138 return f"{prefix}:{entity_id}:v{version}" ``` #### Metric output Running `run_metrics.py` against the sample code above produces: ``` Target: sample_code.py Thresholds: CCN>10 cognitive>15 params>5 NLOC>50 duplication>5.0% ====================================================================== CYCLOMATIC COMPLEXITY + FUNCTION SIZE (lizard) ====================================================================== Function NLOC CCN Params Issues ───────────────────────────────────────────────────────────────── simple_add 2 1 2 ok validate_input 10 7 1 ok process_records 58 24 8 CCN 24>10, params 8>5, NLOC 58>50 categorize_users 20 5 1 ok categorize_orders 20 5 1 ok unused_helper 2 1 0 ok _build_cache_key 2 1 3 ok Total functions: 7 ====================================================================== COGNITIVE COMPLEXITY (cognitive_complexity) ====================================================================== Function Score Issues ────────────────────────────────────────────────── simple_add 0 ok validate_input 6 ok process_records 62 cognitive 62>15 categorize_users 9 ok categorize_orders 9 ok unused_helper 0 ok _build_cache_key 0 ok ====================================================================== DEAD CODE (vulture) ====================================================================== Found 1 issue(s): sample_code.py:18: unused variable 'timeout' (100% confidence) ====================================================================== CODE DUPLICATION (jscpd) ====================================================================== Duplicated lines: 28 / 138 (20.3%) FAIL: 20.3% > 5.0% threshold Clones found: 2 Clone 1: lines 63-75 ~ lines 35-47 Clone 2: lines 115-131 ~ lines 92-108 ====================================================================== RESULT (all metrics) ====================================================================== 6 structural issue(s) found. ``` #### Reading the output **Cyclomatic complexity + function size table:** | Column | Meaning | |---|---| | Function | Function name | | NLOC | "Non-comment Lines Of Code" — lines containing actual logic, excluding blanks and comments. Measures function length. | | CCN | "Cyclomatic Complexity Number" — count of independent execution paths through the function. Each `if`, `for`, `while`, `except`, `and`, `or` adds 1. A CCN of 24 means 24 distinct paths that would each need a test case for full coverage. | | Params | Number of parameters the function accepts | | Issues | Which thresholds were exceeded, or "ok" if none | `process_records` is the only function flagged: 24 execution paths (threshold 10), 8 parameters (threshold 5), and 58 lines of logic (threshold 50). All three indicate a function that should be decomposed. **Cognitive complexity table:** | Column | Meaning | |---|---| | Score | Cognitive complexity score. Unlike CCN which counts branches equally, this weights each branch by how deeply it is nested. An `if` at depth 0 adds 1, but an `if` at depth 5 adds 6. | | Issues | Whether the score exceeds the threshold of 15 | `process_records` scores 62 — nearly 3x its cyclomatic score of 24. This gap reveals the nesting problem: the function doesn't just have many branches, those branches are deeply nested inside each other, making the code disproportionately hard to follow. `validate_input` scores only 6 cognitive despite CCN=7, because its branches are flat sequential guard clauses — easy to read even though there are several. **Dead code:** Reports unused variables, functions, or imports with a confidence percentage. Here it found `timeout` — a parameter that `process_records` accepts but never uses. Vulture only reports issues at >=80% confidence to avoid false positives. **Code duplication:** Reports the percentage of duplicated lines and lists each clone pair with their line ranges. Clone 1 (lines 35-47 and 63-75) is duplicated error-handling logic inside `process_records` for type A vs type B records. Clone 2 (lines 92-108 and 115-131) is `categorize_users` vs `categorize_orders` — near-identical functions that should be a single parameterized function.