# Agentic Audit POC: Technical Architecture
This README explains **what we built**, **how the pieces fit together**, and **how to run and extend it** inside DevX/AWS. It covers services, Lambda functions, IAM, networking, config, testing, and troubleshooting.
---
## 1) High‑level goal
Build a **low‑cost, explainable POC** that:
* Runs **audit checks** over AP data stored in **Amazon S3** using **Amazon Athena**
* Exposes two capabilities:
1. **Duplicate invoice check (AP‑R5/AP‑C5)**
2. **General read‑only SQL** generated by an **LLM (Bedrock)** for simple, ad‑hoc questions (e.g., “list vendor IDs”)
* Returns **auditable results** (columns + rows + presigned CSV link) and **makes each step traceable** in logs
---
## 2) Architecture (services & data flow)
```
[User / Auditor]
│ JSON request ("text": "check duplicate invoices" or "largest invoices by amount")
▼
[Lambda: gp_Agent (router + SQL generator)]
- Calls Athena info_schema to build schema hint
- Calls Bedrock (inference profile) to classify/route
- If duplicate_check → invokes duplicateCheck_AthenaQuery
- If ad_hoc_sql → validates/runs Athena SQL directly
│
├──▶ [Amazon Bedrock]
│ └─ Intent JSON (duplicate_check / ad_hoc_sql)
│
├──▶ [Amazon Athena]
│ └─ Query S3 data (Lake in S3)
│
└──▶ [Amazon S3]
└─ Stores Athena result CSV
[Lambda: duplicateCheck_AthenaQuery]
- Purpose‑built Athena SQL for AP‑R5 duplicate invoices
- Returns rows + presigned CSV URL
(Optional) [EC2/Flask UI]
- Simple button to trigger gp_Agent via API Gateway/Lambda URL
Observability: [CloudWatch Logs] for all Lambdas
Security/Access: [IAM Roles], [SSM]
```
**Core AWS services used**
* **Amazon S3** – Data lake (CSV files) & Athena results output bucket
* **Amazon Athena** – Serverless SQL over S3 (no DB servers)
* **AWS Lambda** – Three functions
* `gp_Agent` (router + LLM intent + ad‑hoc SQL executor)
* `duplicateCheck_AthenaQuery` (fixed duplicate logic)
* `duplicateCheck_TaskAgent` (alternative narrow router that only supports duplicate intent)
* **Amazon Bedrock** – LLM inference for intent classification & simple SQL drafting
* **Amazon CloudWatch Logs** – Execution logs & timing
* **(Optional) AWS Glue** – For automatic schema cataloging (not required for POC)
* **(Optional) API Gateway or Lambda Function URL + EC2/Flask** – Frontend trigger
---
## 3) Lambda functions (purpose, inputs/outputs)
### 3.1 `duplicateCheck_AthenaQuery`
**Purpose:** Run a **known‑good Athena query** that finds duplicate invoices (AP‑R5) by (`vendor_id`,`invoice_number`,`invoice_amount`,`invoice_date`).
**Event →**
```json
{ "table": "invoices_test" }
```
**Response ←**
```json
{
"statusCode": 200,
"queryId": "...",
"columns": ["vendor_id","invoice_number","invoice_amount","invoice_date","duplicate_count"],
"rows": [["ACME","INV-1001","1200.0","", "2"], ...],
"s3_result": {"bucket":"<bucket>","key":"<path>.csv"},
"download_url": "https://...presigned..."
}
```
**Notes**
* Uses env vars: `ATHENA_DB`, `ATHENA_WORKGROUP`, `ATHENA_OUTPUT` (s3://…/athena-results/)
* Waits for Athena completion and returns both **rows** and a **presigned URL** to the CSV
* Timeouts: set Lambda timeout ≥ 30s (Athena latency)
---
### 3.2 `gp_Agent`
**Purpose:** Orchestrator + general agent.
* Builds a **schema hint** by querying `information_schema.columns` for allow‑listed tables
* Calls **Bedrock** with a very strict system prompt to return **JSON only** with one of three actions:
* `duplicate_check` → invokes `duplicateCheck_AthenaQuery`
* `ad_hoc_sql` → validates **read‑only SELECT** then runs it in Athena
* `unsupported` → explains limitation
**Modes & Events**
1. **Chat/Router mode**
```json
{ "text": "please check duplicate invoices" }
```
2. **Adapter (direct dup pass‑through)**
```json
{ "table": "invoices_test" }
```
**Response (examples)**
* Duplicate path:
```json
{
"statusCode": 200,
"path": "duplicate_check",
"bedrock_invoked": true,
"result": { ...rows + download_url... }
}
```
* Ad‑hoc SQL path:
```json
{
"statusCode": 200,
"path": "ad_hoc_sql",
"bedrock_invoked": true,
"submitted_sql": "SELECT vendor_id FROM ap_demo.invoices_test LIMIT 100;",
"result": { ...rows + download_url... }
}
```
* Unsupported:
```json
{
"statusCode": 200,
"path": "unsupported",
"bedrock_invoked": true,
"message": "Supported now: duplicate check, read-only SELECT queries (ad-hoc)."
}
```
**Env vars**
* `ATHENA_DB=ap_demo`
* `TABLE_ALLOWLIST=invoices_test` (CSV list)
* `ATHENA_WORKGROUP=primary`
* `ATHENA_OUTPUT=s3://<bucket>/athena-results/`
* `DUP_CHECK_ARN=arn:aws:lambda:...:function:duplicateCheck_AthenaQuery`
* `BEDROCK_REGION=us-east-1` *(or a Region where your model/profile is available)*
* `BEDROCK_MODEL_ID=us.anthropic.claude-3-5-haiku-20241022-v1:0` *(**inference profile ID**, not base model)***
**Important behaviors**
* **SQL validation**: allows only `SELECT` and requires the query to reference an allow‑listed table **with schema prefix** (e.g., `ap_demo.invoices_test`). Adds `LIMIT 100` if missing
* **Schema hint**: Minimizes hallucinations by telling the LLM which columns exist
* **Error surfacing**: Returns Athena `StateChangeReason` when a query fails (e.g., column doesn’t exist)
---
### 3.3 `duplicateCheck_TaskAgent` (narrow demo)
**Purpose:** A minimal, LLM‑based intent classifier that only recognizes **duplicate invoice** instructions and delegates to `duplicateCheck_AthenaQuery`.
**Event →**
```json
{ "text": "please check identical invoices" }
```
**Response ←**
```json
{
"statusCode": 200,
"action": "duplicate_invoice_check",
"bedrock_invoked": true,
"result": { ...rows + download_url... }
}
```
**When to use**
* Use `gp_Agent` for the general POC.
* Keep `duplicateCheck_TaskAgent` as a tiny, focused example or fallback.
---
## 4) Bedrock configuration
* **Model access**: Grant model access in Bedrock console
* **Inference region**: We use `us-east-1` where the **inference profile** is available
* **Model identifier**: Use a **system‑defined inference profile ID** for on‑demand (example):
* `BEDROCK_MODEL_ID=us.anthropic.claude-3-5-haiku-20241022-v1:0`
* **IAM permission (router Lambda)**: `bedrock:InvokeModel` on `*` (or scoped to the chosen profile ARN)
* **Logging**: Router prints `BEDROCK_RAW_TEXT_LEN`, and on success: intent JSON; on failure: parsing hints
> If you call a base model ID that doesn’t support on‑demand, you’ll get a `ValidationException` directing you to use an inference profile. Keep `BEDROCK_REGION` aligned with the profile’s Regions.
---
## 5) Data & Athena setup
* **S3 buckets**
* `/data/` – your CSV(s) for `ap_demo.invoices_test`
* `/athena-results/` – query output location configured in env var `ATHENA_OUTPUT`
* **Athena database/workgroup**
* `ATHENA_DB=ap_demo`, `ATHENA_WORKGROUP=primary`
* **Table**
* External table pointing to S3 CSV; define via DDL or use Glue crawler (optional)
* **(Optional) AWS Glue**
* If used, Glue Crawler can infer schema & create the table automatically; otherwise, create the table manually in Athena
---
## 6) IAM & permissions (minimum)
**For `gp_Agent` Lambda role**
* `bedrock:InvokeModel` on chosen profile
* `athena:StartQueryExecution`, `athena:GetQueryExecution`, `athena:GetQueryResults`
* `s3:GetObject`, `s3:PutObject`, `s3:ListBucket` on result bucket/prefix
* `lambda:InvokeFunction` on `duplicateCheck_AthenaQuery`
**For `duplicateCheck_AthenaQuery` Lambda role**
* `athena:StartQueryExecution`, `athena:GetQueryExecution`, `athena:GetQueryResults`
* `s3:GetObject`, `s3:PutObject`, `s3:ListBucket` on data & results bucket
**(Optional) For `duplicateCheck_TaskAgent`**
* `bedrock:InvokeModel`
* `lambda:InvokeFunction` on `duplicateCheck_AthenaQuery`
**Trust policies**
* Each Lambda role trusted by `lambda.amazonaws.com`
**(If using VPC‑attached Lambdas)**
* Add appropriate **VPC endpoints** (Athena/Glue/S3 interface or gateway) or NAT for public endpoints
---
## 7) Environment variables (summary)
| Function | Key | Example |
| -------- | ------------------ | ---------------------------------------------------------------- |
| All | `ATHENA_DB` | `ap_demo` |
| All | `ATHENA_WORKGROUP` | `primary` |
| All | `ATHENA_OUTPUT` | `s3://devx-apc5-demo-agentic/athena-results/` |
| gp_Agent | `TABLE_ALLOWLIST` | `invoices_test` |
| gp_Agent | `DUP_CHECK_ARN` | `arn:aws:lambda:us-east-2:…:function:duplicateCheck_AthenaQuery` |
| gp_Agent | `BEDROCK_REGION` | `us-east-1` |
| gp_Agent | `BEDROCK_MODEL_ID` | `us.anthropic.claude-3-5-haiku-20241022-v1:0` |
---
## 8) Deployment & testing
1. **Upload CSVs** to S3 and set/create the `ap_demo.invoices_test` table (Athena or Glue)
2. **Create buckets** and set `ATHENA_OUTPUT` to your results prefix
3. **Deploy Lambdas** (zip/upload or from console), set env vars, attach IAM roles
4. **Test `duplicateCheck_AthenaQuery` directly** with:
```json
{ "table": "invoices_test" }
```
5. **Set `DUP_CHECK_ARN` in `gp_Agent`** to the ARN from step 4
6. **Test `gp_Agent`** with events:
```json
{ "text": "please check duplicate invoices" }
{ "text": "list vendor_id from ap_demo.invoices_test" }
```
7. **Observe logs** in CloudWatch: intent JSON, timing, Athena reasons on failure
*(Optional)* Expose via **API Gateway or Lambda Function URL** and call from EC2/Flask UI.
---
## 9) Security & guardrails
* **SELECT‑only** enforcement for ad‑hoc SQL; rejects DDL/DML/CTAS/UNLOAD
* **Table allow‑list** requires full `db.table` reference
* **Limit injection** adds `LIMIT 100` if absent
* **Schema hint** constrains LLM to existing columns → fewer hallucinations
* **No PII** in prompts; only structure (column names/types)
* **Presigned URLs** expire (default 15 min) for results
* **IAM least‑privilege**: scope S3/Athena access to specific buckets and workgroup
---
## 10) Cost profile (POC)
* Athena: ~$5/TB scanned; keep CSVs compact/partitioned, add LIMITs
* Lambda: ms‑level charges; tiny for this POC
* Bedrock: per‑token; schema hints are small (cheap) and only one call per request
* S3: storage + GET for result files
---
## 11) Troubleshooting
**Symptom:** `"Athena FAILED"` with no details
* We now surface `StateChangeReason`; check `error` string in response
**Symptom:** LLM returns non‑JSON → `JSONDecodeError`
* Router prints `BEDROCK_PAYLOAD_NONJSON` or `BEDROCK_JSON_PARSE_ERR`; our parser strips code fences & extracts first `{...}` block
* Tighten prompt or reduce schema text
**Symptom:** `ValidationException … on‑demand not supported`
* You used a **base model ID**. Switch to **inference profile ID** (e.g., `us.anthropic.claude-3-5-haiku-20241022-v1:0`) and ensure `BEDROCK_REGION` supports it
**Symptom:** Duplicate path returns `Missing 'text' in request.`
* `DUP_CHECK_ARN` points to the wrong Lambda (a chat function). Set it to `duplicateCheck_AthenaQuery` ARN
**Symptom:** LLM wrote SQL with unknown columns (e.g., `created_by`)
* Add that column to the dataset/schema, or adjust the ask. The schema hint + prompt now says “Use **only** allowed columns.”
---
## 12) Extending the POC
* **More controls**: add new Task Lambdas (e.g., 3‑way match, bank‑account verification hits)
* **Tax compliance (AP‑R10/AP‑C10)**: use Textract for W‑9/W‑8 and rule‑based Bedrock agent to flag exceptions
* **Step Functions**: orchestrate multi‑step flows (classify → pick task → validate → report writer)
* **Knowledge Base/Guardrails**: Bedrock KB for policy retrieval; Guardrails for prompt safety & deny‑lists
* **Report Agent**: generate standardized audit report (Markdown/PDF) with traceable evidence links
---
## 13) Appendix – Code placement
* `duplicateCheck_AthenaQuery` – **fixed SQL** for AP‑R5 duplicates
* `gp_Agent` – **router + Bedrock** + **ad‑hoc SELECT** (with validators)
* `duplicateCheck_TaskAgent` – **narrow intent** demo (duplicate only), optional
> Keep `duplicateCheck_AthenaQuery` separate so `gp_Agent` can invoke it via `DUP_CHECK_ARN` without recursion.
---
## 14) Quick reference (copy/paste test events)
**Duplicate check**
```json
{ "text": "please check duplicate invoices" }
```
**Ad‑hoc list**
```json
{ "text": "list vendor_id from ap_demo.invoices_test" }
```
**Direct duplicate (adapter mode)**
```json
{ "table": "invoices_test" }
```
---
**End of README**