plan-for-custom-deep-research

# plan-for-custom-deep-research awesome—let’s stand up a “DIY Deep Research” pipeline you can own and tune for your domains. below is a **practical, shippable blueprint** with JS-first code you can paste into a repo and iterate. it assumes the OpenAI **Responses API**, plus **your own web/search adapters** (you can swap in any search/crawl stack later). --- # DIY Deep Research (owner-operated) ## Goals * **High-quality, citable briefs** from web + docs, with uncertainty handling. * **Composable orchestration**: planner → searchers → readers → extractors → synthesizer → critic. * **Domain-aware prompts** tuned for your three tracks and their intersections. * **Reproducible outputs** (Markdown + JSON sidecar) with **traceable citations**. --- ## Architecture (one process to start) **Roles (agents)** 1. **Planner** – decomposes the query into sub-questions & evidence plan. 2. **Searcher** – web & catalog queries; returns ranked candidates (URLs, metadata). 3. **Reader** – fetch + sanitize + chunk; extracts facts/claims with per-claim provenance. 4. **Extractor** – turns raw text into structured notes (claims, quotes, stats, definitions). 5. **Synthesizer** – writes the brief; cites every non-trivial claim. 6. **Critic/Verifier** – spot-checks: contradiction, missing perspectives, stale links, overclaiming; pushes revisions. 7. **Formatter** – outputs Markdown report + JSON manifest (citations, confidence, lineage). 8. **Indexer (optional)** – stores notes & embeddings for reuse (GraphRAG/ontology later). **Data flow (happy path)** ``` user query -> Planner.plan() -> Searcher.search() per sub-question -> Reader.read() + Extractor.extract() -> Synthesizer.write() -> Critic.review() -> Synthesizer.revise() -> Formatter.emit() -> Indexer.ingest() (optional) ``` **Storage** * `/runs/<timestamp>/` * `report.md` * `report.json` (citations, claims, confidence) * `notes.ndjson` (per-source notes) * `raw/` (source HTML/text snapshots, hashed filenames) * `cache.sqlite` (URL → normalized\_text, fetched\_at) --- ## Output contracts ### `report.md` (human-readable) * Executive summary (bulleted, ≤10 lines) * Answer sections, organized by sub-question * Domain intersection callouts (AI × Higher-Ed, AI × Arts, etc.) * “What we’re confident about” / “Open questions” * Full citations (numbered, inline like \[12]) ### `report.json` (machine-readable) ```json { "query": "How are top U.S. universities integrating AI in humanities pedagogy?", "run_id": "2025-09-06T13-04-18Z", "model": "gpt-4o", "claims": [ { "id": "clm_001", "text": "Several R1 institutions ran AI-assisted close-reading pilots in 2024–25.", "citation_ids": ["src_03","src_07"], "confidence": 0.72, "domain_tags": ["higher-ed","humanities","ai-teaching"], "risk_flags": ["generalization"] } ], "citations": [ { "id": "src_03", "url": "https://example.edu/center/ai-humanities-pilot", "title": "AI & Humanities Pilot 2025", "author": "Center for Teaching", "published": "2025-03-11", "accessed": "2025-09-06", "hash": "sha256:…" } ] } ``` --- ## Domain lenses (prompt adapters) We’ll **prepend** one of these lenses to the Planner & Synthesizer prompts depending on topic classification (or combine two for intersections). ### Lens A — AI research & strategy * Emphasize: method provenance, benchmarks, scaling laws, org impact, governance. * Penalize: vendor hype, uncited claims, outdated references. * Require: dates on findings; diff between *model capability* and *productized feature*. ### Lens B — Higher Ed & Humanities pedagogy * Emphasize: assessment design, learning outcomes, academic integrity, accessibility, equity. * Require: concrete course designs, assignment archetypes, workload/latency tradeoffs, faculty development pathways. ### Lens C — Media/Arts/Multimodal * Emphasize: creative workflow diagrams, pipeline steps (capture → ingest → edit → render → publish), IP/ethics, performance/latency constraints. * Require: tool-agnostic descriptions; reproducible patterns; examples. --- ## Prompts (starter) ### `prompts/planner.txt` ``` You are a research planner. Given a query, produce: 1) 3–7 sub-questions that, if answered, fully address the query. 2) For each, the evidence types needed (news, peer-reviewed, policy, case study, syllabi, GitHub, videos). 3) Disambiguation notes & high-risk pitfalls (stale claims, vendor hype, ambiguous metrics). 4) A brief plan for triangulation & contradiction checks. Domain lens: {{DOMAIN_LENS}} Query: {{QUERY}} Return JSON with keys: sub_questions[], evidence_plan{}, pitfalls[], triangulation[]. ``` ### `prompts/extractor.txt` ``` You are an evidence extractor. Given source text + metadata, output: - atomic claims with quoted support (≤30 words per quote), - the exact span(s) used, - normalized entities (orgs, people, models, courses), - dates found, and - uncertainty markers in the source. Return NDJSON where each line is a JSON object with: {claim, quotes[], spans[], entities[], dates[], source_id, confidence} ``` ### `prompts/synthesizer.txt` ``` Write a rigorous, neutral report that answers the user query. Rules: - Every nontrivial claim must cite [#]. - Separate 'We’re confident' vs 'Open questions'. - Highlight intersections across domains explicitly. - Keep hype in check; state limitations. - Prefer specifics (course patterns, workflow steps, evaluation rubrics). Domain lens: {{DOMAIN_LENS}} Use this JSON notes bundle: {{NOTES_JSON_CHUNKED}} Output Markdown only. ``` ### `prompts/critic.txt` ``` Act as a critical reviewer. Find: - overclaims (claims exceeding source support), - missing perspectives (e.g., humanities equity angle), - contradictions across sources, - stale references (>18 months) if recency is critical. Return JSON with {issues[], required_fixes[], confidence_adjustments{}}. ``` --- ## Repo skeleton ``` diy-deep-research/ .env package.json src/ index.js orchestrator.js planner.js searcher.js reader.js extractor.js synthesizer.js critic.js formatter.js utils/ cache.js chunk.js normalize.js cite.js clock.js prompts/ planner.txt extractor.txt synthesizer.txt critic.txt runs/ ``` **`package.json`** ```json { "name": "diy-deep-research", "type": "module", "scripts": { "start": "node src/index.js", "run": "node src/index.js --q" }, "dependencies": { "openai": "^4.56.0", "node-html-parser": "^6.1.12", "undici": "^6.19.8", "p-limit": "^6.1.0", "better-sqlite3": "^9.6.0", "yargs": "^17.7.2", "uuid": "^9.0.1" } } ``` **`.env`** ``` OPENAI_API_KEY=sk-… OPENAI_MODEL=gpt-4o MAX_TOKENS=4000 ``` --- ## Core modules (minimal versions) **`src/index.js`** ```js import yargs from "yargs"; import { hideBin } from "yargs/helpers"; import { runResearch } from "./orchestrator.js"; const argv = yargs(hideBin(process.argv)) .option("q", { type: "string", describe: "query to research" }) .option("lens", { type: "string", default: "higher-ed", choices: ["ai", "higher-ed", "media-arts", "intersection"] }) .demandOption("q") .help().argv; runResearch({ query: argv.q, lens: argv.lens }).then(p => { console.log("✓ report:", p.reportPath); console.log("✓ manifest:", p.jsonPath); }).catch(e => { console.error("run failed:", e); process.exit(1); }); ``` **`src/orchestrator.js`** ```js import { plan } from "./planner.js"; import { searchAll } from "./searcher.js"; import { readAll } from "./reader.js"; import { extractAll } from "./extractor.js"; import { synthesize } from "./synthesizer.js"; import { review } from "./critic.js"; import { formatOut } from "./formatter.js"; import { uidRunFolder } from "./utils/clock.js"; export async function runResearch({ query, lens }) { const run = uidRunFolder(); const planOut = await plan({ query, lens, run }); const hits = await searchAll({ plan: planOut, run }); const sources = await readAll({ hits, run }); const notes = await extractAll({ sources, run, lens }); let draft = await synthesize({ query, lens, notes, run }); const critique = await review({ draft, notes, lens, run }); if (critique?.required_fixes?.length) { draft = await synthesize({ query, lens, notes, run, critique }); } const out = await formatOut({ query, lens, draft, notes, run }); return out; } ``` **`src/planner.js`** (Responses API call; reads `prompts/planner.txt`) ```js import fs from "fs/promises"; import OpenAI from "openai"; const client = new OpenAI(); const LENSES = { "ai": "Focus on AI research, strategy, governance, benchmarks.", "higher-ed": "Focus on pedagogy, assessment, humanities, equity.", "media-arts": "Focus on creative pipelines, rights/ethics, latency.", "intersection": "Surface intersections among AI, higher-ed, and media arts." }; export async function plan({ query, lens, run }) { const tmpl = await fs.readFile("prompts/planner.txt", "utf8"); const prompt = tmpl.replace("{{DOMAIN_LENS}}", LENSES[lens] ?? LENSES["intersection"]) .replace("{{QUERY}}", query); const resp = await client.responses.create({ model: process.env.OPENAI_MODEL, input: prompt, response_format: { type: "json_object" } }); const json = JSON.parse(resp.output_text); await fs.mkdir(`runs/${run}`, { recursive: true }); await fs.writeFile(`runs/${run}/plan.json`, JSON.stringify(json, null, 2)); return json; } ``` **`src/searcher.js`** (plug your own providers; return ranked hits) ```js import { normalizeUrl } from "./utils/normalize.js"; // stub: replace with your web/search APIs async function webSearch(q) { // return [{url, title, snippet, score}] return []; } export async function searchAll({ plan, run }) { const results = []; for (const sq of plan.sub_questions) { const q = `${sq} site:.edu OR site:.org OR site:.gov OR (peer-reviewed)`; const hits = await webSearch(q); hits.forEach(h => results.push({ ...h, url: normalizeUrl(h.url), for: sq })); } // dedupe by URL const seen = new Set(); const dedup = []; for (const h of results) { if (seen.has(h.url)) continue; seen.add(h.url); dedup.push(h); } return dedup.slice(0, 60); } ``` **`src/reader.js`** (fetch + sanitize + cache) ```js import { request } from "undici"; import fs from "fs/promises"; import crypto from "crypto"; function sha(s){ return crypto.createHash("sha256").update(s).digest("hex"); } export async function readAll({ hits, run }) { const out = []; await fs.mkdir(`runs/${run}/raw`, { recursive: true }); for (const h of hits) { try { const res = await request(h.url); const html = await res.body.text(); const id = sha(h.url); await fs.writeFile(`runs/${run}/raw/${id}.html`, html); out.push({ id, ...h, html }); } catch {} } return out; } ``` **`src/extractor.js`** ```js import OpenAI from "openai"; import fs from "fs/promises"; import { htmlToText } from "./utils/normalize.js"; const client = new OpenAI(); export async function extractAll({ sources, run, lens }) { const notes = []; const tmpl = await fs.readFile("prompts/extractor.txt", "utf8"); for (const s of sources) { const text = htmlToText(s.html).slice(0, 150_000); // guardrails const prompt = `${tmpl}\n\nSOURCE_META:\n${JSON.stringify({url:s.url,title:s.title},null,2)}\n\nSOURCE_TEXT:\n${text}`; const resp = await client.responses.create({ model: process.env.OPENAI_MODEL, input: prompt }); // each line is JSON (NDJSON style) const lines = resp.output_text.split("\n").filter(Boolean); for (const line of lines) { try { notes.push({ ...JSON.parse(line), source_id: s.id }); } catch {} } } await fs.writeFile(`runs/${run}/notes.ndjson`, notes.map(n=>JSON.stringify(n)).join("\n")); return notes; } ``` **`src/synthesizer.js`** ```js import OpenAI from "openai"; import fs from "fs/promises"; const client = new OpenAI(); export async function synthesize({ query, lens, notes, run, critique }) { const tmpl = await fs.readFile("prompts/synthesizer.txt", "utf8"); // chunk notes to fit model context const notesChunk = JSON.stringify(notes.slice(0, 600)); // TODO: paginate if huge const lensNote = lens; const prompt = tmpl .replace("{{DOMAIN_LENS}}", lensNote) .replace("{{NOTES_JSON_CHUNKED}}", notesChunk) + (critique ? `\n\nREVIEW_ISSUES:\n${JSON.stringify(critique, null, 2)}` : ""); const resp = await client.responses.create({ model: process.env.OPENAI_MODEL, input: prompt }); const md = resp.output_text; await fs.writeFile(`runs/${run}/draft.md`, md); return md; } ``` **`src/critic.js`** ```js import OpenAI from "openai"; import fs from "fs/promises"; const client = new OpenAI(); export async function review({ draft, notes, lens, run }) { const tmpl = await fs.readFile("prompts/critic.txt", "utf8"); const input = `${tmpl}\n\nDRAFT:\n${draft}\n\nNOTES:\n${JSON.stringify(notes.slice(0,500))}`; const resp = await client.responses.create({ model: process.env.OPENAI_MODEL, input, response_format: { type: "json_object" } }); const json = JSON.parse(resp.output_text); await fs.writeFile(`runs/${run}/critic.json`, JSON.stringify(json, null, 2)); return json; } ``` **`src/formatter.js`** ```js import fs from "fs/promises"; import { extractCitations } from "./utils/cite.js"; export async function formatOut({ query, lens, draft, notes, run }) { const { refs, manifest } = extractCitations(draft, notes); const report = `${draft}\n\n---\n## References\n` + refs.map((r,i)=>`[${i+1}] ${r.title ?? r.url} — ${r.url}`).join("\n"); const reportPath = `runs/${run}/report.md`; const jsonPath = `runs/${run}/report.json`; const json = { query, lens, run_id: run, citations: manifest }; await fs.writeFile(reportPath, report); await fs.writeFile(jsonPath, JSON.stringify(json, null, 2)); return { reportPath, jsonPath }; } ``` **`src/utils/cite.js`** (very simple placeholder) ```js export function extractCitations(md, notes){ // find [#] patterns and map to sources present in notes.source_id // for now, collapse to unique URLs in order of first appearance const urls = new Map(); for (const n of notes) { if (!urls.has(n.source_id)) urls.set(n.source_id, n.url || n.source_url || ""); } const refs = [...urls.values()].map(u => ({ url: u, title: "" })); const manifest = refs.map((r, idx) => ({ id: `src_${String(idx+1).padStart(2,"0")}`, url: r.url, title: r.title })); return { refs, manifest }; } ``` **`src/utils/normalize.js`** ```js import { parse } from "node-html-parser"; export function normalizeUrl(u){ try { return new URL(u).toString(); } catch { return u; } } export function htmlToText(html){ const root = parse(html); // remove script/style/nav root.querySelectorAll("script,style,nav,footer,header").forEach(n=>n.remove()); return root.text.replace(/\s+/g," ").trim(); } ``` **`src/utils/clock.js`** ```js export function uidRunFolder(){ const d = new Date().toISOString().replace(/[:.]/g,"-"); return d; } ``` --- ## Domain-aware search scaffolds (queries to seed Searcher) * **AI research & strategy** * “site:.org OR site:.edu Transformer ‘evaluation’ 2024..2025 filetype\:pdf” * “capability eval ‘hallucination penalty’ 2024..2025” * “enterprise AI governance rubric case study 2024..2025” * **Higher-ed & humanities** * “site:.edu syllabus ‘AI policy’ humanities 2024..2025” * “writing pedagogy ‘LLM’ assignment redesign” * “academic integrity generative AI statement 2024..2025” * **Media/arts/multimodal** * “film studies ‘AI video essay’ workflow” * “theatre performance AI ‘interactive’ case study” * “game dev education ‘procedural generation’ syllabus” * **Intersections** * “studio classroom ‘AI’ multimodal assignment rubric” * “critical making AI humanities lab case study” *(Wire these patterns into `searcher.js` as seed templates; rotate & mix.)* --- ## Quality controls & latency knobs * **Penalize confident errors** in prompts (“award partial credit for uncertainty; penalize confident wrong answers”). * **Contradiction passes** in Critic: if claim A vs B conflict, require resolve or mark “Open question”. * **Freshness policy**: for time-sensitive topics (models, policies), down-rank >18 months old unless it’s theory/history. * **Diversity constraint**: at least 1 primary source, 1 critical perspective, 1 practitioner example per sub-question. * **Latency slider**: `max_sources_per_sq` and `extract_depth` to trade speed vs rigor. --- ## Running it ```bash node src/index.js --q "How should humanities courses redesign research assignments to productively include AI while safeguarding academic integrity?" --lens higher-ed ``` Produces: ``` runs/2025-09-06T13-04-18Z/report.md runs/2025-09-06T13-04-18Z/report.json runs/…/raw/*.html runs/…/notes.ndjson ``` --- ## Extensions you’ll probably want (given your stack) * **Ontology tags** (Palantir-style): add `facet_tags[]` to claims (e.g., governance|assessment|workflow|equity). * **GraphRAG**: ingest `notes.ndjson` into a small graph (entities ←→ claims ←→ sources) and query across sub-graphs (AI × Higher-Ed × Arts). * **Airtable/Supabase**: push `claims`, `citations`, and `notes` to tables; link to your Prism app for browsing & cross-run diffing. * **Source adapters**: JSTOR/CORE/ArXiv/YouTube transcripts, library catalogs, institutional policy hubs; plus a **syllabus scraper**. * **Evaluator harness**: a small test set of queries with expected “key points” and a rubric (coverage, citation sufficiency, recency, balance). --- ## Minimal evaluator rubric (drop in a JSON file) ```json { "criteria": [ {"name": "Coverage", "weight": 0.3, "desc": "Addresses all sub-questions."}, {"name": "Citation Sufficiency", "weight": 0.25, "desc": "Every nontrivial claim cited; primary sources present."}, {"name": "Recency & Relevance", "weight": 0.2}, {"name": "Balance/Perspectives", "weight": 0.15}, {"name": "Clarity/Actionability", "weight": 0.1} ] } ``` --- ## What you’ll customize first 1. **`searcher.js`** – plug in your preferred search/crawl stack and institute whitelists (e.g., .edu/.org, society journals). 2. **Prompts** – tighten each lens with your institutional priorities (AI & Civil Discourse, integrity, accessibility). 3. **Formatter** – match your house Markdown & add Harvard-style references if desired. 4. **Cache & rate limits** – add retries/backoff; persist URL fetches. --- if you want, I can turn this into a ready-to-run repo (with functioning web search adapters you can later swap) or tailor the prompts to a first real query in your pipeline (e.g., “design an AI-inclusive research essay assignment for Expos that resists shortcutting”).