# mk-hackmd-utilities-project ## Framing the problem (2 key choices) 1. **Source of truth** * **GitHub‑first** (archival + version history; great for Next.js SSG and PR workflows) * **Airtable‑first** (curation/metadata, editorial workflows; Next.js builds pull from Airtable) * **Dual** (GitHub for content, Airtable for metadata, with a manifest tying them together) 2. **Sync style** * **Batch archive** (annual or quarterly snapshots) * **Incremental** (nightly or on‑demand via Slack) * **On‑demand Book‑to‑App** (generate a Next.js app from one HackMD “Book” or index at a time) You can run all of these under a single monorepo with `/tools` (Node scripts), `/apps/next-site`, and `/packages` (remark plugins), with **pnpm** workspaces. --- ## Option A — GitHub‑first with scheduled + on‑demand sync **When**: You want immutable yearly archives, PR‑based review, and fast static builds. **Flow** ``` HackMD API → (Node CLI) normalize + frontmatter → /content/** in GitHub │ ├─ assets pulled to /public/assets/<hackmdId>/ └─ slugMap.json (id↔slug) for link rewriting ↓ GitHub Actions (on push / nightly) → build Next.js (Contentlayer or next-mdx-remote) → Vercel ↓ Slack bot can trigger “sync now” which opens a PR (preview deploy) ``` **Pros** * Best for versioning, code review, Vercel preview links per PR. * Clear year‑over‑year archives (`/content/2024/**`, `/content/2025/**`). **Cons** * Editors need GitHub PRs (unless Slack/automation handles merging). * Airtable becomes optional/add‑on, not canonical. **Notes** * Keep **raw** HackMD markdown in `/raw/` (for provenance) and **normalized MDX** in `/content/`. * Use **Contentlayer** to type metadata (e.g., readingTime, tags, team, year). --- ## Option B — Airtable‑first (headless CMS; GitHub optional) **When**: You want curation + rich metadata (owners, status, programs, courses), and less dependence on Git. **Flow** ``` HackMD API → normalize → push record per note to Airtable. Fields include content_md, frontmatter_json, hackmdId, title, tags, updatedAt, checksum, … ↓ Next.js getStaticProps (ISR) or Build Step → query Airtable → render pages ↓ Optional “Mirror to GitHub” job writes MDX files + assets for long-term archive ``` **Pros** * Editors can tag/filter/curate in Airtable; permissions are familiar. * Next.js builds are simple data pulls; no PR bottleneck for minor copy edits. **Cons** * You’ll want a secondary archive (GitHub or blob store) for long‑term provenance. * Larger markdown bodies stored in Airtable long text can feel cramped (but workable). **Notes** * Use Airtable tables: **Notes**, **Books**, **Teams/Years**, **SyncLog**. * A formula field computes the Next.js route from `[year]/[team]/[slug]`. --- ## Option C — Hybrid (GitHub = content, Airtable = metadata; “content lake” optional) **When**: You want the best parts of A + B without duplicating effort. **Flow** ``` HackMD API → ETL ├─ GitHub: /content/** MDX + assets └─ Airtable: metadata (id, slug, title, team, year, book, tags, visibility, owner, updatedAt, checksum) ↓ Next.js build: - Read MDX from repo - Join with Airtable metadata by hackmdId (or slug) for UI facets, filters, cards ``` **Pros** * Git versioning & PR previews + Airtable for curation & faceting. * Clean separation: heavy content in Git, light metadata in Airtable. **Cons** * Two destinations to keep in sync (solved by single ETL writing both). **Optional**: Add a “**content lake**” (S3/R2/Vercel Blob) to store raw zips, original PDFs, big assets. Git holds normalized MDX only. --- ## Option D — Annual “Snapshot” Archiver (immutable) **When**: End‑of‑year freeze of a HackMD Team, creating a tamper‑proof snapshot. **Flow** ``` Manual or API zip export as proof → store in content lake w/ manifest.json ↓ Normalizer runs on the zip → writes /content/<year>/<team>/** to GitHub ↓ Create Git tag: archive-<team>-<year>, and a release that links the zip + manifest ``` **Pros** * For audits, provenance, FOI requests, or institutional memory. * Extremely simple runtime story: everything immutable. **Cons** * Not for daily updates; no editorial comfort beyond archive. --- ## Option E — “Book → Next App” generator (one-click site) **When**: A faculty member’s Book should become a standalone site/app quickly. **CLI** ``` pnpm dlx learninglab/hackmd-tools book-to-next \ --book https://hackmd.io/@team/book-id \ --template nextra|your-next-template \ --dest ./apps/<short-name> \ --org learning-lab ``` **What it does** * Crawls the Book, preserves order/structure, downloads notes + assets. * Writes `_meta.json` (Nextra) or section index files; builds sidebar/menu from Book. * Adds frontmatter + link rewriting (id→slug), generates `slugMap.json`. * Optional: pushes to GitHub with a ready Vercel deploy hook. **Pros** * Perfect for fast showcases, conferences, or course microsites. * Repeatable and sharable as a pattern. --- ## Option F — Slack‑first control plane (ops from chat) **When**: You want ops via `/hackmd` slash commands + interactive modals. **Commands** * `/hackmd sync [team] [--year 2025] [--since 2025-08-01]` * `/hackmd book-to-next [book-url] [--template nextra]` * `/hackmd archive [team] [--year 2024]` * `/hackmd convert [note-url|id]` * `/hackmd to-airtable [note-url|id]` * `/hackmd status` (last run, deltas, failures, rate limits) **Behavior** * Commands post an ephemeral summary; long‑running logs stream in a thread. * For collisions/approvals, open a **Slack modal** (pick slug/collection, set visibility). * On success, post links to: PR preview, Airtable record, and any warnings. --- ## Option G — Incremental sync with change detection (low cost, always fresh) **When**: You don’t want to re‑pull everything every time. **Mechanics** * Maintain a local **state.json** (or Airtable SyncLog) keyed by `hackmdId` → `checksum` (md5 of content) + `updatedAt`. * Each run: * listNotes() → fetch only changed notes (updatedAt > lastSync or checksum mismatch). * download attachments only for changed notes (etag compare). * Write only diffs to Git/Airtable; avoid noisy commits. --- ## Core building blocks (reusable utilities) ### 1) Frontmatter schema (YAML) ```yaml --- title: "Guide to X" slug: "guide-to-x" hackmdId: "abc123def456" team: "learning-lab" year: 2025 bookId: "book-xyz" # if applicable bookOrder: 12 # for sorted menus (from Book) authors: ["S. Person", "T. Person"] tags: ["ai", "teaching"] visibility: "public" # or "internal", "private" updatedAt: "2025-08-20T14:22:00Z" source: platform: "hackmd" url: "https://hackmd.io/abc123def456" checksum: "md5:..." readingTime: 6 summary: "One-liner used for cards." --- ``` ### 2) Directory layout (GitHub) ``` /raw/<year>/<team>/<hackmdId>.md # raw pulls (optional) /content/<year>/<team>/<slug>.mdx # normalized, link-rewritten /public/assets/<hackmdId>/* # images/assets /content/_maps/slugMap.json # { "abc123…": "/2025/team/guide-to-x", ... } ``` ### 3) Link rewriting strategy * Build `slugMap` from frontmatter (`hackmdId → route`). * A **remark** plugin rewrites: * `https://hackmd.io/<id>` → internal `/[year]/[team]/[slug]` if known. * Keeps external HackMD links if a match is not found (or mark as TODO). * Optional fallback: a Next.js dynamic route `/h/[hackmdId]` that looks up the slug at runtime (catch any misses). **Minimal remark plugin (Node/TS, sketch)** ```ts import { visit } from 'unist-util-visit'; export function remarkRewriteHackMDLinks(slugMap: Record<string, string>) { return (tree: any) => { visit(tree, 'link', (node: any) => { const url: string = node.url || ''; const m = url.match(/https?:\/\/hackmd\.io\/([A-Za-z0-9_-]+)/); if (m) { const id = m[1]; const internal = slugMap[id]; if (internal) node.url = internal; // rewrite to local route } }); }; } ``` ### 4) Normalizer pipeline * **gray-matter** → read/write YAML frontmatter * **remark** plugins: * `remark-gfm` (tables/task lists), * `remark-math` + `rehype-katex` (if needed), * your **link rewriting** plugin, * image URL transformer → `/assets/<hackmdId>/<filename>` * **title/slug**: * prefer HackMD metadata title → kebab‑case slug, * collision resolver: append short hash or `-2`, record in Airtable. ### 5) Scripts (CLI) * `hackmd-archive` (annual snapshot) * `hackmd-sync` (incremental for team/year/since) * `hackmd-book-export` (Book → structured content + \_meta) * `hackmd-to-airtable` (push/update metadata rows) * `airtable-to-github` (mirror curated selections back into repo) * `link-audit` (find unresolved links) * `asset-sync` (download/update images; prune orphans) * `dry-run` (no writes; prints a plan) All CLIs accept `--dry-run`, `--since`, `--team`, `--year`, `--book`, `--only <noteId>`, `--dest github|airtable|both`. --- ## Airtable schema (if you make it canonical or hybrid) **Tables** 1. **Notes** * `hackmdId` (primary key) * `title`, `slug`, `team`, `year` * `bookId` (link to Books), `bookOrder` (number) * `tags` (multi‑select), `visibility` (single‑select) * `updatedAt` (datetime), `checksum` (text) * `content_md` (long text) — optional if you store content in Git only * `route` (formula: `"/" & year & "/" & team & "/" & slug`) * `assets` (attachments) — optional * `status` (Draft/Published/Archived) 2. **Books** * `bookId`, `title`, `team`, `year` * `order` (array or linked child “BookItems” with order) 3. **SyncLog** * `runId`, `startedAt`, `finishedAt`, `actor` (Slack user) * `created`, `updated`, `unchanged`, `errors` (JSON or rollups) **Why**: This lets you filter by program/course, assign owners, toggle visibility, and drive Next.js cards/menus. --- ## Next.js ingestion patterns **Static (fast)** * Use **Contentlayer** to scan `/content/**`, build typed docs at build time. * If Airtable is used for metadata, pull it at build, join by `hackmdId`. **ISR (fresh)** * Build once, then revalidate pages (e.g. 60–300s) triggered by: * Slack command (`/hackmd revalidate /2025/team/foo`) * Webhook (Airtable automation → Vercel deploy hook or custom API route) **MDX** * `@next/mdx` or `next-mdx-remote` + custom components (callouts, citations). * Keep a consistent **remark** stack across the CLI and the Next build for parity. --- ## “Book” handling details * Pull Book structure and write a `_meta.json` (for Nextra) or a `collection.json` your app reads for sidebar order. * Preserve nesting if the Book uses sections; otherwise make sections by heading level or Airtable grouping. * Store `bookOrder` on each note for stable menus. --- ## Observability & safety * **Logs**: Each run creates a `SyncLog` row, and posts a Slack thread with a summary diff. * **Dry‑run** first in Slack (`/hackmd sync --dry-run` gives a preview). * **Secrets**: HackMD token, Airtable token, GitHub token in `doppler`/`1Password` envs. * **Rate limits**: Backoff and checkpoint; small page size + pagination; resume with `--since`. * **PII / Privacy**: Respect visibility flags; skip or mask “private” notes by default unless `--include-private`. --- ## Minimal, get‑started plan (practical and quick) 1. **Start Hybrid (Option C)** * Content in GitHub (`/content/…`), Metadata in Airtable (`Notes`, `Books`, `SyncLog`). 2. **Build utilities now** * `hackmd-sync` (incremental) + `hackmd-book-export` (Book → site). * Include `remarkRewriteHackMDLinks` and `asset-sync`. 3. **Wire Slack (Option F)** * `/hackmd sync [team] [--since 2025-07-01]` → creates a PR with changes and posts preview. * `/hackmd book-to-next [url]` → generates a new app folder + PR. 4. **Annual archive (Option D)** * End of academic year: snapshot zip + manifest, tag release. This gives you: * Day‑to‑day freshness (incremental sync), * Editorial curation (Airtable), * Immutable, auditable archives (Git tag + zip), * One‑click Book→Site. --- ## Example: once‑a‑year archive script (outline) ```ts // pnpm dlx tsx tools/hackmd-archive.ts --team learning-lab --year 2025 --zip ./exports/2025.zip import fs from 'node:fs/promises'; import path from 'node:path'; import { getNotesForTeam, downloadZip, getNote, getAssets } from './lib/hackmd'; // your thin wrappers import { normalizeMd, toFrontmatter, writeFileSafe } from './lib/normalize'; import { buildSlugMap } from './lib/slugmap'; async function run({ team, year, zip }) { // 1) Store raw proof await downloadZip(team, zip); // 2) List all notes via API (for metadata & IDs) const notes = await getNotesForTeam(team); // 3) For each note, fetch content & assets const outDir = path.join(process.cwd(), 'content', String(year), team); const assetsDir = path.join(process.cwd(), 'public', 'assets'); await fs.mkdir(outDir, { recursive: true }); await fs.mkdir(assetsDir, { recursive: true }); const map: Record<string,string> = {}; for (const n of notes) { const raw = await getNote(n.id); const { mdx, slug, frontmatter } = await normalizeMd(raw, n); // adds YAML, cleans links/images map[n.id] = `/${year}/${team}/${slug}`; await writeFileSafe(path.join(outDir, `${slug}.mdx`), mdx); await getAssets(n.id, path.join(assetsDir, n.id)); } // 4) Write slug map and manifest await writeFileSafe(path.join(process.cwd(), 'content/_maps/slugMap.json'), JSON.stringify(map, null, 2)); await writeFileSafe(path.join(process.cwd(), `manifest-${team}-${year}.json`), JSON.stringify({ team, year, count: notes.length }, null, 2)); } run(parseCliArgs()); ``` --- ## Example: “Book → Next app” generator (what it produces) ``` apps/my-book/ next.config.mjs package.json content/ _meta.json # sidebar/menu order intro.mdx chapter-1/ part-a.mdx part-b.mdx public/assets/<hackmdId>/* content/_maps/slugMap.json README.md (how to deploy to Vercel) ``` --- ## Slack UX suggestions * After `/hackmd sync`, reply with: * “✅ 12 updated, 88 unchanged, 1 failed (link)” * Buttons: * “Open PR Preview” * “View SyncLog” * “Revalidate Site” (calls a secure Next API route) * For collisions, open a modal: pick slug, set visibility, choose collection (Book/Team). --- ## What I’d recommend you pilot * **Week 1:** Implement **Option C** (Hybrid) with: * `hackmd-sync` (incremental, `--since`), writing MDX to Git + metadata to Airtable. * `remarkRewriteHackMDLinks` + `asset-sync`. * Contentlayer in your Next app reading `/content/**`. * **Week 2:** Add Slack commands for on‑demand runs and PR previews. * **Week 3:** Add the **Book→Next** generator for faculty showcases. * **End of Term:** Run **annual snapshot** tool and tag a GitHub release. This keeps the day‑to‑day experience smooth (Slack + Airtable), while preserving a rock‑solid Git archive for long‑term Next.js content. If you’d like, I can rough in the Node CLI skeletons (with pnpm + monorepo workspaces) and the Contentlayer config next. Great question. **You don’t have to throw out most of the plan**—you just need to make Airtable “metadata‑only (plus optional conveniences)” and keep **GitHub as the canonical store for full Markdown**. Below is a quick viability matrix and then a concrete “how to adapt” for each option. --- ## What still works (and how) | Option | Original intent | Still viable? | What changes with the 100k limit | | ---------------------------------------------- | ------------------------------------------------- | -----------------------: | ------------------------------------------------------------------------------------------------------------------------------------------- | | **A. GitHub‑first** | Content & history in Git; Next.js reads from repo | **Yes (no change)** | Keep all `.md/.mdx` in Git. Airtable holds only metadata (and optional short excerpts). | | **B. Airtable‑first** | Content in Airtable; Next.js reads from Airtable | **Not viable as stated** | Make it **Airtable‑first metadata** only. Content lives in Git (or blob). Optionally store small docs or an attachment pointer in Airtable. | | **C. Hybrid** (Git=content, Airtable=metadata) | Best of both | **Ideal** | Exactly what you described: full text in Git; Airtable fields for frontmatter/curation; optional “convenience” field for small docs. | | **D. Annual Snapshot** | Immutable year-end archives | **Yes** | Store raw zip + normalized content in Git (and/or blob). Push only metadata to Airtable. | | **E. Book → Next app** | One‑click site from a HackMD Book | **Yes** | Generator writes MDX and assets to Git. If you track Books in Airtable, store structure/order there but not the full text. | | **F. Slack control plane** | `/hackmd …` commands for ops | **Yes** | Commands route content to Git, metadata to Airtable. Post links to PR previews + Airtable rows. | | **G. Incremental sync** | Change‑aware, low-cost syncs | **Yes** | Add a size/char‑count gate: large notes skip Airtable `content_md` and only update Git + Airtable metadata. | --- ## Updated working model (recommended) **Canonical rule:** * **GitHub = source of truth for markdown and assets.** * **Airtable = source of truth for metadata/curation.** * Next.js reads **MDX from Git** and **joins metadata from Airtable** (via `hackmdId` or `slug`). **Airtable fields** (Notes table): * `hackmdId` (key), `title`, `slug`, `team`, `year`, `tags`, `visibility`, `bookId`, `bookOrder`, `owner` * `route` (formula), `updatedAt`, `content_length`, `content_checksum` * `content_excerpt` (long text, e.g., first 2–4k chars) * `content_md_small` (long text, **optional convenience**; only populated if below threshold) * `content_attachment` (optional; attached `.md` file for people who prefer to skim in Airtable) * `github_path` (e.g., `/content/2025/team/slug.mdx`) * `site_url` (computed or automation) > **Policy:** Treat `content_md_small` as read‑only/ephemeral. Next.js never reads content from Airtable. **Sync tool gating logic** * If `charCount <= SMALL_THRESHOLD` (e.g., **60,000** for safety): * Write full text to Git **and** mirror to `content_md_small` for convenience. * If `charCount > SMALL_THRESHOLD`: * Write full text to Git only. * In Airtable, write metadata + `content_excerpt`, update `github_path`, optionally attach the `.md` file. * Always compute `content_checksum` and `content_length` for drift detection. **Why a lower threshold than 100k?** Headroom avoids edge cases (formatting overhead, future expansions, accidental edits). You can set it wherever you’re comfortable; **50–80k** works well. --- ## Option‑by‑option tweaks ### A) GitHub‑first * **No changes needed.** * You can still mirror short notes into Airtable’s `content_md_small` for quick search/snippets. ### B) Airtable‑first → **Airtable‑first metadata** * Next.js stops reading `content_md` from Airtable. * Use `github_path` (and optional attachment) to give Airtable users a way to open the canonical file. * Keep editorial flow (tags, owners, visibility, “featured”, Book order) in Airtable. ### C) Hybrid (recommended default) * Keep your frontmatter **in YAML in Git** and **duplicated as structured fields in Airtable**. * Treat Git YAML as the “ground truth” for programmatic fields; the sync script can **round‑trip** changes (either direction) if/when you want that. ### D) Annual Snapshot * Store the **raw HackMD export zip** in a blob store (or Git release assets) + write normalized MDX to Git. * Add an Airtable `Snapshot` record (team/year/count/sha/tag) for auditability. ### E) Book → Next app * Generator writes **content only to Git**. * If you use Airtable to curate menus, write just `bookId`, `bookOrder`, titles, and routes there. ### F) Slack control plane * Add output notices like: * “Note X is 124,388 chars → skipped Airtable `content_md_small`, stored in Git. View: \[PR] \[Airtable row].” ### G) Incremental sync * Include `content_length` and `content_checksum` in your state file and Airtable to avoid refetching and to detect size‑driven routing changes. --- ## Next.js ingestion (unchanged, but clearer) * **Contentlayer** (or `next-mdx-remote`) reads `/content/**`. * A build‑time step (or ISR API route) fetches Airtable metadata and joins by `hackmdId`/`slug`. * Keep `slugMap.json` for link rewriting; also provide a fallback route `/h/[hackmdId]` that resolves to the current slug. --- ## Guardrails & UX * **Read‑only cues in Airtable**: Add a label/emoji to the `content_md_small` field name (e.g., “Content (preview; read‑only)”) and place it in a **read‑only interface** for most users. * **Slack diffs**: Post char count and whether a note was mirrored to Airtable or Git‑only. * **Automations**: If someone edits `content_md_small`, an automation can revert (by re‑mirroring from Git) or at least warn in Slack. --- ## Bottom line * **Keep Options A, C, D, E, F, G**—they work as‑is with minor tweaks. * **Do not rely on Airtable as canonical content storage** (Option B as originally framed). * Use Airtable for **metadata and curation**, and optionally mirror **small docs/excerpts** for convenience. GitHub remains the **source of truth** for all Markdown and assets. If you want, I can sketch the tiny TypeScript helper that decides “Git‑only vs. Git+Air mirror” and the Airtable field updates you can drop into your `hackmd-sync` script.