# mk-hackmd-utilities-project
## Framing the problem (2 key choices)
1. **Source of truth**
* **GitHub‑first** (archival + version history; great for Next.js SSG and PR workflows)
* **Airtable‑first** (curation/metadata, editorial workflows; Next.js builds pull from Airtable)
* **Dual** (GitHub for content, Airtable for metadata, with a manifest tying them together)
2. **Sync style**
* **Batch archive** (annual or quarterly snapshots)
* **Incremental** (nightly or on‑demand via Slack)
* **On‑demand Book‑to‑App** (generate a Next.js app from one HackMD “Book” or index at a time)
You can run all of these under a single monorepo with `/tools` (Node scripts), `/apps/next-site`, and `/packages` (remark plugins), with **pnpm** workspaces.
---
## Option A — GitHub‑first with scheduled + on‑demand sync
**When**: You want immutable yearly archives, PR‑based review, and fast static builds.
**Flow**
```
HackMD API → (Node CLI) normalize + frontmatter → /content/** in GitHub
│
├─ assets pulled to /public/assets/<hackmdId>/
└─ slugMap.json (id↔slug) for link rewriting
↓
GitHub Actions (on push / nightly) → build Next.js (Contentlayer or next-mdx-remote) → Vercel
↓
Slack bot can trigger “sync now” which opens a PR (preview deploy)
```
**Pros**
* Best for versioning, code review, Vercel preview links per PR.
* Clear year‑over‑year archives (`/content/2024/**`, `/content/2025/**`).
**Cons**
* Editors need GitHub PRs (unless Slack/automation handles merging).
* Airtable becomes optional/add‑on, not canonical.
**Notes**
* Keep **raw** HackMD markdown in `/raw/` (for provenance) and **normalized MDX** in `/content/`.
* Use **Contentlayer** to type metadata (e.g., readingTime, tags, team, year).
---
## Option B — Airtable‑first (headless CMS; GitHub optional)
**When**: You want curation + rich metadata (owners, status, programs, courses), and less dependence on Git.
**Flow**
```
HackMD API → normalize → push record per note to Airtable. Fields include content_md, frontmatter_json, hackmdId, title, tags, updatedAt, checksum, …
↓
Next.js getStaticProps (ISR) or Build Step → query Airtable → render pages
↓
Optional “Mirror to GitHub” job writes MDX files + assets for long-term archive
```
**Pros**
* Editors can tag/filter/curate in Airtable; permissions are familiar.
* Next.js builds are simple data pulls; no PR bottleneck for minor copy edits.
**Cons**
* You’ll want a secondary archive (GitHub or blob store) for long‑term provenance.
* Larger markdown bodies stored in Airtable long text can feel cramped (but workable).
**Notes**
* Use Airtable tables: **Notes**, **Books**, **Teams/Years**, **SyncLog**.
* A formula field computes the Next.js route from `[year]/[team]/[slug]`.
---
## Option C — Hybrid (GitHub = content, Airtable = metadata; “content lake” optional)
**When**: You want the best parts of A + B without duplicating effort.
**Flow**
```
HackMD API → ETL
├─ GitHub: /content/** MDX + assets
└─ Airtable: metadata (id, slug, title, team, year, book, tags, visibility, owner, updatedAt, checksum)
↓
Next.js build:
- Read MDX from repo
- Join with Airtable metadata by hackmdId (or slug) for UI facets, filters, cards
```
**Pros**
* Git versioning & PR previews + Airtable for curation & faceting.
* Clean separation: heavy content in Git, light metadata in Airtable.
**Cons**
* Two destinations to keep in sync (solved by single ETL writing both).
**Optional**: Add a “**content lake**” (S3/R2/Vercel Blob) to store raw zips, original PDFs, big assets. Git holds normalized MDX only.
---
## Option D — Annual “Snapshot” Archiver (immutable)
**When**: End‑of‑year freeze of a HackMD Team, creating a tamper‑proof snapshot.
**Flow**
```
Manual or API zip export as proof → store in content lake w/ manifest.json
↓
Normalizer runs on the zip → writes /content/<year>/<team>/** to GitHub
↓
Create Git tag: archive-<team>-<year>, and a release that links the zip + manifest
```
**Pros**
* For audits, provenance, FOI requests, or institutional memory.
* Extremely simple runtime story: everything immutable.
**Cons**
* Not for daily updates; no editorial comfort beyond archive.
---
## Option E — “Book → Next App” generator (one-click site)
**When**: A faculty member’s Book should become a standalone site/app quickly.
**CLI**
```
pnpm dlx learninglab/hackmd-tools book-to-next \
--book https://hackmd.io/@team/book-id \
--template nextra|your-next-template \
--dest ./apps/<short-name> \
--org learning-lab
```
**What it does**
* Crawls the Book, preserves order/structure, downloads notes + assets.
* Writes `_meta.json` (Nextra) or section index files; builds sidebar/menu from Book.
* Adds frontmatter + link rewriting (id→slug), generates `slugMap.json`.
* Optional: pushes to GitHub with a ready Vercel deploy hook.
**Pros**
* Perfect for fast showcases, conferences, or course microsites.
* Repeatable and sharable as a pattern.
---
## Option F — Slack‑first control plane (ops from chat)
**When**: You want ops via `/hackmd` slash commands + interactive modals.
**Commands**
* `/hackmd sync [team] [--year 2025] [--since 2025-08-01]`
* `/hackmd book-to-next [book-url] [--template nextra]`
* `/hackmd archive [team] [--year 2024]`
* `/hackmd convert [note-url|id]`
* `/hackmd to-airtable [note-url|id]`
* `/hackmd status` (last run, deltas, failures, rate limits)
**Behavior**
* Commands post an ephemeral summary; long‑running logs stream in a thread.
* For collisions/approvals, open a **Slack modal** (pick slug/collection, set visibility).
* On success, post links to: PR preview, Airtable record, and any warnings.
---
## Option G — Incremental sync with change detection (low cost, always fresh)
**When**: You don’t want to re‑pull everything every time.
**Mechanics**
* Maintain a local **state.json** (or Airtable SyncLog) keyed by `hackmdId` → `checksum` (md5 of content) + `updatedAt`.
* Each run:
* listNotes() → fetch only changed notes (updatedAt > lastSync or checksum mismatch).
* download attachments only for changed notes (etag compare).
* Write only diffs to Git/Airtable; avoid noisy commits.
---
## Core building blocks (reusable utilities)
### 1) Frontmatter schema (YAML)
```yaml
---
title: "Guide to X"
slug: "guide-to-x"
hackmdId: "abc123def456"
team: "learning-lab"
year: 2025
bookId: "book-xyz" # if applicable
bookOrder: 12 # for sorted menus (from Book)
authors: ["S. Person", "T. Person"]
tags: ["ai", "teaching"]
visibility: "public" # or "internal", "private"
updatedAt: "2025-08-20T14:22:00Z"
source:
platform: "hackmd"
url: "https://hackmd.io/abc123def456"
checksum: "md5:..."
readingTime: 6
summary: "One-liner used for cards."
---
```
### 2) Directory layout (GitHub)
```
/raw/<year>/<team>/<hackmdId>.md # raw pulls (optional)
/content/<year>/<team>/<slug>.mdx # normalized, link-rewritten
/public/assets/<hackmdId>/* # images/assets
/content/_maps/slugMap.json # { "abc123…": "/2025/team/guide-to-x", ... }
```
### 3) Link rewriting strategy
* Build `slugMap` from frontmatter (`hackmdId → route`).
* A **remark** plugin rewrites:
* `https://hackmd.io/<id>` → internal `/[year]/[team]/[slug]` if known.
* Keeps external HackMD links if a match is not found (or mark as TODO).
* Optional fallback: a Next.js dynamic route `/h/[hackmdId]` that looks up the slug at runtime (catch any misses).
**Minimal remark plugin (Node/TS, sketch)**
```ts
import { visit } from 'unist-util-visit';
export function remarkRewriteHackMDLinks(slugMap: Record<string, string>) {
return (tree: any) => {
visit(tree, 'link', (node: any) => {
const url: string = node.url || '';
const m = url.match(/https?:\/\/hackmd\.io\/([A-Za-z0-9_-]+)/);
if (m) {
const id = m[1];
const internal = slugMap[id];
if (internal) node.url = internal; // rewrite to local route
}
});
};
}
```
### 4) Normalizer pipeline
* **gray-matter** → read/write YAML frontmatter
* **remark** plugins:
* `remark-gfm` (tables/task lists),
* `remark-math` + `rehype-katex` (if needed),
* your **link rewriting** plugin,
* image URL transformer → `/assets/<hackmdId>/<filename>`
* **title/slug**:
* prefer HackMD metadata title → kebab‑case slug,
* collision resolver: append short hash or `-2`, record in Airtable.
### 5) Scripts (CLI)
* `hackmd-archive` (annual snapshot)
* `hackmd-sync` (incremental for team/year/since)
* `hackmd-book-export` (Book → structured content + \_meta)
* `hackmd-to-airtable` (push/update metadata rows)
* `airtable-to-github` (mirror curated selections back into repo)
* `link-audit` (find unresolved links)
* `asset-sync` (download/update images; prune orphans)
* `dry-run` (no writes; prints a plan)
All CLIs accept `--dry-run`, `--since`, `--team`, `--year`, `--book`, `--only <noteId>`, `--dest github|airtable|both`.
---
## Airtable schema (if you make it canonical or hybrid)
**Tables**
1. **Notes**
* `hackmdId` (primary key)
* `title`, `slug`, `team`, `year`
* `bookId` (link to Books), `bookOrder` (number)
* `tags` (multi‑select), `visibility` (single‑select)
* `updatedAt` (datetime), `checksum` (text)
* `content_md` (long text) — optional if you store content in Git only
* `route` (formula: `"/" & year & "/" & team & "/" & slug`)
* `assets` (attachments) — optional
* `status` (Draft/Published/Archived)
2. **Books**
* `bookId`, `title`, `team`, `year`
* `order` (array or linked child “BookItems” with order)
3. **SyncLog**
* `runId`, `startedAt`, `finishedAt`, `actor` (Slack user)
* `created`, `updated`, `unchanged`, `errors` (JSON or rollups)
**Why**: This lets you filter by program/course, assign owners, toggle visibility, and drive Next.js cards/menus.
---
## Next.js ingestion patterns
**Static (fast)**
* Use **Contentlayer** to scan `/content/**`, build typed docs at build time.
* If Airtable is used for metadata, pull it at build, join by `hackmdId`.
**ISR (fresh)**
* Build once, then revalidate pages (e.g. 60–300s) triggered by:
* Slack command (`/hackmd revalidate /2025/team/foo`)
* Webhook (Airtable automation → Vercel deploy hook or custom API route)
**MDX**
* `@next/mdx` or `next-mdx-remote` + custom components (callouts, citations).
* Keep a consistent **remark** stack across the CLI and the Next build for parity.
---
## “Book” handling details
* Pull Book structure and write a `_meta.json` (for Nextra) or a `collection.json` your app reads for sidebar order.
* Preserve nesting if the Book uses sections; otherwise make sections by heading level or Airtable grouping.
* Store `bookOrder` on each note for stable menus.
---
## Observability & safety
* **Logs**: Each run creates a `SyncLog` row, and posts a Slack thread with a summary diff.
* **Dry‑run** first in Slack (`/hackmd sync --dry-run` gives a preview).
* **Secrets**: HackMD token, Airtable token, GitHub token in `doppler`/`1Password` envs.
* **Rate limits**: Backoff and checkpoint; small page size + pagination; resume with `--since`.
* **PII / Privacy**: Respect visibility flags; skip or mask “private” notes by default unless `--include-private`.
---
## Minimal, get‑started plan (practical and quick)
1. **Start Hybrid (Option C)**
* Content in GitHub (`/content/…`), Metadata in Airtable (`Notes`, `Books`, `SyncLog`).
2. **Build utilities now**
* `hackmd-sync` (incremental) + `hackmd-book-export` (Book → site).
* Include `remarkRewriteHackMDLinks` and `asset-sync`.
3. **Wire Slack (Option F)**
* `/hackmd sync [team] [--since 2025-07-01]` → creates a PR with changes and posts preview.
* `/hackmd book-to-next [url]` → generates a new app folder + PR.
4. **Annual archive (Option D)**
* End of academic year: snapshot zip + manifest, tag release.
This gives you:
* Day‑to‑day freshness (incremental sync),
* Editorial curation (Airtable),
* Immutable, auditable archives (Git tag + zip),
* One‑click Book→Site.
---
## Example: once‑a‑year archive script (outline)
```ts
// pnpm dlx tsx tools/hackmd-archive.ts --team learning-lab --year 2025 --zip ./exports/2025.zip
import fs from 'node:fs/promises';
import path from 'node:path';
import { getNotesForTeam, downloadZip, getNote, getAssets } from './lib/hackmd'; // your thin wrappers
import { normalizeMd, toFrontmatter, writeFileSafe } from './lib/normalize';
import { buildSlugMap } from './lib/slugmap';
async function run({ team, year, zip }) {
// 1) Store raw proof
await downloadZip(team, zip);
// 2) List all notes via API (for metadata & IDs)
const notes = await getNotesForTeam(team);
// 3) For each note, fetch content & assets
const outDir = path.join(process.cwd(), 'content', String(year), team);
const assetsDir = path.join(process.cwd(), 'public', 'assets');
await fs.mkdir(outDir, { recursive: true });
await fs.mkdir(assetsDir, { recursive: true });
const map: Record<string,string> = {};
for (const n of notes) {
const raw = await getNote(n.id);
const { mdx, slug, frontmatter } = await normalizeMd(raw, n); // adds YAML, cleans links/images
map[n.id] = `/${year}/${team}/${slug}`;
await writeFileSafe(path.join(outDir, `${slug}.mdx`), mdx);
await getAssets(n.id, path.join(assetsDir, n.id));
}
// 4) Write slug map and manifest
await writeFileSafe(path.join(process.cwd(), 'content/_maps/slugMap.json'), JSON.stringify(map, null, 2));
await writeFileSafe(path.join(process.cwd(), `manifest-${team}-${year}.json`), JSON.stringify({ team, year, count: notes.length }, null, 2));
}
run(parseCliArgs());
```
---
## Example: “Book → Next app” generator (what it produces)
```
apps/my-book/
next.config.mjs
package.json
content/
_meta.json # sidebar/menu order
intro.mdx
chapter-1/
part-a.mdx
part-b.mdx
public/assets/<hackmdId>/*
content/_maps/slugMap.json
README.md (how to deploy to Vercel)
```
---
## Slack UX suggestions
* After `/hackmd sync`, reply with:
* “✅ 12 updated, 88 unchanged, 1 failed (link)”
* Buttons:
* “Open PR Preview”
* “View SyncLog”
* “Revalidate Site” (calls a secure Next API route)
* For collisions, open a modal: pick slug, set visibility, choose collection (Book/Team).
---
## What I’d recommend you pilot
* **Week 1:** Implement **Option C** (Hybrid) with:
* `hackmd-sync` (incremental, `--since`), writing MDX to Git + metadata to Airtable.
* `remarkRewriteHackMDLinks` + `asset-sync`.
* Contentlayer in your Next app reading `/content/**`.
* **Week 2:** Add Slack commands for on‑demand runs and PR previews.
* **Week 3:** Add the **Book→Next** generator for faculty showcases.
* **End of Term:** Run **annual snapshot** tool and tag a GitHub release.
This keeps the day‑to‑day experience smooth (Slack + Airtable), while preserving a rock‑solid Git archive for long‑term Next.js content.
If you’d like, I can rough in the Node CLI skeletons (with pnpm + monorepo workspaces) and the Contentlayer config next.
Great question. **You don’t have to throw out most of the plan**—you just need to make Airtable “metadata‑only (plus optional conveniences)” and keep **GitHub as the canonical store for full Markdown**.
Below is a quick viability matrix and then a concrete “how to adapt” for each option.
---
## What still works (and how)
| Option | Original intent | Still viable? | What changes with the 100k limit |
| ---------------------------------------------- | ------------------------------------------------- | -----------------------: | ------------------------------------------------------------------------------------------------------------------------------------------- |
| **A. GitHub‑first** | Content & history in Git; Next.js reads from repo | **Yes (no change)** | Keep all `.md/.mdx` in Git. Airtable holds only metadata (and optional short excerpts). |
| **B. Airtable‑first** | Content in Airtable; Next.js reads from Airtable | **Not viable as stated** | Make it **Airtable‑first metadata** only. Content lives in Git (or blob). Optionally store small docs or an attachment pointer in Airtable. |
| **C. Hybrid** (Git=content, Airtable=metadata) | Best of both | **Ideal** | Exactly what you described: full text in Git; Airtable fields for frontmatter/curation; optional “convenience” field for small docs. |
| **D. Annual Snapshot** | Immutable year-end archives | **Yes** | Store raw zip + normalized content in Git (and/or blob). Push only metadata to Airtable. |
| **E. Book → Next app** | One‑click site from a HackMD Book | **Yes** | Generator writes MDX and assets to Git. If you track Books in Airtable, store structure/order there but not the full text. |
| **F. Slack control plane** | `/hackmd …` commands for ops | **Yes** | Commands route content to Git, metadata to Airtable. Post links to PR previews + Airtable rows. |
| **G. Incremental sync** | Change‑aware, low-cost syncs | **Yes** | Add a size/char‑count gate: large notes skip Airtable `content_md` and only update Git + Airtable metadata. |
---
## Updated working model (recommended)
**Canonical rule:**
* **GitHub = source of truth for markdown and assets.**
* **Airtable = source of truth for metadata/curation.**
* Next.js reads **MDX from Git** and **joins metadata from Airtable** (via `hackmdId` or `slug`).
**Airtable fields** (Notes table):
* `hackmdId` (key), `title`, `slug`, `team`, `year`, `tags`, `visibility`, `bookId`, `bookOrder`, `owner`
* `route` (formula), `updatedAt`, `content_length`, `content_checksum`
* `content_excerpt` (long text, e.g., first 2–4k chars)
* `content_md_small` (long text, **optional convenience**; only populated if below threshold)
* `content_attachment` (optional; attached `.md` file for people who prefer to skim in Airtable)
* `github_path` (e.g., `/content/2025/team/slug.mdx`)
* `site_url` (computed or automation)
> **Policy:** Treat `content_md_small` as read‑only/ephemeral. Next.js never reads content from Airtable.
**Sync tool gating logic**
* If `charCount <= SMALL_THRESHOLD` (e.g., **60,000** for safety):
* Write full text to Git **and** mirror to `content_md_small` for convenience.
* If `charCount > SMALL_THRESHOLD`:
* Write full text to Git only.
* In Airtable, write metadata + `content_excerpt`, update `github_path`, optionally attach the `.md` file.
* Always compute `content_checksum` and `content_length` for drift detection.
**Why a lower threshold than 100k?**
Headroom avoids edge cases (formatting overhead, future expansions, accidental edits). You can set it wherever you’re comfortable; **50–80k** works well.
---
## Option‑by‑option tweaks
### A) GitHub‑first
* **No changes needed.**
* You can still mirror short notes into Airtable’s `content_md_small` for quick search/snippets.
### B) Airtable‑first → **Airtable‑first metadata**
* Next.js stops reading `content_md` from Airtable.
* Use `github_path` (and optional attachment) to give Airtable users a way to open the canonical file.
* Keep editorial flow (tags, owners, visibility, “featured”, Book order) in Airtable.
### C) Hybrid (recommended default)
* Keep your frontmatter **in YAML in Git** and **duplicated as structured fields in Airtable**.
* Treat Git YAML as the “ground truth” for programmatic fields; the sync script can **round‑trip** changes (either direction) if/when you want that.
### D) Annual Snapshot
* Store the **raw HackMD export zip** in a blob store (or Git release assets) + write normalized MDX to Git.
* Add an Airtable `Snapshot` record (team/year/count/sha/tag) for auditability.
### E) Book → Next app
* Generator writes **content only to Git**.
* If you use Airtable to curate menus, write just `bookId`, `bookOrder`, titles, and routes there.
### F) Slack control plane
* Add output notices like:
* “Note X is 124,388 chars → skipped Airtable `content_md_small`, stored in Git. View: \[PR] \[Airtable row].”
### G) Incremental sync
* Include `content_length` and `content_checksum` in your state file and Airtable to avoid refetching and to detect size‑driven routing changes.
---
## Next.js ingestion (unchanged, but clearer)
* **Contentlayer** (or `next-mdx-remote`) reads `/content/**`.
* A build‑time step (or ISR API route) fetches Airtable metadata and joins by `hackmdId`/`slug`.
* Keep `slugMap.json` for link rewriting; also provide a fallback route `/h/[hackmdId]` that resolves to the current slug.
---
## Guardrails & UX
* **Read‑only cues in Airtable**: Add a label/emoji to the `content_md_small` field name (e.g., “Content (preview; read‑only)”) and place it in a **read‑only interface** for most users.
* **Slack diffs**: Post char count and whether a note was mirrored to Airtable or Git‑only.
* **Automations**: If someone edits `content_md_small`, an automation can revert (by re‑mirroring from Git) or at least warn in Slack.
---
## Bottom line
* **Keep Options A, C, D, E, F, G**—they work as‑is with minor tweaks.
* **Do not rely on Airtable as canonical content storage** (Option B as originally framed).
* Use Airtable for **metadata and curation**, and optionally mirror **small docs/excerpts** for convenience. GitHub remains the **source of truth** for all Markdown and assets.
If you want, I can sketch the tiny TypeScript helper that decides “Git‑only vs. Git+Air mirror” and the Airtable field updates you can drop into your `hackmd-sync` script.