HackMD - Collaborative Markdown Knowledge Base

Below is a practical, production‑ready plan for a **single “q” search** that works well for type‑ahead UX, scales on Postgres, and is easy to evolve. --- ## TL;DR (recommended approach) 1. **Turn on Postgres extensions**: `unaccent` (accent‑insensitive) and `pg_trgm` (fuzzy/trigram). 2. **Add per‑table generated `tsvector` columns + GIN indexes** on: * `customers` (name, company, title) * `customer_contacts` (value/email/phone) * `locations` (full mailing string) This keeps search matching **index‑only and join‑free**. 3. **Search pipeline** (one API): * Build a **tsquery** from `q` and hit each table’s `search_vector` separately (index scan). * **UNION** the matching `customer_id`s with a weighted rank from each source. * Add a **fuzzy fallback** (trigram similarity) for partial/prefix/misspells. * Aggregate to one row per customer, sort by final score, paginate. 4. **UX**: * Debounce 250–300ms, return top 10–20 with a score. * Smarter heuristics: detect email/phone/ZIP, bias results near **primary location**. This gives excellent recall/precision without pulling in an external search engine—and you can add a materialized view later if you want even faster lookups. --- ## 1) Postgres extensions & indexes (Alembic snippets) > Create extensions once in an early migration (Postgres only): ```sql CREATE EXTENSION IF NOT EXISTS unaccent; CREATE EXTENSION IF NOT EXISTS pg_trgm; ``` ### Generated `tsvector` columns (per table) **Customers** ```sql ALTER TABLE customers ADD COLUMN IF NOT EXISTS search_vector tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('simple', unaccent(coalesce(company, ''))), 'A') || setweight(to_tsvector('simple', unaccent(coalesce(last_name, ''))), 'A') || setweight(to_tsvector('simple', unaccent(coalesce(first_name, ''))), 'B') || setweight(to_tsvector('simple', unaccent(coalesce(title, ''))), 'D') ) STORED; CREATE INDEX IF NOT EXISTS ix_customers_search_vector ON customers USING GIN (search_vector); -- Useful fuzzy/prefix indexes CREATE INDEX IF NOT EXISTS ix_customers_company_trgm ON customers USING GIN ((unaccent(lower(company))) gin_trgm_ops); CREATE INDEX IF NOT EXISTS ix_customers_name_trgm ON customers USING GIN ((unaccent(lower(first_name || ' ' || last_name))) gin_trgm_ops); ``` **Contacts** ```sql ALTER TABLE customer_contacts ADD COLUMN IF NOT EXISTS search_vector tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('simple', unaccent(coalesce(type, ''))), 'D') || setweight(to_tsvector('simple', unaccent(coalesce(value, ''))), 'B') ) STORED; CREATE INDEX IF NOT EXISTS ix_customer_contacts_search_vector ON customer_contacts USING GIN (search_vector); -- Fuzzy for email/phone substrings CREATE INDEX IF NOT EXISTS ix_customer_contacts_value_trgm ON customer_contacts USING GIN ((unaccent(lower(value))) gin_trgm_ops); -- Fast digit-only phone contains CREATE INDEX IF NOT EXISTS ix_customer_contacts_digits_trgm ON customer_contacts USING GIN ((regexp_replace(value, '\D','','g')) gin_trgm_ops); ``` **Locations** ```sql ALTER TABLE locations ADD COLUMN IF NOT EXISTS search_vector tsvector GENERATED ALWAYS AS ( setweight( to_tsvector('simple', unaccent( coalesce(concat_ws(' ', line1, line2, city, state, zip, country), '') ) ), 'C') ) STORED; CREATE INDEX IF NOT EXISTS ix_locations_search_vector ON locations USING GIN (search_vector); -- Fuzzy address substring CREATE INDEX IF NOT EXISTS ix_locations_addr_trgm ON locations USING GIN ( (unaccent(lower(concat_ws(' ', line1, line2, city, state, zip, country)))) gin_trgm_ops ); ``` > Notes > > * We use `'simple'` text config to avoid English stemming on names/addresses. > * Weights: **A** (strong) for company/last\_name, **B** for first\_name/contact value, **C** for address, **D** for misc. --- ## 2) ORM mappings (optional but recommended) Map the generated columns so SQLAlchemy can reference them: ```python from sqlalchemy.dialects.postgresql import TSVECTOR from sqlalchemy.orm import Mapped, mapped_column class Customer(DbEntityBase): __tablename__ = "customers" # ... existing fields ... search_vector: Mapped[str] = mapped_column(TSVECTOR, nullable=False) class CustomerContact(DbEntityBase): __tablename__ = "customer_contacts" # ... existing fields ... search_vector: Mapped[str] = mapped_column(TSVECTOR, nullable=False) class Location(DbEntityBase): __tablename__ = "locations" # ... existing fields ... search_vector: Mapped[str] = mapped_column(TSVECTOR, nullable=False) ``` If you prefer not to modify models, you can reference these columns with `sa.literal_column("search_vector")`, but mapping is cleaner. --- ## 3) The search function (SQLAlchemy 2.x) This function: * Searches all relevant fields with full‑text first * Adds fuzzy similarity fallback * Handles email/phone substrings * Scopes by `office_id` / `office_region_id` if provided * Returns customers + a combined `score`, ordered and paginated ```python from __future__ import annotations import re from typing import Optional import sqlalchemy as sa from sqlalchemy import select, func, literal, or_, and_, desc from sqlalchemy.orm import Session from sqlalchemy.dialects import postgresql as pg TRIGRAM_THRESHOLD = 0.25 # tune between 0.2 and 0.35 for your data def search_customers( session: Session, q: str, *, office_id: Optional[int] = None, office_region_id: Optional[int] = None, status_in: Optional[list] = None, # e.g. [CustomerStatus.ACTIVE] limit: int = 20, offset: int = 0, ): q = (q or "").strip() if not q: return [] lower_q = q.lower() digits = re.sub(r"\D", "", q) is_email = "@" in q and " " not in q has_digits = len(digits) >= 4 # Build tsquery; unaccent the query to match our unaccented documents. ts_query = func.websearch_to_tsquery("simple", func.unaccent(lower_q)) # ---- 1) Index-only full-text matches on each table ---------------------- # Customers FTS cust_fts = ( select( Customer.id.label("customer_id"), func.ts_rank_cd(Customer.search_vector, ts_query).label("rank"), literal(1.00).label("w"), # strongest ) .where(Customer.search_vector.op("@@")(ts_query)) ) # Contacts FTS contact_fts = ( select( CustomerContact.customer_id.label("customer_id"), func.ts_rank_cd(CustomerContact.search_vector, ts_query).label("rank"), literal(0.65).label("w"), ) .where(CustomerContact.search_vector.op("@@")(ts_query)) ) # Locations FTS loc_fts = ( select( Location.customer_id.label("customer_id"), func.ts_rank_cd(Location.search_vector, ts_query).label("rank"), literal(0.45).label("w"), ) .where(Location.search_vector.op("@@")(ts_query)) ) fts_union = sa.union_all(cust_fts, contact_fts, loc_fts).subquery("fts_union") fts_scored = ( select( fts_union.c.customer_id, # take best weighted rank among sources per customer func.max(fts_union.c.rank * fts_union.c.w).label("fts_score"), ) .group_by(fts_union.c.customer_id) .subquery("fts_scored") ) # ---- 2) Fuzzy similarity fallback (trigram) ----------------------------- # Name/company similarity full_name = func.unaccent( func.lower(func.concat_ws(" ", Customer.first_name, Customer.last_name)) ) company_l = func.unaccent(func.lower(Customer.company)) # Address similarity (across any location rows) addr_concat = func.unaccent( func.lower( func.concat_ws( " ", Location.line1, Location.line2, Location.city, Location.state, Location.zip, Location.country, ) ) ) # Per-customer fuzzy score (max across related rows) fuzzy_scores = ( select( Customer.id.label("customer_id"), func.greatest( func.similarity(company_l, lower_q), func.similarity(full_name, lower_q), func.coalesce(func.max(func.similarity(addr_concat, lower_q)), 0.0), ).label("fuzzy_score"), ) .join(Location, Location.customer_id == Customer.id, isouter=True) .group_by(Customer.id) .subquery("fuzzy_scores") ) # ---- 3) Email/phone/ZIP heuristics ------------------------------------- # Email contains (fast path) email_match = None if is_email: email_match = ( select(CustomerContact.customer_id) .where( func.lower(CustomerContact.value).contains(lower_q), # optional: where type = 'email' ) .subquery("email_match") ) # Phone digits contains (fast path) phone_match = None if has_digits: phone_digits = digits phone_match = ( select(CustomerContact.customer_id) .where(func.regexp_replace(CustomerContact.value, r"\D", "", "g") .contains(phone_digits)) .subquery("phone_match") ) # ---- 4) Assemble final query ------------------------------------------- # Base filter scope scope_conditions = [] if office_id is not None: scope_conditions.append(Customer.office_id == office_id) if office_region_id is not None: scope_conditions.append(Customer.office_region_id == office_region_id) if status_in: scope_conditions.append(Customer.status.in_(status_in)) # Join scored subqueries scored = ( select( Customer.id, Customer.first_name, Customer.last_name, Customer.company, # max: ensure we can order even when only fuzzy matched func.coalesce(fts_scored.c.fts_score, 0.0).label("fts_score"), func.coalesce(fuzzy_scores.c.fuzzy_score, 0.0).label("fuzzy_score"), # Simple linear blend; tune weights to taste ( func.coalesce(fts_scored.c.fts_score, 0.0) * 1.0 + func.coalesce(fuzzy_scores.c.fuzzy_score, 0.0) * 0.35 ).label("score"), ) .outerjoin(fts_scored, fts_scored.c.customer_id == Customer.id) .outerjoin(fuzzy_scores, fuzzy_scores.c.customer_id == Customer.id) ) # Optional: require at least something to match (fts or fuzzy or heuristics) where_any_match = [ fts_scored.c.fts_score.is_not(None), fuzzy_scores.c.fuzzy_score >= TRIGRAM_THRESHOLD, ] if email_match is not None: scored = scored.outerjoin(email_match, email_match.c.customer_id == Customer.id) where_any_match.append(email_match.c.customer_id.is_not(None)) if phone_match is not None: scored = scored.outerjoin(phone_match, phone_match.c.customer_id == Customer.id) where_any_match.append(phone_match.c.customer_id.is_not(None)) if scope_conditions: scored = scored.where(and_(*scope_conditions)) scored = scored.where(or_(*where_any_match)) # Sorting & pagination scored = ( scored.order_by(desc(sa.literal_column("score")), desc(sa.literal_column("fts_score"))) .limit(limit) .offset(offset) ) rows = session.execute(scored).all() # You can post-process into DTOs; returning rows for brevity return [ { "id": r.id, "full_name": f"{r.first_name or ''} {r.last_name or ''}".strip(), "company": r.company, "score": float(r.score or 0.0), "fts_score": float(r.fts_score or 0.0), "fuzzy_score": float(r.fuzzy_score or 0.0), } for r in rows ] ``` ### Why this shape? * **Index‑friendly**: matching happens on each table’s `tsvector`/trigram indexes, then we union customer IDs. We avoid big ORs over many columns and avoid cartesian blowups from joining first. * **Ranking that feels right**: Full‑text rank dominates, fuzzy gives helpful lift for typos/short prefixes. * **Heuristics for special tokens**: Email and phone detection use narrow, fast filters to surface the obvious hit. --- ## 4) UX affordances (front‑end) * **Debounce** input 250–300ms; show spinner; return top N (10–20). * **Highlight**: you can return the original `q` and let the FE highlight in the displayed name/company/address. * **Bias primary location**: optionally boost customers whose `primary_location.city/state` match the user’s chosen region (add a small bonus to `score`). * **Empty state**: display guidance like “Try name, company, email, phone, or address”. --- ## 5) Performance notes & knobs * **Thresholds**: `TRIGRAM_THRESHOLD` is the main recall/precision dial. Start at `0.25`. Lower increases recall; higher increases precision. * **GIN memory**: If tables are large, ensure `maintenance_work_mem` and `work_mem` are adequate when building indexes. * **Selective partial indexes**: If most customers are ACTIVE, add partial indexes with `WHERE status = 'ACTIVE'` to speed common queries. * **Pagination**: `LIMIT/OFFSET` is fine for type‑ahead. For deep pagination, switch to keyset (seek) on `(score, id)`. --- ## 6) Optional: materialized search view (later) For very large datasets or strict p95 latency budgets, create a denormalized `customer_search` **materialized view** with: * `customer_id` * a single merged `tsvector` (customers + primary contact + primary location) * GIN index on that vector Then query just that MV with FTS + trigram. Schedule `REFRESH MATERIALIZED VIEW CONCURRENTLY` or maintain a regular table via triggers. This trades freshness for speed but can cut tail latency further. --- ### Summary * The **per‑table tsvector + union** strategy delivers fast, accurate, and maintainable search entirely in Postgres with SQLAlchemy. * The provided function is drop‑in: it does full‑text first, fuzzy fallback, and special handling for email/phone—then ranks, dedupes, and paginates cleanly. * You can tune weights/thresholds without schema changes and optionally evolve to a materialized view if/when you need more speed.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.