# The Search Commons
Our understanding of search is stuck in the 1990s. We can build search with better
properties than the perennial search engine.
## Architecture Overview
Some parties provide corpora. These corpora can be either crawls (eg. of the web)
or datasets from any kind of party.
Some parties provide indices over corpora. (Fuck those plurals.) A corpus might
get indexed via Compute Over Data (COD). Indices are meant to be composable so that you can search,
filter, sort by joining over multiple indices that weren't designed to work
together.
Clients access the indices directly (ideally) because they are built on enough
commonality that they can be processed interoperably, and are laid out in such a
way that a query only requires loading a fraction of the index tree.
Index gateways (essentially an HTTP-based UI atop search infrastructure) are possible as
well, though ideally we wouldn't require them or
rely on them excessively because they are wont to be sticky.
## Questions
(For some of these the answer might be "can't" or "tough", but we still need an
answer.)
- How do corpus providers get paid (in the cases in which they need support)?
- How are corpus providers kept honest (for corpora like crawling where there are
incentives to cheat)?
- Is it possible for a corpus to be indexable (eg. with COD) but not provided in
full for download? I can imagine that some entities would be happy to expose their
content to search in exchange for discoverability, but not provide the entire
data wholesale.
- Can we provide provenance and rights reservation? (Eg. anyone can read your
stories but you don't want them used in generative AI.)
- How do indexers get paid?
- How are indexers kept honest?
- How are retrieval costs tied to querying indices covered?
- Can we make indices composable? (Eg. Dave makes a full-text index of the web and
Dietrich makes an automated rating of the jankiness of pages; how do I query
Dave's index for a string and rank the results by Dietrich's rating?)
- Do we need a Bluesky-like infrastructure of larger indexers over smaller corpus
providers? Conversely, should this be a generalisation of that architecture such
that Bluesky aggregators basically fall out of it?
- What kind of metadata do corpora need to be generally indexable?
- Should we ensure that there is always something that can be linked to and render
usefully (as opposed to, say, an entry in a DB)?
- How are indexers notified of changes in corpora?
- Does this tie in to IPNI?
- Does this tie in to Saturn?
- Does this tie in to [Content Claims Protocol](https://hackmd.io/IiKMDqoaSM61TjybSxwHog?both)?
- Can solving this *also* solve Bluesky aggregators?
- It would be cool if users could initiate WARC-file archives on the links they click through on so the corpus could be futher enhanced with proper archives. ( - Aram)
## Capture Threat Model
- [ ] Which decisions does each component make?
- [ ] Who is affected by that decision?
- [ ] How can those who are affected have voice in the process?
## Todo
- [ ] Check out AramZS's list of IndieSearch resources (@robin-berjon)
- Here you go! https://context.center/topics/indie-search/ (- Aram)
- [ ] Also check out [Merkle Search Tree](https://github.com/DavidBuchanan314/merkle-search-tree)
- [ ] Also https://github.com/mikeal/prolly-trees
- [ ] Promising: https://presearch.io/
- [ ] https://github.com/Mubelotix/admarus
- [ ] https://github.com/izihawa/summa
- [ ] Has anyone put [Common Crawl](https://commoncrawl.org/) on IPFS?
- [ ] Should we make it available if not?
- [ ] There's a ton in https://commoncrawl.org/the-data/examples/
- [ ] What is a good indexing system that works with IPFS?
- [ ] The stuff that Mauve and Quinn were talking about with respect to Prolly Trees
sounds like you could load a few layers of an index and then be able to search
pretty fast.
- Links? (- Aram)
- mauve explainifying: https://www.youtube.com/watch?v=TblRt1NA39U and slides from that https://blog.mauve.moe/slides/p2p-deebees/
-
- [ ] There were several mentions at Thing 2023 Interplanetary Databases of composable
indices: because of content addressing, multiple parties can write to the same
index without coordination.
---
<small>This work is part of [The Web Commons](/dFpEb1jeSrKp0Fx8IrdZAg).</small>