The Search Commons

# The Search Commons Our understanding of search is stuck in the 1990s. We can build search with better properties than the perennial search engine. ## Architecture Overview Some parties provide corpora. These corpora can be either crawls (eg. of the web) or datasets from any kind of party. Some parties provide indices over corpora. (Fuck those plurals.) A corpus might get indexed via Compute Over Data (COD). Indices are meant to be composable so that you can search, filter, sort by joining over multiple indices that weren't designed to work together. Clients access the indices directly (ideally) because they are built on enough commonality that they can be processed interoperably, and are laid out in such a way that a query only requires loading a fraction of the index tree. Index gateways (essentially an HTTP-based UI atop search infrastructure) are possible as well, though ideally we wouldn't require them or rely on them excessively because they are wont to be sticky. ## Questions (For some of these the answer might be "can't" or "tough", but we still need an answer.) - How do corpus providers get paid (in the cases in which they need support)? - How are corpus providers kept honest (for corpora like crawling where there are incentives to cheat)? - Is it possible for a corpus to be indexable (eg. with COD) but not provided in full for download? I can imagine that some entities would be happy to expose their content to search in exchange for discoverability, but not provide the entire data wholesale. - Can we provide provenance and rights reservation? (Eg. anyone can read your stories but you don't want them used in generative AI.) - How do indexers get paid? - How are indexers kept honest? - How are retrieval costs tied to querying indices covered? - Can we make indices composable? (Eg. Dave makes a full-text index of the web and Dietrich makes an automated rating of the jankiness of pages; how do I query Dave's index for a string and rank the results by Dietrich's rating?) - Do we need a Bluesky-like infrastructure of larger indexers over smaller corpus providers? Conversely, should this be a generalisation of that architecture such that Bluesky aggregators basically fall out of it? - What kind of metadata do corpora need to be generally indexable? - Should we ensure that there is always something that can be linked to and render usefully (as opposed to, say, an entry in a DB)? - How are indexers notified of changes in corpora? - Does this tie in to IPNI? - Does this tie in to Saturn? - Does this tie in to [Content Claims Protocol](https://hackmd.io/IiKMDqoaSM61TjybSxwHog?both)? - Can solving this *also* solve Bluesky aggregators? - It would be cool if users could initiate WARC-file archives on the links they click through on so the corpus could be futher enhanced with proper archives. ( - Aram) ## Capture Threat Model - [ ] Which decisions does each component make? - [ ] Who is affected by that decision? - [ ] How can those who are affected have voice in the process? ## Todo - [ ] Check out AramZS's list of IndieSearch resources (@robin-berjon) - Here you go! https://context.center/topics/indie-search/ (- Aram) - [ ] Also check out [Merkle Search Tree](https://github.com/DavidBuchanan314/merkle-search-tree) - [ ] Also https://github.com/mikeal/prolly-trees - [ ] Promising: https://presearch.io/ - [ ] https://github.com/Mubelotix/admarus - [ ] https://github.com/izihawa/summa - [ ] Has anyone put [Common Crawl](https://commoncrawl.org/) on IPFS? - [ ] Should we make it available if not? - [ ] There's a ton in https://commoncrawl.org/the-data/examples/ - [ ] What is a good indexing system that works with IPFS? - [ ] The stuff that Mauve and Quinn were talking about with respect to Prolly Trees sounds like you could load a few layers of an index and then be able to search pretty fast. - Links? (- Aram) - mauve explainifying: https://www.youtube.com/watch?v=TblRt1NA39U and slides from that https://blog.mauve.moe/slides/p2p-deebees/ - - [ ] There were several mentions at Thing 2023 Interplanetary Databases of composable indices: because of content addressing, multiple parties can write to the same index without coordination. --- <small>This work is part of [The Web Commons](/dFpEb1jeSrKp0Fx8IrdZAg).</small>