The Legitimacy Engine

# The Legitimacy Engine: Balancing AI Measurements and Human Values in Algorithmic Funding > "it is a well-known fact that those people who must want to rule people are, ipso facto, those least suited to do it." > Douglas Adams, The Restaurant at the End of the Universe --- ## The Woman Whose House Burned for Making a List In a village in Kenya, a field officer named Saida did her job. She compiled a list, a "vulnerability assessment" for a cash transfer program. Who was poorest? Who most needed help? She gathered data, made judgments, submitted names. Close relatives went first. Does that seem fair? "How else could we do it?" she asked. "Everyone is vulnerable, and my family needs help too." When the aid organization left and the money stopped, people got angry. They thought Saida had kept their money or blocked them from the list. They burned her house down and her uncle was found dead. Lists make favorites, and favorites have costs. I've been thinking a lot about Saida while watching how our Web3 community attempts to solve the same problem at scale. Web3 is building digital lists, algorithms that decide which open-source contributors get funded, which public goods deserve support, which projects merit retroactive rewards. Billions of dollars now flow through these systems. **We've convinced ourselves that because the lists are made by code rather than by Saida, we've escaped her dilemma. But we couldn't be more wrong.** ## Public Goods Funding Is A Wicked Problem A fundamental insight from the scientific study of algorithmic decision systems tells us: **every funding algorithm makes a value judgement.** The choice of what to measure *is* the choice of what matters. As Octant founder Julian Zawistowski puts it with bluntly: > Funding is Governance is Ideology When Optimism's Retro Funding uses gas fees as its primary impact metric, it makes a claim: *transaction volume is what matters.* When Deep Funding weights "originality" in its dependency graphs, it asserts: *novelty is more valuable than maintenance.* These are political positions disguised as engineering choices. In the academic literature this is called the value alignment problem and they are inherently social. Over fifty years ago, design theorists Horst Rittel and Melvin Webber identified a category of problems they called **"wicked problems"** [1]. Wicked problems (opposite to "tame problems") have no definitive formulation. For example, calculating the optimal gas limit for a block is a tame problem; deciding who deserves block space is a wicked one. The way you frame the problem determines what solutions are possible. They have no stopping rule. You never know if you've reached the optimal answer. And most importantly, solutions to wicked problems are not "true or false", but "good or bad". They're irreducibly political. **Deciding what to fund** is an inherently wicked problem. You cannot mathematically prove that funding open-source libraries is "correct" while funding user onboarding is "wrong." You cannot optimize your way to consensus on whether maintenance work is as valuable as startup innovation. These are questions about values, and values require human judgment. Our technocratic shift in algorithmic funding is built on a fallacy: believing that public goods funding is a puzzle that yields to the right optimization function when in fact it is a wicked problem. The moment you announce that gas fees determine funding, projects optimize for gas fees. The moment you weight GitHub commits, contributors game commit frequency. In fact, we know from **"Goodhart's Law"** that "when a measure becomes a target, it ceases to be a good measure". ## Collective or Computer Decisions: A Brief History of Web3 PGF To understand how we got here let's explore the history of public goods funding in web3. | Era | Key Strength | Pitfalls | |:----------|:-------------------|:------------------------| | **Democratic (2018-2024)** | Collective input \& legitimacy | Attacks, fatigue, and bias toward insiders | | **Technocratic (since 2024)** | Efficiency and consistency | Embedded biases and lack of transparency | | **Hybrid (Proposed)** | Balanced and accountable | Implementation challenges | _**Table 1:** Comparative analysis of the different eras of web3 public goods funding._ **Era One: The Democratic Experiment (2018 - 2024)** Early experiments in decentralized funding were animated by an optimistic faith in collective intelligence. Quadratic Funding, DAO voting, badge-holder governance: harness the wisdom of the crowd. Let the community decide. The approach had genuine merits, it created legitimacy through participation, surfaced community priorities that top-down approaches missed, and distributed power away from gatekeepers. But the problems ran deeper than mere gaming. Because identity on a blockchain is cheap, Quadratic Funding rounds became playgrounds for Sybil attacks, with bad actors spinning up thousands of fake wallets to drain matching pools. And even among verified humans, collusion rings emerged, where cartels of projects cross-vote to extract maximum value. Grant rounds stopped being impact evaluations and started being extraction games. But even in rounds with robust identity verification and no collusion, structural failures emerged. Turnout decreased over time, often single-digit percentages of eligible participants. This meant motivated minorities dominated outcomes. The projects that won weren't necessarily the most impactful; they were the ones whose founders had time and social capital to mobilize voters. "Wisdom of crowds" assumptions, which work reasonably well for estimating the weight of an ox, don't hold for evaluating whether a cryptographic library is well-architected. Most voters lacked the technical context to distinguish genuine contribution from impressive-sounding nonsense. The social dynamics were worse. We often assume open source is a meritocracy, but the data suggests otherwise. In a 2017 GitHub survey of 5,500 contributors, 95% of respondents identified as men, and just 3% as women. Furthermore, a study of 700,000 pull requests found that acceptance rates for women were higher (78.7%) than men (74.6%) only when their gender was not identifiable; when they were outsider contributors and their gender was visible, their acceptance rate dropped significantly below men’s. Democratic voting amplified rather than corrected these biases, people voted for names they recognized, and recognition flowed through English-language Twitter, Western conferences, and existing social networks. A brilliant contributor in Lagos or Jakarta, working in a timezone disconnected from the discourse, building tools for users who weren't on Crypto Twitter, had almost no path to visibility. The democratic era didn't democratize funding; it transferred gatekeeping power from explicit committees to implicit social networks. Even where participants were honest, they faced cognitive collapse. In Optimism's early RetroPGF rounds, badgeholders confronted hundreds of projects demanding evaluation. Overwhelmed by the sheer informational load, many stopped doing due diligence entirely. They voted for projects they recognized, founders they'd met at conferences, names that appeared in their Twitter feeds. **Era Two: The Technocratic Correction (Since 2024)** The response was predictable. If humans are biased and exhaustible, then let machines decide and strip out the messy subjectivity. This is where Deep Funding, Optimism Retro Funding 7, Proof-of-Impact and similar systems enter. Their designers, talented and well-intentioned, believed they could engineer around the wickedness by creating "objective" allocation mechanisms. But objectivity is the wrong goal. The real question is whether the algorithm's biases are legible, contestable, and democratically authorized. We know from AI ethics research that all algorithms embed the biases of their designers. What matters is whether anyone can see them, challenge them, and change them. On this question, the new systems fall short. ## Case Study: Deep Funding Let me examine one system I've watched closely and participated in: **Deep Funding**. I want to be fair to the designers, who are navigating genuinely hard problems, while also being honest about where I observed structural limitations emerged during implementation. ### What Deep Funding Got Right Before critiquing, I want to acknowledge the genuine innovations that Deep Funding brought. Vitalik Buterin correctly identified that human judgement doesn't scale for technical evaluation, you can't ask thousands of people to assess compiler optimizations. The insight that sparse expert labels can be extrapolated via machine learning is genuinely clever and points toward sustainable funding infrastructure rather than one-off grants. The "AI as the Engine, Humans as the Steering Wheel" framing articulates exactly the right north star and opened up a novel design space for algorithmic funding which this essay is part of. The team also operated under real constraints that explain many of their choices. The dependency graph approach, despite being opinionated, creates legible structure that can be audited and improved. These are hard problems, and earnest attempts to solve them deserve respect even when they fall short. ### Deep Funding's Architecture & Bitter Lesson Rather than asking a crowd to vote on every project, the system assembles a jury of experts who label a small subset of contribution-pairs: Does Alice deserve twice as much credit as Bob for this codebase? Machine learning models then extrapolate from these sparse labels to allocate credit across vast dependency networks, including Ethereum's 40,000+ edges. Funds flow upstream automatically, creating self-sustaining reciprocity rather than one-off grants while using AI to *scale* human judgment rather than *replace* it. However, even when the conceptual vision is clear, the practical design of a pilot inevitably encodes assumptions and constraints into the system. And without deep technical expertise in both machine learning and democratic governance, these encoded biases can compound in ways that undermine the mechanism's legitimacy. This happened to Deep Funding where three structural issues emerged during the pilot phase: **First, jury selection.** The mechanism's legitimacy depends entirely on whose judgments get amplified by the model. Deep Funding selects jurors through "nomination, invitation, or application", a process that inherently favors insiders and people already legible to the technical team. Rather than representative democracy, this creates a credentialed expert panel with extra steps. The composition of that panel determines everything downstream, yet the selection criteria remain opaque to the broader community. Ultimately the subset of willing experts often have a different distribution of perspectives and is not representative to the distribution of all invited experts, but this selection bias stays invisible. **Second, labeling quality.** After a year of operation, only ~600 data points were collected. 64.5% of comparisons received only a single annotation, meaning most juror's opinion did not receive any cross-validation. Using statistics, true error margins were around ±11%, high enough that signal risks being drowned by noise. Ultimately, the resource constraints and project management challenges bottlenecked the process and the system's technical elegance couldn't overcome practical implementation gaps. **Third, the ML model itself.** The machine learning algorithm that extrapolates from sparse labels to full allocations is a black box, but the deeper problem is that it can't be validated. With a lack of evaluation data, there's no reliable ground truth to test against. Selecting a "winning" model is statistically meaningless when your validation data is this noisy. The result is an allocation mechanism that is neither interpretable nor verifiable: participants cannot trace why a project received a specific amount, and no one can confirm whether the model captures genuine expert judgment or artifacts of sparse data. ### Deep Funding II Comeback? TBD ## Algorithmic Accountability There's a name for what happens when decision systems become this opaque. Philosopher Helen Nissenbaum called it **the Problem of Many Hands** [2]. When something goes wrong, whether a deserving project gets zero funding or a bad actor games the system, who's responsible? In a traditional grant committee, you can read the meeting minutes, identify the decision-makers, demand explanations. In an algorithmic regime, responsibility diffuses. The developer blames the training data. The data labeler blames the guidelines. The user blames the "black box." This matters practically as well as theoretically. The **EU AI Act** (2024) specifically addresses opacity in high-stakes algorithmic systems [3]. [Article 13](https://artificialintelligenceact.eu/article/13) requires that affected parties understand *why* decisions were made. [Article 14](https://artificialintelligenceact.eu/article/14) mandates "human oversight" for systems that determine access to essential services. And [Article 86](https://artificialintelligenceact.eu/article/86/) expresses the Right to Explanation of Individual Decision-Making. Current Web3 funding systems, relying on high-dimensional correlations that no human can intuitively trace, fail these standards. This matters for institutional legitimacy as much as legal compliance. When participants can't understand why they were ranked where they were, when there's no clear chain of accountability, the system becomes opaque. Accountability requires not just transparency but participation, which brings us to Elinor Ostrom's work on commons governance. ## Local Knowledge and Lived Experience Nobel laureate Elinor Ostrom [4] studied why top-down technocratic solutions might miss important information from local communities. Ostrom spent her career studying *commons*, shared resources managed by communities. She found that sustainable governance required local knowledge. Outsiders, however well-intentioned, lack the context to make good rules. Her third design principle for successful commons holds that "most individuals affected by the operational rules can participate in modifying the operational rules." This principle addresses both fairness and epistemic accuracy. Local participants possess information that distant rule-makers lack. They know which contributor is genuinely struggling and which is gaming the system. They understand context that shows up nowhere in the data. When Deep Funding's technical team constructs the dependency graph, or Optimism's engineers select metrics, they write operational rules without this participation. The result is what Ostrom predicted: gaming, disengagement, and institutional decay. People don't just object to being excluded from decisions about their fate. They also make the system work worse, because their knowledge never gets incorporated. A legitimate funding system must be **plural**, allowing different communities to define "impact" differently rather than imposing a single centralized metric [4]. ## What the Science of Participatory Budgeting Teaches Us Since 1989, when Porto Alegre, Brazil pioneered participatory budgeting, thousands of cities worldwide have run participatory budgeting experiments: giving affected communities real authority over how money gets spent through citizen assemblies. Studies show that it works. Porto Alegre saw major improvements in sanitation, education, and health, with resources flowing to poorer neighborhoods that traditional budgeting had ignored. When residents set priorities, they often can fund what technocrats miss. However after 35 years of practice, these studies also exposed real problems. Elite capture and lack of representation is still a risk. The educated and articulate tend to dominate citizen assemblies, recreating the same power dynamics under a participatory label. Turnout skews toward people already engaged in civic life. And evaluating complex budget trade-offs is genuinely exhausting for ordinary participants who often lack the time or expertise. The cities that do it well share some patterns: they usually facilitate real deliberation instead of just holding votes, they use tiered assemblies that start local and build upward, they employ staff to help translate community priorities into actual workable proposals, and they run multi-year cycles that let people build trust and learn from mistakes. Digital platforms like Barcelona's Decidim reached more people but created new problems: organized groups gaming the system, users clicking through without thinking, and participation scaling faster than the infrastructure to support it. It is a wicked problem nonetheless. However one big lesson is that quality matters more than scale. A hundred people deliberating well beats a thousand voting blind because participation without structure can reproduce old pathologies. That's why the framework below invests as much in deliberation infrastructure as measurement tools. ## The Two-Stage Framework: Toward Legitimate Algorithmic Funding I believe the path forward requires separating two activities we've been conflating in public goods funding: **scientific measurements** and **value judgments.** Bailey Flanigan, a researcher at the intersection of computer science and political theory, articulates this distinction [5]. Effective democratic decision-making involves two distinct stages: 1. **The Scientific/Measurement Stage:** Establishing empirical facts, mapping trade-off curves, modeling counterfactuals. What are the actual consequences of different choices? 2. **The Value Judgment Stage:** Deciding which point on that trade-off curve to choose. What should we prioritize given those consequences? Algorithms excel at the first but cannot legitimately perform the second. Asking an AI to decide allocation means asking it to make value judgments, a category error. When scientists or their code become final decision-makers, they gain unchecked political authority they neither deserve nor want. We saw this problem during COVID-19 when public health experts were forced into value trade-offs like whether to close schools. ### Stage One: Scientific Measurements as Public Goods We cannot value what we cannot measure, so let AI do what it does best: map dependency graphs, track onchain activity, model counterfactuals. What would Ethereum look like if this library didn't exist? The key is using AI as a **defense against gaming**, not as a judge. Think of it as an anomaly detector, not a rigid filter. Suspicious clustering that suggests collusion? Flag it for human review. Behavioral patterns that look like Sybil attacks? Flag it for human review. The output is information, not decisions. Humans get better data to work with but keep the final say. ### Stage Two: Resilient Deliberation and The "Tired Juror" If Stage One gives us the data, Stage Two requires a legitimate human decision. But this brings us to the hardest constraint in governance: **Cognitive Bandwidth.** We are currently witnessing an attack vector I call **"Judge Tiredness."** When we ask a governance committee to evaluate 500 complex grant proposals in a week, we aren't getting 500 decisions but instead we are getting cognitive collapse. Overwhelmed humans default to heuristics: they vote for names they recognize, they follow the herd, or they check out entirely. To solve this, we cannot rely on a single mechanism. We need a **Legitimacy Stack** that adapts to the scale of the decision and trade-off high-touch human deliberation to high-speed AI automation based on the volume of the problem. #### Tier 1: The Citizen Assembly (Sortition) *For Setting the Rules (1 - 10 Decisions)* For the most "wicked" questions, there is no substitute for human eyes and human debate. Questions like *"Should we prioritize privacy or auditability?"* or *"What are the criteria for this round?"* cannot be automated. Here, we use **Sortition**. We select a representative jury of ~30-50 people. Because the number of decisions is small (Tier 1), we can demand high engagement. These jurors are paid for their time, briefed by conflicting experts (through Stage 1 "Information Phase"), and asked to deliberate. Their output is a **Value Manifesto**, the constitutional principles that guide the lower tiers. #### Tier 2: Augmented Democracy (Simocracy) *For Allocating the Funds (10 - 1,000 Decisions)* Once the value manifesto is set, we face the list-making problem. A human jury cannot read 500 whitepapers without degrading into randomness. This is the domain of **Augmented Democracy**. César Hidalgo proposes that we don't replace the human; we give them a staff. In projects like **Simocracy**, participants use "Digital Twins" or Sims. The mechanism works on a "One Person, One Agent, One Decision" basis: 1. **Calibration:** You train your Sim on the *Value Manifesto* created in Tier 1. 2. **Simulation:** The Sims enter a "Sim Senate." They digest the thousands of pages of technical documentation, debate amongst themselves, and negotiate. 3. **Ratification:** The human user receives a digest: *"Your Sim recommends funding Project X because it aligns with your preference for security tools, but notes that Project X has high developer churn."* The human remains the judge, but the AI acts as the Clerk, doing the reading, summarizing the precedents, and highlighting the contradictions. #### Tier 3: Liquid AI Democracy (Autopilot) *For Micro-Governance (1,000 - ∞ Decisions)* Finally, we reach the scale of the "Long Tail"—tipping individual GitHub commits, rewarding minor bug fixes, or continuous streaming funding to a dependency graph. Here, human ratification is impossible; the friction is too high. At this scale, the Sims run on **Autopilot**. The "Digital Twin" votes autonomously, acting as a proxy based on the user's past behavior and stated values. The agent's authority is derived directly from the user's Tier 2 behavior. It is **Liquid Democracy** where the delegate is not another politician, but your own aligned software agent. However, we must be intellectually honest about the "Autopilot" approach. Current Large Language Models (LLMs) are not "digital twins" in a literal sense; they are probabilistic engines prone to hallucination, sycophancy (echoing the user's bias), and simplification of complex technical nuances. To maintain legitimacy, Autopilot cannot mean "unattended." Much like a pilot in a modern cockpit, the human remains the ultimate authority, monitoring the system for "instrument failure." This is managed through **Exception-Based Governance**: * **Boundary Triggers**: The agent executes autonomously within narrow parameters but must "hand back the stick" to the human if it encounters a project that contradicts the Value Manifesto or involves a high-stakes funding threshold. * **Aggregated Audits**: Instead of reviewing 1,000 individual micro-tips, the user performs periodic "spot checks" on batches of decisions, ensuring the agent’s trajectory remains aligned with their evolving intent. * **Dynamic Mandates**: The agent's authority is not a blank check; it is a temporary lease of power that must be renewed through active participation in Tier 1 and Tier 2 deliberation. By automating the mundane without abandoning the cockpit, we ensure that even at the smallest scales, every cent distributed is still tethered to human value. Or simply reconnecting with Deep Funding's initial North Star: AI as the Engine. Humans as the Steering Wheel. ### Making The AI Engine More Transparent A reasonable objection arises: if we critique traditional ML models as unaccountable black boxes, aren't LLM-based Sims merely a more sophisticated version of the same problem? The contradiction vanishes only if we apply transparency at the **interface layer** across the entire stack. The key is identifying where opacity is a tolerable byproduct of complexity and where human oversight must be absolute. We propose to implement a **Descending Requirement of Transparency**. #### Tier 1: Social Transparency (Audit the Intent) In the **Citizen Assembly**, transparency is not about code; it is about **procedural legitimacy**. * **The Information Phase**: The data provided to jurors from Stage 1 (Measurement) must be fully auditable. We must expose the "opinionated" nature of our metrics, admitting for instance that "impact" is measured by code commits rather than community sentiment. * **The Manifesto**: The transformation of raw human debate into a machine-readable Value Manifesto must be a public "ceremony". Transparency here ensures that the "intent" of the engine is a democratic output, not a hard-coded preference by the developers. #### Tier 2: Algorithmic Interpretability (Audit the Reasoning) Tier 2 (Sims) can tolerate internal neural-network opacity if, and only if, the system maintains **Interpretable Agency**: 1. **Sandwiching**: Humans control the input (Manifesto) and the output (Ratification). The "black box" is relegated to a synthesis tool, never a sovereign decider. 2. **Reasoning Logs**: Sims must generate legible chain-of-thought rationales. If an agent ranks Project X higher than Project Y, it must surface the logic: *"Project X aligns with your 0.8 weight on 'security,' despite its lower 'innovation' score."* This provides a "forensic trail" for the human judge to verify. 3. **Value Dashboards**: By utilizing frameworks like *Values in the Wild*, we can visualize the agent's expressed priorities in real-time. This turns "alignment" from a philosophical hope into a trackable metric, showing exactly when a model begins to drift from its original mandate. #### Tier 3: Systemic Transparency (Audit the Boundaries) In **Autopilot**, where decisions occur at a velocity that precludes individual human audit, transparency shifts to **Systemic Accountability**. * **Open Exception Logs**: Systemic transparency is ultimately defined by knowing when the machine "knows its limits." Every time the Autopilot triggers a boundary, handing control back to the human because a project was too "wicked" or high-stakes, that hand-off must be logged and publicized. By stacking these requirements, we ensure that while the *calculations* may be complex, the *authority* is always visible. We use Citizen Assemblies to ensure the **values** are human, Augmented Democracy to ensure the **analysis** is faithful, and Autopilot to ensure the **reach** is scalable - while keeping the "steering wheel" firmly in human hands. ## What Would This Look Like in Practice? Let me sketch a concrete example that walks through all three tiers. **Stage One: Scientific Measurement** Imagine a funding round for Ethereum developer tooling. The measurement infrastructure produces a transparency report: here are the dependency graphs showing which libraries underpin which applications; here are counterfactual impact estimates modeling what would break if each project disappeared; here are anomaly flags where contribution patterns suggest possible gaming. The AI has identified three projects with suspicious Sybil-like behavior—here's the data, evaluate for yourself. This is information, not judgment. **Tier 1: The Citizen Assembly** A sortition algorithm selects a jury of 40 people, stratified to represent different stakeholder groups: protocol developers, dApp builders, researchers, end users, community members from different regions and randomly selected to prevent insider capture. Instead of evaluating hundreds of projects, they engage in structured deliberation over *principles*: How should we weight maintenance versus innovation? How do we value developer experience versus end-user impact? What counts as "contribution" in an ecosystem context? Should geographic diversity factor into allocation? Their output is a **Value Manifesto**, the constitutional principles that constrain the lower tiers. Importantly, their deliberation is public, recorded, and contestable. Different communities can make different choices: a Latin American developer collective might weight maintenance heavily because reliable infrastructure matters more than new features in their context. In 2024, researchers demonstrated **Collective Constitutional AI**: a representative sample of ~1,000 Americans used deliberative voting to generate a constitution for an AI system. The model trained on this public input showed lower bias across nine social dimensions compared to the standard corporate-designed constitution—without sacrificing capability [CCAI]. The "wisdom of the crowd" aligned the system better than expert-designed alternatives. **Tier 2: Augmented Democracy (Simocracy)** Once the Value Manifesto is set, participants enter Tier 2. Maria, a developer from São Paulo, spends 20 minutes calibrating her **delegate agent** (or "Sim")—ranking five sample projects to teach it her priorities, answering questions about how she weighs the Manifesto's principles against each other. Her Sim then evaluates 847 funding applications overnight, producing a digest: > "I recommend funding LibP2P-Brazil because it aligns with your stated preference for infrastructure maintenance (weight: 0.8 in your calibration), and 94% of their commits are classified as maintenance work. However, I flag that the team had governance disputes in Q2—you may want to review their resolution process. I recommend *against* funding HypeChain despite their high GitHub activity, as their contribution pattern triggered the Sybil anomaly detector and their stated mission conflicts with the Manifesto's emphasis on end-user benefit over speculation." Maria reviews the top 20 recommendations, overrides two where she disagrees with the Sim's reasoning, and ratifies the rest. The key insight: the Sim *proposes*, Maria *disposes*. She retains veto power and can audit the reasoning chain. The Sims then enter a **Sim Senate** where they negotiate, surface disagreements, and identify where human values might conflict through structured deliberation at scale. **Tier 3: Liquid AI Democracy (Autopilot)** For the remaining 500 micro-allocations under $100—tipping individual commits, streaming funds to dependency maintainers, Maria's Sim operates on autopilot, making decisions based on her calibrated preferences without requiring ratification. This is **Liquid AI Democracy**: delegation not to another politician, but to your own aligned software agent. Legitimacy is derived directly from Maria's Tier 2 behavior and remains bounded by the Tier 1 Manifesto. **The Legitimacy Stack in Summary** | Tier | Scale | Mechanism | Human Role | AI Role | |------|-------|-----------|------------|---------| | 1 | 1-10 decisions | Sortition jury | Final authority | Information provision | | 2 | 10-1,000 decisions | Sim delegation + ratification | Veto power, spot-checks | Proposal, analysis | | 3 | 1,000+ decisions | Autopilot | Periodic audit | Autonomous within bounds | By stacking these three tiers, we solve the paradox of scale: Sortition ensures the *values* are human-deliberated. Augmented Democracy ensures the *analysis* is thorough. Autopilot ensures the *reach* extends to the long tail without cognitive collapse. ## Building the Science: Beyond Ad-Hoc Experimentation The sketch above of course leaves open significant design questions: how exactly do you stratify stakeholder groups fairly? How do you prevent the deliberation itself from being captured? How do we prevent AI hallucination? How do we measure values and ensure transparency? All these require ongoing research and experimentation. The AI research community accelerated progress by building shared infrastructure: datasets like ImageNet, benchmarks that allowed comparison across labs, norms of publishing findings including failures. I believe that algorithmic funding needs its own **"ImageNet moment."** This is especially critical because wicked problems cannot be "solved," only managed through ongoing adaptation. Every mechanism will be gamed eventually. The question is whether we detect the gaming quickly enough to respond. This means: **Shared datasets.** Anonymized preference data from funding rounds should be public, allowing researchers to test new mechanisms. Currently, each project guards its data jealously. We're making the same mistakes in isolation, over and over. **Continuous evaluation.** Every funding round should be treated as an experiment, with pre-registered hypotheses and post-hoc analysis. What worked? What didn't? What would we change? This discipline is how engineering fields mature. **Red-teaming and adversarial testing.** Before deployment, mechanisms should face systematic attempts to break them. What would a well-resourced attacker do? What collusion strategies become viable? What happens when Goodhart's Law kicks in? **Academic bridges.** We need to stop reinventing mechanism design in governance forums. Collaborate with researchers who've spent careers on social choice theory, computational democracy, deliberative systems. Don't dismiss peer review as gatekeeping. Treat it as quality control. The goal is a **sim-to-real pipeline**: test mechanisms in simulations, validate against historical data, pilot with small stakes, scale only what survives scrutiny. ## Conclusion: The Legitimacy Engine in Practice The fantasy persists that with enough data, enough compute, enough cleverness, we can engineer our way past the hard work of value judgements. We can't. Wicked problems don't have solutions. They have resolutions, temporary and contestable, that require ongoing democratic engagement to maintain. As we move toward a future of algorithmic funding, our goal is not to build a machine that decides for us, but to build a "Legitimacy Engine": a stack that uses scientific measurement to inform, representative deliberation to provide moral authority, and augmented democracy to scale that authority without burning out the participants. For the practitioner, building this engine is less about writing the perfect optimization function and more about answering a series of uncomfortable, essential questions. Before deploying any system, we must hold it against this alignment framework to ensure it serves the community rather than just the code: ### The Practitioner’s Alignment Framework * **Acknowledge the Wickedness**: Does your documentation admit this is a political trade-off rather than an "objective" optimization? * **Audit Hidden Values**: Who decided which metrics matter, and have you documented what those choices exclude? * **Predict Goodhart’s Law**: How will you detect when your chosen metrics stop correlating with actual impact and start being gamed? * **Ensure Meaningful Recourse**: Is there a clear, human-led appeals process for those the algorithm misses? * **Separate Measurement from Judgment**: Does the AI provide information for humans to review, or does it hold the final authority over the purse strings? * **Resist Capture**: Are you using sortition or stratified sampling to prevent the same "insider" groups from setting the rules? * **Bridge the Translation Gap**: Is the process of converting human values into mathematical weights transparent and contestable? * **Fund Rails, Not Just Lists**: Does the mechanism support long-term reciprocity or does it merely pick a new set of favorites? * **Embrace Plurality**: Are different sub-communities empowered to weight impact differently based on their local context? * **Build for Adaptation**: How will the mechanism learn from its failures and evolve as the community changes? Lastly, Saida's neighbors burned her house down because she made a list without collective legitimacy. They weren't objecting to the technology, since she was using paper and pen. They were objecting to the authority she assumed: the right to decide, on behalf of a community, who mattered and who didn't. We can build better measurement tools. We can build more representative processes. We can build infrastructure for reciprocity instead of ranking. But we cannot build a machine that legitimately makes these choices for us without us. The algorithm can count. It cannot decide what's worth counting. That's our job. ## References [1] Rittel, H. W., & Webber, M. M. (1973). Dilemmas in a general theory of planning. *Policy Sciences*, 4(2), 155-169. [2] Nissenbaum, H. (1996). Accountability in a computerized society. *Science and Engineering Ethics*, 2(1), 25-42. [3] European Parliament & Council. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). [4] Ostrom, E. (1990). *Governing the commons: The evolution of institutions for collective action*. Cambridge University Press. [5] Flanigan, B., et al. (2021). Fair Algorithms for Selecting Citizens' Assemblies. *Nature*. [6] Flanigan, B., et al. (2021). Fair, Manipulation-Robust, and Transparent Sortition. *arXiv preprint arXiv:2102.06646*. [7] Small, C. T., et al. (2021). Pol.is in Taiwan: Computational Democracy in Practice. *Computational Democracy Project*. [8] Hidalgo, C. A., et al. (2024). Large Language Models as Agents for Augmented Democracy. *Philosophical Transactions of the Royal Society A*. [9] Ruddick, W. O. (2024). Grassroots Economics. *Grassroots Economics Research*. [10] Buterin, V. (2025). AI as the Engine, Humans as the Steering Wheel. *vitalik.eth.limo*. [11] Wampler, B. (2007). *Participatory Budgeting in Brazil: Contestation, Cooperation, and Accountability*. Penn State Press. See also: Cabannes, Y. (2004). Participatory budgeting: a significant contribution to participatory democracy. *Environment and Urbanization*, 16(1), 27-46.