# Challenges in Software Engineering (notes for the final lecture in my Algorithm Analysis course, May 9, 2024) The World Wide Web creates value by aggregating information. In the early 2000s Google revolutionized content aggregation. Some of my friends still remember where they were when they first tried Google search. Google understood that its content aggregation was built on top of a well-working knowledge-commons. Famously, Google captured its unwavering commitment to this knowledge-commons by the slogan "Don't be evil" (now abandoned). Brin and Page warned in 1998 that "we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers." Now, 25 years later, not least due to the rise of generative AI, what used to be the world's knowledge-commons starts resembling the dystopia of the movie The Matrix where AIs and corporations feed on people who lost all agency. What can we do about this? There is a lot of enthusiasm in software engineering circles around the idea of decentralization. Moreover, not least thanks to LLMs as code generators ("The hottest new programming language is English"), it was never as easy as it is today to develop, deploy and maintain software. In principle, we don't even need "the cloud" anymore, we could create our own, just by connecting our phones. Technology for this already exists. So why do the many decentralization projects not make more progress? Given that we have the technology and enthusiasm, **what is missing**? I believe the most important question to understand is the following. **What value do online platforms provide? Can this value also be provided in a decentralized way?** As much as I would like to investigate these questions from a mathematical point of view, I must admit that I do not know how to start. I superficially browsed literature in economics, network science, cognitive science, psychology, biology and found a lot of beautiful applied math ... but nothing that stood out to me as being able to provide an answer to the questions above. **Let me know if you have any pointers that I should pursue.** This semester, I started to try to answer these questions in an experimental way by developing software that attempts to decentralize services that are currently provided by centralized platforms. In particular, with students in my lab, I am looking at two topics: - Decentralizing Moderation on Social Media - Aggregating Distributed Information I believe that these problems are at the heart of an answer to the question why decentralization does not make more progress. These problems are difficult to solve. They require finding the right balance between centralization and decentralization. From an engineering point of view, for any solution to be trustworthy, it must be free and open source software maintained by a diverse community. We must find new ways of building interoperable software systems and share the efforts of maintenance and continued development. From an economics point of view, we need to find new ways of rewarding the individual contributions for building the knowledge-and-software-commons. From a network science point of view, note that, by definition of the problem, both the decentralizing moderation and aggregating distributed information project will have to introduce aspects of centralization into a decentralized network. It will be important to conceptualize the tension between centralization and decentralization. Related to this is the observation that networks consist of processes that operate on different time-scales. We need to find ways to think about how to reconcile short-term and long-term goals. Maybe the general question here is the following. How can we build mathematical models that help us understand how to extract value from a network without affecting the processes that keep the network alive and healthy? Without going into further details now, the last question also connects to areas such biology, anthropology, psychology, philosophy, ethics (and many more). ## Appendix In class, I used a [picture](https://excalidraw.com/#json=RymgzYkjMi99_DFazJXqb,PDclODJZtbQ2lxpkfCZyiQ) similar to ![image](https://hackmd.io/_uploads/HkQt36vHA.png) to discuss the relationship between, say, Google and the knowledge-commons, or LLMs and the free-and-open-source software community. **Example**: The output of LLMs can be very valuable, in particular when they are used for generating code. - *How much of this value should we consider as being generated and how much as being extracted?* [^how-much] When GPT-3 and 4 came out and we were awed by the quality of their code, the first reaction was to attribute the progress and innovation to the LLMs themselves. But it didn't take long for software engineers to point out that the quality of the generated code depended on the quality of the code used for training. This code had been carefully collected and curated, over decades, by the software engineering community on free an open source platforms such as Github and Stackoverflow, using old-fashioned symbolic AI including compilers, type checkers and other formal methods. - *How do negative externalities produced by LLMs affect the knowledge-commons on which the LLM's quality and success depends?* (i) The lawsuit Doe v. Github drew attention to the negative impact that code-generating LLMs have on the communities of developers that create the code on which LLMs are trained. (ii) It seems that we already reached the point where LLM-generated code of low quality has a negative impact on the training data (hence quality) of next generation LLMs. **Example:** Given the rate at which low-quality AI generated content is produced just for the purpose of ranking higher in Google search, will the commercial internet destroy itself? How can people protect themselves against AI-generated content? How can software engineers contribute to creating and maintaining an internet that is a fun space to socialise in? [^how-much]: This question does not have a scientific, quantitative answer. Nevertheless, it is an important question that will have an impact on, for example, how judges will decide in the various court cases that are building up around LLMs.