# Hyperreal Enterprises Whitepaper
Joseph Corneli
July 21, 2020 (draft)
# 1 Introduction
Our vision is to make meaningful and rewarding work in the knowledge economy possible for everyone. This would have easily recognisable economic, humanitarian, social, and ecological benefits [24]. While the Internet and Web 2.0 have not achieved this goal, they give hope that it might be possible. The advances of the 1990s and early 2000s have considerably broadened access and, in the process, generated a large pool of open data. Hyperreal Enterprises plans to use this data to bootstrap AI tools that support knowledge work. Our first product will be an AI tutor that helps people learn how to program and connects them with practical projects. This whitepaper gives some background on this project, and summarises the technical state of play.
# 2 Background
PlanetMath was one of the early examples of “commons-based peer production” [3], launched by Aaron Krowne in 2000. For details, see Aaron Krowne’s Master’s thesis [13]. PlanetMath.org, Ltd., was incorporated as a nonprofit circa 2004. Further design considerations were developed in Joseph Corneli’s PhD thesis, which focused on building better support for “peer learning” on PlanetMath [5]. Since 2018, the contents of the PlanetMath encyclopedia are archived on Github, and the site is online but not actively maintained. In the mid-‘00s, several PlanetMath contributors discussed a further iteration of the basic design that would go far beyond being a collaboratively written but largely static repository, and become a computationally meaningful and increasingly complete symbolic mathematics software system. Whereas the field of computer mathematics has focused primarily on representing mathematics in logical formalisms, we imagined a system that could interface directly with mathematical texts as they are written by mathematicians and students. Success criteria would include passing preliminary exams, tutoring students, or writing original mathematics papers. Since this would effectively become a “simulation” of mathematics — much as mathematics itself might be thought of as a simulation of the real world — we called this largely notional project the Hyperreal Dictionary of Mathematics. Completing such an endeavour would require not only mathematical knowledge (and content), but also HCI, linguistics, AI, and organisational work. On the back of this speculative design we imagined an organisation that would use computers to represent still broader forms of knowledge, ranging from learning materials in other disciplines, to logistics and exchange systems. We believed that such representations might be used to solve problems far beyond mathematics. We called this hypothetical organisation Hyperreal Enterprises. In 2019, we officially incorporated a company in the UK under this name. The company will initially focus on “Research and experimental development on natural sciences and engineering.” More specifically, having observed that there is a large demand for technical talent and a corresponding under-production of the same, we decided to focus our spun-up enterprise on building software that can support technical training. Although “intelligent tutoring systems” is a long-established field, we have a new take on it, since we will build our tutoring system on top of a large collection of open data. Although the application area (to programming rather than mathematics) different, the technical facets of the project can be revisited under the previously mentioned high-level division. The intervening years have seen considerable improvements in all of these areas, with both off-the-shelf and research-grade software at our disposal for commercial exploitation.
# 3 HCI
We plan to present the learning materials in the form of an interactive game. We think that this, in itself, will be a considerable advance for new users, when compared learning on one’s own with only the debugger and written documentation — or relying on competitors which mainly teach using videos and minimal interaction — or relying on Q&A, which is not friendly or useful for beginners.
## 3.1 Game engine
Game engines like Unity can support interaction and visualisation. Within a game engine, we plan several additional features to support users:
- Storytelling Module (involve rewriting the core documentation of the language as well in a visual manner and integrating into the engine)
- Interaction Module (e.g., notice when users get stuck and give hints)
- Visualisation Module (to illustrate programming challenges, including internals and an interactive “world”)
- Analytics Module (e.g., to be aware of how many hints are used and when and change the problem accordingly)
## 3.2 Content authoring
The engine can provide a nice look and feel, but ultimately it will only be as good as the content, which will need a well thought through learning design. The initial workflow for content authoring will rely on gathering relevant problems from around the internet or other sources, look at how they depend on each other, and generating hints and links between content for each step where people could go wrong. All of this will then be saved as a playable file ready for the game engine.
# 4 Linguistics
Ultimately we want to be able to transform open source materials from sites like Stack Exchange and Github into instructional materials automatically. This is an ambitious proposal, but there are several recent precedents that could put it within reach over the next five years.
## 4.1 Named entity recognition and graph theory
We should stress that we do not need to fully parse and process content to still get some benefit from tools integration. Even simple named-entity recognition provides some useful affordances [10], towards exposing graph structures which can then be reasoned about computationally [14].
## 4.2 Argumentation theory
We applied ideas from the field of argumentation to model mathematical creativity. One branch of this work focused on constrained models of argumentative process with formally derivable features [18]. Another was more open-ended and sought to identify the actual dialogue moves used [7], and to model them computationally [6].
## 4.3 Vector-based and deep learning based language models
Given progress within computational linguistics, it is reasonable to expect that argumentation structures that we are able to extract by hand could be identified automatically [11] [26]. Recent years have seen considerable and well-publicised advances in deep learning based language models.
## 4.4 Linguistics of technical languages
In the current application we would not be looking at discourse around computer programming rather than mathematical language. However, the kinds of language and reasoning used are broadly similar, so it is worth pointing to fairly recent advances in linguistics applied to mathematical knowledge [8], and to related work in language-aware mathematical problem solving [9].
# 5 AI
We are interested in understanding computer programs and supporting the process of learning how to program. Recent advances in code generation — largely based on the kind of AI that supports linguistics — are able to carry out some impressive feats of code generation. However, current technologies are not yet good at code explanation.
## 5.1 Dataflow
Some of ideas related to from recent advances in computational linguistics have also been applied to computer languages. Of particular interest are data flow models such as code2vec [1] and “Neural Code Comprehension” [2].
## 5.2 Arxana and AtomSpace
Work begun in the mid-2000s on a framework for representing knowledge in computation-friendly formats resulted in the Arxana prototype (https://repo.or.cz/w/arxana.git). We explored using this together with our work on modelling the process of mathematical proofs [6]. Our work was inspired in part by the Conceptual Dependency diagrams of Schank (see [16]) and the Conceptual Graphs of Sowa [21]. AtomSpace (https://github.com/opencog/atomspace) is a broadly similar open source tool developed by the OpenCog Foundation [12]. These or similar graph technologies will be useful for representing domain knowledge and exposition in computational forms.
# 6 Mathematics
We are concerned with content that has an explicit computational interpretation (code), with accompanying expository text, as well as the with process of interaction with these materials. Some portion of these concerns can benefit from mathematical modelling, even when the contents are not themselves “mathematics” per se.
## 6.1 Category theory
Ologs give a simple graphical formalism for representing knowledge objects [22] (with similar affordances to Conceptual Graphs). They are computationally equivalent to a data query language which supports efficient data integration from multiple sources [4], and which has a recent industry-grade implementation (https://conexus.com/cql). Monocl is a simple process modelling language based on category theoretic diagrams [17, 20] — that derive from data flow analysis. It was initially demonstrated as an abstraction layer over a collection of data science programs. In the present project, Monocl (or similar ideas) will be used to create a general purpose ontology of programs and programming.
## 6.2 Statistical analysis
We have demonstrated the application of techniques for semiparametric analysis [23] to understand properties of learning behaviour at scale [5, Chapter 6]. This and other techniques can be adapted within the Analytics Module mentioned above to provide real-time feedback for learners.
# 7 Organisation
We plan to build on open source content. Others have already created various interesting and impressive analyses and applications that can inform our work.
## 7.1 Stack Exchange
Many research papers have investigated Stack Exchange content. For our purposes question difficulty estimation [15] and models of learning efficacy [25] are particularly interesting pieces of prior work. Another interesting line of work connected Stack Overflow to the IDE, automatically sourcing relevant questions [19]. A limited demonstration of code autocompletion based on Stack Overflow questions was published in 2016 (https://emilschutte.com/stackoverflow-autocomplete/), and in 2017 Microsoft released a bot that works with Stack Overflow contents and retrieves relevant questions or code based on a textual query (https://github.com/Microsoft/BotFramework-Samples/tree/master/StackOverflow-Bot).
## 7.2 Peeragogy
There are some overlap between Hyperreal Enterprises and the concerns of Peeragogy project, an open source collaboration that has been gathering design patterns for peer learning (http://www.peeragogy.org/).
# 8 Conclusion
We are working to develop an immersive, engaging and adaptive gamified online learning experience for people upskilling in tech. While this may bring to mind existing players in the EdTech space — such as Udacity, Udemy, EdX, Hacker Rank, Codecademy, Grasshopper, Treehouse, Lambda School, c0d3.com, freecodecamp.com, scrimba.com (among others), our solution is sufficiently different that the easiest way to explain it is that we are solving an essentially different problem. Whereas the organisations and tools mentioned support skill acquisition, they do not address the more challenging problem of modelling skilled performance, at scale and in detail. Accordingly, it is more accurate to think of Hyperreal Enterprises as a competitor to Andela, a startup that prepares people to deliver high-level offshoring services. However, we will be able to offer a wider range of training and services, ultimately covering the whole tech landscape. The advantages of our approach correspond to the disciplines we incorporate in our offering:
HCI - When compared with the standard EdTech offerings we can deliver a much better price-to-performance ratio, through the use of engaging experiences built on top of meaningful models of code and programming practices.
Linguistics - Wide and ultimately comprehensive coverage that, moreover, stays up to date as the field changes, without the need for expensive content development.
AI - The potential to automate routine skill evaluation, supporting independent learning and/or an improved teacher-to-student ratio.
Mathematics - A route to extending our models and learning materials to other domains of knowledge, across STEM and beyond.
Organisation - The ability to certify of skilled practice, including soft skills and collaboration ability, that will be meaningful to employers.
Until recently, it would not have been possible to create a computational model of Stack Exchange and Github without countless years of hand-coding, such that the map would be out of date by the time it was finished. Thanks to the developments surveyed above, we are now in a position to bring an innovative interactive upskilling interface to the world of open source software to market. To be certain, all of the areas touched on above will need further work, and realising our ambitions will require further research as well as a creative use of existing technologies. One purpose of this document is to help clarify the division of labour, and begin to outline some of the relevant interfaces between the technologies we depend on.
References
[1] Uri Alon et al. “code2vec: Learning distributed representations of code”. In: Proceedings of the ACM on Programming Languages 3.POPL (2019), pp. 1–29.
[2] Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. “Neural code comprehension: A learnable representation of code semantics”. In: Advances in Neural Information Processing Systems. 2018, pp. 3585–3597.
[3] Yochai Benkler. “Coase’s Penguin, or, Linux and “The Nature of the Firm””. In: The Yale Law Journal 112.3 (2002), pp. 369–446. issn: 00440094. url: http://www.jstor.org/stable/1562247.
[4] Kristopher S. Brown, David I. Spivak, and Ryan Wisnesky. “Categorical data integration for computational science”. In: Computational Materials Science 164 (2019), pp. 127–132. issn: 0927-0256. doi: https://doi.org/10.1016/j.commatsci. 2019.04.002. url: http://www.sciencedirect.com/science/article/pii/S0927025619302046.
[5] Joseph Corneli. “Peer produced peer learning: A mathematics case study”. PhD thesis. The Open University, 2014. url: http://oro.open.ac.uk/40775/.
[6] Joseph Corneli et al. “Modelling the Way Mathematics is Actually Done”. In: Proceedings of the 5th ACM SIGPLAN International Workshop on Functional Art, Music, Modeling, and Design. FARM 2017. Oxford, UK: ACM, 2017, pp. 10–19. isbn: 978-1-4503-5180-5. doi: 10.1145/3122938.3122942. url: http://doi.acm.org/10.1145/3122938.3122942.
[7] Joseph Corneli et al. “Argumentation Theory for Mathematical Argument”. In: Argumentation 33.2 (June 2019), pp. 173–214. issn: 1572-8374. doi: 10.1007/s10503-018-9474-x. url: https://doi.org/10.1007/s10503-018-9474-x.
[8] Mohan Ganesalingam. “The language of mathematics”. In: The Language of Mathematics. Springer, 2013, pp. 17–38.
[9] Mohan Ganesalingam and William Timothy Gowers. “A fully automatic theorem prover with human-style output”. In: Journal of Automated Reasoning 58.2 (2017), pp. 253–291.
[10] Deyan Ginev and Joseph Corneli. “NNexus Reloaded”. English. In: Intelligent Computer Mathematics. Ed. by Stephen M. Watt et al. Vol. 8543. Lecture Notes in Computer Science. Springer International Publishing, 2014, pp. 423–426. isbn: 978-3-319-08433-6. doi: 10.1007/978- 3319-08434-3_31. url: http://dx.doi.org/10.1007/978-3-319-08434-3_31.
[11] Deyan Ginev and Bruce R. Miller. Scientific Statement Classification over arXiv.org. 2019. arXiv: 1908.10993 [cs.CL].
[12] Hendy Irawan and Ary Setijadi Prihatmanto. “Implementation of graph database for OpenCog artificial general intelligence framework using Neo4j”. In: 2015 4th International Conference on Interactive Digital Media (ICIDM). IEEE. 2015, pp. 1–6.
[13] Aaron Phillip Krowne. “An architecture for collaborative math and science digital libraries”. MA thesis. Virginia Tech, 2003.
[14] Pierre Raymond de Lacaze. BABAR: Wikipedia Knowledge Extraction.
[15] Jing Liu et al. “Question difficulty estimation in community question answering services”. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013, pp. 85–90.
[16] Steven L Lytinen. “Conceptual dependency and its descendants”. In: Computers & Mathematics with Applications 23.2-5 (1992), pp. 51–73.
[17] Evan Patterson et al. “Teaching machines to understand data science code by semantic enrichment of dataflow graphs”. In: arXiv preprint arXiv:1807.05691 (2018).
[18] Alison Pease et al. “Lakatos-style collaborative mathematics through dialectical, structured and abstract argumentation”. In: Artificial Intelligence 246 (2017), pp. 181–219. issn: 00043702. doi: https://doi.org/10.1016/j.artint.2017.02.006. url: http://www.sciencedirect.com/science/article/pii/S0004370217300267.
[19] Luca Ponzanelli et al. “Mining StackOverflow to turn the IDE into a self-confident programming prompter”. In: Proceedings of the 11th Working Conference on Mining Software Repositories. 2014, pp. 102–111.
[20] Ioana Monica Baldini Soares et al. Generating semantic flow graphs representing computer programs. US Patent 10,628,282. Apr. 2020.
[21] John F Sowa. Knowledge representation: logical, philosophical and computational foundations. Brooks/Cole Publishing Co., 1999.
[22] David I. Spivak and Robert E. Kent. “Ologs: A Categorical Framework for Knowledge Representation”. In: PLoS ONE 7.1 (Jan. 2012). Ed. by Chris Mavergames, e24274. issn: 19326203. doi: 10.1371/journal.pone.0024274. url: http://dx.doi.org/10.1371/journal.pone. 0024274.
[23] Timothy Teravainen. “Semiparametric Estimation of a Gaptime-Associated Hazard Function”. PhD thesis. Columbia University, 2014.
[24] Roberto Unger et al. Imagination unleashed: Democratising the knowledge economy. url: https://www.nesta.org.uk/report/imagination-unleashed/.
[25] Utkarsh Upadhyay, Isabel Valera, and Manuel Gomez-Rodriguez. “Uncovering the Dynamics of Crowdlearning and the Value of Knowledge”. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. 2017, pp. 61–70.
[26] Amy X Zhang, Bryan Culbertson, and Praveen Paritosh. “Characterizing online discussion using coarse discourse sequences”. In: Eleventh International AAAI Conference on Web and Social Media. 2017.