What is language?

# What is language? All science starts with metaphysics, with "What is" questions. For this course we start with: What is a language? We are really interested in programming languages, but it is worth asking the more general question first. [^history] [^history]: There is a historical reason and a cutting edge reason. The historical reason is that the mathematical model of programming language syntax (mainly associated with [Noam Chomsky](https://en.wikipedia.org/wiki/Noam_Chomsky)) was imported from linguistics. The cutting edge reason lies in the amazing advances that Natural Language Processing (NLP) has been doing in recent years. **Remark:** In the math based sciences we often answer what-is questions in two different ways. One is to construct the mathematical objects, typically using set theory, logic, discrete mathemtics, etc. Another one is to axiomatise the properties of a class of models. We will see examples of both in the following. Some metaphysical questions related to this course, in decreasing order of relevance. What is a language? What is meaning? What is computation? What is information? What is an algorithm? What is intelligence? (Novelty? Interestingness? Curiosity? Knowledge?) What is conciousness? What is life? What is a self? We will see mathematical answers to the first three questions. [Information theory](https://en.wikipedia.org/wiki/Information_theory) provides an answer to the next question, but it does not capture what we mean by information in phrases like "information processing" or "putting information in a data base", so it is limited and the general question is still open. The remaining questions are wide open, I would say, but central to research in Artificial General Intelligence. **Discussion:**[^discussion] Go through your science knowledge and try to answer some "What is" questions. [^axiomatic] Here are some suggestions of words that have a vague and intuitive meaning in ordinary language and then a different (but related) mathematical meaning in science: geometry force energy number, number line set probability instantaneous change logic Going back to languages, the mathematical definition that is used in theoretical computer science and programming languages is that of a "set of words". And the definition of "word" is parameterised by another set, the "alphabet". **Definition:** Let $\Sigma$ be a set, called the *alphabet*. Elements of $\Sigma$ are called *letters* or symbols. A **word** (aka *string*) is a finite sequence of letters. The set of all words over an alphabet $\Sigma$ is denoted by $\Sigma^\ast$. We write $\epsilon$ for the word of length $0$, called the *empty word*.[^finseq] **Definition:** A language over the alphabet $\Sigma$ is a subset of $\Sigma^\ast$. **Example:** There are many ways to instantiate the definition. We are mainly interested in programming languages, say C. The following two will play an important role in this course and in Compiler Construction. In the first case the language is the set of legal words (aka tokens) that can appear in a C program. In the second case the language is the set of legal C programs. ||letter|word| |:---:|:---:|:---:| |lexing | character | token | |parsing | token | program | **Remark:** Equivalently, we can say that a language is a function [^char-function] $$\Sigma^\ast\to\mathbb B.$$ where $\mathbb B$ is the set of Boolean truth values. Historically, the mathematical notion of language was introduced to account for natural languages such as English but is now more in important in computer science. **Discussion:** When applying our mathematical notion of language to English, what potential difficulties might arise? Some of these difficulties have been addressed in natural language processing by replacing languages $\Sigma^\ast\to\mathbb B$ by probability distributions. **Definition:** A **language model** is a probability distribution $$\Sigma^\ast\to[0,1].$$ **Remark:** If you consult the NLP literature you may find slightly different definitions, but this captures the big idea. If you heard of BERT or GPT-3, the basic idea is that one can learn such a probability distribution simply by churning large amounts of data through a large neural network. Such language models can then be used for translation, generation, autocompletion, etc. **General Remark on Science, Math and Data:** To understand this definition I like to think of three levels of mathematical modelling. First, we want to study a phenomenon in the *real world*, say English. Second, we decide how to model the real world using a class of *mathematical models*. In our NLP example, the models are "good" probability distributions $\Sigma^\ast\to[0,1]$ specified by certain parameters. [^parameters] Third, we need to learn to *estimate the parameters* of the model from data. [^parameters]: Parameters could be, for example, the biases and weights of a neural net. The choice of how to parameterize a class of models is where much of the art of mathematical modelling lies. **<font color=red>Homework</font>:**[^homework] As a warm-up and revision of some discrete mathematics we will need later think about how different operations on languages (union, intersection, complementation, iteration, etc) correspond to logical operations on Booleans (and, or, xor, implication, negation, etc). Extend the following table |sets|logic| |:---:|:---:| |$\cap$ (intersection) | $\wedge$ (and) |... | ... | where on the left we think of languages as subsets of $\Sigma^\ast$ and on the right as functions $\Sigma^\ast\to\mathbb B$. For example the intersection $L_1\cap L_2$ of two languages is given by the logical operation "and", written in symbolic notation as $$L_1\cap L_2 = \{w\in\Sigma^\ast \mid w\in L_1 \textrm{ and } w\in L_2\}= \{w\in\Sigma^\ast \mid L_1(w) \wedge L_2(w) = 1\}.$$ **Remarks on Notation:** - I use the same notation $L_i$ in two different ways. When I write $w\in L_1$, I think of $L_1$ as a set and when I write $L_1(w)$ I think of $L_1$ as a characteristic function. - Here is the same again using more math notation. Let us write the Booleans as 0 and 1, that is, $$\mathbb B=\{0,1\}.$$ Now given a set $A$ and a subset $X\subseteq A$, the characteristic function of the subset is given by $$X(a) = \left\{ \array{ 1 & \textrm{if $a\in X$}\\ 0 & \textrm{if $a\not\in X$}} \right.$$ Conversely, given a function $X:A\to \mathbb B$, the corresponding subset of $A$ is given by $$\{a\in A\mid X(a)=1\}.$$ - Finally I used some logical notation for operations on Booleans. Here is a quick summary of the most commonly used: |symbol | informal meaning | |:---:|:---:| | $X\wedge Y$ | $X$ and $Y$ | | $X\vee Y$ | $X$ or $Y$ | | $X\to Y$ | if $X$ then $Y$ | | $\neg X$ | not $X$ | As an operation on Booleans they are defined via [truth tables](https://en.wikipedia.org/wiki/Truth_table). Note that we can think of truth tables as providing a **formal meaning** to logic.[^formal] To get this notation under your belt, I recommend to write out the homework in all formal detail. **Further Study:**[^further-study] Noah Smith has a great [introduction to language modelling](https://drive.google.com/file/d/1cK43rSzH491oI9NIrLlDAeP8P2F7LXTJ/view). (If you are interested in doing a research project on NLP get in touch.) **What's next?** In this course we will study in depth a variety of small "toy" programming languages. - A calculator. Only arithmetic, no variables, no functions, no side effects. But everything we will learn here will be useful for more expressive languages. - $\lambda$-calculus. Lambda calculus is a very small programming language that only has variables and functions. Nevertheless, it is [Turing complete](https://en.wikipedia.org/wiki/Turing_completeness). - Extensions of $\lambda$-caclulus to a small functional programming language. - Further extensions to a small imperative programming language. To understand what is really going on under the hood of programming languages, we will implement (in Haskell) interpreters for these languages. This will be complemented by some programming languages theory. **What's next next?** Next semester, in **Compiler Construction**, we will see that the methods we learn in Programming Languages scale up to writing a compiler of a real world programming languages. We will implement a compiler from a fragment of C to Webassembly. On the theoretical side, we will learn about regular expressions, LR-parsing and some basic type theory. **Final Remark:** We don't have time for AI in this course, but if you are interested in AI you may want to consider questions such as: What is randomness? What is learning? What is an explanation? There are some very interesting mathematical answers to these questions and there is still a lot to discover. [^finseq]: The set of finite sequences of length $n$ with elements from $\Sigma$ can be written as $\Sigma^n$. $\Sigma^0 = \{\epsilon\}$, $\Sigma^1=\Sigma$, $\Sigma^2 = \Sigma\times\Sigma$, etc . With this notation we have $\Sigma^\ast=\bigcup_{n=0}^\infty \Sigma^n = \Sigma^0 \cup \Sigma^1 \cup \ldots \Sigma^n \ldots$. [^axiomatic]: Also think about whether your answer is "axiomatic" or "by construction". Btw, these two methods are not exclusive. For example, the real numbers are defined axiomatically and then one proves that there is only on mathematical model, thus turning the axiomatic definition into an "axiomatic construction". [^discussion]: On **Discussions**. There is too much technical material in this course to spend much time on discussions in class. Moreover, while discussion topics typically aim to deepen the understanding of the technical content, they are not strictly speaking required. But I invite you to discuss after class, on the discussion forum, in the office hours and in your report. [^char-function]: The function $X\to\mathbb B$ corresponding to a subset of $X$ is known as the [characteristic function](https://en.wikipedia.org/wiki/Indicator_function) of the subset. In computer science, characteristic functions are often implemented as [bit vectors](https://en.wikipedia.org/wiki/Bit_array#Basic_operations). [^homework]: On **Homework**. Exercises labelled "Homework" I consider essential for the understanding of this course. They will not be graded but I am always happy to give feedback on homework in office hours. Homeworks s also be good material to include in your report. [^formal]: "Formal" means roughly the same thing as to have a precise mathematical definition. [^further-study]: On **Further Study**. Programming Languages is a central topic in computer science and has way too many ramifications to explore in this course. A suggestion of "Further Study" can be skipped. I recommend to follow suggestions of Further Study if it lines up with your other interests. Further Study can also provide great material for the discussion forum and for the report. This Further Study is recommended for those who are interested in (computational) linguistics and natural language processing.