# Timeline - February - [x] finalize statistical modelling choices (with Morgan's input, potentially Peng's) - [x] by February 16: write 1-page abstract of thesis (below) - decide on intro framing (Ch.1) and begin writeup drafts of Typos chapter (Ch.5) - **February 23th: Draft of typos chapter (Ch.5) Methods and Results sections** - March - **March 8th: detailed outline of Theoretical Review Chapter (Ch.3)** (all section headings, and text in all sections, to flesh out) - Reach out to Boston-area people who might know of/have postdoc funding - Ted Gibson, Tomer Ullman, Sam Gershman, Gina Kuperberg, Najoung Kim - (also, march 27 is internal deadline for NIH application) - End of March 23: get **draft of Introduction (Ch.1) to Tim for comments.** - April - April 8: submit NIH funding application - mid April: get **full draft of theory chapter** to Tim for comments - late April **Typo chapter intro and discussion** to Tim for comments. - revise OverviewCh, and TheoryCh, and TypoCh intro section (to work together) - **Goal: have complete draft by end of April** - Late April-Early May: **submit a full draft to your thesis committee** so they can make final suggestions - Also by Late April-Early May (2-3 months before defence): determine examiners (internal and external). Send the following to Meghan: - title and abstract - timeline - [x] external examiner list in order of preference: Vera Demberg, Marten van Schijndel, Tal Linzen - [x] internal examiner list in order of preference: Ross Otto, Michael Wagner, Meghan Clayards - May - _by mid May, check that examiners are secured_ - Deadline: late May, **Submit defence draft**/initial submission (6-8 weeks before defence) - (this will be sent out to the examiners who have 4-6 weeks to return a written report) - June - Recieve comments and incorporate revisions. - Late June: (1 month before defence): Schedule defence, and determine Oral Defence Committee (different from the thesis committee) - July - _Deadline: late July, **thesis defence** (1-2 weeks before final submission)_ - revise after defence, for final submission - August - _Deadline: early August, deposit **final submission with revisions** (after defence). Latest possible day: Aug 15_ - (September: wedding on Sept 7th, Hope Maine.) ----- Dissertation **Chapters**: 0. **Overview** 0. **Processing cost as informativeness gain** 0. **_The Plausibility of Sampling as an Algorithmic Theory of Sentence Processing_** 0. **When Unpredictable Does Not Mean Difficult** new manuscript (typos study) 0. **_Linguistic Dependencies and Statistical Dependence_** 0. **Discussion and conclusion** # Outline ## Chapter 0    Overview This chapter should be short (one section). A framing and overview of the entire dissertation: within the perspective of information theoretic approaches to sentence structure and processing. This dissertation is concerned with the structures and mechanisms by which linguistic information is encoded and processed, and how this is interacts with the observable patterns of language use. > _Two questions that are central to the study of human language: What is its structure?, and How are these structures processed in real time? In this dissertation I explore answers to these two questions from the broad perspective of information theory. The main focus of the dissertation is looking at mechanisms for language processing are reflected in the patterns of language use (first three content chapters). In particular, the statistics of word occurrence in context play important roles in theories of how language is processed, and modern language models provide useful tools for estimating such probability distributions. In the final content chapter, I shift gears, and examine the connection between word cooccurence and linguistic structure. In all of this work I make use of large pretrained language models as powerful tools for estimating the underlying statistics of word occurence in language use, but also outline ways in which their direct application has important limitations._ ## Chapter 1    Informativeness and processing cost The remaining chapters are concerned with the computational complexity of incremental language comprehension. What makes a given word harder or easier to process, when it is encountered in context? I advocate for the hypothesis that the amount of computational effort required to process a word is primarily a function of how much information it communicates. In this chapter I will lay out the framework within which I will discuss theories of processing cost, and propose a new way of ### Introduction The main idea in this area is **Surprisal Theory**, which proposes that human processing time is primarily a function of its surprisal in context. We can write this as $$ \text{Surprisal theory hypothesis: } cost = f(surprisal) $$ - Starts with intuition from Hale 2001. Levy picks this up, connects with $\operatorname{KL}(post\|prior)$, and gives justification in terms of rational analysis and incremental update cost. However no algorithm. - There's a promising result that IS. but two things: this is exp in KL. and also, KL with proposal. let's take sampling seriously. a few things come up. let's pretend that all earlier work had sampling in mind. then, assumptions are - linear link - R is zero - D is zeroN We'll tackle these assumptions one at a time. keeping in mind that they can all can interact. - is it possible that surp theory is right, but isn't linear? (OpMi) - when is KL like surprisal? not always (typo) 1. **KL Theory: Generalizing surprisal theory** (theoretical section) - **Motivation**: Shortcomings of pure surprisal theory - no motivation algorithmically - intuition: there are different kinds of unpredictability: sometimes "useful" sometimes not. Surprisal theory _by definition_ can't distinguish. If intuition is correct, this is a problem. - **KL theory definition**: - Derivation: $KL = surprisal + R$ - interpretation (and hypothesis that cost scales with KL not surprisal), and description in an incremental setting - **Situation wrt literature**: here describe relationship (or lack thereof) of KL theory to noisy-channel surprisal, and the other information-theoretic theories reviewed above. - **Implications** - **Potential for algorithms** KL can be related to processing cost via algorithms which take longer for larger updates. Such as sampling: motivation of Chaper 4. - **New Predictions**: KL theory allows non-monotonic relationship between surprisal and processing cost in principle. In particular: for some high-surprisal items, KL may in principle be near zero. Thus is can offer an explanation for the intuition that some types of unpredictable things are not as difficult as others. This motivates the typo study in Chapter 5. (put review after introducing KL theory) 3. **Review of information-theoretic theories of processing cost** (lit review section) - Surprisal theory - justifications via rational analysis (but no algorithm) - Entropy measures (entropy reduction, successor word entropy) - Other related claims (Uniform information density, lookahead information gain, ...) - Extensions of surprisal to a noisy channel ## Chapter 2    _The Plausibility of Sampling ..._ (Open Mind paper, as published) ## Chapter 3    When Unpredictable Does Not Mean Difficult (typos study chapter, follow standard experimental psychology format) 1. **Introduction** While KL and surprisal are equivalent under certain assumptions, there is a potential for (radically) different predictions when these assumptions do not hold. This motivates a case study of constructions where we expect the predictions to differ. In this chapter we look at processing difficulty on typographical errors. These are interesting for the following reasons: - Naïve/traditional surprisal theory predicts these items should be infinite surprisal/impossible to process, since they would not be predicted (they aren't even words in the grammar) - Surprisal theory defined in a joint model (noisy channel) still can't predict these items are easy _even with a 'smart likelihood'_ If humans ever actually process typos roughly as if they were nontypos, surprisal theory can't explain this. - KL theory can explain typos being easy, if they are interpreted as something which has high prior probability. 1. **Methods** - stimulus generation: expected/unexpected words, with/without typos, in controlled contexts - SPRT experiment design, procedure, and participants - human norming experiment (TODO) - statistical model design/choice for interpretation: fit linear mixed-effects model of RTs, and compare the differences between conditions (e.g. expected_typo vs expected_nontypo). Fit simple identical model to predict surprisal, and look at same comparisons. 1. **Results** We see robust evidence for the pattern predicted by our intuitive sketch of KL theory, contra surprisal theory (as validated by looking at surprisals from LMs): typos are not difficult to process, despite being unpredictable. We also see an interesting trend in the surprisals from LMs as they get more recent/bigger/better. 1. **Discussion** Better LM's surprisals do go in the direction of being more like RTs. This can be seen as implying that they are learning a smarter likelihood function/noise model. HOWEVER, there is a bound on how far this can go (TODO, should be able to sketch such a bound on some examples). The point: The mismatch between surprisals and human RTs isn't a matter of needing better surprisal estimates. Even surprisals form a perfect LM will not be able to distinguish items that are surprising because they communicate unexpected meanings, from items that are unexpected in ways irrelevant to meaning. Discuss implications and limitations of this study. Outline extensions to be done. For one: we should be able to choose a likelihood function/family and estimate KL for predicting human reading times, rather than just arguing based on an intuition of what KL should behave (there are some trickinesses with tokenization etc that make this not trivial, but some version should be doable). A limitation to address: modify experiment to discern whether people actually visually perceived the typos. ## Chapter 4    _Linguistic Dependencies and Statistical Dependence_ (EMNLP paper, as published) ## Chapter 5    Discussion (summary and conclusion) For future directions: - Brief elaboration/update of future work mentioned in Plausibility paper (Ch.4): Algorithms for parsing that use adaptive beams etc. Inference controller application. - Extensions of the typo study (Ch.5) to other types of likelihood functions (e.g. semantic rather than orthographic).