# Survey papers for thesis # Outline - Show papers - A Pseudo-Semantic Loss - Controllable Abstractive Summarization - Constrained Abstractive Summarization - KITAB - Comparison # papers ## [A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints](https://openreview.net/attachment?id=hVAla2O73O&name=pdf) ### Introduction Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning Instead of attempting to enforce the constraint on the entire output distribution We optimize the likelihood of the constraint **under a pseudolikelihoodbased approximation** centered around a model sample * A pseudolikelihood-based approximation is a method for approximating the likelihood of a probability distribution by **factorizing it into a product of conditional probabilities**. * It is a local, high-fidelity approximation of the likelihood, exhibiting low entropy and KL-divergence around the model sample. * Our approach can be thought of as penalizing the neural network for all the probability mass it allocates to the local perturbations of a model sample that volate the logical constraint. ![image](https://hackmd.io/_uploads/SkFo683H6.png =60%x) ### Main Point Neuro-Symbolic Losses $$ \underset{\boldsymbol{\theta}}{\operatorname{argmax}} p_{\boldsymbol{\theta}}(\alpha)=\underset{\boldsymbol{\theta}}{\operatorname{argmax}} \mathbb{E}_{\boldsymbol{y} \sim p_{\boldsymbol{\theta}}}[\mathbb{1}\{\boldsymbol{y} \models \alpha\}]=\underset{\boldsymbol{\theta}}{\operatorname{argmax}} \sum_{\boldsymbol{y} \models \alpha} p_{\boldsymbol{\theta}}(\boldsymbol{y}) $$ - This equation quantifies how close the neural network comes to satisfying the constraint - reducing the problem of probability computation to weighted model counting (WMC): summing up the models of α, each weighted by its likelihood under p - The negative logarithm of this expectation yields a loss function called semantic loss Pseudo-Semantic Loss - The pseudolikelihood objective aims to measure our ability to predict the value of each variable given a full observation of all other variables. - The pseudolikelihood objective is attempting to match all of **the model’s conditional distributions** to **the conditional distributions computed from the data**. - If it succeeds in matching them exactly, then a Gibbs sampler run on the model’s conditional distributions **attains the same invariant distribution** as a Gibbs sampler run on the true data distribution. - The above would still not be sufficient to ensure the tractability of the expectation in caculating Neuro-Symbolic Losses. - Intuitively, different solutions depend on different sets of conditionals, meaning - we would have to compute the probabilities of many of the solutions of the constraint from scratch. - Instead, we compute the pseudolikelihood of the constraint in the neighborhood of a model sample. $$ \begin{aligned} \tilde{p}(\alpha)=\mathbb{E}_{\boldsymbol{y} \sim \tilde{p}}[\mathbb{1}\{\boldsymbol{y} \models \alpha\}] \approx & \mathbb{E}_{\boldsymbol{y} \sim p} \mathbb{E}_{\tilde{\boldsymbol{y}} \sim \tilde{p}_{\boldsymbol{y}}}[\mathbb{1}\{\tilde{\boldsymbol{y}} \models \alpha\}]=\mathbb{E}_{\boldsymbol{y} \sim p} \tilde{p}_{\boldsymbol{y}}(\alpha)=\mathbb{E}_{\boldsymbol{y} \sim p} \sum_{\tilde{\boldsymbol{y}} \models \alpha} \tilde{p}_{\boldsymbol{y}}(\tilde{\boldsymbol{y}}) \\ & \quad \text { where } \tilde{p}_{\boldsymbol{y}}(\tilde{\boldsymbol{y}}):=\prod_i p\left(\tilde{\boldsymbol{y}}_i \mid \boldsymbol{y}_{-i}\right) \end{aligned} $$ $$ \begin{aligned} \mathcal{L}_{\text {pseudo }}^{\mathrm{SL}}\left(\alpha, p_{\boldsymbol{\theta}}\right):= & -\log \mathbb{E}_{\boldsymbol{y} \sim p} \tilde{p}_{\boldsymbol{y}}(\alpha)=-\log \mathbb{E}_{\boldsymbol{y} \sim p} \sum_{\tilde{\boldsymbol{y}} \models \alpha} \tilde{p}_{\boldsymbol{y}}(\tilde{\boldsymbol{y}}) \end{aligned} $$ - Our pseudo-semantic loss between $α$ and $p_θ$ can be thought of as **penalizing the neural network** for all the local perturbations $\tilde{y}$ of the model sample $y$ that violate the logical constraint $α$. ![image](https://hackmd.io/_uploads/ByfwckJHT.png =70%x) #### Main difference Pseudolikelihood is a computationally efficient approximation of the true likelihood that only considers the conditional probabilities of each variable given its neighbors General likelihood considers the joint probability of all variables in the model. In other words, pseudolikelihood assumes that each variable is conditionally independent of all other variables given its neighbors, which allows for the computation of the likelihood to be **decomposed into a product of conditional probabilities**. This approximation is often used in cases where the joint probability distribution is intractable or too complex to compute directly. However, this approximation may not be accurate in cases where the variables are not conditionally independent, which can lead to biased estimates of the model parameters. General likelihood, on the other hand, considers the joint probability of all variables in the model, which provides a more accurate estimate of the model parameters but is often computationally expensive or intractable to compute. ### Experiments ![image](https://hackmd.io/_uploads/BJUjmAbra.png) ### Disscusion Instead of attempting to enforce the constraint on the entire distribution, our approach does so on a local distribution centered around a model sample. - the proposed approach may not be suitable for all types of constraints or datasets - the paper focuses on autoregressive models - the computational complexity ### Appendix - The Gibbs sampler, a Markov Chain Monte Carlo (MCMC) algorithm, is employed for sampling from a probability distribution. It iteratively samples from conditional distributions of each variable based on the values of others. The algorithm starts with an initial state and updates variables iteratively until convergence to the target distribution. Widely used in Bayesian statistics and machine learning, the paper uses Gibbs sampling **to generate samples and enforces logical constraints using a pseudolikelihood-based approximation**, achieving the same invariant distribution as a Gibbs sampler on the true data distribution if pseudolikelihood matches true conditional distributions. --------------------- ## [Controllable Abstractive Summarization](https://arxiv.org/pdf/1711.05217.pdf) ### Introduction 1. Our generated summaries follow the specified preferences e.g. the length of the generated summary 2. These control variables guide the learning process and improve generation even when they are set automatically during inference ### Main points 1. Convolutional Sequence-to-Sequence 2. Length-Constrained Summarization 3. Entity-Centric Summarization 4. Source-Specific Summarization 5. Remainder Summarization ### Experiments ![image](https://hackmd.io/_uploads/Bk40BgLr6.png =70%x) ### Disscusion The control variables are effective without user input which we demonstrate by assigning them fixed values tuned on a held-out set. ----------------------- ## [Constrained Abstractive Summarization: Preserving Factual Consistency with Constrained Generation](https://arxiv.org/pdf/2010.12723.pdf) ### Introduction We specify tokens as constraints that must be present in the summary We adopt lexically constrained decoding, a technique generally applicable to autoregressive generative models, to fulfill Constrained Abstractive Summarization (CAS) and conduct experiments in two scenarios: - **Automatic summarization** without human involvement - **Human-guided** interactive summarization First, the added constraints can often replace their unfaithful counterparts in the unconstrained summary (produced by the same model) and help reduce model hallucination Second, when adding important entities not found in the unconstrained summary as constraints, the model is more likely to generate summaries that are focused on these factual entities (“Christmas”) and more specific (“a council” changed to “Gwynedd council”) ![image](https://hackmd.io/_uploads/H1ratmOra.png =70%x) ### Main points Task Formulation - Given document-reference pair $(d, r)$ and abstractive model $\mathcal{M}$,We aim to create a constraint set C that has a high overlap with $r$ to ensure its quality and low overlap with $s'$ to bring additional information Constraint Creation - Automatic Constraints - we adopt a state-of-the-art supervised keyphrase extraction method, BERT-KPE, to extract keyphrases from the source document - we focus on the **factual consistency** of entities and noun phrases - We use spaCy to find named entities and noun phrases in the reference summaries of the training set, and treat those appearing in the source documents as positive training examples. - During test time, we exclude the extracted keyphrases appearing in $s'$ such that only constraints bringing additional information are used. - Manual Constraints - we simulate it by taking tokens in the reference summary as manual constraints. - a user may revise a summary by adding entities they deem important but not in the system summary. - We simulate such edits by taking entities in the reference but not system summary as constraints. Lexically Constrained Decoding - Dynamic beam allocation (DBA) divides the beam during beam search to store hypotheses satisfying different numbers of constraints and adds unmet constraints at each decoding step. - DBA ensures the presence of constraints by allowing the EOS token only when all the constraints are met. ### Experiments ![image](https://hackmd.io/_uploads/ByMDfTqSa.png =70%x) ### Disscusion Constrained abstractive summarization can be easily incorporated into existing abstractive models to preserve factual consistency We can explore and facilitate alternative means beyond lexically constrained decoding to fulfill CAS ------------------------ ## [KITAB: EVALUATING LLMS ON CONSTRAINT SATISFACTION FOR INFORMATION RETRIEVAL](https://openreview.net/attachment?id=b3kDP3IytM&name=pdf) ### Introduction Many current retrieval benchmarks are either saturated or do not measure constraint satisfaction. [KITAB](https://huggingface.co/datasets/microsoft/kitab) consists of book-related data across more than 600 authors and 13,000 queries, and also offers an associated dynamic data collection and constraint verification approach for acquiring similar test data for other authors. Constraint satisfaction queries in information retrieval (IR) are queries that include a set of constraints to be satisfied by the model output. We use KITAB to test LLMs across different controlled conditions: - i) their baseline ability to retrieve all books from an author (ALL-BOOKS) - ii) performance on queries that have both an author constraint and book constraints using only the model’s knowledge (NO-CONTEXT) - iii) performance when the model has access to a complete context with all books from the author, to differentiate between parametric and retrieval-augmented settings (WITH-CONTEXT) - iv) performance for standard chain-of-thought prompts and prompts that require the model to first construct its own context with all books from the author, as a **self-sufficient retrieval approach** that does not use other systems (SELF-CONTEXT). ### Related work Factual Queries - Pre-retrieval in RAG can however introduce new challenges across many domains. Constraint Satisfaction - We contribute a dataset (and functional evaluation) that is challenging even for large proprietary models like GPT4 - Yuksekgonul et al. (2023) propose an attention-based method for mechanistic understanding and detecting failures of open-source models using model internals. Constraint and Query Complexity - We measure the complement of the ratio between the number of solutions S that satisfy the constraint and the total number of items in the domain N (higher constrainedness, more complex), i.e., $κ = 1 −S/N$ ### Main points Our goal is to dissect model performance and create transparency around when and how current LLMs fail on constrained queries. KITAB DATA COLLECTION - **Author sampling** - We first sample 20,000 authors randomly from WikiData - Next, we cross-reference these authors with the Open Library repository using the author name and year of birth - **Book collection** - Using the name of the author and their year of birth, we cross-reference the Open Library corpus and collect all books from the author that are tagged to be in English by the API, or where the language field is empty. - Then, we make an additional check using the Azure Cognitive Services Language API for language detection such that we keep only the earliest English edition titles, given that our prompts are also in English. - Further, the data cleaning process involves a number of quality and consistency checks, namely on deduplication, cross-checking the authorship and publication year of the book on both the Open Library and WikiData. - **Constraints and queries** - the first constraint is always fixed to an author - the following can be *lexical*, *temporal*, and *named entity* book constraints ![image](https://hackmd.io/_uploads/HkUZB7oB6.png) To enable offline model evaluation, KITAB not only provides book metadata and constraint verification functions, but it also includes a mapping of all books that satisfy each of the 12,989 queries. Altogether, this provides a convenient tool also for the evaluation of LLM generated output. ### Experiments For counting constraints, we also consider titles that have one word more or less than the specified constraint as satisfied, to add more tolerance to the evaluation. Surprisingly, even with all of this leeway, SOTA models still perform poorly on KITAB. Improvement on constraint satisfaction tasks may not come simply by scaling up ![image](https://hackmd.io/_uploads/rJxrDGz2B6.png) ![image](https://hackmd.io/_uploads/S1wdNzhBT.png) ### Disscusion We presented KITAB, a dataset and dynamic data collection approach for evaluating abilities of large language models to filter information using constraints. limitations remain when models fabricate irrelevant information when only parametric knowledge is used or when they fail to satisfy specified constraints even when provided with the most complete and relevant context to filter upon Its limitations include: - the use of a single dataset, which may not be representative of all possible scenarios - the use of a specific language model architecture, which may not generalize to other models - the paper focuses on factual queries and may not be applicable to other types of queries - the paper does not provide a comprehensive analysis of the impact of different factors on model performance ## What I learn Using Pseudo-Semantic Loss to imporve the model by predicting the value of each variable given a full observation of all other variables. Constraint is strong in abstracting the keyphrase from document. In general case, constraint is using to constraint the output sentence while KITAB is used to constraint the output from document. ![image](https://hackmd.io/_uploads/Hyx9jI2rp.png =50%x ) # Comparison |Paper|Type of constraint|Contribution|Tasks|Datasets| |-|-|-|--|--| |[A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints](https://openreview.net/attachment?id=hVAla2O73O&name=pdf)|Soft|We optimize the likelihood of the constraint under a pseudolikelihoodbased approximation centered around a model sample.| Warcraft shortest-path finding, $\newline$ LLMs detoxification, comparing the entropy of our local approximation against that of the GPT-2 distribution| [Sudoku Puzzles](https://www.kaggle.com/datasets/rohanrao/sudoku/data) [REALTOXICITYPROMPTS](https://huggingface.co/datasets/allenai/real-toxicity-prompts) [openwebtext](https://huggingface.co/datasets/Skylion007/openwebtext)| |[Controllable Abstractive Summarization](https://arxiv.org/pdf/1711.05217.pdf)|--|Neural summarization model that allows users to control the shape of their summaries. The model is based on a convolutional sequence-to-sequence architecture and incorporates user-specified control variables for length, style, entities, and amount of text.|abstractive summarization|[CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail)| [Constrained Abstractive Summarization: Preserving Factual Consistency with Constrained Generation](https://arxiv.org/pdf/2010.12723.pdf)| Hard | Constrained Abstractive Summarization (CAS), which preserves **factual consistency** by specifying tokens as constraints that must be present in the summary.|abstractive summarization|[CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail),[XSum](https://paperswithcode.com/dataset/xsum)| [KITAB: EVALUATING LLMS ON CONSTRAINT SATISFACTION FOR INFORMATION RETRIEVAL](https://openreview.net/attachment?id=b3kDP3IytM&name=pdf)|Hard|KITAB, a new dataset and methodology for **evaluating the ability of large language models to filter information using constraints**. The dataset provides a flexible way to control the type and complexity of constraints in queries that expect longer lists of outputs, beyond simple facts.|evaluate the ability of language models to retrieve relevant information in response to more complex queries that involve multiple constraints.|[The Pile](https://paperswithcode.com/dataset/the-pile), [WebQuestions](https://huggingface.co/datasets/web_questions), [TriviaQA](https://paperswithcode.com/dataset/triviaqa), [Natural Questions](https://huggingface.co/datasets/natural_questions) ------------------------ # Key point - Why constaint is adavantage to others - Fine-tuning or adapter - [TO LEARN EFFECTIVE FEATURES: UNDERSTANDING THE TASK-SPECIFIC ADAPTATION OF MAML](https://openreview.net/pdf?id=FPpZrRfz6Ss) - - Previous work has any weakness - How others do? - Will I using the small model? - LLaMA-2 - How do I improve? - Title - Constraint retrieve what you expected ## The advantage of the constrained text generation * Fine-tuning is a general approach for adapting pre-trained models to specific tasks. * Process: The pre-trained model is initialized with weights learned from a large corpus of data. These weights are then adjusted based on the task-specific dataset during the fine-tuning process. * Constrained Text Generation (CTG) is focused on generating text that meets certain constraints or conditions. * Process: The text generation model is trained with constraints, and during inference, the generation process is guided to ensure that the output meets the specified conditions. * Adapter methods involve adding task-specific modules to a pre-trained model without modifying its core parameters, allowing it to handle various tasks. Prompt Learning: * Prompt Learning typically refers to the use of specific prompts or keywords in the input to trigger the model to generate a particular type of text. * This approach focuses on utilizing predefined prompts without necessarily emphasizing constraint analysis on the input text. Constrained Text Generation: * Constrained Text Generation places more emphasis on analyzing the input text and ensuring that the generated text meets certain conditions or constraints. * I want to analyze the input text to detect keywords and adjust the generated text accordingly. Constrained text generation finds utility in various scenarios, contributing to: 1. **Domain-Specific Translation:** - Incorporating in-domain terminology in machine translation. 2. **Semantic Enhancement:** - Improving semantic correctness to ensure more meaningful content. 3. **Dialog System Improvement:** - Avoiding generic and meaningless responses in dialogue systems by grounding facts. 4. **Paraphrase Generation:** - Paraphrase generation in monolingual text rewriting. 5. **Search Query Enhancement:** - Re-writing a user search query as a fluent sentence. 6. **Text Summarization:** - Controlling attributes such as tense and length of summaries in text summarization. 7. **Addressing Dialogue Model Limitations:** - Overcoming limitations of neural text generation models in dialogue, including issues like genericness and repetitiveness of responses. ## Previous weakness ## Feedback :::success **Feedback** SIGIR Retrieval conference Emotional control Different prompt but same output theoretical method? Novelty? Conditional generative ::: :::info **Direction** Different prompt but same output - Understand the user want to ask theoretical method? Conditional generative What i want to do the task is handling different domain in same task - domain-adaptation :::