# Learning to summarize from NL-Feedback + Refinements
# General
- The dataset provides a generated original summary. This summary was generated by one of many models that were used in the paper *"Learning to summarize form Human Feedback" [1]*. Note that they used various models to generate those summaries. In other words these summaries are NOT ONLY generated by their human preference model, but can also be generated by their baseline supervised models. So there is no guarantee on the quality of their generated summaries which we refer to as *original summary*. This however is ok, we only want to show that we can improve on that summary.
- Filtered dataset to 150 datapoints. Used feedback length with more than 20 characters as heuristic to filter. In other words all the feeedback given on initial/original summaries by models from [1], have at least 20 characters
- Per datapoint I sample 5 generations with the following hyperparameters:
- T = 1.0
- Nucleus Sampling T (top-p) = 0.9
- Presence Penalty = 0.5
- Frequency Penalty = 0.5
- The dataset provides a target summary that was written by the person who wrote the question on Reddit.
- We do 2-shot learning. We can't fit more examples into the context, i.e. when doing 3-shot learning the longest input won't fit into the context anymore. With 2-shot learning we mean that we provide 2 example text's with possible feedback and refinements.
- We have taken 20 different post's with original summaries and feedback on the summary. We used the feedback to write refined summaries that incorporate the feedback as best as possible, while still being shorter than 48 tokens. If the feedback said for example that the whole summary is bad, we rewrote the whole summary. For each test sample we randomly sample 2 posts, summaries and possible feedback and refinements from the set of those 20 posts.
# Prompts
For a detailed explanation and interpretation of the prompt-types go to the rouge results setion. Here a legend of how the prompt-type we use and their names.
**TL;DR with Summary (Basline):**
```
Title: {title}
Post: {post}
tl;dr: {original summary}
###
Title: {title}
Post: {post}
tl;dr:
```
**TL;DR with Refinement (Baseline):**
```
Title: {title}
Post: {post}
tl;dr: {refinement}
###
Title: {title}
Post: {post}
tl;dr:
```
**Train Summary + Train Feedback + Train Refinement + Generate Bad Summary:**
```
Title: {title}
Post: {post}
Summary: {original summary}
Feedback: {Feedback}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
Summary:
```
**Train Summary + Train Feedback + Train Refinement + Generate Good Summary:**
```
Title: {title}
Post: {post}
Summary: {original summary}
Feedback: {Feedback}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
The improved summary should say:
tl;dr:
```
**Train Summary + Train Refinement + Generate Bad Summary:**
```
Title: {title}
Post: {post}
Summary: {original summary}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
Summary:
```
**Train Summary + Train Refinement + Generate Good Summary:**
```
Title: {title}
Post: {post}
Summary: {original summary}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
The improved summary
```
**Train Summary + Train Feedback + Train Refinement + Test Summary + Test Feedback:**
```
Title: {title}
Post: {post}
Summary: {original summary}
Feedback: {Feedback}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
Summary: {original summary}
Feedback: {Feedback}
The improved summary should say:
tl;dr:
```
**Train Summary + Train Refinement + Test Summary:**
```
Title: {title}
Post: {post}
Summary: {original summary}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
Summary: {original summary}
The improved summary should say:
tl;dr:
```
# Results
We do experiments with Rouge Scores and the Reward model learned from human preferences provided by [1]. Note we show the rouge scores only for completeness and for full documentation. We are of the current believe that the rouge scores by far do not capture the quality of summaries and refinements and thus advocate for using the reward model only.
## Human Preferences
As mentioned, we sample 5 genreations per test sample. In the following we provide results when looking at the minimum reward score of those 5 generations, the mean and the max.
### Reward Score
We plot the reward scores that are calculated with the human preference reward model.
Min | Mean | Max
:-----------------:|:-----------------:|:-----------------:
 |  |
- We can see that feedback + refinement works best (for mean and min at least). This was also apparent when looking at some random result samples. The large models really were able to incorporate the feedback into the refinements.
- When looking at rouge scores in the next chapeter we see that refinements work better than refinements + feedback. This goes against our intuition and also when looking at some samples the models were able to improve the summaries when looking at feedback. I believe this is an artefact of the rouge scores and that they are really not suitable for this task. The human preference reward model is much more aligned with our own preferences.
- When looking at the max we see that all all prompt types work nearly equally well. This is expected, if we sample many times it is quite likely that the best generated summary by each prompt type can be really good. However what is important is looking at the mean, and here we see that our method of refinement + feedback is able to outperform other methdos on average. In other words those prompt types generate more good summaries on average and don't just get lucky.
- Looking at the Max is still useful in order to know the quality of the best generated summaries. Since one can use those for example to fine-tune on.
### Fraction preferred to Original Summary
This metric is similar to what they report in [1]. Namely they look at the reward scores of their generations vs the reference i.e. target summaries provided by the human. They then look at the percentage of their generated summaries being preferred to the reference summaries (i.e. the fraction where their generated summaries have a higher reward score than the reference summary.)
We do the same thing, but compare our generations with various models and prompt-types to the original summaries (i.e. the summaries that were generated by [1] with various models). This is a very important metric, since we often provide these original summaries in the prompt (at least for some prompt-types), and this thus allows us to see if the refinements improve on the original summaries.
Min | Mean | Max
:-----------------:|:-----------------:|:-----------------:
 |  |
- We can nicely see that using refinements + feedback outperforms the other methods (for min and mean). This is a clear indication that using feedback + refinement seems to work.
- This shows that the model is able to incorporate the feedback and improve on the original summaries.
### Fraction preferred to Original Summary
This comparison is now literally the same one as in [1], i.e. we look at the fraction of times where our generated summaries have a higher reward score than the reference i.e. target summaries that were written by humans. Keep in mind that we never fine-tuned our models on summarization and are doing this in a few-shot setting.
Min | Mean | Max
:-----------------:|:-----------------:|:-----------------:
| |
- Here Refinement + Feedback is comparable to only using refinements. We believe that this is just very close to each other because the performance overall is not that good compared to the reference summaries. In other words, the above experiment showed that using feedback + refinement is better than only using refinements. It looks like both methods are however still far behind the reference summaries, and that no method is able to be better than reference summaries more often than the other method. It's likely that this is the case because the quality of the reference summaries is just very high and thus one can't see any difference between the methods.
## Rouge Scores
In the follwowing we always report the mean of the 5 generated samples.
### Baselines
In the following we introduce two baslines. Note that these baselines are normal ways of prompting a language model to write a summary (i.e. without using feedback or refinements).
**TL;DR with Summary:**
```
Title: {title}
Post: {post}
tl;dr: {original summary}
###
Title: {title}
Post: {post}
tl;dr:
```
**TL;DR with Refinement:**
```
Title: {title}
Post: {post}
tl;dr: {refinement}
###
Title: {title}
Post: {post}
tl;dr:
```
- We can see that using a higher quality summary as example i.e. refinement vs origina summary leads to slightly better performance. Remember that those refinements of the examples were written by myself and the original summaries were generated by a possibly "bad" model in [1].
- We will from now on only use TL;DR with refinement as baseline, since it seems to be better than the other.
Rouge 1 | Rouge 2 | Rouge L
:-----------------:|:-----------------:|:-----------------:
 |  |
### Generate Bad vs Good Summary
In the following we extend the baselines with refinements and optional feedback in the examples. The model thus always sees an original summary for a post and a refinement of that summary. The goal is then to investigate whether the model can generate a "good" refinement or a "bad" summary (equivalent to the original summary) for a new post.
**Train Summary + Train Feedback + Train Refinement + Generate Bad Summary:**
```
Title: {title}
Post: {post}
Summary: {original summary}
Feedback: {Feedback}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
Summary:
```
**Train Summary + Train Feedback + Train Refinement + Generate Good Summary:**
```
Title: {title}
Post: {post}
Summary: {original summary}
Feedback: {Feedback}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
The improved summary **Train Summary + Train Feedback + Train Refinement + Test Summary + Test Feedback:**
```
Title: {title}
Post: {post}
Summary: {original summary}
Feedback: {Feedback}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
Summary: {original summary}
Feedback: {Feedback}
The improved summary should say:
tl;dr:
```
**Train Summary + Train Refinement + Test Summary:**
```
Title: {title}
Post: {post}
Summary: {original summary}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
Summary: {original summary}
The improved summary should say:
tl;dr:
``` say:
tl;dr:
```
**Train Summary + Train Refinement + Generate Bad Summary:**
```
Title: {title}
Post: {post}
Summary: {original summary}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
Summary:
```
**Train Summary + Train Refinement + Generate Good Summary:**
```
Title: {title}
Post: {post}
Summary: {original summary}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
The improved summary shoudl say:
tl;dr:
```
**Only refinement**
Rouge 1 | Rouge 2 | Rouge L
:-----------------:|:-----------------:|:-----------------:
 |  |
**Refinement + Feedback**
Rouge 1 | Rouge 2 | Rouge L
:-----------------:|:-----------------:|:-----------------:
 | ⁄ |
- While the results are very close to each other we can generally observe that the Good summaries are better than the Bad summary generations. This means that the model is able to learn to generate normal i.e. "bad" summaries, or refinements thereof (i.e. "good" summaries).
- Using Feedback or not for the example post in the prompt does not influence this finding or the results in general much.
### Main Method
In the following we introduce two prompt types that represent our main method. Our hypothesis is that using Refinements + feedback in the promt should generally improve the generations. We also investigate using only the refinement without feedback.
**Train Summary + Train Feedback + Train Refinement + Test Summary + Test Feedback:**
```
Title: {title}
Post: {post}
Summary: {original summary}
Feedback: {Feedback}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
Summary: {original summary}
Feedback: {Feedback}
The improved summary should say:
tl;dr:
```
**Train Summary + Train Refinement + Test Summary:**
```
Title: {title}
Post: {post}
Summary: {original summary}
The improved summary should say:
tl;dr: {Refinement}
###
Title: {title}
Post: {post}
Summary: {original summary}
The improved summary should say:
tl;dr:
```
**Refinement vs Target**
Here we compare our generation summaries i.e. refinements with the target summaries that were written by humans i.e. rouge(refinement, target).
Rouge 1 | Rouge 2 | Rouge L
:-----------------:|:-----------------:|:-----------------:
 | ⁄ |
**Refinement vs Feedback**
Here we compare our generation summaries i.e. refinements with the feedback that was provided on the original summaries i.e. rouge(refinement, feedback).
Rouge 1 | Rouge 2 | Rouge L
:-----------------:|:-----------------:|:-----------------:
 | ⁄ |
- We generally want to explain the large performance gap between methods that use the original summary of the test sample vs those that do not. The prompt types *Train Summary + Train Feedback + Train Refinement + Test Summary + Test Feedback* and *Train Summary + Train Refinement + Test Summary* us the *Test Summary* i.e. the generated summary of the test sample by [1] and try to improve on that by writing a refinement. This is obviously a much simpler task than writing a whole summary from scratch. This is why the performance of those two prompt-types are so much better. Technically one can't directly compare those prompt-types to the simple baselines that generate a summary from scratch.
- We can see that for *refinement vs target*, the scores are higher for using refinement only vs. refinement + feedback. This is counter intuitive since we would expect the feedback to help. Also when invesigating some samples we can actually see that the generations included the feedback and thus would be rated higher by a human. The results that use the reward model actually show that this is the case.
- When looking at *refinement vs feedback* we can see that the scores are much higher for refinement + feedback than only using refinements. This shows that the generations have a larger overlap with the feedback and were most likely able to incorporate the feedback.
- Lastly we can see that even if we provide refinements and feedback for the prompt examples, but ask the model to generate a good refinement directly (i.e. without providing the original summary), the performance is much worse than when we provide the original summary.