# What is finetuning and why do we need this?
* Finetuning is an approach to transfer learning in which the parameters of a pre-trained model, such as Generative Pre-trained Transformer (GPT), are trained on new data.
* Let the model specialize; in our case, to write posts as instructed by the user
* More consistent (hopefully good) performance
# Goal:
* Better instruction following
* Align the user's prompt with their most desired result
# How to do finetuning
Task type: Given the previous words, predict the next word
## QloRA(Quantized Low-Rank Adaptation)
### Overview

### Quantisation: a way to store the number in a more efficient way.[^quantisation_blog]

[^quantisation_blog]: https://huggingface.co/blog/hf-bitsandbytes-integration
### LoRA
Meta train Llama3 from $W_0$ to $W$ given the model architecture. This process is expensive and time-consuming.
We want to finetune our model based on their pretrained one.
Instead of training directly toward $W^{*}$, the parameters trained for writing post, we model the change $\Delta W$, so that $W + \Delta W = W^{*}$
Low rank: represent a matrix as the multiplication of two smaller matrices
$\Delta W_{d\times d} = A_{d\times r} \text{ }B_{r \times d}$

Once we achieve $\Delta W$, we merge it to finalize the finetuned model.

## DPO (Direct Preference Optimization)
Further alignment on hunman preference.
Finetuning simply tells what the model should do. DPO further informs the model what it should not do, train the model towards the chosen response (preference) and move away from the rejected one.


$\pi$ is the model being optimised, $\pi_{ref}$ is the model finetuned previously, $x$ is the prompt, $y_w$ is good response, $y_l$ is the rejected one, $\beta$ is a hyperparameter which controls the scale of the penalty when model is wrong.

# Our Dataset
## Dataset Quality
### Prompts
Each channel (except Wordpress) has 2 pairs of `system` and `user` prompts corresponding to generating content from a specific topic, or via a website url. In addition, there is an extra prompt utilised by the chatbot function of Sarah. For the purposes of this fine-tuning we will be using the pairs of `system` and `user` prompts corresponding to content generated via a given topic and from links (which will require parsing the links for the html content as in the actual model).
We additionally require information about the companies such as their `company_name`, `usp`, and `audience` in order to fully reconstruct the prompts for the fine-tuning.
In order to increase the effectiveness of the contrastive learning we will choose a random selection of content for prompts specifying `No Markdown` and we will add markdown formatting to the negative examples to incentivise the model to avoid this in future. Additionally for wordpress, this can be reversed by removing `HTML` formatting for the negative examples.
### Content
This dataset consists of $3,040$ rows. In order to determine what content was percieved effectively and what was dismissed, we required a framework to identify whether the content was used or not. For this we used the four columns `status`, `final`, `deleted`, and `version`. We used the following criteria:
- If `final` is True:
- If `status` is `published`, `scheduled`, or `publish_failed`, the content should be used as a positive example
- The original version of the content should be used as the negative example
- If `final` does not exist for any of the verisons, we take the latest version to be the final version
- If `status` is `cancelled` OR `deleted` is `TRUE`, the content cannot be used
- It cannot even be used as a negative example with contrastive learning since no corresponding postive example exists
- If `status` is `proposal`, `draft`, `in_progress`, `human_review`, the content can potentially be used but we need further verification on each of these categories
There are $623$ entries marked as final with approximately this many entries likely to be usable in fine-tuning given some are in the category `draft`.
### Templates
This data is extremely large with $\sim60,000$ rows, however, there is no indication of whether the content was utilised or not. However, there are two headers labelled `sarah_output` and `user_edit`. In the lack of further evidence, we deemed it sensible to assume that content with no difference between the two entries was not used. Filtering for this criteria reduces the dataset to $\sim6,000$ rows. However, upon further manual inspection, a large amount of these entries are Wordpress entries with only `HTML` formatting added. We subsequently decided to strip `HTML` formatting and white space from both columns and then refilter which yielded only $\sim1,000$ different examples, many of which appeared the exact same under visual inspection.
We suspect that the quality of this data is very low and recommend that it is not used within the fine-tuning process. It might be possible to extract a small amount of decent datapoints but this should be a secondary effort.
## How could improved tracking reduce these issues in the future?
The key issue constrianing current fine-tuning attempts is the lack of good-quality datapoints we can use. This is addressed in the following section, but there are a couple of other things that could improve the effectiveness of the fine-tuning process:
1. Storing whether edit versions are regenerations (i.e. Telling Sarah to change a section), or manual edits (i.e. changing spellings or ordering of paragraphs etc.).
- This makes it possible to better determine which versions are potentially positive and negative examples - possibly squeezing more than 1 positive example out of the same prompt if multiple user edits are made.
3. Following up with users to determine (potentially automatically) whether Drafts were eventually used and incentivise them to provide the version they actually submitted.
- This would make it possible to use these drafts as fine-tuning datapoints with full confidence
## Potential Ways to Enrich Our Dataset
### Utilize Large Language Models (LLMs) to Classify Quality Posts
We can leverage LLMs to identify "good" posts from our existing dataset, selecting the most promising content from each draft page.
- **Advantages**: This method is cost-effective, as it leverages automation to process large volumes of data without significant human intervention.
- **Disadvantages**: The quality of the data may be questionable, as the LLMs might not always align with human judgment in determining what constitutes a "good" post. There is a risk of inconsistency in the classifications, which could affect the overall reliability of the dataset.
### Use Our Evaluation Tool
By involving marketing specialists (or others), we can have experts review and possibly enhance the drafts, labeling those deemed "publish-worthy".
- **Advantages**: This approach ensures high-quality, expert-reviewed data. The involvement of marketing specialists guarantees that the posts meet industry standards and are more likely to resonate with the target audience.
- **Disadvantages**: The process is expensive and time-consuming, requiring significant investment in human resources. The reliance on experts also limits scalability, as the volume of posts they can review is constrained by their availability.
### Scout Web for Quality Marketing Posts
We can identify and scrape marketing posts that are widely regarded as effective, then reverse-engineer the required information for prompts, potentially using LLMs for assistance.
- **Advantages**: This method is low cost and allows us to tap into proven marketing strategies. Technically unlimited amount of data.
- **Disadvantages**: Achieving consensus on which companies' marketing posts are "good" can be challenging. Might not be the same "type" of companies (eg. in terms of size). Additionally, there are potential privacy concerns related to scraping and using content from other sources without explicit permission.
### Wait for User Engagement
Collect data as users interact with our services, allowing natural data enrichment through real-world usage.
- **Advantages**: This method yields data that directly reflects user behavior and preferences, ensuring relevance and applicability.
- **Disadvantages**: This process is time-dependent and slower. It requires a substantial user base and ongoing interaction to gather sufficient data, which may significantly delay the finetuning process.