# 112-1 ADL Final Project
## - Image to Text
### Pipeline Proposal
* Images to captions -> captions to prompt -> GPT-2? (how to finetune?) -> Story
* Standford dataset picture -> BLIP answer 5 question -> prmopt to GPT-2 -> use Standford dataset summary as label to finetune GPT-2
* Images to story (storytelling...)
* Pretrained caption models -> Finetune for story
* example: https://huggingface.co/google/pix2struct-textcaps-large
* captions + images -> GPT-4 API -> story
- ### **BLIP-2**
- Image captioning
- https://github.com/salesforce/LAVIS/tree/main/projects/blip2
- demo:https://github.com/salesforce/LAVIS/blob/main/examples/blip_image_captioning.ipynb
- ### **Semantic-Segment-Anything**
- Very strong object recognization
- https://github.com/fudan-zvg/Semantic-Segment-Anything
- ### **Image2Paragraph**
- Full Pipeline
- https://github.com/showlab/Image2Paragraph
- ### **vit-gpt2-image-captioning**
- Image captioning
- https://huggingface.co/nlpconnect/vit-gpt2-image-captioning
- fine-tuning: https://www.linkedin.com/pulse/fine-tuning-image-to-text-algorithms-withlora-daniel-puente-viejo/
- ### **GIT**
- https://huggingface.co/docs/transformers/main/en/model_doc/git
- fine-tuning: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GIT/Fine_tune_GIT_on_an_image_captioning_dataset.ipynb
# Dec 11st, 2023:
Previous difficulty:
1. Hard to reverse generate caption based on keyword.
2. Music generated solely on caption is terrible.
New plan:
1. Generate music style keyword
2. Generate 中文詩:
* BLIP2 -> paragraph -> ??? -> LLama2 -> 中文詩
* Dataset : GPT-4 generated (x=paragraph,y=poem)
Title Proposal:
1. Shot GMP(generate musical poem by image)
2. Story GMP
3. ShotP&M
To-Do:
1. Picture paragraph dataset 陳德維
2. Paragraph poem dataset黃品翰
3. Finetune LLM (eg. LLama2/Taiwan-LLM)吳憶茹
4. Picture to prompt for MusicGen陳德維
5. Pipeline (Gradio...)陳儷其 陳德維
6. PowerPoint for presentation陳儷其
7. GPT-4 prompt黃品翰
8. Try key word dataset with poem fine tune 洪子涵
### - InstructionBLIP prompts:
```
prompts = {
'paragraph': "Write a detailed description for the photo",
'vibe': "Question: Give 3 adjectives to describe the vibe of image. Answer:",
'time': "Question: What time of the day does it represent? Answer:",
'music-era': "Question: What time era does it represent? Answer:",
'emotion': "Question: What emotions do you think the person taking this image is experiencing? Answer:",
'music-style': "Question: Describe the music style. Answer:",
}
```
```
from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image
import requests
from tqdm.auto import tqdm
import json
import jsonlines
import argparse
import pandas as pd
def load_model(path):
model = InstructBlipForConditionalGeneration.from_pretrained(path)
processor = InstructBlipProcessor.from_pretrained(path)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
return model, processor, device
def inference(model, processor, device, urls, prompts, args):
result = []
failed = []
total = len(urls)
if args.append:
with open(args.prev_output_path, 'r') as json_file:
data = json.load(json_file)
with jsonlines.open("InstructBlip_output.jsonl", mode='a') as writer:
for idx, url in tqdm(iterable=enumerate(urls), total=total, colour="green"):
try:
obj = {}
if args.append:
obj = data[idx]
obj['url'] = url
img_path = "/tmp2/b10902138/.cache/stanford_images/" + url.split('/')[-1]
image = Image.open(img_path).convert("RGB")
for prompt in prompts:
inputs = processor(images=image, text=prompts[prompt], return_tensors="pt").to(device)
outputs = model.generate(
**inputs,
do_sample=False,
num_beams=5,
max_length=256,
min_length=1,
top_p=0.9,
repetition_penalty=1.5,
length_penalty=1.0,
temperature=1,
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
obj[prompt] = generated_text
writer.write(obj)
result.append(obj)
except:
failed.append(url)
return result, failed
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--debug", action="store_true", default=False, help="DEBUG")
parser.add_argument("--append", action="store_true", default=False, help="APPEND")
parser.add_argument("--model_path", type=str, help="The path for your model")
parser.add_argument("--data_path", type=str, help="The path for data")
parser.add_argument("--column_name", type=str, help="The path for data")
parser.add_argument("--prev_output_path", type=str, help="The path for prev output file")
parser.add_argument("--output_path", type=str, help="The path for output file")
args = parser.parse_args()
# Get model
model, processor, device = load_model(args.model_path)
# Get data
data = pd.read_csv(args.data_path)
urls = data[args.column_name]
if args.debug:
urls = [urls[i] for i in range(5)]
prompts = {
'paragraph': "Write a detailed description for the photo",
'vibe': "Question: Give 3 adjectives to describe the vibe of image. Answer:",
'time': "Question: What time of the day does it represent? Answer:",
'music-era': "Question: What time era does it represent? Answer:",
'emotion': "Question: What emotions do you think the person taking this image is experiencing? Answer:",
'music-style': "Question: Describe the music style. Answer:",
}
# Inference
result, failed = inference(model, processor, device, urls, prompts, args)
# Output
with open(args.output_path, "w", encoding="utf-8") as output_file:
json.dump(result, output_file, indent=2, ensure_ascii=False)
print(failed)
```