112-1 ADL Final Project

# 112-1 ADL Final Project ## - Image to Text ### Pipeline Proposal * Images to captions -> captions to prompt -> GPT-2? (how to finetune?) -> Story * Standford dataset picture -> BLIP answer 5 question -> prmopt to GPT-2 -> use Standford dataset summary as label to finetune GPT-2 * Images to story (storytelling...) * Pretrained caption models -> Finetune for story * example: https://huggingface.co/google/pix2struct-textcaps-large * captions + images -> GPT-4 API -> story - ### **BLIP-2** - Image captioning - https://github.com/salesforce/LAVIS/tree/main/projects/blip2 - demo:https://github.com/salesforce/LAVIS/blob/main/examples/blip_image_captioning.ipynb - ### **Semantic-Segment-Anything** - Very strong object recognization - https://github.com/fudan-zvg/Semantic-Segment-Anything - ### **Image2Paragraph** - Full Pipeline - https://github.com/showlab/Image2Paragraph - ### **vit-gpt2-image-captioning** - Image captioning - https://huggingface.co/nlpconnect/vit-gpt2-image-captioning - fine-tuning: https://www.linkedin.com/pulse/fine-tuning-image-to-text-algorithms-withlora-daniel-puente-viejo/ - ### **GIT** - https://huggingface.co/docs/transformers/main/en/model_doc/git - fine-tuning: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GIT/Fine_tune_GIT_on_an_image_captioning_dataset.ipynb # Dec 11st, 2023: Previous difficulty: 1. Hard to reverse generate caption based on keyword. 2. Music generated solely on caption is terrible. New plan: 1. Generate music style keyword 2. Generate 中文詩： * BLIP2 -> paragraph -> ??? -> LLama2 -> 中文詩 * Dataset : GPT-4 generated (x=paragraph,y=poem) Title Proposal: 1. Shot GMP(generate musical poem by image) 2. Story GMP 3. ShotP&M To-Do: 1. Picture paragraph dataset 陳德維 2. Paragraph poem dataset黃品翰 3. Finetune LLM (eg. LLama2/Taiwan-LLM)吳憶茹 4. Picture to prompt for MusicGen陳德維 5. Pipeline (Gradio...)陳儷其陳德維 6. PowerPoint for presentation陳儷其 7. GPT-4 prompt黃品翰 8. Try key word dataset with poem fine tune 洪子涵 ### - InstructionBLIP prompts: ``` prompts = { 'paragraph': "Write a detailed description for the photo", 'vibe': "Question: Give 3 adjectives to describe the vibe of image. Answer:", 'time': "Question: What time of the day does it represent? Answer:", 'music-era': "Question: What time era does it represent? Answer:", 'emotion': "Question: What emotions do you think the person taking this image is experiencing? Answer:", 'music-style': "Question: Describe the music style. Answer:", } ``` ``` from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration import torch from PIL import Image import requests from tqdm.auto import tqdm import json import jsonlines import argparse import pandas as pd def load_model(path): model = InstructBlipForConditionalGeneration.from_pretrained(path) processor = InstructBlipProcessor.from_pretrained(path) device = "cuda" if torch.cuda.is_available() else "cpu" model.to(device) return model, processor, device def inference(model, processor, device, urls, prompts, args): result = [] failed = [] total = len(urls) if args.append: with open(args.prev_output_path, 'r') as json_file: data = json.load(json_file) with jsonlines.open("InstructBlip_output.jsonl", mode='a') as writer: for idx, url in tqdm(iterable=enumerate(urls), total=total, colour="green"): try: obj = {} if args.append: obj = data[idx] obj['url'] = url img_path = "/tmp2/b10902138/.cache/stanford_images/" + url.split('/')[-1] image = Image.open(img_path).convert("RGB") for prompt in prompts: inputs = processor(images=image, text=prompts[prompt], return_tensors="pt").to(device) outputs = model.generate( **inputs, do_sample=False, num_beams=5, max_length=256, min_length=1, top_p=0.9, repetition_penalty=1.5, length_penalty=1.0, temperature=1, ) generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip() obj[prompt] = generated_text writer.write(obj) result.append(obj) except: failed.append(url) return result, failed if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--debug", action="store_true", default=False, help="DEBUG") parser.add_argument("--append", action="store_true", default=False, help="APPEND") parser.add_argument("--model_path", type=str, help="The path for your model") parser.add_argument("--data_path", type=str, help="The path for data") parser.add_argument("--column_name", type=str, help="The path for data") parser.add_argument("--prev_output_path", type=str, help="The path for prev output file") parser.add_argument("--output_path", type=str, help="The path for output file") args = parser.parse_args() # Get model model, processor, device = load_model(args.model_path) # Get data data = pd.read_csv(args.data_path) urls = data[args.column_name] if args.debug: urls = [urls[i] for i in range(5)] prompts = { 'paragraph': "Write a detailed description for the photo", 'vibe': "Question: Give 3 adjectives to describe the vibe of image. Answer:", 'time': "Question: What time of the day does it represent? Answer:", 'music-era': "Question: What time era does it represent? Answer:", 'emotion': "Question: What emotions do you think the person taking this image is experiencing? Answer:", 'music-style': "Question: Describe the music style. Answer:", } # Inference result, failed = inference(model, processor, device, urls, prompts, args) # Output with open(args.output_path, "w", encoding="utf-8") as output_file: json.dump(result, output_file, indent=2, ensure_ascii=False) print(failed) ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.