# 給我影片跟聲音,我讓影片中的人講話 ## 前言 這程式碼還要搭配[其他檔案](https://drive.google.com/file/d/1o3KlPs3D-ce4-oR_7Zxw_8MG6fRyM_Aw/view?usp=sharing),因為SadTalker也就是生成影片的工具沒有api,再加上我有把SadTalker改成可以直接從程式碼中呼叫的api。真要用的話執行資料夾裡面的sadtalker.py即可。 ## 摘要 從影片中抓取教授的圖片,再拿去跟音檔結合生成影片。 ## Dependencies Anaconda Python 3.10版本的虛擬環境。 ``` # 去背程式 pip install transparent-background # Yolov8 API pip install ultralytics # Pytorch pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113 # FFmpeg conda install ffmpeg # 以下文字複製到requirements.txt檔,再執行pip install -r requirements.txt numpy==1.23.4 face_alignment==1.3.5 imageio==2.19.3 imageio-ffmpeg==0.4.7 librosa==0.9.2 # numba resampy==0.3.1 pydub==0.25.1 scipy==1.10.1 kornia==0.6.8 tqdm yacs==0.1.8 pyyaml joblib==1.1.0 scikit-image==0.19.3 basicsr==1.4.2 facexlib==0.3.0 gradio gfpgan av safetensors ``` ## 流程 ### 變數 video_path是圖片取樣的影片路徑。image_path是影片處理完後圖片儲存的路徑。 ```python= model = YOLO('yolov8x.pt') video_path = r"./data\1. AI教學影片第一部_人工智慧的定義與應用.mp4" image_path = r"./data\background_removed.jpg" cap = cv2.VideoCapture(video_path) sample_cycle = round(cap.get(cv2.CAP_PROP_FPS) / 2) detected_count = 0 remover = Remover(mode="fast", device="cuda:0") ``` ### 影片取樣 每sample_cycle個frame取樣一次,不然每個frame都處理一次很浪費時間。sample_cycle預設為影片fps的一半,所以一秒大約取樣兩張圖片。 ```python= while cap.isOpened(): # 每sample_cycle個frame取樣一次 for i in range(sample_cycle): cap.grab() ``` ### 人物偵測 利用yolov8偵測圖片中的人物,再確認圖片中只有一個人也就是教授。偵測到第二遍只有一個人的圖片才去進一步處理,因為影片有過場動畫時圖片會有變得很亮之類的特效,這時候第一張符合條件的圖片就會是白人教授。 ```python= # 偵測人物 result = model(frame, classes=0, retina_masks=True)[0] # 確認畫面中是否只有一個人 if len(result.boxes) != 1: continue detected_count = detected_count + 1 # 偵測到一定次數才開始處理,避免轉場動畫影響照片品質 if detected_count < 2: continue ``` ### 消除偵測框以外的影像 把偵測框以外的影像塗成白色,避免去背程式找錯去背的對象。 ```python= # 消除偵測框以外的影像 box = result.boxes[0] x1, y1, x2, y2 = box.xyxy[0].int().cpu().numpy() frame = cv2.rectangle(frame, (0, 0), (frame.shape[1], y1), color=(255, 255, 255), thickness=-1) frame = cv2.rectangle(frame, (0, 0), (x1, frame.shape[0]), color=(255, 255, 255), thickness=-1) frame = cv2.rectangle(frame, (x2, y1), (frame.shape[1], frame.shape[0]), color=(255, 255, 255), thickness=-1) frame = cv2.rectangle(frame, (x1, y2), (x2, frame.shape[0]), color=(255, 255, 255), thickness=-1) ``` ### 去背 去背的工具只吃PIL格式的影像,所以要先轉換一下,處理完再轉回numpy矩陣。 ```python= # 去背 frame = Image.fromarray(frame) frame = np.array(remover.process(frame, type="white")) ``` ### 置中 利用各種精密的計算把教授移到圖片正中間。 ```python= # 置中 cropped = frame[y1:y2, x1:x2, :] cropped_width = x2 - x1 cropped_height = y2 - y1 frame = np.zeros(frame.shape) frame[:, :, :] = 255 frame[frame.shape[0] - cropped_height:, frame.shape[1] // 2 - cropped_width // 2:frame.shape[1] // 2 - cropped_width // 2 + cropped_width, :] = cropped ``` ### 生成影片 vocal_path是音檔的路徑,剩下就交給SadTalker。 ```python= vocal_path = r"./examples\driven_audio\japanese.wav" sadtalker_inference(image_path, vocal_path, "gfpgan", True, "full") ``` ## 程式碼 ```python= import numpy as np import cv2 from transparent_background import Remover from PIL import Image import matplotlib.pyplot as plt from ultralytics import YOLO from sadtalker_inference import sadtalker_inference model = YOLO('yolov8x.pt') video_path = r"./data\1. AI教學影片第一部_人工智慧的定義與應用.mp4" image_path = r"./data\background_removed.jpg" cap = cv2.VideoCapture(video_path) sample_cycle = round(cap.get(cv2.CAP_PROP_FPS) / 2) detected_count = 0 remover = Remover(mode="fast", device="cuda:0") while cap.isOpened(): # 每sample_cycle個frame取樣一次 for i in range(sample_cycle): cap.grab() ret, frame = cap.read() if not ret: break # 偵測人物 result = model(frame, classes=0, retina_masks=True)[0] # 確認畫面中是否只有一個人 if len(result.boxes) != 1: continue detected_count = detected_count + 1 # 偵測到一定次數才開始處理,避免轉場動畫影響照片品質 if detected_count < 2: continue # 顯示原圖 plt.imshow(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)) plt.show() # 消除偵測框以外的影像 box = result.boxes[0] x1, y1, x2, y2 = box.xyxy[0].int().cpu().numpy() frame = cv2.rectangle(frame, (0, 0), (frame.shape[1], y1), color=(255, 255, 255), thickness=-1) frame = cv2.rectangle(frame, (0, 0), (x1, frame.shape[0]), color=(255, 255, 255), thickness=-1) frame = cv2.rectangle(frame, (x2, y1), (frame.shape[1], frame.shape[0]), color=(255, 255, 255), thickness=-1) frame = cv2.rectangle(frame, (x1, y2), (x2, frame.shape[0]), color=(255, 255, 255), thickness=-1) # 去背 frame = Image.fromarray(frame) frame = np.array(remover.process(frame, type="white")) # 置中 cropped = frame[y1:y2, x1:x2, :] cropped_width = x2 - x1 cropped_height = y2 - y1 frame = np.zeros(frame.shape) frame[:, :, :] = 255 frame[frame.shape[0] - cropped_height:, frame.shape[1] // 2 - cropped_width // 2:frame.shape[1] // 2 - cropped_width // 2 + cropped_width, :] = cropped # 顯示並儲存結果 frame = cv2.cvtColor(frame.astype(np.uint8), cv2.COLOR_BGR2RGB) plt.imshow(frame) plt.show() Image.fromarray(frame).save(image_path) break cap.release() vocal_path = r"./examples\driven_audio\japanese.wav" sadtalker_inference(image_path, vocal_path, "gfpgan", True, "full") ```