摘要
擴散模型在圖像風格轉換中的表現一直都很出色,但最大的問題是模型本身速度不快,且擴散模型的隨機性也影響了產出的內容。大部分現有的方法需要對擴散模型進行微調,或者用額外的神經網絡。而我們使用了一種不需要額外訓練,直接使用額外的 loss function 來試圖將預訓練擴散模型的輸出導向到想要的方向。通過這種方法來提高使用者建構的速度,而不用花太多時間來微調擴散模型,並與其他現有不同的圖像風格轉換方法來做比較。
前言
風格轉換指將給定圖像的風格轉換為另一種風格,同時保留其內容。過去幾年有許多基於GAN的方法。而最近,使用預先訓練圖像生成器和圖像文本編碼器(encoder)的讓網路本身不需要或只用很少的訓練就能達到文字引導風格轉換。
近年擴散模型在圖像生成、修改的方面展現出了極高的品質,也有許多人使用擴散模型搭配不同的方法達到風格轉換,
文獻回顧
圖片風格
paper
source code
Introduction
conditional diffusion model : 需要 paired data set with matched source and target styles
$\downarrow$
Unconditional diffusion model : 從 noise 回到圖片的過程是隨機的?????,導致圖片內容不一致
$\downarrow$
DDIB : 2個 domain?
方法
paper with code
[x] Arbitrary Style Transfer in Real-time with Adaptive Instance NormalizationrequirementsPython 3.10
torchvision==0.14.1
numpy==1.25.2
ignore opencv-python and pkg-resources
result
hyperlink : https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
conclution 概念描述較多 感覺沒啥重點
Speech processing plays an important role in any speech system whether its Automatic Speech Recognition (ASR) or speaker recognition or something else. Mel-Frequency Cepstral Coefficients (MFCCs) were very popular features for a long time; but more recently, filter banks are becoming increasingly popular. In this post, I will discuss filter banks and MFCCs and why are filter banks becoming increasingly popular.
解釋信號處理(filter bank)為何會popular
Computing filter banks and MFCCs involve somewhat the same procedure, where in both cases filter banks are computed and with a few more extra steps MFCCs can be obtained. In a nutshell, a signal goes through a pre-emphasis filter; then gets sliced into (overlapping) frames and a window function is applied to each frame; afterwards, we do a Fourier transform on each frame (or more specifically a Short-Time Fourier Transform) and calculate the power spectrum; and subsequently compute the filter banks. To obtain MFCCs, a Discrete Cosine Transform (DCT) is applied to the filter banks retaining a number of the resulting coefficients while the rest are discarded. A final step in both cases, is mean normalization.
如何將filter bank經過一些步驟得到MFCCs的作法
link : this
review
Text-to-image by diffusion model
1. Introduction
大概是在說新的模型(StableDiffusion, MidJourney) 用了很大的資料集(LAION-5B, 5B 的圖片),這個資料集中有很多的版權圖片之類的...影響了智財權之類的,下面這句話就是說這個 Glaze 會給圖片一些擾動(?
Galze works by taking a piece of artwork, and computing a minmal perturbation.