FACEGAN(facial attribute reenactment GAN)

FACEGAN(facial attribute reenactment GAN)

Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Introduction

FACEGAN這篇paper的目的在於將driving image的臉部表情以及頭部的方向角度轉移到source image的identity上
也就是產生的圖片要看起來是source image的那個人, 但表情跟頭部的角度方向要跟driving image一樣
FACEGAN的作法是利用source face的landmark並且將driving face的motion包含進來
driving face的motion是用action units來extract出來, 包含臉部肌肉變化以及頭部的方向角度
這樣的做法在source face跟driving face的臉部架構很不相同時一樣能產生high quality的output image
為了可以讓output image的背景更真實, FACEGAN還將臉部區域以及背景分開處理, 不像現有的方法會為了產生臉部表情而將背景扭曲或是過度模糊
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

3D-face-model (Face2Face)
- 3D-face-model用3D model parameters來encode identity跟motion的feature
- 產生出來的reenact face是用source face的identity parameters以及driving face的motion parameters
- 雖然3D-face-model可以產生不錯的output, 但需要比較大的effort來產生臉部的3D representation, 所以沒辦法scale到比較大的dataset
VAE (X2Face)
- VAE將input image encode成latent representation vector
- 用source face跟driving face的latent representation來生成圖片
- 但VAE很難將identity跟motion從latent representation中disentangle出來, 所以用VAE的效果沒有很好
Deep GAN (FSGAN)
- Deep GAN的作法是利用source face然後based on driving face的landmark來做training
- 但是landmark會保留臉部的結構, 某種程度上也會透漏出driving image的identity
- 所以如果source face跟driving face的臉部結構很不像的話, output的quality就不會太好
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
landmark transfer model (ReenactGAN)
- landmark transfer model想要去除driving face landmark的identity feature,
- 做法是用PCA來將shape跟expression parameters分開, 然後用source face的shape parameters
- 加上driving face的expression parameters來產生出transform過後的landmark
- 但這個做法是train在個別的identity pair上面, 它的decoder是id specified的
- 所以scalability一樣被限制住
action units (ICFACE)
- action units是一位瑞典的解剖學家提出的, 他將各種臉部動作去編碼, 作為構成臉部表情的基本單位, 像是眉毛上揚, 嘴角下垂等等
- ICFACE這篇paper用action unit來做face reenactment
- 由於action unit跟臉部結構是independent的, 所以action unit能夠把driving face的臉部資訊跟identity資訊disentangle出來
- 但只用action unit的效果卻沒有像landmark-based的效果那麼好

Method

所以這篇paper提出的方式是將source face的landmark結合driving face的action units
來產生high quality的reenactment results, 同時不會有identity leakage的問題
主要的架構可以分成三個module: landmark trasformer, face reenactor, background mixer
landmark trasformer
- landmark trasformer會將source face的landmark跟driving face的action units concatenate起來
- 經過fully-connected network形成transformed過後的landmark, 這個module目的在於
- 產生一個有source identity以及driving facial motion的landmark
face reenactor
- face reenactor會把剛剛產生的landmark用2D Gaussian function轉成一個single channel image
- 再把這個single channel跟source image concatenate起來丟到generator裡
- 這樣做的用意是利用transform過後的landmark來改變source image的臉部表情及頭部姿勢
- generator的output是一個reenact過後的face image以及一個segmentation map
- 這個segmentation分成三個class: face, hair, background
- segmentation的用意是要將產生的reenact face的背景去掉
background mixer
- 去掉背景的reenact face就會是第三個module background mixer的其中一個input
- background mixer的第二個input是將source image的臉部去除之後的背景圖
- 將兩者做concatenate, 目的是要讓一個單純是臉的image跟一個單純是background的image做結合, 最後產生出有source identity以及driving facial motion而且background的quality也很高的output

landmark transformer

Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
landmark trasformer是一個fully connected network, input有兩個
一個是source image的facial landmark, 會先reshape成136*1的一維vector ls
第二個input是driving image的action units, 是20*1的一維vector AUd
將這兩個concatenate之後經過fully-connected network
會產生一個對landmark movement的預測delta ls, 再將剛剛的input ls加上這個delta ls產生最後的output lt
landmark trasformer用了三種loss function
第一個reconstruction loss直接去算產生出來的landmark lt跟driving face的landmark ld之間的MAE, 再加上對預測出來的delta ls做l2-weighted penalty regularization
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
由於reconstruction loss是針對landmark做計算, 沒辦法針對比較細微的臉部表情變化
所以作者用另一個fully-connected network La來對臉部表情變化計算loss
作者沒有說明La的output會是什麼, 但從loss function中可以看出作者想要讓La可以將landmark轉換到像是action unit的表示法
這樣就能針對臉部表情變化算loss
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
而為了要讓transform過後的landmark跟原本driving face的landmark相似, 作者用了第三個loss function, connectivity loss
這個loss是分別去看source face的landmark跟transform過後的landmark
兩個landmark圖上各自相鄰的landmark points的距離
大D是一個x維的vector, x代表在loss function中用了幾個landmark points, 裡面放相鄰landmark point的距離
比如說有10個點, 就會存放1到2的距離, 2到3的距離, 以此類推
那這個loss function可以讓兩者的landmark更相似
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
所以landmark trasformer的loss function可以寫成這三個loss function相加
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

face reenactor

Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
這個module會先把上一個module的transformed landmark lt用2D Gaussian function map成single channel image Ht
Ht再和source image Ls concatenate起來丟到generator Gr, 來改變source image的臉部表情及頭部姿勢, Gr會產生reenact過後的RGB image Ifr
同時也會產生一個segmentation map Sfr, 有臉部頭髮跟背景三個class
segmentation是用來將產生的reenact face的背景去掉, 產生I'fr
而segmentation產生的方式是接一個CNN到generator的倒數第二層後面
這邊的generator跟discriminator作者使用High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs
這篇paper所提出的方法, 這篇paper是發表在CVPR2018的
generator跟discriminator分別是coarse-to-fine generator以及multi-scale discriminator

coarse-to-fine generator

Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
先介紹coarse-to-fine generator, 它使用了兩個generator
兩個都是一層convolution, 一系列的residual block,再加上一層transposed convolution
一個叫做local enhancer也就是圖中的G2, input是2048*1024,
另一個是global generator 圖中的G1, input是把原圖downsample變成1024*512
這樣的做法讓generator可以在不同的解析度下訓練
一開始會先把圖片放到local enhancer G2, 得到feature map
同時把downsample的圖片放到global generator G1, 接著把G1的feature map跟剛剛G2的feature map做element-wise sum, 再接著用後半部的G2訓練, 得到最後的結果

multi-scale discriminator

Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
multi-scale discriminator, 作者是用了三個discriminator
每個的架構都一樣, 但處理不同的image scale
圖片會downsample成長寬除以2跟長寬除以4的圖片, 再加上原尺寸的圖片總共是3張圖片
各自經過架構一樣的discriminator
這樣處理最小張的discriminator, 它的receptive field會最大
就會注重在圖片整體的真實度
處理原尺寸圖片的discriminator, 它的receptive field會最小
就會注重在圖片的細節上
GAN的loss function就會是這樣, 有三個discriminator

接下來回到剛剛第二個module
face reenactor的loss function, 有四個loss function
第一個一樣是reconstruction loss, 計算在把背景去掉的reenact face和driving face上
因為reenact face所產生的圖片就是要去學driving face的表情
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
第二個跟第三個loss是使用剛剛generator跟discriminator那篇paper所提出的loss function
分別是VGG perceptual loss跟adversarial loss
VGG perceptual loss是把VGG-16當作loss network, 將生成圖片跟groundtruth的feature map去算loss
Vi就是經過第i層的feature map, CiHiWi分之1是除上總共有這麼多個element
所以perceptual loss是針對feature去算loss
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
adversarial loss跟一般的GAN adversarial loss不一樣的地方在於要對三個discriminator算loss
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
而最後一個loss Lce是對segmentation算loss, 作者先用pretrained的face-segmentation network來取得groundtruth
再來和生成的segmentation算cross entropy loss
所以face reenactor最後可以寫成四個loss的加總
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

background mixer

Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
最後是background mixer, input是reenact image把背景去掉也就是右上角的I'fr
還有source image把臉部跟頭髮去掉, 稱為Ib
generator Gb會產生RGB圖片Ic跟一個single channel mask M
然後用這兩個output跟Ib去計算出最後的output Ir
M乘上Ic代表只focus在頭髮跟臉部, (1-M)乘上Ib代表只focus在背景
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
這樣就能將臉部跟背景分開處理, 讓背景盡量跟source image當中的背景相似
loss function針對mask使用total variation regularization, 可以在去除noise的同時又保留原本的edge, 再加上對mask做l2 weight penalty
loss function還有其他三項, 對於reenact image跟driving image做reconstruction loss以及perceptual loss
還有對於generator跟discriminator做adversarial loss
這三個loss的算法都跟上一個module中提到的一樣
background mixer的loss function就可以寫成這四個loss相加

Experiments

training dataset是用IJB-C video, 臉部用landmark detector, 再用centroid tracker來追蹤, 形成image crop, 總共使用40萬張training data
並使用forensic++ dataset當作testing data, pre-processed的方式跟IJB-C一樣, 總共有20萬張image
所有的network都會先各自train幾個epoch, 再合起來用end-to-end的方式train
landmark trasformer跟generator都是用learning rate 0.0002,
landmark trasformer的batch size是用32, 而generator的batch size是用1

Experiment 1

作者做的第一個experiment是來評估landmark trasformer的重要性, 分別做了self-reenactment跟cross-reenactment在有無landmark trasformer情況下的比較
self-reenactment代表source image跟driving image是同一個人
cross-reenactment代表source image跟driving image是不同人
self-reenactment因為identity是同一個人, 臉部的輪廓基本上是一樣的, 只有表情不同
所以可以把driving image直接當groundtruth, 這邊作者用pixel-wise的MSE來evaluate self-reenactment
而cross-reenactment的identity不同, 就無法用driving image當作groundtruth, 也就無法使用MSE來evaluate
所以作者採用另外三個方式來evaluate cross-reenactment

CSIM

第一個是cosine similarity between image embeddings CSIM
用Deep face recognition這個network來取得source image跟reenact image的embeddings
用來評估model是否能在reenact過程中保留source identity

PSIM

第二個是pose cosine similarity between image PSIM
PSIM是用來評估model是否能保留driving image的頭部角度方向
做法是用Openface來計算driving image跟reenact image的頭部角度

ED

第三個是expression difference ED
用Euclidean distance來計算在action units上, 一樣是用openface來計算driving image跟reenact image的action units
是用來評估model是否能保留driving image的臉部表情

先看self-reenactment的分數, 沒有landmark transformer的效果反而比較好
這是因為source image跟driving image的identity是同一個人, 所以driving image的landmark本身就是landmark transformer能產生的最好的output
但有landmark transformer的分數也只有比沒有landmark transformer低一點點
這說明了driving face的臉部表情跟頭部角度方向都能很好地被transform到reenact output上
再來是cross-reenactment的分數, 作者認為PSIM跟ED都沒有差很多, 所以注重在CSIM的討論
但我認為PSIM的分數還是有一定的差距, 也就是加上landmark transformer, output的頭部角度方向反而會比較差一點
我自己的想法是因為加上landmark transformer, 整個臉的輪廓還是會影響到頭部角度方向
像這張圖, landmark transformer為了要保留source的landmark, 頭旋轉的角度就沒有那麼多
回到剛剛作者認為比較重要的CSIM, CSIM分數越高代表source identity越高程度被保留在reenact output中
所以在CSIM分數的差異代表使用landmark transformer可以提升reenactment的quality也可以減少identity leakage的問題
右圖顯示有沒有landmark transformer對於output的影響
我認為最明顯的就是臉部的輪廓, 像這邊的臉就因為driving face而縮小, 這邊也是
這樣就已經影響到source identity
整體來說在頭部姿勢上有些有些微的差距, 而在表情上我認為都差不多, 整體上都有反映左邊表格的分數

Experiment 2

第二個experiment將FACEGAN跟一些現有方法做比較, 可以看到FACEGAN的結果都能夠將source identity很好地保留
跟其他方法比較也沒有嚴重的identity leakage的問題, 而這個差異主要來自於加入了landmark transformer
也可以看到其他方法都只對於臉部區域做處理, 卻忽略了背景或其他身體部位的整合
由於FACEGAN加入了background mixer, 將臉部和背景分開處理
所以像背景的線都還是能很好地被還原出來, 不像有些被模糊或是扭曲了

Experiment 3

第三個experiment將剛剛的比較用數據來呈現, 使用的metrics有剛剛提到的
MSE, CSIM, PSIM, ED, 以及作者為了能夠分析結構上比較細微的變化而提出的

LSIM

landmark similarity score LSIM
LSIM的計算比較複雜, 要先隨機拿同一個identiy的兩張source image, 稱為source image 1跟source image 2
然後再找一張跟source image 2的action unit很相似但identity不同的driving image
用source image 1跟driving image進行FACEGAN的處理
產生出來的reenact face會跟source image 2做比較
這邊source image 2是當作groundtruth, 因為它的identity跟source image 1相同, 同時它的表情又跟driving image相同
為了要focus在結構上微小的變化, 作者用MSE在reenact face跟source image 2對應的landmark location上, 而不是pixel level的MSE
所以如果reenact face有同時保留facial shape跟臉部表情, LSIM就會比較低

接著回到表格的比較, self-reenactment除了ICFace之外, 其他的MSE都差不多
由於ICFace只有使用action unit而沒有使用landmark, 所以這邊顯示使用landmark的好處
而cross-reenactment的部分, CSIM由FACEGAN獲得最高分, 顯示FACEGAN保留source identity的能力最好
在PSIM, FSGAN獲得最高的分數, PSIM是用來評估model是否能保留driving image的頭部角度方向
FSGAN只有使用landmark, 沒有用action unit
我認為FACEGAN在PSIM沒有獲得最高分的原因跟我剛剛的想法一樣
因為用了landmark transformer, 臉的輪廓還是會稍微影響到頭部的角度方向
而X2Face在PSIM獲得最低的分數, 最前面有提到X2Face使用VAE, 所以效果本來就不是很好
而ICFACE獲得第二低的分數, 也再次說明只用action unit是很難產生接近driving image的頭部姿勢
在ED的部分, FACEGAN取得最好的分數, 但其他方法的分數也很接近
最後LSIM的部分, 評比的是同時要有facial shape以及臉部表情
所以FACEGAN同時使用landmark以及action unit來達到這個目的, 在LSIM中也獲得最高分

Experiment 4

最後一個experiment說明可以如何去改變reenact face的action unit以及背景
FACEGAN用了17個臉部表情以及3個頭部角度方向的action unit
前面的例子action unit都是從driving face得到, 但其實也可以從source face得到
可以把source face的action unit做一些改變當成driving information丟到model裡
圖中的x軸就是各種action unit, y軸就是對這個action unit的value做改變,
可以看到藉由調整action unit的值就能產生不同表情跟頭部姿勢的reenact face
而對於背景的變換, background mixer可以將rennact face跟任何background combine在一起
如圖中所示, 這邊每兩個圖片是同一個source identity, 左邊的是source圖片
但右邊的圖片是使用其他background來形成reenact image

Conclusion

FACEGAN結合facial landmark以及action unit來產生高質量的reenactment
可以減少identity leakage problem, 同時不像過去的方法會對於source跟driving image pair有很多限制
FACEGAN還將臉部區域以及背景分開處理, 這樣的方法甚至可以對圖片更換不同背景, 也能夠產生更多不同應用

FACEGAN(facial attribute reenactment GAN)

Introduction

Related Work

Method

landmark transformer

face reenactor

coarse-to-fine generator

multi-scale discriminator

background mixer

Experiments

Experiment 1

CSIM

PSIM

ED

Experiment 2

Experiment 3

LSIM

Experiment 4

Conclusion

Read more

Fair Contrastive Learning for Facial Attribute Classification

RGB2IR image translation(for face images)

修140.114.76.175 專題生GPU(裝cuda 10.2 driver 440, 再升級成driver 470)

Consistent Instance False Positive Improves Fairness in Face Recognition