Perceptual Losses for Real-Time Style Transfer and Super-Resolution

hydragon 2020. 5. 16. 13:54

1. Overview

본 논문은 per-pixel loss function를 사용하는 대신 image transformation task에 대한 feed-forward network를 훈련하기 위한 perceptual loss function의 사용을 제안한다.

Per-pixel loss function

개별 픽셀 값을 기준으로 두 영상 비교. 따라서, 만약 두 개의 이미지가, perceptually 동일하지만, 한 픽셀에 기초하여 서로 다르다면, per-pixel loss function에 기초하여 서로 매우 다를 것이다.

Perceptual loss functions

사전 구성된 Convolutional Neural Networks(이미지 분류 작업에 대해 교육됨, ImageNet Dataset)의 높은 수준의 representation을 기반으로 두 이미지를 비교.

이들은 두 가지 image transformation task에 대한 접근 방식을 평가한다.
(i) Style Transfer
(ii) Single-Image Super Resolution

Style transfer을 위해, 그들은 Gatys가 제안한 최적화 문제를 해결하려고 노력하는 feed-forward network를 훈련한다.

Super resolution에서는 perceptual loss을 사용하는 실험을 하고, per-pixel loss function를 사용하는 것보다 더 좋은 결과를 얻는다는 것을 보여준다.

2. Method

제안된 모델 architecture는 다음과 같은 두 가지 요소로 구성된다.
(i) mage Transformation Network(f_{w})
(ii) Loss Network(Φ)

Image Transformation Network

이미지 변환 네트워크는 Gatys가 제안한 최적화 문제를 해결하기 위해 훈련된 deep residual Convolutional Neural Network이다.

입력 이미지(x)가 주어진 경우 이 네트워크는 출력 이미지(ŷ)로 변환한다.

이 네트워크(W)의 weight는 출력 이미지(ŷ)를 사용하여 계산된 loss를 사용하여 학습하고, 이를 다음 항목과 비교한다.

- Style transfer의 경우 style image(y_{s}) 및 content image(y_{c})의 representation.
- Super resolution의 경우 content image y_{c}만 해당.

Image Transformation Network는 모든 loss function의 weighted sum을 최소화하는 가중치(W)를 얻기 위해 Stochastic Gradient Descent 하여 훈련한다.

https://towardsdatascience.com/perceptual-losses-for-real-time-style-transfer-and-super-resolution-637b5d93fa6d

Loss Network

Loss Network(Φ)는 ImageNet Dataset에 pretrained된 VGG16이다.

loss network는 content 및 style 이미지에서 content 및 style representation을 가져오는 데 사용된다.

(i) content representation은 'relu3_3' layer에서 취한다.
(ii) style representation은 layer 'relu1_2', 'relu2_2', 'relu3_3', 'relu4_3'에서 취한다.

이러한 representation은 두 가지 유형의 loss을 정의하는 데 사용된다.

Feature Reconstruction Loss

content representation출력 이미지(ŷ)와 'relu3_3' layer의 content representation 및 이미지에서 다음과 같은 loss function을 사용한다.

Style Reconstruction Loss

출력 이미지(ŷ)와 레이어 'relu1_2', 'relu2_2', 'relu3_3' 및 'relu4_3'의 style representation으로 이미지에서 다음과 같은 loss function from를 사용한다.

total loss은 일반적으로 style transfer 시 feature reconstruction loss와 style reconstruction loss의 가중 합이다. 그리고 super-resolution은 feature reconstruction의 weighted product이다.

이러한 loss는 Image Transformation Network의 가중치를 학습하는 데 사용된다.

Results

Style Transfer

Network trained on COCO Dataset (for content images).
80k training images resized to 256x256 patches.
Batch size: 4
With 40k iterations (~2 epochs)
Optimizer used: Adam
Learning rate: 1e-3
Training takes ~4 hours on Titan X GPU
Compared against the method proposed by Gatys et al.

Single Image Super-resolution

Trained with 288x288 patches from 10k images from the MS-COCO
Prepared low-resolution inputs by blurring with a Gaussian kernel of width σ=1.0 and downsampling with bicubic interpolation.
Batch size: 4
Iterations: 200k
Optimizer: Adam
Learning rate: 1e-3
Compared against SRCNN

우리가 볼 수 있듯이, 우리는 이 논문의 방법을 사용하여 비슷한 결과를 얻으며 추론하는 동안 Gatys의 방법보다 거의 3배 빠르다.

이 방법의 가장 큰 단점은 우리가 스타일당 또는 해상도당 하나의 네트워크를 훈련해야 한다는 것이다. 즉, 하나의 네트워크만으로는 임의의 스타일 전송을 수행할 수 없다는 것이다.

Gatys를 이용하면 하나의 네트워크만 사용하여 임의적인 스타일 전송을 할 수 있었는데, 이 논문이 제안한 방법으로는 할 수 없다.