Fully Convolutional Networks for Semantic Segmentation

머신러닝, 딥러닝 공부 2020. 3. 1. 13:28

classification 및 detection 과제에 비해 segmentation은 훨씬 어려운 작업이다.

Image Classification: 이미지 내에서 개체를 분류(개체 클래스 인식)
Object Detection: 경계 상자가 있는 영상 내에서 객체를 분류하고 탐지한다. 그것은 또한 각 물체의 등급, 위치, 크기를 알 필요가 있다는 것을 의미한다.
Semantic Segmentation: 영상 내의 각 픽셀에 대한 객체 클래스를 분류한다. 이것은 각 픽셀에 대한 라벨이 있다는 것을 의미한다.

Original Image (Leftmost), Ground Truth Label Map (2nd Left), Predicted Label Map (2nd Right), Overlap Image and Predicted Label (Rightmost)

논문에서 설명하고자 하는 내용은 다음과 같다.

From Image Classification to Semantic Segmentation
Upsampling Via Deconvolution
Fusing the Output
Results

From Image Classification to Semantic Segmentation

classification에서 통상적으로 입력 이미지는 크기가 축소되어 convolution layer와 fully connected (FC) layer를 거치고, 다음과 같이 입력 이미지에 대한 예측 라벨 하나를 출력한다.

예를 들어 원래 VGG network는 이렇게 추출된 feature의 뒤에 4096, 4096, 1000으로 이어지는 fully connected layer를 연결하여 classification을 한다. 하지만 논문에서는 이런 fully connected layer를 제거하고 1x1 convolutional layer를 추가한다. 이제 FC layer를 1×1 convolutional layer로 바꾼다고 가정해 보자.

그리고 만약 이미지가 축소되지 않는다면, 출력은 단일 라벨이 되지 않을 것이다. 대신, 출력의 크기가 입력 이미지보다 작다.(max pooling으로 인해)

Transforming fully connected layers into convolution layers enables a classification net to output a heatmap

위 그림에서 4096, 4096, 1000이라고 표현된 1x1 conv layer를 확인할 수 있다. (여기서는 1000개의 클래스에 대해 표현했지만 실제로는 21개의 클래스에 대해 실험한다. 즉 1000->21) 마지막 1x1 conv layer에서 depth channel은 각 class를 의미하므로 1000개의 class 중에서 tabby cat에 대한 feature map을 확인해보면 위와 같이 tabby cat에 대한 heatmap을 구할 수 있다.

위의 출력을 upsample 하면 아래와 같이 픽셀 단위 출력(label map)을 계산할 수 있다. 즉, 원래 image의 폭을 W, 높이를 H,라고 한다면, WxHx21(실제론 21개의 클래스에 대해 실험했으므로)의 dense heatmap결과를 얻을 수 있다.

Feature Map / Filter Number Along Layers

Upsampling Via Deconvolution

Convolution은 출력 크기를 줄이는 과정이다. 따라서 출력 크기를 더 크게 하기 위해 upsampling을 하고 싶을 때 deconvolution이라는 것을 사용한다. (그러나 deconvolution이라는 이름은 convolution의 역과정으로 잘못 해석되지만, 실제로는 역과정은 아니다. 이 글을 참고하자) deconvolution은 up convolution, transposed convolution 이라고도 불린다.

Fusing the Output

아래와 같이 conv7을 거치면 출력 이미지가 작아 32배의 upsampling을 수행하여 출력물이 동일한 크기의 입력 이미지를 갖도록 한다. 하지만 coarse한 결과를 upsampling하기 때문에 detail이 뭉개진 segmenation 결과를 얻을 수 밖에 없다.

그 이유는 feature가 deep 해질수록 공간 위치 정보도 손실되기 때문이다. 이것은 시작층에 가까울수록 더 많은 위치 정보를 가지고 있다는 것을 의미한다. 이러한 특징을 이용하기 위해 본 논문에서는 이전 layer의 feature map을 이용하는 skip combining 기법을 사용한다.

Fully Convolutional Networks with Skip Combining : FCN-16s

FCN-32s에서 마지막 conv 7에서 32배 upsampling하여 segmentation 결과를 만들었다. 하지만, FCN-16s에서는 마지막 conv layer 결과를 2배 upsampling 하고 pool 4 layer의 결과와 합쳐준다. 그리고 난 후, 그 합쳐진 결과를 16배 upsampling 하여 FCN-16s 라는 segmentation 결과 이미지를 만들어 낸다. 참고로 여기서 합친다는 말은 그냥 더해준다는 의미이다. 같은방법으로 FCN-8s의 구조는 아래와 같다.

Fully Convolutional Networks with Skip Combining : FCN-8s

Results

Pascal VOC 2011 dataset (Left), NYUDv2 Dataset (Middle), SIFT Flow Dataset (Right)

FCN-8s is the best in Pascal VOC 2011
FCN-16s is the best in NYUDv2
FCN-16s is the best in SIFT Flow

Refining fully convolutional nets by fusing information from layers with different strides improves segmentation detail. The first three images show the output from our 32, 16, and 8 pixel stride nets

Fully convolutional segmentation nets produce state-of-the-art performance on PASCAL.

References

'머신러닝, 딥러닝 공부' 카테고리의 다른 글

You Only Look Once: Unified, Real-Time Object Detection (0)	2020.03.09
Multi-Scale Context Aggregation by Dilated Convolutions (0)	2020.03.07
An Introduction to different Types of Convolutions in Deep Learning [번역] (0)	2020.02.27
Up-sampling with Transposed Convolution [번역] (0)	2020.02.18
Accelerating the Super-Resolution Convolutional Neural Network (0)	2020.02.16

인기포스트 MORE POST

ABOUT ME

Minhyeok Lee Minhyeok Lee

From Image Classification to Semantic Segmentation

Upsampling Via Deconvolution

Fusing the Output

Results

'머신러닝, 딥러닝 공부' 카테고리의 다른 글

티스토리툴바

인기포스트 MORE POST

ABOUT ME

From Image Classification to Semantic Segmentation

Upsampling Via Deconvolution

Fusing the Output

Results

'머신러닝, 딥러닝 공부' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바