Convolutional Neural Networks

Study/딥러닝

Convolutional Neural Networks

Jiwon Kim

|2023. 10. 14. 22:59

Fully - Connected Layer

- 모든 input 값과 output 값이 1:1로 연결됨

- Fully-Connected layer models relationships from every input value to every output value

- 따라서, it is assumed that any output value can be affected by any input value

지금까지 배운 MLP(다층신경망)가,

가중치 W를 input X에 행렬곱 해주고 이를 활성화함수에 통과시키는 fully connected 구조였다.

Convolution 연산

- fully connected 구조와 다르게 kernel을 input vector에서 움직이면서 linear model이 적용되는 연산!

- 똑같이 선형변환이지만, i번째 행에 대해서 가중치가 각각 따로 존재하는게 아니라 고정된 kernel을 곱해준다는게 차이

- 이는 "spatial locality"를 이용한 개념이다

convolution 연산을 이미지에서 적용할 방법을 생각해보면,

(1) image 위에서 filter (kernel)을 옮겨가면서 inner product를 계산하고 result map을 반환한다.

(2) threshold를 적용하여 찾고자 하는 target object의 존재 여부를 판단할 수 있다.

(3) 정교한 판단을 위해 2~4 nested loops가 필요할 수 있을 것이다.

여기서 질문, image에서 target object를 찾기 위한 적절한 filter 는 어떻게 디자인하는가?

이미지에 적용될 필터의 요소들을 살펴보면 필터의 사이즈랑 필터에 들어갈 값들, 두 가지가 있는데
필터에 들어갈 값들, 즉 parameter values는 data로부터 학습되는 값들이기 때문에 우리는 사이즈, 즉 구조만 hyperparameter로 넣어주면 된다. ( & the architecture should preserve spatial locality )

Convolution Neural Network의 가정

1) Spatial Locality : each filter looks at nearby pixels only

2) Positional Invariance : same filters are applied to all locations in the image

(target object이 image 내에 어디에 있을지 모르기에 이미지 전체에 필터를 모두 적용해야 함)

Convolution Layer

# Monotone Image 에 대하여

- 흑백이므로 channel = 1

- Convolve the filter with the image (slide over the image spatially, computing dot products)

# RGB Image 에 대하여

- R/G/B 각 색상에 대하여 Input 이미지가 들어오기 때문에 channel에 3개이다.

- 이미지의 어떠한 한 부분 (3*3)에 대하여 bias까지 총 28개의 숫자가 더해진다.

- w^T * x + b

# Convolution over entire image : output size of calculation map ?

위 그림에서 초록색 하나가 하나의 계산된 숫자값인데, 이 숫자의 의미를 정리해보자면,

the result of taking a dot product between the filter and a small 5*5*3 chunk of the image : 5*5*3 = 75-dimensional dot product + bias

# Filter 가 여러개인 경우에는 ?

Multiple filters = Multiple Activation Maps

각 map에 대해 시행되는 연산은 동일 (w^T + b)함. 단, 필터마다의 value, 즉 weight에 차이가 있는 것 뿐이다.

위의 경우에서 만약 filter가 1개가 아닌 4개라면, output activation map의 size는 28 * 28 * 4가 되는 것이다.

이에 (28 * 28 * 4) 다시 10개의 5*5*4짜리 필터를 적용한다고 하자.

그렇다면 output activation map의 size는 24 * 24 * 10 이 될 것이다.

Nested Conv-layers

c 개의 class들로 가장 잘 구분시켜줄 수 있는, 우리가 지정한 수 만큼의 filter를 학습시킨 것 : High-level Features

뒤에서부터 더 고차원의 구분이 가능하도록 학습됨.

여기서 Nested Convolutional Layers에 대한 질문 :

1) 필터의 크기가 커질수록 필터를 통과한 output ; activation map이 급격히 작아지는데 그러면 이를 여러 단계로 쌓기가 어려워짐 (may prevent us from nesting many layers)

2) input image가 고화질 (ex. 4k resolution : 3280 * 2160)이면 conv. layers require too much computation. 이미지가 너무 클 경우에 계산량이 너무 많아진다는 문제가 있음

이를 해결하기 위한 방법으로 나온 것이 "Stride" 와 "Padding" 이다.

Stride

- 7 x 7 Input 이미지에 3 x 3 filter, with stride = 2

activation map : 3 x 3 output

- 7 x 7 Input 이미지에 3 x 3 filter, with stride = 3

activation map : cannot be applied - doesn't fit !

일반화해보면

Input 이 N x N 이고 Filter 가 F x F이고 Stride = s 일 때

Output Size = (N-F) / s + 1 이다.

ex) s=1 : (7-3) / 1 + 1 = 5 s=2 : (7-3) / 2 + 1 = 3 s=3 : (7-3) / 3 + 1 = 2.33 (X)

Padding

In practice, it is common to zero pad ( 모서리를 검정색으로 채움) the border

7 x 7 image에 대해서 padding을 1겹으로 해준다면 아래 그림과 같이 될 것이다. ( 9 x 9 )

여기서 3x3 filter , 1 stride 를 적용한다면 output activation map의 크기는 7x7이 될것이다.

이 또한 padding 크기, filter 크기, stride 크기에 따른 activation map의 크기를 일반화해보면 다음과 같다.

Output size = (N-F+2P) / s + 1

(stride를 1간격으로 가져가는 경우에는, P = (F-1) / 2 로 패딩을 설정해준다면 map size가 preserved 될 것이다)

Convolutional Layer Summary

Given an input volume of W x H x C , a convolutional layer needs 4 HYPERparameters.
(1) Number of filters : K
(2) The filter size : F
(3) The stride : S
(4) The zero padding : P

This will produce an output of size W' x H' x K,

W' = (W-F+2P) / S + 1
H' = (H-F+2P) / S + 1
Number of parameters = K (F^2 * C + 1)

Fully - Connected vs. Conv

convolutional layer is a special case of fully connected layers
each value in the output is determined by
- all input values with fully - connected layer
- values within a small region with convolutional layer
thus, a conv-layer is equivalent to a fully connected layer where all other weights (outside of the filter range) are zeros
+ 반대로, 이미지의 크기와 같은 filter를 이용한다고 생각하면 filter를 이용한 conv-layer가 fully connected layer를 포함하는 개념이라고 생각할 수도 있음

Pooling Layer

- with downsampling, makes the representations smaller and more managable.
+ some level of denoising / controlling overfitting

- operates over each activation map (channel) independently

Pooling Layer Summary

Given an input volume of W x H x C , a convolutional layer needs 2 HYPERparameters.
(1) The spatial extent : F
(2) The stride : S

This will produce an output of size W' x H' x C (무조건 각 채널별로 각각하기 때문) ,

W' = (W-F) / S + 1
H' = (H-F) / S + 1
Number of parameters = 0 (NO learning happens)

Case Study

2012년에 최초로 image classification에 잘 적용될 수 있는 CNN으로 Alexnet 발표

'Study > 딥러닝' 카테고리의 다른 글

Data Preprocessing & Augmentation (1)	2023.10.22
Activation Functions (0)	2023.10.21
Nerual Networks and Backpropagation (0)	2023.10.13
Loss function & Optimization (0)	2023.10.11
Linear & Softmax Classifiers (0)	2023.10.09