Weight Initialization

Jiwon Kim

|2023. 10. 22. 18:41

We have seen how to construct a Neural Network architecture, and how to preprocess the data. Before we can begin to train the network we have to initialize its parameters.

data가 well-normalized 되었다는 가정하에 학습되는 (최종) weight 또한 0을 중심으로 절반은 양수값, 절반은 음수값으로 정해질 것이라고 예측할 수 있을텐데, 그렇다면 다 동일하게 그냥 w=0으로 초기화하면 안되는가?의 질문이다. 모든 w가 동일하다면 ouput = wx로 모든 뉴런이 동일한 parameter update 과정을 겪게 될 것이고 이러면 뉴런을 여러 개 쌓아준 의미가 없다. (다 동일하니까)

Small Gaussian Random

- weight가 0에 가깝게, 그러면서도 0이 아니도록 initialize하는 방법

- implementation for this weight matrix : W = 0.01 * np.random.randn(d_in, d_out)

- 평균은 0, 분산은 0.01인 정규분포로 초기화 시켜준다고 가정

- shallow network에서는 이용되지만 layer를 깊게 쌓아나가는 Neural Network layer에서는 backpropagation에서 gradient가 소실될 수 있다는 문제가 발생한다.

activation function으로 tanh를 이용하여 layer별로 output (tanh(Wx))의 분포를 나타내보았을 때 평균이 0에 가깝게 유지되지만 급격하게 분산이 감소하는 것을 알 수 있다. 왜냐하면 tanh함수는 아주 작은 W가 곱해진 0에 가까운 Wx를 받으면 0에 가까운 output을 내놓기 때문이다. 결국 layer을 쌓을수록 small weight로 인하여 activation output이 거의 모두 0에 수렴한다는 것이다. 따라서 역전파 과정에서도 gradient가 0에 가까워지고, 이는 학습 속도가 매우 느려짐을 의미한다.

Large Gaussian Random

- weight가 너무 작은게 문제였으니, 이제는 그 크기를 더 키워준다.

- implementation for this weight matrix : W = 0.5 * np.random.randn(d_in, d_out)

- 평균은 0, 분산은 0.5인 정규분포로 초기화 시켜준다고 가정

- 이번에는 almost all neurons are saturated to either -1 or 1, which also makes gradient to 0

W가 조금 커지니까 Wx도 커지고 tanh함수의 특성에 따라 output이 거의 다 +1/-1값으로 출력된다. 이러한 출력이 여러 레이어에 걸쳐 반복되면 점점 더 output의 분포가 양극화 될 것이다.

Loss function을 W에 대하여 미분해주면 위의 식과 같은데, (tanh(Wx+b))^2값이 1이므로 저 미분값이 0이 된다는 것을 확인할 수 있음. 또! gradient 소실 문제가 발생한다.

따라서 우리는 weight을 초기화시킬 때 지나치게 작게도, 크게도 하지 않게 만들 필요가 있는데,

이에 대한 solution으로 Xavier Initialization이 제안되었다.

Xavier Initialization

- weight = np.random.randn(d_in, d_out) / np.sqrt(d_in)

- 위의 식과 같이 weight을 sqrt(d_input)으로 나눠주면 신기하게도 output의 분산이 input의 분산과 유지되도록 계산이 된다. (자세한 원리는 생략)

- 따라서 위의 그림과 같이 layer가 쌓여도 output이 0으로만 또는 +1/-1으로만 몰리지 않게 된다.

https://proceedings.mlr.press/v9/glorot10a.html

Understanding the difficulty of training deep feedforward neural networks

Whereas before 2006 it appears that deep multi-layer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental resul...

proceedings.mlr.press

그런데 위와 같은 weight initialization 방법 또한 activation function을 ReLU로 가져갔을 때 x>0인 input에 대하여 input = output이므로 층이 쌓일수록 0 근처의 값들로 (layer가 쌓이는데 0에 가까운 값에 계속 0에 가까운 값을 곱하면) 분포가 만들어지고 (Wx)를 W에 대하여 미분한 값은 x인데 x 가 0으로 수렴한다는 것은 gradient도 0에 가까워짐을 의미한다.

He(Kaiming) Initialization for ReLU

- weight = np.random.randn(d_in, d_out) / np.sqrt(2 / d_in)

- ReLU와는 보통 이 Kaiming Initialization을 이용한다고 한다.

'Study > 딥러닝' 카테고리의 다른 글

Pytorch Tensorboard (1)	2023.12.12
파이토치 (0)	2023.12.11
Data Preprocessing & Augmentation (1)	2023.10.22
Activation Functions (0)	2023.10.21
Convolutional Neural Networks (2)	2023.10.14

Weight Initialization

'Study > 딥러닝' 카테고리의 다른 글

티스토리툴바