Loss function & Optimization

Study/딥러닝

Loss function & Optimization

Jiwon Kim

|2023. 10. 11. 18:38

Q . How to set the values in W parameters?

Machine Learning : data-driven approach

we(human) just design the form of our model (eg. f(W,x) = Wx ), and
initialize the parameter value (W) randomly (초기화만 우리가, 정해주는건 machine이)
then, find a training data X to estimate the label (y hat)
compare our estimation (y hat) to ground truth label (y),
estimating how good/bad we are currently (=Loss Function)
update the parameters (W) based on this loss (=Optimization)
repeat this until we get y hat similar to (real) y

Loss Function

: quantifies how good (or bad) our current machine learning model is

우리가 예측한 값 ( y hat )과 실제 ground truth label ( y )의 함수 : L (y hat, y)

y hat과 y의 차이에 따라, loss function은 모델에게 penalty를 줄 수 있는 양수 값을 뽑을 것.
(즉 차이가 없으면 loss가 0, 차이가 크면 heavily penalize하는 방식)

# Loss Function의 일반적 정의

▶ Dicriminative Setting

우리가 예측한 값 (y hat)과 실제값 (y)의 곱의 부호 및 크기에 따라 Loss를 결정하는 방식.

Margin-based Loss Comparisons :

- exponential: 이상치(노이즈)에 영향을 많이 받음, 너무 큰 loss를 부여하게 됨, 따라서 noisy data에 부적절

- hinge & log loss: 많이 쓰이는 loss function

- hinge loss (SVM) 은 기울기가 단순하게 -1 / 0이므로 연산이 효율적임.

- log loss (logistic regression) 은 결과 자체가 p(y|x)이므로 더 해석가능함.

위 예시에서 SVM은 binary classification이었지만, 아래 예시와 같이 multi-class classification도 당연히 가능함.

hinge loss 그래프에서 x축은 s_y_i (value for a true category), s_y_i가 커질수록 loss가 작아짐을 볼 수 있다.

Q1. 처음에 Weight Initialization을 하고 거의 모든 s값이 0에 근사한다면 SVM Loss의 값은 몇이 될까?

A1. Num(class) - 1 : 이 값으로 초기 Loss를 확인하여 코드 디버깅에 자주 활용된다고 함

Q2. Loss를 0으로 만드는 W를 찾았다고 치자. 그 W값이 Loss를 0으로 만드는 유일한 값인가?

A2. 유일하지 않다 : 왜냐하면 위 예시의 s값이 두 배가 되어도 직접 계산해보면 Loss는 0이 되기 때문이다.

Q3. 그렇다면 Weight로 우리는 W를 선택해야 하는가? 아니면 nW (n>1)를 선택해야 하는가?

A3. 하나로 정해주기 위해 Normalization을 시행한다 : extending the loss function with a regularization penalty $R (W)$

▶ Probabilistic Setting

여기서 y_i가 확률의 개념! (softmax 함수에 따른)

< SVM vs Softmax >

공통점 : we compute the same score vector f (e.g. by matrix multiplication in this section).
차이점 : The difference is in the interpretation of the scores in f

- The SVM interprets these as class scores and its loss function encourages the correct class (class 2, in blue) to have a score higher by a margin than the other class scores.

- The Softmax classifier instead interprets the scores as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high (equivalently the negative of it to be low).

주의점 : Note that these numbers (loss 1.58 / 1.04) are not comparable; They are only meaningful in relation to loss computed within the same classifier and with the same data.

Optimization

위에서는 이제 우리가 각 가중치에 따라 실제 값과 예측값이 얼마나 차이나느냐? 를 계산하기 위한 Loss Function을 찾은 것이고, 지금부터는 그 Loss Function을 어떻게 최소화할 것인가? (최소화하는 weight값을 찾는, 그러기 위해서 계속 업데이트 하는 방법)에 대해서 알아볼 것이다. 이를 Optimization이라고 한다.

while True:
	weights_grad = evaluate_gradient(loss_function, data, weights)
    weights += - step_size * weights_grad

(여기서 step size는 learning rate(hyperparameter)임)

위 단계를 계속 반복해주면 빨간색 영역이 Loss가 가장 작은 구역일 때, 중심을 향해서 가중치가 업데이트 될 것이다.

위와 같이 경사하강법, 즉 'Gradient Descent'를 이용해서 우리는 theta(최종 업데이트된 weight) 를 구할 수 있다.
그런데 이는 모든 데이터를 다 보고 업데이트를 한 번에 해주기 때문에 local minimum으로 수렴하는 매우 느리다는 단점이 있다. (weight parameter+1번의 과정을 반복)

따라서 이를 보완한 'Stochastic Gradient Descent'를 수행해줄 수 있다.

▶ Stochastic Gradient Descent의 개념 :

학습 데이터 전체에 대하여 기울기를 계산하는 대신에 랜덤하게 샘플링한 일부(Minibatch)에 대해서만 기울기를 계산함

- minibatch (크기 32, 64, 128, 256, ... , 8192)로 데이터 선별해서

- minibatch가 작을 때는 gradient estimation이 빠르게 진행되고, 클수록 더뎌지면서 memory도 두배씩 차지하게 됨

예) ConvNet에서 256개짜리 배치를 이용해서 gradient를 계산하고 parameter를 업데이트 해준다

# Minibatch Gradient Descent
while True:
	data_batch = sample_training_data(data, 256)	# sample 256 examples
    weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
    weights += - step_size * weights_grad	# perform parameter update

▶ Stochastic Gradient Descent의 단점 :

1. Jittering

위 그림과 같은 경우 상하 방향은 기울기가 크고 좌우 방향은 기울기가 작다.

따라서 한 방향(상하방향)으로 다른 방향보다 loss의 변화가 매우 크다.

loss가 최소화되는 방향으로 가는 과정이 비효율적, 즉 굉장히 느리게 수렴하게 됨

2. Local Optimum and Saddle Points

local optimum (max or min) 또는 saddle point에서 기울기가 0이 되면 SGD는 더 이상 가중치를 업데이트하지 않는다.

특히 고차원일수록 saddle point가 등장할 확률이 높음

3. Inaccurate Gradient Estimation

원본 데이터의 크기가 매우 클 경우에는 당연히 mini-batch로부터 기울기를 추정하면 전체 데이터의 일부만 사용한 것이기 때문에 추정이 정확하지 않다는 문제점이 있다.

( 미니배치 크기가 4096이더라도 전체 데이터 크기가 1억개면 전체 데이터의 0.0004%에 불과함, 그렇다고 미니배치 크기를 더 키우는 것은 메모리 크기 한계 때문에 제한됨 )

▶ SGD + Momentum

- 왼쪽 그림과 같이 국소점 또는 saddle point에서 멈추지 않도록

- continue moving in the general direction as the previous iterations

- 가중치를 업데이트할 때 그냥 gradient 방향으로 업데이트하는 것이 아니라 'velocity' term을 더한 방향(원래 loss가 줄어들던 방향을 더함)으로 업데이트 된다.

- 'rho' term은 'friction'을 의미함 (0.9 or 0.99) (controls the degree of momentum)

▶ AdaGrad

▶ Adam

참고자료

https://www.youtube.com/watch?v=h7iBpEHGVNc

'Study > 딥러닝' 카테고리의 다른 글

Data Preprocessing & Augmentation (2)	2023.10.22
Activation Functions (1)	2023.10.21
Convolutional Neural Networks (3)	2023.10.14
Nerual Networks and Backpropagation (0)	2023.10.13
Linear & Softmax Classifiers (0)	2023.10.09