Decision Tree : scikit learn의 feature_importances_는 어떻게 계산된 것일까? + Permutation Importance

Study/머신러닝

Decision Tree : scikit learn의 feature_importances_는 어떻게 계산된 것일까? + Permutation Importance

Jiwon Kim

|2023. 9. 8. 03:02

they are computed as the mean and standard deviation of accumulation of the impurity decrease within each tree

- scikit-learn page-

피처가 트리 분할 시 지니 계수(불순도)를 얼마나 효율적으로 잘 개선시켰는지를 정규화된 값으로 표현한 것.

1. feature importance를 왜 알아야 하는가?

AI model에는 해석가능한 모델의 특성과 (Interpretability) 설명가능한 모델의 특성이 (Explainability) 있다.

Interpretability in AI: ability to understand the decision-making process of an AI model

왜 그 모델에서 특정한 결과가 나왔는지 이해하는 것, 예를 들어 '선형 회귀'의 경우에는 결과를 이해하기 위해서 각 특성별로 부여된 가중치를 이용할 수 있다. - transparent in its operation and provides information about the relationships between inputs and outputs

Explainability in AI: ability to explain the decision-making process of an AI model in terms understandable to the end user

모델이 특정한 결론을 내린 이유와 근거, 작동원리를 설명하는 것. 예를 들어 '선형 회귀'의 경우에는 왜 특정 피처가 더 큰 가중치를 얻게 되었는지에 대한 것은 설명가능한 ai에 해당한다. - provides a clear and intuitive explanation of the decisions made, enabling users to understand why the model produced a particular result. In other words, explainability focuses on why an algorithm made a specific decision and how that decision can be justified.

우리는 'feature importance'를 ML의 모델을 interpret 하기 위한 도구로 사용한다.

2. feature importance는 어떻게 계산되는가?

이해를 위해서 특정 '분류' 상황을 가정해보자.

- binary classification problem : class가 'valid'인지 'invalid'인지 예측하는 문제

- 현재 feature은 3가지 : response size / latency / total impressions

- DecisionTreeClassifier로 training 진행

- 분류 이전 데이터는 2000개로, valid / invalid 각각 1000개

- impurity measure으로 gini index 사용

이제 feature importance를 계산하기 위해서는 2단계를 거쳐야 한다.

[ 1. 각 노드에 대해 importance를 계산한다. ]

Node importance =

{ (해당 노드에 도달한 sample의 비율) * (해당 노드의 impurity: gini index)

- (왼쪽 분할 노드에 도달한 sample의 비율) * (왼쪽 분할 노드의 impurity)

- (오른쪽 분할 노드에 도달한 sample의 비율) * (오른쪽 분할 노드의 impurity) } / 100

예를 들어 1st Node importance = (100*0.5 - 52.35*0.086 - 47.65%*0) / 100 = 0.455

2nd Node importance = (52.35*0.086 - 48.8*0 - 0.035*0.448) / 100 = 0.0448

3rd Node importance = (0.035*0.448 - 0.024*0.041 - 0) / 100 = 0.00014

4th Node importance = (0.024*0.041 - 0 - 0) / 100 = 0.0000098

[ 2. 각 feature에 대해 importance를 계산한다. ]

Feature Importance for feature K =

( feaure K 로 인해 분할된 모든 node의 importance 의 합 ) / ( 모든 node의 importance 의 합 )

예를 들어

feature 'Total Impressions'의 importance = 0.455 / 0.5 = 0.91

feature 'Total Response Size'의 importance = (0.0448 + 0.00014) / 0.5 = 0.124

feature 'Total Latency'의 importance = 0.0000098 / 0.5 = 0.002

scikit-learn에서 feature_importances_를 이용해 feature별 중요도를 계산하면 나오는 결과는

위와 같을 것이다. 크기는 impressions > response size > latency 순으로 크게 나온다.

하지만 위의 feature importance는 우리가 실제로 feature engineering (feature selection) 할 때 사용될 수 있을까?

decision tree 및 decision tree를 기반으로 하는 random forest나 여러 앙상블 모형들을 구현할 때 사용되는 scikit learn의 feature importance는 위에서 설명한 바와 같이 최적의 tree 구조를 만들기 위한 피쳐들의 impurity(gini index)가 중요한 기준이 된다. 즉, 결정 값(target)과 관련이 없는 feature라도 그 중요도가 높게 측정될 수 있다는 것이다.

그래서 사용되는 것으로 "Permutation Importance"라는 것이 있다. permutaion importance의 아이디어는 <테스트 데이터 내에서 특정변수를 재배열 했을 때 예측의 정확도가 크게 달라지지 않는다면 해당 변수는 모델의 예측력에 큰 영향이 없을 것이다>의 맥락이다.

scikit learn에서는 permutation_importance를 불러와서 위와 같이 feature별 중요도(반복해서 측정된 정확도 차이의 평균값)를 측정할 수 있다.

위에서는 그 평균이 클수록 중요도가 큰 (모델의 예측력에 큰 영향을 미치는) feature가 될 것이다.

출처 : https://medium.com/data-science-in-your-pocket/how-feature-importance-is-calculated-in-decision-trees-with-example-699dc13fc078

https://www.xcally.com/news/interpretability-vs-explainability-understanding-the-importance-in-artificial-intelligence/

'Study > 머신러닝' 카테고리의 다른 글

차원 축소 (0)	2023.10.29
앙상블 학습 - Bagging (0)	2023.09.25
Decision Tree : 가지치기 (pruning)에 대하여 (0)	2023.09.15
Decision Tree : 데이터 분할에서 '균일도'(impurity)에 대하여 (0)	2023.09.07
PR Graph 성능 평가 (0)	2023.09.07

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Decision Tree : scikit learn의 feature_importances_는 어떻게 계산된 것일까? + Permutation Importance

'Study > 머신러닝' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역