6.4 Influential Instances
- 機器學習的模型是訓練數據的產出,刪除任一訓練數據會影響訓練結果。若刪除每一訓練數據對模型產生巨大影響,則稱這個點唯有影響的點(instance)。對有影響的點分析可以幫助我們檢視模型。
- Deletion Diagnostics : delete the instance from the training data, retrain the model on the reduced training dataset and observe the difference in the model parameters or predictions
- Influenced functions : upweight a data instance by approximating the parameter changes based on the gradients of the model parameters.
6.4.1 Deletion Diagnostics
-
DFBETA : 衡量移除某個instance對模型參數的影響。
- 適用於有參數的模型,如 logistic regression or neural networks.
- 不適合 decision trees, tree ensembles, some support vector machines.
-
Cook's distance: 衡量移除某個instance對模型預測的影響
- 通用,但較常用於linear regression 和 generalized linear models.
- 唯一個問題是 MSE 對某些模型沒有意義 (例如:classification)。
-
作者使用 Cook's distance 的分子為衡量公式。
Deletion diagnostics example
- 為資料與模型除錯時,先找出 influential instance,因為這些instance的錯誤影響模型的預測。
- 進一步了解,甚麼讓這些instance influential? Model the influence of an instance as a function of its feature values. 了解哪些 features 對模型有很大的影響後 (如課文中的年齡 >= 35歲)。
- 每個instance都要被remove,模型retrain,會有運算時間過長的問題。
6.4.2
用提高指定instance權重,觀察其Loss的變化
instead of deleting training instances, the method approximates how much the model changes when the instance is upweighted in the empirical risk (sum of the loss over the training data).
只適用部份有紀錄loss gradient與參數的關係變化
我們常用模型不適用QQ:
Logistic regression, neural networks and support vector machines qualify, tree-based methods like random forests
The method of influence functions requires access to the loss gradient with respect to the model parameters, which only works for a subset of machine learning models. Logistic regression, neural networks and support vector machines qualify, tree-based methods like random forests do not.
與其直接看LOSS變化,直接看parameter變化就好
公式各種變化的結果
應用場景
- Understanding model behavior
- Handling domain mismatches / Debugging model errors (train test 分佈差異)
- Fixing training data –調整讓模型訓練不好的高影響力錯誤instance
優點 看文章粗體
emphasizes the role of training data in the learning process. This makes influence functions and deletion diagnostics one of the best debugging tools for machine learning models.
缺點
有提到某些模型若調整loss function 則適用influential function 方法
Fixing training data