How to get feature importance in xgboost?
PythonXgboostPython Problem Overview
I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore()
, but it returns {}
and my train code is:
dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)
So is there any mistake in my train? How to get feature importance in xgboost?
Python Solutions
Solution 1 - Python
In your code you can get feature importance for each feature in dict form:
bst.get_score(importance_type='gain')
>>{'ftr_col1': 77.21064539577829,
'ftr_col2': 10.28690566363971,
'ftr_col3': 24.225014841466294,
'ftr_col4': 11.234086283060112}
Explanation: The train() API's method get_score() is defined as:
get_score(fmap='', importance_type='weight')
- fmap (str (optional)) – The name of feature map file.
- importance_type
- ‘weight’ - the number of times a feature is used to split the data across all trees.
- ‘gain’ - the average gain across all splits the feature is used in.
- ‘cover’ - the average coverage across all splits the feature is used in.
- ‘total_gain’ - the total gain across all splits the feature is used in.
- ‘total_cover’ - the total coverage across all splits the feature is used in.
https://xgboost.readthedocs.io/en/latest/python/python_api.html
Solution 2 - Python
Get the table containing scores and feature names, and then plot it.
feature_important = model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.nlargest(40, columns="score").plot(kind='barh', figsize = (20,10)) ## plot top 40 features
For example:
Solution 3 - Python
Using sklearn API and XGBoost >= 0.81:
clf.get_booster().get_score(importance_type="gain")
or
regr.get_booster().get_score(importance_type="gain")
For this to work correctly, when you call regr.fit
(or clf.fit
), X
must be a pandas.DataFrame
.
Solution 4 - Python
Build the model from XGboost first
from xgboost import XGBClassifier, plot_importance
model = XGBClassifier()
model.fit(train, label)
this would result in an array. So we can sort it with descending
sorted_idx = np.argsort(model.feature_importances_)[::-1]
Then, it is time to print all sorted importances and the name of columns together as lists (I assume the data loaded with Pandas)
for index in sorted_idx:
print([train.columns[index], model.feature_importances_[index]])
Furthermore, we can plot the importances with XGboost built-in function
plot_importance(model, max_num_features = 15)
pyplot.show()
use max_num_features
in plot_importance
to limit the number of features if you want.
Solution 5 - Python
For feature importance Try this:
Classification:
pd.DataFrame(bst.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)
Regression:
xgb.plot_importance(bst)
Solution 6 - Python
For anyone who comes across this issue while using xgb.XGBRegressor()
the workaround I'm using is to keep the data in a pandas.DataFrame()
or numpy.array()
and not to convert the data to dmatrix()
. Also, I had to make sure the gamma
parameter is not specified for the XGBRegressor.
fit = alg.fit(dtrain[ft_cols].values, dtrain['y'].values)
ft_weights = pd.DataFrame(fit.feature_importances_, columns=['weights'], index=ft_cols)
After fitting the regressor fit.feature_importances_
returns an array of weights which I'm assuming is in the same order as the feature columns of the pandas dataframe.
My current setup is Ubuntu 16.04, Anaconda distro, python 3.6, xgboost 0.6, and sklearn 18.1.
Solution 7 - Python
I don't know how to get values certainly, but there is a good way to plot features importance:
model = xgb.train(params, d_train, 1000, watchlist)
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()
Solution 8 - Python
Try this
fscore = clf.best_estimator_.booster().get_fscore()
Solution 9 - Python
According to this post there 3 different ways to get feature importance from Xgboost:
- use built-in feature importance,
- use permutation based importance,
- use shap based importance.
Built-in feature importance
Code example:
xgb = XGBRegressor(n_estimators=100)
xgb.fit(X_train, y_train)
sorted_idx = xgb.feature_importances_.argsort()
plt.barh(boston.feature_names[sorted_idx], xgb.feature_importances_[sorted_idx])
plt.xlabel("Xgboost Feature Importance")
Please be aware of what type of feature importance you are using. There are several types of importance, see the docs. The scikit-learn
like API of Xgboost is returning gain
importance while get_fscore
returns weight
type.
Permutation based importance
perm_importance = permutation_importance(xgb, X_test, y_test)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(boston.feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")
This is my preferred way to compute the importance. However, it can fail in case highly colinear features, so be careful! It's using permutation_importance
from scikit-learn
.
SHAP based importance
explainer = shap.TreeExplainer(xgb)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
To use the above code, you need to have shap
package installed.
I was running the example analysis on Boston data (house price regression from scikit-learn). Below 3 feature importance:
Built-in importance
Permutation based importance
SHAP importance
All plots are for the same model! As you see, there is a difference in the results. I prefer permutation-based importance because I have a clear picture of which feature impacts the performance of the model (if there is no high collinearity).
Solution 10 - Python
In case you are using XGBRegressor, try with: model.get_booster().get_score()
.
That returns the results that you can directly visualize through plot_importance
command
Solution 11 - Python
None of the above worked for me, this was the code I ended up with, to sort features by importance.
from collections import Counter
Counter({k: v for k, v in sorted(model.get_fscore().items(), key=lambda item: item[1], reverse = True)}).most_common
just replace model with the name of your model and everything will be there. Of course I'm doing the same thing twice, there's no need to order a dict before passing to counter, but I figure it wouldn't hurt to leave it there in case anyone hates Counters.