How can I implement incremental training for xgboost?

PythonMachine LearningXgboost

Python Problem Overview


The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop - it will not help, because in such case it just rebuilds whole model for each batch.

Python Solutions


Solution 1 - Python

Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.

Here's a small experiment that I ran to convince myself that it works:

First, split the boston dataset into training and testing sets. Then split the training set into halves. Fit a model with the first half and get a score that will serve as a benchmark. Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar.. But, fortunately, the new model seems to perform much better than the first.

import xgboost as xgb
from sklearn.cross_validation import train_test_split as ttsplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse

X = load_boston()['data']
y = load_boston()['target']

# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train, 
                                                     y_train, 
                                                     test_size=0.5,
                                                     random_state=0)

xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)

params = {'objective': 'reg:linear', 'verbose': False}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')

# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')

print(mse(model_1.predict(xg_test), y_test))     # benchmark
print(mse(model_2_v1.predict(xg_test), y_test))  # "before"
print(mse(model_2_v2.predict(xg_test), y_test))  # "after"

# 23.0475232194
# 39.6776876084
# 27.2053239482

reference: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py

Solution 2 - Python

There is now (version 0.6?) a process_update parameter that might help. Here's an experiment with it:

import pandas as pd
import xgboost as xgb
from sklearn.model_selection import ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse

boston = load_boston()
features = boston.feature_names
X = boston.data
y = boston.target

X=pd.DataFrame(X,columns=features)
y = pd.Series(y,index=X.index)

# split data into training and testing sets
rs = ShuffleSplit(test_size=0.3, n_splits=1, random_state=0)
for train_idx,test_idx in rs.split(X):  # this looks silly
    pass

train_split = round(len(train_idx) / 2)
train1_idx = train_idx[:train_split]
train2_idx = train_idx[train_split:]
X_train = X.loc[train_idx]
X_train_1 = X.loc[train1_idx]
X_train_2 = X.loc[train2_idx]
X_test = X.loc[test_idx]
y_train = y.loc[train_idx]
y_train_1 = y.loc[train1_idx]
y_train_2 = y.loc[train2_idx]
y_test = y.loc[test_idx]

xg_train_0 = xgb.DMatrix(X_train, label=y_train)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)

params = {'objective': 'reg:linear', 'verbose': False}
model_0 = xgb.train(params, xg_train_0, 30)
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model=model_1)

params.update({'process_type': 'update',
               'updater'     : 'refresh',
               'refresh_leaf': True})
model_2_v2_update = xgb.train(params, xg_train_2, 30, xgb_model=model_1)

print('full train\t',mse(model_0.predict(xg_test), y_test)) # benchmark
print('model 1 \t',mse(model_1.predict(xg_test), y_test))  
print('model 2 \t',mse(model_2_v1.predict(xg_test), y_test))  # "before"
print('model 1+2\t',mse(model_2_v2.predict(xg_test), y_test))  # "after"
print('model 1+update2\t',mse(model_2_v2_update.predict(xg_test), y_test))  # "after"

Output:

full train	 17.8364309709
model 1 	 24.2542132108
model 2 	 25.6967017352
model 1+2	 22.8846455135
model 1+update2	 14.2816257268

Solution 3 - Python

I created a gist of jupyter notebook to demonstrate that xgboost model can be trained incrementally. I used boston dataset to train the model. I did 3 experiments - one shot learning, iterative one shot learning, iterative incremental learning. In incremental training, I passed the boston data to the model in batches of size 50.

The gist of the gist is that you'll have to iterate over the data multiple times for the model to converge to the accuracy attained by one shot (all data) learning.

Here is the corresponding code for doing iterative incremental learning with xgboost.

batch_size = 50
iterations = 25
model = None
for i in range(iterations):
    for start in range(0, len(x_tr), batch_size):
        model = xgb.train({
            'learning_rate': 0.007,
            'update':'refresh',
            'process_type': 'update',
            'refresh_leaf': True,
            #'reg_lambda': 3,  # L2
            'reg_alpha': 3,  # L1
            'silent': False,
        }, dtrain=xgb.DMatrix(x_tr[start:start+batch_size], y_tr[start:start+batch_size]), xgb_model=model)

        y_pr = model.predict(xgb.DMatrix(x_te))
        #print('    MSE itr@{}: {}'.format(int(start/batch_size), sklearn.metrics.mean_squared_error(y_te, y_pr)))
    print('MSE itr@{}: {}'.format(i, sklearn.metrics.mean_squared_error(y_te, y_pr)))

y_pr = model.predict(xgb.DMatrix(x_te))
print('MSE at the end: {}'.format(sklearn.metrics.mean_squared_error(y_te, y_pr)))

XGBoost version: 0.6

Solution 4 - Python

looks like you don't need anything other than call your xgb.train(....) again but provide the model result from the previous batch:

# python
params = {} # your params here
ith_batch = 0
n_batches = 100
model = None
while ith_batch < n_batches:
    d_train = getBatchData(ith_batch)
    model = xgb.train(params, d_train, xgb_model=model)
    ith_batch += 1

this is based on https://xgboost.readthedocs.io/en/latest/python/python_api.html enter image description here

Solution 5 - Python

If your problem is regarding the dataset size and you do not really need Incremental Learning (you are not dealing with an Streaming app, for instance), then you should check out Spark or Flink.

This two frameworks can train on very large datasets with a small RAM, leveraging disk memory. Both framework deal with memory issues internally. While Flink had it solved first, Spark has caught up in recent releases.

Take a look at:

Solution 6 - Python

To paulperry's code, If change one line from "train_split = round(len(train_idx) / 2)" to "train_split = len(train_idx) - 50". model 1+update2 will changed from 14.2816257268 to 45.60806270012028. And a lot of "leaf=0" result in dump file.

Updated model is not good when update sample set is relative small. For binary:logistic, updated model is unusable when update sample set has only one class.

Solution 7 - Python

One possible solution that I have not tested is to used a dask dataframe which should act the same as a pandas dataframe but (I assume) utilize disk and reads in and out of RAM. here are some helpful links. this link mentions how to use it with xgboost also see also see. further there is an experimental options from XGBoost as well here but it is "not ready for production"

Solution 8 - Python

Hey guys you can use my simple code for incremental model training with xgb base class :

    batch_size = 10000000


    X_train="your pandas training DataFrame" 
    y_train="Your lables"
    
    #Store eval results
    evals_result={}
    Deval = xgb.DMatrix(X_valid, y_valid)
    eval_sets = [(Dtrain, 'train'), (Deval, 'eval')]
    for start in range(0, n, batch_size):
           model = xgb.train({'refresh_leaf': True, 
                         'process_type': 'default', 
                         'max_depth': 5, 
                         'objective': 'reg:squarederror', 
                         'num_parallel_tree': 2,
                        'learning_rate':0.05,
                        'n_jobs':-1},
                        dtrain=xgb.DMatrix(X_train, y_train), evals=eval_sets, early_stopping_rounds=5,num_boost_round=100,evals_result=evals_result,xgb_model=model) 

Solution 9 - Python

I agree with @desertnaut in his solution.

I have a dataset where I split it into 4 batches. I have to do an initial fit without the xgb_model parameter first, then the next fits will have the xgb_model parameter, like in this (I'm using the Sklearn API):

for i, (X_batch, y_batch) in enumerate(zip(self.X_train_batched, self.y_train_batched)):
    print(f'Step: {i}',end = ' ')
    if i == 0:
        model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
                        verbose=False, eval_metric = ['logloss'],
                        early_stopping_rounds = 400)
    else:
        model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
                        verbose=False, eval_metric = ['logloss'],
                        early_stopping_rounds = 400, xgb_model=model_xgbc)
            
    preds = model_xgbc.predict(self.X_valid)
    
    rmse = metrics.mean_squared_error(self.y_valid, preds,squared=False)

Solution 10 - Python

It's not based on xgboost, but there is a C++ incremental decision tree.
see gaenari.

Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMarat ZakirovView Question on Stackoverflow
Solution 1 - PythonAlainView Answer on Stackoverflow
Solution 2 - PythonpaulperryView Answer on Stackoverflow
Solution 3 - PythonShubham ChaudharyView Answer on Stackoverflow
Solution 4 - PythonMobigitalView Answer on Stackoverflow
Solution 5 - PythonAlberto Castelo BecerraView Answer on Stackoverflow
Solution 6 - PythonTao ChengView Answer on Stackoverflow
Solution 7 - PythonPhillip MaireView Answer on Stackoverflow
Solution 8 - PythonPankaj KalalView Answer on Stackoverflow
Solution 9 - PythonJ RView Answer on Stackoverflow
Solution 10 - PythongreenfishView Answer on Stackoverflow