How can I implement incremental training for xgboost?

Python Problem Overview

The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop - it will not help, because in such case it just rebuilds whole model for each batch.

Python Solutions

Solution 1 - Python

Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.

Here's a small experiment that I ran to convince myself that it works:

First, split the boston dataset into training and testing sets. Then split the training set into halves. Fit a model with the first half and get a score that will serve as a benchmark. Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar.. But, fortunately, the new model seems to perform much better than the first.

import xgboost as xgb
from sklearn.cross_validation import train_test_split as ttsplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse

X = load_boston()['data']
y = load_boston()['target']

# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train, 
                                                     y_train, 
                                                     test_size=0.5,
                                                     random_state=0)

xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)

params = {'objective': 'reg:linear', 'verbose': False}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')

# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')

print(mse(model_1.predict(xg_test), y_test))     # benchmark
print(mse(model_2_v1.predict(xg_test), y_test))  # "before"
print(mse(model_2_v2.predict(xg_test), y_test))  # "after"

# 23.0475232194
# 39.6776876084
# 27.2053239482

reference: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py

Solution 2 - Python

There is now (version 0.6?) a process_update parameter that might help. Here's an experiment with it:

import pandas as pd
import xgboost as xgb
from sklearn.model_selection import ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse

boston = load_boston()
features = boston.feature_names
X = boston.data
y = boston.target

X=pd.DataFrame(X,columns=features)
y = pd.Series(y,index=X.index)

# split data into training and testing sets
rs = ShuffleSplit(test_size=0.3, n_splits=1, random_state=0)
for train_idx,test_idx in rs.split(X):  # this looks silly
    pass

train_split = round(len(train_idx) / 2)
train1_idx = train_idx[:train_split]
train2_idx = train_idx[train_split:]
X_train = X.loc[train_idx]
X_train_1 = X.loc[train1_idx]
X_train_2 = X.loc[train2_idx]
X_test = X.loc[test_idx]
y_train = y.loc[train_idx]
y_train_1 = y.loc[train1_idx]
y_train_2 = y.loc[train2_idx]
y_test = y.loc[test_idx]

xg_train_0 = xgb.DMatrix(X_train, label=y_train)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)

params = {'objective': 'reg:linear', 'verbose': False}
model_0 = xgb.train(params, xg_train_0, 30)
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model=model_1)

params.update({'process_type': 'update',
               'updater'     : 'refresh',
               'refresh_leaf': True})
model_2_v2_update = xgb.train(params, xg_train_2, 30, xgb_model=model_1)

print('full train\t',mse(model_0.predict(xg_test), y_test)) # benchmark
print('model 1 \t',mse(model_1.predict(xg_test), y_test))  
print('model 2 \t',mse(model_2_v1.predict(xg_test), y_test))  # "before"
print('model 1+2\t',mse(model_2_v2.predict(xg_test), y_test))  # "after"
print('model 1+update2\t',mse(model_2_v2_update.predict(xg_test), y_test))  # "after"

Output:

full train	 17.8364309709
model 1 	 24.2542132108
model 2 	 25.6967017352
model 1+2	 22.8846455135
model 1+update2	 14.2816257268

Solution 3 - Python

I created a gist of jupyter notebook to demonstrate that xgboost model can be trained incrementally. I used boston dataset to train the model. I did 3 experiments - one shot learning, iterative one shot learning, iterative incremental learning. In incremental training, I passed the boston data to the model in batches of size 50.

The gist of the gist is that you'll have to iterate over the data multiple times for the model to converge to the accuracy attained by one shot (all data) learning.

Here is the corresponding code for doing iterative incremental learning with xgboost.

batch_size = 50
iterations = 25
model = None
for i in range(iterations):
    for start in range(0, len(x_tr), batch_size):
        model = xgb.train({
            'learning_rate': 0.007,
            'update':'refresh',
            'process_type': 'update',
            'refresh_leaf': True,
            #'reg_lambda': 3,  # L2
            'reg_alpha': 3,  # L1
            'silent': False,
        }, dtrain=xgb.DMatrix(x_tr[start:start+batch_size], y_tr[start:start+batch_size]), xgb_model=model)

        y_pr = model.predict(xgb.DMatrix(x_te))
        #print('    MSE itr@{}: {}'.format(int(start/batch_size), sklearn.metrics.mean_squared_error(y_te, y_pr)))
    print('MSE itr@{}: {}'.format(i, sklearn.metrics.mean_squared_error(y_te, y_pr)))

y_pr = model.predict(xgb.DMatrix(x_te))
print('MSE at the end: {}'.format(sklearn.metrics.mean_squared_error(y_te, y_pr)))

XGBoost version: 0.6

Solution 4 - Python

looks like you don't need anything other than call your xgb.train(....) again but provide the model result from the previous batch:

# python
params = {} # your params here
ith_batch = 0
n_batches = 100
model = None
while ith_batch < n_batches:
    d_train = getBatchData(ith_batch)
    model = xgb.train(params, d_train, xgb_model=model)
    ith_batch += 1

this is based on https://xgboost.readthedocs.io/en/latest/python/python_api.html

Solution 5 - Python

If your problem is regarding the dataset size and you do not really need Incremental Learning (you are not dealing with an Streaming app, for instance), then you should check out Spark or Flink.

This two frameworks can train on very large datasets with a small RAM, leveraging disk memory. Both framework deal with memory issues internally. While Flink had it solved first, Spark has caught up in recent releases.

Take a look at:

"XGBoost4J: Portable Distributed XGBoost in Spark, Flink and Dataflow": http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html
Spark Integration: http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html

Solution 6 - Python

To paulperry's code, If change one line from "train_split = round(len(train_idx) / 2)" to "train_split = len(train_idx) - 50". model 1+update2 will changed from 14.2816257268 to 45.60806270012028. And a lot of "leaf=0" result in dump file.

Updated model is not good when update sample set is relative small. For binary:logistic, updated model is unusable when update sample set has only one class.

Solution 7 - Python

One possible solution that I have not tested is to used a dask dataframe which should act the same as a pandas dataframe but (I assume) utilize disk and reads in and out of RAM. here are some helpful links. this link mentions how to use it with xgboost also see also see. further there is an experimental options from XGBoost as well here but it is "not ready for production"

Solution 8 - Python

Hey guys you can use my simple code for incremental model training with xgb base class :

    batch_size = 10000000


    X_train="your pandas training DataFrame" 
    y_train="Your lables"
    
    #Store eval results
    evals_result={}
    Deval = xgb.DMatrix(X_valid, y_valid)
    eval_sets = [(Dtrain, 'train'), (Deval, 'eval')]
    for start in range(0, n, batch_size):
           model = xgb.train({'refresh_leaf': True, 
                         'process_type': 'default', 
                         'max_depth': 5, 
                         'objective': 'reg:squarederror', 
                         'num_parallel_tree': 2,
                        'learning_rate':0.05,
                        'n_jobs':-1},
                        dtrain=xgb.DMatrix(X_train, y_train), evals=eval_sets, early_stopping_rounds=5,num_boost_round=100,evals_result=evals_result,xgb_model=model)

Solution 9 - Python

I agree with @desertnaut in his solution.

I have a dataset where I split it into 4 batches. I have to do an initial fit without the xgb_model parameter first, then the next fits will have the xgb_model parameter, like in this (I'm using the Sklearn API):

for i, (X_batch, y_batch) in enumerate(zip(self.X_train_batched, self.y_train_batched)):
    print(f'Step: {i}',end = ' ')
    if i == 0:
        model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
                        verbose=False, eval_metric = ['logloss'],
                        early_stopping_rounds = 400)
    else:
        model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
                        verbose=False, eval_metric = ['logloss'],
                        early_stopping_rounds = 400, xgb_model=model_xgbc)
            
    preds = model_xgbc.predict(self.X_valid)
    
    rmse = metrics.mean_squared_error(self.y_valid, preds,squared=False)

Solution 10 - Python

It's not based on xgboost, but there is a C++ incremental decision tree.
see gaenari.

Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.

Content Type	Original Author	Original Content on Stackoverflow
Question	Marat Zakirov	View Question on Stackoverflow
Solution 1 - Python	Alain	View Answer on Stackoverflow
Solution 2 - Python	paulperry	View Answer on Stackoverflow
Solution 3 - Python	Shubham Chaudhary	View Answer on Stackoverflow
Solution 4 - Python	Mobigital	View Answer on Stackoverflow
Solution 5 - Python	Alberto Castelo Becerra	View Answer on Stackoverflow
Solution 6 - Python	Tao Cheng	View Answer on Stackoverflow
Solution 7 - Python	Phillip Maire	View Answer on Stackoverflow
Solution 8 - Python	Pankaj Kalal	View Answer on Stackoverflow
Solution 9 - Python	J R	View Answer on Stackoverflow
Solution 10 - Python	greenfish	View Answer on Stackoverflow