Does the SVM in sklearn support incremental (online) learning?

PythonMachine LearningScikit LearnSvm

Python Problem Overview


I am currently in the process of designing a recommender system for text articles (a binary case of 'interesting' or 'not interesting'). One of my specifications is that it should continuously update to changing trends.

From what I can tell, the best way to do this is to make use of machine learning algorithm that supports incremental/online learning.

Algorithms like the Perceptron and Winnow support online learning but I am not completely certain about Support Vector Machines. Does the scikit-learn python library support online learning and if so, is a support vector machine one of the algorithms that can make use of it?

I am obviously not completely tied down to using support vector machines, but they are usually the go to algorithm for binary classification due to their all round performance. I would be willing to change to whatever fits best in the end.

Python Solutions


Solution 1 - Python

While online algorithms for SVMs do exist, it has become important to specify if you want kernel or linear SVMs, as many efficient algorithms have been developed for the special case of linear SVMs.

For the linear case, if you use the SGD classifier in scikit-learn with the hinge loss and L2 regularization you will get an SVM that can be updated online/incrementall. You can combine this with feature transforms that approximate a kernel to get similar to an online kernel SVM.

>One of my specifications is that it should continuously update to changing trends.

This is referred to as concept drift, and will not be handled well by a simple online SVM. Using the PassiveAggresive classifier will likely give you better results, as it's learning rate does not decrease over time.

Assuming you get feedback while training / running, you can attempt to detect decreases in accuracy over time and begin training a new model when the accuracy starts to decrease (and switch to the new one when you believe that it has become more accurate). JSAT has 2 drift detection methods (see jsat.driftdetectors) that can be used to track accuracy and alert you when it has changed.

It also has more online linear and kernel methods.

(bias note: I'm the author of JSAT).

Solution 2 - Python

Maybe it's me being naive but I think it is worth mentioning how to actually update the sci-kit SGD classifier when you present your data incrementally:

clf = linear_model.SGDClassifier()
x1 = some_new_data
y1 = the_labels
clf.partial_fit(x1,y1)
x2 = some_newer_data
y2 = the_labels
clf.partial_fit(x2,y2)

Solution 3 - Python

Technical aspects

The short answer is no. Sklearn implementation (as well as most of the existing others) do not support online SVM training. It is possible to train SVM in an incremental way, but it is not so trivial task.

If you want to limit yourself to the linear case, than the answer is yes, as sklearn provides you with Stochastic Gradient Descent (SGD), which has option to minimize the SVM criterion.

You can also try out pegasos library instead, which supports online SVM training.

Theoretical aspects

The problem of trend adaptation is currently very popular in ML community. As @Raff stated, it is called concept drift, and has numerous approaches, which are often kinds of meta models, which analyze "how the trend is behaving" and change the underlying ML model (by for example forcing it to retrain on the subset of the data). So you have two independent problems here:

  • the online training issue, which is purely technical, and can be addressed by SGD or other libraries than sklearn
  • concept drift, which is currently a hot topic and has no just works answers There are many possibilities, hypothesis and proofes of concepts, while there is no one, generaly accepted way of dealing with this phenomena, in fact many phd dissertations in ML are currenlly based on this issue.

Solution 4 - Python

SGD for batch learning tasks normally has a decreasing learning rate and goes over training set multiple times. So, for purely online learning, make sure learning_rate is set to 'constant' in sklearn.linear_model.SGDClassifier() and eta0= 0.1 or any desired value. Therefore the process is as follows:

clf= sklearn.linear_model.SGDClassifier(learning_rate = 'constant', eta0 = 0.1, shuffle = False, n_iter = 1)
# get x1, y1 as a new instance
clf.partial_fit(x1, y1)
# get x2, y2
# update accuracy if needed
clf.partial_fit(x2, y2)

Solution 5 - Python

If interested in online learning with concept drift then here is some previous work

  1. Learning under Concept Drift: an Overview https://arxiv.org/pdf/1010.4784.pdf

  2. The problem of concept drift: definitions and related work http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.9085&rep=rep1&type=pdf

  3. A Survey on Concept Drift Adaptation http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf

  4. MOA Concept Drift Active Learning Strategies for Streaming Data http://videolectures.net/wapa2011_bifet_moa/

  5. A Stream of Algorithms for Concept Drift http://people.cs.georgetown.edu/~maloof/pubs/maloof.heilbronn12.handout.pdf

  6. MINING DATA STREAMS WITH CONCEPT DRIFT http://www.cs.put.poznan.pl/dbrzezinski/publications/ConceptDrift.pdf

  7. Analyzing time series data with stream processing and machine learning http://www.ibmbigdatahub.com/blog/analyzing-time-series-data-stream-processing-and-machine-learning

Solution 6 - Python

A way to scale SVM could be split your large dataset into batches that can be safely consumed by an SVM algorithm, then find support vectors for each batch separately, and then build a resulting SVM model on a dataset consisting of all the support vectors found in all the batches.

Updating to trends could be achieved by maintaining a time window each time you run your training pipeline. For example, if you do your training once a day and there is enough information in a month's historical data, create your traning dataset from the historical data obtained in the recent 30 days.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMichael AquilinaView Question on Stackoverflow
Solution 1 - PythonRaff.EdwardView Answer on Stackoverflow
Solution 2 - PythonJarianiView Answer on Stackoverflow
Solution 3 - PythonlejlotView Answer on Stackoverflow
Solution 4 - PythonAlaleh RzView Answer on Stackoverflow
Solution 5 - PythonSemanticBeengView Answer on Stackoverflow
Solution 6 - PythonSergey ZakharovView Answer on Stackoverflow