From Machine Learning to Continual Learning

Machine Learning

The field of machine learning aims at finding automatic algorithms to extract knowledge from data. It might be for pattern recognition, image/text generation, anomaly detection, manipulation, deplacements … The training settings is generally the following, there are two types of data, the training data and the testing data.

The goal of the algorithm is to train a model (in deep learning the model is a deep neural network) with the training data to find some hidden information. Then the algorithms is assessed with the test data to evaluate the algorithm ability to generalized. The ultimate goal is to learn from train data and generalize to any data of the same kind.

In this setting, learning consists in optimizing a loss function with a gradient descent heuristic on the model parameters. The training data is the support for this optimization process.

A second step mays arrise: for model selection. Indeed the model might be not completely adapted to the learning setting and might need some modifications or tuning. The model architecture might be modified by changing the number of layer, or the number of parameter per layers, or the connection between layers and so on. The parameters that allows to modified those model caracterisitcs are called hyper-parameters. They are not learned by gradient descent (and they should not!) they are empirically choosen from some heuristic. To select the good hyper-parameter, the solution take some training data as validation data. The validation data allow to test if the hyper-parameters are well selected. Once, the set of hyper-parameter selected and the model trained, we can test with the test data for final assessment.

The full process allows to find solutions in classical machine learning.

Machine Learning in the Wild

The classical machine learning paradigm have been used to solve big challenges in many domains, as for classification, detection, playing games, generating data … Nevertheless, it worth noticing that a huge effort has been given to create the algorithm that find any of those solutions. The general algorithm that can find “easily” a solution for any given learning problem is not yet born ( It will probably never be).

So, even in classical machine learning there are limitations in the existing solutions: mainly two, first poor generalization of an algorithms to a new setting and secondly even if it is adapted, the training might be very long and tedious to train (hyper-parameter might be difficult to find and parameters long to train).

On the other hand, in classical machine learning, the train set have a very convenient property: it does not change. While training the algorithm, we make the hypothesis that the data will not suddently change and that if we need it we can access it later. However, if the goal of machine learning is to be applied on real life settings, we can not make this assumption. The data generated by the world suffer from constant concept drift, perturbations, modifications and interaction difficult to predict or model.

Continual Learning

Continual learning aims at moving machine algorithms from the classical benchmarks to real life settings. The first rule is that CL algorithms deal with special data availability: they don’t have access to all the data at the same time. There is a drift between data accessible at the beginning and the end of the learning life of the algorithm. There the algorithm should learn a solution dependant on all the data seen in his life even if he cannot access it anymore. We call the learning life of a CL algorithm the continuum. It can be modeled as a stream of data fed to the algorithm to train it.

The particular difficulty of this setting regarding a classical machine learning setting, is that since the data distribution change, the loss function to optimize change. The parameters are always modified such as optimizing the actuel loss function. So, if the the loss function change the learned parameters might be modified and not adapted anymore to the previous loss function. This phenomena is called forgetting. Moreover this process might be extremely fast and is even considered as “catastrophic forgetting”.

Finding algorithms that could deal with catastrophic forgetting could leverage many applications for machine learning.