Conclusion - Metric Learning for Structured Data

7

S U P E R V I S E D T R A N S F E R L E A R N I N G

Summary: Most machine learning models implicitly assume stationarityof the data, meaning that the data distribution does not change over time. Whenever this stationarity assumption is violated, models trained at one point in time may not correctly process later data. Transfer learning methods try to account for the difference between training and test data and learn mappings between the two. We propose a novel transfer learning framework where a mapping from test to training data is learned based on a supervised loss on the training data. We implement our framework for linear transfer mappings and the loss functions of generalized learning vector quantization as well as labelled Gaussian mixture models. On artificial data we demonstrate that we are able to successfully transfer target data back to the source space even in cases where reference methods in the literature fail and that our approach is orders of magnitude faster compared to training a new model.

Publications: This chapter is based on the following publications.

• Paaßen, Benjamin, Alexander Schulz, and Barbara Hammer (2016). “Linear Super-vised Transfer Learning for Generalized Matrix LVQ”. In:Proceedings of the Workshop New Challenges in Neural Computation (NC² 2016). (Hannover, Germany). Ed. by Barbara Hammer, Thomas Martinetz, and Thomas Villmann. Best presentation award, pp. 11–18. u r l: https://www.techfak.uni- bielefeld.de/~fschleif/

mlr/mlr_04_2016.pdf#page=14.

• Paaßen, Benjamin et al. (2017). “An EM transfer learning algorithm with applica-tions in bionic hand prostheses”. In:Proceedings of the 25th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2017). (Bruges, Belgium). Ed. by Michel Verleysen. i6doc.com, pp. 129–134. u r l: http://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2017-57.pdf.

• — (2018). “Expectation maximization transfer learning and its application for bionic hand prostheses”. In: Neurocomputing 298, pp. 122–133. d o i: 10.1016/j.

neucom.2017.11.072.

Source Code: The MATLAB(R) source code corresponding to the content of this chapter is available athttp://doi.org/10.4119/unibi/2912671.

The aim of machine learning is to identify patterns in a set of training data such that these patterns hold for unseen and new data. The ability to correctly apply patterns to unseen data is called generalization(Bishop2006). Generalization is simple if the training data and the new data aresimilar, in the sense that they stem from the same underlying distribution. However, in many scenarios, this assumption is violated (Cortes et al.2008).

For example, the training data may have been selected in a biased way and thus patterns that hold for the training data may not hold for the overall population (Cortes et al.2008).

Further, the generative process of the data may change over time, for example due to external disturbances (Ditzler et al. 2015). Finally, one may want to generalize to data which are generated from another source (Ben-David et al.2006). Each of these scenarios

Figure 7.1:An illustration of two kinds of concept drift. Left: Virtual concept drift, also known as covariate shift or sample selection bias. Right: Real concept drift, where source and target data are related by rotation. Colors and shapes illustrate class assignments. Source data is drawn dashed and transparent, while target data is opaque.

leads to a mismatch between a model derived from the training data and the new data, which in turn may limit generalization.

In cases of abrupt changes in the data distribution, many classical approaches would suggest to discard the entire learned model and start learning a new model using only data from the new distribution (Ditzler et al.2015). However, if only few data from the target distribution are available, this newly trained model may be inaccurate. Instead, we propose to re-use the trained model from the source domain, and to only learn thetransfer functionbetween the source and target domain, which makes our proposed framework an instance oftransfer learning(S. J. Pan and Q. Yang2010; Weiss, Khoshgoftaar, and D. Wang 2016). In particular, our proposed approach can be regarded as a case ofheterogeneous domain adaptation, which is concerned with learning mappings between domains such that knowledge can be transferred from one to the other (Weiss, Khoshgoftaar, and D.

Wang2016).

In more detail, we propose to learn a mapping h from the target domain to the source domain using only few target domain data such that the loss of the source model on these target data is minimized. In other words, we adapt the representation of the target space data to the source model. The main contributions of this chapter are to formalize this supervised transfer learning framework, and to provide two instances of supervised transfer learning, one for learning vector quantization models and one for labeled Gaussian mixture models. Note that both models can be seen as an instance of metric learning. In particular, our transfer learning approach adapts the target space representation such that the source space metric becomes applicable to the target space.

We begin by covering some related work on changing data distributions and adapta-tions, then describe our own method, before we evaluate our approach experimentally and close with a conclusion.

7.1 r e l at e d w o r k

In this chapter, we consider classification tasks. In particular, we assume a list of tuples (~x₁,y₁), . . . ,(~x_M,y_M), which we call thesource dataset. Each of these tuples consists of aninput data point~x_i ∈_R^m for somem∈_{N, and a}labelof interesty_i ∈ {1, . . . ,L}for some L ∈ _N. Our task is to construct a machine learning model f :R^m → {1, . . . ,L},

such that f predicts the correct label for the source dataset, and generalizes to target data.

However, the literature covers several scenarios in which the model f may be able able to correctly predict the source data, but may fail to generalize.

For example, Shimodaira (2000) has introduced the notion ofcovariate shift, which refers to differences in the marginal density p(~x)between source and target data, while the conditional label distribution P(y|~x)remains the same. In that case, the target data may contain more samples in a region of the data space where the model is inaccurate and thus the model may fail to generalize (see Figure7.1, left).

Similarly, Cortes et al. (2008) have establishedsample selection bias correction theory, which assumes that atrueunderlying distributionP(y,~x)exists, but that the source data is sampled not from this distribution directly but only from a limited region of the space.

In that case, the model may fail to correctly predict samples in the regions from which no samples were available and thus fail to generalize.

Note that both scenarios assume that the change from source to target data is discrete, without regard for the time dimension. By contrast, research onconcept driftis concerned with changes over time. In particular, a change in the marginal distribution p(~x)is called virtual concept drift, while a change that also affects the conditional distribution P(y|~x)is calledreal concept drift (Ditzler et al.2015). Furthermore, one can distinguish between gradualdrift andsuddendrift (Ditzler et al.2015). From the perspective of concept drift, covariate shift and sample selection bias would be special cases of sudden, virtual concept drift. In our work, we focus on cases of real concept drift because in these cases even target data that are close to source data may be misclassified (see Figure 7.1, right).

A final perspective is provided by the fields oftransfer learninganddomain adaptation, which are concerned with settings in which source and target data stem from different domains (Ben-David et al.2006; S. J. Pan and Q. Yang2010; Weiss, Khoshgoftaar, and D.

Wang2016). In these cases, a model f learned on the source data is a priori not applicable and needs to be adapted to the target domain.

The first step in adapting to changes between source and target data is to detect whether a change has occurred. In some cases, a change may be obvious, for example in case of domain adaptation. For non-obvious cases, various change detection tests exist, for example based on deviations in the sample mean, the sample variance, or the classification error (Ditzler et al. 2015). Once a change has been detected, the next step is to adapt to the change.

In case of gradual concept drift, be it virtual or real, one can apply incremental learning schemes to smoothly adapt a model to a new distribution via single samples or mini-batches, such as incremental support vector machines, Learn++, on-line random forests, or incremental learning vector quantization (Ditzler et al.2015; Gepperth and Hammer2016; Losing, Hammer, and Wersing2016a).

In case of a sudden virtual concept drift, such as covariate shift or sample selection bias, the source data can be augmented by re-weighing the source data points~x_i, such that the distribution of the re-weighted source data corresponds to the distribution of the target data (Cortes et al. 2008; Jiayuan Huang et al. 2007; Sugiyama et al.2008). If the drift is sudden and real, the source data is typically considered to be invalid and should be forgotten entirely, which also means that the old model f should be discarded and replaced by a new one (Ditzler et al. 2015). Note that models are typically optimized only for either sudden or gradual drift. To our knowledge, only the the

long-and-short-term-memory model by Losing, Hammer, and Wersing (2016b), has the ability to adapt to both kinds of drift.

A lacuna in all these approaches is that they do not take the relatedness between source and target data into account. By contrast, transfer learning and domain adaptation approaches assume that source and target data can be embedded in a common latent space in which a model can be learned that applies to all data (S. J. Pan and Q. Yang 2010; Weiss, Khoshgoftaar, and D. Wang2016). One class of transfer learning approaches are concerned withinvariant feature representationsthat can be computed for both source and target data and then permit a correct classification of both, such as the first layers of deep convolutional neural networks or scale-invariant features (Glorot, Bordes, and Bengio2011; Long et al.2015; Lowe1999). Note that this approach does not help in cases of real concept drift where the label for a region of the data space changes, because this region would have to be mapped to different locations for correct classification, which a single mapping can intrinsically not do.

By contrast, Blitzer, McDonald, and Pereira (2006) and Blöbaum, Schulz, and Hammer (2015) as well as others usedifferentmappings from source and target space to a common latent space. As such, the approach is conceptually strong enough to deal with real concept drift and re-use a learned model on source data for the target domain. However, these approaches do not take label information in the target data into account, which leads to failure in all cases where the relation between source and target data is ambiguous.

Consider the right plot in Figure7.1as an example. In this case, source and target data are related by a 180^◦ rotation. However, without label information for the target data, it would be equally plausible to assume that no change between source and target data has occurred, because the marginal densityp(~x)is the same for source and target data.

Only few approaches to date have taken label information into account as well.

First, theadaptive support vector machine (a-SVM)(J. Yang, Yan, and Hauptmann2007), which assumes that source and target space are the same, but that real concept drift has occurred. In turn, a model f on the source data may misclassify some target data points. Thea-SVMlearns a support vector machine model f⁰ that predicts the difference between the predicted labels of f for some target sample points and the actual labels of these points. As such, the source model f is re-used for all data points that are still correctly classified but adapts the source model for all other points. However, thea-SVM may still fail for the real concept drift example in Figure7.1, because it has to re-learn the entire model and does not exploit the simple, linear relationship between source and target data.

By contrast, theasymmetric regularized cross-domain transformation (ARC-t) ap-proach (Kulis, Saenko, and Darrell2011) learns a linear mappingH between source data points~x and target data points ˆx by maximizing the inner product~x^>·H·xˆ if~x and

x have the same label and minimizing it otherwise. The mapping can then be used to transfer source data to the target space and train a target domain classifier there. In line with our framework, it is also possible to transfer target space data to the source space to make a source classifier applicable again. Note, however, thatARC-t is challenged whenever classes are multi-modal because in that case, maximizing the inner product between all points within classes may yield conflicting objectives.

A more flexible approach is offered byheterogeneous feature augmentation (HFA) (Duan, Xu, and I. Tsang 2012), which learns two linear mappings P and Q from the source and the target space to a shared latent space such that the loss of a support vector machine trained on all data in the latent space is minimized. Note that this bears

some similarity to our proposed framework as the transfer mappingsPandQare also learned based on a classifier loss function, namely that of the support vector machine. In contrast to our method, though, the mappings PandQare learned only implicitly in a kernel-based approach and can not be used to transfer target space data to the source space, which would be necessary to re-use an already trained source space classifier.

Instead, HFAhas to train a new classifier in the latent space.

As such, our proposed framework fills a notable gap in the existing literature by a) learning a transfer mapping explicitly (other thana-SVMandHFA) that b) permits the application of an already learned source space classifier without retraining (other thanHFA) and c) is trained based on the loss of that classifier (other thanARC-t). In the following section, we will formalize this problem and develop two learning approaches, one based on learning vector quantization, and one based on labeled Gaussian mixture models.

Im Dokument Metric Learning for Structured Data (Seite 119-125)