to multivariate discrimination

. Multivariate discrimination or classiﬁcation is one of the best-studied problem in machine learning, with a plethora of well-tested and well-performing algorithms. There are also several good general textbooks [1–9] on the subject written to an average engineering, computer science, or statistics graduate student; most of them are also accessible for an average physics student with some background on computer science and statistics. Hence, instead of writing a generic introduction, we concentrate here on relating the subject to a practitioner experimental physicist. After a short introduction on the basic setup (Section 1) we delve into the practical issues of complexity regularization, model selection, and hyperparameter optimization (Section 2), since it is this step that makes high-complexity non-parametric ﬁtting so di ﬀ erent from low-dimensional parametric ﬁtting. To emphasize that this issue is not restricted to classiﬁcation, we illustrate the concept on a low-dimensional but non-parametric regression example (Section 2.1). Section 3 describes the common algorithmic-statistical formal framework that uniﬁes the main families of multivariate classiﬁcation algorithms. We explain here the large-margin principle that partly explains why these algorithms work. Section 4 is devoted to the description of the three main (families of) classiﬁcation algorithms, neural networks, the support vector machine, and A  B  . W         ;                        . B     ,             -   . W     -

example, C = {signal, background}), we are talking about classification. The quality of g on an arbitrary pair (x,y) ∈X×Cis measured by an error or loss function L g, (x,y) that depends on the type of problem. In regression, the goal is to get g(x) as close to y as possible, so the loss grows with the difference between g(x) and y. The most widely used loss function in this case is the quadratic or squared loss L 2 g, (x,y) = g(x) − y 2 .
In classification, typically there is no distance or similarity defined between the classes, so all we can measure is whether or not g predicts correctly the class y. The usual loss in this case is the one-loss or zero-one loss where the indicator function I {A} is 1 if its argument A is true and 0 otherwise. The goal of learning algorithms is not to memorize D, rather to generalize from it. Indeed, it is rather trivial to construct a function g that has zero loss on every sample point (x i ,y i ) by explicitly setting g(x i ) y i on all sample points from D, and, say, g(x i ) 0 everywhere else. We obviously have not learned anything from D, we have simply memorized it. It is also clear that this function has very little chance to perform well on points not in the set D (unless 0 is a good prediction everywhere in which case the problem itself is trivial). To formalize this intuitive notion of generalization, it is usually assumed that all sample points (including those in D) are independent samples from a fixed but unknown distribution D, and the goal is to minimize the expected loss (a.k.a., risk or error) where E (x,y)∼D {L} denotes the expectation of L with respect to the random variable (x,y) drawn from the distribution D. When we know D, the optimal function is the one that minimizes the risk, that is, g * = arg min g R(g).
With this notation, since the expectation of the indicator of an event is the probability of the event, the misclassification probability P {g(x) y} is the risk generated by the one-loss R I (g) = E (x,y)∼D L I g, (x,y) = P {g(x) y}, 1 ( and so it can be shown that the optimal classifier (the so-called Bayes classifier)is In regression, the risk is the expectation of the squared error R 2 (g) = E (x,y)∼D (g(x) − y) 2 , and it can be shown easily that the optimal regressor is the conditional expectation of y given x, that is, g * 2 (x) = E {y | x}.
In practice, of course, the data distribution D is unknown, and the goal is to get close to the performance of g * by only using the sample D. As our imaginary example demonstrates it, finding a function g that minimizes the empirical risk (or training error) is not necessarily a good idea. Nevertheless, it turns out that the estimator g * = arg min g∈G R(g) (4) can have good properties both in theory and in practice if the function class G is not too "rich" so the functions in G are not too complex. In fact, the real error R( g * ) of the empirically best classifier g * is usually underestimated by the training error R( g * ), but the bias can be controlled, and the minimization (4) can lead to a good solution if the complexity of the function class G is "small" compared to the number of data points n. How to control the complexity of G and how to find an optimal function in G computationally efficiently are the two main subjects of the design of supervised learning algorithms. Finding g * in G is the subject of algorithmic design. The process is called training or learning, and the data set D on which R( g) is minimized is the training set. Of course, once D is used for finding g * , it is tainted from a statistical point of view: R( g * ) is no longer an unbiased estimator of R( g * ). Overfitting is the term that describes the situation when the risk R( g * ) is significantly larger than the training error R( g * ). To detect and to assess overfitting, g * has to be evaluated on a test set D ′ , independent of D.

Parametric fitting and the maximum likelihood method
The main subject of the chapter is non-parametric (model-less) multi-variate classification, and the setup of Section 1 is designed to formalize the theoretical framework for this approach. Nevertheless, it is worthy to note that "classical" low-dimensional fitting, be it maximum likelihood or least-square, also fits into this framework. These methods are treated in other chapters, so here we only sketch them to illuminate their relationship to the setup of Section 1.
When we know a generative probabilistic model, one of the possibilities of fitting a model is to maximize the likelihood p(z). This can also be cast into the framework of this section with the usual loss function and with the risk that L P implies For example, when the data points z i are triplets (x i ,y i ,σ i ) with the generative model of y i ∼ N g(x i ),σ i , 2 maximum likelihood reduces to weighted least square ("chi square"), that is, Although the least-square method is mostly used in parametric fitting when G is a low-complexity function class parametrized with a few parameters, neural networks and other model-less approaches can be used also with a (weighted) quadratic loss, and in this case all considerations about overfitting, complexity regularization, and hyperparameter optimization, usually associated with non-parametric classification, are also valid.

Complexity regularization (model selection) and hyperparameter optimization
The simplest way to control the complexity of the function class G is to fix it either explicitly (for example, by taking the set of all linear functions) or implicitly (for example, by taking the set of all functions that a neural network with N neurons and one hidden layer can represent). In parametric fitting, G is a set of functions parametrized with a low number of parameters, and the dimensionality d of the input space is usually low. Most importantly, the number of parameters is fixed independently of the size n of the training data set, so the overfitting issue only shows up in goodness-of-fit tests (for example, in setting the degrees of freedom of the chi-square distribution). The situation is different in non-parametric fitting. Contrary to what its name suggests, nonparametric fitting means that we have a lot of parameters. This approach is used when we have little knowledge about the model that generated the observations, so we want to avoid the bias caused by fitting a misspecified rigid model. More importantly, we do not want to decide the complexity or the smoothness of the solution beforehand; we prefer that the data speak for itself.
Formally, we can define a (possibly infinite) set of function classes G θ , parametrized by a vector θ of the so-called hyperparameters. Hyperparameters can take diverse forms. Most of them are related to the complexity of the class; the number of neurons in a neural network, the depth of decision trees, or the number of boosting iterations are typical examples. Other hyperparameters are penalty coefficients in a framework called complexity regularization [4] or structured risk minimization [9]. In this setup, instead of finding the empirical minimizer in G,wepenalize the complexity of the functions g ∈Gby an appropriate penalty C(g), and find where β is a penalty coefficient (a hyperparameter) which has to be tuned. The classical practices of weight decay or early stopping in neural networks fall in the category of structured risk minimization. Another typical example is the penalty coefficient (usually denoted by C) in support vector machines. A third family of hyperparameters are simple "technical" parameters, such as the learning rate in neural networks, or the number of features x ( j) tested at each cut in a decision tree.
There is a vast literature on how to tune hyperparameters on the training set D itself. This socalled model selection problem has both Bayesian and frequentist solutions, but most of the time, at least in model-less non-parametric fitting, optimality results are only valid asymptotically (as training sample size n →∞ ), and so they only provide some ballpark hints in practice. The solution adopted in practice is called validation. In the simplest setup, we choose g * θ by minimizing the training error in each set G θ g * θ = arg min then we choose the overall best predictor g * by minimizing the error on a held-out validation set D ′′ where we explicitly added D and D ′′ as an argument of the error to emphasize that the two minimizations use two different sets. Indeed, D ′′ has to be independent of both the training set D and the eventual test set D ′ (used for estimating the performance of the final classifier). This procedure is known as hyperparameter optimization or hyperparameter tuning in machine learning. The particularity of hyperparameter optimization is that the evaluation of the function f (θ) = R( g * θ ; D ′′ ) is computationally expensive. Indeed, the evaluation of f (θ) requires the optimization of 02001-p.4 g θ ∈G θ , that is, training g θ on D and testing it on D ′′ , which can take hours or days, depending on the particular learning algorithm, the data size n, and the data dimensionality d. This feature relates hyperparameter optimization to experimental design where, say, we want to optimize a handful parameters of a detector by evaluating its performance using expensive simulations. When the number of hyperparameters is small (say, not more than 3), the usual procedure is simple grid search:w e discretize the components of θ, and evaluate f (θ) for all members of the discrete set. For example, we train AB for T ∈{ 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10 000} iterations using decision trees with a number of leaves of N ∈{2, 3, 5, 10, 20, 50, 100}, and then select the best (T, N) pair out of the 70 trained functions. When the number of hyperparameters exceeds three, this procedure rapidly becomes infeasible because the number of grid points blows up exponentially. Simple heuristics, such as Latin hypercube search, simple random search [10], or gradient search [11] can be applied, but the principled solution is surrogate optimization: we replace f by a, usually non-parametric, smooth function which we train iteratively on a sequence of evaluation points θ t , f (θ t ) . In each iteration t,a surrogate function f t (θ) is regressed over the set , then a new evaluation point is selected based on f t (θ). Gaussian process regression [12] is one of the most popular concrete choices for the regression method since it provides not only a regressor but also a conditional distribution p f (θ) | θ , and so it can be used with probabilistic criteria to select the next evaluation point θ t . The best-known such criterion is expected improvement. The recent success of deep neural networks [13][14][15], with often tens of hyperparameters, generated a surge of papers on the subject [10,[16][17][18][19], showing that the jury is still out on what the best strategy is.

An example in low-dimensional fitting
To illustrate that overfitting and model selection do not only concern multi-variate classification, we start by a simple example using Gaussian process (GP) regression [12] on a one-dimensional fitting problem. The main ingredient of a GP is the kernel function K(x, x ′ ) that defines the smoothness of the regressor functions g(x). In this example we use the squared exponential kernel where a and w are hyperparameters to be tuned. Without going into the details, the GP regression function is given in the form of The coefficient vector α = (α 1 ,...,α n )isgivenby is the Gram matrix and y = (y 1 ,...,y n ) is the label vector. The diagonal matrix Σ contains either the squared uncertainties σ 2 i in the diagonal in case they are known, or a constant σ 2 which becomes a third hyperparameter that has to be tuned along with a and w.
There are both Bayesian and frequentist techniques to find the optimal hyperparameters a, w, and σ using a single training set D. These methods work well in practice, so validation is not a common technique in GP regression. Nevertheless, for the sake of illustrating the general technique, we show on an example how validation works in this case. We generate 50 training and 50 validation points 02001-p.5 (x i ,y i ) by first drawing x i uniformly from the interval [3,100], and then drawing is our target regression function. Figure 1 depicts g(x) in blue, and the training set D in red (the validation set D ′′ is not shown). For the sake of simplicity, we fix a = 10 and σ = 0.4 and only tune the kernel width w. The role of w is to set the smoothness of the regressor function. We plot three cases in Figure 1. When w = 10, the estimated regressor g(x) clearly overfits the data: it follows the training data too closely, achieving a root mean square error (RMSE) R 2 (g) = 0.33 on the training set. At the same time, on the validation set its RMSE is 0.47, giving a qualitative indication of overfitting. When w = 80, we are underfitting (oversmoothing) the data. The training and validation RMSEs are close (0.47 and 0.48, respectively), but they are both suboptimal. At the optimum w = 40, selected by the validation procedure, we can achieve a 0.39 training RMSE and a0.38 validation RMSE. The validation RMSE slightly underestimates the expected RMSE since w was selected based on this sample, but this "hyper-overfitting" is negligible. In fact it is comparable to overfitting a single-parameter function in a parametric setup.

Convex losses in binary classification
Finding the empirical minimizer g * (4), g * θ (6), or g * β (5) can be algorithmically challenging even in relatively "simple" function sets G. For example, finding the linear binary classifier that minimizes the training error R I (g) is NP hard [20]. 3 One way to make the learning problem tractable is to relax the minimization of R I (g) by defining convex losses that upper bound the one-loss L I g, (x,y) (1).
To formalize this, we need to introduce some notions used in binary classification. Most of the learning algorithms do not directly output a {±1}-valued classifier g, rather, they learn a real-valued discriminant function f , and obtain g by thresholding the output of f (x) at 0, that is, Using the real-valued output of the discriminant function, the classifier can be evaluated on a finer scale by defining the (functional) margin With this notation, the misclassification indicator (one-loss) of a discriminant function f on a sample point (x,y) can be re-defined as a function of the margin ρ by Besides the sign of ρ that represents the classification error, the magnitude of ρ is also important: it indicates the confidence or the robustness of the classification. Indeed, the combination of a large ρ with a Lipschitz-type (slope) penalty on f means that the geometric margin ρ (g) is also large, that is, x is far from the decision border (defined by the set of points for which f (x) = 0, see Figure 2).
The idea behind large margin classification is to design loss functions that "push" the decision border away from the training points. The goal is not just to minimize the training zero-one error R I (g) (2) but also to increase the margin of the correctly classified points. The common feature of these loss functions is that 1) they penalize larger errors (negative margins) more than smaller ones, and 2) they keep penalizing even correctly classified points especially if they are close to the decision border (their margin is close to zero). Formally, one can define smooth convex upper bounds of the margin-based one-loss L I (ρ), and minimize the corresponding empirical risks instead of the training error R I . The most common losses, depicted in Figure 3, are the exponential loss the hinge loss and the logistic loss L  (ρ) = log 2 exp(−ρ) + 1 .
Given the margin-based loss L • (ρ), the corresponding margin-based empirical risk can be defined as The discriminant function f (x) and the induced classifier g(x). The blue point (with label y = −1) is misclassified and the two red points (with label y =+1) are correctly classified, but the second red point has a larger margin f (x)y indicating that its classification is more robust.
Minimizing the margin-based empirical risks have both algorithmic and statistical advantages. First, combining the minimization of these risks with Lipschitz-type penalties 4 often leads to convex optimization problems that can be solved with standard algorithms (e.g., linear, quadratic, or convex programming). Second, it can also be shown within the complexity regularization framework that the minimization of these penalized convex risks leads to large margin classifiers with good generalization properties, confirming the intuitive explanation of the previous paragraph.

Binary classification algorithms
The three most popular binary classification algorithms are the multi-layer perceptron or neural network (NN) [1,21], the support vector machine (SVM) [9,22,23], and AB [24]. They have very different origins, and the algorithmic details also make them stand apart, nevertheless, they share some important concepts, and their practical success is explained, at least qualitatively, by the same theory on large margin classification.
An important pragmatic similarity is that they all learn generalized linear discriminant functions of the form In neural networks h (t) (x)isaperceptron, that is, a linear combination of the input features followed by a sigmoidal nonlinearity σ (such as arctan) In the case of binary classification the neural network is usually trained to minimize the logistic loss (9). One of the advantages of NNs is their versatility: as long as the loss function is differentiable, it can be plugged into the algorithm. The differentiability condition imposed by the gradient descent optimization, on the other hand, constrains the loss function and the nonlinearities used in the hidden units: one could not use, say, a Heaviside-type nonlinearity or the one-loss. The invention of the multilayer perceptron in 1986 [21] was a technological breakthrough: complex pattern recognition problems (e.g., handwritten character recognition) could suddenly be solved efficiently using neural networks. At the same time, the theory of machine learning at these early days was not yet developed to the point of being able to explain the principles behind algorithmic design. Most of the engineering techniques (such as weight decay or early stopping) were developed using a trial-and-error approach, and they were theoretically justified only much later within the complexity regularization and large margin framework [25]. Partly because of the "empirical" nature of neural network design, a common belief developed about the "user-unfriendliness" of neural networks. Added to this reputation was the uneasiness of having a non-convex optimization problem: back-propagation cannot be guaranteed to converge to a global minimum of the empirical risk. This is basically a no-issue in practice (especially on today's large problems), still, it is a common criticism from people who usually have never experimented with neural networks on real problems. As a consequence, neural networks were slightly over-shadowed in the 90s with the appearance of the sup-02001-p.9 Ρ port vector machine and AB. Today neural networks are, again, becoming more popular partly because of user-friendly program packages (e.g., [26]), partly due to the computational efficiency of the training algorithm (especially its stochastic version), and partly because of the recent success of unsupervised feature-learning techniques that use deep neural architectures to solve hard computer vision and language processing problems [13][14][15]. Support vector machines also learn generalized linear discriminant functions of the form (11) with where K(x, x ′ ) is a positive semidefinite kernel function that expresses the similarity between two observations x and x ′ . K(x, x ′ ) is usually monotonically decreasing with the distance between x and x ′ . As in Gaussian process regression (Section 2.1), the most common choice for K is the squared exponential (a.k.a. Gaussian) kernel. The index t ranges over t = 1,...,n which means that each training point (x t ,y t ) ∈Dcontributes to f a kernel function centered at x t with a sign equal to the label y t , so the final discriminant function can be interpreted as a weighted nearest neighbor classifier where the weight comprises of the "similarity term" K(x t , x) and the "importance term" α (t) .
The objective of training the SVM is to find the weight vector α = α (1) ,...,α (T ) that minimizes the hinge loss (8) with a complexity penalty (the L 2 loss on the weights of features in the Hilbert space induced by K(x, x ′ )). The objective function was explicitly designed based on the theory of large margin classification to ensure good generalization properties. A great advantage of the setup is that the objective is a quadratic function of α with linear constraints 0 ≤ α (t) ≤ C (the so-called box constraints), so the global optimum can be found using quadratic programming. The result is sparse: only a subset of the training points in D have nonzero coefficients α (t) . These points are called support vectors, giving the name of the method.
The biggest disadvantage of the technique is its training time: naïve quadratic programming solvers run in super-quadratic time, and even with the most sophisticated tricks it is hard to beat the O(n 2 ) barrier. The second disadvantage of the method is that in high-dimensional problems the number of support vectors is comparable to the size of the training set 5 , so, when evaluating f at the test phase is a bottleneck (see Section 6.1), SVMs can be prohibitively slow. Third, unlike neural networks, SVMs are designed for binary classification, and the generic extensions for multi-class classification and regression do not reproduce the performance of binary SVM. Despite these disadvantages, the appearance of the support vector machine in the middle of the 90s revolutionized the technology of pattern recognition. Besides its remarkable generalization performance, the small number of hyperparameters 6 and the appearance of turn-key software packages 7 made SVMs the method of choice for a wide range of applications involving small-to-moderate size training sets. With the appearance of extra-large training sets in the last decade, training time and optimization became more important issues than generalization [27], so SVMs lost somewhat their dominant role.
AB learns a generalized linear discriminant function of the form (11) in an iterative fashion. It can be considered as constructing a neural network by adding one neuron at a time, but since backpropagation is no longer applied, there is no differentiability restriction on the base classifiers h (t) . This opens the door to using domain-dependent features, making AB easy to adapt to a wide range of applications. The basic binary classification algorithm (Figure 4) consists of elementary steps that can be implemented by a first year computer science student in an hour. The only tricky step is the implementation of the base learner (line 3) whose goal is to return h (t) with a small weighted error but most of the simple learning algorithms (e.g., decision trees, linear classifiers) can be easily adapted to this modified risk. The weights w = (w 1 ,...,w n ) over the training points are initialized uniformly (line 1), and then updated in each iteration by a simple and intuitive rule (lines 7-10): if h (t) misclassifies (x i ,y i ), the weight w i increases (line 8), whereas if h (t) correctly classifies (x i ,y i ), the weight w i decreases (line 10). In this way subsequent base classifiers will concentrate on points that were "missed" by previous base classifiers. The coefficients α (t) are also set analytically to the formula in line 5 which is monotonically decreasing with respect to the weighted error.
AB D, B(·, ·), T 1 w (1) ← (1/n,...,1/n) ⊲ initial weights for i ← 1 to n ⊲ re-weighting the training points ⊲ weighted "vote" of base classifiers This simple algorithm has several beautiful mathematical properties. First it can be shown [24] that if each base classifier h (t) is slightly better than a random guess (that is, every weighted error ǫ (t) ≤ 1 2 − δ with δ>0), then the (unweighted) training error R I of the final discriminant function f (T ) becomes zero after at most T = log n 2δ 2 + 1 steps. The logarithmic dependence of T on the data size n implies that the technique is formally a boosting algorithm from which the second part of its name derives. 8 Second, it can be shown that 8 The algorithm is also Aptive because the coefficients α (t) depend on the errors ǫ (t) of the base classifiers h (t) .

02001-p.11
the algorithm minimizes the exponential risk R  ( f ) (7) using a greedy functional gradient approach [28,29]. In this alternative but equivalent formulation, in each iteration h (t) is selected to maximize the decrease (derivative) of R  f (t−1) + αh (t) at α = 0, and then α (t) is set to Furthermore, for inseparable data sets (for which no linear combination of base classifiers achieves 0 training error), the exponential risk R  ( f (T ) ) goes to the minimum achievable exponential risk as T →∞ [30]. It can also be proven [31] that for the normalized discriminant function , the margin-based training error also goes to zero exponentially fast for all θ< ρ * /2, where is the maximum achievable (normalized) minimum margin. This results shows that AB, similarly to the support vector machine, leads to a large margin classifier. It also explains the surprising experimental observation that the generalization error R I ( f ) (estimated on a test set) often decreases even after the training error R I ( f ) reaches 0. AB arrived to the pattern recognition scene in the late 90s when support vector machines were in full bloom. AB has a lot of advantages over its main rival: it is fast (its time complexity is linear in n and T ), it is an any-time algorithm 9 , it has few hyperparameters, it is resistant to overfitting, and it can be directly extended to multi-class classification [32]. Due to these advantages, it rapidly became the method of choice of machine learning experts in certain types of problems where the natural setup is to define a large set of plausible features of which the final solution f (T ) uses only a few (a so called sparse classifier). Arguably the most famous application is the first face detector running real-time on 30 frame-per-second video [33]. At the same time, AB is much less known among users with no computer science background than the support vector machine. The reason is paradoxically the simplicity of AB: since it is so easy to implement with a little background in programming, no machine learning expert took the effort to provide a turn-key software package easily usable for non-experts. 10 In fact it is not surprising that the only large non-computer-scientist community in which AB is much more popular than SVM is experimental physics: physicists, especially in high energy physics, are heavy computer-users with considerable programming skills. In the last five years AB (and other similar ensemble methods) seem to have been taking over the field of large-scale applications. In recent large-scale challenges [34,35] the top entries are dominated by ensemble-based solutions, and SVM is almost non-existent in the most efficient approaches.
Although binary AB (Figure 4) is indeed simple to implement, multi-class AB has some nontrivial tricks. There are several multi-class extensions of the original binary algorithm, and it is not well-known, even among machine learning experts, that the original AB.M1 and A-B.M2 algorithms [24] are largely suboptimal compared to AB.MH published a couple of years later [32]. 11 On the other hand, the AB.MH paper [32] did not specify the implementation details of multi-class base learning, making the implementation non-trivial. For more information on multi-class AB.MH we refer the reader to the documentation of MB, available from the http://multiboost.org website.

Applications
In this section we illustrate the versatility of the abstract supervised learning model described in this chapter through presenting some of the machine learning applications we have worked on. It turns out that real-world applications are never so simple as just taking a turn-key implementation out of the box and running it. To make a system work, one needs both domain expertise and machine learning expertise, and so it often requires an interdisciplinary approach and considerable communication effort from experts of different backgrounds. Nevertheless, once the problem is reduced to the abstract setup described in Section 1, machine learning algorithms can be very efficient.
As the examples will show, classification or regression is only one step in the pattern recognition pipeline ( Figure 5). Data collection is arguably the most costly step in most of the applications. In particular, harvesting the labels y i usually involves human interaction. How to do this in a costeffective way by, for example, using CAPTCHAs to obtain character labels [36] or making people label images by playing a game [37,38], is itself a challenging research subject. On the other hand, in experimental physics simulators can be used to sample from the distribution p(x | y), so in these applications data collection is usually not an issue. The second step of the pipeline is preprocessing the data. There is a wide range of techniques that can be applied here, usually with the objective of simplifying the supervised learning task. First, if the observations are not already in a format of a real vector x ∈ R d , features (column vectors (x ( j) 1 ,...,x ( j) n ) of the data matrix) should be extracted from the raw data. The goal here is to find features that are plausibly correlated with the label y, so domain knowledge is almost always necessary to carry out this step. Since learning algorithms usually deteriorate as the dimension of the input x increases (because of the so called curse of dimensionality), dimensionality reduction is often part of data preprocessing. Principal component analysis is the basic tool that is usually applied, but nonlinear manifold algorithms [39,40] may also be useful if the training set is not too large. The output space C can also be transformed to massage the original problem into a setup that can be solved by standard machine learning tools. This can be done in an ad hoc way, but there exist also principled reduction techniques between machine learning problems (binary/multi-class classification, regression, even reinforcement learning) [41].
After training, postprocessing the results can also be an important step. If, for example, the original complex problem was reduced to an easy-to-solve setup, re-transforming the obtained labels and even re-calibrating the solution is often a good idea. In well-documented challenges of the last five years [34,35,42] we have also learned that the most competitive results are obtained when divers models trained using different algorithms are combined, so model aggregation techniques are also becoming part of the generic machine learning toolbox.
Sometimes it happens that an application poses a specific problem that can not be solved with existing tools. Solving such a problem and generalizing the solution to make it applicable for similar problems is one of the motivational engines behind the development of the machine learning domain.

Music classification, web page ranking, muon counting
In [43] we use AB for telling apart speech from music. The system starts by constructing the spectrogram on a set of recorded audio signals which constitute the observation vectors x in this case. AB is then run in a feature space inspired by image classification on the spectrogram "images". The output y of the system is binary, that is, y ∈C= {, }. In [44] we stay within the music classification domain but we tackle a more difficult problem: finding the performing artist and the genre of a song based on the audio signal of the first [30]s of the song. The first module of the classifier pipeline is an elaborate signal processor that collects a vector of features for each [2]s segment and then aggregates them to constitute an observation vector x with about 800 components per song. We train two systems, one for finding the artist performing the song and another for predicting its genre. This application also stretches the limits of the classical multi-class (single-label) setup. It is plausible that a song belongs to several genres leading to a so-called multi-label classification. It may also be useful to train a unique system for predicting both the artist and the genre at the same time. This problem is known as multi-task learning.
The multi-variate regression problem can be illustrated by our recent work [45] in which we aim to predict the number of muons in a signal recorded in a water Cherenkov detector of the Pierre Auger experiment [46]. The pipeline, again, starts by extracting features that are plausibly correlated with the muon content of the signal. The 172 features are quite correlated, so first we apply principal component analysis to reduce the size of the observation vector x to 19. Then we train a neural network [1,26] and convert its output into a point estimate and a pointwise error bar (uncertainty) of the number of muons.
In [47,48] we use AB.MH for web page ranking. The input x of the classifier g is a set of features representing a search query and a document, and the label y represents the relevance of the document to the query. The goal is to rank the documents in order of their relevance. Of course, if the relevance of the document is correctly predicted, the implied ranking will also be good. Nevertheless, it is hard to formally derive a meaningful loss for each (document, query, relevance) triplet from the particular loss defined on rankings. The solution to this problem is a post-processing technique called calibration: instead of directly using the output of the classifier g, we send it through another regressor or ranker which is now trained to minimize the desired risk. Our concrete system also contains another 02001-p.14 postprocessing module called model aggregation: instead of training one classifier g, we train a large number of classifiers by varying data preprocessing and algorithmic hyperparameters, and combine the results using a simple weighted voting scheme.

Three open research problems
In this section we briefly describe three open research problems that are either relevant to or even motivated by certain unorthodox applications of multivariate discrimination in experimental physics.

Trigger design: classification with test-time constraints
One of the recent applications of multivariate discriminants in high-energy physics is trigger design [49]. The goal of a trigger is to separate signals generated by a phenomenon to be detected or measured from background, which is an observed event that just happen to look like real signal because of random fluctuations or due to some uninteresting phenomena. The final goal of the analysis is to collect a large statistics of observations that, with high probability, were generated by the targeted phenomenon. Since the background is often several orders of magnitudes more frequent than the signal, part of the background/signal separation cannot be done offline. Due to either limited disk capacity or limited bandwidth, most of the raw signal has to be discarded by online triggers.
In the machine learning paradigm, a trigger is just a binary classifier with There are several attributes that make the trigger a special binary classifier. First, it is extremely unbalanced: the probability of  is practically 1 in a lot of cases. This makes the classical setup of minimizing the error probability R I (g) (2) inadequate: it is very hard to beat the trivial constant classifier g(x) ≡  which has an error of R I (g) = P(y = ). Indeed, the natural gain in trigger design is the true positive or hit indicator Taking the complement of the true positive indicator as the loss, the implied risk is Again, minimizing R  (g) is trivial by setting g(x) ≡  this time, so the goal is to minimize R  (g) with a constraint that the false positive rate R  (g) = P g(x) =  | y =  is kept below a fixed level p  . As we mentioned above, in triggers we have P(y = ) ≪ P(y = ), so the false positive rate R  (g) is approximately equal to the unconditional positive rate P(g(x) = ). In experimental physics terminology this means that a constraint is imposed on the trigger rate.
The second attribute that makes trigger design special is that we have strict computational constraints imposed on the evaluation of the classifier g on test samples x. Typically, observations x arrive 02001-p.15 at a given rate and g(x) has to be run online. The designer can have some flexibility on the parallel handling of the incoming events, but computational resources are often limited. In certain cases 12 the electric consumption may also be limited and harsh conditions may require the use of robust hardware with low clock-rate.
Trigger design shares these two attributes (unbalanced classes and test-time constraints) with object detection in computer vision where machine learning has been applied widely. For example, when the goal is to detect faces in images, the probability of the signal class (face) is much lower then the probability of background (everything else). We have also computational constraints at test time if the goal is to detect faces online in video recordings with given frame rate and the detector hardware must fit into a compact camera. What makes trigger design slightly more challenging is the sometimes extremely low signal probability and the fact that the computational cost of each feature (components of x) can vary in a large range. For example, the LHCb trigger [49,51] can use "cheap" observables that can be evaluated fast to rapidly obtain a rough classifier. One can also almost reconstruct the collision event which can take up a large portion of the allotted time, but the resulting feature can be used reliably to discard background events.
This example shows why a natural answer to these challenges is to design cascade classifiers both in experimental physics and in object detection [33,[52][53][54][55][56]. A cascade classifier g(x) is composed of a list of simpler binary classifiers h 1 (x),...,h N (x) evaluated sequentially. For j < N,ifh j (x) classifies the observation x negatively, the final classification of g(x)is, and if the output of h j (x)is positive, the observation is sent to the next ( j + 1)th stage (or level in the terminology of experimental physics). The classifier h N (x) at the last stage is the only one that can classify x as a . In computer vision, the stage classifiers h j (x) are usually learned using classification algorithms (most often AB). Some of the newest detection algorithms also attempt to learn the cascade structure automatically, nevertheless, manual experimentation is usually required to set hyperparameters (stagewise false positive/false negative rates, computational complexity of stage classifiers h j (x)). In [57], we present a principled approach that can be used to automatically design test-time constrained classifiers. The automatic design also allows us to go beyond the cascade structure which is, although quite intuitive, an artificial constraint to keep the classifier structure simple and to accommodate manual tuning.

Learning to discover
The main application of multivariate discrimination in high-energy particle physics is, interestingly, not classification. The goal in these applications is to increase the sensitivity of counting-based statistical tests [58,59]. Intuitively, the goal is to find regions of the feature space where the signal is present or where it is amplified with respect to its average abundance. Once the subregion is found, we claim the discovery of a novel phenomenon (particle) when the number of events in the region is significantly higher than that predicted by the pure background hypothesis. The simplest formal goal, motivated by a Poisson test, is to maximize where Although G( f ) has the right "flavor", it does not take into consideration the statistical fluctuations of the test when s and/or b are small, so other, more sophisticated, approximate criteria have also been derived [60]. Whereas all these criteria (including (12)) are clearly related to the classical classification error R I ( f ), the two are not equivalent. A notable difference is that the expectation of G depends on the sample size n, whereas increasing sample size only decreases the variance of R I . Nevertheless, the standard practice is to learn a discriminant function f : R d → R on D using standard classification methods that minimize the classification error R, and then use G only to optimize a threshold θ which defines the function g θ by This way of using multivariate discriminants raises several interesting research questions. First, given a concrete test, what should the training criteria G be? Second, what is the relationship between G and R I ( f )? Finally, how to adapt classical classification algorithms to the given criteria?

Deep learning for automatic feature construction
Finally let us raise a completely open research question. The current technology of multivariate discrimination allows us to classify objects that are represented by vectors x = x (1) ,...,x (d) of fixed length d. The representation has to be "semantically aligned", that is, x ( j) 1 must be comparable to x ( j) 2 in two objects x 1 and x 2 . At the same time, raw observation, be it pixels of an image or electronic channels in a detector event, seldom comes in this form. Today, the process of translating the raw observation into a semantically aligned vector of features is in the hands of the domain expert. One of the most important movements of the last decade in machine learning is the development of a family of techniques for building deep unsupervised neural architectures [13]. The goal is to construct representations in a mostly non-supervised way that can capture deep invariances in the raw data. Although the field is rather young, it already had a major impact on natural language processing [14] and computer vision [15]. Interesting and natural questions are whether these techniques can be adapted to scientific data, whether the invariances in, say, raw detector data are sufficient to construct representations that could be useful in the ultimate analysis, and whether automatic or semi-automatic feature extraction can improve on the manually constructed representations.