4.2 Machine Learning Techniques
While the focus of the previous section was defining the main problems and properties that can be addressed with machine learning techniques, details about the actual computational and statistical procedures used for learning were not provided. In this chapter, the basis of the two classes of algorithms that are used elsewhere in this work will be described in detail: boosted decision trees and artificial neural networks. These families of learning methods are also those that are most commonly used in machine learning within experimental particle physics, mostly to solve supervised learning problems, as will be described in Section 4.3. The overview included here is by no means comprehensive about the mentioned approaches or alternative popular statistical learning techniques such as random forests or support vector machines, for which the following references provided a more extensive review [113], [115], [116].
4.2.1 Boosted Decision Trees
The term boosted decision trees (BDT) refers to a large family of algorithms that are based on additively constructing ensembles of decision trees for supervised learning tasks [117]–[119] as those described in Section 4.1.1. A subset of these techniques, which is often referred as gradient boosting, are particularly useful for classification and regression problems. The basis for these methods is that a strong model can be obtained by combining the outcome of a set of weak models, e.g. shallow binary decision trees, if they are built to minimise the residual error at each stage. Gradient boosting algorithms can be applied to any supervised task as long as it can be specified by a differentiable loss function, and they can be understood as gradient descent (which will be discussed in Section 4.2.2) in function space [120].
While it can be applied to other weak learners, gradient boosting is often used to learn ensembles of decision trees. A decision tree is hierarchical branched structure that associates an outcome for each input \(\boldsymbol{x}\in\mathcal{X}\) by means of partitioning the input space in different disjoint subsets \(R = (\mathcal{X}_0, ..., \mathcal{X}_L)\), each associated with a constant prediction \(w_r\) for each leaf. A generic type of decision trees, which is referred to as classification and regression trees (CART) [121] can be expressed as a function of the input \(t(\boldsymbol{x})\) as a sum over the indicator function \(\mathbb{1}_\mathcal{X}^r(\boldsymbol{x})\) of each subspace (see Equation 3.6) as follows: \[ t(\boldsymbol{x}) = \sum^{\mathcal{X}_r \in R} w_r \mathbb{1}_{\mathcal{X}_r}(\boldsymbol{x}) \qquad(4.13)\] where \(w_r\) is the outcome for each subspace, noting the summands will be zero for all subsets \(\mathcal{X}^r\) except for one because their are disjoint. The indicator function \(\mathbb{1}_{\mathcal{X}_r}(\boldsymbol{x})\) of a given subspace is specified by a series of binary decisions on a single feature. If the leaf predictions \(w_r\) are categorical, the resulting model \(t(\boldsymbol{x})\) is referred as a classification tree. If \(w_r\) are numerical, \(t(\boldsymbol{x})\) is a regression tree. In the context of gradient boosting, regression trees are often more useful, even for classification tasks, i.e. regression trees can be used in conjunction with soft classification loss functions (e.g. cross entropy). For the rest of this section, we will then focus on gradient boosting with regression trees. A schematic representation of a regression tree is provided in Figure 4.1, which corresponds to the first tree in the ensemble used for signal versus background classification in the analysis described in Chapter 5.
Given its structural limitations, a single CART tree of small maximum depth \(d\) performs rather poorly a given supervised learning task for complex non-linear problems. If \(d\) is very large, the problem of learning an optimal tree based on data is computationally very demanding, and the resulting model would not generalise well to unseen data. This motivates the use of tree ensembles, where the final prediction is composed by the combined predictions of several small trees. For an ensemble of \(K\) CART trees, the final model prediction \(T(\boldsymbol{x})\) can be expressed as: \[ T(\boldsymbol{x}) = \sum^K_{j=1} t_j(\boldsymbol{x}) \qquad(4.14)\] where each \(t_j(\boldsymbol{x})\) is a CART model, as described in Equation 4.13. Other regression tree ensembles based on alternative methods such as bagging [122] can also be expressed by a similar combination of predictions. The learning problem can be expressed as empirical risk minimisation in the space of possible tree ensembles over the learning set of labelled observations \(S=\{(\boldsymbol{x}_0,\boldsymbol{y}_0),...,(\boldsymbol{x}_n,\boldsymbol{y}_n)\}\), as discussed in Equation 4.3. The total empirical risk functional \(R(T)\) for an ensemble of \(K\) trees can usually be written as: \[ R(T) = \sum_{(\boldsymbol{x}_i,\boldsymbol{y}_i) \in S} L(\boldsymbol{y}_i, T(\boldsymbol{x}_i)) + \sum_{j=1}^K \Omega(t_j) \qquad(4.15)\] where \(L(\boldsymbol{y}_i, T(\boldsymbol{x}_i))\) is the preferred loss function for the task (e.g. binary cross entropy as defined in Equation 4.8) and \(\Omega(t_j)\) is a regularisation term that depends on the properties of each tree and controls the complexity of the model in order to avoid overfitting.
Because learning the structure and leaf weights \(w_r\) of all trees in the ensemble at the same time is intractable, boosting is based on sequentially learning trees. At each step, a tree \(t_j\) is built to improve over the previously ensemble of trees \(T_{(j-1)}(\boldsymbol{x})\), the prediction for each observation in the learning set a given step \(j\) of the training procedure can then be expressed as: \[ T_j(\boldsymbol{x}_i) = T_{(j-1)}(\boldsymbol{x}_i) + t_j(\boldsymbol{x}_i) \qquad(4.16)\] which can be used to redefine the equivalent risk from Equation 4.15 at each training step, where the tree \(t_j(\boldsymbol{x})\) is being created as: \[ R(T_j) = \sum_{(\boldsymbol{x}_i,\boldsymbol{y}_i) \in S} L(\boldsymbol{y}_i, T_{(j-1)}(\boldsymbol{x}_i) + t_j(\boldsymbol{x}_i)) + \sum_{j=1}^K \Omega(t_j) \qquad(4.17)\] where the loss \(L(\boldsymbol{y}_i, T_{(j-1)}(\boldsymbol{x}_i)\) can be expanded as a Taylor series assuming that at the step \(j\) the ensemble \(T_{(j-1)}(\boldsymbol{x})\) is constant. Omitting constant terms, which do not play any role in risk minimisation, the risk at a given training step can be expressed as: \[ \begin{aligned} R(T_j) \sim \sum_{(\boldsymbol{x}_i,\boldsymbol{y}_i) \in S} \bigg( & \underbrace{\frac{ \partial L(\boldsymbol{y}_i, T_{(j-1)}(\boldsymbol{x}_i))}{ \partial T_{(j-1)}(\boldsymbol{x}_i)}}_{g_i} t_j(\boldsymbol{x}_i) \\ &+ \frac{1}{2} \underbrace{\frac{\partial^2 L(\boldsymbol{y}_i, T_{(j-1)}(\boldsymbol{x}_i))}{ \partial T^{2}_{(j-1)}(\boldsymbol{x}_i)}}_{h_i} t_j^2(\boldsymbol{x}_i) \bigg) + \Omega(t_j) \end{aligned} \qquad(4.18)\] where \(g_i\) and \(h_i\) are so-called gradient statistics, computed using the first and second partial derivatives of the loss function with respect to the ensemble prediction at the previous step \(T_{(j-1)}(\boldsymbol{x}_i)\). At each step the learning problem can then be reduced to choosing a tree structure and weights, characterised by the function \(t_j\), that minimises \(R(T_j)\). This technique can therefore be applied to any supervised learning tasks as long the associated loss function is differentiable.
A common regularisation term, that is used by the xgboost library [123] used for training the classifier in Chapter 5, is a combination of the number of leaves \(L\) and the squared sum of the leaf weights \(w_r\) for all the leaves: \[ \Omega(t_j) = \gamma L + \frac{1}{2} \lambda\sum^{\mathcal{X}_r\in R} w_r^2 \qquad(4.19)\] where \(\gamma\) and \(\lambda\) are constants that regulate the relative importance of each regularisation component. Using the previous regularisation term, it is possible to redefine the risk of a given tree structure and set of leaf weight at given training step as: \[ R(T_j) \sim \sum^{\mathcal{X}_r \in R} \left ( w_r \underbrace{\sum^{\boldsymbol{x}_i \in S} g_i \mathbb{1}_{\mathcal{X}_r}(\boldsymbol{x}_i)}_{G_r} + \frac{1}{2} w^2_r \underbrace{\sum^{\boldsymbol{x}_i \in S} ( h_i + \lambda ) \mathbb{1}_{\mathcal{X}_r}(\boldsymbol{x}_i)}_{H_r + \lambda} \right ) + \gamma L \qquad(4.20)\] where \(G_r\) and \(H_r\) represent the sum of \(g_i\) and \(h_i\) over all the samples in the learning set that correspond to the leaf indexed by \(r\). The previous expression can in turn be used to obtain the optimal leaf weight \(w_r^{\star}\) and simplify the risk at a given step as follows: \[ w_r^{\star} = - \frac{G_r}{H_r + \lambda} \Rightarrow R(T_j) = - \frac{1}{2} \sum^{\mathcal{X}_r \in R} \frac{G^2_r}{H_j + \lambda} + \gamma T \qquad(4.21)\] where \(\mathcal{X}_r\) are the subsets of the input space corresponding to each leaf of the last tree \(j\). The last expression for \(R(T_j)\) can be used to compare tree structures to be added to the ensemble in a principled manner.
In practice, the number of possible tree structures is infinite so the problem of finding the optimal tree at each step is still intractable. A greedy heuristic is instead used, which proceeds one level of the tree at time. For each input feature, the optimal splitting at a given level can be found by maximising the splitting gain, which can be done very efficiently by sorting the observations in that feature and finding the threshold that maximises the gain \(\mathcal{G}\), that is defined as: \[ \mathcal{G} = \frac{1}{2} \left( \frac{G_L}{H_L + \lambda} + \frac{G_R}{H_R + \lambda} - \frac{(G_L+G_R)^2}{H_L + H_R + \lambda} \right) + \gamma \qquad(4.22)\] where \(G_L\) and \(H_L\) are the sum of gradient statistics left of the threshold and \(G_R\) and \(H_R\) are those right of the threshold. If the gain is negative for the whole, no splitting is preferred in the considered features. Once the optimal splitting is determined for all the features, the featurs that provides the minimal risk as defined in Equation 4.21 is chosen. The algorithm then proceeds to the next tree level until the maximum tree depth is reached or any additional splitting degrades the performance.
Boosted tree ensembles are prone to overfitting to the learning set, so additional heuristics are often used to improve generalisation. A common approach after each step that produces a tree \(t_j\) by the procedure outlined before, is to define ensemble for the next step by weighting the constribution from the last three \(T_j(\boldsymbol{x}_i) = T_{(j-1)}(\boldsymbol{x}_i) + \eta\, t_j(\boldsymbol{x}_i)\), where \(\eta\) is referred as learning rate or shrinkage. The use of \(\eta<1\) produces a less efficient learning procedure, so additional trees are required in the ensemble, however the resulting model is less prone to overfitting. Other policies against overfitting include subsampling the set of observations or the feature vector dimensions. Early stopping, as defined in Section 4.1.1, can also be trivially applied to boosted tree ensembles simply by leaving out the last \(n\) trees in the summation so the risk over validation set is maximised.
4.2.2 Artificial Neural Networks
An alternative way to carry out empirical risk minimisation is based on consider function \(f(\boldsymbol{x}; \boldsymbol{\phi})\), which depends on a vector of parameters \(\boldsymbol{\phi}\), and attempt to find the values of \(\boldsymbol{\phi}\) that minimise the risk \(R_S(f)\) over the learning set \(S=\{(\boldsymbol{x}_0,\boldsymbol{y}_0),..., (\boldsymbol{x}_n,\boldsymbol{y}_n)\}\). If \(f(\boldsymbol{x}; \boldsymbol{\phi})\) is differentiable with respect to the parameter vector \(\boldsymbol{\phi}\), the minimisation from Equation 4.4, can be attempted with gradient-based methods. The simplest gradient-based optimisation technique is referred to as gradient descent (GD), and can be applied to the previous problem by initialising the parameter vector at random \(\boldsymbol{\phi}^0\) and then iteratively updating the model parameters \(\boldsymbol{\phi}\) at each step \(t\) according to: \[ \boldsymbol{\phi}^{t+1} = \eta(t) \nabla_{\boldsymbol{\phi}} R_S(\boldsymbol{\phi}^t) = \eta(t) \nabla_{\boldsymbol{\phi}} \frac{1}{n} \sum_{(\boldsymbol{x}_i,\boldsymbol{y}_i) \in S} \left ( L(\boldsymbol{y}_i,f(\boldsymbol{x}_i; \boldsymbol{\phi}^t)) + \Omega(\boldsymbol{\phi}^t) \right ) \qquad(4.23)\] where \(\nabla_{\boldsymbol{\phi}}\) is the gradient operator with respect the model parameters, \(\eta(t)\) is the learning rate or step size and \(\Omega(\boldsymbol{\phi})\) is a generic generalisation term added to the loss to constrain model complexity. Many other gradient-based optimisation methods exist [124], e.g. using second-order derivative information. The previous flavour of gradient descent is often referred as batch gradient descent, because the whole learning set \(S\) is used to compute the parameter updates at each step. Batch gradient descent can be very computationally demanding when the number of observations in \(S\) is large and the computation of the gradient of the loss for each labelled observation is costly. In addition, batch gradient descent is a deterministic optimisation method and likely to get stuck at a local minima if the optimisation surface is non-convex.
A variation of the previous technique, that is referred to as stochastic gradient descent (SGD) [125], overcomes the mentioned issues by using a random subset \(B=\{(\boldsymbol{x}_0,\boldsymbol{y}_0),..., (\boldsymbol{x}_m,\boldsymbol{y}_m)\}\) of \(m\) observations from the training set \(S\) at each step. If \(m\) is small the updates can be computed much faster, the trade-off being more noisy estimates of \(\mathbb{E}_{(\boldsymbol{x}_i,\boldsymbol{y}_i) \in S} \nabla_{\boldsymbol{\phi}} \left [ L(\boldsymbol{y}_i,f(\boldsymbol{x}_i; \boldsymbol{\phi}^t) \right ]\). The parameter update rule from Equation 4.23 in SGD can be instead be expressed as: \[ \boldsymbol{\phi}^{t+1} = \eta(t) \nabla_{\boldsymbol{\phi}} R_S(\boldsymbol{\phi}^t) = \eta(t) \nabla_{\boldsymbol{\phi}} \frac{1}{m} \sum_{(\boldsymbol{x}_i,\boldsymbol{y}_i) \in B } \left ( L(\boldsymbol{y}_i,f(\boldsymbol{x}_i; \boldsymbol{\phi}^t)) + \Omega(\boldsymbol{\phi}^t) \right ) \qquad(4.24)\] where \(B\) is a random subset of size \(m\) of the learning set \(S\). In the original formulation \(m=1\), yet nowadays a larger value for \(m\) is often used in what is referred to as mini-batch SGD to obtain balance the estimate noise and take advantage of vectorised computations. Several variations of SGD exist, which in some cases can provide convergence advantages over the previous update rule by using adaptive learning rates or momentum in the update dynamics [126]. Stochastic gradient descent methods are a key element for training complex differentiate machine models \(f(\boldsymbol{x}; \boldsymbol{\phi})\) as artificial neural networks, which will be discussed in the rest of this section. SGD in combination with a non-decomposable loss function is also used in Chapter 6 to learn inference-aware summary statistics.
A particularly promising family of parametric functions \(f(\boldsymbol{x}; \boldsymbol{\phi})\) is referred to as artificial neural networks. Artificial neural networks are differentiable functions based on the composition of simple (and possibly non-linear) operations. The simplest type of artificial neural network is depicted in Figure 4.2, which is referred as feed-forward neural network, that maps a input \(\boldsymbol{x}\) to an output \(\boldsymbol{y}\) by means of a series of forward transformations, referred as neural network layers. In the simplest configuration, the values at a given layer \(k\) other than the input layer can be computed as non-linear transformation of the result of a linear combination of the output of the previous layer after the addition of a bias term. The previous transformation can be expressed very compactly in matrix form as: \[ \boldsymbol{a}^k = g( (\boldsymbol{W}^k)^T \boldsymbol{a}^{k-1} + \boldsymbol{b}^k) \qquad(4.25)\] where \(\boldsymbol{a}^k\) is the outcome in vector notation after the layer transformation, \(\boldsymbol{a}^{k-1}\) is the vector of values from the previous transformation (or \(\boldsymbol{a}^0=\boldsymbol{x}\) if it is the first layer after the input), \(\boldsymbol{W}^k\) a matrix with all the linear combination coefficients and \(\boldsymbol{b}^k\) is the bias vector that is added after linear combination. The activation function \(g(\boldsymbol{z})\) is applied element-wise, and it is often based on a simple non-linear function. The sigmoid function \(\sigma(z)=1/(1+e^{z})\) used to be a common choice for the activation function, but nowadays the rectified linear unit (ReLU) function \(g(\boldsymbol{z})=\max(0,z)\) and its variants are most frequently used instead.
The full feed-forward model \(f(\boldsymbol{x}; \boldsymbol{\phi})\) is based on the composition of transformation of the type described in Equation 4.25. When a single transformation is applied, i.e. \(\boldsymbol{y} = g( (\boldsymbol{W})^T \boldsymbol{x} + \boldsymbol{b})\), the model can be referred to as perceptron. If the model is instead based on the composition of several transformations, it can also be called multi-layer perceptron (MLP), and each of the intermediate transformations (which can be composed by an arbitrary number of computational units) is referred as hidden layers. The model in Figure 4.2 is a MLP. The advantage of using models based on feed-forward neural networks with hidden layers is that they can be used to model any arbitrary function due to the universal approximation theorem [127]. In fact, while it is still the focus of theoretical research, the use of a large number of hidden layers is found to increase the expressivity and facilitate the training of powerful neural network models. The experimental success of these family techniques has led to the concept of deep learning, where multiple transformations layers are used for learning data representations in many learning tasks.
A good choice for depth and overall structure for a neural network model depends on the problem at hand as well as the characteristics and size of the learning set available, thus it frequently has to be defined by trial-and-error, based on the performance on a validation set as discussed in Equation 4.1.1. The output size and choice of activation function in the last transformation often depends on the task at hand. For binary classification classification tasks, it is practical to use the sigmoid function \(\sigma(z)=1/(1+e^{z})\) as the activation function of the last layer, in combination with a loss function for soft classification (e.g. binary cross entropy from Equation 4.8). For multi-class classification problems, such as the one discussed in Section 4.3.2.1, the size of the output vector usually matches the number of the categories given that the softmax function (see Equation 4.11) is often used in the last layer to approximate conditional class probabilities in combination with a cross entropy loss (see Equation 4.10). For learning tasks different from classification, different output structures and constraints might be used, e.g. the output vector size in the use case in Chapter 6 corresponds to the number of dimensions of the resulting summary statistic, that is based on a transformation of the input using a multi-layer neural network.
The SDG update rule from Equation 4.24 requires the computation of the gradients of the loss function with respect to the model parameters. For complex models, e.g. those put together by stacking layers as those described in Equation 4.25, the computation of derivatives by numerical finite differences or symbolic differentiation may become rather challenging. The former requires the evaluation of the loss function after variations for at least twice the number of parameters and are affected by round-off and truncation errors, and a naive use of the later could instead lead to very large expressions for the exact derivative that cannot be easily simplified. Given that a numerical function as implemented in a computer program is a sequence of simple operations (e.g. addition, subtraction, exponentiation, etc.), it is possible to efficiently obtain gradients and other derivatives by applying the chain rule repeatedly based on the structure of the program, the derivatives of the simple operations and a record of the intermediate values.
The previous family of techniques, which will not be discussed in depth in this work, are referred as automatic differentiation (AD) [128]. The most efficient way of computing the gradients of a one-dimensional function that depends on many parameters, as the gradient of the empirical risk for a batch of observations from Equation 4.24 is by means of reverse-mode automatic differentiation, which is also referred to as the backpropagation in the context of neural network training. The computational cost of computing the full gradient of the loss to numerical precision using backpropagation is of the same order than a single forward evaluation of the loss, which provides a great advantage relative to finite differences. In addition, when implemented in a computation framework, it can be generally applied to any numerical function as long as can be expressed as a computational graph, e.g. an arbitrary program containing control flow statements, without requiring complex expression simplification as would be the case for symbolic differentiation. In fact, modern computational that include automatic differenciation such as TensorFlow [129] or PyTorch [130] may also be used to compute higher-order gradients (e.g. Hessian matrix elements), which are useful in Chapter 6 to build a differentiable approximation the covariance matrix based on a summary statistic.
As mentioned before, reverse mode automatic differentiation can be used to computed the gradients of an arbitrary function as long as it can be represented as a computational graph containing differentiable simple operations. Thus the neural network model \(f(\boldsymbol{x}; \boldsymbol{\phi})\) is not restricted to the composition of layers of the type described in Equation 4.25, which are often referred as fully connected or dense layers. Alternative function components are useful for dealing with data cannot be represented by a fixed-length vector [115], e.g. convolutional layers are often useful for working with 2D images while recurrent layers extend the application of neural networks to sequences that vary in length between observations. Both convolutional and recurrent layers are used in the neural network model for jet flavour-tagging described in Section 4.3.2.1. Other differentiable neural network components have also been developed to deal with permutation invariant sets [131] or graphs [132] as input data structures, which could have promising applications in particle collider experiments analyses.