Statistical Learning and Inference at Particle Collider Experiments

4.3 Applications in High Energy Physics

Machine learning techniques, in particular supervised learning, are increasingly being used in experimental particle physics analysis at the LHC [133]. In this section, the main use cases are described, linking the learning task with the statistical problems and properties which were described in Chapter 3. In broad terms, most supervised learning at collider experiments can be viewed as a way to approximate the latent variables of the generative model based on simulated observations. Those latent variable approximations are often very informative about the parameters of interest and then can be used to construct summary statistics of the observations, which allow to carry out likelihood-free inference efficiently.

4.3.1 Signal vs Background Classification

The mixture structure of the statistical model for the outcome of collisions, discussed in Chapter 3, facilitates its framing as a classification problem. Intuitively, the classification objective could be stated as the separation of detector outcomes coming from processes that contain information about the parameters of interest from those that do not, which will be referred as signal and background respectively, following the same nomenclature from Section 3.1.1.4. The two classes are often non-separable - i.e. a given detector outcome \(\boldsymbol{x}\) (or any function of it) could have been produced either by signal or background processes, and only probabilistic statements of class assignment can be made.

In order to use supervised machine learning techniques to classify detector outcomes, labelled samples are required, yet only the detector readout \(\boldsymbol{x}\) is known for collected data. Realistic simulated observations, generated specifically to model events from a given set processes (e.g. signal and background) can instead be used as training data, where the categorical latent variable \(z_i\) that represents a given set of processes can effectively used as classification label. If the simulator model is misspecified, e.g. due to the effect of known unknowns as discussed in Section 3.1.4, the resulting classifiers would be trained to optimise the classification objective for different distributions.

To understand the role of classification in the larger goal of statistical inference of a subset of parameters of interest in a mixture model, let us consider the general problem of inference for a two-component mixture problem. One of the components will be denoted as signal \(p_s(\boldsymbol{x}| \boldsymbol{\theta})\) and the other as background \(p_b(\boldsymbol{x} | \boldsymbol{\theta})\), where \(\boldsymbol{\theta}\) are of all parameters the distributions might depend on. As discussed in Section 3.1.1.3, it is often the case that \(f_s(\boldsymbol{x}| \boldsymbol{\theta})\) and \(f_b(\boldsymbol{x} | \boldsymbol{\theta})\) are not known, observations can only be simulated, which will not affect the validity the following discussion. The probability distribution function of the mixture can be expressed as: \[ p(\boldsymbol{x}| \mu, \boldsymbol{\theta} ) = (1-\mu) p_b(\boldsymbol{x} | \boldsymbol{\theta}) + \mu p_s(\boldsymbol{x} | \boldsymbol{\theta}) \qquad(4.26)\] where \(\mu\) is a parameter corresponding to the signal mixture fraction, which will be the only parameter of interest for the time being. As discussed in Section 3.1.1.4, most of the parameters of interest in analyses at the LHC, such as cross sections, are proportional to the mixture coefficient of the signal in the statistical model. The results presented here would also be also be valid if alternative mixture coefficient parametrisations such as the one considered in Section 6.5.1 are used, e.g. \(\mu=s/(s+b)\) where \(s\) and \(b\) is the expected number of events for signal and background respectively, as long as \(b\) is known and fixed and \(s\) is the only parameter of interest.

4.3.1.1 Likelihood Ratio Approximation

Probabilistic classification techniques will effectively approximate the conditional probability of each class, as discussed in Equation 4.9 for the binary classification. A way to approximate the density ratio \(r(\boldsymbol{x})\) between two arbitrary distribution functions \(\rho(\boldsymbol{x})\) and \(q(\boldsymbol{x})\) is then to train a classifier - e.g. a neural network optimising cross-entropy. If samples from \(\rho(\boldsymbol{x})\) are labelled as \(y=1\), while \(y=0\) is used for observations from \(q(\boldsymbol{x})\), the density ratio can be approximated from the soft BCE classifier output \(s(\boldsymbol{x})\) as: \[ \frac{s(\boldsymbol{x})}{1-s(\boldsymbol{x})} \approx \frac{p(y = 1| \boldsymbol{x})}{p(y = 0| \boldsymbol{x})} = \frac{p(\boldsymbol{x} | y = 1) p(y = 1)}{p(\boldsymbol{x} | y = 0) p(y = 0)} = r(\boldsymbol{x}) \frac{p(y = 1)}{p(y = 0)} \qquad(4.27)\] thus the density ratio \(r(\boldsymbol{x})\) can be approximated by a simple function of the trained classifier output directly from samples of observations. The factor \(p(y = 1)/p(y = 0)\) is independent on \(\boldsymbol{x}\), and can be simply estimated as the ratio between the total number of observations from each category in the training dataset - i.e. equal to 1 if the latter is balanced.

Density ratios are very useful for inference, particularly for hypothesis testing, given that the likelihood ratio \(\Lambda\) from Equation 3.39 is the most powerful test statistic to distinguish between two simple hypothesis and can be expressed as a function of density ratios. Returning to the two component mixture from Equation 4.26, for discovery the null hypothesis \(H_0\) corresponds to background-only \(p(\boldsymbol{x}| \mu = 0, \boldsymbol{\theta})\) while the alternate is often a given mixture of signal and background \(p(\boldsymbol{x}| \mu = \mu_0, \boldsymbol{\theta})\), where \(\mu_0\) is fixed. For the time being, the other distribution parameters \(\boldsymbol{\theta}\) will be assumed to be known and fixed to the same values for both hypothesis. The likelihood ratio in this case can be expressed as: \[ \Lambda( \mathcal{D}; H_0, H_1) = \prod_{\boldsymbol{x} \in \mathcal{D}} \frac{p(\boldsymbol{x}| H_0)}{ p(\boldsymbol{x} |H_1)} = \prod_{\boldsymbol{x} \in \mathcal{D}} \frac{p(\boldsymbol{x}| \mu = 0, \boldsymbol{\theta})}{ p(\boldsymbol{x}| \mu = \mu_0, \boldsymbol{\theta})} \qquad(4.28)\] where the \(p(\boldsymbol{x}| \mu = 0, \boldsymbol{\theta})/p(\boldsymbol{x}| \mu_0, \boldsymbol{\theta})\) factor could be approximated from the output of a probabilistic classifier trained to distinguish observations from \(p(\boldsymbol{x}| \mu = 0, \boldsymbol{\theta})\) and those from \(p(\boldsymbol{x}| \mu = \mu_0, \boldsymbol{\theta})\). A certain \(\mu_0\) would have to be specified to generate \(p(\boldsymbol{x}| \mu = \mu_0, \boldsymbol{\theta})\) observations in order to train the classifier. The same classifier output could be repurposed to model the likelihood ratio when \(H_1\) is \(p(\boldsymbol{x}| \mu = \mu_1, \boldsymbol{\theta})\) with a simple transformation, yet the mixture structure of the problem allows for a more direct density ratio estimation alternative, which is the one regularly used in particle physics analyses.

Let us consider instead the inverse of the likelihood ratio \(\Lambda\) from Equation 4.28, each factor term is thus proportional to the following ratio: \[ \Lambda^{-1} \sim \frac{p(\boldsymbol{x} | H_1)}{ p(\boldsymbol{x} | H_0 )} = \frac{ (1-\mu_0) p_\textrm{b}(\boldsymbol{x}| \boldsymbol{\theta}) + \mu_0 p_\textrm{s}(\boldsymbol{x}| \boldsymbol{\theta})}{p_\textrm{b}(\boldsymbol{x}| \boldsymbol{\theta})} \qquad(4.29)\] which can in turn be be expressed as: \[ \Lambda^{-1} \sim (1-\mu) \left ( \frac{p_\textrm{s}(\boldsymbol{x}| \boldsymbol{\theta})}{ p_\textrm{b}(\boldsymbol{x}| \boldsymbol{\theta})}-1 \right) \qquad(4.30)\] thus each factor in the likelihood ratio is a bijective function of the ratio \(p_\textrm{s}(\boldsymbol{x}| \boldsymbol{\theta}) /p_\textrm{b}(\boldsymbol{x}| \boldsymbol{\theta})\). The previous density ratio can be approximated by training a classifier to distinguish signal and background observations, which is computationally more efficient and easier to interpret intuitively than the direct \(p(\boldsymbol{x}| H_0)/p(\boldsymbol{x} |H_1)\) approximation mentioned before.

From a statistical inference point of view, supervised machine learning framed as the classification of signal versus background can be viewed as a way to approximate the likelihood ratio directly from simulated samples, bypassing the need of a tractable density function (see Section 3.2.1). It is worth noting that because it is only an approximation, in order to be useful for inference it requires careful calibration. Such calibration is usually carried out using a histogram and an holdout dataset of simulated observations, effectively building a synthetic likelihood of the whole classifier output range or the number of observed events after cut in the classifier is imposed (see Section 3.1.3.4). Alternative density estimation techniques could also be used for the calibration step, which could reduce the loss of information due to the histogram binning.

The effect of nuisance parameters, due to known unknowns, have also to be accounted for during the calibration step. The true density ratio between signal and background depends on any parameter \(\boldsymbol{\theta}\) that modifies the signal \(p_s(\boldsymbol{x} | \boldsymbol{\theta})\) or background \(p_b(\boldsymbol{x} | \boldsymbol{\theta})\) probability densities, thus its approximation using machine learning classification can become complicated. In practice, the classifier can be trained for the most probable likely value of the nuisance parameters and their effect can be adequately accounted during calibration, yet the resulting inference will be degraded. While this issue can be somehow ameliorated using parametrised classifiers [134], the main motivation for using the likelihood ratio - i.e. the Neyman-Pearson lemma - does not apply because the hypothesis considered are not simple when nuisance parameters are present.

4.3.1.2 Sufficient Statistics Interpretation

Another interpretation of the use of signal versus background classifiers, which more generally applies to any type of statistical inference, is based on applying the concept of statistical sufficiency (see Section 3.1.3.3). Starting from the mixture distribution function in Equation 4.26, and both dividing and multiplying by \(p_b(\boldsymbol{x} | \boldsymbol{\theta})\) we obtain: \[ p(\boldsymbol{x}| \mu, \boldsymbol{\theta} ) = p_b(\boldsymbol{x} | \boldsymbol{\theta}) \left ( 1-\mu + \mu \frac{p_s(\boldsymbol{x} | \boldsymbol{\theta})}{p_b(\boldsymbol{x} | \boldsymbol{\theta})} \right ) \qquad(4.31)\] from which we can already prove that the density ratio \(s_{s/ b}(\boldsymbol{x})= p_s(\boldsymbol{x} | \boldsymbol{\theta}) / p_b(\boldsymbol{x} | \boldsymbol{\theta})\) (or alternatively its inverse) is a sufficient summary statistic for the mixture coefficient parameter \(\mu\), according the Fisher-Neyman factorisation criterion defined in Equation 3.30. The density ratio can be approximated directly from signal versus background classification as indicated in Equation 4.27.

In the analysis presented in Chapter 5 and in the synthetic problem considered in Section 6.5.1, as well as for most LHC analysis using classifiers to construct summary statistics, the summary statistic \[ s_{s/(s+b)}= \frac{p_s(\boldsymbol{x} | \boldsymbol{\theta})}{ p_s(\boldsymbol{x} | \boldsymbol{\theta}) + p_b(\boldsymbol{x} | \boldsymbol{\theta})}\] is used instead of \(s_{s/ b} (\boldsymbol{x})\). The advantage of \(s_{s/(s+b)}(\boldsymbol{x})\) is that it represents the conditional probability of one observation \(\boldsymbol{x}\) coming from the signal assuming a balanced mixture, so it can be approximated by simply taking the classifier output. In addition, being a probability it is bounded between zero and one which greatly simplifies its visualisation and non-parametric likelihood estimation. Taking Equation 4.31 and manipulating the subexpression depending on \(\mu\) by adding and subtracting \(\mu\) we have: \[ p(\boldsymbol{x}| \mu, \boldsymbol{\theta} ) = p_b(\boldsymbol{x} | \boldsymbol{\theta}) \left ( 1-2\mu + \mu \frac{p_s(\boldsymbol{x} | \boldsymbol{\theta}) + p_b(\boldsymbol{x} | \boldsymbol{\theta})}{p_b(\boldsymbol{x} | \boldsymbol{\theta})} \right ) \qquad(4.32)\] which can in turn can be expressed as: \[ p(\boldsymbol{x}| \mu, \boldsymbol{\theta} ) = p_b(\boldsymbol{x} | \boldsymbol{\theta}) \left ( 1-2\mu + \mu \left ( 1- \frac{p_s(\boldsymbol{x} | \boldsymbol{\theta})}{p_s(\boldsymbol{x} | \boldsymbol{\theta}) +p_b(\boldsymbol{x} | \boldsymbol{\theta})} \right )^{-1} \right ) \qquad(4.33)\] hence proving that \(s_{s/(s+b)}(\boldsymbol{x})\) is also a sufficient statistic and theoretically justifying its use for inference about \(\mu\). The advantage of both \(s_{s/(s+b)}(\boldsymbol{x})\) and \(s_{s/b}(\boldsymbol{x})\) is that they are one-dimensional and do not depend on the dimensionality of \(\boldsymbol{x}\) hence allowing much more efficient non-parametric density estimation from simulated samples. Note that we have been only discussing sufficiency with respect to the mixture coefficients and not the additional distribution parameters \(\boldsymbol{\theta}\). In fact, if a subset of \(\boldsymbol{\theta}\) parameters are also relevant for inference (e.g. they are nuisance parameters) then \(s_{s/(s+b)}(\boldsymbol{x})\) and \(s_{s/b}(\boldsymbol{x})\) are not sufficient statistics unless the \(p_s(\boldsymbol{x}| \boldsymbol{\theta})\) and \(p_b(\boldsymbol{x}| \boldsymbol{\theta})\) have very specific functional form that allows a similar factorisation.

In summary, probabilistic signal versus background classification is an effective proxy to construct summary statistic that asymptotically approximate sufficient statistics directly from simulated samples, when the distributions of signal and background are fully defined and \(\mu\) (or \(s\) in the alternative parametrisation mentioned before) is the only unknown parameter. If the statistical model depends on additional nuisance parameters, probabilistic classification does not provide any sufficiency guarantees, so useful information about that can used to constrain the parameters of interest might be lost if a low-dimensional classification-based summary statistic is used in place of \(\boldsymbol{x}\). This theoretical observation will be observed in practice in Chapter 6, where a new technique is proposed to construct summary statistics, that is not based on classification, but accounts for the effect of nuisance parameters is presented.

4.3.2 Particle Identification and Regression

While the categorical latent variable \(z_i\), denoting the interaction process that occurred in a given collision, is very useful to define an event selection or directly as a summary statistic, information about other latent variables can also be recovered using supervised machine learning. As discussed in Section 2.3.3, event reconstruction techniques are used to cluster the raw detector output so the various readouts are associated with a list of particles produced in the collision. It is possible that in the near future the algorithmic reconstruction procedure might be substituted by supervised learning techniques, training directly on simulated data to predict the set of latent variables at parton level, especially given the recent progress with sequences and other non-tabular data structures. For the time being, machine learning techniques are instead often used to augment the event reconstruction output, mainly for particle identification and fine-tuned regression.

The set of physics objects obtained from event reconstruction, when adequately calibrated using simulation, can estimate effectively a subset of the latent variables \(\boldsymbol{z}\) associated with the resulting parton level particles, such as their transverse momenta and direction. Due to the limitations of the hand-crafted algorithms used, some latent information is lost in the standard reconstruction process, particularly for composite objects such as jets. Supervised machine learning techniques can be used to regress some of these latent variables, using simulated data and considering both low-level and high-level features associated with the relevant reconstructed objects. This information could be used to complement the reconstruction output for each object and design better summary statistics, e.g. adding it as an input to the classifiers discussed in Section 4.3.1.

The details of the application of machine learning techniques in particle identification and regression depend on the particle type and the relevant physics case. In the remainder of this section, the application of new deep learning techniques to jet tagging within CMS is discussed in more detail. The integration of deep learning jet taggers with the CMS experiment software infrastructure was one of the secondary research goals of the project embodied in this document. Leveraging better machine learning techniques for jet tagging and regression could substantially increase the discovery reach of analyses at the LHC that are based on final states containing jets, such as the search for Higgs boson pair production described in Section 5.

4.3.2.1 Deep Learning for Jet Tagging

The concept of jet tagging, introduced in Section 2.3.3.4, is based on augmenting the information of reconstructed jets based on their properties to provide additional details about latent variables associated to the physics object which were not provided by the standard reconstruction procedure. Heavy flavour tagging, and in particular b-tagging, is extremely useful to distinguish and select events containing final states from relevant physical interactions. The efficiency of b-tagging algorithms in CMS has been gradually improving for each successive data taking period since the first collisions in 2010. The advance in b-tagging performance, which was already exemplified by Figure 2.12, is mainly due the combined effect of using additional or more accurate jet associated information (e.g. secondary vertex reconstruction or lepton information) and better statistical techniques.

Jet tagging can generally be posed as a supervised machine learning classification problem. Let us take for example the case of b-tagging, i.e. distinguishing jets originating from b-quarks from those originating from lighter quarks or gluon, which can be framed as binary classification problem: predicting wether a jet is coming from a b-quark or not given a set of inputs associated to each jet. The truth label is available for simulated samples, which are used to train the classifier. The CSVv2 b-tagging algorithm (and older variants) mentioned in Section 2.3.3.4 is based on the output of supervised classifiers trained from simulation, i.e. the combination of three shallow neural network combination depending on vertex information for CSVv2. The CMVAv2 tagger, which is used in the CMS analysis included in Section 5, is instead based on a boosted decision tree binary classifier that uses other simpler b-tagging algorithm outputs as input. Similar algorithms based on binary classification have been also developed for charm quark tagging and double b-quark tagging for large radius jets.

The first attempt to use some of the recent advances in neural networks (see Section 4.2.2) for jet tagging within CMS was commissioned using 2016 data, and it is referred to as DeepCSV tagger. The purpose for the development of this tagger was to quantify the performance gain due to the use of deep neural networks for jet tagging in CMS, which was demonstrated effective using a simplified detector simulation framework [135], [136]. Thus, a classifier based on a 5-layer neural network, each layer with 100 nodes using ReLU activation functions, was trained based on the information considered for the CSVv2 tagger. A vector of variables from up to six charged tracks, one secondary vertex and 12 global variables was considered as an input, amounting to 66 variables in total. Another change with respect to previous taggers is that flavour tagging is posed as a multi-class classification problem, which is a principled and simple for tacking the various flavour tagging problems simultaneously.

Five exclusive categories were defined based different on the generator level hadron information⁸: the jet contains exactly one B hadron, at least two B hadrons, exactly one C hadrons and no B hadrons, at least two C hadrons and no B hadrons, or none of the previously defined categories. The softmax operator (see Equation 4.11) was used to normalise the category output as probabilities and construct a loss function based on cross entropy (see Equation 4.10). As was shown in Figure 2.12 for b-tagging performance, the DeepCSV tagger is considerably better than CSVv2 for the b-jet efficiency/misidentification range - e.g. about 25% more efficient at light jet and gluon mistag rate of \(10^{-3}\). In fact, DeepCSV outperforms the CMVAv2 super-combined tagger, which uses additional leptonic information. While not shown in this document, the performance for c-tagging was found also comparable with dedicated c-taggers [85].

The very favourable results obtained for DeepCSV motivated the use of newer machine learning technologies, such as convolutional and recurrent layers, which were readily available in open-source software libraries [129], [137], as well as advances in hardware (i.e. more powerful GPUs for training). The large amount of jets available in simulated data, e.g. in 2016 about \(10^9\) \(\textrm{t}\bar{\textrm{t}}\) events were simulated for CMS (each with two b-quarks and probably several light quarks), conceptually justifies the use of more complex machine learning models because over-fitting is unlikely. Thus, a new multi-class jet tagger referred to as DeepJet (formerly know as DeepFlavour) was developed, whose architecture is depicted in Figure 4.3, that can be characterised by a more involved input structure and both convolutional and recurrent layers.

Figure 4.3: Scheme of DeepJet tagger architecture. Four different sets of inputs are considered: a sequence of charged candidates, a sequence of neutral candidates, a sequence of secondary vertices and a 15 global variables. Sequences go first through a series of 1x1 convolution filter that learn a more compact feature representation and then through a recurrent layer that summarises the information of the sequence to in a fixed size vector. All the inputs are then feed to a 7-layer dense network. A total of six exclusive output categories are considered depending on the generator-level components: b, bb, leptonic b, c, light or gluon. Figure adapted from [138].

Instead of a fixed input vector, optionally padded with zeroes for the elements that did not exist (e.g. not reconstructed secondary vertex has been reconstructed), a complex input object is considered for DeepJet. Variable-size sequences are directly taken as input for charged candidates, neutral candidates and secondary vertices; each element in the sequence characterised by 16, 8 and 12 features respectively. Each of the three input sequences go through a 3-layers of 1x1 convolutions in order to obtain a more compact element representation, 8-dimensional for charged candidates and secondary vertices and 4-dimensional for neutral candidates. The output of the convolutional layers is connected with a recurrent layer, which transforms a variable-size input to fixed-size embedding. The fixed-size outputs after the recurrent layer, as well as a set of 15 global jet variables, are feed into a 6-layer dense network with 100 (200 for the first layer) cells with ReLU activation functions per layer.

A total of six mutually exclusive output categories are considered based on the generator-level particle content associated to the jet:

b - exactly one B hadron that does not decay to a lepton.
bb - at least two B hadrons.
lepb - one hadron B decaying to a soft lepton
c - at least one C hadron and no B hadrons
l - no heavy hadrons but originated from a light quark
g - no heavy hadrons but was originated from a gluon.

The DeepJet tagger aims to provide gluon-quark discrimination in addition to b-tagging, c-tagging and double b-tagging. The output probabilities are normalised by using the softmax operator (see Equation 4.11). The training loss function was constructed based on cross entropy (see Equation 4.10). Additional details regarding the architecture and training procedure are available at [139].

The b-tagging performance of DeepJet, by means of the misidentification versus efficiency curve compared with the DeepCSV tagger, is shown in Figure 4.4. The additional model complexity and input variables lead to a clear performance improvement, about a 20% additional efficiency at a mistag rate of \(10^{-3}\) for light quark and gluon originated jets. Larger relative enhancements with respect to DeepCSV are seen for b-jet versus c-jet identification. The performance for c-tagging and quark-gluon discrimination is slightly improved in comparison with dedicated approaches, with the advantage of using a single model for all the flavour tagging variations. The expected relative performance boost, especially when compared non deep learning based taggers (CSVv2 or CMVA) can increase significantly the discovery potential for analyses targeting final states containing several b-tagged jets, such as the one presented in Chapter 5. In addition similar model architectures have since been successfully applied to large radius jet tagging [140] and could be also extended to other jet related tasks, as providing a better jet momenta estimation by means of a regression output.

Figure 4.4: Misidentification probability (in log scale) for jets originating from c quarks (dashed lines) or light quarks and gluons (solid lines) as a function of the b-tagging efficiency for both DeepCSV and DeepJet taggers. The corrected mistag/efficiency and its uncertainty for the loose, medium and tight working points are also included. Figure adapted from [138].

While both advances in model architecture and the addition of input features allow notable jet tagging performance gains, they can complicate the integration of these tools within the CMS experiment software framework [141], which is often referred as CMSSW. Training and performance evaluation of both DeepCSV and DeepJet was carried out using the Keras [137] and TensorFlow [129] open-source libraries. In order to integrate jet tagging models in the standard CMS reconstruction sequence, which has rather stringent CPU and memory requirements per event because it is run for both acquired and simulated data in commodity hardware in a distributed manner around the world in the LHC computing grid [142]. In addition, the lwtnn open-source library [143], a low-overhead C++ based interface used for the integration of DeepCSV did not support multi-input models with recurrent layers at the time.

An alternative path to integrate DeepJet into production was thus required. Given than TensorFlow backend is based on the C++ programming language and a basic interface to evaluating training was also provided, the direct evaluation of machine learning model using its native TensorFlow backend was chosen as the best alternative. In addition, this way the integration effort and basic interface developed could be re-used in future deep learning use cases in the CMS experiment (e.g. large radius jet tagging), leading to the development of the CMSSSW-DNN module [144]. The integration process was made more challenging due to the difficulty recovering the same features at reconstruction level, the strict memory requirements and multi-threading conflicts. After resolving all the mentioned issues [145], the output of the DeepJet model at production was verified to match that of the training framework [146] to numerical precision. The successful integration, that is currently in use, facilitated the measurement of DeepJet b-tagging performance on data for the main discriminator working points, as shown in Figure 4.4.

Here by B and C hadrons we refer to hadrons containing b-quarks c-quarks as valence quarks respectively, which often have a lifetime large enough to fly away from the primary vertex as discussed in Section 2.3.3.4.↩