Statistical Learning and Inference at Particle Collider Experiments

6.1 Introduction

Simulator-based inference is currently at the core of many scientific fields, such as population genetics, epidemiology, and experimental particle physics. In many cases the implicit generative procedure defined in the simulation is stochastic and/or lacks a tractable probability density \(p(\boldsymbol{x}| \boldsymbol{\theta})\), where \(\boldsymbol{\theta} \in \mathcal{\Theta}\) is the vector of model parameters. Given some experimental observations \(D = \{\boldsymbol{x}_0,...,\boldsymbol{x}_n\}\), a problem of special relevance for these disciplines is statistical inference on a subset of model parameters \(\boldsymbol{\omega} \in \mathcal{\Omega} \subseteq \mathcal{\Theta}\). This can be approached via likelihood-free inference algorithms such as Approximate Bayesian Computation (ABC) [95], simplified synthetic likelihoods [187] or density estimation-by-comparison approaches [188].

Because the relation between the parameters of the model and the data is only available via forward simulation, most likelihood-free inference algorithms tend to be computationally expensive due to the need of repeated simulations to cover the parameter space. When data are high-dimensional, likelihood-free inference can rapidly become inefficient, so low-dimensional summary statistics \(\boldsymbol{s}(D)\) are used instead of the raw data for tractability. The choice of summary statistics for such cases becomes critical, given that naive choices might cause loss of relevant information and a corresponding degradation of the power of resulting statistical inference.

For the particular problem of high energy physics data analyses at the LHC, the properties of the underlying generative model discussed in Chapter 3 make the likelihood intractable, but its structure facilitates the construction of simulation-based likelihoods of low-dimensional summary statistics that approximate latent variables. The ultimate aim is nevertheless to extract information about Nature from the large amounts of high-dimensional data on the subatomic particles produced by energetic collision of protons, and acquired by highly complex detectors built around the collision point. Accurate data modelling is only available via stochastic simulation of a complicated chain of physical processes, from the underlying fundamental interaction to the subsequent particle interactions with the detector elements and their readout. As a result, the density \(p(\boldsymbol{x}| \boldsymbol{\theta})\) cannot be analytically computed.

Due to the high dimensionality of the observed data, a low-dimensional summary statistic has to be constructed in order to perform inference. A well-known result of classical statistics, which was also discussed in Section 3.2.2 as the Neyman-Pearson lemma[97], establishes that the likelihood-ratio \(\Lambda(\boldsymbol{x})=p(\boldsymbol{x}| H_0)/p(\boldsymbol{x}| H_1)\) is the most powerful test when two simple hypotheses are considered. As \(p(\boldsymbol{x}| H_0)\) and \(p(\boldsymbol{x}| H_1)\) are not available, simulated samples are used in practice to obtain an approximation of the likelihood ratio by casting the problem as supervised learning classification.

Within high energy physics analysis, the nature of the generative model (a mixture of different processes) allows the treatment of the problem as signal (S) versus background (B) classification [189], when the task becomes one of effectively estimating an approximation of \(p_{S}(\boldsymbol{x})/p_{B}(\boldsymbol{x})\) which will vary monotonically with the likelihood ratio. This has been discussed at great lengths in Section 4.3.1. While the use of classifiers to learn a summary statistic can be effective and increase the discovery sensitivity, the simulations used to generate the samples which are needed to train the classifier often depend on additional uncertain parameters (commonly referred to as nuisance parameters). These nuisance parameters are not of immediate interest but have to be accounted for in order to make quantitative statements about the model parameters based on the available data. Classification-based summary statistics cannot easily account for those effects, so their inference power is degraded when nuisance parameters are finally taken into account.

In this chapter, we present a new machine learning method to construct non-linear sample summary statistics that directly optimises the expected amount of information about the subset of parameters of interest using simulated samples, by explicitly and directly taking into account the effect of nuisance parameters. In addition, the learned summary statistics can be used to build synthetic sample-based likelihoods and perform robust and efficient classical or Bayesian inference from the observed data, so they can be readily applied in place of current classification-based or domain-motivated summary statistics in current scientific data analysis workflows.