Statistical Learning and Inference at Particle Collider Experiments

3.1 Statistical Modelling

An essential element for carrying out statistical inference is the availability of an adequate statistical model. In this section, the main characteristics of the statistical models used in particle collider analyses will be formally developed from first principles. This methodology allows a mathematical approach to their structure and factorisation. This will prove useful to establish a formal link between the techniques discussed in the next chapters and the simulation-based generative models that are often used to describe the data. Additionally, the role and importance of event selection, event reconstruction and dimensionality reduction - i.e. the compression of the relevant information from high-dimensional data into a lower-dimensional representation, such as the output of a multivariate classifier - will be described in the larger statistical framework of an LHC analysis. Lastly, the main approaches commonly followed to construct synthetic³ likelihoods that efficiently connect summaries of the detector observation with the parameters of interest will be illustrated.

3.1.1 Overview

Let us suppose that we record a collection of raw detector readouts \(D = \{\boldsymbol{x}_0,...,\boldsymbol{x}_n\}\) for a total of \(n\) bunch crossings at a particle collider experiment, such as CMS at the LHC (see Section 2.2). Note that vector notation is used for each individual readout, also referred to as event, because for mathematical simplification we will be assuming that each detector observation can be embedded - in the mathematical sense - as a member of a fixed size \(d\)-dimensional space, i.e. \(\boldsymbol{x} \in \mathcal{X} \subseteq \mathbb{R}^d\), even though variable-size sets or tree-like structures might be a more compact and useful representation in practice, as will be discussed later. As a starting point, let us assume for simplicity that the detector readout for every bunch crossing is recorded, i.e. no trigger filtering system as the one described in Section 2.2.7 is in place, hence after each bunch crossing \(i\) a given raw detector readout \(\boldsymbol{x}_i\) will be obtained. From here onwards, each event/observation/readout will be assumed to be independent and identically distributed (i.i.d.), a reasonable approximation if the experimental conditions are stable during the acquisition period as discussed at the beginning of Section 2.3; consequently the event ordering or index \(i\) are not relevant.

3.1.1.1 Experiment Outcome

Within the above framework, we could begin by posing the question of how we expect the readout output, which can be effectively treated as a random variable \(\boldsymbol{x}\), to be distributed and how such distribution is related with the (theoretical) parameters we are interested in measuring or the model extensions we are interested in testing using the experiment. We would like then to model the probability density distribution function generating a given observation \(\boldsymbol{x}_i\) conditional on the parameters of interest, that is: \[ \boldsymbol{x}_i \sim p ( \boldsymbol{x}|\boldsymbol{\theta} ) \qquad(3.1)\] where \(\boldsymbol{\theta} \in \mathcal{\Theta} \subseteq \mathbb{R}^p\) denotes all the parameters we are interested in and affects the detector outcome of collisions. As will be extensively discussed in this chapter, an analytical or even tractable approximation of \(p ( \boldsymbol{x}|\boldsymbol{\theta})\) is not attainable, given that we are considering \(\boldsymbol{x}\) to be a representation of the raw readout of all sub-detectors, thus its dimensionality \(d\) can be of the order \(\mathcal{O}(10^8)\). It is worth mentioning that even \(d\) is very high, each observation is usually extremely sparse given that most of the detectors would not sense any signal. The total number of observations \(n\) is also very large at modern colliders, e.g. a collision occurs each \(25\ \textrm{ns}\) at the LHC. Furthermore, the known interactions that produce the set of particles of the event as well as the subsequent physical processes that generate the readouts in the detectors are overly complex, and realistic modelling can only be obtained through simulation, as jointly reviewed in Section 1.3 and Section 2.3.

3.1.1.2 Mixture Structure

While a detailed closed-form description of \(p(\boldsymbol{x}|\boldsymbol{\theta})\) cannot be obtained, we can safely make a very useful remark about its basic structure, which is fundamental for simplyfing the statistical treatment of particle collider observations and simulations, and was already hinted at in Section 1.3.1 when discussing the possible outcomes of fundamental proton-proton interactions. The underlying process generating \(\boldsymbol{x}\) can be treated as a mixture model, which can be expressed as the probabilistic composition of samples from multiple probabilistic distributions corresponding to different types of interaction processes occurring in the collision. If we knew the probabilistic distribution function of each mixture component \(p_j(\boldsymbol{x}|\boldsymbol{\theta})\) then \(p ( \boldsymbol{x}|\boldsymbol{\theta} )\) could be expressed as: \[ p ( \boldsymbol{x}|\boldsymbol{\theta} ) = \sum^{K-1}_{j=0} \phi_j \ p_j ( \boldsymbol{x}|\boldsymbol{\theta} ) \qquad(3.2)\] where \(K\) is the number of mixture components and \(\phi_j\) is the mixture weight/fraction, i.e. probability for a sample to be originated from each mixture component \(j\). The specifics of the mixture expansion as well as the total number of mixture components are not uniquely defined, but are based on the independence of groups of physical processes, as will be discussed later. Practically, each \(p_j(\boldsymbol{x}|\boldsymbol{\theta})\) will be intractable due to the exact same reason that \(p ( \boldsymbol{x}|\boldsymbol{\theta} )\) is intractable, thus a more sensible description of the mixture model is its generative definition, described by the following two-step sampling procedure: \[ z_i \sim \textrm{Categorical}(\boldsymbol{\phi}) \quad \longrightarrow \quad \boldsymbol{x}_i \sim p_{z_i}( \boldsymbol{x} | \boldsymbol{\theta}) \qquad(3.3)\] describing the sampling of random integer \(z_i \in \{0, \dots, K -1 \}\) from a random categorical⁴ distribution and the subsequent sampling of the corresponding mixture component indexed by \(z_i\), where \(\boldsymbol{\phi} = \{\phi_0, \dots, \phi_{K-1} \}\) is the vector of probabilities for each of the mixture components. For here onwards, mixture models might in some cases be portrayed by using the analytical depiction as in Equation 3.2, always noting that the generative approach might be more convenient for the actual estimation of expectation values when the mixture component distributions \(p_j(\boldsymbol{x}|\boldsymbol{\theta})\) are not tractable.

3.1.1.3 Mixture Components

The mixture model structure can be directly linked to the physical processes happening in fundamental proton-proton collisions and within the detectors used to study them, as described in previous chapters. As an additional simplification for now, let us neglect the effect of multiple particle interactions, described in Section 2.1.3. For each proton bunch crossing, hard interactions (i.e. ones associated with a large characteristic energy scale \(Q^2\), whose cut-off does not have be specified for this particular argument) between partons might or might not occur, given the stochastic nature of the scattering processes. We could nevertheless associate a probability for a hard interaction happening \(\phi_{\textrm{hard}}\), as well to it not happening \(\phi_{\textrm{not-hard}} = 1-\phi_{\textrm{hard}}\). Given the proton colliding conditions at the LHC, the latter case is much more likely, i.e. \(\phi_{\textrm{not-hard}} \gg \phi_{\textrm{hard}}\), yet the relative probabilities depend on the energy scale cut-off considered.

We can further break each previously mentioned category in sub-components corresponding to different types of processes. The hard interaction category can itself be expressed as a mixture of groups of physical interactions that can produce a hard scattering⁵, so the probability \(\phi_{\textrm{hard}}\) can be expresses as the following sum: \[ \phi_{\textrm{hard}} = \phi_0 + \dots + \phi_{K-2} = \sum_{k \in H} \phi_k \qquad(3.4)\] where \(H\) represents a given set of independent contributions \(k\), each characterised by a distribution \(p_j(\boldsymbol{x}|\boldsymbol{\theta})\), which depends on the group \(j\) of processes that produce hard scatterings. Such a set is not uniquely defined nor its the number of elements, given that any two components \(a\) and \(b\) in \(H\) can be substituted by \(c\), where \(\phi_c = \phi_a + \phi_b\) and \[p_c(\boldsymbol{x}|\boldsymbol{\theta})= \frac{\phi_a}{\phi_a+\phi_b} \ p_a(\boldsymbol{x}|\boldsymbol{\theta}) + \frac{\phi_b}{\phi_a+\phi_b} \ p_b(\boldsymbol{x}|\boldsymbol{\theta}) \qquad(3.5)\] which can be applied recursively to alter the number of components in the set. Independently on the basis chosen for the mixture expansion, in general it is not possible to infer the latent category \(z_i\) (see Equation 3.2 given an observation \(\boldsymbol{x}_i\), because \(\boldsymbol{x}_i\) may be in the support of several mixture components \(p_j(\boldsymbol{x}|\boldsymbol{\theta})\). Only probabilistic statements about the generative group \(j\) can be made based on the observations.

A convenient definition for the set \(H\) is one that is aligned with the way theoretical calculations are carried out, given that the relative probability for a given process \(\phi_{pp\rightarrow X}\) will be proportional to its total cross section \(\sigma (pp\rightarrow X)\), while its readout distribution will depend on its differential cross section \(d\sigma (pp\rightarrow X)\) and its support (i.e. subset of the function domain not mapped to zero). In fact, given that the total and differential cross sections are proportional to the matrix element squared (see Section 1.1.1) of a given process \(d\sigma (pp\rightarrow X) \propto |\mathcal{M}|^2\), it is often possible to further divide each process into the cross product of Feynman diagram expansions (including interference terms). This can be a very useful notion for some analysis use cases, and is related with the approach that will be used in Chapter 5.

3.1.1.4 Signal and Background

Oftentimes, we are interested in studying a subset \(S \subset H\) of all the hard interaction processes, which will be referred to as signal set in what follows. This can be a single type of physical process \(\sigma (pp\rightarrow X)\), e.g. the inclusive production of a pair of Higgs bosons \(\sigma (pp\rightarrow \textrm{HH} + \textrm{other})\), or several, which it can be effectively viewed as one mixture component using Equation 3.5. We can accordingly define the background subset \(B = H - S\), as the result of all other generating processes in \(H\) that we are not interested in, a definition which could also be extended to include collisions where non-hard processes occurred if needed. Such distinction between generating processes of interest \(S\) and background \(B\) is at the root of every analysis at the LHC and it is motivated by the fact that small changes of the parameters of the SM or its theoretical extensions/alternatives affect only a subset of the produced processes at leading order, those that are governed by the interactions linked to the parameter.

As a matter of a fact, customarily statistical inference at the LHC is not carried out directly on the parameters of the SM or the extension being studied, but on the relative frequency of the set of processes of interest \(\phi_S\) or the properties of its distribution \(p_S(\boldsymbol{x}|\boldsymbol{\theta})\). As previously mentioned, the former is proportional to the cross section of the signal processes \(\sigma_S\) (see Section 1.3) while the latter can include properties such as the mass of an intermediate particle resonance⁶ (e.g. the Higgs mass \(m_\textrm{H}\)) or the general behaviour of the differential distribution (i.e. using unfolding methods to remove the experimental effects, which are not discussed in this work). Those parametric proxies can then be used by comparing them with the theoretical predictions of the SM or the alternative considered, in order to exclude it or constrain its fundamental parameters (i.e. those that appear in the Lagrangian).

3.1.1.5 Event Selection

Given the mixture model structure expected for \(p ( \boldsymbol{x}|\boldsymbol{\theta} )\) and the fact we are only interested in a small amount of the readout generating processes for each collision, because in general \(\phi_S \ll \phi_B \ll \phi_{\textrm{not-hard}}\), the effect of trigger or any other event selection should be considered. The role of event selection is to reduce the fraction of events that do not contain useful information for the inference task of interest. Trigger selection can be thought of as a technical requirement, reducing the total rate of detector readouts recorded to match the available hardware for data acquisition, as discussed in Section 2.2.7. The purpose of analysis selection, as will be discussed in Chapter 5, is instead to reduce the expected contribution of background processes that are not well-modelled by simulation, as well as to the increase the expected fraction of signal events in synthetic counting likelihoods, such as those which will be detailed in Section 3.1.3.4.

In general mathematical terms, any deterministic event selection can be thought of as an indicator function \(\mathbb{1}_\mathcal{C} : \mathcal{X} \longrightarrow \{0,1\}\), of a given subset of the set of possible detector readouts \(\mathcal{C} \subseteq \mathcal{X}\). The indicator function \(\mathbb{1}_\mathcal{C}(\boldsymbol{x})\) can be defined as:

\[\mathbb{1}_\mathcal{C}(\boldsymbol{x}) = \begin{cases} 1 \ \textrm{if} \ \mathbf{x} \in C \\ 0 \ \textrm{if} \ \mathbf{x} \notin C \\ \end{cases} \qquad(3.6)\]

where the specific definition of such function depends on the definition of the subset \(\mathcal{C}\), e.g. a simple cut on a one-dimensional function \(f : \mathcal{X} \longrightarrow T \subseteq \mathcal{R}\) of the readout \(f(\boldsymbol{x}) > t_{{\textrm{cut}}}\). Any indicator function can also be viewed as a boolean predicate function, so the event selection can also be expressed as a combination of selection functions, i.e. if the set \(\mathcal{C}=\mathcal{A} \cap \mathcal{B}\) is the intersection between two subsets, the indicator function of \(C\) can be simply expressed as the product \(\mathbb{1}_\mathcal{C}=\mathbb{1}_\mathcal{A} \cdot \mathbb{1}_\mathcal{B}\). This framework is flexible enough to represent all deterministic event selections, and it could also be extended by an independent non-deterministic term without affecting the rest of the considerations presented in this chapter. A non-deterministic factor could be useful to model for example trigger prescales, which are trigger decisions based on randomly selecting a fraction of all the selected events to be recorded, ensuring that the total rate is manageable.

In practice, a given selection \(\mathbb{1}_\mathcal{C}(\boldsymbol{x})\), likely based on a composition of simple criteria, would have been imposed on the recorded detector readouts before any statistical analysis is carried out. The structure of the statistical model \(g(\boldsymbol{x} | \boldsymbol{\theta} )\) resulting after applying an arbitrary selection \(\mathbb{1}_\mathcal{C}(\boldsymbol{x})\) on a mixture model as the one described in Equation 3.1 can be obtained by multiplying the probability density by \(\mathbb{1}_\mathcal{C}(\boldsymbol{x})\). After including the relevant normalisation term, the resulting probability distribution can be expressed as:

\[ g(\boldsymbol{x} | \boldsymbol{\theta} ) = \frac{ \mathbb{1}_\mathcal{C}(\boldsymbol{x}) \sum^{K-1}_{j=0} \phi_j \ p_j ( \boldsymbol{x}|\boldsymbol{\theta})}{ \int \left (\mathbb{1}_\mathcal{C}(\boldsymbol{x}) \sum^{K-1}_{j=0} \phi_j \ p_j ( \boldsymbol{x}|\boldsymbol{\theta}) \right ) d \boldsymbol{x}} = \sum^{K-1}_{j=0} \left ( \frac{ \phi_j \epsilon_j }{ \sum^{K-1}_{j=0} \phi_j \epsilon_j } \right ) g_j (\boldsymbol{x}|\boldsymbol{\theta}) \qquad(3.7)\]

where \(g_j (\boldsymbol{x}|\boldsymbol{\theta}) = \mathbb{1}_\mathcal{C}(\boldsymbol{x}) p_j (\boldsymbol{x}|\boldsymbol{\theta}) / \epsilon_j\) is the probability density function of each mixture component after the selection, \(\epsilon_j=\int \mathbb{1}_\mathcal{C}(\boldsymbol{x}) p_j (\boldsymbol{x}|\boldsymbol{\theta})\) is the efficiency on the selection on each mixture, and the integral sign in the denominator in the last expression has been simplified by noting that \(\int g_j ( \boldsymbol{x}|\boldsymbol{\theta}) d \boldsymbol{x} = 1\). From Equation 3.7 it becomes clear that the statistical model after any event selection is also a mixture model, whose mixture components are \(g_j (\boldsymbol{x}|\boldsymbol{\theta})\) and mixture fractions are \(\chi_j=\phi_j\epsilon_j/\sum^{K-1}_{j=0} \phi_j\epsilon_j\). This fact will be very relevant to build statistical models of the observed data after an event event selection is in place.

So far, no explicit assumptions on the probability distribution functions of each mixture component \(j\) or the details of the event selection function \(\mathbb{1}_\mathcal{C}(\boldsymbol{x})\) have been considered, in order to keep the previously developed modelling framework as general as possible. In the next sections, it will become increasingly clear how \(p_j (\boldsymbol{x}|\boldsymbol{\theta})\), and in turn \(g_j (\boldsymbol{x}|\boldsymbol{\theta})\) and the efficiency \(\epsilon_j\), can be modelled by generating simulated detector readouts produced by a given process \(j\).

3.1.2 Simulation as Generative Modelling

The physical principles underlying the simulation of detector readouts, or events, for a given hard proton-proton interaction process were reviewed in Section 1.3 and Section 2.3. Instead of focussing on the procedural details of event generation, the focus of this section is the study of the simulation chain as a generative statistical model, together with its basic structure and properties, that will be useful later to understand many analysis techniques that are commonly used in experimental particle physics.

For simplicity, we will be considering the statistical model describing the distribution of observations of detector readouts before any event selection, what was referred to as \(p ( \boldsymbol{x}|\boldsymbol{\theta} )\) in the previous section. Always taking into account that the distribution after any arbitrary deterministic event selection \(\mathbb{1}_\mathcal{C}(\boldsymbol{x})\) is also a mixture model (see Equation 3.7) and samples under the corresponding probability distribution functions \(g_j (\boldsymbol{x}|\boldsymbol{\theta})\) and mixture fractions \(\chi_j\) can easily obtained from the non-selected simulated events, as it is actually done in practice.

3.1.2.1 Observable and Latent Variables

The first step to build a generative statistical model is to define what are the observed variables and what are the hidden quantities, referred to as latent variables, that explain the structure of the data. For particle collider experiments, we may consider the full detector readout \(\boldsymbol{x} \in \mathcal{X} \subseteq \mathbb{R}^d\) as the only observable variable, given that any other observable can be expressed as a function of the raw readout, as will be discussed in Section 3.1.3. The probability density function of the data \(p ( \boldsymbol{x}|\boldsymbol{\theta} )\) from a generative standpoint can be written as an integration of the joint distribution \(p(\boldsymbol{x}, \boldsymbol{z} | \boldsymbol{\theta})\) over all latent variables \(\boldsymbol{z}\) of an event: \[p ( \boldsymbol{x}|\boldsymbol{\theta} ) = \int p(\boldsymbol{x}, \boldsymbol{z} | \boldsymbol{\theta})d\boldsymbol{z} \qquad(3.8)\] where \(\boldsymbol{\theta}\) is a vector with all model parameters, which normally are global (same for all the observations) and include the theory parameters of interest as well as any other parameter that affect the detector readouts. While the true generative model of the data \(p(\boldsymbol{x}, \boldsymbol{z} | \boldsymbol{\theta})\) is unknown, knowledge about the underlying physical processes described in in Section 1.3 and Section 2.3 can be used to build a generative approximation of \(p(\boldsymbol{x}, \boldsymbol{z} | \boldsymbol{\theta})\) which can describe the observed data realistically and be used to carry out inference on the parameters of interest.

In fact, one of the most relevant latent variables at particle colliders has been already introduced with the generative definition of a mixture model in Equation 3.3, the mixture assignment integer \(z_i \in \{0, \dots, K -1 \}\). This latent variable represents which type of fundamental interaction occurred in the event, and is useful to exemplify the main property of latent variables: that they are not observed but can only (at most) be inferred. Let us consider the problem of finding out the type of interaction \(j\) that caused a single detector readout observation \(\boldsymbol{x}_i\). As long as \(\boldsymbol{x}_i\) is in the support space of more than one of the mixture components \(p_j( \boldsymbol{x}|\boldsymbol{\theta})\), which is almost always the case, only probabilistic statements about the type of interaction originating \(\boldsymbol{x}_i\) can be made, even if the \(p_j( \boldsymbol{x}|\boldsymbol{\theta})\) are known. In practice, \(p_j( \boldsymbol{x}|\boldsymbol{\theta})\) are not known analytically so probabilistic classification techniques can be used to estimate the conditional probabilities based on simulated samples, as discussed in Chapter 4.

3.1.2.2 Structure of Generative Model

Other than the basic mixture model structure, our understanding of the underlying physical process occurring in proton-proton collisions can be used to recognise additional structure in the generative model by means of factorising the joint distribution \(p(\boldsymbol{x}, \boldsymbol{z} | \boldsymbol{\theta})\) in conditional factors matching the various simulation steps and their dependencies: \[p(\boldsymbol{x}, \boldsymbol{z} | \boldsymbol{\theta}) = p ( \boldsymbol{x} | \boldsymbol{z}_\textrm{d}) p ( \boldsymbol{z}_\textrm{d} | \boldsymbol{z}_\textrm{s}) p ( \boldsymbol{z}_\textrm{s} | \boldsymbol{z}_\textrm{p}) \sum^{K-1}_{j=0} p ( z_i = j |\boldsymbol{\theta}) p ( \boldsymbol{z}_\textrm{p}|\boldsymbol{\theta}, z_i = j) \qquad(3.9)\] where each factor can be defined as follows:

\(p( z_i = j|\boldsymbol{\theta}) = \phi_j(\boldsymbol{\theta})\) is the probability of a given type of process \(j\) occurring, which is usually a function of theory parameters \(\boldsymbol{\theta}\).
\(p ( \boldsymbol{z}_\textrm{p}|\boldsymbol{\theta}, z_i = j)\) is the conditional probability density of a given set of parton-level four-momenta particles (characterised by the latent representation \(\boldsymbol{z}_\textrm{p} \in \mathcal{Z}_\textrm{p}\)) of being the outcome of a group of fundamental proton interaction processes \(pp \longrightarrow X\) indexed by the latent variable \(z_i \in \mathcal{Z}_i\), which might also be a function of theory parameters \(\boldsymbol{\theta}\).
\(p( \boldsymbol{z}_\textrm{s} | \boldsymbol{z}_\textrm{p})\) is the conditional density of a given parton-shower outcome. \(\boldsymbol{z}_\textrm{s} \in \mathcal{Z}_\textrm{d}\) as a function of the parton-level outcome.
\(p ( \boldsymbol{z}_\textrm{d} | \boldsymbol{z}_\textrm{s})\) is the conditional density of a set of detector interactions and readout noise \(\boldsymbol{z}_\textrm{d} \in \mathcal{Z}_\textrm{d}\) as a function of the parton-shower output.
\(p ( \boldsymbol{x} | \boldsymbol{z}_\textrm{d})\) is the conditional density of a given detector readout \(\boldsymbol{x} \in \mathcal{X}\) as a function of the detector material interactions and detector readout noise.

The dimensionality of the latent space greatly increases with each simulation step, from a single integer for \(\mathcal{Z}_i\), to \(\mathcal{O}(10)\) parton four-momenta variables within \(\mathcal{Z}_p\), to \(\mathcal{O}(100)\) after the parton-shower \(\mathcal{Z}_s\), and finally to \(\mathcal{O}(10^8)\) in the detector interaction latent space \(\mathcal{Z}_d\) and also the observable readout space \(\mathcal{X}\). In the factorisation presented in Equation 3.9, the dependence on the parameters has only has been made explicit for \(p( z_i|\boldsymbol{\theta})\) and \(p ( \boldsymbol{z}_\textrm{p}|\boldsymbol{\theta}, z_i)\), that is because the theoretical parameters of interest \(\boldsymbol{\theta}\) often only affect the rate of the different fundamental processes and their differential distributions, which correspond to the mentioned conditional probability distributions. In the actual simulation chain, all conditional factors typically depend on additional parameters which might be uncertain, and whose effect and modelling will be discussed in Section 3.1.4.

As previously mentioned, computer programs can be used to realistically simulate detector observations. For simulated observations, not only the final readout is observed, but all latent variables can be obtained from the intermediate steps of the generative chain. These variables, in particular \(\boldsymbol{z}_\textrm{p}\) and \(\boldsymbol{z}_\textrm{s}\), are commonly referred as generator level observables, and are extremely useful to construct techniques that approximate the latent variables from the detector readouts. In fact, the whole simulation chain can be viewed as a probabilistic program [86], [87], thus each of the factors in Equation 3.9 can be further broken down as a sequence of random samples, which can be used to speed up latent variable inference based on the execution traces, i.e. recorded sequences of random numbers generated for each observation.

Some joint factorisations are particularly useful for data analysis and simulation, such as the one making explicit the dependence between the differential partonic cross sections and the parton configuration in the collision, because it allows to factor out the density of the latent variables \(\boldsymbol{z}_\textrm{PDF}\) associated with the parton components (i.e. flavour and momenta of each interacting parton and factorisation scale \(\mu_F^2\), as depicted in Section 1.3.3). Each mixture component \(j\) in Equation 3.9, which represents a group of fundamental interactions between protons \(pp \longrightarrow X\), can be expressed as the product of the probability of a given parton configuration \(p(\boldsymbol{z}_\textrm{PDF}|\boldsymbol{\theta}_\textrm{PDF})\) and a mixture over all parton configurations that can that produce \(pp \longrightarrow X\), referred as \(L\) in the following expression: \[ p(z_i|\boldsymbol{\theta}) \ p ( \boldsymbol{z}_\textrm{p}|\boldsymbol{\theta}, z_i) = p(\boldsymbol{z}_\textrm{PDF}|\boldsymbol{\theta}_\textrm{PDF}) \sum^{g \in L} p(z_f = g| \boldsymbol{\theta}, z_\textrm{PDF}) p ( \boldsymbol{z}_\textrm{p}|\boldsymbol{\theta}, z_f = g) \qquad(3.10)\] where \(p(z_f = g| \boldsymbol{\theta}, z_\textrm{PDF})\) is the relative probability of given partonic process \(g\) given a parton configuration \(\boldsymbol{z}_\textrm{PDF}\) and \(p(\boldsymbol{z}_\textrm{p}|\boldsymbol{\theta}, z_f = g)\) is the probability distribution function of the parton-level particles produced as a result of the interaction for a given partonic process \(g\), which is proportional to the partonic differential cross section \(d\sigma(ij \rightarrow X)\). This factorisation is basically a probabilistic model version of Equation 1.32, dealing with the QCD factorisation of the parton distribution functions and the hard process differential cross section.

Another relevant phenomenon that can be made explicit in the joint distribution \(p(\boldsymbol{x}, \boldsymbol{z} | \boldsymbol{\theta})\) is the effect of multiple hadron interactions in the collision, or pileup, as discussed in Section 2.1.3. Given that each proton-proton interaction is independent from the others, the effect of pileup interactions can be considered by augmenting the factor representing the conditional probability density of the detector interaction and noise as a function of the hard interaction parton shower output \(p ( \boldsymbol{z}_\textrm{d} | \boldsymbol{z}_\textrm{s})\) as follows: \[ p ( \boldsymbol{z}_\textrm{d} | \boldsymbol{z}_\textrm{s}) = p(\boldsymbol{z}_\textrm{d} | \boldsymbol{z}_\textrm{s},\boldsymbol{z}_\textrm{pileup}) p(\boldsymbol{z}_\textrm{pileup} | \boldsymbol{\theta}_\textrm{pileup}) \qquad(3.11)\] where \(\boldsymbol{z}_\textrm{pileup}\) is a latent variable representing the details about the pileup interactions that happened in a given collision (i.e. number of interactions and their corresponding particle outcome), and \(\boldsymbol{\theta}_\textrm{pileup}\) are the bunch crossing and luminosity parameters that affect the pileup distribution.

Further structure in the generative model can be often found, depending on the process being generated, the modelling assumptions, and the latent space representation chosen. As an example, it is often useful to factorise out the latent subspace that depends directly on the subset of parameters of interest from those that do not. The conditional observations in that latent subspace can sometimes be analytically expressed, or their dimensionality may be low enough to use non-parametric density estimation techniques effectively, which can greatly simplify the modelling of changes in the parameters of interest.

3.1.2.3 Simulated Observations

The mentioned mixing structure of the probability distribution function \(p(\boldsymbol{x} | \boldsymbol{\theta})\) greatly simplifies the simulation of realistic observations, because large datasets \(S_j = \{\boldsymbol{x}_0,...,\boldsymbol{x}_m\}\) of simulated observations for each type of interaction \(j\) can be simulated before any event selection. The expected value of any measurable function of the detector readout \(f(\boldsymbol{x})\) for events coming from a given process \(j\) can be expressed as: \[ \mathop{\mathbb{E}}_{x \sim p_j ( \boldsymbol{x}|\boldsymbol{\theta} )} \left [ f(\boldsymbol{x}) \right ] = \int f(\boldsymbol{x}) p_j ( \boldsymbol{x}|\boldsymbol{\theta} ) d\boldsymbol{x} \approx \frac{1}{m} \sum^{\boldsymbol{x}_s \in S_j } f(\boldsymbol{x}_s) \qquad(3.12)\] where the last terms approximates the integral as the sum over all stochastic simulations for a given process. The previous Monte Carlo approximation can be used to estimate the selection efficiency \(\epsilon_j\), as defined in Equation 3.7, after any deterministic event selection \(\mathbb{1}_\mathcal{C}(\boldsymbol{x})\): \[ \epsilon_j = \mathop{\mathbb{E}}_{x \sim p_j ( \boldsymbol{x}|\boldsymbol{\theta} )} \left [ \mathbb{1}_\mathcal{C}(\boldsymbol{x}) \right ] = \int \mathbb{1}_\mathcal{C}(\boldsymbol{x}) p_j ( \boldsymbol{x}|\boldsymbol{\theta} ) d\boldsymbol{x} \approx \frac{1}{m} \sum^{\boldsymbol{x}_s \in S_j } \mathbb{1}_\mathcal{C}(\boldsymbol{x}) \qquad(3.13)\] which simply corresponds to the number of simulated observations that pass the selection divided by the total number of simulated observations \(m\). Lastly, the expected value of any measurable function \(f(\boldsymbol{x})\) after a given event selection \(\mathbb{1}_\mathcal{C}(\boldsymbol{x})\) for events generated by a given process \(j\) can be approximated by: \[ \mathop{\mathbb{E}}_{x \sim g_j ( \boldsymbol{x}|\boldsymbol{\theta} )} \left [ f(\boldsymbol{x}) \right ] = \frac{1}{\epsilon_j } \int f(\boldsymbol{x}) \mathbb{1}_\mathcal{C}(\boldsymbol{x}) p_j ( \boldsymbol{x}|\boldsymbol{\theta} ) d\boldsymbol{x} \approx \frac{1}{\epsilon_j m} \sum^{\boldsymbol{x}_s \in S_j } f(\boldsymbol{x}_s) \mathbb{1}_\mathcal{C}(\boldsymbol{x}) \qquad(3.14)\] which corresponds to the mean of \(f(\boldsymbol{x})\) for all the events that passed the selection, noting that if all the events passed the selection (i.e. \(\mathbb{1}_\mathcal{C}(\boldsymbol{x}) = 1\)), then Equation 3.12 would be recovered.

While we have been dealing independently with the estimation of arbitrary expected values for a given mixture component \(j\), the computation of expected values of any measurable function \(f(\boldsymbol{x})\) under the total mixture distribution can be easily be expressed as function of expectations of mixture components: \[ \mathop{\mathbb{E}}_{x \sim g ( \boldsymbol{x}|\boldsymbol{\theta} )} \left [ f(\boldsymbol{x}) \right ] = \int f(\boldsymbol{x}) \sum^{K-1}_{j=0} \chi_j g_j (\boldsymbol{x}|\boldsymbol{\theta}) d\boldsymbol{x} \approx \sum^{K-1}_{j=0} \chi_j \mathop{\mathbb{E}}_{x \sim g_j ( \boldsymbol{x}|\boldsymbol{\theta} )} \left [ f(\boldsymbol{x}) \right ] \qquad(3.15)\] where \(\chi_j=\phi_j\epsilon_j/\sum^{K-1}_{j=0} \phi_j\epsilon_j\) is the mixture fraction after selection (see Equation 3.7). While the problem of estimation of expected values might seem unrelated to the inference problem at hand, in Chapter 3.1.3 it will become evident that the construction of non-parametric likelihoods of summary statistics can be reduced to the estimation of expected values.

Oftentimes, the simulated observations are generated using a somewhat different probability distribution than that of experimental data, maybe because some of the generating parameters are not known precisely beforehand (e.g. the properties of pileup interactions). Alternatively, we might want to use a single set of simulated observations to realistically model observables corresponding to a different value of the parameters \(\boldsymbol{\theta}\) or even to compute observables under a different process \(j\). Let us suppose that the samples were generated under \(p_Q(\boldsymbol{x} | \boldsymbol{\theta}_Q)\) while we want to model samples under \(p_R(\boldsymbol{x} | \boldsymbol{\theta}_R)\). In that case, if both distributions have the same support, we can express the expectation value under the desired distribution as: \[ \mathop{\mathbb{E}}_{x \sim p_R ( \boldsymbol{x}|\boldsymbol{\theta}_Q )} \left [ f(\boldsymbol{x}) \right ] = \frac {\int f(\boldsymbol{x}) \frac{p_R ( \boldsymbol{x}|\boldsymbol{\theta}_R )}{ p_Q ( \boldsymbol{x}|\boldsymbol{\theta}_Q )} p_Q ( \boldsymbol{x}|\boldsymbol{\theta}_Q ) d \boldsymbol{x}}{ \int \frac{p_R ( \boldsymbol{x}|\boldsymbol{\theta}_R )}{ p_Q ( \boldsymbol{x}|\boldsymbol{\theta}_Q )} p_Q ( \boldsymbol{x}|\boldsymbol{\theta}_Q ) d \boldsymbol{x}} \approx \frac{\sum^{\boldsymbol{x}_s \in S_j } w(\boldsymbol{x}_s) f(\boldsymbol{x}_s)}{ \sum^{\boldsymbol{x}_s \in S_j } w(\boldsymbol{x}_s) } \qquad(3.16)\] which is analogous to what was done in Equation 3.12, but accounting for a weight \(w(\boldsymbol{x}_s) = p_R ( \boldsymbol{x}_s|\boldsymbol{\theta}_R )/ p_Q ( \boldsymbol{x}_s|\boldsymbol{\theta}_Q )\) for each simulated observation. This technique can be also used together with an arbitrary event selection \(\mathbb{1}_\mathcal{C}(\boldsymbol{x})\) simply by considering as event weight the product \(w_{\mathcal{C}}(\boldsymbol{x}_s) = \mathbb{1}_\mathcal{C} (\boldsymbol{x}) w(\boldsymbol{x}_s)\), which amounts to summing over the selected events. In particle physics experiments, the probability distribution functions \(p_Q(\boldsymbol{x} | \boldsymbol{\theta}_R)\) and \(p_Q(\boldsymbol{x} | \boldsymbol{\theta}_R)\) are most likely intractable, thus estimation of \(w_{\mathcal{C}}(\boldsymbol{x}_s)\) has either to be carried out by non-parametric density estimation in a lower dimensional-space of the detector readouts (discussed in Section 3.1.3) or by directly estimating the density ratio via probabilistic classification as will be discussed in Chapter 4.

As previously mentioned, an advantage of using simulated observations is that the latent variables \(\mathcal{H}_j= \{\boldsymbol{z}_0,...,\boldsymbol{z}_m\}\) for a given simulated set of observations \(S_j = \{\boldsymbol{x}_0,...,\boldsymbol{x}_{m-1}\}\) are known. This allows to rewrite the weight \(w(\boldsymbol{x}_s,\boldsymbol{z}_s)\) for a given event as the ratio of joint distributions: \[ w(\boldsymbol{x}_s, \boldsymbol{z}_s) = \frac{p_R(\boldsymbol{x}_s,\boldsymbol{z}_s | \boldsymbol{\theta}_R)}{ p_Q(\boldsymbol{x}_s,\boldsymbol{z}_s | \boldsymbol{\theta}_Q) } = \frac { p_R ( \boldsymbol{x} | \boldsymbol{z}_\textrm{d}) p_R ( \boldsymbol{z}_\textrm{d} | \boldsymbol{z}_\textrm{s}) p_R ( \boldsymbol{z}_\textrm{s} | \boldsymbol{z}_\textrm{p}) p_R ( \boldsymbol{z}_\textrm{p}|\boldsymbol{\theta}_R) }{ p_Q ( \boldsymbol{x} | \boldsymbol{z}_\textrm{d}) p_Q ( \boldsymbol{z}_\textrm{d} | \boldsymbol{z}_\textrm{s}) p_Q ( \boldsymbol{z}_\textrm{s} | \boldsymbol{z}_\textrm{p}) p_Q ( \boldsymbol{z}_\textrm{p}|\boldsymbol{\theta}_Q) } \qquad(3.17)\] where the last term is an expansion of each joint distribution as a product of the conditional distributions discussed in Equation 3.9. If the difference between \(p_R(\boldsymbol{x} | \boldsymbol{\theta}_R)\) and \(p_Q(\boldsymbol{x} | \boldsymbol{\theta}_Q)\) is contained in one of the factors of the joint distribution, which is often the case, most of the factors in Equation 3.17 cancel out and we are left with a much simpler problem of density ratio estimation in the latent space. This is often what is done to model the effect of a different pileup distribution or alternative parton distribution functions, further factoring the joint distribution to include explicit dependencies with respect to \(\boldsymbol{z}_\textrm{pileup}\) or \(\boldsymbol{z}_\textrm{PDF}\), as done in Equation 3.11 and Equation 3.10 respectively. The case when the difference between distributions is contained in a subset of the parton-level latent variables is one of special relevance, because the event weight for a given event \(w(\boldsymbol{z}_s)\) can be expressed as the ratio: \[ w(\boldsymbol{z}_s) = \frac{p_R ( \boldsymbol{z}_\textrm{p}|\boldsymbol{\theta}_R)}{p_Q ( \boldsymbol{z}_\textrm{p}|\boldsymbol{\theta}_Q)} \qquad(3.18)\] which is referred to as generator-level re-weighting, a procedure that in some cases can even be done analytically. The concept of re-weighting will be useful to model different parameter points in Chapter 5 with a single set of simulated observations as well as to understand how the effect of varying parameters can be modelled via differentiable transformations in Chapter 6.

3.1.3 Dimensionality Reduction

In the previous overview of the basic statistical modelling principles of experimental high-energy physics, the structure and properties of the probability distribution of the full detector readout \(\boldsymbol{x} \in \mathcal{X}\) have been considered. The consideration of the detector readout as single observable variable \(\boldsymbol{x}\) in the generative model greatly simplifies the modelling narrative, plus also allows to include the effect of any arbitrary event selection as a deterministic function \(\mathbb{1}_\mathcal{C} (\boldsymbol{x})\). Nevertheless, the high-dimensionality of the readout space \(\boldsymbol{x} \in \mathcal{X}\) (i.e. \(\mathcal{O}(10^8)\)) complicates its direct use when comparing simulated and recorded observations, which is crucial when carrying out any type of statistical inference.

The high-dimensionality of the raw detector readout space \(\boldsymbol{x} \in \mathcal{X}\) also makes it very difficult to specify an effective event selection \(\mathbb{1}_\mathcal{C} (\boldsymbol{x})\) that is able to reduce the contributions from non-interesting or not well-modelled background processes. This motivates the use of a dimensionality reduction function \(\boldsymbol{f}(\boldsymbol{x}) : \mathcal{X} \longrightarrow \mathcal{Y}\), from the raw detector readout space \(\mathcal{X} \subseteq \mathbb{R}^{d}\) to a lower dimensional space \(\mathcal{Y} \subseteq \mathbb{R}^{b}\). Here \(\boldsymbol{f}(\boldsymbol{x})\) represents any deterministic function of the detector readout, but in practice it can be implemented by a series of consecutive transformations.

Let us denote as \(\boldsymbol{y} \in \mathcal{Y}\) the resulting variable after the transformation \(\boldsymbol{f}(\boldsymbol{x})\) is applied to the observed detector readout. If the function \(\boldsymbol{f}\) is differentiable and bijective (i.e. there is a one-to-one correspondence between \(\boldsymbol{x}\) and \(\boldsymbol{y}\)), the probability density distribution function of \(\boldsymbol{y}\) could be obtained as: \[ p(\boldsymbol{y} | \boldsymbol{\theta}) = p(\boldsymbol{x} | \boldsymbol{\theta}) \left | \det \frac{d\boldsymbol{x} }{d\boldsymbol{y} } \right | \qquad(3.19)\] where the last term is the Jacobian determinant of the inverse of \(\boldsymbol{f}\). The transformations commonly used in particle colliders are non-bijective and sometimes non-differentiable, plus Equation 3.19 is in any case of little use when \(p(\boldsymbol{x} | \boldsymbol{\theta})\) is intractable. However, the expectation value of \(\boldsymbol{y}\) as well any other deterministic transformation of the detector readout \(\boldsymbol{x}\) after any arbitrary event selection \(\mathbb{1}_\mathcal{C} (\boldsymbol{x})\) can be obtained using simulated samples for a given interaction process as shown in Equation 3.14}, independently of whether the transformation is invertible or differentiable. In the rest of this section, the main procedures followed to reduce the dimensionality of the observable space and its objectives from a statistical perspective will be discussed.

3.1.3.1 Event Reconstruction

The methods of event reconstruction, as described in Section 2.3.3, provide a very efficient way to transform the high-dimensional detector readout to a lower-dimensional space that can more easily be interpreted from a physical standpoint. In fact, reconstruction can be viewed as a complex procedural technique of inference on a subset of the latent variables given the detector readout \(\boldsymbol{x}\) of an event. These methods attempt to walk back the generative chain described in Equation 3.9 to recover the subset of the parton-level \(\boldsymbol{z}_\textrm{p}\) (and \(\boldsymbol{z}_\textrm{s}\) or \(\boldsymbol{z}_\textrm{d}\) in some cases) that strongly depends on the detector readouts, providing a compressed summary of the information in the event about the parameters of interest \(\boldsymbol{\theta}\). The dimensionality of the output of the reconstruction procedure \(\boldsymbol{y}_\textrm{reco}\) depends on the subset of variables considered for each physical object, which typically amounts to a total of \(\mathcal{O}(100)\) dimensions, which is a significant reduction from \(\dim(\mathcal{X}) \approx \mathcal{O}(10^8)\).

Due to the detector noise and characteristics, the reconstruction function \(\boldsymbol{f}_\textrm{reco}(\boldsymbol{x}) : \mathcal{X} \longrightarrow \mathcal{Y}_\textrm{reco}\) cannot fully recover \(\boldsymbol{z}_\textrm{p} \in \mathcal{Z}_\textrm{p}\). This is the case for neutrinos that leave the detector undetected, when the measured four-momentum of a given particle differs from the real value, or when the reconstructed particle does not even exist in \(\boldsymbol{z}_\textrm{p}\). Simulated events can then be used to make calibrated probabilistic statements of the resulting reconstructed physical objects and their relation with the actual unobserved particles going through the detector. Particle identification (e.g. jet b-tagging) and fine-tuned momentum regressions on the reconstructed objects can also be thought of as inference of latent variables, which amounts to using the additional detector information around an object to measure more precisely its properties. These properties include the type of particle that produced the detector readouts clustered for particle identification, or a more precise determination of the momentum for particle regression.

One aspect of the generative model that complicates both reconstruction and statistical inference which has not been discussed yet is that efficient representations of the latent space of simulated events are not easily represented as a fixed-size real vector \(\boldsymbol{z} \in \mathcal{Z} \subseteq \mathbb{R}^b\). Let us consider as an example the parton-level latent information \(\boldsymbol{z}_\textrm{p}\), which amounts to a short list of produced particles. The total number of particles and the number of particles of each type are variable, thus \(\boldsymbol{z}_\textrm{p}\) is better represented by a set (or several sets, one for each particle type): \[ \boldsymbol{z}_\textrm{p}^\textrm{set} = \{ \boldsymbol{z}_\textrm{p}^i \ | \ i \in \{{1,...,n_\textrm{p}} \} \} \qquad(3.20)\] where \(n_\textrm{p}\) is the total number of particles produced at parton-level and \(\boldsymbol{z}_\textrm{p}^\textrm{i}\) are the latent variables associated to each particle (i.e. type, four-momenta, charge, colour and spin). A similar set structure can be attributed to latent variables describing long-lived particles after the parton-shower \(\boldsymbol{z}_\textrm{s}\), while additional variables might be associated to each particle (e.g. production vertex) and total number and type diversity would be considerably larger. Because the number of particles and their type greatly varies between different interaction processes, the mapping this structure to observable variable space is very useful. In fact, the result of general event reconstruction process at CMS can be expressed also as a set of physical objects: \[ \boldsymbol{y}_\textrm{reco}^\textrm{set} = \{ \boldsymbol{y}_\textrm{reco}^\textrm{i} \ | \ i \in \{{1,...,n_\textrm{reco}} \} \} \qquad(3.21)\] where \(n_\textrm{reco}\) is the total number of particles, \(\boldsymbol{y}_\textrm{reco}^\textrm{i}\) are the reconstructed variables for each physical object (i.e. reconstructed type, reconstructed four-momenta, reconstructed charge and any other reconstructed attributes). The calibration between the reconstructed physical objects \(\boldsymbol{y}_\textrm{reco}^\textrm{set}\) and the actual particles produced in the collision \(\boldsymbol{z}_\textrm{p/s}^\textrm{set}\) hence amounts to matching set elements (typically based on a \(\Delta R\) distance criterion, see Section 2.2.1) and the comparison of their reconstructed and generated attributes.

The fact that both reconstructed and latent spaces have a variable-size set structure greatly complicates the application of inference and learning techniques directly based on \(\boldsymbol{y}_\textrm{reco}^\textrm{set}\), because they often can only deal with a fixed-size vector of real numbers \(\mathbb{R}^b\). Similarly to what is done for event selection, often the elements in the set of reconstructed objects in an event are reduced by imposing a given condition based on their attributes (e.g. type, isolation or momenta). There exist naive ways to embed a set such as \(\boldsymbol{y}_\textrm{reco}^\textrm{set}\) as a fixed-size vector \(\mathbb{R}^b\), such as taking the relevant attributes of the first \(n_\textrm{sel}\) objects according to a specific ordering convention after a given object selection and possibly padding with zeros or alternative numerical values the elements that do not exist for a given event. Some of the newer machine learning techniques that will be presented in Chapter 4 can deal with variable-size input, such as sequences, sets or graphs inputs, by embedding them in vector representations internally, providing new ways to deal with the mentioned representational issue.

3.1.3.2 Summary Statistics

The attributes of the subset of reconstructed objects selected in an event for a given analysis, often as a fixed-size vector representation \(\boldsymbol{y}_\textrm{sel} \in \mathcal{Z}_\textrm{sel} \subseteq \mathbb{R}^{b}\), are often still too high-dimensional to be considered directly for statistical inference. The effectiveness of the likelihood-free techniques that will be presented later in this chapter strongly depend on the dimensionality of the observable space considered. Hence, it is desirable to further combine the reconstructed outputs in a lower dimensional summary statistic, which can be either a function of each single observation or a set of multiple observations, so simpler statistical models that relate the parameters of interest with the observations can be constructed.

Until now, we have been dealing with the problem of how a single event is distributed \(p ( \boldsymbol{x}|\boldsymbol{\theta})\), however in practice a collection \(D = \{\boldsymbol{x}_0,...,\boldsymbol{x}_n\}\) of events is considered for inference. Let us first consider again the set \(D\), before any trigger or event selection, similarly to what was done at the beginning of Section 3.1.1. Because of the independence between events, the probability density of a given set \(D\) can be expressed as the product of individual probability densities for each event \(\boldsymbol{x}_i\): \[ p(D | \boldsymbol{\theta}) = \prod^{\boldsymbol{x}_i \in D} p ( \boldsymbol{x}_i|\boldsymbol{\theta} ) \qquad(3.22)\] where \(p ( \boldsymbol{x}_i|\boldsymbol{\theta} )\) can only be modelled realistically by forward simulation, and has the mixture model structure and latent factorisation discussed before. After an arbitrary event selection \(\mathbb{1}_\mathcal{C} (\boldsymbol{x})\), only a subset of events \(D_\mathcal{C} = \{\boldsymbol{x}_0,...,\boldsymbol{x}_{n_\mathcal{C}}\} \subseteq D\) remain. These events are also independent, so their probability density can be expressed as: \[ g(D_\mathcal{C} | \boldsymbol{\theta}) = \prod^{\boldsymbol{x}_i \in D_\mathcal{C}} g ( \boldsymbol{x}_i|\boldsymbol{\theta} ) \qquad(3.23)\] where the dependence between the distribution function after the event selection \(g ( \boldsymbol{x}_i|\boldsymbol{\theta} )\) and that before \(p ( \boldsymbol{x}_i|\boldsymbol{\theta} )\) was already described in Equation 3.7. If we are only focussed on the probability distribution of the events in \(D_\mathcal{C}\), we would be neglecting an important quantity that can also provide information about the parameters of interest: the total number of events that pass the event selection \(n_\mathcal{C}\). Because this quantity depends on the set of recorded readouts \(\mathcal{D}\), where each individual readout \(\boldsymbol{x}_i\) is assumed to be an independent and identically distributed variable, the total number of selected events \(n_C\) after a deterministic selection \(\mathbb{1}_\mathcal{C} (\boldsymbol{x})\) can be modelled using a binomial distribution: \[ p( n_\mathcal{C} | n, \boldsymbol{\theta}) = \textrm{Binomial}(n, \epsilon) \approx \textrm{Poisson}(n\epsilon) \qquad(3.24)\] where \(n\) is the total number of events, and the dependence on the parameters is contained in the total efficiency \(\epsilon\), i.e. probability \(\boldsymbol{x} \sim p(\boldsymbol{x}|\boldsymbol{\theta})\) of passing the selection criteria, that can be defined as \(\epsilon = \int \mathbb{1}_\mathcal{C}(\boldsymbol{x}) p (\boldsymbol{x}|\boldsymbol{\theta})\). The Poisson approximation is justified because the number of trials \(n\) is sufficiently large (i.e. 40 million bunch crossings per second) and the total selection efficiencies \(\epsilon \leq 0.000025\) already at trigger level, as discussed in Section 2.2.7. This type of stochastic process is also referred to in the literature as multi-dimensional homogenous Poisson point process [88]. The expected value of \(n_C\) coincides with the Poisson mean \(n\epsilon\). It could be more intuitively linked with the parameters of interest \(\theta\) by making explicit the contributions from the different mixture processes: \[ \mathop{\mathbb{E}}_{D \sim p ( D |\boldsymbol{\theta} )} \left [ n_\mathcal{C} \right ] = n \sum^{K-1}_{j=0} \phi_j \mathop{\mathbb{E}}_{x \sim p_j ( \boldsymbol{x}|\boldsymbol{\theta} )} \left [\mathbb{1}_\mathcal{C} (\boldsymbol{x}) \right ] = n \sum^{K-1}_{j=0} \phi_j \epsilon_j \qquad(3.25)\] where the efficiency for each process \(\epsilon_j= \int \mathbb{1}_\mathcal{C}(\boldsymbol{x}) p_j (\boldsymbol{x}|\boldsymbol{\theta})\) can be estimated using simulated observations as shown in Equation 3.13. In principle, all possible processes \(j\) that could occur have to be considered, i.e. cases when no hard collision occurred as well as the inclusive contribution of each possible hard process, as described in Equation 3.4. However, if the product of the expected probability of a given process occurring \(\phi_j\) and the event selection efficiency \(\epsilon_j\) is low enough relative to the total efficiency \(\epsilon=\sum^{K-1}_{j=0} \phi_j \epsilon_j\), the effect of those mixture components can be safely neglected.

The situation discussed above is often the case for events where no hard collision occurred after some basic event selection, that is \(\epsilon_\textrm{not-hard} \approx 0\) so it can thus be neglected. For the subset of bunch crossings where hard interactions occur, the probability of a given type of interaction before any event selection might be expressed as the product of its cross section \(\sigma_j\) by the total integrated luminosity during the data taking period \(\mathcal{L}_\textrm{int}\) divided by the total number of bunch crossings, thus the expected value for number of observations \(n_\mathcal{C}\) after an event selection that reduces to a negligible fraction the contribution of non-hard processes \(\mathbb{1}_\mathcal{C} (\boldsymbol{x})\) can also be expressed as: \[ \mathop{\mathbb{E}}_{D \sim p ( D |\boldsymbol{\theta} )} \left [ n_\mathcal{C} \right ] = n \sum^{K-1}_{j=0} \frac{\mathcal{L}_\textrm{int} \sigma_j}{n }\epsilon_j = \mathcal{L}_\textrm{int} \sum^{K-1}_{j=0} \ \sigma_j \ \epsilon_j \qquad(3.26)\] where \(n_j=\mathcal{L}_\textrm{int} \ \sigma_j \ \epsilon_j\) is the expected number of events coming from a given process \(j\), that can be estimated with theoretical input regarding \(\sigma_j\), simulated observations to estimate \(\epsilon_j\) and an experimental measurement of the luminosity \(\mathcal{L}\).

The number of observations \(n_\mathcal{C}\) that pass a given event selection \(\mathbb{1}_\mathcal{C} (\boldsymbol{x})\), which normally includes trigger and some additional analysis dependent selection, is the quantity that serves as the basis of the simplest statistical model used in particle physics to link theoretical parameters and observations. This type of summary statistic is very effective when the parameter of interest is the cross section of a single process \(\sigma_S\) and the rest of background processes are well modelled by theoretical predictions and simulated observations. In that case, if all parameters but \(\sigma_S\) are known, a cut-and-count sample-based likelihood can be built based on Equation 3.24, corresponding to the following probability density function: \[ p ( n_\mathcal{C} | \sigma_S) = \textrm{Poisson} \left (\sigma_s\epsilon_s + \sum^{j \in B} \sigma_j\epsilon_j \right) \qquad(3.27)\] which can be used to carry out statistical inference about \(\sigma_S\) given an observed number of events that pass the event selection \(n_\mathcal{C}^\textrm{obs}\), using classical techniques.

The previous concept can be applied to several disjoint subsets of \(\mathcal{X}\) simultaneously \(T=\{\mathcal{C}_0,...,\mathcal{C}_b\}\), each characterised by a different indicator function \(\mathbb{1}_{\mathcal{C}_t} (\boldsymbol{x})\) defining an arbitrary event selection, as long as their intersection is null. The probability function for the variable \(\boldsymbol{n}_T = \{n_{\mathcal{C}_0},...,n_{\mathcal{C}_b}\}\), given that each \(n_{\mathcal{C}_i}\) is independent, can be obtained as: \[ p ( \boldsymbol{n}_T | \boldsymbol{\theta}) = \prod^{\mathcal{C}_i \in T} \textrm{Poisson} \left (\sum^{j \in H} n^{\mathcal{C}_i}_j(\boldsymbol{\theta}) \right ) \qquad(3.28)\] where \(n^{\mathcal{C}_i}_j(\boldsymbol{\theta})\) is the expected number of observed events coming from process \(j\) after the selection \(\mathcal{C}_i\). As long as a parametrisation of \(n^{\mathcal{C}_i}_j(\boldsymbol{\theta})\) exists, which can be often estimated as \(n^{\mathcal{C}_i}_j(\boldsymbol{\theta}) =\mathcal{L} \ \sigma_j \ \epsilon^{\mathcal{C}_i}_j(\boldsymbol{\theta})\), Equation 3.28 can be used to construct a likelihood to carry out inference on the parameters \(\boldsymbol{\theta}\) based on the observed value of the sample summary statistic \(\boldsymbol{n}_T^\textrm{obs}\).

3.1.3.3 Sufficient Statistics

The selection count vector \(\boldsymbol{n}_T^\textrm{obs}(D)\), which has not been specified yet, could be also written as a sum over a function \(\boldsymbol{n}_T(\boldsymbol{x}) : \mathcal{X} \subseteq \mathbb{R}^{d} \longrightarrow \mathcal{Y} \subseteq \{0,1\}^b \subset \mathbb{R}^{b}\) applied for each event in \(D = \{\boldsymbol{x}_0,...,\boldsymbol{x}_n\}\) as follows: \[ \boldsymbol{n}_T^\textrm{obs}(D) = \sum^{\boldsymbol{x}_i \in D} \boldsymbol{n}_T(\boldsymbol{x}) \qquad(3.29)\] where \(\boldsymbol{n}_T^\textrm{obs}(D)\) could be described as a summary statistic of the whole collection of observations while \(\boldsymbol{n}_T(\boldsymbol{x}_i)\) summarises a single event \(\boldsymbol{x}_i\).

There are infinite ways to choose a lower-dimensional summary statistic of the detector readout \(\boldsymbol{s}(\boldsymbol{x}) : \mathcal{X} \subseteq \mathbb{R}^{d} \longrightarrow \mathcal{Y}\subseteq \mathbb{R}^{b}\). Functions of the type \(\boldsymbol{n}_T(\boldsymbol{x})\) are only a reduced subset, yet still infinite, of the possible space of functions. Regardless of the likelihood-free inference methods considered (see Section 3.2), the need of a low-dimensional summary statistic can be thought as an effective consequence of the curse of dimensionality, because the number of simulated observations required to realistically model the probability density function or compute useful distance measures rapidly increases with the number of dimensions.

In general, the selection of a summary statistic \(\boldsymbol{s}(\boldsymbol{x})\) is far from trivial, and naive choices can lead to large losses of useful information about the parameters of interest \(\boldsymbol{\theta}\). From classical statistics, we can define a sufficient summary statistic as the function of the set of observations that can be used for carrying out inference about the model parameters \(\boldsymbol{\theta}\) of a given statistical model in place of the original dataset without losing information [89]. Such a sufficient statistic contains all the information in the observed sample useful to compute any estimate on the model parameters, and no complementary statistic can add any additional information about \(\boldsymbol{\theta}\) contained in the set of observations. Sufficient statistics can be formally characterised using the Fisher-Neyman factorisation criterion, which states that a summary statistic \(\boldsymbol{s}(\boldsymbol{x})\) is sufficient for the parameter vector \(\boldsymbol{\theta}\) if and only if the probability distribution function of \(\boldsymbol{x}\) can be factorised as follows: \[ p(\boldsymbol{x} | \boldsymbol{\theta}) = q(\boldsymbol{x}) r(\boldsymbol{s}(\boldsymbol{x}) | \boldsymbol{\theta}) \qquad(3.30)\] where \(q(\boldsymbol{x})\) is a non-negative function that does not depend on the parameters and \(r(\boldsymbol{x})\) is also a non-negative function for which the dependence on the parameters \(\boldsymbol{\theta}\) is a function of the summary statistic \(\boldsymbol{s}(\boldsymbol{x})\). The identity function \(\boldsymbol{s}(\boldsymbol{x})=\boldsymbol{x}\) is a sufficient summary statistic according to the theorem in Equation 3.30, however we are only interested in summaries that reduce the original data dimensionality without losing of useful information about the parameters \(\boldsymbol{\theta}\).

The definition of sufficiency can also be applied to a set of observations \(D = \{\boldsymbol{x}_0,...,\boldsymbol{x}_n\}\). In fact if we assume they are independent and identically distributed, and \(\boldsymbol{s}(\boldsymbol{x})\) is sufficient for each observation \(\boldsymbol{x}_i\), we may rewrite Equation 3.22 as: \[ p ( D |\boldsymbol{\theta}) = \prod^{\boldsymbol{x}_i \in D} q(\boldsymbol{x}) \prod^{\boldsymbol{x}_i \in D} r(\boldsymbol{s}(\boldsymbol{x}_i) | \boldsymbol{\theta}) = q(D) r(\boldsymbol{s}(D)| \boldsymbol{\theta}) \] where the set of sufficient summary statistics for each observation is a sufficient summary statistic for the whole dataset \(\boldsymbol{s}(D) = \{ \ \boldsymbol{s}(\boldsymbol{x}_i) \ | \ \forall \boldsymbol{x}_i \in D \}\) and the dependence on the summary statistic is contained as the product of independent factors for each observation.

Because \(p(\boldsymbol{x} | \boldsymbol{\theta})\) is not available in closed form in particle collider experiments, the general task of finding a sufficient summary statistic that reduces the dimensionality cannot be tackled directly by analytic means. However, for finite mixture models where the only model parameters are a function of the mixture coefficients \(\phi_j\), probabilistic classification can be used to obtain (approximate) sufficient summary statistics. We will return to this topic in Chapter 4. When the parameters of interest or additional unknown parameters affect the mixture components \(p_j(\boldsymbol{x} | \boldsymbol{\theta})\), the construction of sufficient summary statistics cannot be addressed directly, thus a fraction information about the parameters \(\boldsymbol{\theta}\) is very likely to be lost in the dimensionality reduction step. An automated way to obtain powerful summary statistics in those cases using machine learning techniques will be presented in Chapter 6.

3.1.3.4 Synthetic Likelihood

The advantage of using lower-dimensional summary statistics \(\boldsymbol{s}(D) : \mathcal{X}_D \subseteq \mathbb{R}^{d\times n} \longrightarrow \mathcal{Y}_D \subseteq \mathbb{R}^{b\times n}\) of the detector readout collected by the experiment is that often the generative model of \(p(\boldsymbol{x} | \boldsymbol{\theta})\) can be used to build non-parametric likelihoods of \(s(D)\) that link the observations with the model parameters, so classical inference algorithms can be used. This likelihoods are referred here as synthetic because they are not based on the actual generative model of \(\boldsymbol{x}\) but on approximations constructed using simulated observations.

For summary statistics of the type \(\boldsymbol{n}_T^\textrm{obs}(D) : \mathcal{X}_D \subseteq \mathbb{R}^{d \times n } \longrightarrow \mathcal{Y}_D \subseteq \{0,1\}^{b}\) the likelihood can be expressed as a product of independent Poisson count likelihoods as shown in Equation 3.28. Such likelihood can be evaluated for the observed data \(D\) and specific parameters \(\boldsymbol{\theta}_R\), even in the case that \(\boldsymbol{\theta}\) modifies the distribution of the mixture components \(p_j(\boldsymbol{x} | \boldsymbol{\theta})\), by forward approximating \(n^{\mathcal{C}_i}_j(\boldsymbol{\theta}_R)\) (or alternatively \(\epsilon^{\mathcal{C}_i}_j(\boldsymbol{\theta}_R)\)) using simulated observations for each process \(j\) generated for \(\boldsymbol{\theta}_R\). This process would rapidly become computationally very demanding if it had to be repeated for each likelihood evaluation during the whole inference process. Re-weighting procedures such as those described in Equation 3.18 can often be applied to re-use already simulated events using \(\boldsymbol{\theta}_R\) to model events corresponding to different values of the parameters \(\boldsymbol{\theta}_Q\).

A more economical approach, commonly used in LHC analyses that use binned Poisson likelihoods based on the formalism introduced in Equation 3.28, is to parametrise the effect of varying parameters by interpolating between the values of the \(\epsilon^{\mathcal{C}_i}_j(\boldsymbol{\theta}_k)\) (or directly \(n^{\mathcal{C}_i}_j(\boldsymbol{\theta}_k)\)) for different values of \(k\). Such parametrisation allows the analytical approximation of the likelihood originated by Equation 3.28, and simplifies the computation of gradients with respect to the parameters. This is particularly relevant to model the effect of nuisance parameters, which are uncertain but not of direct interest, and have to be accounted for in the inference procedure; this issue will be discussed in Section 3.1.4. Different interpolation conventions exist [90], but they are normally based on the marginal one-dimensional interpolation between the effect of a single parameter \(\theta_i \in \boldsymbol{\theta}\) at three equally spaced values (the nominal parameter values and the up/down variations). In that case the total effect on \(\epsilon^{\mathcal{C}_i}_j(\boldsymbol{\theta}_k)\) is accounted by adding absolute shifts or multiplying marginal effects.

Even assuming that the marginal description when a single parameter of interest varies is accurate, which is not ensured by the interpolation, and the effect of each parameter is factorised in \(p_j(\boldsymbol{x} | \boldsymbol{\theta})\), the integral definition of \(\epsilon^{\mathcal{C}_i}_j (\boldsymbol{\theta}_k)\) from Equation 3.13 does not ensure that the correlated effect of the variation of multiple \(\theta_i \in \boldsymbol{\theta}\) is accurately modelled. This issue can be easily exemplified, considering the product of relative variations in the two parameter case \(\boldsymbol{\theta}_R = (\theta^R_0,\theta^R_1)\). Let us consider the expected value for the efficiency after a given selection \(\mathbb{1}_{\mathcal{C}_i}(\boldsymbol{x})\): \[ \begin{aligned} \epsilon^{\mathcal{C}_i}_j (\boldsymbol{\theta}_R) &= \int \mathbb{1}_\mathcal{C}(\boldsymbol{x}) p_j ( \boldsymbol{x}|\boldsymbol{\theta}_R ) d\boldsymbol{x} \\ &= \int \mathbb{1}_\mathcal{C}(\boldsymbol{x}) p_j ( \boldsymbol{x}|\boldsymbol{\theta}_Q ) \frac{p_j ( \boldsymbol{x}| (\theta^R_0,\theta^Q_1))}{p_j ( \boldsymbol{x}|\boldsymbol{\theta}_Q )} \frac{p_j ( \boldsymbol{x}| (\theta^Q_0,\theta^R_1))}{p_j ( \boldsymbol{x}|\boldsymbol{\theta}_Q )} d\boldsymbol{x} \end{aligned} \qquad(3.31)\] where \(\boldsymbol{\theta}_R\) is the parameter point we want to simulate by interpolating around a nominal point \(\boldsymbol{\theta}_Q\). The last expression in Equation 3.31 is only correct when the effect of each parameter is independent, i.e. the underlying probability density function can be factorised as the product of independent factors. However, it becomes evident that the previous expression does not simplify: \[ \epsilon^{\mathcal{C}_i}_j (\boldsymbol{\theta}_R) \neq \epsilon^{\mathcal{C}_i}_j (\boldsymbol{\theta}_Q) \frac{\epsilon^{\mathcal{C}_i}_j (\boldsymbol{\theta}_R)}{ \epsilon^{\mathcal{C}_i}_j (\boldsymbol{\theta}_Q)} \frac{\epsilon^{\mathcal{C}_i}_j (\boldsymbol{\theta}_R)}{ \epsilon^{\mathcal{C}_i}_j (\boldsymbol{\theta}_Q)} \qquad(3.32)\] because the integral of the product of functions is not product of integrals, unless the volume of the selected region \(C\) is infinitesimally small - an irrelevant case as it would correspond to null efficiencies. This effect also applies if additive variations are considered and can be more notable when more parameters are considered.

The previously mentioned modelling issue, even though to the best of our knowledge has not been made explicit in the literature before, affects a multitude of analyses at the LHC, i.e. those that use template interpolation, as implemented in the standard statistical libraries used in particle physics experiments [91], [92]. A possible solution would include doing a multi-dimensional interpolation, but it would naively require evaluating at least all 3-point combinatorial variations of the parameters, amounting to a minimum of \(3^p\) evaluations of \(\epsilon^{\mathcal{C}_i}_j (\boldsymbol{\theta})\), where \(p\) is the number of parameters. If the effect of the parameters can be factored out in the joint distribution and the same simulated event set can be modified to describe each marginal variation, as reviewed around Equation 3.17, the non-marginal terms can be estimated from the product of per-event marginal terms by considering the finite sum approximation of the last expression in Equation 3.31, which would only require \((2p+1)\) parameter variation evaluations. Alternatively, the basis of the approach presented in Chapter 6, where the variation of the parameters and its derivatives are computed in place over the simulated observations by specifying the full computational graph, could also be used in analyses where the discussed assumption fails to realistically describe the data.

3.1.4 Known Unknowns

So far we have assumed that the simulated observations can model the data and the only parameters \(\boldsymbol{\theta}\) that affect the generative model are those we are interested in carrying out inference on. However, simulated observations effectively depend on the modelling of the physical processes occurring in the proton-proton collisions and the detector, of which we only have an approximate description. Those mis-modelling effects have to be accounted in the inference procedure to obtain unbiased estimates, and are accounted by additional nuisance parameters in the statistical model when their effect is known and can be approximated. For cases where simulation does not provide the desired level of accuracy, the contribution from some of the mixture components can often be estimated from data directly, using what are referred to as data-driven estimation techniques.

3.1.4.1 Nuisance Parameters

The general definition of nuisance parameters in a statistical model refers to all the uncertain parameters of the statistical model that are not of intermediate interest but have to be accounted for in the inference procedure. These parameters can include uncertain theoretical parameters (e.g. top quark mass or expected background rate), account for limitation on the experimentally measured parameterisations of certain phenomena (e.g. parton density functions uncertainties) or represent the accuracy limits of calibration between data and simulation. Nuisance parameters can also represent additional degrees of freedom of the model that cover for possible wrong assumptions or quantify imprecisions due to the limited number of simulated observations.

Because the actual generative process for the experimental data is not known perfectly, the simulation-based model is extended with additional parameters that portray the possible variability on the distribution of the detector readouts. The formalism developed in the previous part of Section 3.1 still applies, noting that the parameter vector \(\boldsymbol{\theta}=\{\boldsymbol{\theta}_\iota,\boldsymbol{\theta}_\nu\}\) now includes both parameters of interest \(\boldsymbol{\theta}_\iota\) and nuisance parameters \(\boldsymbol{\theta}_\nu\). While the effect of (usually theoretical) parameters of interest typically only affects the parton-level latent factor \(p(\boldsymbol{z}_\textrm{p} | \boldsymbol{\theta})\), some nuisance parameters account for possible mis-modelling in subsequent steps of the simulation thus can affect the other factors in Equation 3.9.

The effect of variation of nuisance parameters for any observable or summary statistic considered in a given analysis can be estimated by simulating again the affected observation with the chosen parameters - often prohibitively expensive - or by re-weighting already simulated observations as described in Equation 3.17 - which is much faster and reduces the statistical fluctuations between variations associated with the random sampling of the full latent space. Unprincipled modelling shortcuts, such as considering the additive or multiplicative effect of marginal efficiencies to account for combined effects, are also frequently used for count vector observables \(n^{\mathcal{C}_i}_j(\boldsymbol{\theta})\), as discussed in Section 3.1.3.4 together with possible solutions to some of the associated issues.

The re-weighting approach from Equation 3.16 is extremely effective to model the effect of parameters in the conditional factor that deal with low-dimensional latent variables, such as \(p(\boldsymbol{z}_\textrm{p} | \boldsymbol{\theta})\), because the rest of the factors in the joint distribution simplify and we are left with a low-dimensionality density estimation problem (even analytically tractable in some cases). For conditional factors that deal with higher dimensional latent or observable spaces, such as \(p(\boldsymbol{z}_\textrm{d} |\boldsymbol{z}_\textrm{s}, \boldsymbol{\theta})\) or \(p(\boldsymbol{x} |\boldsymbol{z}_\textrm{d}, \boldsymbol{\theta})\), the ratio can be very hard to estimate unless additional simplifications are possible. For those nuisance parameters, it is easier to consider the effect on the lower-dimensional summary statistic instead of the detector readout \(x\), because the ratio: \[ w(\boldsymbol{s}(\boldsymbol{x})) = \frac{p_R(\boldsymbol{s}(\boldsymbol{x})|\boldsymbol{\theta}_R)}{ p_Q(\boldsymbol{s}(\boldsymbol{x})|\boldsymbol{\theta}_Q)} \qquad(3.33)\] can be simpler to estimate through density estimation or approximately factorise if the summary statistic is chosen carefully. This fact motivates an alternative way to model the effect of some of the nuisance parameters, especially those related with the differences in the reconstructed objects observables between simulation and data after calibration. Let us consider the case where summary statistics \(\boldsymbol{s}(\boldsymbol{x}) : \mathcal{X} \subseteq \mathbb{R}^{d} \longrightarrow \mathcal{Y}_\textrm{sum} \subseteq \mathbb{R}^{b}\) are effectively a function of the reconstructed objects and its properties \(\boldsymbol{y}_\textrm{reco} \in \mathcal{Y}_\textrm{reco}\), which can be schematically represented by the following function composition chain: \[ \mathcal{X} \overset{g}{\longrightarrow} \mathcal{Y}_\textrm{reco} \overset{h}{\longrightarrow} \mathcal{Y}_\textrm{sum} \qquad(3.34)\] where \(\boldsymbol{y}_\textrm{reco} = g(\boldsymbol{x})\) and \(\boldsymbol{y}_\textrm{sum} = h(\boldsymbol{y}_\textrm{reco})\). This compositional approach can be extended to include also event selection at trigger or analysis level, or other intermediate summaries of \(\boldsymbol{x}\) complementary to reconstruction, as part of the definition of the summary statistic \(\boldsymbol{s}(\boldsymbol{x})\). In all cases where \(s(\boldsymbol{x})\) is a deterministic function, all differences between simulated observations and data in any expected observables originate from the differences between the simulation-based generative definition of \(p(\boldsymbol{x} | \boldsymbol{\theta})\) and the true unknown generative process \(p_\textrm{true}(\boldsymbol{x})\). While the task of evaluating and parametrising these differences directly by studying the raw detector output is quite convoluted, the differences can be corrected and their uncertainty assessed for the lower-dimensional intermediate states of the composition chain depicted in Equation 3.34.

For example, if the momenta of a certain subset of the reconstructed objects \(\boldsymbol{y}_\textrm{reco}\) statistically differ between experimental data and the simulated observations, based on a subset of the data that is assumed to be well-modelled, the momenta of simulated observations can be corrected to better model the data. The statistical accuracy of such procedure due to the different factors leads to a set of nuisance parameters that describe the limit of the mentioned calibration as a function of the value of \(\boldsymbol{y}_\textrm{reco}\). The effect of these type of nuisance parameters often be modelled in the simulation by using a function of the simulated intermediate outputs, e.g. in the case of reconstructed objects: \[ \mathop{\mathbb{E}}_{\boldsymbol{x} \sim p( \boldsymbol{x}| \boldsymbol{\theta} )} \left [ \boldsymbol{s}(\boldsymbol{x}) \right ] = \mathop{\mathbb{E}}_{\boldsymbol{y}_\textrm{reco} \sim p( \boldsymbol{y}_\textrm{reco}| \boldsymbol{\theta}_o )} \left [ h(r(\boldsymbol{y}_\textrm{reco}, \boldsymbol{\theta}_\rho))\right ] \qquad(3.35)\] so \(p( \boldsymbol{x}| \boldsymbol{\theta} )\) can be approximated by computing observables after applying the re-parametrisation \(r(\boldsymbol{y}_\textrm{reco}, \boldsymbol{\theta}_\rho)\) to the simulated observations, where \(\boldsymbol{\theta}_\rho\) is the vector of parameters representing the different uncertainty factors.

In general, the effects of all relevant nuisance parameters can be modelled by a combination of simulated observations re-weighting by \(w(\boldsymbol{x}_i,\boldsymbol{z}_i | \boldsymbol{\theta}_w )\) and transformations of intermediate simulated observations \(\boldsymbol{y}_\textrm{new} = r(\boldsymbol{y}_\textrm{sim},\boldsymbol{z}_i | \boldsymbol{\theta}_\rho)\). The former is based on importance sampling [93] to estimate the properties of a different distribution than the one sampled originally from, while the latter assumes that the mis-modelling can be accounted by a parametrisation of the simulated intermediate observables. If the functions \(w(\boldsymbol{x}_i,\boldsymbol{z}_i | \boldsymbol{\theta}_w )\) and \(r(\boldsymbol{y}_\textrm{sim},\boldsymbol{z}_i | \boldsymbol{\theta}_\rho)\) are differentiable or can be approximated by differentiable functions, the gradient (and higher order derivatives) with respect to the parameters \(\boldsymbol{\theta}\) of any expectation value can be very efficiently approximated. This can be very useful for statistical inference (e.g. likelihood maximisation), while it has not been used so far in LHC analysis to our knowledge. This is one of the core concepts of the technique to construct summary statistics presented in Chapter 6.

The inference results of a given analysis depend strongly on the assumptions implicit in the statistical model. The determination, assessment and practical definition of the effect of nuisance parameters that are relevant for a given analysis is one the most challenging yet important aspects in experimental particle physics at the LHC. When nuisance parameters are quantitatively taken into account in the statistical model, they lead to an increase of the uncertainty on the parameters of interest and larger interval estimates (or exclusion limits) on the parameters of interest. The choice of summary statistics may also affect significantly subsequent inference, and while nuisance parameters are usually qualitatively considered when building simple summary statistics by physics-inspired combinations of reconstructed variables, they are not regarded at all when the automatic multi-variate techniques described in Chapter 4 are applied to construct complex non-linear observables. This issue is addressed by the method proposed in Chapter 6.

3.1.4.2 Data-Driven Estimation

For some fundamental processes, the generative modelling provided by simulated observations might not be accurate enough for the purposes of a given LHC analysis. In a subset of those cases, the simulated observations can be calibrated to better describe the observations in well-modelled data regions, as mentioned in the previous section. However, if the description of the summary statistics considered in the analysis provided by the simulated observations from process \(j\) is substandard, e.g. the number of simulated observations that could be realistically simulated is not sufficient, then the contribution from the mentioned mixture component might have to be estimated from experimental observations directly.

The actual procedure used for modelling the contribution for a given mixture component \(j\) from data depend on the specifics of the process as well the details of the analysis, but often includes some re-weighting factor obtained from simulated observations or additional experimental observations with an orthogonal selection criterion. Such data-driven estimation techniques are often used for the background processes, but are hard to combine with the non-linear summary statistics reconstructed by machine learning techniques such as those described in Chapter 4. In the CMS analysis presented in Chapter 5, we describe and utilise a fully data-driven background estimation technique fine-tuned for the modelling of the QCD-based multiple jet background for the search of Higgs pair production decaying to four b-quarks.

In this work, synthetic likelihood will be used to refer to likelihoods that are not based on the probability distribution function of the generative model, but on non-parametric approximations using low-dimensional summaries of the data.↩
Here categorical distribution refers to the special case of the multinomial distribution were the number of trials is one.↩
The term group/type of interactions here generally refers to a set of processes that could be generatively modelled independently, not to quantum mechanical amplitudes or intensities of a process. For example, each group can correspond to a group of processes with a given final state \(pp \rightarrow X\) which could be modelled by sampling its differential cross section from Equation 1.32 followed by parton showering and detector simulation. The group category is a latent/hidden variable for each event, i.e. it is not observed.↩
In particle physics, a resonance is a peak around a certain energy in the differential cross section associated with the production of subatomic particles.↩