Statistical Learning and Inference at Particle Collider Experiments

3.2 Statistical Inference

In the previous section, the main characteristics of the generative statistical model \(p(D | \boldsymbol{\theta})\) relating the parameters \(\boldsymbol{\theta}\) with the set of observations \(D = \{\boldsymbol{x}_0,...,\boldsymbol{x}_n\}\) have been reviewed. In addition, we discussed the role of lower dimensional summary statistics as functional transformations of each detector readout \(\boldsymbol{s}(\boldsymbol{x}_i)\) or even the whole dataset \(\boldsymbol{s}(D)\), as well as how the effect of additional uncertain parameters can be included in the simulation-based generative model of the data. In this section, we deal with the actual problem of inference of the subset of parameters of interest \(\boldsymbol{\theta}_\iota\) once a summary statistic has already been chosen and the final statistical model \(p(\boldsymbol{s}(D) | \boldsymbol{\theta})\) has been fully specified.

3.2.1 Likelihood-Free Inference

One of the main properties of the statistical models at particle colliders we focussed on in the last section was their generative-only nature, whereby their probability density \(p(\boldsymbol{x} | \boldsymbol{\theta})\) cannot be expressed analytically, but only by means of forward simulated observation. This fact greatly complicates the application of standard inference techniques which require the explicit definition of a likelihood \[L(\boldsymbol{\theta} | D) =\prod^{\boldsymbol{x}_i \in D} p(\boldsymbol{x}_i | \boldsymbol{\theta}) \qquad(3.36)\] in order to make quantitative statements about the parameters of interest, because it expresses the extent to which a set of values for the model parameters are consistent with the observed data. Problems where the likelihood cannot be expressed directly are common in many scientific disciplines, because a link between observations and the underlying parameters can often only be provided by a probabilistic computer program. This is frequently the case when the system under study is complex, e.g. can only be described by a hierarchy or a sequence of stochastic processes.

The evaluation of the likelihood for complex generative models rapidly becomes impractical, especially when the dimensionality of the observations or the parameter space is very high. Various statistical techniques for dealing with these cases exist, generally referred to as likelihood-free or simulation-based inference techniques. A well established group of techniques for inference when the likelihood function is unknown is referred to as Approximate Bayesian Computation (ABC) [94], [95]. The fundamental concept behind ABC is the generation of a simulated sample \(S_0 = \{\boldsymbol{x}_0,...,\boldsymbol{x}_{m-1}\}\) using a given vector of parameters \(\boldsymbol{\theta}_0\), which is then compared using a distance criterion to the actual observed dataset \(D\). If the data and the simulation are close enough, then \(\boldsymbol{\theta}_0\) is retained as sample from the posterior. The process is repeated until the posterior is estimated with the desired accuracy. The quality of the posterior approximation produced by ABC techniques, as well as the number of sampling steps required to reach a given accuracy, strongly depend on the distance definition. When the dimensionality of the output is high, a summary statistic vector \(\boldsymbol{s}(D)\) has to be used in practice to increase the computational efficiency of the previous procedure, which would be otherwise intractable.

The approach commonly used when carrying out inference at particle physics experiments at the LHC is somehow related with the mentioned family of techniques. The observations are also reduced to a lower-dimensional summary statistic space, but then a non-parametric likelihood is constructed so that standard inference techniques can be applied. The likelihood is often based on the product of Poisson count terms, as depicted in Equation 3.27 and Equation 3.28, where the dependence on the expectations on the parameters is based on the simulation and the mixture structure. Alternative approaches include the use of a simple one-dimensional parametrisation for a continuous background and a bump-like signal, which is common when the reconstructed mass of an intermediate object is used as summary statistic and its distribution is well-controlled, e.g. a Higgs bosons decaying to two photons. An additional alternative approach, which has not been used in LHC analyses to date, could be to use non-parametric density estimation techniques to obtain an unbinned likelihood directly from simulated data. This approach has been recently referred as Approximate Frequentist Computation (AFC) [96], and can be also combined with the technique presented in Chapter 6.

3.2.2 Hypothesis Testing

Statistical inference within experimental particle physics is often framed as a hypothesis testing problem. The goal of statistical testing is to make a quantitative statement about how well observed data agrees with an underlying model or prediction, which is often referred to as a hypothesis. The statistical model under consideration is often referred to as null hypothesis \(H_0\). Classical statistical testing techniques often require the definition of an alternative hypothesis \(H_1\), whose agreement with the data is compared with that of the null. A hypothesis is said to be simple, when all the distribution (or generative model) parameters are fully specified, i.e. \(p(\boldsymbol{x} | H_s) =f(\boldsymbol{x})\) does not depend on any non-fixed parameter. A composite hypothesis instead depends on one or more parameters \(\boldsymbol{\theta}\), i.e. the distribution under the hypothesis can be expressed as \(p(\boldsymbol{x} | H_c) =f(\boldsymbol{x},\boldsymbol{\theta})\).

In order to carry out hypothesis testing based on a set of observations \(D = \{\boldsymbol{x}_0,...,\boldsymbol{x}_n\}\), a test statistic \(t(D)\) that is a function of the observations is constructed. The choice of test statistic is especially challenging when \(\boldsymbol{x}\) is high-dimensional and \(p(\boldsymbol{x}_i | \boldsymbol{\theta})\) is not known. The concepts of test statistic and summary statistic, the latter discussed in Section 3.1.3.2, are very related. A test statistic is in fact a sample summary statistic⁷ \(s(D)\), that is used within an statistical test to accept or reject hypothesis, so all the concerns regarding sufficiency from Section 3.1.3.3 also apply. Regarding the dimensionality of \(t(D) : \mathcal{X}_D \subseteq \mathbb{R}^{d \times n } \longrightarrow \mathcal{T}\), while it can be a multi-dimensional vector (e.g. could even use \(t(D)=(\boldsymbol{x}_0,...,\boldsymbol{x}_n)\)), a one dimensional variable is usually considered in order to simplify the process of making calibrated statistical statements.

Let us refer to the test statistic for the set of observations as \(t_\textrm{obs}\) from here onwards. The result of the statistical test is whether the hypothesis \(H_0\) can be rejected in favour of \(H_1\) if the null is unlikely enough. In practice, in order to make a principled decision, a critical region \(\mathcal{T}_C \subseteq \mathcal{T}\) in the space of the test statistic has to be defined before looking at the set of observations. Once the critical region has been chosen, a test can be then characterised by its significance level \(\alpha\) and power \(1-\beta\). The significance, which is also referred to as the Type I error rate, is directly related with the probability of rejecting \(H_0\) when it is actually true. For a given test based on the summary statistic \(t(D)\) and its critical region \(\mathcal{T}_C\), the significance level can be defined as: \[ \alpha = P ( t \in \mathcal{T}_C | H_0) = \int_{\mathcal{T}_C} g( t| H_0) dt \stackrel{\textrm{1D}}{\Rightarrow} \int_{t_\bold{cut}}^\infty g( t| H_0) dt \qquad(3.37)\] where \(g( t| H_0)\) is the distribution of the test statistic under the null hypothesis \(H_0\), and the latter simplification applies for one-dimensional summary statistics where the critical region is defined based on a given threshold \(t_\textrm{cut}\). The power of a test \(1-\beta\) is instead defined by the probability of not rejecting the null hypothesis when the alternative is actually true, which often referred as type II error rate \(\beta\). The type II error rate \(\beta\) can be defined as the probability of not being in the critical region under the alternative hypothesis: \[ \beta = P ( t \not\in \mathcal{T}_C | H_1) = 1 - \int_{\mathcal{T}_C} g( t| H_1) dt \stackrel{\textrm{1D}}{\Rightarrow} 1 - \int_{-\infty}^{t_\bold{cut}} g( t| H_1) dt \qquad(3.38)\] where \(g( t| H_0)\) is the distribution of the test statistic under the alternative hypothesis \(H_1\), and the last terms corresponds to the one dimensional case based on a threshold. Both the significance level and the power of a hypothesis test depend on the definition of its test statistic and the critical region. The significance level of a test \(\alpha\) is often fixed at a given value in order to reject the null in favour of an alternative. It is then beneficial to design the test so its power is as high as possible, which is equivalent to having a Type II error rate as low as possible.

From the definition of Type I and Type II error rates in Equation 3.37 and Equation 3.38, it is evident that either the probability distribution function of the test statistic under both the null and alternative hypothesis or a way to estimate the integrals from simulated observations are required. The main advantage of one-dimensional test statistics, similarly to the low-dimensional summary statistics discussed in Section 3.1.3.2, is that they allow for an efficient estimation of the probability distribution function using non-parametric techniques. When both the null \(H_0\) and alternative hypothesis \(H_1\) are simple, the Neyman-Pearson lemma [97] states that the likelihood ratio, which is a one-dimensional test statistic defined as: \[ \Lambda( \mathcal{D}; H_0, H_1) = \frac{p(D| H_0)}{p(D| H_1)} = \prod_{\boldsymbol{x} \in \mathcal{D}} \frac{p(\boldsymbol{x}| H_0)}{ p(\boldsymbol{x} |H_1)} \qquad(3.39)\] is the most powerful test statistic at any threshold \(t_\textrm{cut}\), which is associated with a significance \(\alpha=P(\Lambda(\mathcal{D}; H_0, H_1) \leq t_\textrm{cut})\). The last expansion requires independence between the different observations. While the likelihood ratio can be proven to be the most powerful test statistic, it cannot be evaluated exactly if the likelihood is not known, which often the case for LHC inference problems as discussed in Section 3.2.1. The alternative hypothesis is usually composite in particle colliders because the signal mixture fraction \(\mu\) (or its cross section equivalently) is one of the parameters of interest. The likelihood ratio test can nevertheless be expressed in this case as a function the parameter \(\mu\), which will be the most powerful test for a given \(\mu\) if it is the only unknown parameter.

It is worth noting that while the likelihood ratio defined in Equation 3.39 defines the most powerful test, the likelihood ratio based on a summary statistic \(\boldsymbol{s}(D)\) can also be defined, but it is not the most powerful test for inference based on \(D\) unless \(\boldsymbol{s}(D)\) is a sufficient summary statistic with respect to the parameters \(\boldsymbol{\theta}\) which fully define the null \(p(\boldsymbol{x} | H_0) = p(\boldsymbol{x} | \boldsymbol{\theta}_0)\) and alternate \(p(\boldsymbol{x} | H_1) = p(\boldsymbol{x} | \boldsymbol{\theta}_1)\) hypotheses. This fact motivates the use of machine learning techniques to approximate the likelihood ratio directly based on simulated observations as discussed in Section 4.3.1.1. The likelihood-ratio can then be calibrated by means of non-parametric probability density estimation techniques or count-based likelihoods.

Another relevant issue when defining test statistics is that hypotheses are rarely simple (or with a composite alternate in the way previously described). Let us suppose the \(\mu\) is the parameter of interest, e.g. the mixture coefficient for the signal. The statistical model often depends on additional nuisance parameters \(\boldsymbol{\theta}\), as discussed in Section 3.1.4. The likelihood ratio from Equation 3.39 is not guaranteed to be the most powerful test statistic when the hypotheses are composite. In this case, often summary statistics based on the profile likelihood ratio are used, that can be defined for LHC searches as: \[ \lambda(\mu) = \frac{L(\mu, \hat{\hat{\boldsymbol{\theta}}})}{ L(\hat{\mu}, \hat{\boldsymbol{\theta}})} \qquad(3.40)\] where \(\hat{\hat{\boldsymbol{\theta}}}\) at the numerator refers to the value of the nuisance parameter that maximises the likelihood for a given \(\mu\), and \(\hat{\mu}\) and \(\hat{\boldsymbol{\theta}}\) at the denominator are the standard maximum likelihood estimators. The property that motivates the use of the profile likelihood ratio, other than its convergence to the likelihood ratio when the hypotheses are simple, is that the distribution for large numbers of observations can be effectively approximated, as demonstrated by Wilks and Wald [98], [99].

For a discussion of the different test statistics based on the profiled likelihood ratio as well as their asymptotic approximations, the following reference is recommended [100]. In particular, the use of the Asimov dataset, where the observed sample summary statistic of the type outlined Equation 3.29 is assumed to be equal to the expectation, is instrumental for the technique described in Chapter 6. The statistical framework of hypothesis testing is used to decide whether to reject or not reject the null hypothesis in favour of the alternate. Alternatively it can also be useful to estimate the probability of obtaining the observed data (or test statistic) under the null hypothesis, which is simply referred to as the p-value or alternatively as Z-value when standard deviation units are used. When the null hypothesis is not rejected \(H_0\), the statistical test can be recast to obtain exclusion upper limits at a given confidence level (usually 95% is used), as is done in the non-resonant Higgs production search included in Chapter 5.

For obtaining exclusion upper limits, it is useful to define a modified test statistic \(\widetilde{q}(\mu)\): \[ \widetilde{q}(\mu) = \begin{cases} -2\ln \frac{L(\mu, \hat{\hat{\boldsymbol{\theta}}}(\mu))}{ L(0, \hat{\boldsymbol{\theta}}(\mu))} \quad &\textrm{if}\ \hat{\mu} < 0 \\ -2\ln \frac{L(\mu, \hat{\hat{\boldsymbol{\theta}}}(\mu))}{ L(\hat{\mu}, \hat{\boldsymbol{\theta}}(\mu))} \quad &\textrm{if}\ 0 \leq \hat{\mu} \leq \mu \\ 0 \quad &\textrm{if}\ \hat{\mu} > \mu \\ \end{cases} \] which does not regard negative background fluctuations or cases where \(\hat{\mu} > \mu\) as evidence against \(\mu\). When using \(\widetilde{q}(\mu)\) or similar profile-likelihood-based one-dimensional test statistics, the observed exclusion upper upper limit can be defined as the largest value of \(\mu\) for which the probability of obtaining a test statistic value is equal or larger than a given confidence level (e.g. \(\alpha=0.05\) for 95% confidence intervals), which can be expressed as the following integral: \[ P(\widetilde{q}(\mu) \geq \alpha | \mu) = \int^{\infty}_{\widetilde{q}_\textrm{obs}(\mu)} g(\widetilde{q}(\mu) | \mu) dq \qquad(3.41)\] where \(\widetilde{q}_\textrm{obs}(\mu)\) is the observed test statistic and \(g(\widetilde{q}(\mu) | \mu)\) is the distribution under the alternate when the signal fraction is \(\mu\). This integral can be approximated using Monte Carlo simulations or by the asymptotic approximations described in [100]. A different upper limit definition is often used to avoid excluding an alternative hypothesis with a fixed probably \(\alpha\) even when the analysis has no sensitivity, referred to as CLs procedure [101], [102], in which the exclusion limit is defined as the value of \(\mu\) for which \(P(\widetilde{q}(\mu) \geq \alpha | \mu)/P(\widetilde{q}(\mu) \geq \alpha | 0) \geq\alpha)\), which solves the mentioned issue at the cost of over-coverage.

Most data analyses at the LHC, and particularly searches such as the one discussed in Chapter 5, are carried out in blinded manner to reduce the experimenter’s bias, i.e. the subset of observations or results relevant for statistical inference are not considered (or concealed) until all the analysis procedures have defined. In order to optimise the various analysis components (e.g. selection or summary statistic), it is useful to compute a figure of merit that is representative of the prospective sensitivity of the analysis. The expected significance, is the expectation value for the probability value from Equation 3.37 under the alternative hypothesis. Instead, the median instead of the expectation is often considered to preserve monotonicity with Z-values, and several approximations exist for simple cut-and-count likelihoods. Both the expected and median significance depend on the signal fraction \(\mu\) assumed, so they are particularly useful to optimise analyses where the order of magnitude expected for \(\mu\) is known, e.g. cross section measurements of SM processes.

Alternatively, the expected median upper limit can be defined as the exclusion upper limit using the median test statistic \(\widetilde{q}_\textrm{med}(\mu)\) under the null hypothesis instead of the observed statistic. In addition to the median expected limit, it is common practice in LHC searches to also compute the so-called 1-sigma and 2-sigma bands, that correspond to the \(50.0\pm34.1\) and \(50.0\pm47.7\) percentiles instead of the median. The upper limit bands provide a quantitive estimation of the possible limit variation if no signal is present in the data. Both the expected significance and the expected upper limit can be estimated asymptotically for summary statistics like the one described in Equation 3.29. The effect of nuisance parameters can be also included in both in the asymptotic approximations or the Monte Carlo based estimation. The asymptotic approximation are found to be good empirically, within 10% to 30% (for situations where the number of events is small) of the Monte Carlo based estimation, and thus are frequently used for obtaining limits and significances in New Physics searches.

3.2.3 Parameter Estimation

Another inference problem that can be defined based on the observed data, is parameter estimation, whose goal can generally be defined as the determination of the possible or optimal values that the parameters of a statistical model in relation to a set of observations. Two types of parameter estimation problems are often considered: point estimation and interval estimation. If the aim is to obtain the best estimate (i.e. a single value) of a vector of parameter based on a set of observations, it is referred to as a point estimation problem. When we are instead interested on using a set of observations to make statistical statements about a range or region for the values that the statistical model parameters, we are dealing with an interval estimation problem.

Parameter estimation can be addressed either from a classical (i.e. also known as frequentist) standpoint where the true values of the parameter are assumed to be fixed but unknown, and intervals represent the region of parameters for which the set of observed data could be obtained upon repeated sampling; or from a Bayesian perspective, where probabilistic statements representing the degree of belief on the values for the parameters are updated based on the set of observations. A classical inference approach is predominantly adopted in this document, where the definition of probability is based on the relative frequency of the outcome when repeated trials are carried out. Classical interval estimation, often referred to as confidence interval estimation is strongly related with hypothesis testing, as reviewed in Section 3.2.2. The \(100(1-\alpha)\%\) confidence interval (CI) for a one-dimensional parameter \(\theta\) can be defined as the interval \([\hat{\theta}^{-},\hat{\theta}^{+}]\):, such that:

\[ P(\hat{\theta}^{-} \leq \theta \leq \hat{\theta}^{+}) = 1 - \alpha \qquad(3.42)\]

where \(\hat{\theta}^-\) and \(\hat{\theta}^+\) are referred as the lower and upper limits. The definition of confidence interval in the context of classical parameter estimation is the range of values for a given parameter which, upon repeated trials, would contain the true value \(100(1-\alpha)\%\) of the times. The concept of confidence interval can also be extended to confidence region when a multi-dimensional parameter vector or several disjoint intervals are considered. While the definition of confidence interval based on its coverage properties is rather simple, its construction based on a set of observations \(D = \{\boldsymbol{x}_0,...,\boldsymbol{x}_n\}\) can be quite challenging. It is worth noting that both upper and lower limit are estimators, quantities calculated by applying a given produce to the set of observations, and thus \(\hat{\theta}^- (D)\) and \(\hat{\theta}^+ (D)\) explicitly depend on the set of data.

The Neyman construction [103] provides a principled procedure to define \(100(1-\alpha)\%\) confidence intervals which guarantee the property defined in Equation 3.42, by inverting an ensemble of hypothesis tests (as defined in Section 3.2.2), by using simulated datasets for the different values that parameter \(\theta\) can take. Confidence intervals can be one-sided, e.g. such as the exclusion upper limits defined in Equation 3.41, or two-sided as the definition provided in Equation 3.42. In particle collider analyses, there is often a dichotomy between one-sided intervals for null results and two-sided intervals for non-null results, which can be solved by extending the Neyman construction with a likelihood-ratio ordering criterion [104].

Confidence interval procedures based on the Neyman construction work very well for simple statistical models with one or two parameters, however rapidly become computationally intractable for larger number of parameters. Even though the number of parameters of interest at LHC analyses is usually small, nuisance parameters play an important role in inference as reviewed in Section 3.1.4.1, and cannot be accounted for in a straightforward manner in the previous procedure. Thus when the total number of parameters is high, confidence intervals are usually computed based on alternative approximations, often based of some of the properties of the profiled likelihood ratio discussed in Section 3.2.2.

Before discussing the fundamentals of the confidence interval approximations, it is useful to formally define the maximum likelihood estimator of a parameter \(\boldsymbol{\theta}_{\textrm{ML}}\) based on a set of observations \(D = \{\boldsymbol{x}_0,...,\boldsymbol{x}_n\}\) as:

\[ \boldsymbol{\theta}_\textrm{ML} = \mathop{\textrm{arg max}}_{\theta \in \Theta} L(D; \boldsymbol{\theta}) \qquad(3.43)\]

where \(L(D; \boldsymbol{\theta})\) is the likelihood function given the set of observations \(D\) which is a function of the model parameters \(\boldsymbol{\theta}\). The maximum likelihood estimator of model parameters was already used to define the profile likelihood ratio test statistic in Equation 3.40, an it is a very common point estimator because it is asymptotically consistent and efficient. In addition, the maximum likelihood estimator coincides with the maximum a posteriori (MAP) point estimator in Bayesian inference when the parameter priors are uniform, because the evidence is proportional to the likelihood.

The shape of the likelihood function around the maximum likelihood estimator \(\boldsymbol{\theta}_{\textrm{ML}}\) can be used to approximate confidence intervals. Using asymptotic theory developed by Wilks [98], the \(100(1-\alpha)\%\) confidence region for the parameter vector \(\boldsymbol{\theta}\) can be determined using the following relation:

\[ - \ln L(D; \boldsymbol{\theta}) \leq - \ln L(D; \boldsymbol{\theta}_{\textrm{ML}}) + \Delta \ln L \qquad(3.44)\]

where \(\ln L(D; \boldsymbol{\theta}_{\textrm{ML}})\) is the natural logarithm of the likelihood for the maximum likelihood estimator and \(\Delta \ln L\) depends on the number of parameter dimensions and the desired coverage \(1-\alpha\). For example, the values of \(\boldsymbol{\theta}\) inside the \(68.27\%\) (i.e. 1-sigma) confidence region and for one dimensional parameter are those for which the previous relation is verified using \(\Delta \ln L = 0.5\). If \(\boldsymbol{\theta}\) is one-dimensional and the function \(L(D; \boldsymbol{\theta})\) is convex, the confidence interval limits \(\hat{\theta}^- (D)\) and \(\hat{\theta}^+ (D)\) can be obtained by finding the most extreme values of \(\theta\) that verify Equation 3.44 at each side of the maximum likelihood estimator \(\boldsymbol{\theta}_{\textrm{ML}}\).

As discussed in Section 3.1.4.1, we are often interested on confidence intervals for a subset of interest of the statistical model \(\boldsymbol{\theta}_\iota\), while regarding the others as nuisance parameters \(\boldsymbol{\theta}_\nu\). The previous procedure can be extended for computing approximate confidence interval for the parameters of interest, by considering the profiled likelihood [105] \(\hat{L}(D; \boldsymbol{\theta}_\iota)\) instead of the full likelihood in Equation 3.44, which is defined as:

\[ \hat{L}(D; \boldsymbol{\theta}_\iota) = \mathop{\textrm{arg max}}_{\theta_\nu \in \Theta_\nu} L(D; \boldsymbol{\theta}_\iota, \boldsymbol{\theta}_\nu) \qquad(3.45)\]

so the nuisance parameters \(\boldsymbol{\theta}_\nu\) are profiled by considering their values that would maximise the likelihood conditional for each value of the parameters of interest \(\boldsymbol{\theta}_\iota\). Noting that a constant denominator in the likelihood would cancel out at each side of Equation 3.44, and similarly when using the profiled likelihood from Equation 3.45. Both procedures can be theoretically linked with the profile-likelihood ratio test statistic defined in Equation 3.40. Algorithms for likelihood maximisation and computation of intervals based on the profiled likelihood are implemented in the minos routine as part of the minuit software library [106], which can also account for bounded parameters. Confidence intervals based on the profiled likelihood will be used for benchmarking different ways for constructing summary statistics in Chapter 6.

Another important subtlety when dealing with nuisance parameters (which also applies to a lesser degree to the combination of measurements), is that oftentimes they are constrained by theory or external measurement. This can be included in the previous likelihood-based techniques by considering the likelihood as a product of the likelihood derived from the statistical model for the set of observations \(L_D(D; \boldsymbol{\theta})\) with the available constraints \(L_C^i(\boldsymbol{\theta})\), as follows: \[ L (D; \boldsymbol{\theta}) = L_D(D; \boldsymbol{\theta}) \prod_{i=0}^{c} L_C^i(\boldsymbol{\theta}) \qquad(3.46)\] where simplified likelihoods (e.g. a normal approximation) are often used in the constrain terms \(L_C^i(\boldsymbol{\theta})\) but they could in principle also depend on an independent set of observations. The constrain terms could be also understood as prior probability distributions in a Bayesian setting, obtained from previous evidence.

In order to obtain approximate confidence intervals from the shape of the likelihood or profile likelihood function around the maximum likelihood, several likelihood evaluations (together with a constrained optimisation problem if \(\hat{L}(D; \boldsymbol{\theta}_\iota)\) is used) are often required to estimate accurately a confidence interval. A cruder but often useful approximation can be obtained from the curvature of the negative log-likelihood function at \(\boldsymbol{\theta}_{\textrm{ML}}\). In more than one dimension, the local curvature can be expressed by the Hessian matrix \(\boldsymbol{H}\). The expectation of hessian of the \(- \ln L(D; \boldsymbol{\theta})\) is also referred as the Fisher information matrix \({\boldsymbol{I}(\boldsymbol{\theta})}\) [107] and it is defined as: \[ {\boldsymbol{I}(\boldsymbol{\theta})}_{ij} = {\boldsymbol{H}(\boldsymbol{\theta})}_{ij} = \mathop{\mathbb{E}}_{D \sim p ( D |\boldsymbol{\theta} )} \left [ \frac{\partial^2}{\partial {\theta_i} \partial {\theta_j}} \left ( - \ln L(D; \boldsymbol{\theta}) \right ) \right ] \qquad(3.47)\] which can be evaluated at any given \(\boldsymbol{\theta}\), e.g. by using numerical differentiation. The Cramér-Rao lower bound [108], [109] provides a link between the inverse of the Fisher information matrix and the covariance of an unbiased estimator \(\hat{\boldsymbol{\theta}}\): \[ \textrm{cov}_{\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}) \geq I(\boldsymbol{\theta})^{-1} \qquad(3.48)\] which becomes an equality in the large-sample limit for an efficient parameter estimator such as the maximum likelihood estimator \(\boldsymbol{\theta}_\textrm{ML}\). The diagonal elements of the inverse of the information matrix \(\sigma_i^2=\left( I(\boldsymbol{\theta})^{-1} \right)_{ii}\) may be used to construct a \(68.3\%\) confidence interval for \(\theta_i\) parameter where the effect of the rest of parameters has been profiled as \([\boldsymbol{\theta}_\textrm{ML}-\sigma_i, \boldsymbol{\theta}_\textrm{ML}+\sigma_i]\). This approximation is equivalent to profiling assuming that the \(- \ln L(D; \boldsymbol{\theta})\) can be described by a multi-dimensional parabola centered at \(\boldsymbol{\theta}_\textrm{ML}\), and thus leads to symmetric intervals. In Bayesian literature, an analogous approach is used to extend MAP estimation in order obtain a multi-dimensional normal approximation for the posterior, which is often referred to as Laplace approximation [110]. An important advantage of this approximation, that will be used in Chapter 6 to construct an inference-aware machine learning loss function, is that can be interpreted both in the context of classical and Bayesian inference.

Here a statistic is a function of observations, and sample summary statistic refers to statistics the summarise a set of observations \(\boldsymbol{s}(D) : \mathcal{X}_D \subseteq \mathbb{R}^{d\times n} \longrightarrow \mathcal{Y}_D \subseteq \mathbb{R}^{b\times n}\).↩