Statistical Learning and Inference at Particle Collider Experiments

5.3 Analysis Strategy

The goal of this analysis is to carry out statistical inference on the occurrence of \(\textrm{pp} \rightarrow \textrm{HH}\rightarrow \textrm{b}\bar{\textrm{b}}\textrm{b}\bar{\textrm{b}}\), as predicted by the SM or in BSM effective field theory extensions, based on experimental data acquired by the CMS detector on 2016. The type of statistical inference applicable to this search is hypothesis testing, as introduced in Section 3.2.2. In principle, we would like to test whether the null hypothesis \(H_0\) corresponding to the SM without HH production hypothesis can be rejected. Several alternate hypothesis \(H_1\) are considered, which are based on the SM including HH production processes, either coming from SM production models or from EFT extensions. However we do not expect to reject the \(H_0\) hypothesis, so the objective is the one of setting exclusion upper limits on the signal cross section for a given model including Higgs pair production. This, we would like to adopt an analysis strategy that maximises the sensitivity to the presence of HH production, which amounts to minimising the Type II error rate for a given fixed Type I error rate in statistical terms. The Type II error rate would in turn depend on the alternate hypothesis \(H_1\) considered, which for the optimisation of the analysis strategy would be the SM including HH production through SM processes at an enhanced rate.

The event selection in this analysis will include some custom online requirements, which were set at trigger level to reduce the total rate of data collection while keeping a large fraction of events relevant for this analysis, as well as an offline selection to reduce the contribution of background processes that are not well modelled, in order to simplify the construction of powerful summary statistics. The online trigger requirements as well as the characteristics of the datasets considered in this analysis are described in Section 5.4, while the adopted event selection is described in detail in Section 5.5.

After a basic event selection, mainly comprising the filtering of events with four or more b-tagged jets⁹, a subset including four of the reconstructed jets within each event is paired to construct two di-jet candidates, as an attempt to recover the kinematic properties of the Higgs bosons, including their reconstructed masses. The information from the two di-jet candidates can in turn be combined to compute variables that can approximate the features of the Higgs pair system, which are also quite useful for inference. A set of variables from the selected jets, the H candidates and the HH system, are combined in a single discriminating variable obtained by training a probabilistic classification model, specifically machine learning model based on boosted decision trees (see Section 4.2.1), to separate signal from background, in a analogous manner to what was described in Section 4.3.1.

The statistical inference in this analysis is based on constructing a binned likelihood of the expected distribution of the classifier output for events originated from signal and background processes. This likelihood, which also accounts for the effect of nuisance parameters as discussed in Section 3.1.3.4, is used to extract information the about the parameter of interest (i.e. HH production cross section times the branching ratio) based on the observed data. While both the SM and the various BSM signal models can be modelled using simulated observations, the main background of the analysis, multi-jet QCD production, is hard to model by simulation. Thus a data-driven background estimation method, described in detail in Section 5.6, is used both for training the probabilistic classifier and for modelling the background contribution in the binned likelihood.

After including the effect of the relevant sources of systematic uncertainty, which are listed in Section 5.7, upper limits are obtained for the \(\textrm{pp} \rightarrow \textrm{HH}\rightarrow \textrm{b}\bar{\textrm{b}}\textrm{b}\bar{\textrm{b}}\) cross section for each of the benchmarks listed in Table 5.1, as well as for the SM HH production process. The results, which are contained in Section 5.8, include the upper limit on the mentioned cross section a function of the Higgs self-coupling factor parameter \(\kappa_\lambda\) when \(\kappa_\textrm{t}=1\) and the other EFT parameters are null. While the analysis could be redone for any arbitrary EFT point by recomputing the limits for the particular model, given that the benchmarks have been constructed to represented the main differential cross section differences in a large part of the EFT parameter space, approximate limits can be obtained by considering the limit obtained for the closest benchmark using the distance measure from [174].

Events with a different b-tagged jet definition will be also used to define a data control region, as will be discussed in Section 5.6.2.↩