Statistical Learning and Inference at Particle Collider Experiments

5.4 Trigger and Datasets

The experimental data considered in this analysis was collected by the CMS detector in 2016 from proton-proton collisions at centre-of-mass energy \(\sqrt{s} = 13\ \textrm{TeV}\). The total integrated luminosity at the CMS interaction point corresponding to the certified set of datasets used in this analysis is \(35.9\ \textrm{fb}^{-1}\), which is the subset of data corresponding to periods when the relevant detecting systems were running regularly and no problematic anomalies were discovered during data quality monitoring (DQM).

Because the rates for the main background processes of this analysis - events originating from QCD multi-jet events - are expected to be much higher that those of the signal, an efficient online trigger selection is essential for maximising the sensitivity of the analysis. While the set of standard CMS trigger path includes ones that select events with several high-energy jets, a more practical strategy is to include some b-tagging requirements within the high-level trigger sequence. Hence, this analysis re-uses the multi-jet trigger paths that were developed for the search of the resonant process \(\textrm{pp} \rightarrow \textrm{X} \rightarrow \textrm{HH} \rightarrow \textrm{b}\bar{\textrm{b}}\textrm{b}\bar{\textrm{b}}\) [175], where \(X\) is a heavy mediating particle. These two paths both require that at least three jets have are b-tagged by the online version of the Combined Secondary Vertex (CSV) algorithm [85].

The full specification trigger selection used is rather complex, however it may be represented by a logical OR of the following two HLT trigger paths that were in place during the CMS 2016 data taking period:

HLT_DoubleJet90_Double30_TripleBTagCSV_p087
HLT_QuadJet45_TripleBTagCSV_p087

which represent a particular online selection sequence at the HLT. The sequence is preceded by a given set of L1 trigger seeds, as conceptually reviewed in Section 2.2.7. The L1 trigger paths are different for each of the HLT paths, but are based on the logical OR between several conditions requiring a certain number of L1 jets over a given energy or the total deposited energy on the calorimeter \(H_T\) to be over a certain threshold. At the HLT, both paths require some quality criteria on the reconstructed primary vertex and at least 4 reconstructed jets within a pseudo-rapidity range defined by \(|\eta| < 2.6\). The first path in addition requires that the momenta of two of the reconstructed jets satisfy the requirement \(p_T>90\ \textrm{GeV}\), while two other jets are required to have \(p_T>30\ \textrm{GeV}\). The second path instead requires that the event contains at least four reconstructed jets with \(p_T>45\ \textrm{GeV}\). As mentioned, both paths include a b-tagging requirement, chiefly that the value of the online CSV discriminator is larger than the value of 0.87, which is defined as the “medium working point” of the algorithm, for three of the eight most energetic reconstructed jets in the event.

Samples of simulated observations from Higgs pair production are generated using MadGraph5_aMC@NLO [176] at leading-order, following the relevant prescriptions, including the loop factor on an event-by-event basis detailed in [177]. A total of 300,000 events have been simulated for the SM model production component, as well as an older version of the clustering benchmarks discussed in Section 5.2 and the \(\kappa_\lambda=0\) box model. Regading the parton distribution function used for generation, the NNPDF30_LO_AS_0130_NF_4 n set [178] was used for all samples.

The datasets for the benchmark points listed in Table 5.1, or any other EFT point for that matter, can be generated from the previous samples by means of generator re-weighting. As described in Section 3.1.2.3, the latent variables of the simulator can be used to model a different point of the parameter space of the the underlying theory by computing observables after assigning to each event a weight proportional to the ratio between probability density functions. In this case, the effect of varying EFT parameters in Equation 5.2 can be fully characterised by two parton variables at leading order: the Higgs pair invariant mass \(m_\textrm{HH}\) and the \(\lvert \cos \theta^{*} \rvert\), where \(\theta^{*}\) is the polar angle of any one of the Higgs bosons with the respect to the beam axis. Once these two variables are specified, the rest of the simulation does not depend on the EFT parameters. A set of HH production simulated events generated for a given vector of EFT parameters \(\boldsymbol{\theta}_\textrm{EFT}=(\kappa_\lambda, \kappa_\textrm{t}, c_2, c_\textrm{g}, c_\textrm{2g})\) re-weighted by: \[ w \left ( m_\textrm{HH}, \lvert \cos \theta^{*} \rvert \right ) = \frac{p \left ( m_\textrm{HH}, \lvert \cos \theta^{*} \rvert \ \mid \ {\boldsymbol{\theta}'}_\textrm{EFT} \right )}{ p(m_\textrm{HH}, \lvert \cos \theta^{*} \rvert \ \mid \ {\boldsymbol{\theta}}_\textrm{EFT})} \qquad(5.3)\] could be used to model events generated at the EFT point \({\boldsymbol{\theta}'}_\textrm{EFT}\), as long as the both the numerator and denominator are not zero. The previous concept can be extended to any arbitrary probability distribution of \(p \left ( m_\textrm{HH}, \lvert \cos \theta^{*} \rvert \right )\), e.g. a large sample uniformly distributed in the mentioned 2D-space could be re-weighted to model any EFT parameter point. While the density ratio in Equation 5.3 can also be estimated exactly as the ratio between the matrix elements [179], a non-parametric density estimation approach was adopted in this analysis.

A large sample of HH production events was formed by concatenating all non-resonant Higgs pair events simulated from each of the 14 samples, creating what will be referred to as the pangea sample. For all the EFT points of interest, 50,000 events (300,000 for the SM production) were generated at parton level, which is rather inexpensive. The per-event weight in Equation 5.3 is estimated by the ratio of 2D-histograms, which effectively approximate the mentioned density ratio. The weighted pangea sample can represent any EFT parameter point at leading order by this procedure, so it is used to model the signal characteristics of all the models considered in this work.