[Validation] Double b-tagging performance

[**:house: Home**](https://hackmd.io/s/rkkDP_l4M) | [:boy: **About**](https://hackmd.io/s/B149Z8v7b) | [**:microscope: Researches**](https://hackmd.io/s/rJPFNKlVz) | [**:rocket: Side projects**](https://hackmd.io/s/H1aS2qe4G) | [**:airplane: Life gallery**](https://hackmd.io/s/HJN4JslNM) [CERN Approval](https://hackmd.io/KwZgTAnAHNYLRgOxgCZwCwCMwAY5RShDgFMoBjCAQ3JRJIkqA===#cern-approval) **>** [Validation] Double b-tagging performance --- # [Validation] Double b-tagging performance *<div style="text-align: center;" markdown="1">`ROC` `efficiency` `error propagation` `Monte Carol` `statistics`</div>* ## Introduction The raw LHC data contains pure digital information from each pattern of detectors. To do the physics data analysis, the digital information has to be reconstructed to particle information, e.g. vertex, tracks, energy and charge, by vertexing algorithms, tracking algorithms, clustering etc... Sometimes the special topic requires to use the new algorithm to construct the target particle for the new physics, which is unknown in data. In that case, the data can not provide the pure source, so called **control sample**, to validate the algorithm. Thus, the ***Monte Coral*** (MC) method based analysis is used for test the algorithm. The MC contains the particle truths, and it simulates the data-taking procedures and particle reconstruction. Due to the requirement of [\[Particle search\] $\text{b}’$ quarks decaying to b quark and Higgs boson in LHC data](https://hackmd.io/s/S1mZHTeNb), the special reconstruction algorithm is needed. For the $\text{b}$-tagging in boosted Higgs jet, i.e. a boosted single fat jets containing both of its products ($\text{H}\to\text{b}\bar{\text{b}}$), the distance between both of the b quarks from Higgs dependent on the Higgs mass and transverse momentum ($p_{\text{T}}$) as shown in Figure <div style="text-align: center;" markdown="1"><img src="https://i.imgur.com/c2mvQip.png" height="300" ></div> <br> The distance $\Delta R$ between $\text{b}\bar{\text{b}}$ is defined as $$ \Delta R = \sqrt{(\eta_{\text{b}}-\eta_{\bar{\text{b}}})^2+(\phi_{\text{b}}-\phi_{\bar{\text{b}}})^2}\ , $$ where we use the right-hand coordinate to present the particle's direction with the *azimuthal angle* ($\phi$) on the $x$-$y$ plane, and the *pseudo-rapidity* ($\eta$) alone the $z$ direction and adjusted with special relativity as $\eta=-\ln[\tan(\theta/2)]$. In the boosted Higgs $p_{\text{T}}$ region, the general tagging (classification) algorithm, mentioned in [\[Validation\] b-tagging commissioning](https://hackmd.io/s/rkavdgSEG), can not provide the proper performance anymore. We were thus trying to develop the new technique to identify the merged fat jets by two $\text{b}$ jets. An alternative algorithm is using *"inclusive vertex finder"* (IVF) to record all vertices from the particle tracks in the data before doing jet clustering. This is different from the general $\text{b}$-tagging algorithms which do the jet clustering first, since we don't want lost the possible tracks and vertices. By comparing the vertices and jet with their distance $\Delta R(\text{vertex},\,\text{jet})$, the algorithm provide the discriminator to classify jet, i.e. containing double $\text{b}$-jet or not. However, there is no pure source (data) from Higgs particle, we use MC to validate the algorithm. ## Techniques The MC sample contains the particle truth which helps to validate the performance of algorithm. The MC samples used for the IVF-based double $\text{b}$ tagger are $\text{b}'\to\text{H}\text{b},\, \text{H}\to\text{b}\bar{\text{b}}$, with different mass of $\text{b}'$. $\text{b}'$ is a certain massive particle which can decay to high $p_{\text{T}}$ products. To validate the algorithm, we have to match the true Higgs to the fat jet by their distance $\Delta R(\text{H},\,j)$. The efficiency as function of double b-tagger discriminates can be obtained. On the other hand, we also care about the miss-tagged rate in data, i.e. the algorithm make wrong jet to be boosted Higgs. The background MC is used to present the rate. ### 1. Signal To the MC data containing boosted $\text{H}$ resulting to two closed $\text{b}\bar{\text{b}}$, the data is selected to have - Far distance between $\text{b}$ and $\bar{\text{b}}$ quarks $\Delta R(\text{b},\,\bar{\text{b}})>0.8$ - High transverse momentum of fat jet $p_{\text{T}}(j)>300\,(\text{GeV})$ - Higgs mass constraint in fat jet $75<mass(j)<175\,(\text{GeV})$ Basing on above selection, we can calculate the matching efficiency, $\epsilon_{m}$, between Higgs and fat jet by requiring the distance $\Delta R(\text{H},\,j)<0.5$. However, the double b-tagged may be failed in the matched fat jet, the tagging efficiency is also considered and denoted as $\epsilon_{b}$. Thus, the signal b-tagging efficiency is defined as $$ \epsilon=\epsilon_{m}\epsilon_{b}\ . $$ It also present as function of $\text{H}$ $p_{\text{T}}$ with different mass of $\text{b}'$: <div style="text-align: center;" markdown="1"><img src="https://i.imgur.com/w9OatRt.png" height="300" ></div> <br> This shows the algorithm has low efficiency which is about 20% in overall region. The statistical uncertainties are calculated with binomial distribution and asymmetric error. ### 2. Background In the real case of LHC data, the $\text{pp}$ collision can produce a lot of random jets which makes difficult for the analysis. Thus, a good algorithm must to have low miss-tagging rate. To obtain the miss-tagging rate, the background MC is used. We selected the data containing the fat jet with the same $p_{\text{T}}$ and mass requirements as the signal. The miss-tagging rate of the double b-tagger as function of fat-jet $p_{\text{T}}$ is shown as following Figure: <div style="text-align: center;" markdown="1"><img src="https://i.imgur.com/fEfmjcP.png" height="300" ></div> The overall miss-tagging rate is about 2%. The statistical uncertainties are calculated with binomial distribution and the error propagations from different background samples and weights, see the [details](https://i.imgur.com/PDkvVxJ.png). ## Results By comparing the algorithm has big difference between 20% efficiency and 2% miss-tagging rate, the performance look ok for the analysis. However, the algorithm produce the discriminate for the tagging, the final results present with **Receiver operating characteristic (ROC)** plot. This can be used for data analysis to decide the used discriminate for the $\text{H}$ jet. <div style="text-align: center;" markdown="1"><img src="https://i.imgur.com/RHUczwY.png" height="300" ></div> ## References - Appendix of CERN-CMS Physics NOTE : https://www.dropbox.com/s/0d8wxarxjfze540/AN2013_185_v4.pdf?dl=0 - Poster in CMS week : https://www.dropbox.com/s/63cmj68rwag7hvn/CMSWEEK2013_PosterSession2bTaggerShadow.pdf?dl=0 <br> --- [:ghost: Github](https://github.com/juifa-tsai) | [:busts_in_silhouette: Linkedin ](https://www.linkedin.com/in/jui-fa-tsai-08ba0a93)