TP4b Spring 2022 - logbook

### Some precisions FoM method bkg MC |$N^{ref}_{MC}$ | $N^{sel}_{MC}$ --------|----------------|-------------------- bkg data|$N^{ref}_{data}$ | $N^{sel}_{data}$ Assuming a BDT2 cut efficiency similar for data and MC, we have the proportionality relation $N^{sel}_{data} = N^{sel}_{MC} \cdot \frac{N^{ref}_{data}}{N^{ref}_{MC}}$ # TP4b Spring 2022 - logbook Report suggested plan: 1. **Introduction**: rapide intro au SM et LFU, mention du signal decay comme contexte à ce tp, et délimitation du projet (en gros c'est un résumé de tout le projet) 2. **Motivation**: inclut tout ce que tu as dit sur le LFU, puis pourquoi on s'intéresse à B->JpsiKmumu à LHCb. Aussi que tu veux tester une certaine sélection. 3. **LHCb detector** 4. **Selection efficiency** 1. Boosted decision tree training on signal soft muon (mention signal sample, preselection and combinatorial background rejection) 2. Tag and probe method $J/\psi\rightarrow \mu^+\mu^-$ 3. Effect of BDT2 on $J/\psi\rightarrow \mu^+\mu^-$ dimuon mass peak 4. Efficiency map on P-ETA plane (data and MC) 5. Comparison with another selection 5. **Conclusion** ## Week 13 * [ ] Derniers sur le 2d histo efficiency P-ETA * [x] Check mon code pour l'utilisation du RDataFrame et du snapshot * [ ] Fits: augmenter range sigma 2 et donner 2 means différents pour les 2 gauss. Est-ce que cela améliore les fits? * [ ] Idem pour MC (sans fits). Path to be coming * [ ] Si temps: efficiency plot en fonction de P pour les data ## Week 12 : 16.05.2022 ## Week 11 : 09.05.2022 Goal of today: update of the BDT2 optimisation started some months ago and that needs to be finished. Work basis: `/panfs/bouchiba/TP4b_spring22/OPTIMISATION/loop_cutBDT2_opts.py` `python loop_cutBDT2_opts.py -F 1/2 -r all/range` Try to adapt the ranges of BDT2 cuts. ### BDT2 cut optimisation with figures of merit A figure of merit is a quantity that can probe the performance of a method in function of its set of options. In the case of a multivariate analysis, a figure of merit is used to determine the optimal signal selection versus background rejection. Two separate optimisations is done: one for combinatorial first, and then for soft muon rejection. Here is a list of possible figures of merit (among other possibilities): 1) $\frac{S}{\sqrt{B}}$. 2) $\frac{S}{\frac{5}{2}+\sqrt{B}}$ 3) $\frac{S}{\sqrt{S+B}}$ $S$ and $B$ are absolute numbers of events that passed a certain selection. * **$S$: absolute number fo signal events taht passed the selection** Let $S_{ref}$ be the reference number of signal events **without any selection** (so as created in the pp collision). In principle is can be evaluated using branching ratios evaluations: $S_{ref} = \mathcal{L}(2018)\cdot \sigma(pp\rightarrow B^+X, 13 TeV)\cdot\mathcal{B}(B^+\rightarrow K^+J/\psi)\cdot\mathcal{B}(J/\psi \rightarrow \mu^+\mu^-)\cdot \alpha^2$ * [ ] Do you understand this expression? Where can you find these numbers? Now, of course reconstructed events will have some loss due to the efficiencies of the several steps of our selection: reconstruction of the detector, stripping, preselection, combinatorial BDT, `isMuon` and then soft muon BDT. This overall efficiency (without BDT2) can be evaluated as: $\epsilon = \epsilon_{reco\times stripping}\cdot\epsilon_{preselection}\cdot\epsilon_{BDT1}$ The number of signal events that pass this whole selection is be given by $S = S_{ref}\cdot\epsilon\cdot\epsilon_{BDT2}$. * **$B$: background (that is aimed to be discarded for the current cut) absolute number of events.** In the case of a soft muon BDT study, the associated background will be the cases of misidentified muons. We have a simulation of $B^+\rightarrow K^+ J/\psi /pi^+ \pi^-$ available. Again let $B_{ref}$ be the reference number of background events **without any selection** (so as created in the pp collision). In a similar way as for $S_{ref}$, it can be evaluated by: $B_{ref} = \mathcal{L}(2018)\cdot \sigma(pp\rightarrow B^+X, 13 TeV)\cdot\mathcal{B}(B^+\rightarrow K^+J/\psi\pi^+\pi^-)\cdot\mathcal{B}(J/\psi \rightarrow \mu^+\mu^-)\cdot \mathcal{B}(\pi\rightarrow \mu \nu_{\mu})^2$ Also in a very similar way, the number of these background events that will remain after a complete selection applied of efficiency $\epsilon$ will be given by $B=B_{ref}\cdot \epsilon$, with $\epsilon$ defined as above (and that has a priori a different value that for the signal). * What do we do explicitely? We have some data of $B^+\rightarrow K^+ J/\psi \mu^+ \mu^-$. We have some prior models (MC simulations) of signal and misID. We can fit the invariant $K^+ J/\psi \mu^+ \mu^-$ mass using these models + exponential background. And then see how the numbers $S$ and $B_{misID}$ vary un function of several BDT2 cut applied (to both soft muons).  ## Week 10 : 02.05.2022 New BDT2 trained, more files added. Folder: `/panfs/bouchiba/TP4b_spring22/NTUPLES/JpsiMuMu_samples/BDT2added_220502` Fix a BDT2 cut at, say, 0. Redo the plot $\epsilon_{BDT2}$ in a 2d histo in function of the momentum and the pseudorapidity of the probe muon. ## Week 9 - 25.04.22 Continuation of week 8 tasks. ## Week 8 - 11.04.22 ### Efficiency extraction We are going to plot something a bit different. The efficiency of our selection, applied on top of `isMuon`, is evaluated in the following way: $$\epsilon_{BDT2} = \frac{\mu_{probe}(isMuon+BDT2)}{\mu_{probe}(isMuon)}$$ Denominator: *In a given probe mu P area (tag muon can have any momentum)*, this yield is computed as a reference (not recomputed each time). This is the number of muons `probe` on which we apply only `isMuon` only in given P area, tag muon can be any, get number of events in the signal peak (background substracted) Numerator: the exact same sample, but on top on this we apply several BDT2 cuts. To do: * [ ] For low momenta region (3-6 GeV), plot this efficiency as a scan of BDT2. * [ ] plot $\epsilon_{BDT2}$ in function of the momentum of probe muon. * [ ] plot $\epsilon_{BDT2}$ in a 2d histo in function of the momentum and the pseudorapidity of the probe muon. Hints: * Use`TH2` to produce 2-dimensionnal histograms: https://root.cern.ch/doc/master/classTH2.html * Pseurapidity of the probe muon can be found in the sample as `probe_Brunel_ETA`  ## Week 7 - 28.03.22 * [ ] Study the distribution in $J/\psi$ invariant mass in 3 intervals of `probe_Brunel_P` values. * [ ] For no cuts * [ ] For n cuts of `BDT2_mu3` * [ ] For n cuts of `BDT2_mu3` and `BDT2_mu4` ## Week 6 - 21.03.22 ### New BDT2 selection (updated) * [ ] Compute efficiencies for different cuts of BDT2. ## Week 5 - 21.03.22 ### Scan on selection efficiencies * [ ] Compute efficiencies for different cuts of BDT2. Remark: a bit of a "lighter" week since I was sick and talked with Bastien remotely - I helped him quite less. Meeting set this week. ## Week 4 - 14.03.22 ### Selection efficiency #### Muon calibration using $J/ \psi \rightarrow \mu^+ \mu^-$ An abundant source of muons is provided in our experiment by the $J/ \psi \rightarrow \mu^+ \mu^-$ inclusive decay - which means several possible production channels of $J/\psi$ are accepted, both prompt and from b decays. In particular, particles tracks originating from a b decay shows a very clean signature since most of the random backgroud tracks coming from the primary vertex are discarded by the VELO (at online level. Therefore they and are good candidates to measure the efficiciency of some dedicatated muon selection. [comment]:LHCb [comment]:####TISTOS [//]: # (This may be the most platform independent comment) #### Data samples technicalities - LPHE cluster location: `/panfs/bouchiba/TP4b_spring22/NTUPLES/JpsiMuMu_samples/MagnetDown` `/panfs/bouchiba/TP4b_spring22/NTUPLES/JpsiMuMu_samples/MagnetUp` - Year: 2016 - Luminosity:? - Tree structure: `Jpsinopt_MuPTuple/DecayTree` $\mu_{probe} = \mu^+$ `Jpsinopt_MuMTuple/DecayTree` $\mu_{probe} = \mu^-$ Branches prefixes, based on the tree `Jpsinopt_MuPTuple` - `probe`: "Probe" or "calibration"= here the first (positive) muon. Basically thre is no cuts applied on this muon. - `tag`: "Tag" muon, on which some selection is applied in order to evaluate the efficiency when compared to the probe. - `_Brunel`: All the variables (almost all...) come there twice, with and without the Brunel prefix. This because this data is from the "Turbo" stream (i.e. direct HLT2 output) but then it was reconstructed twice: once directly in the HLT2 and then again "offline". This allows to use this data to calibrate both the "Turbo" and "Stripping" analyses. "Brunel" means offline (stripping-like) reconstruction, so we should use the Brunel variables. - `_ANNPID`: code to get (some of) the variables which were used in the training of the original ProbNN classifier. ***Tasks*** The files (temporary) are located in `/panfs/bouchiba/TP4b_spring22/NTUPLES/JpsiMuMu_samples/MagnetDown/BDT2mu4_added_Muon4Aliased_BDT2mu3_added_MuonAliased*root` * [x] Plot invariant dimuon mass. How to get it from the variables you have? Hint: `Jpsi_M`. Tree `DecayTree` * [x] Plot it for different cuts in `BDT2_mu3` and `BDT2_mu4`. * [x] Plot the ratio of selected events over total number of events. #### Efficiency computation methods Last week you evaluated some selection (e.g. some BDT response) efficiency $\epsilon_{sel}$ on the signal sample computing $\epsilon_{sel} = \frac{n_{sel}}{N}$, where $n_{sel}$ is the number of events that passed the selection. This is possible to do on a signal sample, since in principle all the events are well identified as signal. Now if we want to do this evaluation on data, which is basically a pile-up of signal and a list of backgrounds we need a more solid method. One possible method is the fit couting. Basically, for the main method we're going to use, the efficiency of the selection is now defined as: $$ \epsilon_{sel} = \frac{n_{sig}}{n_{sig}+n_{bkg}} = \frac{N_{sel=true}}{N_{sel=true}+N_{sel=false}}$$ The $N$ are called *yields*, they are numbers of events extracted from fits. Another method can be used. If we write the total efficiency as the following: $$ \epsilon_{tot} = \frac{S}{S+B} \epsilon_{sel} + \frac{B}{S+B} \epsilon_{bkg}$$ where $S$ and $B$ are the number of $\mu_{probe}$ of the $J/\psi$ sample. This method can be used later as a cross-check. #### The RooFit package (Work in progress) * Tutorial: https://indico.cern.ch/event/72320/contributions/2082589/attachments/1037201/1478048/roofit-intro-roostats-v11a.pdf ***Tasks*** * [x] Example file located at `/panfs/bouchiba/TP4b_spring22/FITS/RooFitExample.py` * [x] Fit the dimuon mass profile: a gaussian + exponential background ## Week 3 - 07.03.22 ### **STUDENT TASKS** #### What is a classifier? Our muon selection uses some "machine learning" operation. A **decision tree** is a specific algorithm consisting of tree-structured true/false questions nodes, and aiming to classify an array of *events* into *labels*. In the present case, the labels are “signal” and “background”. It is specific to a given type of events and must be grown/trained on a subset of these events. The creation of the decay tree is called the *growning* (or *training*). It operates in the following way: an array of events is given in input of a node that asks an optimum binary question. The incoming array separated into two in function of the Boolean answer. Then each of the smaller arrays reaches a new node with another optimum question splitting them in two. The process continues until no question can be asked anymore, namely when only one event remains in the array or all parameters have the same value for all the events of the array. This is called a *leaf*. The optimum question is the best reducer of “mixing amount” between the input and the average of the output arrays, namely the one that increases at maximum the probability to attribute the correct label to each event. This “mixing amount” can be chosen without quality disparity between different objects, with the constant idea to give, symetrically for the labels $S$ and $B$, the probability to randomly assign a correct label to an instance of the array. In the present case, the *Gini paramter* has been used, given by $p\cdot(1-p)$. There are numerous ways of applying a classifer to some data sample. In ROOT there is a build-in library called **TMVA**. #### First look at TMVA selection TMVA reference: https://root.cern.ch/doc/master/classTMVA_1_1MethodBDT.html * [x] Explore content of TMVA scripts `/panfs/bouchiba/TP4b_spring22/SELECTION/`, in particular the TMVA file `BDT2.py` * [x] Have a look at rootfile`/panfs/bouchiba/TP4b_spring22/TMVA_files/ofile2_inclusive_JpsiKpipi_E69_BDT_211124.root`. Command to open TMVA gui: in a root shell: `TMVA::TMVAGui("myfile.root")` #### Description of the documents * **`./NTUPLES` folder:** Parent analysis samples with `BDT2mu4_added_Muon4Aliased_BDT2mu3_added_MuonAliased_BDT1added_` prefix. This means they contains `BDT1`, `BDT2_mu3` and `BDT2_mu4` leaves (each of them being a classifier output). * **Data** samples with magnet polarised down and up respectively: `preselectedDimuon_inclusiveJpsi_gangaAll_2018_Data_MD_211104.root` `preselectedDimuon_inclusiveJpsi_gangaAll_2018_Data_MU_211104.root` * **Signal MC simulation** samples with magnet polarised down and up respectively: `preselectedDimuon_inclusiveJpsi_gangaAll_2018_MC69_MD_211104.root` `preselectedDimuon_inclusiveJpsi_gangaAll_2018_MC69_MU_211104.root` * **MisID MC simulation** samples (already magnets polarisation cat. merged and truthmatched). `truthmatch_preselected_JKpipi_as_mm.root` * `JpsiMuMu_samples/MagnetDown` and `/MagnetUp`: $J/\psi \rightarrow \mu^+ \mu^-$ data samples * **`./SELECTION` folder** * TMVA output script `BDT2.py` * `loader` folder * Aliasers (technical) `MuonAliaser.py` and `Muon4Aliaser.py` * Truthmatching script (technical) `JpsiKpipi_truthmatch_v2.py` * **./`TMVA_files` folder** * TMVA Output files: `ofile2_inclusive_JpsiKpipi_E69_BDT_211124.root`, can be read with `TMVA::TMVAGui` #### Tasks * Reproduce BDT2 efficiency on signal MC samples (MU&MD) * [ ] Plot: `B_LOKI_MASS_JpsiConstr` with `BDT1>0` (quite loose), and adding different `BDT2_mu3` (and `BDT2_mu4`) cuts. What is the effect? The selection efficiency? * [ ] Reproduce top-left plot, slide n°11 of my SPS presentation. Some hacks in PyROOT: * To get the number $N$ of entries in a tree: `tree.GetEntries()` * Number of events that passed the selection $n_{sel}$: ```nsel = tree.Draw(var1+"var1>>some_histo","some condition", "goff")``` * Efficiency of `some condition`: $\epsilon_{sel} = \frac{n_{sel}}{N}$ * Plot this efficiency in function of `mu3_P`. Is it flat? Is it better for low/big P? * Plot this for `BDT2_mu3<i`, with $i \in [0,0.8]$ **TO DO (Sonia)**: * [x] Move `/afs/cern.ch/work/s/sbouchib/Multilepton/BATCH_CHAINS/BDTbranch.sh` * [x] Move BDT2 loader, xmlfile: `"/afs/cern.ch/work/s/sbouchib/Multilepton/BDT/BDT2/JpsiKpipi/211108/loader/weights/factory_BDTG.weights.xml"` * [x] Move TMVA BDT2 last rootfile * [x] Move aliasers * [x] Alias `probe_Brunel` muons to `muX` * [ ] Add BDT2 response with `branch adder` script * [ ] Reproduce for `tag` muons ## Week 2 - 28.02.22 **STUDENT TASKS**: 1) Files exploration Explore files in `/panfs/bouchiba/TP4b_spring22`, in `MagnetUp` and `MagnetDown` There are some plotter scripts (python) and some **rootfiles** containing ntuples. Let's get acquiatined to theses rootfiles exploration. * From a TBrowser (can be slow) ``` root -l blah.root // will laucnh root shell and open blah.root as _file0 TBrowser randomName ``` Select one of the rootfiles-> what's its structure? (tree, branches, entries...) Name of the tree? Construction -> *Muon Identification performance at LHCb with the 2010 data*, §3.Datasets What is an event? How many events by rootfile? * From root shell ```` root -l blah.root // will open blah.root as _file0 TTree *tree = new TTree() _file0->Get("Jpsinopt_MuPTuple/DecayTree") // you can get this structure by exploring the file with TBrowser // Draw leves content in histos tree->Draw("leafName") // draw tree->Draw("leafName","leafName>34") // draw with selection tree->Draw("leafName","leafName>34","same") // draw with selection, on the same histo tree.GetEntries() // number of entries tree.Print() // print list of branches // Other type object to merge several trees WITH THE SAME BRANCHES TChain *chain = new TChain() chain->Add("blah.root/Jpsinopt_MuPTuple/DecayTree") // Pretty much same functions that with a tree ```` 3) Get acquainted with LHCb physics, reconstruction, trigger... Ressources: * LHCb Starterkit:https://lhcb.github.io/starterkit-lessons/, with highlight on: * First analysis steps: * Physicst LHCb * LHCb data flow * TupleTools and branches (contains some tuple tools definitions that appears in the branches) * Second analysis steps: * TisTos DIY, for the definition * and all the ones _you_ would find interesting! * LHCb-INT-2011-048: mainly §3 and §4 (§4.1 more important, the rest for yur curiosity). Eq.(6) will be used in our case (most likely), stress on Figure 5 and how it is obtained. * Some of my slides: SPS presentation, and notice (writing ongoing, should be ready next week) 2) Open-read-write files * From a PyROOT script Explore `/panfs/bouchiba/TP4b_spring22/scripts/Plotter_example.py` -> How to open TFile? -> How to write a variable in a histogram? -> How to draw with a condition of one variable? Two? Formulas? -> How to draw a canvas? Challenge! - print, explore the tree, what are their names standing for etc etc. - draw with some selection `&& ||` - fancy histo (SetLineColor, SetLineWidth...) -> cf TH1 class ref (Google) - #### cp /home/vieites/blah.py . gedit blah-py -> utiliser gedit installé sur le cluster pour éditer ton fichier blah.py ## Cluster usage