This is a note for doing analysis with SPINE. To make samples, submit batch jobs, or do analysis using SBND machine: [**SBND Coding Notes**](https://hackmd.io/xjAl8h0vQiCk2QOJFqPZgQ?view)
# SBND ML Analysis
[toc]
# Setup
## Log in to SLAC
1. Enter the account: https://s3df.slac.stanford.edu/pun/sys/dashboard/batch_connect/sys/slac-ood-jupyter/session_contexts/new
2. Launch Jupyter:

3. If you are going to run grid job / `spine_prod` job or using any `sbatch` commands:
`ssh castalyf@s3dflogin.slac.stanford.edu`
and then
`ssh neutrino`
Alternative way on Mac Terminal:
```
srun --partition ampere --account neutrino:icarus-ml --gpus a100:1 --mem-per-cpu=30G --pty /bin/bash
```
---
## Samples that we have
### mpvmpr test sample (~10k events)
* **Name**: multi particle vertex, multi particle rain.
* MPV: Simulate neutrino interactions (multiple particles originated from a common vertex)
* MPR: Simulate interactions produced by cosmic rays
* **Feature**: It’s our training sample that’s independent of any neutrino or cosmic generators. We use it to tune particle abundance, energies, etc. for training ML reco.
* 250k for training.
* 50k for testing.
* **Location**: `/sdf/data/neutrino/bearc/make_hdf5/mpvmpr/train1`
### bnb cosmics rockbox sample (~4.5k events)
* **Name**: bnb + cosmics + rockbox.
* BNB = booster neutrino beams, so neutrino interactions in the TPC.
* Rockbox = neutrino interactions upstream of the detector, but the resulting products of the interaction still get into the TPC.
### In-time sample
* **Name**: In-time = a cosmic that crosses the detector the same time the BNB does.
* PMTs will pick up light from this cosmic. The PMTs have ~10 us window from the start of the beam. The BNB pulse lasts about 1.6us. Most cosmics are out of time, so fall outside these windows.
* **Feature**: The sample has majorly cosmic muons, and for sure, sometimes a muon can interact outside the detector and make a shower so the sample not only contains muons. Due to the great amount of muons, this sample has the highest fraction of minimally ionizing muons. Minimally ionizing happens when the dE/dx curve is flat (at high momentum, high residual range), so dE/dx is constant in this region (muons are a standard candle for the conversion from dQ/dx to dE/dx).
### Nue sample
* **Feature**: The sample with electron neutrino interactions only.
## Check the git package and update:
`git checkout develop`
---
# Inferences
## Particle IDs
1. `pid=0`: Photons
2. `pid=1`: Electrons
3. `pid=2`: Muons
4. `pid=3`: Pions
5. `pid=4`: Protons
6. `pid=5`: Kaons
## Energy attributes
In case of getting confused, three common types of energies related to the detection can be found in `MyAnalyzer_module.cc`:
| Attribute | Eng | IDEEnergy | SedE |
|--------------------------|-----------------------------|-----------------------------------|----------------------------------|
| **Channel** | `MCParticle` | `SimChannel` | `SimChannel` |
| **Source** | GEANT4 | LArSoft (wirecell 2d drift simulation) | GEANT4 |
| **Type** | Particle's total energy | Ionization electrons energy | Sim energy deposit ("Sed") |
| **Content** | Total energy | Ionization energy deposited | Energy deposited in 3D cubes at specific times |
| **Usage** | Particle interaction analysis | Track analysis in detector | Track analysis in detector |
| **Granularity** | Particle level | Voxel like properties + energy in the channels | Voxel level within the detector |
* Each energy has corresponding `TrackID` and related parameters.
* If `IDEEnergy` or `SedEnergy` is less than `Eng`, the value is 'OK' (deposited energy should not be greater than the initial energy).
# dQ/dx analysis
## Particles in Matter
- Bethe-Bloch formula (how much energy lost of a charged particle when traveling through matter):
$$-\frac{dE}{dx} = K z^2 \frac{Z}{A} \frac{1}{\beta^2} \left[\frac{1}{2} \ln\frac{2 m_e c^2 \beta^2 \gamma^2}{I^2} - \beta^2 - \frac{\delta(\beta \gamma)}{2} \right]
$$
- $\frac{dE}{dx}$: Energy loss per unit path length
- $K$: A constant
- $z$: Charge of the particle
- $Z$: Atomic number of the medium
- $A$: Atomic mass of the medium
- $\beta$: Velocity of the particle in units of the speed of light
- $\gamma$: Lorentz factor ($\gamma = 1/\sqrt{1 - \beta^2}$)
- $m_e$: Electron mass
- $c$: Speed of light
- $I$: Mean excitation energy of the medium
- $\delta(\beta \gamma)$: Density effect correction
- For small velocities ($\beta \ll 1$):
$$-\frac{dE}{dx} = K z^2 \frac{Z}{A} \frac{4 \pi}{m_e c^2} \frac{n Z}{v^2} \left[ \ln\frac{2 m_e v^2}{I} - \beta^2 \right]
$$
### Bragg peak vs. Bethe-Bloch
| Aspect | Bragg Peak | Bethe-Bloch Curve |
|-------------------------|------------------------------------------------------|---------------------------------------------------------|
| Definition | Peak in energy deposition of charged particles | Average rate of energy loss of charged particles |
| Occurrence | Near the end of particle's range in a medium | Throughout the particle's traversal in a medium |
| Particle Type | Typically ions | Charged particles (electrons, muons, protons, etc.) |
| Energy Loss Mechanism | Primarily due to ionization and excitation | Primarily due to ionization and excitation |
| Significance | Important in radiation therapy for tumor treatment | Essential for understanding particle behavior in detectors |
| Energy Range | High energy, usually MeV to GeV range | Wide energy range, from MeV to TeV |
| Stopping Power | Sharp peak indicating maximum energy deposition | Continuous curve showing energy loss per unit path length |
| Stopping Energy | Maximum energy deposited at the peak (Bragg peak) | Energy loss per unit path length (Bethe-Bloch formula) |
| Image |  | 
### Remarks
* Particles going into LArTPC is a process that charged particles in matter, so it is useful to analyze the dE/dx to achieve **particle identifications**.
* **Deposition energy**: the amount of energy ionizing radiation deposits *in matter*. Thus, the Bethe-Bloch curve shows the dE/dx is highest at the stopping point and decrease along with residual length.
* **Minimum Ionizing Particle (MIP)**: a particle that loses a relatively small and nearly constant amount of energy as it travels through matter (detector).
* As a particle moves through something, it can knock out electrons and leave a trail of ions (charged particles). Most particles lose energy in this process, but a MIP is special because it loses just enough energy to ionize atoms, but not more than that. It's "minimum ionizing" because it's at the point where energy loss per unit distance is at its lowest.
* When a neutrino hits something in the detector, it can produce other particles like muons or pions, which often behave like MIPs. By tracking these MIPs, we can reconstruct what happened in the neutrino interaction and gather insights into the properties of neutrinos.

MicroBooNE: Electrons are gathered in the MIP peak, while most photons are around 4 MeV/cm.
* Note that there is a difficulty to distinguish **muons & pions**.
* We can use **calorimeter** to record how much energy deposited in matter and record the position (coordinates) at the same time.
* Can also be used to calculate "missing energy" by e.g. neutrinos.


## From ADC to MeV
When a muon zips through liquid argon in a detector like SBND’s TPC, it loses energy by knocking electrons off atoms, creating a charge trail (dQ/dx, in ADC/cm). We want to know the TPC gain—how many electrons each ADC unit represents. The energy loss can be modelled in two ways. The **Bethe-Bloch Equation** gives the average energy loss:
$$- \frac{dE}{dx} = K \cdot z^2 \cdot \frac{Z}{A} \cdot \frac{\rho}{\beta^2} \left[ \frac{1}{2} \ln\left(\frac{2 m_e c^2 \beta^2 \gamma^2 T_{\text{max}}}{I^2}\right) - \beta^2 - \frac{\delta}{2} \right]$$
Here, $K = 0.307075$, $z = 1$, $Z/A = 0.4509$, and $\rho = 1.396$ set the scale, while $\beta$ (speed) and $\gamma$ (relativity) tweak it—like a kid losing energy bumping through a crowd, averaged out. But detectors see the *most likely* loss, not the average. The distribution is skewed—not a neat bell curve—because energy loss is random: small losses are common, but rare big losses stretch the tail. The **most probable value (MPV)** is the peak of this distribution, where the energy loss is most likely to occur. Hence the **Landau-Vavilov fit** steps in:
$$\Delta_p / x = \xi \left[ \ln\left(\frac{2 m_e c^2 \beta^2 \gamma^2}{I}\right) + \ln\left(\frac{\xi}{I}\right) + j - \beta^2 - \delta \right] / x$$
$$\frac{dE}{dx}_{\text{MPV}} = (\Delta_p / x) \cdot \rho$$
With $\xi$ based on thickness and speed, this is the usual stumble—think of the kid’s typical trip. In short, Landau-Vavilov's expression accounts for the fluctuation in the detector so it's more practical for us to the later analysis.
SBND measured dQ/dx across distances the muon has left to stop (residual range), grouping data into bins. For each bin, we plotted a histogram and fit a Landau-Gaussian curve—spiky from random energy loss, smoothed by detector blur—finding the peak charge (MPV):
$$\text{MPV}_{\text{true}} = \text{MPV} - 0.22278 \cdot \eta$$
The parameter is given by [Landau PDF in ROOT](https://root-forum.cern.ch/t/definition-of-landau-pdf-in-root/39051). Then, we compared these MPVs to theoretical dE/dx from the Landau model, adjusted for **recombination** (some electrons get lost):
$$R = \frac{\ln(\alpha + \beta \cdot \frac{dE}{dx} / \rho)}{\beta \cdot \frac{dE}{dx} / \rho}, \quad \alpha = 0.930, \quad \beta = \frac{0.212}{\text{Electric field}}$$
The gain links them:
$$\frac{dQ}{dx} = \frac{\frac{dE}{dx} \cdot [e^- \text{per MeV}] \cdot R}{\text{gain}}$$
By fitting this equation to match data and theory, SBND found a gain, showing detector sensitivity.
## Shower energy reconstruction
Another expression that is common in LArTPC is for the shower energy:
$$
E_{\text{shower}} = W_i \cdot C_{\text{cal}} \cdot C_{\text{adj}} \cdot \frac{1}{R} \cdot \sum_{Q} e^{\frac{t_{\text{drift}}}{\tau}}Q
$$
* $W_i [\text{MeV}/e^-]$ is the work function, which is 23.6 $\text{eV}/e^-$ for Ar.
* $C_{\text{cal}} [e^-/\text{ADC}]$ is for calibration, it's the TPC **gain** value which accounts for converting ADC into MeV. This is what we have done from MPV fit above.
* $C_{\text{adj}}$ is the **adjustment factor** (or **shower fudge factor**).
* $R$ represents the **recombination** factor (~0.68). This term accounts for charge lost from the recombination process, so it's taken in the form of $1/R$.
* $Q [\text{ADC}]$ represents the deposited charge.
* $t_{\text{drift}}$ is the electron's drift time.
* $\tau$ is the electron lifetime, corresponding to the purity of the argon.
The reason why there is a summation for the deposit charge is that $t_{\text{drift}}$ varies for each deposited charge.
This expression is calculated in the post-processor, which gives the shower energy for us to do analysis. Sometimes we have updates for the factors, then we need to change the post-processor in a `.cfg` and re-process the new `.h5` with it. The steps can be found [here](https://hackmd.io/xjAl8h0vQiCk2QOJFqPZgQ#Batch-job-to-create-h5-files-eg-with-updated-config).
### Recombination
In the liquid Argon, a charged particle can knock off electrons from argons, creating free electrons and argon ions (Ar⁺). **Recombination** is a process that the electron ionized from an Ar returns back without drifting far and rejoin the Ar ion.

Recombination depends on the strength of drift E-field, orientation to the particle track, and its stopping power (dE/dx).
## Setup
1. Get the `.h5` sample with proper config (post-processor, etc.)
* One thing that has to be made sure for the calibration study, is the order in the post-processor has to be `direction` --> `find_cathode_crossers` --> `apply_calibrations`. This way, the cathode crossing algorithm has the proper track direction info it needs, and the calibration algorithm has tracks with their drift coordinate corrected (needed for lifetime correction). The priority will be 4, 3, 2 for each.
2. Navigate to the working area (e.g. `/sdf/home/c/castalyf/dqdx_analysis`), make sure you have the following scripts:
* To process the `.h5` sample: `dqdx_anaconfig.cfg`
* To produce dQ/dx parameters using SPINE: `dqdx_analyzer.py`
* To do analysis (used later): `dqdx_study.py` (it should currently be named as `dqdx_study_sbnd.py`)
3. To enable the analyzer works, we need to make sure it's there, i.e. `/sdf/home/c/castalyf/spine/spine/ana/script/dqdx_analyzer.py`
and under the same directory, add a line into `__init__.py`:
`from .dqdx_analyzer import *`
4. Make sure the config is also under the previous directory, i.e.
`/sdf/home/c/castalyf/spine/spine/ana/dqdx_anaconfig.cfg`
## Run
5. Open the `dqdx_anaconfig.cfg`. Make sure `file_keys` points to the output file `.h5`.
6. Navigate to the directory where you want the H5 to be produced and saved, e.g. `/sdf/home/c/castalyf/dqdx_analysis`
7. Run the following command on the interactive terminal:
```
python3 /sdf/home/c/castalyf/spine/bin/run.py -c dqdx_anaconfig.cfg
```
* It's sometimes useful to run the following command to ensure the process can continue in background without being broken by disconnection, etc.
```
nohup python3 /sdf/home/c/castalyf/spine/bin/run.py -c dqdx_anaconfig.cfg &
```
* To check the status: `ps aux | grep run.py`
8. There should be two CSVs are generated. We will be focusing on `/sdf/home/c/castalyf/spine/spine/ana/logs/dqdx....csv`
9. You can produce the plots about dQ/dx vs. residual length and see what the distribution looks like.
## Analysis
We are going to use `dqdx_study.py` to proceed analysis.
10. Navigate to the working area (e.g. `/sdf/home/c/castalyf/dqdx_analysis`), make sure you have the following:
* `ICARUSTPCGain` folder
* `betheBloch_library.py`
11. Now, open a terminal on the local machine, do `ssh` things (`ssh castalyf@s3dflogin.slac.stanford.edu` and then `ssh neutrino`) to log in, and navigate to the working area you were in (e.g. `/sdf/home/c/castalyf/dqdx_analysis`).
12. Create a python virtual environment, `python3 -m venv .venv` (To remove: `rm -r -f .venv`)
13. Enter the virtual env: `source .venv/bin/activate`
14. Install the pacakges
* `pip install numpy scipy pandas matplotlib` (or else)
* `pip install git+https://github.com/gputnam/landau.git`
16. Run the following command, which will apply the analysis script on the CSV we had under `logs`:
```
python3 dqdx_study_sbnd.py --in_path /sdf/home/c/castalyf/dqdx_analysis/logs
```
After that, we shall obtain the gain value and the plots (dQ/dx vs. rrs, Langaus fits, etc.) in the directory.
In the future when we repeat the same analysis, we basically just need to repeat steps **1, 5-9, 11, 13, and 15**.
### Notes
* The segment length can be modified in `dqdx_anaconfig.cfg`, whenever we have adjustments, the corresponding `dl` parameter in `dqdx_study.py` script need to be updated.
### Appendix: How to do PR (Pull Request) with `git`?
Sometimes we updated certain files in a directory and we would like to syncronize our changes on GitHub. Here are the steps:
1. First thing first, make sure you have the repository locally (**fork** the repository to your own's on the website and then do `git clone <URL>` in your area).
2. Navigate to the local repository e.g. `cd spind_prod`.
3. Make sure it's up-to-date: `git checkout main` and `git pull origin main`.
- NOTE: Since we forked the repository, which is sometimes not syncronized from the upstream, so we might need to do the following:
- To add and ensure we have the upstream: `git remote add upstream https://github.com/DeepLearnPhysics/spine_prod.git` and then `git remote -v`
- Fetch the info from upstream: `git fetch upstream`
- Merge upstream/main into the main branch: `git reset --hard upstream/main`
4. It's better to create a new branch and switch to that branch to make changes: `git checkout -b <branch_name>`
5. Make changes (e.g. `nano config/sbnd/sbnd_full_chain_data_250328.cfg`)
6. Add the updated files: `git add <updated_file>` (to make sure what files were changed, do `git status`)
7. Commit messages: `git commit -m "<leave_message_here>"`
8. Push the changes: `git push origin <branch_name>`
9. At this stage you'll probably have to do authentication with your username and password. For password, you might have to go to GitHub menu --> `Setting` --> `Developer settings` --> `Personal access tokens (classic)` to generate or retrieve the token. The token is your password, so just copy and paste.
10. Go to the repository on GitHub and find "Compare & pull request” button, and then create the pull request, you'll be done!
If you've done before and going to commit in the existing branch, here are the quick steps:
```
cd spine_prod
nano config/sbnd/sbnd_full_chain_data_250328.cfg
git add config/sbnd/sbnd_full_chain_data_250328.cfg
git status
git commit -m "..."
git push origin <branch name>
```
# $\nu_e$ Selection
* **Why selection?** When neutrinos are produced (e.g., in nuclear reactions or in cosmic events), they are mixed with other particles. To study certain types of neutrino interactions, selection will be a required algorithm.
* **What is selection?** It's essentially a process to quantify how well the reconstruction algorithm is able to capture signals.
* **How to do selection?** Taking $\nu_e$ as an example, it is crucial that *electrons* will be produced within $\nu_e$ interactions. In general: $$\nu_e+Ar \rightarrow e^- + \text{hadronic products } (p, \pi,...)$$ E.g. for CCQE interactions, $\nu_e+Ar \rightarrow e^- + Np$, where $N\geq 1$ represents the number of protons produced. In this case, we will look at the topology that the final states include 1 electron and $N$ protons (i.e. $1eNp$). The signals should be determined by several *cuts*:
1. **Containment cut**: The entire interaction should be *contained* by at least 5 cm in all three directions.
* **Active volume (AV)**: Volume that liquid argon is filled.
2. **Fiducial cut**: Interaction (vertex) within the *fiducial volume* (FV)
* **Fiducial volume (FV)**: Volume that has no *space charge effect*. Interactions occur outside of the FV would be viewed as ineffective.
* **Space charge effect (SCE)**: Particles/cosmic rays ionize Ar in the TPC. Electrons drift toward the wire cell quickly, while the positive ion (Ar$^+$) drifts slowly and eventually accumulated near the edges. The accumulation of ions results in the distortion of the elcetric field and the reco position of ionized particles. This effect will affect the reconstruction of dE/dx and so the resolution of reco energy.
3. **Final state topology**: At least 1 *primary electron* and 1 *primary protons* are required with $\geq$ 50 MeV energy to count in the final state, ensuring they form tracks and are distinct from muons.
4. Electron neutrino *charged-current (CC)* interaction. It is required to have *at least a single primary electron in the final state*.
5. Based on 3, others particles are similar but required by $\geq$ 25 MeV.
* **Flash matching**: Since neutrino beam from BNB carries background events, particularly cosmic rays, we can identify which of the many cosmic-ray interactions in a given event is associated with the light produced in time with the beam-spill. In short, resolving the interaction time using optical information handles *cosmic rejection*.
It can predict what the light looks like from an interaction, and match it to a PDS flash that’s coincident with the beam. If you match the prediction correctly, you then know which interaction in each event is the neutrino.
* **What to show?** Confusion matrix will be a siginificant metric -- because "selection" is to conduct "how well the recontruction does" for identify interaction's topology, for example.
* **Efficiency**: How well the true signals can be captured, $$\text{Efficiency} = \frac{\text{# of reco signals that are also true signals}}{\text{# of true signals}}$$

* **Purity**: How less the selected samples are contaminated with false signals, $$\text{Purity} = \frac{\text{# of true positives}}{\text{# of true+false positives}}$$
* 
## Running analysis script
The methodology is to process the CSV from the sample (e.g. nue or bnb_cosmic samples), and then apply the selection:
- Cosmic cut
- FV cut
- Electron cut
Then we can obtain the selected $\nu_e$ CC signals.
The main script for the selection is `nuecc.py`, and configuration is `save.cfg`.
1. Ensure `nuecc.py` is under the `spine` directory, specifically, `.../spine/spine/ana/script/nuecc.py`
2. In the same directory, there is an `__init__.py`, and you have to include `from .nuecc import *` in the script.
3. Run the following command to process:
```
python3 /sdf/home/c/castalyf/spine/bin/run.py --config save.cfg
```
To ensure it can be processed in background:
```
nohup python3 /sdf/home/c/castalyf/spine/bin/run.py --config save.cfg &
```
(Make sure you are in the right directory with `safe.cfg` and have the correct `spine` path.)
Then you will obtain a CSV (`nueccr2t_nuecc.csv`) in the logs folder to be analyzed.
### From samples to CSVs (alternative way)
1. First of all, make a configuration file under the directory you'd like to run (e.g. `anaconfig.cfg`):
```
analysis:
iteration: -1 # change as how many you'd like to run; -1 iterates all images.
profile: True
log_dir: LOG_DIRECTORY
data_builders:
- ParticleBuilder
- InteractionBuilder
convert_to_cm: False
# force_build: True
reader:
name: HDF5Reader
file_keys:
- /sdf/data/neutrino/sbnd/simulation/mpvmpr_v01/h5/output_ana/larcv_test_2500_mlreco_ana.h5 #This line will be replaced
scripts:
run_bidirectional_inference:
logger:
append: False
particles:
- id
- interaction_id
- pid
- is_ccrosser
- is_contained
- nu_id
- is_primary
- size
- semantic_type
- start_point
- end_point
- creation_process
- momentum
- truth_start_dir
- energy_init
- energy_deposit
- children_counts
- reco_start_dir
- reco_end_dir
- reco_length
- csda_ke
- calo_ke
- ke
- depositions_sum
- truth_depositions_sum
- primary_scores
- pid_scores
interactions:
- id
- size
- nu_id
- volume_id
- count_primary_particles
- count_truth_primary_particles
- topology
- truth_topology
- is_contained
- is_fiducial
- vertex
- truth_vertex
- nu_info
- flash_match_info
```
3. Set your environment using a [Singularity container](https://nuslac.github.io/ComputingWiki/#/tools/containers-whatis). The first step is to enter the `ondemand` i.e. do the first two `ssh` steps to log in. And then:
(1) Make a job submission bash script (e.g. `job.sh`):
```
#!/bin/bash
#SBATCH --account=neutrino:icarus-ml
#SBATCH --partition=ampere
#SBATCH --mem-per-cpu=4g
#SBATCH --output=/sdf/group/neutrino/castalyf/slurm/slurm-%j.out
#SBATCH --error=/sdf/group/neutrino/castalyf/slurm/slurm-%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --job-name=nue
#SBATCH --time=24:00:00
```
(2) Submit the job: `sbatch job.sh`
- Check the status: `squeue -u $YOUR_USERNAME`
- More info about Slurm: https://slurm.schedmd.com/srun.html
3. Call singularity image and run the bash:
```
singularity exec --bind /sdf,/fs/ddn/sdf --nv /sdf/group/neutrino/images/develop.sif bash
```
**Comments:**
- `singularity exec [--flags] [container_name] [application]` will try to run your application inside the singularity container that you specified.
- The `--bind` flag makes certain directories visible when running things inside the container. So here, `--bind /sdf,/fs/ddn/sdf` will mount these directories to the container so that when running the main script, these directories are available to load files etc.
- `--nv` flag allows us to use GPUs.
- The `/fs/ddn...` part is where the singularity container lives. We are using the recent version at `/sdf/group/neutrino/images/latest.sif` (if not work, try `…/develop.sif`).
- If your job crashes, the error log will be saved here: `#SBATCH --error=/sdf/group/neutrino/castalyf/slurm/slurm-%j.err`
4. Run analysis tools and log information into CSVs:
```
python3 -u $YOUR_LARTPC_PATH/analysis/run.py YOUR_ANACONFIG.cfg
```
For example:
```
python3 -u /sdf/home/c/castalyf/lartpc_mlreco3d/analysis/run.py /sdf/home/c/castalyf/nue_selection/nue/anaconfig.cfg
```
5. Now the CSVs should be generated (check the `LOG_DIRECTORY` folder) and way to do analysis!
`nohup [command] &`