[Note] Fundamental statistics in data analysis (working on...)

[**:house: Home**](https://hackmd.io/s/rkkDP_l4M) | [:boy: **About**](https://hackmd.io/s/B149Z8v7b) | [**:microscope: Researches**](https://hackmd.io/s/rJPFNKlVz) | [**:rocket: Side projects**](https://hackmd.io/s/H1aS2qe4G) | [**:airplane: Life gallery**](https://hackmd.io/s/HJN4JslNM) --- # [Note] Fundamental statistics in data analysis (working on...) *<div style="text-align: center;" markdown="1">`statistics` `LATEX`</div>* This documents I summerized the fundamental statistics used in physics, HEP data analysis and machine learning. The farther detail and complicate method are not included. If you are interested in the details, please go through the books listed in [\[Note\] Course taking & book reading](https://hackmd.io/s/r16bkjl4f). ## Probability There are two types of probability defination, ***frequentist*** and ***bayesian***. ### 1. Frequentist probability The probability can be otained with repeatable measurements. ### 2. Bayesian probability The defination of bayesian probability is $$ \begin{split} p(\theta_i|x)&=\frac{p(x\cap\theta)}{p(x)}\\ &=\frac{p(x|\theta_i)p(\theta_i)}{p(x)}\ , \end{split} $$ where $x$ is the interest variable; $\theta_i$ is a unkown parameter which affects the probability; $p(\theta_i|x)$ is the interesting probability of $\theta_i$ with respect to fixed $x$, called ***postior***; $p(\theta_i)$ usually is the unknown probability of the possible parameter, called ***prior***, which affect the result of the *postior* and can not be obtained by repeatable measurment. $p(x)$ can be expended with the *law of total probability* as $$ \begin{split} p(x)&=\sum_{i=1}^np(x|\theta_i)p(\theta_i)\\ &=\sum_{i=1}^np_i(x)\ , \end{split} $$ hence the probability of variable $x$ is consist of the $n$ probability of $x$, and each probability of $x$ can be measured by $\theta_i$. Thus, the bayesian probability can be rewritten by $$ p(\theta_i|x)=\frac{p(x|\theta_i)p(\theta_i)}{\sum_{i=1}^np(x|\theta_i)p(\theta_i)}\ , $$ hence $p(\theta_i|x)$ is the normalized probability of $p(x|\theta_i)p(\theta_i)$. However, if there are more than two parameters affecting the probablility of $x$, the alternative way to observe their correlation is with thier ratio, e.g. $$ \frac{p(\theta_1|x)}{p(\theta_2|x)}=\frac{p(x|\theta_1)}{p(x|\theta_2)}\frac{p(\theta_1)}{p(\theta_2)}\ . $$ The total probability of $x$ can be ignored in this way. On the other hand, in two partition measurment case, the bayesian probability becomes as $$ p(A|B)=\frac{p(B|A)p(A)}{p(B)}\ . $$ This obtains the useful correlation between two partitions' probability as $$ p(A|B)p(B)=p(B|A)p(A)=p(A\cap B) $$ ## Probability distribution function The probability distribution functionThere (PDF) performs the continue or discrete probability as function of interesting variable $x$ as $f(x)$. The total probability for certain case A is $$ p(A)=\int_A f(x_1,\dots,x_n)d^nx $$ Here are three common distributions, ***binomial***, ***poisson*** and ***gaussian***. They are correlated to each other and perform different properties. The details are introduced by given the example: there are $N$ number of total data in a pool; the $n$ sample is extracted randomly from the pool; and randomly pickup rate (probability) $p$ of the interesting samples from the $n$ samples can be written as $$ p=\frac{\nu}{n}\ , $$ where $v$ is the number of intersting sample. ### 1. Binomial distribution $$ f(p;n,N)=\dbinom{N}{n}p^n(1-p)^{N-n}\ , $$ where $$ \dbinom{N}{n}=\frac{N!}{n!(N-n)!} $$ ### 2. Poisson distribution When the $N>>n$, the binomial distribution can be simplify to $$ f(n,v)=\frac{\nu^ne^{-v}}{n!} $$ ### 3. Guassian distribution $$ f(x;v,n,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\nu)^2}{2\sigma^2}} $$ where $\sigma^2$ is $n$ if the distribution is obeying the poison distribution. NOTE: ***Stirling's approximation*** $$ N!\approx N^Ne^{-N}\sqrt{2\pi N} $$ $$ \ln N!\approx N\ln N - N\ , $$ where the term with $\sqrt{2\pi N}$ can be ignored in approximation case of $N>>1$. ## Entropy The entropy in thermal physics is defined as $$ S=k\ln\Omega\ , $$ where $k$ is Boltzmann constant; $\Omega$ is the number of possible combination, or dichotomy in two partition case. ### 1. Two partition mixing case There are two separated idea gas (two partitioon samples) $A$ and $B$, and then they are released and mixed randomly. The total number of the gas molecules (samples) is $N$ consisted of $N_A$ and $N_B$. The fraction of $A$ is $p_A=p=\frac{N_A}{N}$, and $B$ is $p_B=1-p=\frac{N_B}{N}$. Thus the number of $N_A$ and $N_B$ become $$ \begin{split} N_A&=Np\\ N_B&=N(1-p)\ . \end{split} $$ The ***mixing entropy*** is $$ \Delta S = S_{A+B} - S_{A,B}\ , $$ where $S_{A,B}$ is the entropy before mixing, i.e. $A$ and $B$ are independent in different room; $S_{A+B}$ is the entropy after mixing; both are defined as $$ S_{A,B}=S_A+S_B=k\ln\Omega_A+k\ln\Omega_B\ , $$ $$ S_{A+B}=k\ln\Omega_{A+B}\ . $$ where $\Omega_A=\Omega_B=1$, since they are independently staying in the unique room which has only one combination, the $S_{A,B}$ is zero. The $\Omega_{A+b}$ is the combination of $A$ and $B$ after mixing. It can be simplifed by the *Stirling's approximation*: $$ \begin{split} \Omega_{A+B}=\dbinom{N}{N_A}=\dbinom{N}{N_B}&=\frac{N!}{N_A!N_B!}\\ &\approx\frac{N^N}{N_A^{N_A}N_B^{N_B}}\\ &=p_A^{-N_A}p_B^{-N_B}\\ &=p^{-Np}(1-p)^{-N(1-p)}\ . \end{split} $$ The *mixing entropy* thus can be obtained as $$ \Delta S=-Nk\left[p\ln{p}+(1-p)\ln(1-p)\right]\ . $$ This form is the same with the ***cross entropy*** in **Information theory**. It used in *neural network (NN)* and some machine learning case. ## Cost function :construction: ### 1. Likelihood ### 2. $\chi^2$ method ## Appandix ### 1. Uniform distribution The mean and probability are $$ \langle x\rangle = \frac{a+b}{2}\ , $$ and $$ p(x)=\frac{1}{b-a}\ , $$ respectively. The variance is $$ \begin{split} V[x]&=\langle x^2\rangle-\langle x\rangle^2 \\ &=\frac{a^2+ab+b^2}{3}-\frac{a^2+2ab+b^2}{4} \\ &=\frac{(b-a)^2}{12} , \end{split} $$ where $$ \begin{split} \langle x^2\rangle&=\int_a^bx^2p(x)dx\\ &=\frac{1}{b-a}\int_a^bx^2dx \\ &=\frac{1}{b-a}(\frac{b^3}{3}-\frac{a^3}{3})\\ &=\frac{a^2+ab+b^2}{3}\ . \end{split} $$ Thus, the uncertainty of the uniform distribution become as $$ \sigma=\frac{b-a}{\sqrt{12}}\ . $$ <br> --- [:ghost: Github](https://github.com/juifa-tsai) | [:busts_in_silhouette: Linkedin ](https://www.linkedin.com/in/jui-fa-tsai-08ba0a93)