5468 Statistical Computing Midterm
===
```python=
# The packages used in this project.
import numpy as np
import matplotlib.pyplot as plt
import scipy.special as sc
from scipy.stats import binom
from scipy import stats
from statsmodels.distributions.empirical_distribution import ECDF
import seaborn as sns
sample_size = 10**4
```
## Problem 1. Dirichlet Distribution
### a. Write down the algorithm to sample using acceptance-rejection method with a proposal g(x).
1. Generate $U = (u_1, u_2... u_d), u_i \sim U(0,1)$ and $\sum u_i = 1$
2. $V = \prod{u_i^{\alpha_i -1}}$
3. Generate $W \sim U(0,10)$
4. if $W * B(\alpha) > V$
- go back to 1.
5. return $U$
### b. Write a code to sample 104 data points through your algorithm in (a) with d = 3 , (a~1~, a~2~, a~3~) = (2, 3, 4), and g being the uniform distribution on [0, 1]^3^ . Report the following:
```python=
# beta function
beta = lambda alpha: \
np.product(list(map(sc.gamma, alpha)))/sc.gamma(sum(alpha))
# Init
alpha = np.array([2, 3, 4])
k = 2
num = 0
U = np.array([])
m = beta(alpha)
```
```python=11
#Generate
while num < sample_size:
u = np.random.uniform(0,1,k)
u = np.array([u.min(axis=0),np.abs(u[1]-u[0]),\
1-u.max(axis=0)]).T
v = 10*np.random.uniform(0, 1)*m
u_ = np.product(np.power(u, alpha-1))
if v<u_:
U = np.append(U, u)
num += 1
U= U.reshape(-1,3)
```
#### histogram for X~1~, X~2~ and X~3~, individually
```python=21
# Plot histogram
_ = plt.hist(np.random.dirichlet((2,3,4),sample_size),\
100, cumulative=1, density=1, histtype=u'step')
_ = plt.hist(U, 100,cumulative=1, density=1, histtype=u'step')
plt.legend(['$Y_1$', '$Y_2$', '$Y_3$','$X_1$', '$X_2$', '$X_3$'])
```
- 
- > CDF of Y~i~ and X~i~ , where Y~i~ come from points sampled from **numpy.random.dirichlet**.
#### the sample mean and covariance matrix
```python=25
print(U.mean(axis=0))
print(np.cov(U.T))
# Means
#[0.22245965 0.33119373 0.44634662]
# Cov
#[[ 0.01734423 -0.00713978 -0.01020445]
#[-0.00713978 0.0218523 -0.01471252]
#[-0.01020445 -0.01471252 0.02491697]]
```
| $\mu$ | X~1~ | X~2~ | X~3~ |
|:---------------:|:-----------:|:-----------:|:-----------:|
| | 0.22245965 | 0.33119373 | 0.44634662 |
| Cov(X~i~, X~j~) | | | |
| X~1~ | 0.01734423 | -0.00713978 | -0.01020445 |
| X~2~ | -0.00713978 | 0.0218523 | -0.01471252 |
| X~3~ | -0.01020445 | -0.01471252 | 0.02491697 |
### c. There’s a way to sample Dirichlet distribution through Gamma distribution. Write down the corresponding algorithm.
1. Generate $X = (X_1, X_2, ... X_d)$ with $X_i \sim Gamma(\alpha_i, 1)$
2. $Y_i = X_i/\sum X_i$
3. Return $Y$
### d. Write a code to sample 104 data points through your algorithm in c with d = 3 , (a~1~, a~2~, a~3~) = (2, 3, 4), and g being the uniform distribution on [0, 1]^3^ .
```python=
# Gamma random vector generator
def gamma_generator(a, sample_size):
a_ = a-1
while(sample_size):
u = np.random.uniform(0,1)
V = -np.log(u)
u = np.random.uniform(0,1)
W = -np.log(u)
if(W >= a_*(V - np.log(V) - 1)):
sample_size -= 1
yield a*V
def gamma(a, sample_size):
return [i for i in gamma_generator(a, sample_size)]
```
```python=15
# Generate
X = np.array([gamma(a, sample_size) for a in alpha])
Y = X/np.sum(X,axis=0)
```
#### histogram for X~1~, X~2~ and X~3~, individually
```python=18
# Plot histogram
_ = plt.hist(np.random.dirichlet((2,3,4), sample_size),\
100, cumulative=1, density=1, histtype=u'step')
_ = plt.hist(Y.T, 100, cumulative=1, density=1, histtype=u'step')
plt.legend(['$Y_1$', '$Y_2$', '$Y_3$','$X_1$', '$X_2$', '$X_3$'])
```
- 
- > CDF of Y~i~ and X~i~ , where Y~i~ come from points sampled from **numpy.random.dirichlet**.
#### the sample mean and covariance matrix
```python=25
print(Y.mean(axis=1))
print(np.cov(Y))
# Means
#[0.22266105 0.33398685 0.4433521 ]
# Cov
#[[ 0.01719933 -0.0074986 -0.00970073]
# [-0.0074986 0.02206054 -0.01456194]
# [-0.00970073 -0.01456194 0.02426267]]
```
| $\mu$ | X~1~ | X~2~ | X~3~ |
|:---------------:|:-----------:|:-----------:|:-----------:|
| | 0.22266105 | 0.33398685 | 0.4433521 |
| Cov(X~i~, X~j~) | | | |
| X~1~ | 0.01719933 | -0.0074986 | -0.00970073 |
| X~2~ | -0.0074986 | 0.02206054 | -0.01456194 |
| X~3~ | -0.00970073 | -0.01456194 | 0.02426267 |
\pagebreak
## Problem 2. Marshall-Olkin Copula
### a. Write down the algorithm of generating a Marshall-Olkin vector(X,Y) with parameter(ld~1~,ld~2~,ld~3~).
1. Generate $V_i \sim U(0,1)$ with $i=1,2,3$
2. $Z_i = \frac{-ln(V_i)}{\lambda_i}$
3. $X=Min(Z_1,Z_3), Y=Min(Z_2,Z_3)$
4. $U_j=e^{-(\lambda_j+\lambda_3)*Min(Z_j,Z_3)}$, for $j = 1,2$
### b. Write a code to sample 10^4^ data points (X, Y ) through your algorithm in (a) with (ld~1~, ld~2~, ld~3~) = (1, 2, 3).
```python=
# Init
ld = np.array([1,2,3])
F = lambda x: 1 - np.exp(-(ld[0]+ld[2])*x)
G = lambda y: 1 - np.exp(-(ld[1]+ld[2])*y)
F_rvs = lambda x: -1/(ld[0]+ld[2])*np.log(1-x)
G_rvs = lambda y: -1/(ld[1]+ld[2])*np.log(1-y)
```
```python=7
# The algorithm above
Z = -np.log(np.random.uniform(0, 1, (sample_size, 3)))/ld
X = np.min(Z.T[[0,2]],axis=0)
Y = np.min(Z.T[[1,2]],axis=0)
U1 = np.exp(-(ld[0]+ld[2]) * X)
U2 = np.exp(-(ld[1]+ld[2]) * Y)
```
\pagebreak
```python=13
# Plot (X,Y)
h = sns.jointplot(X,Y,kind='kde')
h.set_axis_labels('$X$', '$Y$', fontsize=16)
h.fig.subplots_adjust(top=0.9)
plt.suptitle("Marshall-Olkin Expon", fontsize=18)
```
- 
- > The joint plot of (X,Y).
-
\pagebreak
#### Plot the rank plot between X and Y .
```python=18
# Rank plot
h = sns.jointplot(U1, U2, kind= 'kde')
h.set_axis_labels('$X$', '$Y$', fontsize=16)
h.fig.subplots_adjust(top=0.9)
plt.suptitle("Marshall-Olkin Copula", fontsize=18)
```
- 
- > Rank plot of Marshall-Olkin copula.
\pagebreak
### c. Write down the algorithm of generating a (Z,W) such that Z ~ N(*mu*~Z~,*sigma*~Z~^2^ ), W ~ N(*mu*~W~ ,*sigma*~W~^2^), and the copula of (Z, W ) is a Marshall-Olkin copula with parameter (ld~1~ , ld~2~ , ld~3~ ).
1. $Z = N(*mu*_Z,*sigma*_Z^2)^{-1}(U_1)$
2. $W = N(*mu*_W,*sigma*_W^2)^{-1}(U_2)$
### d. Write a code to sample 10^4^ data points (Z,W) through your algorithm in c. with (*mu*~Z~,*sigma*~Z~^2^)=(0,1), (*mu*~W~ ,*sigma*~W~^2^) = (-2,4), and (ld~1~,ld~2~,ld~3~) = (1,2,3).
```python=
# Generate Norm
Z = stats.norm( 0, 1).ppf(U1)
W = stats.norm(-2, 2).ppf(U2)
```
\pagebreak
#### Plot the scatter plot of (Z,W)
```python=
# Joint plot
h = sns.jointplot(Z,W, kind='kde')
h.set_axis_labels('$Z$', '$W$', fontsize=16)
h.fig.subplots_adjust(top=0.9)
plt.suptitle("Multivariate Normal Distribution\non Marshall-Olkin Copula", fontsize=18)
```
- 
- > Jointplot of (Z, W)
\pagebreak
## Problem 3. Bootstrap and Jacknife
```python=
p3_data = '0.0839 0.0205 0.3045 0.7816 0.0003 \
0.0095 0.4612 0.9996 0.9786 0.7580 0.0002 0.7310 \
0.0777 0.4483 0.4449 0.7943 0.1447 0.0431 0.8621 \
0.3273'
p3_data = list(map(float,p3_data.split()))
real_med = np.median(p3_data) # 0.3861
```
### a. Jackknife
#### Write down the procedure of Jacknife.
1. $X$ = p3_data
2. $\theta = Median(X)$
3. $\widetilde \theta = \bigcup_{i} Median(X \setminus X_i)$
4. $\overline \theta = Mean(\widetilde \theta)$
5. $n = length(X)$
6. $Bias = (n-1)(\overline \theta - \theta)$
7. $Esimate = \theta-bias$
8. $SE = \sqrt{\frac{n-1}{n}*\sum(\widetilde \theta - \overline \theta)^2}$
9. $zScore= \sqrt{2}*erf^{-1}(0.95)$
10. $\delta^* = zScore*SE$
11. Return $Interval_{.95} = [Estimate-\delta^*, Estimate+\delta^*]$
#### Write a code for it.
```python=
def jackknife(data, statistic):
resample = np.array([data[:idx] + data[idx+1:] for idx, x in enumerate(data)])
n = len(data)
stat_data = np.median(data)
jack_stat = statistic(resample,axis=1)
mean_jack_stat = np.mean(jack_stat, axis=0)
bias = (n-1)*(mean_jack_stat - stat_data)
std_err = np.sqrt((n-1)*np.mean((jack_stat - mean_jack_stat)*(jack_stat - \
mean_jack_stat), axis=0))
estimate = stat_data - bias
z_score = np.sqrt(2.0)*erfinv(0.95)
conf_interval = estimate + z_score*np.array((-std_err, std_err))
return jack_stat,conf_interval
jack_stat, interval = jackknife(p3_data, np.median)
```
#### Report the 95%-confidence interval you obtained.
> Interval.95 = [-0.11624515, 0.88844515]
#### plot the histogram of your jackknife sample.
```python=16
# Plot histogram
_ = plt.hist(jack_stat,20)
```
- 
- > Histogram of resample medians.
### b. Bootstrap
#### Write down the procedure of Parametric Bootstrap.
1. $X$ = p3_data
2. $\theta = Median(X)$
3. Resample $X^* = \bigcup_{i=1}^{10^4}\bigcup_{j=1}^{20}{X^*_{ij}}, X^*_{ij} \sim X$
4. $\widetilde\theta = \bigcup Median(X_i)$
5. $\delta^* = \widetilde \theta - \theta$
6. $Interval_.95=[\theta-\delta^*_{.975}, \theta+\delta^*_{.025}]$
#### Write a code for it.
```python=
# CI
def confidence_interval(real_mean, theta, confidence=0.95):
h = theta - real_mean
c = (1-confidence)/2
h_ = np.percentile(h,((1-c)*100, c*100))
return real_mean-h_
# Bootstrap
bt_sp = np.random.choice(p3_data, size=(sample_size, len(p3_data)), replace=1)
bt_theta = np.median(bt_sp, axis=1)
print(confidence_interval(real_med, bt_theta))
```
#### construct the 95%-confidence interval with 10^4^ bootstrap sampling.
> Interval.95 = [0.0277, 0.6914]
#### plot the histogram of your bootstrap sample.
```python=11
# Plot histogram
_ = plt.hist(bt_theta, 100, cumulative=0, histtype='step')
```
- 
- >Histogram of resample medians.
### Parametric Bootstrap
Suppose now we know that the data is coming from a Beta distribution **beta**(a, b) with unknown (a, b).
#### Write down the procedure of Parametric Bootstrap.
1. $X$ = p3_data
2. $\mu = Mean(X)$
3. $\sigma = Std(X)$
4. Estimate $(a,b)$ with $\mu$ and $\sigma$
5. Generate $X^* = \bigcup_{i=1}^{10^4}\bigcup_{j=1}^{20}{X^*_{ij}}, X^*_{ij} \sim \beta(a,b)$
6. $\widetilde\theta = \bigcup Median(X_i)$
7. $\delta^* = \widetilde \theta - \theta$
8. $Interval_.95=[\theta-\delta^*_{.975}, \theta+\delta^*_{.025}]$
#### Write a code for it.
```python=
def calculate_beta_parm(mu, std):
alpha = -mu*(std**2 + mu**2 - mu)/std**2
beta = (mu-1)*(std**2 + mu**2 - mu)/std**2
return [alpha, beta]
# get a,b from mu, std
mu = np.mean(p3_data)
std = np.std(p3_data)
a, b = calculate_beta_parm(mu, std)
```
```python=9
beta_sp = np.random.beta(a,b, (sample_size,len(p3_data)))
beta_theta = np.median(beta_sp, axis=1)
print(confidence_interval(np.median(p3_data), beta_theta))
```
#### construct the 95%-confidence interval with 10^4^ bootstrap sampling.
> Interval.95 = [0.09163249, 0.6674887]
#### plot the histogram of your bootstrap sample.
```python=12
_ = plt.hist(beta_theta, 100, histtype='step')
```

\pagebreak
## Problem 4. Goodness-of-Fit and Bootstrap
### a. Compute the corresponding Kolmogorov-Smirnov statistics
```python=
def KS(data):
n = 8
p = data.mean()/n
x = np.sort(data)
f = ECDF(x)
d = np.max([np.abs(f(x) - binom(n,p).cdf(x)), \
np.abs(f(x) - 1/(n+1) - binom(n,p).cdf(x))])
return d
KSdist = KS(p4_data)
```
> KsDist = 0.37344311354803494
### b. Write down the algorithm of approximating the p-value of this Kolmogorov-Smirnov statistics based on bootstrap with 10^4^ sample.
1. $X$ = p4_data
2. $\theta = KsDist$
3. Estimate $p=Mean(X)/8$ when $Mean(x) \approx E[X] = np$
4. $X^* = \bigcup_{i=1}^{10^4}\bigcup_{j=1}^{20}{X^*_{ij}}, X^*_{ij} \sim Binom(n,p)$
5. $\widetilde \theta = \bigcup KS(X^*_i)$
6. $p = \frac{\#\widetilde\theta:\widetilde\theta\ge\theta}{\#\widetilde\theta}$
### c. Write a program based on your algorithm in (b). Determine the result of the hypothesis testing.
```python=10
n = 8
p = p4_data.mean()/n
resample = np.random.binomial(n,p, (sample_size, len(p4_data)))
ks_stat = list(map(KS, resample))
p_value = sum(ks_stat>=KSdist)/len(ks_stat)
p_value
```
> p-value = 0.0001 < 0.05, accept.
###### tags: `NCTU` `statistical-computing` `simulation`