5468 Statistical Computing Midterm === ```python= # The packages used in this project. import numpy as np import matplotlib.pyplot as plt import scipy.special as sc from scipy.stats import binom from scipy import stats from statsmodels.distributions.empirical_distribution import ECDF import seaborn as sns sample_size = 10**4 ``` ## Problem 1. Dirichlet Distribution ### a. Write down the algorithm to sample using acceptance-rejection method with a proposal g(x). 1. Generate $U = (u_1, u_2... u_d), u_i \sim U(0,1)$ and $\sum u_i = 1$ 2. $V = \prod{u_i^{\alpha_i -1}}$ 3. Generate $W \sim U(0,10)$ 4. if $W * B(\alpha) > V$ - go back to 1. 5. return $U$ ### b. Write a code to sample 104 data points through your algorithm in (a) with d = 3 , (a~1~, a~2~, a~3~) = (2, 3, 4), and g being the uniform distribution on [0, 1]^3^ . Report the following: ```python= # beta function beta = lambda alpha: \ np.product(list(map(sc.gamma, alpha)))/sc.gamma(sum(alpha)) # Init alpha = np.array([2, 3, 4]) k = 2 num = 0 U = np.array([]) m = beta(alpha) ``` ```python=11 #Generate while num < sample_size: u = np.random.uniform(0,1,k) u = np.array([u.min(axis=0),np.abs(u[1]-u[0]),\ 1-u.max(axis=0)]).T v = 10*np.random.uniform(0, 1)*m u_ = np.product(np.power(u, alpha-1)) if v<u_: U = np.append(U, u) num += 1 U= U.reshape(-1,3) ``` #### histogram for X~1~, X~2~ and X~3~, individually ```python=21 # Plot histogram _ = plt.hist(np.random.dirichlet((2,3,4),sample_size),\ 100, cumulative=1, density=1, histtype=u'step') _ = plt.hist(U, 100,cumulative=1, density=1, histtype=u'step') plt.legend(['$Y_1$', '$Y_2$', '$Y_3$','$X_1$', '$X_2$', '$X_3$']) ``` - ![](https://i.imgur.com/wLzKrZG.png =600x) - > CDF of Y~i~ and X~i~ , where Y~i~ come from points sampled from **numpy.random.dirichlet**. #### the sample mean and covariance matrix ```python=25 print(U.mean(axis=0)) print(np.cov(U.T)) # Means #[0.22245965 0.33119373 0.44634662] # Cov #[[ 0.01734423 -0.00713978 -0.01020445] #[-0.00713978 0.0218523 -0.01471252] #[-0.01020445 -0.01471252 0.02491697]] ``` | $\mu$ | X~1~ | X~2~ | X~3~ | |:---------------:|:-----------:|:-----------:|:-----------:| | | 0.22245965 | 0.33119373 | 0.44634662 | | Cov(X~i~, X~j~) | | | | | X~1~ | 0.01734423 | -0.00713978 | -0.01020445 | | X~2~ | -0.00713978 | 0.0218523 | -0.01471252 | | X~3~ | -0.01020445 | -0.01471252 | 0.02491697 | ### c. There’s a way to sample Dirichlet distribution through Gamma distribution. Write down the corresponding algorithm. 1. Generate $X = (X_1, X_2, ... X_d)$ with $X_i \sim Gamma(\alpha_i, 1)$ 2. $Y_i = X_i/\sum X_i$ 3. Return $Y$ ### d. Write a code to sample 104 data points through your algorithm in c with d = 3 , (a~1~, a~2~, a~3~) = (2, 3, 4), and g being the uniform distribution on [0, 1]^3^ . ```python= # Gamma random vector generator def gamma_generator(a, sample_size): a_ = a-1 while(sample_size): u = np.random.uniform(0,1) V = -np.log(u) u = np.random.uniform(0,1) W = -np.log(u) if(W >= a_*(V - np.log(V) - 1)): sample_size -= 1 yield a*V def gamma(a, sample_size): return [i for i in gamma_generator(a, sample_size)] ``` ```python=15 # Generate X = np.array([gamma(a, sample_size) for a in alpha]) Y = X/np.sum(X,axis=0) ``` #### histogram for X~1~, X~2~ and X~3~, individually ```python=18 # Plot histogram _ = plt.hist(np.random.dirichlet((2,3,4), sample_size),\ 100, cumulative=1, density=1, histtype=u'step') _ = plt.hist(Y.T, 100, cumulative=1, density=1, histtype=u'step') plt.legend(['$Y_1$', '$Y_2$', '$Y_3$','$X_1$', '$X_2$', '$X_3$']) ``` - ![](https://i.imgur.com/2RfbCaY.png =600x) - > CDF of Y~i~ and X~i~ , where Y~i~ come from points sampled from **numpy.random.dirichlet**. #### the sample mean and covariance matrix ```python=25 print(Y.mean(axis=1)) print(np.cov(Y)) # Means #[0.22266105 0.33398685 0.4433521 ] # Cov #[[ 0.01719933 -0.0074986 -0.00970073] # [-0.0074986 0.02206054 -0.01456194] # [-0.00970073 -0.01456194 0.02426267]] ``` | $\mu$ | X~1~ | X~2~ | X~3~ | |:---------------:|:-----------:|:-----------:|:-----------:| | | 0.22266105 | 0.33398685 | 0.4433521 | | Cov(X~i~, X~j~) | | | | | X~1~ | 0.01719933 | -0.0074986 | -0.00970073 | | X~2~ | -0.0074986 | 0.02206054 | -0.01456194 | | X~3~ | -0.00970073 | -0.01456194 | 0.02426267 | \pagebreak ## Problem 2. Marshall-Olkin Copula ### a. Write down the algorithm of generating a Marshall-Olkin vector(X,Y) with parameter(ld~1~,ld~2~,ld~3~). 1. Generate $V_i \sim U(0,1)$ with $i=1,2,3$ 2. $Z_i = \frac{-ln(V_i)}{\lambda_i}$ 3. $X=Min(Z_1,Z_3), Y=Min(Z_2,Z_3)$ 4. $U_j=e^{-(\lambda_j+\lambda_3)*Min(Z_j,Z_3)}$, for $j = 1,2$ ### b. Write a code to sample 10^4^ data points (X, Y ) through your algorithm in (a) with (ld~1~, ld~2~, ld~3~) = (1, 2, 3). ```python= # Init ld = np.array([1,2,3]) F = lambda x: 1 - np.exp(-(ld[0]+ld[2])*x) G = lambda y: 1 - np.exp(-(ld[1]+ld[2])*y) F_rvs = lambda x: -1/(ld[0]+ld[2])*np.log(1-x) G_rvs = lambda y: -1/(ld[1]+ld[2])*np.log(1-y) ``` ```python=7 # The algorithm above Z = -np.log(np.random.uniform(0, 1, (sample_size, 3)))/ld X = np.min(Z.T[[0,2]],axis=0) Y = np.min(Z.T[[1,2]],axis=0) U1 = np.exp(-(ld[0]+ld[2]) * X) U2 = np.exp(-(ld[1]+ld[2]) * Y) ``` \pagebreak ```python=13 # Plot (X,Y) h = sns.jointplot(X,Y,kind='kde') h.set_axis_labels('$X$', '$Y$', fontsize=16) h.fig.subplots_adjust(top=0.9) plt.suptitle("Marshall-Olkin Expon", fontsize=18) ``` - ![](https://i.imgur.com/Xi1mkbT.png =600x) - > The joint plot of (X,Y). - \pagebreak #### Plot the rank plot between X and Y . ```python=18 # Rank plot h = sns.jointplot(U1, U2, kind= 'kde') h.set_axis_labels('$X$', '$Y$', fontsize=16) h.fig.subplots_adjust(top=0.9) plt.suptitle("Marshall-Olkin Copula", fontsize=18) ``` - ![](https://i.imgur.com/mk8xHH9.png =600x) - > Rank plot of Marshall-Olkin copula. \pagebreak ### c. Write down the algorithm of generating a (Z,W) such that Z ~ N(*mu*~Z~,*sigma*~Z~^2^ ), W ~ N(*mu*~W~ ,*sigma*~W~^2^), and the copula of (Z, W ) is a Marshall-Olkin copula with parameter (ld~1~ , ld~2~ , ld~3~ ). 1. $Z = N(*mu*_Z,*sigma*_Z^2)^{-1}(U_1)$ 2. $W = N(*mu*_W,*sigma*_W^2)^{-1}(U_2)$ ### d. Write a code to sample 10^4^ data points (Z,W) through your algorithm in c. with (*mu*~Z~,*sigma*~Z~^2^)=(0,1), (*mu*~W~ ,*sigma*~W~^2^) = (-2,4), and (ld~1~,ld~2~,ld~3~) = (1,2,3). ```python= # Generate Norm Z = stats.norm( 0, 1).ppf(U1) W = stats.norm(-2, 2).ppf(U2) ``` \pagebreak #### Plot the scatter plot of (Z,W) ```python= # Joint plot h = sns.jointplot(Z,W, kind='kde') h.set_axis_labels('$Z$', '$W$', fontsize=16) h.fig.subplots_adjust(top=0.9) plt.suptitle("Multivariate Normal Distribution\non Marshall-Olkin Copula", fontsize=18) ``` - ![](https://i.imgur.com/KLUnhus.png =600x) - > Jointplot of (Z, W) \pagebreak ## Problem 3. Bootstrap and Jacknife ```python= p3_data = '0.0839 0.0205 0.3045 0.7816 0.0003 \ 0.0095 0.4612 0.9996 0.9786 0.7580 0.0002 0.7310 \ 0.0777 0.4483 0.4449 0.7943 0.1447 0.0431 0.8621 \ 0.3273' p3_data = list(map(float,p3_data.split())) real_med = np.median(p3_data) # 0.3861 ``` ### a. Jackknife #### Write down the procedure of Jacknife. 1. $X$ = p3_data 2. $\theta = Median(X)$ 3. $\widetilde \theta = \bigcup_{i} Median(X \setminus X_i)$ 4. $\overline \theta = Mean(\widetilde \theta)$ 5. $n = length(X)$ 6. $Bias = (n-1)(\overline \theta - \theta)$ 7. $Esimate = \theta-bias$ 8. $SE = \sqrt{\frac{n-1}{n}*\sum(\widetilde \theta - \overline \theta)^2}$ 9. $zScore= \sqrt{2}*erf^{-1}(0.95)$ 10. $\delta^* = zScore*SE$ 11. Return $Interval_{.95} = [Estimate-\delta^*, Estimate+\delta^*]$ #### Write a code for it. ```python= def jackknife(data, statistic): resample = np.array([data[:idx] + data[idx+1:] for idx, x in enumerate(data)]) n = len(data) stat_data = np.median(data) jack_stat = statistic(resample,axis=1) mean_jack_stat = np.mean(jack_stat, axis=0) bias = (n-1)*(mean_jack_stat - stat_data) std_err = np.sqrt((n-1)*np.mean((jack_stat - mean_jack_stat)*(jack_stat - \ mean_jack_stat), axis=0)) estimate = stat_data - bias z_score = np.sqrt(2.0)*erfinv(0.95) conf_interval = estimate + z_score*np.array((-std_err, std_err)) return jack_stat,conf_interval jack_stat, interval = jackknife(p3_data, np.median) ``` #### Report the 95%-confidence interval you obtained. > Interval.95 = [-0.11624515, 0.88844515] #### plot the histogram of your jackknife sample. ```python=16 # Plot histogram _ = plt.hist(jack_stat,20) ``` - ![](https://i.imgur.com/UAKU9w2.png =500x) - > Histogram of resample medians. ### b. Bootstrap #### Write down the procedure of Parametric Bootstrap. 1. $X$ = p3_data 2. $\theta = Median(X)$ 3. Resample $X^* = \bigcup_{i=1}^{10^4}\bigcup_{j=1}^{20}{X^*_{ij}}, X^*_{ij} \sim X$ 4. $\widetilde\theta = \bigcup Median(X_i)$ 5. $\delta^* = \widetilde \theta - \theta$ 6. $Interval_.95=[\theta-\delta^*_{.975}, \theta+\delta^*_{.025}]$ #### Write a code for it. ```python= # CI def confidence_interval(real_mean, theta, confidence=0.95): h = theta - real_mean c = (1-confidence)/2 h_ = np.percentile(h,((1-c)*100, c*100)) return real_mean-h_ # Bootstrap bt_sp = np.random.choice(p3_data, size=(sample_size, len(p3_data)), replace=1) bt_theta = np.median(bt_sp, axis=1) print(confidence_interval(real_med, bt_theta)) ``` #### construct the 95%-confidence interval with 10^4^ bootstrap sampling. > Interval.95 = [0.0277, 0.6914] #### plot the histogram of your bootstrap sample. ```python=11 # Plot histogram _ = plt.hist(bt_theta, 100, cumulative=0, histtype='step') ``` - ![](https://i.imgur.com/cFzxuVI.png =500x) - >Histogram of resample medians. ### Parametric Bootstrap Suppose now we know that the data is coming from a Beta distribution **beta**(a, b) with unknown (a, b). #### Write down the procedure of Parametric Bootstrap. 1. $X$ = p3_data 2. $\mu = Mean(X)$ 3. $\sigma = Std(X)$ 4. Estimate $(a,b)$ with $\mu$ and $\sigma$ 5. Generate $X^* = \bigcup_{i=1}^{10^4}\bigcup_{j=1}^{20}{X^*_{ij}}, X^*_{ij} \sim \beta(a,b)$ 6. $\widetilde\theta = \bigcup Median(X_i)$ 7. $\delta^* = \widetilde \theta - \theta$ 8. $Interval_.95=[\theta-\delta^*_{.975}, \theta+\delta^*_{.025}]$ #### Write a code for it. ```python= def calculate_beta_parm(mu, std): alpha = -mu*(std**2 + mu**2 - mu)/std**2 beta = (mu-1)*(std**2 + mu**2 - mu)/std**2 return [alpha, beta] # get a,b from mu, std mu = np.mean(p3_data) std = np.std(p3_data) a, b = calculate_beta_parm(mu, std) ``` ```python=9 beta_sp = np.random.beta(a,b, (sample_size,len(p3_data))) beta_theta = np.median(beta_sp, axis=1) print(confidence_interval(np.median(p3_data), beta_theta)) ``` #### construct the 95%-confidence interval with 10^4^ bootstrap sampling. > Interval.95 = [0.09163249, 0.6674887] #### plot the histogram of your bootstrap sample. ```python=12 _ = plt.hist(beta_theta, 100, histtype='step') ``` ![](https://i.imgur.com/utgRs0y.png =500x) \pagebreak ## Problem 4. Goodness-of-Fit and Bootstrap ### a. Compute the corresponding Kolmogorov-Smirnov statistics ```python= def KS(data): n = 8 p = data.mean()/n x = np.sort(data) f = ECDF(x) d = np.max([np.abs(f(x) - binom(n,p).cdf(x)), \ np.abs(f(x) - 1/(n+1) - binom(n,p).cdf(x))]) return d KSdist = KS(p4_data) ``` > KsDist = 0.37344311354803494 ### b. Write down the algorithm of approximating the p-value of this Kolmogorov-Smirnov statistics based on bootstrap with 10^4^ sample. 1. $X$ = p4_data 2. $\theta = KsDist$ 3. Estimate $p=Mean(X)/8$ when $Mean(x) \approx E[X] = np$ 4. $X^* = \bigcup_{i=1}^{10^4}\bigcup_{j=1}^{20}{X^*_{ij}}, X^*_{ij} \sim Binom(n,p)$ 5. $\widetilde \theta = \bigcup KS(X^*_i)$ 6. $p = \frac{\#\widetilde\theta:\widetilde\theta\ge\theta}{\#\widetilde\theta}$ ### c. Write a program based on your algorithm in (b). Determine the result of the hypothesis testing. ```python=10 n = 8 p = p4_data.mean()/n resample = np.random.binomial(n,p, (sample_size, len(p4_data))) ks_stat = list(map(KS, resample)) p_value = sum(ks_stat>=KSdist)/len(ks_stat) p_value ``` > p-value = 0.0001 < 0.05, accept. ###### tags: `NCTU` `statistical-computing` `simulation`