# Exploratory data analysis (EDA) 1. Promoted by [John Tukey](https://en.wikipedia.org/wiki/John_Tukey) (FFT and box plot) 2. Also known as data visualization or data exploration 3. EDA is not an exact science, it is a very important art! ![](https://i.imgur.com/4vZ4yVJ.png) --- ## Main Reasons to do EDA 1. Detection of mistakes 2. Checking of assumptions 3. Preliminary selection of appropriate models 4. Determining relationship among the explanatory variables 5. Assessing the direction and rough size of relationships between explanatory and outcome variables ## Example This research aimed at the case of customer’s default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods [(Yeh and Lien, 2009)](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients). Let's see [24_EDA.ipynb](https://drive.google.com/file/d/1d0_R-ItPGU61Sg3nLE_MMpcYiJd1gNcz/view?usp=sharing) This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: - X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. - X2: Gender (1 = male; 2 = female). - X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). - X4: Marital status (1 = married; 2 = single; 3 = others). - X5: Age (year). - X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. - X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. - X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. <!-- ### Hw 3 (Due 10/19) Follow Hw 2 and continue. Recall that the Geometric Brownian motion under the risk-neutral probabiltiy: $$\frac{d S_t}{S_t} = r d t +\sigma dW_t,$$ where $r$ is the risk-free rate, $\sigma$ is the volatiliy, and $W_t$ is the standard Brownian motion. Suppose we use equal time steps to simulate stock price at time $t_0=0,t_1,t_2,\ldots, t_d=T$, where $t_{i+1}-t_{i}=\Delta = T/d$. With Ito's lemma and Euler discritization, we can use the following iteration formula to generate a path of stock price: $$S_{t_{i+1}}=S_{t_i}\exp^{(r-\sigma^2/2)\Delta + \sigma\sqrt{\Delta}Z_i},$$ where $Z_i\stackrel{i.i.d.}{\sim}N(0,1)$ are the Normal innovations. 1. Consider $S_T=S_{0}\exp^{(r-\sigma^2/2)T + \sigma\sqrt{T}Z}$, where $Z\sim N(0,1)$. Derive $E[S_T]=S_0 \exp^{(r-\sigma^2/2)T+(\sigma^2/2) T}=S_0 \exp^{rT}$. 2. Define a function to generate a path of daily stock price, called gen_GBM with input arguements $S_0$, $r$, $\sigma$, $T$, $d$. Use the function to generate a path of stock price for 1/12 year (=1 month = 22 days). Plot this simulated path of the stock price. 3. Use Monte Carlo simulation to estimate the above European call option price using $n=1,000$ simulated paths. Let $S^{(j)}_T$ denote the $j$-th simulated stock price at maturity. We estimate the European call option price by $$\hat{c}=\frac{1}{n}\sum_{j=1}^n e^{-rT}\max(S_T^{(j)}-K,0).$$ 4. Plot these 1,000 simulated paths from (ii). 5. Find the 95% confidence interval of the stock price. The formula of 95% confidence interval is $$\left[\hat{c}-1.96\frac{s}{\sqrt{n}}, \hat{c}+1.96\frac{s}{\sqrt{n}}\right],$$ $$s=\sqrt{\frac{1}{n-1}\sum_{j=1}^{n}(e^{-rT}\max(S_T^{(j)}-K, 0)-\hat{c})^2}.$$ 6. Does the confidence interval in (iii) cover the theoretical price calculated in Hw 2? --- :::info $$\log\frac{S_{t_{i+1}}}{S_{t_i}} = (r-\sigma^2/2)\Delta+\sigma\sqrt{\Delta}Z_i.$$ ::: :::info A standard Brownian motion $W_t$ 1. $W_t$ is continuous 2. $W_0=0$ 3. $W_t$ has independent and normal increments: (i) $(W_{t_2}-W_{s_2})$ and $(W_{t_1}-W_{s_1})$ are independnet. (ii) $(W_{t}-W_{s})\sim N(0, t-s)$. ::: !-->
{"metaMigratedAt":"2023-06-16T13:06:55.298Z","metaMigratedFrom":"YAML","title":"EDA","breaks":true,"contributors":"[{\"id\":\"33bf799f-1e8c-4781-be29-e4f7443e2905\",\"add\":6498,\"del\":1972}]"}
    553 views