# Introduction
This project will be about applying statistical methods, that we learnt during this semester in the course "Statistica Matematica". More specifically: We have given data from 52 observations of mineral water regarding the PH value and the lead level of the water. What we are going to do now is investigating properties of the underlying random variables (that represent the PH value and the lead level) by using the programming language R. The water observations can be found in the following table:
{height='520px'}
{height='320px'}
# Exercise 1: General properties of the data
First we have to get the main statistics, so we have a better view on the available sample data of mineral water. Here we show the (relative) frequences and the (relative) cumulative frequences, which we have ordered from lower to higher:
| PH | Frequence | Cum. Freq. | Rel. Freq | Cum. Rel. Freq |
|:---:|:---------:| ---------- |:---------:|:--------------:|
| 4.7 | 1 | 1 | 0.019 | 0.019 |
| 5.5 | 1 | 2 | 0.019 | 0.038 |
| 5.8 | 2 | 4 | 0.038 | 0.077 |
| 6.0 | 1 | 5 | 0.019 | 0.096 |
| ... | ... | ... | ... | ... |
| 8.0 | 2 | 47 | 0.038 | 0.904 |
| 8.1 | 3 | 50 | 0.058 | 0.962 |
| 8.5 | 1 | 51 | 0.019 | 0.981 |
| 8.8 | 1 | 52 | 0.019 | 1 |
| PB | Frequence | Cum. Freq. | Rel. Freq | Cum. Rel. Freq |
|:-----:|:---------:| ---------- |:---------:|:--------------:|
| 0.12 | 1 | 1 | 0.019 | 0.019 |
| 0.123 | 1 | 2 | 0.019 | 0.038 |
| 0.128 | 1 | 3 | 0.019 | 0.057 |
| 0.138 | 1 | 4 | 0.019 | 0.076 |
| ... | ... | ... | ... | ... |
| 0.52 | 1 | 49 | 0.019 | 0.943 |
| 0.521 | 1 | 50 | 0.019 | 0.962 |
| 0.525 | 1 | 51 | 0.019 | 0.981 |
| 0.527 | 1 | 52 | 0.019 | 1 |
Now we will look at some basis properties of our samples, which are calculated by R:
The *sample mean*, which is given by $\overline{x}=\frac{1}{n}\sum_{i=1}^nx_i$:
$$\overline{PH}=7.096154~~~~~~~~\overline{PB}=0.3224038$$
The *sample variance*, which is given by $s^2=\frac{1}{n-1}\sum_{i=1}^n(\overline{x}-x_i)^2$:
$$s_{PH}^2=0.6741026~~~~~~s_{PB}^2=0.01302585$$
The *sample deviation* s, which is given by $\sqrt{s^2}$:
$$s_{PH}=0.8210375~~~~~s_{PB}=0.1141309$$
The *median*, which is given by $\frac{1}{2}(x'_\frac{n}{2}+x'_{\frac{n}{2}+1})$, where x' is x sorted:
$$median(PH)=7.2~~~~~~median(PB)=0.328$$
The *mode* of a sample is the value, that appears most often in the sample data set. For the mode of our two samples we have to do a little bit more, as there is no predefined function available. We use the following function:
```getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
```
With that function we now get: $$getmode(PH)=7.2~~~~~getmode(PB)=0.424$$
The *skewness*, which is given by $skw(x)=\sum_{i=1}^n\frac{(x_i-\overline{x})^3}{x_is_x^3}$ and is a measure of asymmetry:
$$skw(PH)=-0.4199191~~~~skw(PB)=-0.04034458$$
As both values are negative we know now (this is how the skewness can be interpretated), that the distributions of both samples are concentrated to the right side. For the PH values this result can be seen in the following graphic very well:

For the lead level PB the concentration to the right is very hard to see with the naked eye:

The reason for that is that the absolute value of $skw(PB)$ is very small.
The *kurtosis*, which is given by $kurt(x)=\sum_{i=1}^n\frac{(x_i-\overline{x})^4}{ns_x^4}$ and which gives information about the tail of the distribution:
$$kurt(PH)=2.946431~~~~~kurt(PB)=2.130508$$
A little bit more of information to be able to interpretate these numbers a little bit more:
The normal distribution has kurtosis 3. If the kurtosis of a sample mean is higher than 3, the tail of the distribution is thicker than from the normal distribution (and if smaller than 3: thinner). That the kurtosis of PH is close to 3 is a first hint, that the sample of PH is generated by normal distributed random variables, but we will concentrate on that later.
{width='280px'}
{width='280px'}
{width='280px'}
{width='280px'}
# Exercise 2: Interval estimation for the mean and standard deviation of the two samples
As we have seen in exercise 1 it is not very difficult to determine the mean and standard daviation of given sample data. But this exercise will be about the mean and standard deviation of the random variables, which our samples are drawn from. First of all we have to mention, that we are not able to calculate a concrete value for the mean or the standard deviation. But still: we are able to determine intervals, where the mean and standard deviation can be found with a certain probability.
For the mean we can use following result from the lecture notes:
**Theorem 3.2:**
*If X and S are the sample mean and sample standard deviation of a random sample of size n drawn from a normal population of unknown variance, a confidence interval of
coefficient 1 - $\alpha$ for the unknown mean value $\mu$ of the population results to be:*
$$(\overline{X}-t_{\alpha/2;n-1}\frac{S}{\sqrt{n}},\overline{X}+t_{\alpha/2;n-1}\frac{S}{\sqrt{n}})$$
*, where $t_{\alpha;\nu}$ is the upper quantile of Student’s distribution with $\nu$ degrees of freedom.*
The implementation of that can be seen in the following code (here for PH; PB is obviously analogous)
```
#Iterval estimation of the mean of PH
standarderror<-sd_PH/sqrt(lPH)
alpha<-0.05
quantile1<-qt(1-(alpha/2),df=lPH-1)
amplitude1<-quantile1*standarderror
#Calculate the borders of the intervall
lower_PH<-mean_PH-amplitude1
upper_PH<-mean_PH+amplitude1
lower_PH
upper_PH
```
With that we get the 90% ($\alpha=0.1$) confidence intervals (rounded to three digits) of the mean:
- PH: $(6.905,7.287)$
- PB: $(0.296,0.349)$
For the 95% ($\alpha=0.05$) intervals we get:
- PH: $(6.868,7.325)$
- PB: $(0.291,0.354)$
And for the 99% ($\alpha=0.01$) intervals we furthermore get:
- PH: $(6.792,7.401)$
- PB: $(0.280,0.365)$
For the confidence intervals of the standard deviation we can use the following Corrolar from the lecture notes, which is a conclusion from Theorem 5.2 together with Proposition 5.2:
**Theorem 5.2:**
*Let $S^2$ be the sample variance of a random sample of size n taken from a normal population of unknown mean. Then a confidence interval with coefficient $1−\alpha$ for the standard deviation $\sigma$ of the population is given by:*
$$(\sqrt{\frac{n-1}{\chi_{\alpha/2;n-1}}S^2},\sqrt{\frac{n-1}{\chi_{1-\alpha/2;n-1}}S^2})$$
The implementation of this can be seen in the following code:
```
#Interval estimation of the standard deviation of PH
alpha <-0.01
quantile1<-qchisq(1-alpha/2,df=lPH-1)
quantile2<-qchisq(alpha/2,df=lPH-1)
# Calculate the extremes of the interval
sdlower_PH<-sqrt((lPH-1)*var_PH/quantile1)
sdupper_PH<-sqrt((lPH-1)*var_PH/quantile2)
sdlower_PH
sdupper_PH
```
With that we get the 90% ($\alpha=0.1$) confidence intervals (rounded to three digits) of the mean:
- PH: $(0.708,0.983)$
- PB: $(0.098,0.137)$
For the 95% ($\alpha=0.05$) intervals we get:
- PH: $(0.688,1.018)$
- PB: $(0.096,0.142)$
And for the 99% ($\alpha=0.01$) intervals we furthermore get:
- PH: $(0.653,1.094)$
- PB: $(0.091,0.152)$
# Exercise 3: Hypothesis Test
This exercise is about testing the null hypothesis "The mean of the random variable of which our PH sample is generated is 7.0" against the alternative hypothesis "The mean is smaller than 7.0". Before we come to the implementation of this hypothesis test, we have to do deal with some theory, as we can't use results from the lecture notes directly. The difficulty we are cofronted with is that we have to find a critical region *C*, whose amplitude is equal to the simply most powerful amplitude.
A first result, which helps us on the way to find the *C* is the following:
**Proposition 3.1:**
*Let $(X_1,...,X_n)$ be a random sample drawn from a normal population of unknown mean $\mu$ and known variance $\sigma^2$. The test of the ratio of the likehoods of amplitude $\alpha$ to test the null hypothesis $H_0:\mu\le\mu_0$ against the alternative hypothesis $H_1:\mu>\mu_0$ is one that has critical region:*
$$C=\{{(x_1,...,x_n):\overline{x}\ge\mu+z_\alpha\frac{\sigma}{\sqrt{n}}}\}$$
In this Proposition it is required that the sample is drawn from a normal population. We will show in Exercise 4, that we are allowed to assume, that this is the case for our PH sample. Still: This proposition may not look like it will help us a lot at the first sight. But with a few thoughts we can actually derivate a result from this, which only has to be implemented in the end.
First we define the random sample $(Y_1,...,Y_n):=(-X_1,...,-X_n)$, where $(X_1,...,X_n)$ is a random sample drawn from a normal population and so $(Y_1,...,Y_n)$ is too. If we now apply *Proposition 3.1* on this sample we have:
- the null hypothesis $H_0:\mu_Y\le\mu_0$ ($\Leftrightarrow \mu_X\ge\mu$)
- the alternative hypothesis $H_1:\mu_Y>\mu_0$ ($\Leftrightarrow \mu_X<\mu$)
- $C=\{{(y_1,...,y_n):\overline{y}\ge\mu_Y+z_\alpha\frac{\sigma_Y}{\sqrt{n}}}\}=\{{(x_1,...,x_n):-\overline{x}\ge-\mu_X+z_\alpha\frac{\sigma_X}{\sqrt{n}}}\}= \{{(x_1,...,x_n):\overline{x}\le\mu-z_\alpha\frac{\sigma_X}{\sqrt{n}}}\}$
With that we are already able to derivate the following result:
**Proposition:**
*Let $(X_1,...,X_n)$ be a random sample drawn from a normal population of unknown mean $\mu$ and known variance $\sigma^2$. The test of the ratio of the likehoods of amplitude $\alpha$ to test the null hypothesis $H_0:\mu\ge\mu_0$ against the alternative hypothesis $H_1:\mu<\mu_0$ is one that has critical region:*
$$C=\{{(x_1,...,x_n):\overline{x}\ge\mu-z_\alpha\frac{\sigma}{\sqrt{n}}}\}$$
We still have 2 problems to handle:
- The null hyothesis is not the same as in the exercise
- The Proposition requires that we know the variance, which is not the case
In the beginning we'll deal with the first-mentioned problem: We want to look at the hypothesis test, which has the same alternative hypothesis as in the previous Proposition, but in the null hypothesis a $"="$ instead of a $"\ge"$. It is easy to see, that the critical region of this test has to be the same as in the Proposition, because in the case $\overline{x}>\mu_0$ we obviously prefer the hypothesis $\mu=\mu_0$ over the hypothesis $\mu<\mu_0$. We get:
**Proposition:**
*Let $(X_1,...,X_n)$ be a random sample drawn from a normal population of unknown mean $\mu$ and known variance $\sigma^2$. The test of the ratio of the likehoods of amplitude $\alpha$ to test the null hypothesis $H_0:\mu=\mu_0$ against the alternative hypothesis $H_1:\mu<\mu_0$ is one that has critical region:*
$$\mathcal{C}=\{{(x_1,...,x_n):\overline{x}\ge\mu-z_\alpha\frac{\sigma}{\sqrt{n}}}\}$$
Now we still have to handle the second one of the previous-mentioned problems. We will skip the proof at this point and refer to the connection from Proposition 3.2 and 3.3: There we also have the same hypothesis, where we know the variance in 3.2 and in 3.3 it is unknown. In the critical region only the standard deviation $\sigma$ changes to the sample standard deviation s and the upper quantile $z_\alpha$ of the standard normal distribution becomes the upper quantile of the student's distribution with n-1 degrees of freedom $t_{\alpha,n}$. This connection between results with known and unknown variance can be seen in more results in different chapters, because of what the proof is not from a great importance here.
**Proposition:**
*Let $(X_1,...,X_n)$ be a random sample drawn from a normal population of unknown mean $\mu$ and unknown variance. The test of the ratio of the likehoods of amplitude $\alpha$ to test the null hypothesis $H_0:\mu=\mu_0$ against the alternative hypothesis $H_1:\mu<\mu_0$ is one that has critical region:*
$$\mathcal{C}=\{{(x_1,...,x_n):\overline{x}\ge\mu-t_{\alpha,n-1}\frac{S}{\sqrt{n}}}\}$$
In R the implementation of this hythosistest looks now like this with $\alpha=0.05$:
```
valuestat <- mean_PH
valuestat
alpha<-0.05
qt(0.025,23)
qt(1-alpha,df=(lPH-1))
crtval <-7.0-qt(1-alpha, lPH-1)*(sd_PH/sqrt(lPH))
crtval
if(valuestat <= crtval){
print("Refuse Ho")
}else{
print("Accept Ho")
}
```
If we let this program run, it shows "Accept H_0", so we accept the null hypothesis. But careful: This result itself doesn't show that the mean of PH which generated our sample has a mean close to 7.0. Because even with a sample mean of 500 we would have accepted the null hypothesis, although the mean of the random variable, which generated this sample, is much higher. Even though: our sample mean is close to 7.0, so the hypothesis "the mean of PH is 7.0" is a realistic statement.
Now we are confronted with the question of the second part of the exercise: What changes, if the alternative hypothesis changes to "the mean is not equal to 7.5". First of all we can say, that this different alternative hypothesis has the effect, that we now also have values greater than $7.0$, where the null hypothesis will be refused.
As we don't have a result in the lecture notes which helps us dirctly, we have to think a little about the theory, before we will continue with the implementation part.
The following result from the lecutre notes will help us:
*Proposizione 3.3
*Let $(X_1,...,X_n)$ be a random sample drawn from a normal population of unknown mean $\mu$ and unknown variance. The test of the ratio of the likehoods of amplitude $\alpha$ to test the null hypothesis $H_0:\mu=\mu_0$ against the alternative hypothesis $H_1:\mu\neq\mu_0$ is one that has critical region:*
$$\mathcal{C}=\{(x_1,...,x_n):|{\overline{x}-\mu|\ge t_{\alpha,n-1}\frac{S}{\sqrt{n}}}\}$$
Now we have to ask ourself the question, how the critical region changes, when the alternative hypothesis changes to $H_1:\mu\neq\mu_1$ with $\mu_1\neq\mu_0$. The answer is very simple: It doesn't change anything. This can be seen pretty easy. Consider the case $\overline{x}=7.5$, where we contradict the alternative hypothesis "the most". We have to refuse the nullhypothesis, because we are too far away from 7.0 and decide for the alternative hypothesis, although there is quite a high chance that it is true.
The implementation of the hypothesis test can be seen in the following code:
```
valuestat <- abs(mean_PH-7.0)
valuestat
alpha<-0.05
qt(1-alpha,df=(lPH-1))
crtval <-qt(1-alpha, lPH-1)*(sd_PH/sqrt(lPH))
crtval
if(valuestat >= crtval){
print("Refuse Ho")
}else{
print("Accept Ho")
}
```
# Exercise 4
Now we are going to perform a Chi-quare test to see if our PH data comes from a normal distribution. For that we use the following proposition:
**Proposition 4.1:**
*Under the condition that the null hypothesis $H_0:{p_i}={\pi _i}$ is true as $n$ diverges, the statistic*
$$\Xi =\sum_{i=1}^{k} \frac{(N_i - n\pi _i )^2}{n\pi _i},$$
*where random variables $N_1,N_2,...,N_k$ are the absolute frequencies of the classes, assumes chi-square distribution with $k-1$ degrees of freedom.*
Following the indications, we first divide our data in 5 classes, such that the expected frequences of all classes are higher than 5. In particular, we have selected this classes:
| Interval | (4.7,6.2] | (6.2,6.7] | (6.7,7.2] | (7.2,7.7] | (7.7,8.8] |
|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:|
| Data | 6 | 11 | 12 | 9 | 13 |
Now the expected probabilities are:
- $\pi_1:0.138$
- $\pi_2:0.177$
- $\pi_3:0.236$
- $\pi_4:0.219$
- $\pi_5:0.231$
And the expected frequencies:
- $n \cdot \pi_1:7.15$
- $n \cdot \pi_2:9.21$
- $n \cdot \pi_3:12.25$
- $n \cdot \pi_4:11.37$
- $n \cdot \pi_5:12.01$
The implementation of the previous calculation looks in R like this, which are calculated by R using the following density-function:
$$f(x)=\frac{1}{\sqrt{2\pi} S}e^{-\frac{(x-\overline{x})^2}{2S^2}}$$
```
breaks1 <- c( 4.7, 6.2, 6.7, 7.2, 7.7, 8.8)
frequency_PH <- table(cut(PH, breaks = breaks1))
frequency_PH
distnorm <- c(pnorm(6.2,mean=mean_PH,sd=sd_PH),
pnorm(6.7,mean=mean_PH,sd=sd_PH) -
pnorm(6.2,mean=mean_PH,sd=sd_PH),
pnorm(7.2,mean=mean_PH,sd=sd_PH) -
pnorm(6.7,mean=mean_PH,sd=sd_PH),
pnorm(7.7,mean=mean_PH,sd=sd_PH) -
pnorm(7.2,mean=mean_PH,sd=sd_PH),
1 - pnorm(7.7,mean=mean_PH,sd=sd_PH))
distnorm
expfreq2 <- round(lPH*distnorm,digits=2)
expfreq2
```
Using now the *Proposition 5.1* we can calculate the value assumed by the Chi-square statistic:
$$\chi^2 = \sum_{i=1}^{5} \frac{(n_i-52\cdot\pi_i)^2}{52\cdot\pi_i}=1.11 $$
Because $\chi^2_{0.05;3}=7.815$ the critical region is $\mathcal{C}=\{ (x_1,x_2,...,x_n):\chi^2 \geq 7.815\}$
The computed Chi-square value $\chi^2$ is less than $\chi^2_{0.05;3}$ and so the observed realization does not belong to the critical region $\mathcal{C}$. So we can accept $H_0$. We implemented this in R the following way:
```
chisquare<-round(sum((frequency_PH-expfreq2)^2/expfreq2),digits=2)
chisquare
alpha<-0.05
degofreed<-3
criticalval<-qchisq(1-alpha,df=degofreed)
criticalval
if(chisquare>=criticalval)
{print("Refuse Ho")}else{print("Accept Ho")}
````
# Exercise 5
In the last exercise we will derivate the estimate regression line $y=a+bz$. Using a scatter plot we represent the data pairs $(PH,PB)$

We calculated the regression line with this command:
```
lm(PB~PH)
abline(lm(PB~PH))
```
That gives us $b=0.05278$ and $a=-0.05212$. If we now include the line $a+bz$ we get the following graphic:

Now we have to compute the residuals, the differences between the real values and the ones predicted. We do this in R via the following implementation:
```
modello<-lm(PB~PH)
modello$coefficients
stime<-modello$fitted.values
modello$residuals
segments(PH,stime,PH,PB,col="green")
```
We proceed to calculate now the *coefficient of determination*, that measures how well the model predicts: $R^2=\frac{S^2_{XY}}{S_{XX}S_{YY}}=1-\frac{SS_R}{S_{YY}}$. We can compute this in R easily with the following line
````
summary(modello)
````
that returns:

```
Call:
lm(formula = PB ~ PH)
Residuals:
Min 1Q Median 3Q Max
-0.215718 -0.085010 0.008671 0.070116 0.195393
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.05212 0.12990 -0.401 0.6900
PH 0.05278 0.01819 2.902 0.0055 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1066 on 50 degrees of freedom
Multiple R-squared: 0.1442, Adjusted R-squared: 0.127
F-statistic: 8.422 on 1 and 50 DF, p-value: 0.005501
```
Looking at our $R^2$ value, that is close to zero, we can assure that the linear approximation is not a good prediction. We can confirm this by looking at the previous image.

Now if we look at the residuals we have, there is no clear transformation we can make to the data that will help us to improve the prediction.