--- tags: credibility spellcheck: true --- Credibility Theory === $$ \renewcommand{\labelenumi}{{(\arabic{enumi})}} \def\empP{\mathbb{P}_n} \def\empP{\mathbb{P}_n} \def\seq#1{\{#1\}_{n\geq1}} \def\E#1{\mathbb{E}{\left(#1\right)}} \def\EP#1#2{\mathbb{E}_{#1}{\left(#2\right)}} \def\VarP#1#2{{\rm Var}_{#1}{\left(#2\right)}} \def\Var#1{{\rm Var}\left(#1\right)} \def\Cov#1#2{{\rm Cov}\left(#1,#2\right)} \def\Pr#1{{\rm Pr}\left(#1\right)} \def\st{\scriptstyle} \def\msp{(\Omega,\mathcal{F},P)} \def\sst{\scriptscriptstyle} \def\ts{\textstyle} \def\eqd{\buildrel \rm d \over =} \def\eqdef{\buildrel \rm def \over =} \def\wc{\buildrel {\rm d} \over \longrightarrow} \def\uc{\hookrightarrow} \def\asc{\buildrel {\rm a.s.} \over \longrightarrow} \def\convp{\buildrel {\rm p} \over \longrightarrow} \def\convLp#1{\buildrel {L^{#1}} \over \longrightarrow} \def\calL#1{{\calligraphic L}\left(#1\right)} \def\calF{{\mathcal{F}}} \def\calI{{\mathcal{I}}} \def\calA{{\mathcal{A}}} \def\calD{{\mathcal{D}}} \def\calC{{\mathcal{C}}} \def\calH{{\mathcal{H}}} \def\calS{{\mathcal{S}}} \def\calB{{\mathcal{B}}} \def\calG{{\mathcal{G}}} \def\calM{{\mathcal{M}}} \def\calN{{\mathcal{N}}} \def\calP{{\mathcal{P}}} \def\dF{{\rm F}} \def\SF{{\rm S}} \def\vX{\tilde{X}} \def\vx{\tilde{x}} \def\v#1{\tilde{#1}} \def\Real{\mathbb{R}} \def\iff{if and only if } \def\wrt{with respect to } \def\TVnorm#1{\Vert #1 \Vert_{\scriptscriptstyle \textsf{TV}}} \def\Lpnorm#1#2{\Vert #1 \Vert_{#2}} \def\Supnorm#1{\Vert #1 \Vert_{\infty}} \def\KL#1#2{\mathsf{KL}\left(#1,#2\right)} \def\Hell#1#2{\mathsf{H}\left(#1,#2\right)} \def\Ind#1{\mathbb{1}_{#1}} \def\Indset#1{\mathbf{1}\left\{#1\right\}} \DeclareMathOperator{\sgn}{sgn} \newcommand{\comp}[1]{{#1}^{\mathsf{c}}} \DeclareMathOperator{\tr}{tr} \def\lL{\dot{\it l}_{\theta_0}^\ast} \def\l{\dot{\it l}_{\theta_0}} \DeclareMathOperator{\contig}{\triangleleft} \DeclareMathOperator{\mcontig}{\triangleleft\,\triangleright} \def\dpost{P_{\bar{H}_n|\vX_n}} \def\lik#1{P_{n, #1}} $$ ## Introduction In this note we will present a streamlined and straightforward way of looking at credibility theory. We will present credibility theory from the Bayesian point of view even though it is not how it originated. Some of the arguments we present are likely not to be found anywhere else - Enjoy! Credibility theory tries to solve the following type of problems: We have collected data on losses, for example, for $n$ periods, and using this we wish to predict or estimate the mean of the losses corresponding to the $(n+1)$-st period. The idea is that when $n$ is small, to get well behaving predictors one needs to rely more on *expert prior knowledge* of the loss generating process, and smoothly transition to relying exclusively on the data with increasing $n$. In this modern day, it is clear that such a motivation clearly points to the Bayesian methodology. So it should come as no surprise that we will adopt below the Bayesian point of view. But it is worth pointing out that Bayesian Inference is far more encompassing than credibility theory. ## Some General Results: In the following we will suppose that given $\theta$, $X_1,X_2,\ldots,X_n,X_{n+1}$ are independent random variables with mean $\mu_i(\theta)$ and variance $\sigma_i^2(\theta)$, for $i=1,\ldots,n+1$. Also, it is assumed that the prior uncertain knowledge of $\theta$ is expressed through a distribution which we denote by $\pi(\cdot)$. While we at times will find it convenient to work with the random variable $\Theta$, we emphasize that the underlying $\theta$ is non-random, and the use of the probabilistic setup where the uncertainty in our knowledge of $\theta$ is modeled by a distribution is just a framework towards a coherent methodology. In the Bayesian setup, if we are interested in estimating $\mu_{n+1}(\theta)$, then given a loss function $L(\cdot,\cdot)$, our estimate would be $$ \hat{\mu}_{n+1}(\vX_n):=\arg\min_a \E{L(\mu_{n+1}(\Theta),a)\vert \tilde{X}_n}, $$ where $\tilde{X}_n:=(X_1,\ldots,X_n)$. In particular, if the loss function equals squared-error then we have $$ \hat{\mu}_{n+1}(\vX_n):= \E{\mu_{n+1}(\Theta)\vert \tilde{X}_n}. $$ Since credibility theory originated in the era of poor computing resources, the credibility estimator was constrained to be a linear function of the data (*i.e.* linear statistic/estimator), and the loss function was taken to be squared-error. **Theorem 1:** All of the following ways of defining the credibility estimator are equivalent: 1. $$\arg\min_{\phi \hbox{ linear}} \E{(X_{n+1}-\phi(\vX_n))^2}$$ 2. $$\arg\min_{\phi \hbox{ linear}} \E{(\mu_{n+1}(\Theta)-\phi(\vX_n))^2}$$ 3. $$\arg\min_{\phi \hbox{ linear}} \E{(\E{\mu_{n+1}(\Theta)\vert \tilde{X}_n}-\phi(\vX_n))^2}$$ <details> <summary> Proof: </summary> **1. is equivalent to 2.** $$ \begin{aligned} \E{(X_{n+1}-\phi(\vX_n))^2} &= \E{\E{(X_{n+1}-\phi(\vX_n))^2\vert \Theta,\vX_n}}\\ &=\E{\E{(X_{n+1}-\mu_{n+1}(\Theta)-[\phi(\vX_n)-\mu_{n+1}(\Theta)])^2\vert \Theta,\vX_n}}\\ &=\E{\E{(X_{n+1}-\mu_{n+1}(\Theta))^2\vert \Theta,\vX_n}} \\ &\quad +\E{\E{(\phi(\vX_n)-\mu_{n+1}(\Theta))^2\vert \Theta,\vX_n}}\\ &\quad -2 \E{\E{(X_{n+1}-\mu_{n+1}(\Theta))(\phi(\vX_n)-\mu_{n+1}(\Theta))\vert \Theta,\vX_n}}. \end{aligned} $$ Note that $$ \E{(X_{n+1}-\mu_{n+1}(\Theta))(\phi(\vX_n)-\mu_{n+1}(\Theta))\vert \Theta,\vX_n}=(\phi(\vX_n)-\mu_{n+1}(\Theta))\E{(X_{n+1}-\mu_{n+1}(\Theta))\vert \Theta,\vX_n}=0. $$ Hence, $$ \E{(X_{n+1}-\phi(\vX_n))^2} =\E{(X_{n+1}-\mu_{n+1}(\Theta))^2} +\E{(\phi(\vX_n)-\mu_{n+1}(\Theta))^2}. $$ As the first term is free of $\phi$, we have equivalence of 1. and 2.. **2. is equivalent to 3.** We begin by observing as above that $$ \begin{aligned} \E{(\phi(\vX_n)-\mu_{n+1}(\Theta))^2}&= \E{\E{(\phi(\vX_n)-\hat{\mu}_{n+1}(\vX_n)-[\mu_{n+1}(\Theta)-\hat{\mu}_{n+1}(\vX_n)])^2\vert \vX_n}}\\ &=\E{\E{(\phi(\vX_n)-\hat{\mu}_{n+1}(\vX_n))^2\vert \vX_n}}\\ &\quad + \E{\E{(\mu_{n+1}(\Theta)-\hat{\mu}_{n+1}(\vX_n))^2\vert \vX_n}}\\ &\quad + 2\E{\E{(\phi(\vX_n)-\hat{\mu}_{n+1}(\vX_n))(\mu_{n+1}(\Theta)-\hat{\mu}_{n+1}(\vX_n))\vert \vX_n}}\\ \end{aligned} $$ Again, the cross product term equals zero as $$ \begin{aligned} \E{(\phi(\vX_n)-\hat{\mu}_{n+1}(\vX_n))(\mu_{n+1}(\Theta)-\hat{\mu}_{n+1}(\vX_n))\vert \vX_n}&=(\phi(\vX_n)-\hat{\mu}_{n+1}(\vX_n))\E{(\mu_{n+1}(\Theta)-\hat{\mu}_{n+1}(\vX_n))\vert \vX_n}\\ &=0. \end{aligned} $$ Hence, $$ \E{(\phi(\vX_n)-\mu_{n+1}(\Theta))^2}= \E{(\phi(\vX_n)-\hat{\mu}_{n+1}(\vX_n))^2} + \E{(\mu_{n+1}(\Theta)-\hat{\mu}_{n+1}(\vX_n))^2}. $$ As the second term is free of $\phi$, we have equivalence of 2. and 3.. </details> <P> </P> **Remarks:** Note that the first objective function relates to the prediction problem, the second to an estimation problem and the third to an approximation problem. Is not that neat? Now we will prove some elementary lemmas: **Lemma 1:** Let $X$ and $Y$ be two zero mean random variables with finite second moments. Then $$ \arg \min_{w\in \Real} \Var{wX+(1-w)Y}=\frac{\sigma_Y^2-\sigma_{X,Y}}{\sigma_X^2+\sigma_Y^2-2\sigma_{X,Y}}, $$ where $\sigma_{X,Y}$ is the covariance between $X$ and $Y$. <details> <summary> Proof: </summary> It is straightforward to see that $$ \Var{wX+(1-w)Y}=w^2\Var{X}+(1-w)^2\Var{Y}+2w(1-w)\Cov{X}{Y}. $$ Differentiating w.r.t. $w$ and equating to zero yields the above expression for the optimal solution. Note that the coefficient of $w^2$ equals $\Var{X}+\Var{Y}-2\Cov{X}{Y}$ and by Cauchy-Schwartz inequality we have $$ \vert \Cov{X}{Y} \vert \leq \sqrt{\Var{X}\Var{Y}}, $$ which implies that $$\Var{X}+\Var{Y}-2\Cov{X}{Y}\geq 0.$$ Hence, $\Var{wX+(1-w)Y}$ is an upright parabola in $w$, and the extrema found is the global minimum. </details> <P></P> In the following we will use the following notation: $$ \mu=\E{\mu(\Theta)},\quad \nu=\E{\sigma^2(\Theta)},\quad \alpha=\Var{\mu(\Theta)},\quad \hbox{and} \quad \kappa=\frac{\E{\sigma^2(\Theta)}}{\Var{\mu(\Theta)}}. $$ **Lemma 2:** Assume that conditioned on $\Theta$, $\{X_i\}_{i\geq 1}$ are independent, and have identical first two (conditional) moments. Now consider the optimization problem $$ \min_{a,b\in R} \E{(\mu(\Theta)-(a+b\bar{X}_n))^2}. $$The optimal solution is given by $$ b=\frac{n}{n+\kappa}, \quad \hbox{and} \quad a=\frac{\kappa}{n+\kappa}\mu. $$ <details> <summary> Proof: </summary> Note that for any fixed $b$, the optimal $a$ will be one that satisfies $$ \E{\mu(\Theta)-(a+b\bar{X}_n)}=0; $$this yields $a=(1-b)\mu$. So our objective function can be written as $$ \E{([\mu(\Theta)-\mu]-b[\bar{X}_n-\mu])^2}. $$If we define $Y:=[\mu(\Theta)-\mu]$, and $Y-X:=[\bar{X}_n-\mu]$, then we can apply Lemma 1 to conclude that the optimal $b$ is given by $$ b=\frac{\sigma_{\mu(\Theta)}^2-\sigma_{\mu(\Theta)-\bar{X}_n,\mu(\Theta)}}{\sigma_{\mu(\Theta)-\bar{X}_n}^2+\sigma_{\mu(\Theta)}^2-2\sigma_{\mu(\Theta)-\bar{X}_n,\mu(\Theta)}}. $$We now observe that $$ \sigma_{\mu(\Theta)-\bar{X}_n}^2=\Var{\mu(\Theta)-\bar{X}_n}=\E{\Var{\bar{X}_n\vert \Theta}}=\E{\sigma^2(\Theta)}/n, $$ and $\sigma_{\mu(\Theta)-\bar{X}_n,\mu(\Theta)}=0$ since $\E{\mu(\Theta)-\bar{X}_n\vert \Theta}=0$. Hence, $$ \begin{aligned} b&=\frac{\sigma_{\mu(\Theta)}^2-\sigma_{\mu(\Theta)-\bar{X}_n,\mu(\Theta)}}{\sigma_{\mu(\Theta)-\bar{X}_n}^2+\sigma_{\mu(\Theta)}^2-2\sigma_{\mu(\Theta)-\bar{X}_n,\mu(\Theta)}}\\ &=\frac{\Var{\mu(\Theta)}}{\E{\sigma^2(\Theta)}/n+\Var{\mu(\Theta)}}\\ &=\frac{n}{n+\E{\sigma^2(\Theta)}/\Var{\mu(\Theta)}}=\frac{n}{n+\kappa}. \end{aligned} $$ </details> <P></P> Note that Lemma 2 says that the optimal estimator according to any of the objective functions of Theorem 1, which is moreover linear in $\bar{X}_n$ is given by $$ \frac{n}{n+\kappa} \bar{X}_n + \frac{\kappa}{n+\kappa}\mu. $$ So to find the optimal linear estimator of $\vX_n$, all we need to show is that the optimal linear estimator is a function of $\vX_n$. This is done in the following Lemma. **Lemma 3** The optimal linear estimator, under the assumptions of Lemma 2, is a function of $\bar{X}_n.$ <details> <summary> Proof: </summary> We first note that by a simple application of Cauchy-Schwartz inequality we can show that if $Y_i,\;i=1,\dots,m$ all have the same variance then $$ \Var{\bar{Y}}\leq \Var{Y_1}. $$ This is so as $$|\Cov{Y_i}{Y_j}|\leq \Var{Y_1}.$$ Now let us denote by $s_k(\vX_n)$ the vector $(X_{n-k+1},\ldots,X_n,X_1,\ldots,X_{n-k})$, for $k=1,\ldots,n$ and with $s_0(\vX_n):=\vX_n$. Note that the linear function $\phi(\cdot)$ that is a solution of $$\arg\min_{\phi \hbox{ linear}} \E{(\mu(\Theta)-\phi(\vX_n))^2}$$ is such that $\E{\mu(\Theta)-\phi(\vX_n)}=0$, and hence by using the previous observation we have for any optimal $\phi(\cdot)$ that, $$ \begin{aligned} \Var{\mu(\Theta)-\phi(\overline{X}_n\cdot\mathbf{1})}&=\Var{\frac{1}{n} \sum_{k=0}^{n-1}(\mu(\Theta)-\phi(s_k(\vX_n)))}\quad \hbox{(by linearity of $\phi$)}\\ &\leq \Var{\mu(\Theta)-\phi(\vX_n)} \hbox{(as each summand has same variance)}\\\\ &=\E{(\mu(\Theta)-\phi(\vX_n))^2} \quad (\hbox{as }\E{\mu(\Theta)-\phi(\vX_n)}=0) \end{aligned} $$ Hence the proof. </details> **Remarks:** To see why the weights intuitively makes sense note that if $\Var{\mu(\Theta)}=0$ then prior mean of $\mu$ will be a zero error predictor and hence the weight for $\mu$ should be $1$. On the other hand $\Var{\bar{X}_n}=\Var{\mu(\Theta)}+\frac{1}{n}\E{\sigma^2(\Theta)}$. So the higher $\E{\sigma^2(\Theta)}$ is the lower the weight for $\bar{X}_n$ should be. ## The Models and the Credibility Estimators **B&uuml;hlmann Model:** We assume that $X_i$, for $i\geq 1$ are conditionally independent given $\Theta$, and have identical conditional means and variances which we denote by $\mu(\Theta)$ and $\sigma^2(\Theta)$. **Credibility Estimator in the case of B&uuml;hlmann Model:** Note that the assumptions of the above lemmas are satisfied under the B&uuml;hlmann Model, and hence the credibility estimator for the premium for the $n+1$-st year using the data of the first n-years is given by $$ \frac{n}{n+\kappa} \bar{X}_n + \frac{\kappa}{n+\kappa}\mu.$$ **B&uuml;hlmann-Straub Model:** We assume that $X_i$, for $i\geq 1$ are conditionally independent given $\Theta$, and have identical conditional means denoted by $\mu(\Theta)$. But the conditional variance of $X_i$ is given by $\sigma^2(\Theta)/m_i$. **Credibility Estimator in the case of B&uuml;hlmann-Straub Model:** If you look at the text and other treatments I have looked at, they all reinvent the wheel as they lack the following insight. While the B&uuml;hlmann-Straub Model is clearly more general than the B&uuml;hlmann model ($m_i\equiv 1$), the Aha! observation is that there is a B&uuml;hlmann model related to the B&uuml;hlmann-Straub Model through which we can simply deduce the optimal estimator under the B&uuml;hlmann-Straub Model - we describe this in the following: --- Note that the obvious motivation of the B&uuml;hlmann-Straub model is to account for the fact that the $i$-th year data is compressed and stored as an average. So let us suppose that we have the raw data, *i.e.* $Y_{i:1}, \ldots, Y_{i:m_i}$; in other words $X_i$ in the B&uuml;hlmann-Straub arises as $$ X_i:=\frac{1}{m_i}\sum_{j=1}^{m_i} Y_{i:j}=:Y_{i\cdot} $$ So what assumptions on $\{Y_{i:j}\vert j=1,\ldots,m_i;i=1,\ldots n\}$ would be consistent with the B&uuml;hlmann-Straub model? Note that the answer is that we can assume that $$\{Y_{i:j}\vert j=1,\ldots,m_i;i=1,\ldots n\}$$ are all conditional on $\Theta$ independent and have identical conditional means and variances denoted by $\mu(\Theta)$ and $\sigma^2(\Theta)$. So these variables satisfy the conditions of the B&uuml;hlmann model and hence the optimal linear predictor of $\mu(\Theta)$ is given by $$ \frac{\sum m_i}{\sum m_i + \kappa} \left(\frac{1}{\sum m_i}\right)\sum_{i,j}Y_{i:j} + \frac{\kappa}{\sum m_i+\kappa}\mu.$$ But wait a minute, this can be rewritten as $$\frac{m}{m + \kappa} \sum_{i}\frac{m_i}{m}X_{i} + \frac{\kappa}{m+\kappa}\mu,$$ which is a function of the compressed data available under the B&uuml;hlmann-Straub model (note that $m=\sum m_i$). Hence since even in the presence of "full data" the optimal linear estimator is a function of the compressed data, the above is the optimal estimator under the B&uuml;hlmann-Straub model as well. Done! --- There are two ways to use the above credible estimators: 1. We interpret the distribution of $\Theta$ as summarizing our prior information on the distribution of the random variables. 2. We observe $r$ different sub-problems where the $i$-th sub-problem is driven by $\Theta_i$ with $\Theta_i$, for $i=1,\ldots,r$ assumed to be a random sample from an unknown distribution. In this case, there is an estimation problem to deal with. This setting is that of **Empirical Bayes** in modern statistical parlance. First, I present an example which is analyzed more later using the parallel programming paradigm. **Example 1:** Consider the case where we have $X_i$, for $i=1,\ldots,n$, being conditioned on $\Lambda=\lambda$ a random sample from a Poisson distribution with parameter $\lambda$. To make things a little bit more interesting we assign a non-conjugate *prior* of $U(0,1)$ on $\Lambda$. The goal is to derive both the Bayesian estimator of $\Lambda$ under squared error loss and the B&uuml;hlmann credibility estimator. <details> <summary> Solution: </summary> **B&uuml;hlmann Credibility estimator:** Note that $$\mu=\E{\Lambda}=\frac{1}{2},\quad \nu=\E{\Lambda}=\frac{1}{2},\quad \hbox{and } \alpha=\Var{\Lambda}=\frac{1}{12}.$$ Hence, $\kappa=\nu/\alpha=6$ which results in the credibility estimator $$ \frac{n}{n+6} \bar{X}_n + \frac{6}{n+6} \left(\frac{1}{2}\right).$$ **Bayesian Estimator:** Note that the Bayesian estimator under squared error loss is given by the posterior expectation of $\Lambda$. Let $S_n$ denote the partial sum of $X_i$, $i=1,\ldots,n$. It is easy to see that the posterior density is proportional to $$ \lambda^{S_n} \exp\{-n\lambda\} I_{[0,1]}(\lambda), $$ and hence the Bayesian posterior expectation is given by $$ \begin{aligned} \frac{\int_0^1 \lambda^{S_n+1} \exp\{-n\lambda\} {\rm d}\lambda}{\int_0^1 \lambda^{S_n} \exp\{-n\lambda\} {\rm d}\lambda}&= \frac{\int_0^1 (\Gamma(S_n+2))^{-1}(n\lambda)^{S_n+1} \exp\{-n\lambda\} {\rm d}(n\lambda)}{\int_0^1 (\Gamma(S_n+1))^{-1}(n\lambda)^{S_n} \exp\{-n\lambda\} {\rm d}(n\lambda)}\frac{\Gamma(S_n+2)}{n\Gamma(S_n+1)}\\ &=\left(\frac{F_{Gamma(S_n+2,scale=1/n)}(1)}{F_{Gamma(S_n+1,scale=1/n)}(1)}\right)\frac{S_n+1}{n}. \end{aligned} $$ Can you see the limit of (think :chicken:LLN)$$\left(\frac{F_{Gamma(S_n+2,scale=1/n)}(1)}{F_{Gamma(S_n+1,scale=1/n)}(1)}\right)?$$ Of course, the ratio can be computed using the <tt>pgamma</tt> function in base R. But there is always a room for a little ingenuity, isn't there? We start by observing that $$\int_0^1 \lambda^{k+1} \exp\{-n\lambda\} {\rm d}\lambda= \frac{\exp\{-n\}}{k+2} + \frac{n}{k+2}\int_0^1 \lambda^{k+2} \exp\{-n\lambda\} {\rm d}\lambda.$$ From this we see that $$ \begin{aligned} {\int_0^1 \lambda^{S_n+1} \exp\{-n\lambda\} {\rm d}\lambda}&=\frac{\exp\{-n\}}{n}\left[ \sum\limits_{j=S_n+1}^m \prod_{k=S_n+1}^j\frac{n}{k+1} \right] \\ &\qquad+ \left(\prod_{k=S_n+1}^m\frac{n}{k+1}\right) \int_0^1 \lambda^{m+1}\exp\{-n\lambda\} {\rm d}\lambda\\ &\approx \frac{\exp\{-n\}}{n}\sum\limits_{j=S_n+1}^m \prod_{k=S_n+1}^j\frac{n}{k+1}. \end{aligned} $$ This results in $$ \begin{aligned} \frac{\int_0^1 \lambda^{S_n+1} \exp\{-n\lambda\} {\rm d}\lambda}{\int_0^1 \lambda^{S_n} \exp\{-n\lambda\} {\rm d}\lambda}&\approx \frac{\sum\limits_{j=S_n+1}^m \prod_{k=S_n+1}^j\left(\frac{n}{k+1}\right)}{\sum\limits_{j=S_n}^m \prod_{k=S_n}^j\left(\frac{n}{k+1}\right)}\\ &=\frac{\sum\limits_{j=S_n+1}^m \prod_{k=S_n+1}^j\left(\frac{n}{k+1}\right)}{1+\sum\limits_{j=S_n+1}^m \prod_{k=S_n+1}^j\left(\frac{n}{k+1}\right)}\left(\frac{S_n+1}{n}\right). \end{aligned} $$ This approximation is what is implemented in my rcode below: <pre><tt>Bayes_est_mod<-function(s){ # The Bayes Estimator as a function of the sample sum return((1-1/(1+sum(cumprod(n/(s+(2:31))))))*(s+1)/n) } s<-23; n<-2; system.time(for(i in 1:1000000) {k<-Bayes_est_mod(s)});</tt> user system elapsed 2.199 0.000 2.199 system.time(for(i in 1:1000000) {k<-pgamma(1,s+2,n)/pgamma(1,s+1,n)*(s+1)/n}); user system elapsed 3.247 0.000 3.248 Bayes_est_mod(s)-pgamma(1,s+2,n)/pgamma(1,s+1,n)*(s+1)/n 4.218847e-15 </pre> </details> <P></P> ## Estimation of $\mu$ and $\kappa$ As mentioned above, outside the Bayesian setting, in order to use the credibility estimators we need estimates of $\mu$ and $\kappa$. We will look at the two models one by one. ### Estimation in the B&uuml;hlmann Model: We assume the setting where we observe $r$ different risks for $n$ years where the $i$-th risk is driven by $\Theta_i$ with $\Theta_i$, for $i=1,\ldots,r$ assumed to be a random sample from an unknown distribution. We will denote these random variables by $X_{ij}$, for $i=1,\ldots,r;j=1,\ldots, n$. So the assumption is that for each $i$, $X_{i,j}$, for $j=1,\ldots,n$ are conditioned on $\Theta_i$ independent and have conditional mean of $\mu(\Theta_i)$ and conditional variance of $\sigma^2(\Theta_i)$. Also, $\Theta_i$ for $i=1,\ldots,r$ form a random sample of size $r$. In the following if we replace any subscript by $\cdot$ it means that we are simply averaging over that subscript. Note that the credibility estimator for the $i$-th risk is given by $$ \frac{n}{n+\hat{\kappa}} X_{i\cdot} + \frac{\kappa}{n+\hat{\kappa}}\hat{\mu}\;. $$ So in the following we will derive consistent estimators for $\mu$ and $\kappa$. **Estimator for $\mu$:** Note that in this setting $X_{i\cdot}$ is a consistent estimate for $\mu(\Theta_i)$ by :muscle:LLN. So a natural unbiased estimator for $\mu$ and consistent when $r$ and $n$ both travel to infinity, again by :muscle:LLN, is given by $X_{\cdot\cdot}$. **Estimator for $\nu$:** Recall that $\nu=\E{\sigma^2(\Theta)}$. As above, and from simple sample statistics properties, we see that $$\frac{1}{n-1}\sum_{j=1}^n(X_{ij}-X_{i\cdot})^2\rightarrow \sigma^2(\Theta_i), \quad \hbox{as }n\rightarrow\infty,$$ by :muscle:LLN, and moreover is an unbiased estimator for $\nu$. Of course, we get a better estimator by pooling information from all of the risks, so the estimator of choice will be $$ \hat{\nu}:=\frac{1}{r(n-1)}\sum_{i=1}^r\sum_{j=1}^n(X_{ij}-X_{i\cdot})^2 $$ **Estimator for $\alpha$:** Recall that $\alpha=\Var{\mu(\Theta)}$. Since as observed above, $X_{i\cdot}$ is a consistent estimate for $\mu(\Theta_i)$, $$ \frac{1}{r-1} \sum_{i=1}^r (X_{i\cdot}-X_{\cdot\cdot})^2, $$ is a consistent estimate of $\alpha$ as both $r$ and $n$ approach infinity. But for small $n$ we can do better, and in fact get an unbiased estimate. To see this, note that unconditionally, $X_{i\cdot}$ are iid random variables and hence the above is an unbiased estimate of $$\Var{X_{i\cdot}}=\Var{\mu(\Theta_i)}+\frac{1}{n}\E{\sigma^2(\Theta_i)}=\alpha+\frac{1}{n}\nu. $$ This suggests that the following is an unbiased estimator for $\alpha$:$$ \hat{\alpha}:=\frac{1}{r-1} \sum_{i=1}^r (X_{i\cdot}-X_{\cdot\cdot})^2 - \frac{1}{n}\hat{\nu}. $$ Note that the correction term is of $O(1/n)$, and this correction could possibly make, for small $n$, $\hat{\alpha}$ negative - in which case I would avoid the correction. Finally, we construct $$\hat{\kappa}:=\frac{\hat{\nu}}{\hat{\alpha}};$$ importantly, note that while $\hat{\kappa}$ inherits :muscle: consistency from that of $\hat{\nu}$ and $\hat{\alpha}$, it does not inherit unbiasedness. ### Estimation in the B&uuml;hlmann-Straub Model: In this model we will denote by $X_{ijk}$, for $i=1,\ldots,r;j=1,\ldots, n_i;k=1,\ldots,m_{ij}$, the $k$-th realization of the $i$-th risk in the $j$-th year. It is also assumed that $X_{ijk}$ given $\Theta_i$ are independent random variables with conditional mean and variance given by $\mu(\Theta_i)$ and $\sigma^2(\Theta_i)$. Also, $\Theta_i$ for $i=1,\ldots,r$ form a random sample of size $r$. Importantly, only the compressd data $X_{ij\cdot}$, for $i=1,\ldots,r;j=1,\ldots, n_i$ is available. Note that the credibility estimator for the $i$-th risk is given by $$ \frac{m_i}{m_i+\hat{\kappa}} X_{i\cdot\cdot} + \frac{\kappa}{m_i+\hat{\kappa}}\hat{\mu}\;. $$ **Estimator for $\mu$:** Note that in this setting $X_{i..}$ is a consistent estimate for $\mu(\Theta_i)$ by :muscle:LLN. So a natural unbiased estimator for $\mu$ and consistent when $r$ and $m_i$ both travel to infinity, again by :muscle:LLN, is given by $$X_{\cdot\cdot}=\frac{1}{m_i}\sum_{j=1}^{n_i} m_{ij}X_{ij\cdot}.$$ **Estimator for $\nu$:** Recall that $\nu=\E{\sigma^2(\Theta)}$. We will start by looking at the following: $$\sum_{j=1}^{n_i}\sum_{k=1}^{m_{ij}}(X_{ijk}-X_{i\cdot\cdot})^2.$$ Note that the summands have the same distribution and conditional on $\Theta_i$ the expectation of the sum equals $(m_i-1)\sigma^2(\Theta_i)$, where $m_i:=\sum_{j=1}^{n_i} m_{ij}$. But the catch is that the raw data is not available; so we proceed as follows: $$ \begin{aligned} \sum_{j=1}^{n_i}\sum_{k=1}^{m_{ij}}(X_{ijk}-X_{i\cdot\cdot})^2&= \sum_{j=1}^{n_i}\sum_{k=1}^{m_{ij}}(X_{ijk}-X_{ij\cdot}+X_{ij\cdot}-X_{i\cdot\cdot})^2\\ &=\sum_{j=1}^{n_i}\sum_{k=1}^{m_{ij}}(X_{ijk}-X_{ij\cdot})^2+ \sum_{j=1}^{n_i}m_{ij}(X_{ij\cdot}-X_{i\cdot\cdot})^2.\\ \end{aligned} $$ Note that the cross product term equals zero, and the only term that is a function of the compressed data is the second term on the right. But note that $$\E{\sum_{j=1}^{n_i}\sum_{k=1}^{m_{ij}}(X_{ijk}-X_{i\cdot\cdot})^2\vert \Theta_i}=(m_i-1)\sigma^2(\Theta_i),$$ and $$\E{\sum_{j=1}^{n_i}\sum_{k=1}^{m_{ij}}(X_{ijk}-X_{ij\cdot})^2\vert \Theta_i}=\sum_{i=1}^{n_i}(m_{ij}-1)\sigma^2(\Theta_i)=(m_i-n_i)\sigma^2(\Theta_i).$$ These together imply that $$ \E{\sum_{j=1}^{n_i}m_{ij}(X_{ij\cdot}-X_{i\cdot\cdot})^2\vert \Theta_i}=(n_i-1)\sigma^2(\Theta_i),$$ which in turn suggests the following unbiased and :muscle:consistent estimator for $\nu$ (with $r,m_i's,n_i's$ approaching infinity):$$ \hat{\nu}:=\frac{1}{(\sum_{i=1}^r n_i -r)}\sum_{i=1}^r\sum_{j=1}^{n_i}m_{ij}(X_{ij\cdot}-X_{i\cdot\cdot})^2. $$ **Estimator for $\alpha$:** Recall that $\alpha=\Var{\mu(\Theta)}$. We will proceed by first observing the following: $$ \begin{eqnarray*} \sum_{i=1}^r m_i(X_{i \cdot \cdot}-X_{\cdots} )^2&=& \sum_{i=1}^r m_i(X_{i \cdot \cdot}-\mu(\Theta_i)+\mu(\Theta_i)-\mu+\mu-X_{\cdots} )^2 \\ &=& \underbrace{ \sum_{i=1}^r m_i(X_{i \cdot \cdot}-\mu(\Theta_i) )^2}_{T_1} +\underbrace{\sum_{i=1}^r m_i(\mu(\Theta_i)-\mu )^2}_{T_2} +\underbrace{\left(\sum_{i=1}^r m_i \right)(\mu-X_{\cdots} )^2}_{T_3} \\ &&\qquad+\underbrace{2\sum_{i=1}^r m_i(X_{i \cdot \cdot}-\mu(\Theta_i) )(\mu(\Theta_i)-\mu)}_{T_4}\\ &&\qquad+\underbrace{2\sum_{i=1}^r m_i(X_{i \cdot \cdot}-\mu(\Theta_i) )(\mu-X_{\cdots})}_{T_5}\\ &&\qquad+\underbrace{2\sum_{i=1}^r m_i(\mu(\Theta_i)-\mu) )(\mu-X_{\cdots})}_{T_6}.\\ \end{eqnarray*} $$ We will further analyze each term below: 1. $\mathbf{T_1}:$ Note that $X_{i \cdot \cdot}$ conditioned on $\Theta_i$ have mean $\mu(\Theta_i)$, and variance $\frac{\sigma^2(\Theta_i)}{m_i}.$ Hence, $$\E{T_1|\Theta_1,\ldots,\Theta_r}=\sum_{i=1}^R m_i \cdot \frac{\sigma^2(\Theta_i)}{m_i}=\sum_{i=1}^R \sigma^2(\Theta_i),$$ which in turn implies that $$\E{T_1}=r\nu.$$ 2. $\mathbf{T_2}:$ We note that $$\mathbb{E}(T_2)=\sum_{i=1}^r m_i \mathbb{E}(\mu(\Theta_i)-\mu )^2=\left(\sum_{i=1}^r m_i\right)\alpha=m\alpha,$$ where $m:=\sum_{i=1}^r m_i.$ 3. $\mathbf{T_3}:$ We begin by observing that $$ \begin{eqnarray*} T_3&=&m \left(\sum_{i=1}^r\frac{ m_i }{m} X_{i \cdot \cdot}-\sum_{i=1}^r \frac{m_i}{m} \mu(\Theta_i) + \sum_{i=1}^r \left(\frac{ m_i }{m}\right) (\mu(\Theta_i)-\mu) \right)^2 \\ &=& m \left(\sum_{i=1}^r\frac{ m_i }{m} \left(X_{i \cdot \cdot}- \mu(\Theta_i) \right) + \sum_{i=1}^R \frac{ m_i }{m} (\mu(\Theta_i)-\mu) \right)^2 \end{eqnarray*} $$ $$ \begin{eqnarray*} \mathbb{E} (T_3|\Theta_1, \ldots,\Theta_r)=m \left[\underbrace{ \sum_{i=1}^r\frac{ m_i^2 }{m^2} \frac{\sigma^2(\Theta_i)}{m_i}}_{\text{Using step for } T_1 \text{ above}} + \underbrace{ \left( \sum_{i=1}^r \left(\frac{ m_i }{m}\right) (\mu(\Theta_i)-\mu) \right)^2}_{\text{All other cross products terms vanish}} \right] \\ \end{eqnarray*} $$ Hence, $$ \begin{eqnarray*} \mathbb{E}(T_3)= \sum_{i=1}^r\frac{ m_i }{m} \nu +m \sum_{i=1}^r\frac{ m_i^2 }{m^2} \alpha = \nu +\sum_{i=1}^r\frac{ m_i^2 }{m} \alpha \end{eqnarray*} $$ 4. $\mathbf{T_4}:$ This part is trivial: $$\mathbb{E}(T_4|\Theta_1, \ldots,\Theta_r)=0 \Rightarrow \mathbb{E} (T_4)=0.$$ 5. $\mathbf{T_5}:$ Note that $$ \begin{eqnarray*} T_5&=&2m \left(X_{ \cdots}-\sum_{i=1}^r\frac{ m_i\mu(\Theta_i)}{m} \right)(\mu-X_{\cdots})\\ &=&2m \left(X_{ \cdots}-\sum_{i=1}^r\frac{ m_i\mu(\Theta_i)}{m} \right) \left(\mu-\sum_{i=1}^r\frac{ m_i\mu(\Theta_i)}{m}+\sum_{i=1}^r\frac{ m_i\mu(\Theta_i)}{m}-X_{\cdots}\right). \end{eqnarray*} $$ Hence, $$ \mathbb{E} (T_5|\Theta_1, \ldots,\Theta_r)=-2m \left(X_{ \cdots}-\sum_{i=1}^r\frac{ m_i\mu(\Theta_i)}{m} \right)^2, $$ which implies that $$ \mathbb{E} (T_5)=-2m \sum_{i=1}^r\frac{ m_i^2 }{m^2} \frac{\mathbb{E}\sigma^2(\Theta_i)}{m_i}=-2 \nu $$ 6. $\mathbf{T_6}:$ Note that $$T_6=-2 \sum_{i=1}^r m_i (\mu(\Theta_i)-\mu)(X_{\cdots}-\mu),$$ which leads to $$ \mathbb{E} (T_6|\Theta_1, \ldots,\Theta_r)=-2\sum_{i=1}^r m_i (\mu(\Theta_i)-\mu) \left(\sum_{i=1}^r\frac{ m_i\mu(\Theta_i)}{m}-\mu \right), $$ and the former expression implies that $$ \mathbb{E} (T_6)=-2 \sum_{i=1}^r\frac{ m_i^2 }{m}\alpha. $$ We summarize the above partial results in the following table: | |$\nu$ | $\alpha$| |:---------:|:---------:|:---------:| | $\mathbb{E}(T_1)$| $r$ | -- | |$\mathbb{E}(T_2)$| -- | $m$ | |$\mathbb{E}(T_3)$|$1$| $\sum_{i=1}^r\frac{ m_i^2 }{m}$ | |$\mathbb{E}(T_4)$| -- | -- | |$\mathbb{E}(T_5)$| -$2$ | -- | |$\mathbb{E}(T_6)$| -- | -$2\sum_{i=1}^r\frac{ m_i^2 }{m}$ | |Total | $r-1$ | $m-\sum_{i=1}^r\frac{ m_i^2 }{m}$| Hence,$$ \hat{\alpha}:=\frac{\sum_{i=1}^r m_i(X_{i \cdot \cdot}-X_{\cdots} )^2-(r-1)\hat{\nu} }{m-\sum_{i=1}^r\frac{ m_i^2 }{m}} $$ is an unbiased estimator for $\alpha$ Finally, we construct $$\hat{\kappa}:=\frac{\hat{\nu}}{\hat{\alpha}};$$ as observed before, $\hat{\kappa}$ inherits :muscle: consistency from that of $\hat{\nu}$ and $\hat{\alpha}$, but fails to inherit unbiasedness. **Remark:** The above estimators are non-parametric in their very nature - that is they do NOT assume any parametric family for the distributions of $X_i$ given $\Theta_i$. When we do, other estimators may naturally arise. For example in the case of a Poisson distribution we have $\mu(\Theta)=\sigma^2(\Theta)$ and hence $\hat{\mu}$ works as $\hat{\nu}$ as well.