Running Proof of Welford's algorithm for variance, batched version

\[ s_{n+k}-s_n = \sum_{i=1}^{n+k} (x_i-u_{n+k})^2 - \sum_{i=1}^n (x_i-u_n)^2 \\ = \sum_{i=n+1}^{n+k} (x_i-u_{n+k})^2 + \sum_{i=1}^{n} (x_i-u_{n+k})^2 - \sum_{i=1}^n (x_i-u_n)^2 \\ = \sum_{i=n+1}^{n+k} (x_i-u_{n+k})^2 + \sum_{i=1}^{n} (x_i-u_{n+k})^2 - (x_i-u_n)^2 \\ = \sum_{i=n+1}^{n+k} (x_i-u_{n+k})^2 + \sum_{i=1}^{n} (x_i-u_{n+k} + x_i-u_n) (x_i-u_{n+k} - x_i+u_n) \\ = \sum_{i=n+1}^{n+k} (x_i-u_{n+k})^2 + (\sum_{i=1}^{n} (x_i-u_{n+k})+\sum_{i=1}^{n}(x_i-u_n)) (-u_{n+k}+u_n) \\ = \sum_{i=n+1}^{n+k} (x_i-u_{n+k})^2 + (\sum_{i=1}^{n} x_i-u_{n+k}) (u_n-u_{n+k}) \\ = \sum_{i=n+1}^{n+k} (x_i-u_{n+k})^2 + (- \sum_{i=n+1}^{n+k} (x_i-u_{n+k})) (u_n-u_{n+k}) \quad \text{(see below)}\\ = \sum_{i=n+1}^{n+k} (x_i-u_{n+k})^2 + \sum_{i=n+1}^{n+k} (x_i-u_{n+k}) (u_{n+k}-u_n)\\ = \sum_{i=n+1}^{n+k} (x_i-u_{n+k})^2 + (x_i-u_{n+k}) (u_{n+k}-u_n)\\ = \sum_{i=n+1}^{n+k} (x_i-u_{n+k}) (x_i-u_{n+k} +u_{n+k}-u_n)\\ = \sum_{i=n+1}^{n+k} (x_i-u_{n+k})(x_i-u_n)\\ \]

Therefore
\[s_{n+k}-s_{n} = \sum_{i=n+1}^{n+k} (x_i-u_{n+k})(x_i-u_n)\]
And so:

\[s_{n+k} = s_{n} + \sum_{i=n+1}^{n+k} (x_i-u_{n+k})(x_i-u_n)\]

Proof of a)

\[ \sum_{i=1}^{n} x_i-u_{n+k} =\\ \sum_{i=1}^{n} x_i-u_{n+k} + \sum_{i=n+1}^{n+k} (x_i-u_{n+k}) - \sum_{i=n+1}^{n+k} (x_i-u_{n+k}) =\\ =- \sum_{i=n+1}^{n+k} (x_i-u_{n+k})\\ \]

Python version

class RunningVariance:
    def __init__(self):
        self.n=0
        self.s=0
        self.mean=0
        
    def update(self,x:np.ndarray):
        k = x.shape[0]
        self.n += k
        # (x_i-u_{n+k})(x_i-u_n)
        diff = (x - self.mean)
        self.mean = self.mean + diff / self.n
        diff *= (x - self.mean)
        # s_{n+k} = s_n + diff
        self.s= self.s + diff.sum(axis=0)
        
    def var(self):
        if self.n<=1:
            raise ValueError(f"Not enough samples to compute variance (n={self.n})")
        return self.s/self.n
    
            

Select a repo