Research Investigation - Invariant Federated Learning

# Research Investigation - Invariant Federated Learning ### Paper "Learning Explanations that are Hard to Vary" #### 1. Problem Conventional method employs the logical OR pattern (arithmetic mean), which averages the gradients across examples. Nevertheless, this method engenders several problems, which might potentially smother the process of learning invariances. 1. **Training will stop once the loss is low enough.** If optimization learns spurious patterns by the time it converges, invariances will not be learned anymore. (Question: Why the logical AND pattern does not have such issue?) 2. Since the gradients for each example are computed independently, **each signal is identical to the one for an equivalent dataset of size 1.** 3. Gradient descent with average gradients greedily maximizes for learning speed, but in some situations we would like to **trade some convergence speed for invariance**. #### 2. Proposed Solution Use AND pattern (geometric mean), instead of OR pattern (arithmetic mean). AND pattern will first calculate the product of the gradients, and then calculate the square root of it. For simplicity, assume Hessian Matrix <a href="https://www.codecogs.com/eqnedit.php?latex=H_e&space;\in&space;\{H\}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?H_e&space;\in&space;\{H\}" title="H_e \in \{H\}" /></a> is diagonal and all the eigenvalues <a href="https://www.codecogs.com/eqnedit.php?latex=\lambda&space;_i^e" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\lambda&space;_i^e" title="\lambda _i^e" /></a> are positive. **OR Pattern:** (Arithmetic Mean) <a href="https://www.codecogs.com/eqnedit.php?latex=H^+&space;=&space;diag(&space;\frac{1}{E}&space;\sum\nolimits_{e&space;\in&space;E}&space;\lambda&space;_1^e,...,&space;\frac{1}{E}&space;\sum\nolimits_{e&space;\in&space;E}&space;\lambda&space;_n^e&space;)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?H^+&space;=&space;diag(&space;\frac{1}{E}&space;\sum\nolimits_{e&space;\in&space;E}&space;\lambda&space;_1^e,...,&space;\frac{1}{E}&space;\sum\nolimits_{e&space;\in&space;E}&space;\lambda&space;_n^e&space;)" title="H^+ = diag( \frac{1}{E} \sum\nolimits_{e \in E} \lambda _1^e,..., \frac{1}{E} \sum\nolimits_{e \in E} \lambda _n^e )" /></a> Now optimizing from a point <a href="https://www.codecogs.com/eqnedit.php?latex=\theta&space;^k" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\theta&space;^k" title="\theta ^k" /></a> to the local minimizer <a href="https://www.codecogs.com/eqnedit.php?latex=\theta&space;^*" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\theta&space;^*" title="\theta ^*" /></a>, the gradient descent reads <a href="https://www.codecogs.com/eqnedit.php?latex=\theta&space;^{k+1}&space;=&space;\theta&space;^k&space;-&space;\eta&space;{H^+}&space;(\theta&space;^k&space;-&space;\theta&space;^*)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\theta&space;^{k+1}&space;=&space;\theta&space;^k&space;-&space;\eta&space;{H^+}&space;(\theta&space;^k&space;-&space;\theta&space;^*)" title="\theta ^{k+1} = \theta ^k - \eta {H^+} (\theta ^k - \theta ^*)" /></a>, where <a href="https://www.codecogs.com/eqnedit.php?latex=\eta" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\eta" title="\eta" /></a> denotes the learning rate. Question: Why does <a href="https://www.codecogs.com/eqnedit.php?latex=H^+(\theta&space;^k&space;-&space;\theta&space;^*)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?H^+(\theta&space;^k&space;-&space;\theta&space;^*)" title="H^+(\theta ^k - \theta ^*)" /></a> equals <a href="https://www.codecogs.com/eqnedit.php?latex=\bigtriangledown&space;L&space;(\theta&space;^k)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\bigtriangledown&space;L&space;(\theta&space;^k)" title="\bigtriangledown L (\theta ^k)" /></a>? How to calculate <a href="https://www.codecogs.com/eqnedit.php?latex=(&space;\theta&space;^k&space;-&space;\theta&space;^*)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?(&space;\theta&space;^k&space;-&space;\theta&space;^*)" title="( \theta ^k - \theta ^*)" /></a>, since the local minimizer <a href="https://www.codecogs.com/eqnedit.php?latex=\theta&space;^*" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\theta&space;^*" title="\theta ^*" /></a> is unknown? Discussion: Since <a href="https://www.codecogs.com/eqnedit.php?latex=H^+" target="_blank"><img src="https://latex.codecogs.com/gif.latex?H^+" title="H^+" /></a> is the product of means of the eigenvalues of the Hessian matrixes. (First calculate the eigenvalues of Hessian matrixes, then calculate the means of the eigenvalues, then multiply the means to calculate the product). But <a href="https://www.codecogs.com/eqnedit.php?latex=\bigtriangledown&space;L(&space;\theta&space;^k)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\bigtriangledown&space;L(&space;\theta&space;^k)" title="\bigtriangledown L( \theta ^k)" /></a> denotes the gradient of current point <a href="https://www.codecogs.com/eqnedit.php?latex=\theta&space;^k" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\theta&space;^k" title="\theta ^k" /></a>. **AND Pattern:** (Geometric Mean) <a href="https://www.codecogs.com/eqnedit.php?latex=H^{\land}&space;=&space;diag(&space;(\prod\nolimits_{e&space;\in&space;E}&space;\lambda&space;_1^e)^{&space;\frac{1}{E}},...,&space;(\prod\nolimits_{e&space;\in&space;E}&space;\lambda&space;_n^e)^{&space;\frac{1}{E}})" target="_blank"><img src="https://latex.codecogs.com/gif.latex?H^{\land}&space;=&space;diag(&space;(\prod\nolimits_{e&space;\in&space;E}&space;\lambda&space;_1^e)^{&space;\frac{1}{E}},...,&space;(\prod\nolimits_{e&space;\in&space;E}&space;\lambda&space;_n^e)^{&space;\frac{1}{E}})" title="H^+ = diag( (\prod\nolimits_{e \in E} \lambda _1^e)^{ \frac{1}{E}},..., (\prod\nolimits_{e \in E} \lambda _n^e)^{ \frac{1}{E}})" /></a> Now optimizing from a point <a href="https://www.codecogs.com/eqnedit.php?latex=\theta&space;^k" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\theta&space;^k" title="\theta ^k" /></a> to the local minimizer <a href="https://www.codecogs.com/eqnedit.php?latex=\theta&space;^*" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\theta&space;^*" title="\theta ^*" /></a>, the gradient descent reads <a href="https://www.codecogs.com/eqnedit.php?latex=\theta&space;^{k+1}&space;=&space;\theta&space;^k&space;-&space;\eta&space;H^{\land}&space;(&space;\theta&space;^k&space;-&space;\theta&space;^*)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\theta&space;^{k+1}&space;=&space;\theta&space;^k&space;-&space;\eta&space;H^{\land}&space;(&space;\theta&space;^k&space;-&space;\theta&space;^*)" title="\theta ^{k+1} = \theta ^k - \eta H^{\land} ( \theta ^k - \theta ^*)" /></a>, where <a href="https://www.codecogs.com/eqnedit.php?latex=\eta" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\eta" title="\eta" /></a> denotes the learning rate. Question: Why does <a href="https://www.codecogs.com/eqnedit.php?latex=H^{\land}(\theta&space;^k&space;-&space;\theta&space;^*)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?H^{\land}(\theta&space;^k&space;-&space;\theta&space;^*)" title="H^+(\theta ^k - \theta ^*)" /></a> equals <a href="https://www.codecogs.com/eqnedit.php?latex=\bigtriangledown&space;L&space;(\theta&space;^k)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\bigtriangledown&space;L&space;(\theta&space;^k)" title="\bigtriangledown L (\theta ^k)" /></a>? How to calculate <a href="https://www.codecogs.com/eqnedit.php?latex=(&space;\theta&space;^k&space;-&space;\theta&space;^*)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?(&space;\theta&space;^k&space;-&space;\theta&space;^*)" title="( \theta ^k - \theta ^*)" /></a>, since the local minimizer <a href="https://www.codecogs.com/eqnedit.php?latex=\theta&space;^*" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\theta&space;^*" title="\theta ^*" /></a> is unknown? Discussion: Since <a href="https://www.codecogs.com/eqnedit.php?latex=H^{\land}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?H^{\land}" title="H^^" /></a> is the product of means of the eigenvalues of the Hessian matrixes. (First calculate the eigenvalues of Hessian matrixes, then calculate the means of the eigenvalues, then multiply the means to calculate the product). But <a href="https://www.codecogs.com/eqnedit.php?latex=\bigtriangledown&space;L(&space;\theta&space;^k)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\bigtriangledown&space;L(&space;\theta&space;^k)" title="\bigtriangledown L( \theta ^k)" /></a> denotes the gradient of current point <a href="https://www.codecogs.com/eqnedit.php?latex=\theta&space;^k" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\theta&space;^k" title="\theta ^k" /></a>. Question: Why AND pattern is superior than OR pattern? Answer: AND pattern is more likely to capture the consistency than OR pattern. Based on the inequality of arithmetic and geometric means, or AM–GM inequality, for a specific list of non-negative real numbers, the arithmetic mean is always greater than or equal to its geometric mean. Besides, the two means are equal to each other only if every number in the two list are the same. For example, assume a > 0, b > 0, for <a href="https://www.codecogs.com/eqnedit.php?latex=\frac{1}{2}(a+b)&space;\geq&space;\sqrt{ab}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\frac{1}{2}(a+b)&space;\geq&space;\sqrt{ab}" title="\frac{1}{2}(a+b) \geq \sqrt{ab}" /></a>, if a == b, then their means are also equal to each other. ![means](https://i.imgur.com/dq5GxSy.png) Therefore, for the geometric mean, if the curvature is consistent, where the gradients are equal to each other, then the training speed will be maximized. Alternatively, if the curvature is inconsistent, where the gradients are different from each other, then the training speed will be reduced. In this case, it is more easier to find the local minimizer which is consistent with each other. #### 4. Problems with the Proposed Geometric Mean Solution There are several problems concerning the proposed geometric mean solution. 1. Previously, we assume the eigenvalues are all positive real numbers, which means the graph is a convex function. However, **the geometric mean does not work in the non-convex settings**. 2. Since mulitplying floating point values will lead to the floating point overflow, we will use the sum of logarithm for numerical stability. However, **logarithm is computationally expensive**. 3. Adam will rescale the gradients, where the exact magnitude of the geometric mean would be ignored and most of the difference from arithmetic mean come from the zero-ed components. Question: Why does the magnitude of geometric mean would be ignored? Why does the MASK solution could mitigate this issue? #### 5. Geometric Mean with MASK Solution Suppose we have *gradients*: <a href="https://www.codecogs.com/eqnedit.php?latex=grads&space;=&space;[[1,&space;3,&space;5],&space;[0,&space;3,&space;-5]]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?grads&space;=&space;[[1,&space;3,&space;5],&space;[0,&space;3,&space;-5]]" title="grads = [[1, 3, 5], [0, 3, -5]]" /></a> (1) First we calculate the **signs** of the gradients. - If the value is larger than 0, then the sign will be 1. - If the value is smaller than 0, then the sign will be -1. - If the value is equal to 0, then the sign will also be 0. <a href="https://www.codecogs.com/eqnedit.php?latex=signs&space;=&space;[[1,&space;1,&space;1],&space;[0,&space;1,&space;-1]]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?signs&space;=&space;[[1,&space;1,&space;1],&space;[0,&space;1,&space;-1]]" title="signs = [[1, 1, 1], [0, 1, -1]]" /></a> (2) Then we calculate the **mask** of the gradients. Specifically, we have pre-defined an agreement threshold. Then we will calculate the average value of signs in the specific dimension across different samples (e.g. the average value of signs in the first dimension of Sample A and Sample B). If the average value of signs is smaller than the agreement threshold, then we will mask out the gradients. Because if the graphs have different signs across samples, they are definitely inconsistent. <a href="https://www.codecogs.com/eqnedit.php?latex=mask&space;=&space;\lvert{avg(signs)}\lvert&space;\geq&space;agreement\_threshold" target="_blank"><img src="https://latex.codecogs.com/gif.latex?mask&space;=&space;\lvert{avg(signs)}\lvert&space;\geq&space;agreement\_threshold" title="mask = \lvert{avg(signs)}\lvert \geq agreement\_threshold" /></a> Suppose <a href="https://www.codecogs.com/eqnedit.php?latex=agreement\_threshold&space;=&space;0.2" target="_blank"><img src="https://latex.codecogs.com/gif.latex?agreement\_threshold&space;=&space;0.2" title="agreement\_threshold = 0.2" /></a> in our case. Then the average value of signs in the specific dimension across different samples is <a href="https://www.codecogs.com/eqnedit.php?latex=\lvert{avg(signs)}\lvert&space;=&space;[0.5,&space;1,&space;0]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\lvert{avg(signs)}\lvert&space;=&space;[0.5,&space;1,&space;0]" title="\lvert{avg(signs)}\lvert = [0.5, 1, 0]" /></a>. Since 0.5 and 1 are larger than the agreement threshold 0.2, while 0 is smaller than the agreement threshold 0.2, then the mask shall be <a href="https://www.codecogs.com/eqnedit.php?latex=mask&space;=&space;[1,&space;1,&space;0]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?mask&space;=&space;[1,&space;1,&space;0]" title="mask = [1, 1, 0]" /></a>. (3) Finally, we calculate final gradients by applying the mask to **the geometric mean** of gradients. Meanwhile, we will **rescale** the gradients. <a href="https://www.codecogs.com/eqnedit.php?latex=final\_grads&space;=&space;\frac{1}{avg(mask)&space;+&space;\varepsilon}&space;\times&space;(mask(geo(grads)))" target="_blank"><img src="https://latex.codecogs.com/gif.latex?final\_grads&space;=&space;\frac{1}{avg(mask)&space;+&space;\varepsilon}&space;\times&space;(mask(geo(grads)))" title="final\_grads = \frac{1}{avg(mask) + \varepsilon} \times (mask(geo(grads)))" /></a> Question: How to calculate geometric mean of gradients? Answer: We use logarithm to calculate the geometric mean for numerical stability. Besides, we calculate the signs of the arithmetic mean of gradients. <a href="https://www.codecogs.com/eqnedit.php?latex=geo(grads)&space;=&space;sign(avg(grads))&space;\times&space;e^&space;{\frac{1}{n}&space;sum(&space;\log_{e}&space;\lvert{grads&space;+&space;\varepsilon&space;}\lvert)}&space;\\&space;=&space;sign(avg(grads))&space;\times&space;e^{\log_{e}(prod(\lvert{grads&space;+&space;\varepsilon&space;}\lvert))^{\frac{1}{n}}}\\&space;=&space;sign(avg(grads))&space;\times&space;prod(\lvert{grads&space;+&space;\varepsilon&space;}\lvert))^{\frac{1}{n}}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?geo(grads)&space;=&space;sign(avg(grads))&space;\times&space;e^&space;{\frac{1}{n}&space;sum(&space;\log_{e}&space;\lvert{grads&space;+&space;\varepsilon&space;}\lvert)}&space;\\&space;=&space;sign(avg(grads))&space;\times&space;e^{\log_{e}(prod(\lvert{grads&space;+&space;\varepsilon&space;}\lvert))^{\frac{1}{n}}}\\&space;=&space;sign(avg(grads))&space;\times&space;prod(\lvert{grads&space;+&space;\varepsilon&space;}\lvert))^{\frac{1}{n}}" title="geo(grads) = sign(avg(grads)) \times e^ {\frac{1}{n} sum( \log_{e} \lvert{grads + \varepsilon }\lvert)} \\ = sign(avg(grads)) \times e^{\log_{e}(prod(\lvert{grads + \varepsilon }\lvert))^{\frac{1}{n}}}\\ = sign(avg(grads)) \times prod(\lvert{grads + \varepsilon }\lvert))^{\frac{1}{n}}" /></a> For example, in our case, we first calculate the geometric mean by using logarithm: <a href="https://www.codecogs.com/eqnedit.php?latex=e^&space;{\frac{1}{n}&space;sum(&space;\log_{e}&space;\lvert{grads&space;+&space;\varepsilon&space;}\lvert)}&space;=&space;[1e-05,&space;3,&space;5]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?e^&space;{\frac{1}{n}&space;sum(&space;\log_{e}&space;\lvert{grads&space;+&space;\varepsilon&space;}\lvert)}&space;=&space;[1e-05,&space;3,&space;5]" title="e^ {\frac{1}{n} sum( \log_{e} \lvert{grads + \varepsilon }\lvert)} = [1e-05, 3, 5]" /></a>. In order to prevent the base of logarithm is 0, we add a constant value <a href="https://www.codecogs.com/eqnedit.php?latex=\varepsilon&space;=&space;1e-10" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\varepsilon&space;=&space;1e-10" title="\varepsilon = 1e-10" /></a>. Then we calculate the signs of the arithmetic mean of gradients: <a href="https://www.codecogs.com/eqnedit.php?latex=sign(avg(grads))&space;=&space;[1,&space;1,&space;0]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?sign(avg(grads))&space;=&space;[1,&space;1,&space;0]" title="sign(avg(grads)) = [1, 1, 0]" /></a>. The geometric mean would be <a href="https://www.codecogs.com/eqnedit.php?latex=sign(avg(grads))&space;\times&space;e^&space;{\frac{1}{n}&space;sum(&space;\log_{e}&space;\lvert{grads&space;+&space;\varepsilon&space;}\lvert)}&space;=&space;[1e-05,&space;3,&space;0]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?sign(avg(grads))&space;\times&space;e^&space;{\frac{1}{n}&space;sum(&space;\log_{e}&space;\lvert{grads&space;+&space;\varepsilon&space;}\lvert)}&space;=&space;[1e-05,&space;3,&space;0]" title="sign(avg(grads)) \times e^ {\frac{1}{n} sum( \log_{e} \lvert{grads + \varepsilon }\lvert)} = [1e-05, 3, 0]" /></a>. Then we apply <a href="https://www.codecogs.com/eqnedit.php?latex=mask&space;=&space;[1,&space;1,&space;0]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?mask&space;=&space;[1,&space;1,&space;0]" title="mask = [1, 1, 0]" /></a> to the average value of gradients, which is <a href="https://www.codecogs.com/eqnedit.php?latex=mask(geo(grads))&space;=&space;[1e-05,&space;3,&space;0]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?mask(geo(grads))&space;=&space;[1e-05,&space;3,&space;0]" title="mask(geo(grads)) = [1e-05, 3, 0]" /></a>. Then we rescale the gradients by dividing the average value of the mask <a href="https://www.codecogs.com/eqnedit.php?latex=avg(mask)&space;=&space;0.667" target="_blank"><img src="https://latex.codecogs.com/gif.latex?avg(mask)&space;=&space;0.667" title="avg(mask) = 0.667" /></a>. In order to prevent the demoninator to be 0, we add a constant value <a href="https://www.codecogs.com/eqnedit.php?latex=\varepsilon&space;=&space;1e-10" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\varepsilon&space;=&space;1e-10" title="\varepsilon = 1e-10" /></a>. Then the <a href="https://www.codecogs.com/eqnedit.php?latex=final\_grads&space;=&space;[1.5e-05,&space;4.5,&space;0]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?final\_grads&space;=&space;[1.5e-05,&space;4.5,&space;0]" title="final\_grads = [1.5e-05, 4.5, 0]" /></a>. Question: Why do we need to rescale the gradients by dividing the average value of the mask? #### 6. AND MASK Solution Suppose we have *gradients*: <a href="https://www.codecogs.com/eqnedit.php?latex=grads&space;=&space;[[1,&space;3,&space;5],&space;[0,&space;3,&space;-5]]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?grads&space;=&space;[[1,&space;3,&space;5],&space;[0,&space;3,&space;-5]]" title="grads = [[1, 3, 5], [0, 3, -5]]" /></a> (1) First we calculate the **signs** of the gradients. - If the value is larger than 0, then the sign will be 1. - If the value is smaller than 0, then the sign will be -1. - If the value is equal to 0, then the sign will also be 0. <a href="https://www.codecogs.com/eqnedit.php?latex=signs&space;=&space;[[1,&space;1,&space;1],&space;[0,&space;1,&space;-1]]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?signs&space;=&space;[[1,&space;1,&space;1],&space;[0,&space;1,&space;-1]]" title="signs = [[1, 1, 1], [0, 1, -1]]" /></a> (2) Then we calculate the **mask** of the gradients. Specifically, we have pre-defined an agreement threshold. Then we will calculate the average value of signs in the specific dimension across different samples (e.g. the average value of signs in the first dimension of Sample A and Sample B). If the average value of signs is smaller than the agreement threshold, then we will mask out the gradients. Because if the graphs have different signs across samples, they are definitely inconsistent. <a href="https://www.codecogs.com/eqnedit.php?latex=mask&space;=&space;\lvert{avg(signs)}\lvert&space;\geq&space;agreement\_threshold" target="_blank"><img src="https://latex.codecogs.com/gif.latex?mask&space;=&space;\lvert{avg(signs)}\lvert&space;\geq&space;agreement\_threshold" title="mask = \lvert{avg(signs)}\lvert \geq agreement\_threshold" /></a> Suppose <a href="https://www.codecogs.com/eqnedit.php?latex=agreement\_threshold&space;=&space;0.2" target="_blank"><img src="https://latex.codecogs.com/gif.latex?agreement\_threshold&space;=&space;0.2" title="agreement\_threshold = 0.2" /></a> in our case. Then the average value of signs in the specific dimension across different samples is <a href="https://www.codecogs.com/eqnedit.php?latex=\lvert{avg(signs)}\lvert&space;=&space;[0.5,&space;1,&space;0]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\lvert{avg(signs)}\lvert&space;=&space;[0.5,&space;1,&space;0]" title="\lvert{avg(signs)}\lvert = [0.5, 1, 0]" /></a>. Since 0.5 and 1 are larger than the agreement threshold 0.2, while 0 is smaller than the agreement threshold 0.2, then the mask shall be <a href="https://www.codecogs.com/eqnedit.php?latex=mask&space;=&space;[1,&space;1,&space;0]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?mask&space;=&space;[1,&space;1,&space;0]" title="mask = [1, 1, 0]" /></a>. (3) Finally, we calculate final gradients by applying the mask to **the arithmetic mean** of gradients. Meanwhile, we will **rescale** the gradients. <a href="https://www.codecogs.com/eqnedit.php?latex=final\_grads&space;=&space;\frac{1}{avg(mask)&space;+&space;\varepsilon}&space;\times&space;(mask(avg(grads)))" target="_blank"><img src="https://latex.codecogs.com/gif.latex?final\_grads&space;=&space;\frac{1}{avg(mask)&space;+&space;\varepsilon}&space;\times&space;(mask(avg(grads)))" title="final\_grads = \frac{1}{avg(mask) + \varepsilon} \times (mask(avg(grads)))" /></a> For example, in our case, the average value of gradients in the specific dimension across different samples is <a href="https://www.codecogs.com/eqnedit.php?latex=avg(grads)&space;=&space;[0.5,&space;3,&space;0]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?avg(grads)&space;=&space;[0.5,&space;3,&space;0]" title="avg(grads) = [0.5, 3, 0]" /></a>. Then we apply <a href="https://www.codecogs.com/eqnedit.php?latex=mask&space;=&space;[1,&space;1,&space;0]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?mask&space;=&space;[1,&space;1,&space;0]" title="mask = [1, 1, 0]" /></a> to the average value of gradients, which is <a href="https://www.codecogs.com/eqnedit.php?latex=mask(avg(grads))&space;=&space;[0.5,&space;3,&space;0]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?mask(avg(grads))&space;=&space;[0.5,&space;3,&space;0]" title="mask(avg(grads)) = [0.5, 3, 0]" /></a>. Then we rescale the gradients by dividing the average value of the mask <a href="https://www.codecogs.com/eqnedit.php?latex=avg(mask)&space;=&space;0.667" target="_blank"><img src="https://latex.codecogs.com/gif.latex?avg(mask)&space;=&space;0.667" title="avg(mask) = 0.667" /></a>. In order to prevent the demoninator to be 0, we add a constant value <a href="https://www.codecogs.com/eqnedit.php?latex=\varepsilon&space;=&space;1e-10" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\varepsilon&space;=&space;1e-10" title="\varepsilon = 1e-10" /></a>. Then the <a href="https://www.codecogs.com/eqnedit.php?latex=final\_grads&space;=&space;[0.75,&space;4.5,&space;0]" target="_blank"><img src="https://latex.codecogs.com/gif.latex?final\_grads&space;=&space;[0.75,&space;4.5,&space;0]" title="final\_grads = [0.75, 4.5, 0]" /></a>. Question: Why do we need to rescale the gradients by dividing the average value of the mask? #### 7. Application to Federated Learning Basic Configuration: - Number of users: 15 (Total: 53) - Training epochs per round: 50 - Total training rounds: 10 - Batch size: 64 - Agreement threshold: 0.2 - Model architecture: Three-layer neural network Result: We are unable to achieve the same result as stated in the paper in federated settings when using ICU data. | Equal Size | Federated Learning Algorithm | Train Loss | Validation Loss | Train Accuracy | Validation Accuracy | |------------|------------------------------|------------|-----------------|----------------|---------------------| | True | Default | 2.072 | 7.791 | 0.979 | 0.922 | | True | Geometric Mask | 0.156 | 1.349 | 0.990 | 0.941 | | True | And Mask | 0.098 | 1.205 | 0.995 | 0.932 | | False | Default | 1.487 | 6.708 | 0.985 | 0.933 | | False | Geometric Mast | 0.077 | 1.597 | 0.992 | 0.930 | | False | And Mask | 0.066 | 2.177 | 0.995 | 0.927 | Question: 1. How do we evaluate the domain adaptation in federated settings (currently we use the loss and accuracy in each round)? 2. How to handle geometric mean and AND mask for users with different batch size? 3. Why do we use geometric mean instead of GAN approaches to solve domain adaptation in federated settings?