ICLR 2024 Federated Generalization via Information-Theoretic Distribution Diversification

# Paper review ## Summary This paper addresses the out-of-distribution generalization challenge in federated learning through a client-sampling approach inspired by information theory. Specifically, the authors propose to minimize the "self-information weighted empirical risk" function whose generalization bound leads to two client sampling strategies. These strategies aim to maximize the cross entropy between a participating client and a new client, assuming that novel clients very different data distributions. Empirical results suggest that the proposed sampling method works better than a range of baselines. ## Strengths - The paper presents a nice and clear definition of the new risk functions (Definitions 1 and 2). - The information-theoretic generalization bounds (Theorems 1 and 2) are presented well and explained with useful remarks. - In particular, the authors show that, from the client participation perspective, the bounds can be minimized by careful client sampling and setting the right the participation weights $\alpha_i$. ## Weaknesses - There are a lot of unaccounted-for notations in this paper, beginning especially at Section 4. For example, the authors have not sufficiently explained what $F_i(\cdot)$ and $\mathbf{w}$ are in Assumption 3. Is $F_i(\cdot)$ the usual empirical risk or information-theoretic version? Similarly, what is the local gradient $\mathbf{g}_t^i$ with respect to? - This notational ambiguity makes me unable to understand Algorithm 2 fully. At line 5, especially, what local loss function does client $i$ try to minimize? - The authors aim to optimize the participation-dependent term in the generalization bound, $\min_{j \in \mathcal{I}_t} \sum_{i \in \mathcal{I}_p \ \mathcal{I}_t} H(P_{Z_i}, P_{Z_j})$. However, the relationship between this problem and the formulations in (11) and (12) and insufficiently clear. For example, what do the authors mean by the claim that cosine similarity is "suitable for FL"? Similarly, I do not see any proof for the equivalence of this optimization problem and that in (12). - In Section 4.2.2, again, I do not fully understand what the convex hull is with respect to. - Empirically speaking, it is quite rare to update clients' parameters through only one round of gradient descent as the authors propose in Algorithm 2, due to the cost of communication relative to local computation. ## Recommendation Reject or borderline reject. There is a lot to clarify in this paper.