# Solution Documentation of Team puffle (CMU) at the US/UK PETs Prize Challenge **Authors**: [Ken Liu](https://kenziyuliu.github.io/), [Virginia Smith](https://www.cs.cmu.edu/~smithv/) <img src="https://blog.ml.cmu.edu/wp-content/uploads/2023/04/feature-v2-2-970x333.png"> <!-- wp:paragraph --> ***TL;DR**: In this post, we study the application of differential privacy in personalized, cross-silo federated learning settings (</i><a style="font-style: italic;" href="https://arxiv.org/abs/2206.07902">full paper</a> at NeurIPS'22), describe how these insights helped us develop a 1st place solution to the pandemic forecasting problem at the <a href="https://www.whitehouse.gov/ostp/news-updates/2023/03/31/us-uk-annouce-winners-innovation-pets-democratic-values/">US/UK Privacy-Enhancing Technologies (PETs) Prize Challenge</a>, and share some challenges and learnings along the way.* <!-- /wp:paragraph --> <!-- wp:separator --> <hr class="wp-block-separator"/> <!-- /wp:separator --> <!-- wp:heading {"level":3} --> ## Part 1: Privacy, Personalization, and Cross-Silo Federated Learning <!-- /wp:heading --> <!-- wp:paragraph --> <strong>Federated learning</strong> (FL) is a technique to train models using decentralized data without directly communicating such data. Typically, <!-- /wp:paragraph --> <!-- wp:list --> - a central server sends a model to the participating <em>clients</em> (e.g., mobile devices); - the clients train that model using their local data, and send back the updated models; and - the server aggregates the updates (e.g., simple average, as in <a href="https://arxiv.org/abs/1602.05629">FedAvg</a>) <!-- /wp:list --> <!-- wp:paragraph --> and the cycle repeats. Already, companies like Apple and Google has deployed FL to millions of devices to train models for <a href="https://ai.googleblog.com/2017/04/federated-learning-collaborative.html">predictive keyboards</a>, <a href="https://ai.googleblog.com/2021/11/predicting-text-selections-with.html">text</a> <a href="https://ai.googleblog.com/2023/03/distributed-differential-privacy-for.html">selection</a>, and <a href="https://machinelearning.apple.com/research/improving-on-device-speaker">speaker verification</a>. <!-- /wp:paragraph --> <!-- wp:paragraph --> Despite its ability to mitigate systemic privacy risks, however, FL's data minimization principle alone isn't sufficient for data privacy, as the client model updates can <a href="https://arxiv.org/abs/2104.07586">still</a> <a href="https://arxiv.org/abs/2202.00580">reveal</a> sensitive information. <!-- /wp:paragraph --> <!-- wp:paragraph --> This is where <strong>differential privacy </strong>(DP) can come in handy. DP is nice because it is both a formal guarantee and an effective empirical mitigation to attacks like <a href="https://arxiv.org/abs/2112.03570">membership inference</a> and <a href="https://arxiv.org/abs/2204.00032">data poisoning</a>. <!-- /wp:paragraph --> <!-- wp:paragraph --> In a nutshell, DP is a <em>statistical</em> notion of privacy where we add randomness to a query on a "dataset" to create <em>quantifiable</em> uncertainty about whether any one "data point" has contributed to the query output. <!-- /wp:paragraph --> <!-- wp:paragraph --> DP is typically measured by two scalars $(\varepsilon, \delta)$, the smaller the more private. In the context of ML, the model (or its gradients) are usually the "queries" on the data. <!-- /wp:paragraph --> <!-- wp:paragraph --> In the above, "dataset" and "data point" are in quotes because <strong>privacy granularity</strong> (aka *privacy unit*) matters. For our usual CIFAR-10 classifier, they may respectively be a labeled image and the training set (<strong>"example-level DP</strong>"). In the context of FL, it is typical to apply "<strong>client-level DP</strong>", where the federated clients, such as mobile devices (aka “user-level”) or institutions (“silo-level”), are the "data points". In general, we want the granularity to correspond individual persons or data subjects that actually require privacy protection. <!-- /wp:paragraph --> <!-- wp:heading {"level":4} --> ### 1.1. What is a good privacy granularity for <em><strong>cross-silo</strong></em> FL? <!-- /wp:heading --> <!-- wp:paragraph --> While client-level DP is great for <strong>cross-device</strong> FL (e.g. training on millions of iPhones) as each client naturally corresponds to a person, it may not be suitable for <strong>cross-silo FL</strong>, where there are fewer (2-100) clients but each hold many data subjects that require protection --- think hospitals, banks, or schools with many patient, customer, or student records. <!-- /wp:paragraph --> <!-- wp:paragraph --> Under cross-silo FL, client-level DP protection: <!-- /wp:paragraph --> <!-- wp:list --> - can be an “overkill” as the all records within a data silo are treated as a single unit, and that the entities enjoying the protection are not the ones we want to protect; - can have worse privacy-utility tradeoffs as clients typically participate in all rounds (no subsampling privacy amplification) and utility doesn't benefit from large local dataset sizes; and - may be compromised nevertheless as laws and regulations may mandate such institutions participating in FL be <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7430624/">disclosed publicly</a>. <!-- /wp:list --> <!-- wp:image {"align":"center","id":16778,"width":552,"height":196,"sizeSlug":"large","linkDestination":"none"} --> <div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img src="https://blog.ml.cmu.edu/wp-content/uploads/2023/04/dp-models-1024x365.png" alt="" class="wp-image-16778" width="552" height="196"/><figcaption>Fig. 1. Visualizing <strong>client-level</strong> DP vs. <strong>silo-specific example-level</strong> DP in federated learning.</figcaption></figure></div> <!-- /wp:image --> <!-- wp:paragraph --> The above motivates us to consider the notion of <strong>"silo-specific example-level DP</strong>" as a natural alternative (see figure above). *In short, it says that the $k$-th data silo may set its own $(\varepsilon_k, \delta_k)$ example-level DP target for any learning algorithm with respect to its local dataset.* It is more aligned with real-world use cases where each data subject contributes a single example;[^1] e.g. voting / student / vaccination records across regions / schools / clinics for a particular election / exam / disease. This privacy granularity is very easy to implement: each silo can simply use <a href="https://arxiv.org/pdf/1607.00133.pdf">DP-SGD</a> when performing local gradient steps with calibrated per-step noise to guarantee example-level privacy of the locally computed model updates. <!-- /wp:paragraph --> <!-- wp:paragraph --> This privacy notion is not new, and prior work studied its <a href="https://arxiv.org/abs/2106.09779">statistical rates</a>, its <a href="https://arxiv.org/abs/2102.11158">use with $f$-DP</a>, its application to <a href="https://arxiv.org/abs/2103.07491">extracting vaccine adverse event mentions</a>, how its relaxation can <a href="https://arxiv.org/abs/1909.05830">boost utility</a>, and how it <a href="https://arxiv.org/abs/2007.05553">meshes with security primitives</a>. In our post and <a href="https://arxiv.org/abs/2206.07902">full paper</a>, we instead consider aspects of data heterogeneity, DP noise reduction, the personalization spectrum, and their interplay, as we describe below. <!-- /wp:paragraph --> <!-- wp:heading {"level":4} --> ### 1.2. The interplay of privacy, heterogeneity, and model personalization <!-- /wp:heading --> <!-- wp:paragraph --> <strong>Model personalization</strong> is a key technique for improving model performance under <em>data heterogeneity</em> (i.e. data non-iidness) across data silos.[^2] Indeed, <a href="https://arxiv.org/pdf/2206.09262.pdf">existing</a> <a href="https://arxiv.org/abs/2210.04620">benchmarks</a> suggest that we may very well have highly heterogeneous settings where directly fitting local models on local data is the best we could do. <!-- /wp:paragraph --> <!-- wp:paragraph --> When it comes to privacy, we should also clarify the *trust model* we are assuming and its relationship with model personalzation. For <em>cross-device</em> personalized FL, it is natural to assume that local training---i.e., clients train their fully personalized models without any communication---is completely 'private' as the clients <em>are</em> the individuals to protect; in fact, one could even <a href="https://proceedings.mlr.press/v162/bietti22a/bietti22a.pdf">leverage this assumption</a> to improve the privacy-utility tradeoffs. <!-- /wp:paragraph --> <!-- wp:paragraph --> On the other hand, for <em>cross-silo</em> personalized FL settings, we may want <em>application-specific</em> trust models. Here, we focus on the natural case where only the <em>data curator</em> within each silo is trusted with full data access (e.g. university IT over student records), and that the artefacts (trained model, computed statistics, ...) of either local or federated learning must obtained and curated privately as they may be served to non-curator users within the silo through an API. <!-- /wp:paragraph --> <!-- wp:paragraph --> The notion of <em>silo-specific example-level privacy</em> fits the above trust model and it has several implications on the dynamics of federated learning: <!-- /wp:paragraph --> <!-- wp:list {"ordered":true} --> 1. <strong>Silos incur privacy costs with <em>queries</em> to their data, but not with <em>participation</em> in FL</strong>. This is due to DP's <a href="https://en.wikipedia.org/wiki/Differential_privacy#Robustness_to_post-processing">robustness to post-processing</a>: silos’ model updates already satisfy their own example-level DP targets, and simply sending these updates does not involve extra dataset queries. One immediate consequence is that local training and FedAvg have <em>identical</em> privacy costs, as far as the data silos are concerned. 2. <strong>A trusted server isn't necessary for the DP guarantees</strong>, as a corollary of the above. 3. <strong>A tradeoff emerges between utility costs from privacy and data heterogeneity.</strong> - As DP noises are added independently on each silo, they are reflected in the silos’ model updates and can thus be smoothed out when these updates are averaged (e.g. via FedAvg), leading to a smaller utility drop from DP for the federated model. - On the other hand, federation also means that the shared model may suffer from client heterogeneity (non-iid data across silos). - The figure below helps illustrate the nuance: <!-- /wp:list --> <!-- wp:image {"align":"center","id":17035,"width":592,"height":154,"sizeSlug":"large","linkDestination":"none"} --> <div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img src="https://blog.ml.cmu.edu/wp-content/uploads/2023/04/finetune-gap-annotated-1024x265.png" alt="" class="wp-image-17035" width="592" height="154"/><figcaption>Fig. 2. Consider two interesting phenomena illustrated by a simple experiment where all silos use (ε = 1, δ = 1e-7) example-level DP for their own dataset. <strong>Left</strong>: FedAvg can smooth out the independent, per-silo DP noise and lead to smaller average utility drop from DP; <strong>Mid/Right</strong>: Finetuning (FedAvg followed by further local training) may not improve utility as expected, as the effect of noise reduction is removed when finetuning begins.</figcaption></figure></div> <!-- /wp:image --> <!-- wp:paragraph --> This "**privacy-heterogeneity cost tradeoff**" is interesting because it suggests that *model personalization can play a key and distinct role* in cross-silo FL. Intuitively, local training (no FL participation) and FedAvg (full FL participation) can be viewed as two ends of a <strong>personalization spectrum</strong> with identical privacy costs, and various personalization algorithms (<a href="https://arxiv.org/abs/1910.10252">finetuning</a>, <a href="http://arxiv.org/abs/2006.04088">clustering</a>, ...) essentially navigate this spectrum in different ways. <!-- /wp:paragraph --> <!-- wp:paragraph --> *If local training minimizes the effect of data heterogeneity but enjoys no DP noise reduction, and contrarily for FedAvg, are there personalization methods that lie in between and achieve better utility? If so, what methods would work best?* <!-- /wp:paragraph --> <!-- wp:heading {"level":4} --> ### 1.3. MR-MTL is a simple and strong baseline for differentially private cross-silo learning <!-- /wp:heading --> <!-- wp:image {"id":17102,"sizeSlug":"large","linkDestination":"none"} --> <figure class="wp-block-image size-large"><img src="https://blog.ml.cmu.edu/wp-content/uploads/2023/04/pareto_merged-1024x269.png" alt="" class="wp-image-17102"/><figcaption>Fig. 3. <strong>Privacy-utility tradeoffs</strong> for representative personalization methods under silo-specific example-level DP across four cross-silo FL datasets. <strong><a href="https://arxiv.org/abs/1910.10252">Finetune</a>: </strong>a simple but very strong baseline for model personalization; <strong><a href="https://arxiv.org/abs/2006.04088">IFCA</a>/<a href="https://arxiv.org/abs/2002.10619">HypCluster</a>:</strong> hard clustering of client models; <strong><a href="https://arxiv.org/abs/2012.04221">Ditto</a></strong>: strong, fair, and robust personalization baseline. <strong><a href="http://www0.cs.ucl.ac.uk/staff/M.Pontil/reading/mt-kdd.pdf">MR-MTL</a></strong>: the family of mean-regularized multi-task learning methods that consistently outperforms other baselines.</figcaption></figure> <!-- /wp:image --> <!-- wp:paragraph --> Perhaps we could start by asking for the desirable properties of a good personalization algorithm under our privacy granularity. Among properties like scalability, fairness, and robustness, our earlier observations motivate the following: <!-- /wp:paragraph --> <!-- wp:list {"ordered":true} --> 1. <strong>Noise reduction through federation</strong>: As in FedAvg, we prefer to have methods that <em>consistently</em> smooth out the independent DP noises from each silo. <em>Local finetuning is a counter-example (as we've seen above).</em> 2. <strong>Minimal privacy overhead</strong>: The personalization technique should have little to no additional local dataset queries; in effect, such overhead shifts the utility tradeoff curve downwards. <em><a href="https://arxiv.org/abs/2012.04221">Ditto</a> may be a counter-example (see Fig. 3 above). </em> 3. <strong>Smooth interpolation along the personalization spectrum:</strong> If there is indeed an optimal point between local training and FedAvg, the interpolation provided by the personalization should be fine-grained (if not continuous) such we could attain the optimal. <em><a href="http://arxiv.org/abs/2006.04088">Client clustering</a> may be a counter-example when there isn't a clear structure to the data heterogeneity. </em> <!-- /wp:list --> <!-- wp:paragraph --> These properties are rather restrictive. For example, <a href="https://arxiv.org/abs/2001.01523">model</a> <a href="https://arxiv.org/abs/2202.05318">mixture</a> and <a href="https://arxiv.org/abs/2002.04758">local</a> <a href="https://arxiv.org/abs/2108.07313">adaptation</a> methods can incur linear overhead in dataset iterations, and so can <a href="https://arxiv.org/abs/2012.04221">multi</a>-<a href="https://arxiv.org/abs/1705.10467">task</a> <a href="https://arxiv.org/abs/1910.01991">learning</a> methods that benefit from additional training. Clustering methods can also incur overhead with <a href="https://arxiv.org/abs/2006.04088">cluster selection</a>, <a href="https://arxiv.org/abs/2109.08119">distillation</a>, or <a href="https://arxiv.org/abs/1910.01991">training restarts</a>, and they may discretize the personalization spectrum in a way that depends on the number of clients, clusters, or top-down partitions. <!-- /wp:paragraph --> <!-- wp:paragraph --> These considerations point to <strong>mean-regularized multi-task learning (MR-MTL)</strong> as one of the simplest yet particularly suitable forms of personalization. It simply asks each client $k$ to train its own local model $w_k$, regularize it towards the mean of others' models $\bar w$ via a penalty $\frac\lambda 2 \| w_k - \bar w \|_2^2$, and keep $w_k$ across rounds (i.e. client is stateful). The mean model $\bar w$ is maintained by the FL server (like FedAvg) and may be updated in every round. <!-- /wp:paragraph --> <!-- wp:paragraph --> To be a bit more explicit, each local update step on client $k$ looks like: <!-- /wp:paragraph --> <!-- wp:image {"align":"center","id":17297,"width":317,"height":97,"sizeSlug":"large","linkDestination":"none"} --> <div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img src="https://blog.ml.cmu.edu/wp-content/uploads/2023/04/image-1-1024x316.png" alt="" class="wp-image-17297" width="317" height="97"/></figure></div> <!-- /wp:image --> <!-- wp:paragraph --> The hyperparameter $\lambda$ serves as a smooth knob between local training and FedAvg: $\lambda = 0$ recovers local training, and a larger $\lambda$ forces the personalized models to be closer to each other (intuitively, “federate more”). <!-- /wp:paragraph --> <!-- wp:paragraph --> MR-MTL has some nice properties matching our desiderata above: <!-- /wp:paragraph --> <!-- wp:list {"ordered":true} --> 1. noise reduction is attained throughout training via the soft proximity constraint towards an averaged model; 2. the mean-regularization itself has <em>no privacy overhead</em>;[^3] and 3. $\lambda$ provides a smooth interpolation along the personalization spectrum. <!-- /wp:list --> <!-- wp:paragraph --> To see why the above is interesting, consider the following experiment where we try a range of $\lambda$ values. <!-- /wp:paragraph --> <!-- wp:image {"align":"right","id":17335,"width":315,"height":159,"sizeSlug":"large","linkDestination":"none"} --> <div class="wp-block-image"><figure class="alignleft size-large is-resized"><img src="https://blog.ml.cmu.edu/wp-content/uploads/2023/04/vehicle_t400_d1_eps05_bump_annotated-1-1024x519.png" alt="" class="wp-image-17335" width="315" height="159"/><figcaption>Fig. 4. Test acc ± std of MR-MTL on a simple cross-silo FL task with varying λ. A "sweet spot" λ* exists where it outperforms both ends of the personalization spectrum (local / FedAvg) under the <em>same</em> privacy budget. Results correspond to ε = 0.5 in the first subplot in the privacy-utility tradeoff curves.</figcaption></figure></div> <!-- /wp:image --> Observe that: <!-- wp:list {"ordered":true} --> 1. In both private and non-private settings, $\lambda$ serves to roughly interpolate between two ends of the personalization spectrum (local / FedAvg). 2. <strong>The utility at $\lambda^\ast$ can outperform that of the endpoints.</strong> This is significant as MR-MTL$(\lambda)$ uses <em>identical privacy budgets</em>. 3. The <strong>utility advantage</strong> of MR-MTL over the endpoints are larger under privacy. 4. The optimal $\lambda^\ast$ is also larger under privacy. Intuitively, silos are encouraged to “federate more” for noise reduction. 5. <a href="https://arxiv.org/abs/2012.04221">Ditto</a> resembles MR-MTL in terms of the training procedure and exhibits similar interpolation behaviors; we included it as a baseline to illustrate the effect of privacy overhead from 2x local training iterations. <!-- /wp:list --> <!-- wp:paragraph --> The above provides rough intuition on why MR-MTL may be a strong baseline for private cross-silo FL, as we revisit the privacy-utility tradeoff comparisons in Fig. 3. It also has additional benefits such as being extensible to <em>cluster preconditioning</em> to handle some structured heterogeneity known a priori among the clients; see the <a href="https://arxiv.org/abs/2206.07902">full paper and appendix</a> for more details and additional analyses and results! <!-- /wp:paragraph --> <!-- wp:separator --> <hr class="wp-block-separator"/> <!-- /wp:separator --> <!-- wp:heading {"level":3} --> ## Part 2: Federated Pandemic Forecasting at the US/UK PETs Challenge <!-- /wp:heading --> <!-- wp:image {"align":"left","id":17433,"width":317,"height":356,"sizeSlug":"large","linkDestination":"none"} --> <div class="wp-block-image"><figure class="align size-large is-resized"><img src="https://blog.ml.cmu.edu/wp-content/uploads/2023/04/pandemic-small.png" alt="" class="wp-image-17433" width="317" height="356"/><figcaption>Fig. 5. Illustration of the pandemic forecasting problem at the US/UK PETs challenge (<a href="https://prepare-vo.org/synthetic-pandemic-outbreaks">image source</a>). The task is to predict an individual's risk of infection using data siloed across administrative regions, while preserving the privacy of the individual.</figcaption></figure></div> <!-- /wp:image --> <!-- wp:paragraph --> Let's now take a look at a federated pandemic forecasting problem, announced last year at the <a href="https://petsprizechallenges.com/">US/UK Privacy-Enhancing Technologies (PETs) prize challenge</a>, and how we could apply our learnings from Part 1. <!-- /wp:paragraph --> <!-- wp:heading {"level":4} --> ### 2.1. Problem setup <!-- /wp:heading --> The pandemic forecasting problem asks the following: *Given a person's demographic attributes (e.g. age, household size), locations, activities, infection history, and the contact network, what is the likelihood of infection in the next $t_\text{pred}=7$ days? Can we make predictions while protecting the privacy of the individuals? Moreover, what if the data are siloed across administrative regions?* There's a lot to unpack in the above. First, the pandemic outbreak problem follows <a href="https://people.wku.edu/lily.popova.zhuhadar/">a discrete-time SIR model</a> (<strong>S</strong>usceptible → <strong>I</strong>nfectious → <strong>R</strong>ecovered) and we begin with a subset of the population infected. Subsequently, - each person goes about their usual daily activities, like working or shopping, and gets into contact with others either directly or indirectly (e.g. at a mall)---this forms a <strong>contact graph</strong> where individuals are nodes and direct contacts are edges; - each person may get infected with different risk levels depending on a myriad of factors---their age, the nature and duration of their contact(s), their node centrality, what they do all day, etc.; and - such infection can also be <em>asymptomatic</em>---the individual can appear in the S state while being secretly infectious. <!-- /wp:list --> <!-- wp:paragraph --> <a href="https://prepare-vo.org/synthetic-pandemic-outbreaks">The challenge dataset</a> models a synthetic pandemic outbreak in Virginia and contains roughly <strong>7.7 million</strong> nodes (persons) and <strong>186 million</strong> edges (contacts) with health states over 63 days (assuming the same daily contacts like <a href="https://en.wikipedia.org/wiki/Groundhog_Day_(film)">Groundhog Day</a>); so the actual contact graph is fairly large but also quite sparse. <!-- /wp:paragraph --> There are a few extra factors that make this problem challenging: <!-- wp:paragraph --> **There is generally a very small fraction of positive cases**: there are less than 5% of people who are ever in the I or R state and roughly 0.3% of people became infected in the final week. <!-- /wp:paragraph --> <!-- wp:paragraph --> <strong>The data are siloed</strong>: the actual contact graph will be cut along administrative boundaries, e.g., by grouped <a href="https://en.wikipedia.org/wiki/FIPS_county_code">FIPS codes/counties</a>. Each silo only sees a local <em>subgraph</em> but people may still travel and make contacts across multiple regions. There are up to 10 silos (clients) and the population size can vary by more than an order of magnitude in the official evaluation (up to 2.6M nodes for a single silo in the official evaluation). <strong>The temporal nature of the problem is subtle</strong>: we are given the first $t_\text{train} = 56$ days of each person's health states (S/I/R) and asked to predict the risk of infection for any individual any time in the subsequent $t_\text{pred} = 7$ days. *What is a training example in this case? How should we perform temporal partitioning? How does this relate to privacy accounting?* <!-- wp:paragraph --> <strong>Graphs generally complicate DP</strong>: when applying DP to machine learning, we are used to the settings where we can clearly define the <em>privacy granularity</em> and how it relates to an actual individual (e.g. tabular data or medical images of patients). This is tricky when it comes to graphs: people can make different number of contacts, each of different nature, and their influence may propagate to their neighbors, neighbors' neighbors, and so on. On a high-level (and as specified by <a href="https://www.drivendata.org/competitions/98/nist-federated-learning-1/page/525/#scope-of-sensitive-data">the scope of sensitive data of the competition</a>), what we care about is known as <strong>node-level DP</strong>---the model output is "roughly" the same if we add/remove/replace a node, <em>along with its edges</em>. See Part 2.4 below for more! <!-- /wp:paragraph --> <!-- wp:heading {"level":4} --> ### 2.2. Applying MR-MTL with silo-specific example-level privacy <!-- /wp:heading --> <!-- wp:paragraph --> One simple and clean approach to the problem is to just operate on the <strong>individual level</strong> and view it as <strong>(federated) binary classification</strong>: if we could build a feature vector that summarizes what we need to know about an individual, then risk scores are simply the sigmoid probabilities of near term infection. <!-- /wp:paragraph --> <!-- wp:paragraph --> Of course, the problem lies in what that feature vector (and the corresponding label) is---we'll get to this in the following section. Regardless, we can already see that <strong>MR-MTL </strong>with<strong> silo-specific example-level privacy</strong> (from Part 1) is a nice framework for this type of cross-silo problems: <!-- /wp:paragraph --> <!-- wp:list --> - There are a small number of clients (silos), but each holds many data subjects, and client-level DP isn't suitable. - <strong>Model personalization</strong> is most likely needed as the silos are large and somewhat heterogeneous by construction (geographic regions can't all be similar). - <strong>Simplicity, usability, efficiency, and scalability</strong>: MR-MTL is remarkably easy to implement and its simple construction implies minimal resource overhead (over FedAvg and local training) and high scalability. This is crucial for real-world applications. - <strong>Adaptability and explainability</strong>: The framework is highly adaptable to any learning algorithm that can take DP-SGD-style updates. It also preserves the explainability of the underlying ML algorithm and the applicability of many interpretability methods as the privacy mechanism does not obfuscate the model weights, updates, or predictions. - <strong>Extensibility</strong>: We could also add heuristics to (qualitatively) improve robustness and privacy-utility tradeoffs.<ul><li>For robustness, the (honest) server may inspect client updates and apply clipping, zero out coordinates with extreme values, or leverage existing robust aggregators (e.g. <a href="https://arxiv.org/pdf/1703.02757.pdf">this work</a>, <a href="https://arxiv.org/abs/1912.13445">this work</a>) to fend off malicious clients. - For privacy, we may introduce extra heuristic that can help with the DP-utility tradeoff, such as manually removing extreme outliers (e.g. people in isolated locations) or merging categorical features (e.g. activity types). <!-- /wp:list --> <!-- wp:paragraph --> It is also helpful to look at the **threat models** we might be dealing with throughout the FL lifecycle and how the framework behaves under these settings. <!-- /wp:paragraph --> <!-- wp:image {"align":"center","id":17668,"width":529,"height":198,"sizeSlug":"large","linkDestination":"none"} --> <div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img src="https://blog.ml.cmu.edu/wp-content/uploads/2023/04/vulnerabilities-1024x384.png" alt="" class="wp-image-17668" width="529" height="198"/><figcaption>Fig. 6. <strong>Various sources of vulnerabilities</strong>. During training, the adversary may have control over (1) participating clients; (2) communication channels; or (3) the central server. During diagnosis/deployment, it may also have (4-5) white-box access to personalized and global models and their iterates.</figcaption></figure></div> <!-- /wp:image --> <!-- wp:paragraph --> We make worst-case assumptions that the adversary is computationally unbounded, has <em>white-box</em> access to personalized/global models (we assume they will be released), and may compromise or eavesdrop the clients, the server, and the communication channels. <em>Within a client</em> $k$, we make the minimal assumption that the <em>data curator</em> will faithfully execute a private training protocol (e.g. DP-SGD) without nefariously leaking the data directly to an adversary, but the curator may incorporate custom training objectives such as data augmentation or adversarial training as long as it adheres to its $(\varepsilon_k, \delta_k)$-DP budget. <!-- /wp:paragraph --> <!-- wp:paragraph --> The <strong>adversary's goal </strong>may be to extract information of any person (say Alice) from the models; this could be things like whether Alice participated in the training (i.e. <em>membership inference</em>), Alice's age and social contacts (i.e. <em>attribute inference</em>), or completely reconstructing Alice's data (i.e. <em>model inversion</em>). The strength of an attack should be correlated with its success probability, ease of execution, and the accuracy/completeness of the extracted information. <!-- /wp:paragraph --> <!-- wp:paragraph --> Now let's take a look what could go wrong with the sources of vulnerabilities illustrated in figure above, and how DP's <a href="https://en.wikipedia.org/wiki/Differential_privacy#Robustness_to_post-processing">post-processing property</a><strong> </strong>can be quite useful. <!-- /wp:paragraph --> <!-- wp:paragraph --> (Source 1) <strong>Clients:</strong> Malicious clients may try to <a href="https://arxiv.org/abs/2007.08432">poison</a> the data or models (e.g. intentionally train with wrong S/I/R labels, or scale up their model updates) but they won't affect honest clients' example-level DP guarantees (by post-processing). They may also <a href="https://arxiv.org/abs/1807.00459">plant backdoors</a> but the efficacy may be limited by DP and backdoors are not directly useful for training data extraction. <!-- /wp:paragraph --> <!-- wp:paragraph --> (Source 2) <strong>Communication channels</strong>: The post-processing property also means that silos' own DP guarantees will hold even if the channels are compromised. This is an useful advantage of statistical privacy over cryptographic approaches that may require key distribution over insecure channels (e.g. SecAgg). <!-- /wp:paragraph --> <!-- wp:paragraph --> (Source 3) <strong>Central server:</strong> Since the server only performs simple aggregation, a malicious server is no more dangerous than an adversary with full control over all client-server channels, as with above. <!-- /wp:paragraph --> <!-- wp:paragraph --> (Sources 1, 2, 3) <strong>Collusion:</strong> As a corollary of the above, the DP guarantees for honest clients are not compromised even if all of the above parties are colluding. <!-- /wp:paragraph --> <!-- wp:paragraph --> (Sources 4, 5) <strong>Non-interactive / inference-time attacks</strong>: Instead of tampering with training, the adversary may just try attacking the final models with attacks like membership/attribute inference and model inversion. While the models are certainly private artefacts from the data and the post-processing property holds, the tricky part is whether DP is always effective against attacks. In practice, there has been related empirical evidence [<a href="https://arxiv.org/abs/2112.03570">1</a>, <a href="https://www.usenix.org/system/files/sec19-jayaraman.pdf?ref=https://githubhelp.com">2</a>, <a href="https://arxiv.org/abs/2204.02500">3</a>, <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9378274">4</a>] that suggests DP can be an effective countermeasure, even without strong theoretical guarantees (small $\varepsilon$). <!-- /wp:paragraph --> <!-- wp:paragraph --> Of course, there are important caveats to the privacy story. As with modeling, the main caveat is that **the definition of a "training example" is intricate**, and it affects how example-level DP in this case translates to actual individual protection as well as what $\varepsilon$ values would be acceptable on the <em>spectrum</em> of<strong> </strong>privacy-utility tradeoffs. See the appendix of the [full paper](https://arxiv.org/pdf/2206.07902.pdf) for additional discussions! <!-- /wp:paragraph --> ### 2.3. Building training examples <!-- wp:paragraph --> We now describe how to convert individual information and the contact network into a <strong>tabular dataset</strong> for every silo $k$ with $n_k$ nodes. <!-- /wp:paragraph --> <!-- wp:image {"align":"right","id":17736,"width":239,"height":238,"sizeSlug":"large","linkDestination":"none"} --> <div class="wp-block-image"><figure class="alignright size-large is-resized"><img src="https://blog.ml.cmu.edu/wp-content/uploads/2023/04/neighborhood-aggregation-1024x1022.png" alt="" class="wp-image-17736" width="239" height="238"/><figcaption>Fig. 7. Illustration of our iterative, ℓ-hop neighborhood aggregation. Here, green nodes are the sampled neighbors and the yellow node(s) can't be sampled.</figcaption></figure></div> <!-- /wp:image --> <!-- wp:paragraph --> Recall that our task is to predict if a person is likely to get infected within $t_\text{pred} = 7$ days, and that each silo only sees its local subgraph. We formulate this via a set of examples $( X_k \in \mathbb R^{n_k \times d}, Y_k \in \mathbb \{0, 1\}^{n_k} )$, where the features of each example ${X_k^{(i)} \in \mathbb R^d}$ describe the <strong><em>local neighborhood</em> around a person $i$ </strong>(see figure), with binary label ${Y_k^{(i)}}$ denoting if the person is infected in the next $t_\text{pred}$ days. <!-- /wp:paragraph --> <!-- wp:paragraph --> Each example $(X_k^{(i)}, Y_k^{(i)} )$ will have the following features: <!-- /wp:paragraph --> <!-- wp:paragraph --> (1) <strong>Individual features</strong>: Basic (normalized) demographic features like age, gender, and household size; activity features like whether this person has ever been to work, school, church, shopping, etc.; and the individual's *infection history* as concatenated one-hot vectors (which depends on how we create labels; see below). <!-- /wp:paragraph --> <!-- wp:paragraph --> (2) <strong>Contact features:</strong> One of our key simplifying heuristics is that each node's $\ell$-hop neighborhood in the contact graph should contain most of the information we need to predict infection. <!-- /wp:paragraph --> <!-- wp:paragraph --> We may want $\ell \gt 1$ as higher-degree information can help (e.g. an individual's immediate contacts can be asymptotically infected), but it could also be more expensive to compute. Since the graph is irregular, we may also want to *subsample* the graph to save some computation and equalize the feature dimensions across nodes. <!-- /wp:paragraph --> <!-- wp:paragraph --> The contact features are built as follows: <!-- /wp:paragraph --> <!-- wp:list --> - For each node, we encode every one of its sampled neighbors using the neighbor's <em>individual features</em> along with the <em>edge features</em> describing the contact---e.g., the location, the duration, and the activity type. - We use <strong>iterative neighborhood sampling</strong> (figure above), meaning that we first select a set of $S_1$ 1-hop neighbors, and then sample $S_2$ 2-hop neighbors from those 1-hop neighbors, and so on; this is such that the 1-hop edge features can be re-used and the feature dimension $d$ kept low. In the final submission, we tried different values of $\ell$ and $S_{j \le \ell}$ as it poses a <strong>computation-utility tradeoff</strong>. - Importantly, we decided to do <strong>deterministic neighborhood sampling</strong> --- the same person always takes the same subset of neighbors --- which makes the neighborhood sampling really a <strong>graph pruning</strong> step. This drastically reduces computation as the graph/neighborhoods can now be preprocessed and cached. There are also important implications for privacy, as we'll see in the next section. <!-- /wp:list --> <!-- wp:image {"id":17809,"sizeSlug":"large","linkDestination":"none"} --> <figure class="wp-block-image size-large"><img src="https://blog.ml.cmu.edu/wp-content/uploads/2023/04/features-1024x108.png" alt="" class="wp-image-17809"/><figcaption>Fig. 8. <strong>Illustration of the tabularized features</strong>. Red/pink blocks are individual (node) features and green blocks are edge features describing the contact. Each blue block denotes the combined features of a single social contact (the neighboring node & the edge), and contacts of higher degrees are concatenated.</figcaption></figure> <!-- /wp:image --> <!-- wp:paragraph --> Putting everything together, the figure above illustrates the final <strong>neighborhood feature vector</strong> that describes a person and her contacts that can then be used as inputs to a binary classifier. <!-- /wp:paragraph --> <!-- wp:paragraph --> Intriguingly, this way of training set construction makes the per-silo models *a simplified variant of a <strong>graph neural network</strong> (GNN):* there is a single step of <em>non-parameterized</em> neighborhood aggregation followed by feature transform and prediction (cf. <a href="https://arxiv.org/abs/1902.07153">SGC models</a>). <!-- /wp:paragraph --> <!-- wp:paragraph --> <strong>For the label(s)</strong> $Y_k^{(i)}$, we use a <strong>random infection window </strong>approach: <!-- /wp:paragraph --> <!-- wp:list {"ordered":true} --> 1. Pick a window size $t_\text{window}$ (say 21 days) 2. Select a random day $t'$ within the valid range ($t_\text{window} \le t' \le t_\text{train} - t_\text{pred}$); 3. Encode the S/I/R states in the past window from $t'$ into the individual features for every node in the neighborhood 4. The label is then whether person $i$ is infected in any of the next $t_\text{pred}$ days from $t'$. <!-- /wp:list --> <!-- wp:image {"id":18051,"sizeSlug":"large","linkDestination":"none"} --> <figure class="wp-block-image size-large"><img src="https://blog.ml.cmu.edu/wp-content/uploads/2023/04/temporal-windows-1024x210.png" alt="" class="wp-image-18051"/><figcaption>Fig. 9. During training, we take a <em>random</em> window of infection states to use as features (the "observation" window) and labels (1 iff the person <em>transitions</em> into infection during the "prediction" window) every time we sample a person, and his/her neighboring nodes will use the same window for building the neighborhood feature vector. During testing, we deterministically take the latest days of the infection history. </figcaption></figure> <!-- /wp:image --> <!-- wp:paragraph --> This implicitly assumes that a person’s infection risk is <em>individual</em>: whether Bob gets infected depends only on his own activities and contacts in the past (say) 21-day window. This is definitely not perfect as it ignores <em>population-level</em> modeling (e.g. dense areas increase likelihood of infection), but it makes the ML problem very simple. <!-- /wp:paragraph --> <!-- wp:paragraph --> Since the contact graph repeats every day, this strategy implies that for any person $i$ there can be up to $t_\text{train} - t_\text{pred} - t_\text{window} + 1$ examples with <strong>the same node and contact features, but different disease states and labels</strong>, thereby possibly different rows in the silo's tabular dataset from multiple random infection windows. In practice, we can pick one random window for every person, every epoch (so $n_k$ nodes imply $n_k$ examples), rather than actually generating all the possible windows at once. <!-- /wp:paragraph --> <!-- wp:paragraph --> <strong>At test time</strong>, we build the same neighborhood features, except that the infection window is now deterministically the window right before the prediction period. <!-- /wp:paragraph --> <!-- wp:paragraph --> See our [open-source repo](https://github.com/kenziyuliu/pets-challenge) for more details on feature selection and training set construction! ### 2.4. <strong>Privacy accounting</strong> <!-- wp:paragraph --> Having defined an "example" above, we can now think about what silo-specific "example-level" DP guarantees actually mean for the actual individuals. <!-- /wp:paragraph --> <!-- wp:paragraph --> Recall that each node (person) is now described by a<strong> </strong>feature vector of its (deterministically subsampled and thus unique) <strong>local enclosing neighborhood</strong>, and thus our framework---specifically DP-SGD for local updates---provides <strong>"neighborhood-level DP"</strong>. <!-- /wp:paragraph --> <!-- wp:paragraph --> However, <a href="https://www.drivendata.org/competitions/98/nist-federated-learning-1/page/525/#scope-of-sensitive-data">our privacy desideratum</a> implies that we want "<strong>node-level DP</strong>" (i.e. person-level DP) on the siloed graphs, which intuitively says replacing any one node in the graph, <em>along with its incident edges</em>, won't change the model too much. More formally: <!-- /wp:paragraph --> <!-- wp:paragraph --> <strong><em>Node-level DP. </em></strong><em>A model</em> $M: \mathcal G \to \mathcal Y$,<em> where </em>$\mathcal G$<em> is the space of undirected contact graphs and </em>$\mathcal Y$<em> is the set of outputs, is </em>$(\varepsilon, \delta)$<em> node-level DP if for any subset </em>$Y \subseteq \mathcal Y$<em> and any neighboring graphs </em>$G, G'$<em> <strong>differing in one node </strong>and <strong>all of its incident edges</strong>, we have </em>$$\Pr[M(G) \in Y] \le \exp(\varepsilon) \cdot \Pr[M(G') \in Y] + \delta.$$ <!-- /wp:paragraph --> <!-- wp:paragraph --> In general, however, it is difficult to obtain gradients w.r.t. a single node, as its influence would propagate through the graph. In the worst case, we can get very lucky during neighborhood sampling and a node could appear in all of its neighbors' sampled neighborhoods during training. For the same reason, the usual <a href="https://arxiv.org/abs/1908.10530">subsampling amplification</a> for DP-SGD do not apply directly. <!-- /wp:paragraph --> <!-- wp:paragraph --> Fortunately, <a href="https://arxiv.org/abs/2111.15521">Daigavane et al.</a> provided a nice theorem for node-level privacy accounting on generic neighborhood aggregation methods (e.g. GNNs) under <a href="https://arxiv.org/abs/1702.07476">Rényi DP</a> (a relaxation of DP with nice composition properties), which we adapt in our case to the following: <!-- /wp:paragraph --> <!-- wp:paragraph --> <em><strong>Node-level Rényi DP for SGD</strong>. For a (sub)graph with $n$ nodes (hence $n$ neighborhoods and $n$ examples) and maximum neighborhood size $S$, training for $T$ steps with batch size $b$ with noise multiplier $\sigma$ yields $\left(\alpha, \gamma T\right)$-node-level RDP, where</em> <!-- /wp:paragraph --> <!-- wp:paragraph --> $$ \gamma =\frac{1}{\alpha-1}\ln\mathbb{E}_{\rho}\left[\exp\left(\alpha(\alpha-1)\frac{2\rho^2}{\sigma^2}\right)\right], \rho\sim\operatorname{Hypergeometric}(n, S, b). $$ <!-- /wp:paragraph --> <!-- wp:paragraph --> The idea is that the number of times $\rho$ a certain node $u$ appears in a batch of examples (neighborhoods) can be seen as following a <strong>Hypergeometric distribution</strong> --- if all $n$ neighborhoods within a graph are coins, and those $S$ of them that contain $u$ are heads, then $\rho$ is the number of heads we'll get if we take $b$ coins without replacement --- and we can account for this when performing privacy analysis (computing Rényi divergence). Note that since the accounting depends on the graph size, it's easier to think about <em>replacement </em>DP instead of addition/removal. <!-- /wp:paragraph --> <!-- wp:paragraph --> The above theorem would be an analogy of <em>subsampling amplification</em> on graphs, but it still implies that the DP guarantee depends (linearly) on the neighborhood size $S$ (when batch size $b \gg S$) and this may be loose. The following figure can perhaps provide some grounding; note that $\varepsilon$ degrades drastically with larger maximum degree (neighborhood size if we take $\ell = 1$ hop only). <!-- /wp:paragraph --> <!-- wp:image {"align":"center","id":18034,"width":257,"height":243,"sizeSlug":"large","linkDestination":"none"} --> <div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img src="https://blog.ml.cmu.edu/wp-content/uploads/2023/04/privacy-table-1024x970.png" alt="" class="wp-image-18034" width="257" height="243"/><figcaption>Fig. 10. DP budgets ε for training 1 epoch with batch size <em>b </em>= 512 on a graph with <em>n</em> ≈ 2,600,000 nodes and δ = 1 / <em>n</em>.</figcaption></figure></div> <!-- /wp:image --> <!-- wp:paragraph --> On the bright side, it is nevertheless good to know that MR-MTL with local DP-SGD updates can *provably satisfy silo-specific **node-level** DP*, which is our privacy goal. <!-- /wp:paragraph --> <!-- wp:paragraph --> **Deterministic vs randomized neighborhood selection.** Recall that as part of constructing training examples, we may want to deterministically prune the graph instead of randomly picking neighbors for a node every time we visit it. By inspecting the theorem above we see that the former is important in ensuring that each node has up to $S - 1$ neighbors, or more precisely, is included as some other node’s neighborhood up to $S-1$ times for the amplification to hold. <!-- /wp:paragraph --> <!-- wp:paragraph --> <strong>How does our temporal partitioning / label creation strategy affect the accounting?</strong> Because random infection windows and the corresponding labels are essentially different feature subsets of the same neighborhood features, the DP guarantees should not degrade and may even be boosted by the temporal randomness. <!-- /wp:paragraph --> <!-- wp:heading {"level":4} --> ### 2.5. <strong>Other ingredients</strong> <!-- /wp:heading --> <!-- wp:paragraph --> We can now see our solution coming together: - each silo builds a tabular dataset using neighborhood vectors for features and infection windows for labels, and - each silo trains a personalized binary classifier under MR-MTL with silo-specific example-level privacy (convertible to node-level DP). We now describe some additional design and implementation details below to complete our submission. Unless otherwise stated, we'll look at <em>average precision</em> (AP or AUPRC) as the metric; it summarizes the binary classifier's performance across all operating thresholds, and random guess corresponds to the fraction of positive examples, instead of 0.5. <!-- /wp:paragraph --> #### 2.5.1. Model architecture: simple is good <!-- wp:paragraph --> While the model design space is large for generic tabular datasets, we are specifically interested in models amenable to <em>gradient-based private optimization</em> (e.g. DP-SGD) and <em>weight-space averaging</em>. Decision trees &amp; forests tend to work well for tabular data even with data imbalance, but despite <a href="https://arxiv.org/abs/2210.02910">recent progress</a>, we argue that they are not yet mature and battle-tested under private and federated settings due to many underlying assumptions. <!-- /wp:paragraph --> <!-- wp:paragraph --> In the end, we compared a linear model (i.e. logistic regression) and a 3-layer MLP, and found that the data noise (either inherent from data generation, or due to our training set construction) favor linear models, which also have benefits in privacy (in the sense of limited capacity for memorization) as well as explainability, efficiency, and robustness. Of course, linear models aren't *perfect* in terms of utility and it'd be interesting to explore other modeling techniques (see last section). <!-- /wp:paragraph --> | <strong>Model</strong> | <strong>AUPRC</strong> ($\times 10^{-2}$) | | --- | --- | | Random guess | 0.266 | | 3-layer MLP | 0.329 | | Logistic Regression | <strong>2.049</strong> | *Average precision (AUPRC) for model architectures<br>on a 2.6M-node subgraph (using node features only).* #### 2.5.2. Data imbalance and weighted loss <!-- wp:paragraph --> Because the data are highly imbalance, training naively will suffer from low recall and AUPRC. While there are <a href="https://imbalanced-learn.org/stable/index.html">established over-/under-sampling methods</a> to deal with such imbalance, they may unfortunately break the privacy accounting (Part 2.4) in terms of the subsampling assumption or increased data queries, and their benefits may be offset by the worsened privacy-utility tradeoffs.[^4] <!-- /wp:paragraph --> | <strong>Model</strong> (values<strong> </strong>$\times 10^{-2}$) | <strong>AUPRC</strong> | <strong>Recall</strong> | | --- | --- | --- | | BCE Loss | 3.096 | 0.05 | | Focal Loss $(\alpha=0.25)$ | 3.036 | 0.0 | | Focal Loss $(\alpha=0.50)$ | 3.165 | 0.259 | | Focal Loss $(\alpha=0.75)$ | <strong>3.222</strong> | <strong>1.940</strong> | *Comparing loss functions on a 2.6M-node subgraph.* <!-- wp:paragraph --> <a href="https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#class_weights">Weighted losses</a> are also a common strategy to deal with imbalance. They are nice in this case because they are <em>directly compatible</em> with DP subsampling amplification, as we don't change the distribution from which we create/draw training examples. We leveraged the <strong><a href="https://arxiv.org/abs/1708.02002">focal loss</a></strong> from the computer vision literature designed to emphasize hard examples (infected cases) and found that it did improve both the AUPRC and the recall. <!-- /wp:paragraph --> <!-- wp:image {"align":"left","id":18144,"width":277,"height":215,"sizeSlug":"large","linkDestination":"none"} --> <div class="wp-block-image"><figure class="alignleft size-large is-resized"><img src="https://blog.ml.cmu.edu/wp-content/uploads/2023/04/grad_norm_batch-1024x798.png" alt="" class="wp-image-18144" width="277" height="215"/><figcaption>Fig. 11. L2 norm of gradients in a typical batch of 512 examples. The spikes are 3 positive cases.</figcaption></figure></div> <!-- /wp:image --> <!-- wp:paragraph --> However, loss weights can be slightly problematic when it comes to DP-SGD implementation because they can exacerbate <em>the mismatch of gradient magnitudes </em>between the positive and negative examples. Fig. 11 illustrates this problem by examining the norms of the gradients in a typical batch. <!-- /wp:paragraph --> <!-- wp:paragraph --> Data imbalance could be a fundamental issue for DP, which inherently asks for <em>uniform</em> guarantees across examples. <!-- /wp:paragraph --> #### 2.5.3. Neighborhood sampling <!-- wp:paragraph --> We saw from Part 2.4 (Privacy Accounting) that the neighborhood size has a tradeoff with privacy. Here, we are also interested in its tradeoff with <em>utility</em> and <em>computation</em>. The following table gives us a sense: larger $S$ may have diminishing returns. Note also that using only 1-hop neighbors are generally easier to scale than higher-hop neighbors as there are less of the graph to load into memory. In our submission we used only 1-hop neighbors with $S = 20$ to prioritize efficiency. <!-- /wp:paragraph --> | <strong># neighbors</strong> | <strong>AUPRC $\times 10^{-2}$</strong>&nbsp; | <strong>Peak RAM</strong> | | --- | --- | --- | | $S = 15$ | 3.013 | ≈ 18 GiB | | $S = 20$ | 3.158 | ≈ 20 GiB | | $S = 30$ | 3.244 | ≈ 25 GiB | | $S = 50$ | 3.279 | OOM | *Effects of neighborhood size $S$ on a 2.6M-node subgraph.* #### 2.5.4. Noisy SGD as an *empirical* defense While we have a complete framework for providing *silo-specific node-level DP* (Part 2.4), for this challenge specifically, we opted for *weak DP* ($\varepsilon > 500$) as an empirical protection, rather than a rigorous theoretical guarantee. While some readers may find this mildly disturbing at a first glance and may take this out of context, it is essential to note that *providing a framework to achieve DP* is orthogonal to the *design decision for this specific challenge*, and the optimal $\varepsilon$ depends on the data, the models, the actual threats, the desired privacy-utility trade-off, and several crucial factors linking theory and practice: 1. Attacks such as [poisoning](https://arxiv.org/abs/2204.00032), [membership inference](https://arxiv.org/abs/2112.03570), and [backdoors](https://arxiv.org/abs/2006.07709) can be meaningfully mitigated in practice, as DP is inherently pessimistic; 2. Linear models have an inherently limited capacity for memorization and privacy leakage; 3. Our node-level DP parameters (clip bound and noise level) are calibrated w.r.t. the *positive* cases during training---which are the *outliers* giving huge gradient norms (recall figure from earlier)---meaning that the vast majority of *inliers* enjoy much stronger protection; 4. Node-level DP is an active research area and current accounting isn't quite tight; and 5. A finite $\varepsilon$ means that the model enters a regime where we can quantify further necessary privacy improvements. Our solution was also in turn attacked by several [red teams](https://drivendata.co/blog/federated-learning-pets-prize-winners-phases-2-3#phase-3-red-teaming) for vulnerabilities. In general, we believe that there is still much room for improving the privacy-utility tradeoff of the current techniques/implementations of (node-level) DP on graphs with data imbalance. #### 2.5.5. Limitations <!-- wp:paragraph --> The above concludes the main building blocks of our solution to the PETs challenge! There are indeed many areas for future work. For example: <!-- /wp:paragraph --> <!-- wp:list --> - <em>Limited hyperparameter tuning</em>: Due to the time limit, we only tried a few configurations (or simply set a reasonable default) for hyperparameters such as infection window size $t_\text{window}$, number of hops for neighborhood aggregation $\ell$, regularization strength $\lambda$, and number of local training iterations. We expect proper tuning could see utility improvements. - <em>Sub-optimal implementation of example-level DP for logistic regression</em>: The convexity of the ML task actually offers alternatives for implementing (amplified) DP-SGD (e.g. <a href="https://proceedings.mlr.press/v89/chen19e.html">weight-space perturbations</a>, <a href="https://arxiv.org/abs/1808.06651">privacy amplification by iteration</a>, <a href="https://arxiv.org/abs/2103.00039">DP-FTRL</a>, <a href="https://arxiv.org/abs/2203.05363">novel accounting</a>) that may allow better privacy-utility tradeoffs or even data over-/under-sampling to deal with data imblance. - <em>High variance of the training pipeline</em>: Our experiments on model architectures indeed suggest that the variance inherent in the training pipeline (partly due to our dataset construction) can lead to models underperforming; this may also cause problems with reliability/explainability. <!-- /wp:list --> <!-- wp:separator --> <hr class="wp-block-separator"/> <!-- /wp:separator --> <!-- wp:heading {"level":4} --> ## Takeaways and Open Challenges <!-- /wp:heading --> <!-- wp:paragraph --> <p>In Part 1, we reviewed our <a href="https://arxiv.org/abs/2206.07902">NeurIPS'22 paper</a> that studied the application of differential privacy in cross-silo federated learning scenarios, and in Part 2, we saw how the core ideas and methods from the paper helped us develop our submission to the PETs prize challenge and win a 1st place in the pandemic forecasting track. </p> <!-- /wp:paragraph --> <!-- wp:paragraph --> Despite the many subtleties to fully build out a working system, the main ideas were really simple: just train personalized models with DP and add some proximity constraint! For readers interested in more details---such as a theoretical characterization, hyperparameter tuning, further experiments, failure modes, and so on---be sure to checkout the [full paper](https://arxiv.org/abs/2206.07902). <!-- /wp:paragraph --> <!-- wp:paragraph --> <p>On the other hand, our work identified several important future work in this context:</p> <!-- /wp:paragraph --> <!-- wp:list --> - <strong>DP under data imbalance</strong>. DP is inherently a <em>uniform</em> guarantee, but data imbalance implies that examples are <em>not</em> created equal---minority examples (e.g., disease infection, credit card fraud) are more informative and tend to give off (much) larger gradients during model training. Should we instead do <em>class-specific</em> (group-wise) DP or refine "<a href="http://dimacs.rutgers.edu/~graham/pubs/papers/pdp.pdf">heterogeneous</a> <a href="https://arxiv.org/abs/1504.06998">DP</a>" notions to better cater for the discrepancy between data points? - <strong>Graphs and privacy. </strong>Another fundamental basis of DP is that we could delineate what is an isn't an <em>individual</em>. But as we've seen, the information boundaries are often nebulous when an individual is a node in graph (think social networks and gossip propagation), particularly when the node is arbitrarily well connected. Instead of having rigid constraints (e.g., imposing a max node degree and accounting for it), are there alternative privacy definitions that offer varying degrees of protection for varying node connectedness? - <strong>Scalable private and federated trees for tabular data. </strong>Decision trees/forests tend to work extremely well for tabular data such as ours, <em>even with data imbalance</em>, but despite <a href="https://arxiv.org/abs/2210.02910">recent progress</a>, we argue that they are not yet mature under private and federated settings due to some underlying assumptions. - <strong>Novel training frameworks</strong>. While MR-MTL is a simple and strong baseline under our privacy granularity, it has clear limitations in terms of modeling capacity. Are there other methods that can also provide similar properties to balance the emerging privacy-heterogeneity cost tradeoff? - <strong>Honest privacy cost of hyperparameter search</strong>. When searching for better frameworks, the dependence on <em>hyperparameters</em> is particularly interesting: our full paper (<a href="https://arxiv.org/pdf/2206.07902.pdf">section 7</a>) made a surprising but somewhat depressing observation that the honest privacy cost of just tuning (on average) 10 configurations (values of $\lambda$ in this case) may <em>already outweigh</em> the utility advantage of the best tune MR-MTL($\lambda^\ast$). What does this mean if MR-MTL is already a strong baseline with just a single hyperparameter? <!-- /wp:list --> <!-- wp:paragraph --> This (very long) post serves as the **solution documentation** of our entry to the PETs prize challenge along with the <a href="https://github.com/kenziyuliu/pets-challenge">open-source release</a>. Please note that we may be making updates to this post to provide more discussions. Additionally, keep an eye out for a shorter version of this post on the <a href="https://blog.ml.cmu.edu/">ML@CMU blog</a>! <!-- /wp:paragraph --> <!-- wp:separator --> <hr class="wp-block-separator"/> <!-- /wp:separator --> <!-- wp:paragraph --> <strong>Check out the following related links:</strong> <!-- /wp:paragraph --> <!-- wp:list --> - Part 1: Our NeurIPS'22 <a href="https://arxiv.org/abs/2206.07902">paper</a> and <a href="https://kenziyuliu.github.io/files/private-silos-poster-prelim.pdf">poster</a> - Part 2: <a href="https://petsprizechallenge.drivendata.org/">The US/UK Privacy-Enhancing Technologies (PETs) Prize Challenge</a> <ul><li><a href="https://drivendata.co/blog/federated-learning-pets-prize-winners-phases-2-3#puffle">Team profile</a> - <a href="https://github.com/kenziyuliu/pets-challenge">Open-source release</a> - Description: <a href="https://petsprizechallenges.com/">challenge main page</a>, <a href="https://iuk.ktn-uk.org/wp-content/uploads/2022/08/PETs-Prize-Challenges_-Public-Health-Technical-Brief-1.pdf">technical brief</a> of the federated pandemic forecasting problem - News coverage: <a href="https://drivendata.co/blog/federated-learning-pets-prize-winners-phases-2-3">DrivenData</a>, <a href="https://cmu.is/pets-prize">CMU</a>, <a href="https://www.whitehouse.gov/ostp/news-updates/2023/03/31/us-uk-annouce-winners-innovation-pets-democratic-values/">White House</a>, <a href="https://new.nsf.gov/news/us-uk-winners-prize-challenge-in-privacy-enhancing-tech">NSF</a>, <a href="https://www.gov.uk/government/news/at-summit-for-democracy-the-united-kingdom-and-the-united-states-announce-winners-of-challenge-to-drive-innovation-in-privacy-enhancing-technologies">UK Gov</a></li></ul> <!-- /wp:list --> <!-- wp:separator --> <hr class="wp-block-separator"/> <!-- /wp:separator --> <!-- wp:paragraph --> <strong>DISCLAIMER:&nbsp;</strong>All opinions expressed in this post are those of the authors and do not represent the views of CMU. <!-- /wp:paragraph --> [^1]: What if one example $\not\Leftrightarrow$ one data subject? E.g. Alice has many banks accounts both within a bank and across banks, and she may enjoy protection weaker than all of the $(\varepsilon_k, \delta_k)$ targets. Under DP, this may be a tricky issue --- is it always practical to identify all examples across silos that correspond to a single data subject? See Appendix B of the <a href="https://arxiv.org/pdf/2206.07902.pdf">full paper</a> for more discussions. [^2]: Note that “personalization” refers to customizing models for each <em>client</em> (data silo) in FL rather than a specific person. [^3]: As compared to local training or FedAvg for a fixed $\lambda$. However, <em>tuning $\lambda$ as a hyperparameter</em> can [incur privacy cost](https://arxiv.org/abs/2110.03620). [^4]: Alternatively, we may target for <em>class-specific</em> DP guarantees such that over-/under-sampling can be implemented and accounted for within each class, but this is more of an open area and future work.