Authors: Ken Liu, Virginia Smith
TL;DR: In this post, we study the application of differential privacy in personalized, cross-silo federated learning settings (full paper at NeurIPS'22), describe how these insights helped us develop a 1st place solution to the pandemic forecasting problem at the US/UK Privacy-Enhancing Technologies (PETs) Prize Challenge, and share some challenges and learnings along the way.
Federated learning (FL) is a technique to train models using decentralized data without directly communicating such data. Typically,
and the cycle repeats. Already, companies like Apple and Google has deployed FL to millions of devices to train models for predictive keyboards, text selection, and speaker verification.
Despite its ability to mitigate systemic privacy risks, however, FL's data minimization principle alone isn't sufficient for data privacy, as the client model updates can still reveal sensitive information.
This is where differential privacy (DP) can come in handy. DP is nice because it is both a formal guarantee and an effective empirical mitigation to attacks like membership inference and data poisoning.
In a nutshell, DP is a statistical notion of privacy where we add randomness to a query on a "dataset" to create quantifiable uncertainty about whether any one "data point" has contributed to the query output.
DP is typically measured by two scalars
In the above, "dataset" and "data point" are in quotes because privacy granularity (aka privacy unit) matters. For our usual CIFAR-10 classifier, they may respectively be a labeled image and the training set ("example-level DP"). In the context of FL, it is typical to apply "client-level DP", where the federated clients, such as mobile devices (aka “user-level”) or institutions (“silo-level”), are the "data points". In general, we want the granularity to correspond individual persons or data subjects that actually require privacy protection.
While client-level DP is great for cross-device FL (e.g. training on millions of iPhones) as each client naturally corresponds to a person, it may not be suitable for cross-silo FL, where there are fewer (2-100) clients but each hold many data subjects that require protection –- think hospitals, banks, or schools with many patient, customer, or student records.
Under cross-silo FL, client-level DP protection:
The above motivates us to consider the notion of "silo-specific example-level DP" as a natural alternative (see figure above).
In short, it says that the
It is more aligned with real-world use cases where each data subject contributes a single example;[1] e.g. voting / student / vaccination records across regions / schools / clinics for a particular election / exam / disease. This privacy granularity is very easy to implement: each silo can simply use DP-SGD when performing local gradient steps with calibrated per-step noise to guarantee example-level privacy of the locally computed model updates.
This privacy notion is not new, and prior work studied its statistical rates, its use with
Model personalization is a key technique for improving model performance under data heterogeneity (i.e. data non-iidness) across data silos.[2] Indeed, existing benchmarks suggest that we may very well have highly heterogeneous settings where directly fitting local models on local data is the best we could do.
When it comes to privacy, we should also clarify the trust model we are assuming and its relationship with model personalzation. For cross-device personalized FL, it is natural to assume that local training–-i.e., clients train their fully personalized models without any communication–-is completely 'private' as the clients are the individuals to protect; in fact, one could even leverage this assumption to improve the privacy-utility tradeoffs.
On the other hand, for cross-silo personalized FL settings, we may want application-specific trust models. Here, we focus on the natural case where only the data curator within each silo is trusted with full data access (e.g. university IT over student records), and that the artefacts (trained model, computed statistics, …) of either local or federated learning must obtained and curated privately as they may be served to non-curator users within the silo through an API.
The notion of silo-specific example-level privacy fits the above trust model and it has several implications on the dynamics of federated learning:
This "privacy-heterogeneity cost tradeoff" is interesting because it suggests that model personalization can play a key and distinct role in cross-silo FL. Intuitively, local training (no FL participation) and FedAvg (full FL participation) can be viewed as two ends of a personalization spectrum with identical privacy costs, and various personalization algorithms (finetuning, clustering, …) essentially navigate this spectrum in different ways.
If local training minimizes the effect of data heterogeneity but enjoys no DP noise reduction, and contrarily for FedAvg, are there personalization methods that lie in between and achieve better utility? If so, what methods would work best?
Perhaps we could start by asking for the desirable properties of a good personalization algorithm under our privacy granularity. Among properties like scalability, fairness, and robustness, our earlier observations motivate the following:
These properties are rather restrictive. For example, model mixture and local adaptation methods can incur linear overhead in dataset iterations, and so can multi-task learning methods that benefit from additional training. Clustering methods can also incur overhead with cluster selection, distillation, or training restarts, and they may discretize the personalization spectrum in a way that depends on the number of clients, clusters, or top-down partitions.
These considerations point to mean-regularized multi-task learning (MR-MTL) as one of the simplest yet particularly suitable forms of personalization. It simply asks each client
To be a bit more explicit, each local update step on client
The hyperparameter
MR-MTL has some nice properties matching our desiderata above:
To see why the above is interesting, consider the following experiment where we try a range of
Observe that:
The above provides rough intuition on why MR-MTL may be a strong baseline for private cross-silo FL, as we revisit the privacy-utility tradeoff comparisons in Fig. 3. It also has additional benefits such as being extensible to cluster preconditioning to handle some structured heterogeneity known a priori among the clients; see the full paper and appendix for more details and additional analyses and results!
Let's now take a look at a federated pandemic forecasting problem, announced last year at the US/UK Privacy-Enhancing Technologies (PETs) prize challenge, and how we could apply our learnings from Part 1.
The pandemic forecasting problem asks the following: Given a person's demographic attributes (e.g. age, household size), locations, activities, infection history, and the contact network, what is the likelihood of infection in the next
There's a lot to unpack in the above. First, the pandemic outbreak problem follows a discrete-time SIR model (Susceptible → Infectious → Recovered) and we begin with a subset of the population infected. Subsequently,
The challenge dataset models a synthetic pandemic outbreak in Virginia and contains roughly 7.7 million nodes (persons) and 186 million edges (contacts) with health states over 63 days (assuming the same daily contacts like Groundhog Day); so the actual contact graph is fairly large but also quite sparse.
There are a few extra factors that make this problem challenging:
There is generally a very small fraction of positive cases: there are less than 5% of people who are ever in the I or R state and roughly 0.3% of people became infected in the final week.
The data are siloed: the actual contact graph will be cut along administrative boundaries, e.g., by grouped FIPS codes/counties. Each silo only sees a local subgraph but people may still travel and make contacts across multiple regions. There are up to 10 silos (clients) and the population size can vary by more than an order of magnitude in the official evaluation (up to 2.6M nodes for a single silo in the official evaluation).
The temporal nature of the problem is subtle: we are given the first
Graphs generally complicate DP: when applying DP to machine learning, we are used to the settings where we can clearly define the privacy granularity and how it relates to an actual individual (e.g. tabular data or medical images of patients). This is tricky when it comes to graphs: people can make different number of contacts, each of different nature, and their influence may propagate to their neighbors, neighbors' neighbors, and so on. On a high-level (and as specified by the scope of sensitive data of the competition), what we care about is known as node-level DP–-the model output is "roughly" the same if we add/remove/replace a node, along with its edges. See Part 2.4 below for more!
One simple and clean approach to the problem is to just operate on the individual level and view it as (federated) binary classification: if we could build a feature vector that summarizes what we need to know about an individual, then risk scores are simply the sigmoid probabilities of near term infection.
Of course, the problem lies in what that feature vector (and the corresponding label) is–-we'll get to this in the following section. Regardless, we can already see that MR-MTL with silo-specific example-level privacy (from Part 1) is a nice framework for this type of cross-silo problems:
It is also helpful to look at the threat models we might be dealing with throughout the FL lifecycle and how the framework behaves under these settings.
We make worst-case assumptions that the adversary is computationally unbounded, has white-box access to personalized/global models (we assume they will be released), and may compromise or eavesdrop the clients, the server, and the communication channels.
Within a client
The adversary's goal may be to extract information of any person (say Alice) from the models; this could be things like whether Alice participated in the training (i.e. membership inference), Alice's age and social contacts (i.e. attribute inference), or completely reconstructing Alice's data (i.e. model inversion). The strength of an attack should be correlated with its success probability, ease of execution, and the accuracy/completeness of the extracted information.
Now let's take a look what could go wrong with the sources of vulnerabilities illustrated in figure above, and how DP's post-processing property can be quite useful.
(Source 1) Clients: Malicious clients may try to poison the data or models (e.g. intentionally train with wrong S/I/R labels, or scale up their model updates) but they won't affect honest clients' example-level DP guarantees (by post-processing). They may also plant backdoors but the efficacy may be limited by DP and backdoors are not directly useful for training data extraction.
(Source 2) Communication channels: The post-processing property also means that silos' own DP guarantees will hold even if the channels are compromised. This is an useful advantage of statistical privacy over cryptographic approaches that may require key distribution over insecure channels (e.g. SecAgg).
(Source 3) Central server: Since the server only performs simple aggregation, a malicious server is no more dangerous than an adversary with full control over all client-server channels, as with above.
(Sources 1, 2, 3) Collusion: As a corollary of the above, the DP guarantees for honest clients are not compromised even if all of the above parties are colluding.
(Sources 4, 5) Non-interactive / inference-time attacks: Instead of tampering with training, the adversary may just try attacking the final models with attacks like membership/attribute inference and model inversion. While the models are certainly private artefacts from the data and the post-processing property holds, the tricky part is whether DP is always effective against attacks. In practice, there has been related empirical evidence [1, 2, 3, 4] that suggests DP can be an effective countermeasure, even without strong theoretical guarantees (small
Of course, there are important caveats to the privacy story. As with modeling, the main caveat is that the definition of a "training example" is intricate, and it affects how example-level DP in this case translates to actual individual protection as well as what
We now describe how to convert individual information and the contact network into a tabular dataset for every silo
Recall that our task is to predict if a person is likely to get infected within
Each example
(1) Individual features: Basic (normalized) demographic features like age, gender, and household size; activity features like whether this person has ever been to work, school, church, shopping, etc.; and the individual's infection history as concatenated one-hot vectors (which depends on how we create labels; see below).
(2) Contact features: One of our key simplifying heuristics is that each node's
We may want
The contact features are built as follows:
Putting everything together, the figure above illustrates the final neighborhood feature vector that describes a person and her contacts that can then be used as inputs to a binary classifier.
Intriguingly, this way of training set construction makes the per-silo models a simplified variant of a graph neural network (GNN): there is a single step of non-parameterized neighborhood aggregation followed by feature transform and prediction (cf. SGC models).
For the label(s)
This implicitly assumes that a person’s infection risk is individual: whether Bob gets infected depends only on his own activities and contacts in the past (say) 21-day window. This is definitely not perfect as it ignores population-level modeling (e.g. dense areas increase likelihood of infection), but it makes the ML problem very simple.
Since the contact graph repeats every day, this strategy implies that for any person
At test time, we build the same neighborhood features, except that the infection window is now deterministically the window right before the prediction period.
See our open-source repo for more details on feature selection and training set construction!
Having defined an "example" above, we can now think about what silo-specific "example-level" DP guarantees actually mean for the actual individuals.
Recall that each node (person) is now described by a feature vector of its (deterministically subsampled and thus unique) local enclosing neighborhood, and thus our framework–-specifically DP-SGD for local updates–-provides "neighborhood-level DP".
However, our privacy desideratum implies that we want "node-level DP" (i.e. person-level DP) on the siloed graphs, which intuitively says replacing any one node in the graph, along with its incident edges, won't change the model too much. More formally:
Node-level DP. A model
In general, however, it is difficult to obtain gradients w.r.t. a single node, as its influence would propagate through the graph. In the worst case, we can get very lucky during neighborhood sampling and a node could appear in all of its neighbors' sampled neighborhoods during training. For the same reason, the usual subsampling amplification for DP-SGD do not apply directly.
Fortunately, Daigavane et al. provided a nice theorem for node-level privacy accounting on generic neighborhood aggregation methods (e.g. GNNs) under Rényi DP (a relaxation of DP with nice composition properties), which we adapt in our case to the following:
Node-level Rényi DP for SGD. For a (sub)graph with
The idea is that the number of times
The above theorem would be an analogy of subsampling amplification on graphs, but it still implies that the DP guarantee depends (linearly) on the neighborhood size
On the bright side, it is nevertheless good to know that MR-MTL with local DP-SGD updates can provably satisfy silo-specific node-level DP, which is our privacy goal.
Deterministic vs randomized neighborhood selection. Recall that as part of constructing training examples, we may want to deterministically prune the graph instead of randomly picking neighbors for a node every time we visit it. By inspecting the theorem above we see that the former is important in ensuring that each node has up to
How does our temporal partitioning / label creation strategy affect the accounting? Because random infection windows and the corresponding labels are essentially different feature subsets of the same neighborhood features, the DP guarantees should not degrade and may even be boosted by the temporal randomness.
We can now see our solution coming together:
We now describe some additional design and implementation details below to complete our submission. Unless otherwise stated, we'll look at average precision (AP or AUPRC) as the metric; it summarizes the binary classifier's performance across all operating thresholds, and random guess corresponds to the fraction of positive examples, instead of 0.5.
While the model design space is large for generic tabular datasets, we are specifically interested in models amenable to gradient-based private optimization (e.g. DP-SGD) and weight-space averaging. Decision trees & forests tend to work well for tabular data even with data imbalance, but despite recent progress, we argue that they are not yet mature and battle-tested under private and federated settings due to many underlying assumptions.
In the end, we compared a linear model (i.e. logistic regression) and a 3-layer MLP, and found that the data noise (either inherent from data generation, or due to our training set construction) favor linear models, which also have benefits in privacy (in the sense of limited capacity for memorization) as well as explainability, efficiency, and robustness. Of course, linear models aren't perfect in terms of utility and it'd be interesting to explore other modeling techniques (see last section).
Model | AUPRC ( |
---|---|
Random guess | 0.266 |
3-layer MLP | 0.329 |
Logistic Regression | 2.049 |
Average precision (AUPRC) for model architectures
on a 2.6M-node subgraph (using node features only).
Because the data are highly imbalance, training naively will suffer from low recall and AUPRC. While there are established over-/under-sampling methods to deal with such imbalance, they may unfortunately break the privacy accounting (Part 2.4) in terms of the subsampling assumption or increased data queries, and their benefits may be offset by the worsened privacy-utility tradeoffs.[4]
Model (values |
AUPRC | Recall |
---|---|---|
BCE Loss | 3.096 | 0.05 |
Focal Loss |
3.036 | 0.0 |
Focal Loss |
3.165 | 0.259 |
Focal Loss |
3.222 | 1.940 |
Comparing loss functions on a 2.6M-node subgraph.
Weighted losses are also a common strategy to deal with imbalance. They are nice in this case because they are directly compatible with DP subsampling amplification, as we don't change the distribution from which we create/draw training examples. We leveraged the focal loss from the computer vision literature designed to emphasize hard examples (infected cases) and found that it did improve both the AUPRC and the recall.
However, loss weights can be slightly problematic when it comes to DP-SGD implementation because they can exacerbate the mismatch of gradient magnitudes between the positive and negative examples. Fig. 11 illustrates this problem by examining the norms of the gradients in a typical batch.
Data imbalance could be a fundamental issue for DP, which inherently asks for uniform guarantees across examples.
We saw from Part 2.4 (Privacy Accounting) that the neighborhood size has a tradeoff with privacy. Here, we are also interested in its tradeoff with utility and computation. The following table gives us a sense: larger
# neighbors | AUPRC |
Peak RAM |
---|---|---|
3.013 | ≈ 18 GiB | |
3.158 | ≈ 20 GiB | |
3.244 | ≈ 25 GiB | |
3.279 | OOM |
Effects of neighborhood size
While we have a complete framework for providing silo-specific node-level DP (Part 2.4), for this challenge specifically, we opted for weak DP (
While some readers may find this mildly disturbing at a first glance and may take this out of context, it is essential to note that providing a framework to achieve DP is orthogonal to the design decision for this specific challenge, and the optimal
Our solution was also in turn attacked by several red teams for vulnerabilities. In general, we believe that there is still much room for improving the privacy-utility tradeoff of the current techniques/implementations of (node-level) DP on graphs with data imbalance.
The above concludes the main building blocks of our solution to the PETs challenge! There are indeed many areas for future work. For example:
In Part 1, we reviewed our NeurIPS'22 paper that studied the application of differential privacy in cross-silo federated learning scenarios, and in Part 2, we saw how the core ideas and methods from the paper helped us develop our submission to the PETs prize challenge and win a 1st place in the pandemic forecasting track.
Despite the many subtleties to fully build out a working system, the main ideas were really simple: just train personalized models with DP and add some proximity constraint! For readers interested in more details–-such as a theoretical characterization, hyperparameter tuning, further experiments, failure modes, and so on–-be sure to checkout the full paper.
On the other hand, our work identified several important future work in this context:
This (very long) post serves as the solution documentation of our entry to the PETs prize challenge along with the open-source release. Please note that we may be making updates to this post to provide more discussions.
Additionally, keep an eye out for a shorter version of this post on the ML@CMU blog!
Check out the following related links:
DISCLAIMER: All opinions expressed in this post are those of the authors and do not represent the views of CMU.
What if one example
Note that “personalization” refers to customizing models for each client (data silo) in FL rather than a specific person. ↩︎
As compared to local training or FedAvg for a fixed
Alternatively, we may target for class-specific DP guarantees such that over-/under-sampling can be implemented and accounted for within each class, but this is more of an open area and future work. ↩︎