# Generalisation bounds for exchangeable data Most generalisation bounds assume data is sampled i.i.d from some distribution. This prevents these bounds to say anything meaningful about situations like domain adaptation or domain generalisation, where multiple related datasets are available from sligthly different domains/contexts, or transfer learning where some information about a dataset captured by a learning algorithm is transferred over to a related task, or meta-learning where we are learning to learn from data. Relaxing the i.i.d. assumption to exchangeability might allow a mathematically tractable way to extend generalisation bounds to a broader set of situations. For example, different domains may be modelled as i.i.d. copies of an underlying exchangeable process. ## Relaxing i.i.d. to exchangeable The first question to investigate is whether it is possible to obtain generalisation bounds for exchangeable, rather than i.i.d. data generating processes, or how to even defien it. We assume there exists an infinitely exchangeable process $Z_1, Z_2, \ldots$, and our training dataset is a finite marginal $\mathcal{D} = [Z_1, \ldots, Z_N]$. Based on this, we can define the empirical risk, in the usual fashion: $$ \hat{\mathcal{R}}(\theta, \mathcal{D}) = \frac{1}{N} \sum_{n=1}^N \ell(Z_n, \theta) $$ The population risk, however, would now have to be defined differently (I think). The population risk is a random variable (which presumably under some assumptions converges defined as follows: $$ \mathcal{R}(\theta) = \lim_{N\rightarrow\infty}\frac{1}{N}\sum_{n=1}^{N}\ell(Z_n, \theta) $$ If $Z_n$ were i.i.d., this $\mathcal{R}(\theta)$ would be a constant (or converge in probability to a constant). But since $Z_n$ are exchangeable, the risk converges to a random variable (depends on the de Finetti parameter). **RQ1:** Assume $\theta = \mathcal{A}(\mathcal{D})$ is obtained by running an algorithm $\mathcal{A}$ on the dataset $\mathcal{D}$. Can we bound $R(\theta)$ by $\hat{\mathcal{R}}(\theta, \mathcal{D})$ and some additional stuff related to the algorithm?. Or are there high probability PAC-bounds for this situation? Perhaps the easiest first exercise is to investigate whether information theoretic bounds (bounding the difference in expectation) carry over? (see e.g. this [paper](https://arxiv.org/abs/2110.01584) and refefences) ## Multi-environment learning Now we can turn to discussing what we can do if we observe data from multiple environments/data sources. A reasonable assumption is that these data sources are all conditionally i.i.d. on some source-dependent latent factors. Thus, we can model each data source as an i.i.d. copy of the same exchangeable process. So $Z_1, Z_2, \ldots$ is an exchangeable process, and we observe multiple datasets $\mathcal{D}^e, e = 1, \ldots E$ such that and each dataset $\mathcal{D}^e = [Z^e_1, Z^e_2, \ldots, Z^e_{N_e}]$ is a finite marginal of an i.i.d. copy otf this sequence. There are multiple research questions to ask here: **RQ2:** If we have a single $\theta$ and know $\hat{\mathcal{R}}(\theta, \mathcal{D}^e)$ in each of the environments, can we bound $\mathbb{E}[\mathcal{R}(\theta)]$? The meaning of $\mathbb{E}[\mathcal{R}(\theta)]$ is that it is the expected population risk in a previously unseen data souce? Would this bound tell us anything about how to obtain such $\theta$ or how to combine data from the different environments? **RQ3:** Suppose we know $\hat{\mathcal{R}}(\theta, \mathcal{D}^e)$ for all environments $e$, can we bound $\mathcal{R}^1(\theta)$ defined as $$ \mathcal{R}^1(\theta) = \lim_{N\rightarrow\infty}\frac{1}{N}\sum_{n=1}^{N}\ell(Z^1_n, \theta) $$ So, basically, the interesting question is whether there are any bounds that leverage knowledge obtained from other data sources to get a tighter bound on risk in a specific data source? **RQ4:** Suppose we have an algorithm $\mathcal{A}$ for obtaining a hypothesis $\theta^e = \mathcal{A}(\mathcal{D}^e)$ in each environment. Can we bound this algorithm's generalisation gap on an unseen environment $E+1$? The idea here is that data about generalisation performance in the observed data-sources $1\ldots E$ should be informative about setting hyperparameters of the algorithm $\mathcal{A}$. This would be a form of meta-learning generalisation bound. It would be interesting if such bounds would indicate anything about whether it's more useful to obtain more data from each source vs less data from more sources, for example.