Towards a Unified Framework of Contrastive Learning for Disentangled Representations

NIPS 2023

check 22 24 27 papers

Introduction

Contrastive learning has recently emerged as a promising approach for learning data representations that discover and disentangle the explanatory factors of the data. (advantage)
- Learning to disentangle the explanatory factors of the observed data is valuable for a variety of machine learning (ML) applications.
- Contrastive methods (approximately) invert the data generating process and thus recover the generative factors.
We aim to develop a unified framework of statistical priors on the data generating process to improve our understanding of CL for disentangled representations. (motivation and issue)
- To deal with nonuniform marginal distributions of the latent factors.
- To deal with situations when these factors are conditionally dependent to some degree.
Contributions:
- We extend and unify theoretical guarantees of disentanglement for a family of contrastive losses under relaxed assumptions about the data generating process.
- The theoretical findings are empirically validated on several benchmark datasets and we quantitatively compare the disentanglement performance of four contrastive losses.
- We analyze the impact of partially violated assumptions and investigate practical limitations of the proposed framework.

Disentangled Representation Learning

Historical Roots: Blind source separation, factorial codes.
Nonlinear Data Relationships: Shift in focus to nonlinear relationships.
Identifiability Challenge: Fundamental difficulties in identifying latent factors in i.i.d. data.
Mutual Independence: Recovery of factors under mutual independence and specific regularities.
Dependent Variables and Auxiliary Information: Use of co-observed dependent variables, time-series data, interventions, and augmentations.
Exponential Family and VAE: Modeling latent factor distributions with conditional independence given an auxiliary variable.
Extensions to Distributions: Considering Laplace, von Mises–Fisher distributions, and distance-based conditional distributions.
Shared Latent Factors: Recovery of latents with known shared factors or interventions, and proving identifiability with general causal dependencies.

Contrastive Learning

It has been observed that contrastive objectives with tighter mutual information (MI) bounds do not necessarily enhance downstream performance.
Zimmermann et al. showed that InfoNCE approximately inverts the data generating process, leading to the identification of the true latent factors.

Contrastive Learning for Disentangled Representations

Framework

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Data Generating Process

S \subseteq R^{n}

denotes the space of latent factors.

X \subseteq R^{m}

denotes the space of observations.

p (\tilde{s} ∣ s) = \frac{Q (\tilde{s})}{Z (s)} e^{- d (s, \tilde{s})} with Z (s) = \int_{S} Q (\tilde{s}) e^{- d (s, \tilde{s})} d \tilde{s}

Q

and

Z

are scalar-valued functions.

d

is a distance function.

Contrastive Learning Approaches

δ (z, \tilde{z}) = \hat{d} (z, \tilde{z}) + α (z) + \tilde{α} (\tilde{z})

$\hat{d}$ is a fixed expression that describes the interaction between related examples.

L_{δ - NCE} (f, δ) = \underset{(x, \tilde{x}) \sim p_{pos}}{E} - \log [sig (- δ (f (x), f (\tilde{x})))] \underset{x, x^{-} \sim p}{-} E \log [1 - sig (- δ (f (x), f (x^{-})))]

L_{δ - INCE} (f, δ; K) = \underset{\begin{matrix} (x, \tilde{x}) \sim {posp}_{pos} \\ {x_{i}^{-}}_{i = 1}^{K} \sim {i.d.}^{\sim} p p \end{matrix}}{E} - \log \frac{e^{- δ (f (x), f (\tilde{x}))}}{e^{- δ (f (x), f (\tilde{x}))} + \sum_{i = 1}^{K} e^{- δ (f (x), f (x_{i}^{-}))}}

To fit this method into our theoretical framework, we use a slightly modified version in our analysis, where we substitute
$f (x)^{T} f (x ˜)$ with
$e^{- δ (f (x), f (x ˜)}$

L_{δ - SCL} (f, δ) = \underset{(x, \tilde{x}) \sim p_{pos}}{E} - 2 e^{- δ (f (x), f (\tilde{x}))} + \underset{x, x^{-} \sim p}{E} e^{- 2 δ (f (x), f (x^{-}))}

L_{δ - NWJ} (f, δ) = \underset{(x, \tilde{x}) \sim p_{pos}}{E} δ (f (x), f (\tilde{x})) + \underset{x, x^{-} \sim p}{E} e^{- δ (f (x), f (x^{-}))}

We here use it as a loss instead of a lower bound on MI.
To the best of our knowledge, neither the SCL nor the NWJ objective have been employed to learn disentangled representations or for Independent Component Analysis (ICA).

Identifiability

We derive precise conditions on the data generating process under which the relationship between the learned representation and true latent factors can be described by a simple function.

If they only match up to other simple transformations (e.g., invertible element-wise) of the true or learned latents, the recovered relation can be extended by just those transformations.

Theorem 1 (Weak identifiability). Let

S \subseteq R^{n}

be open and connected,

X \subseteq R^{m}

, and

g : S \to X

invertible and differentiable. Let us further assume that the observed data satisfy the generative model given in Eq. (1). If

d = \hat{d}

has one of the following properties:

(i) there exists a function
$ξ : R^{+} \to R^{+}$ , such that
$ξ \circ d$ is a norm-induced metric
(ii)
$d (s, \tilde{s}) = \sum_{i} d_{i} (| s_{i} - {\tilde{s}}_{i} |)$ , where each
$d_{i}$ is continuous and strictly increasing

then the optimal estimator of any of the contrastive losses presented above identifies the true latent factors up to affine transformations, i.e.,

h = f ◦ g

is an affine mapping.

Learned representation can be approximated by a simple affine mapping (linear mapping with translation).
Allows recovery of true latent factors up to affine transformations.
Does not involve complex permutations or non-linear operations.

Theorem 2 (Strong identifiability). Assume that all conditions in Theorem 1 are satisfied. Let the function d from

E q

. (1) be defined by

d (s, \tilde{s}) = \sum_{i} {(| s_{i} - {\tilde{s}}_{i} | / σ_{i})}^{β}

with

β \in (0, 2) \cup (2, \infty)

and

σ_{i} > 0

for all

i

, then

h = f \circ g

is a generalized permutation matrix, i.e., a composition of a permutation and element-wise scaling and sign flips.

Requires learned representation to be a generalized permutation matrix (composition of permutation, element-wise scaling, and sign flips).
Guarantees preservation of relationships between latent factors in a structured manner.
Involves more complex, non-linear transformations.

Summary:

Weak Identifiability: Simple transformations (affine mappings).
Strong Identifiability: Complex, structured transformations (generalized permutations).

Experiments

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

p (\tilde{s} ∣ s) = \frac{Q (\tilde{s})}{Z (s)} e^{- \sum_{i} {(\frac{| s_{i} - {\tilde{s}}_{i} |}{σ})}^{β}}

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

The results in Table 4 likely aim to assess the identifiability of latent factors under different conditions, such as weak and strong identifiability scenarios.
Identifiability refers to the ability of the model to accurately distinguish and represent the underlying factors of the data in a disentangled manner.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Difficulty with Increasing Latent Factors
Impact of Conditional Distribution Concentration

Conclusion

Our framework accounts for nonuniform marginal distributions of the factors of variation, allows for nonconvex latent spaces, and does not assume that these factors are statistically independent or conditionally independent given an auxiliary variable.

This study provides further evidence that contrastive methods learn to approximately invert the underlying generative process, which may explain their remarkable success in many applications.

Appendix

Nonconvex Latent Spaces:
In the context of disentangled representation learning, the term "nonconvex latent spaces" likely refers to the latent space where the disentangled factors of variation are represented. This means that the manifold or space in which these factors reside is not convex.

Towards a Unified Framework of Contrastive Learning for Disentangled Representations

Introduction

Related Work

Disentangled Representation Learning

Contrastive Learning

Contrastive Learning for Disentangled Representations

Framework

Data Generating Process

Contrastive Learning Approaches

Identifiability

Experiments

Conclusion

Appendix

Read more

Contrastive Disentanglement for Coherent Empathetic Dialogue

How to measure hallucination

CONT: Contrastive Neural Text Generation

1/26 Study papers