What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization

1Boston University, 2Runway

Abstract

Domain Generalization aims to develop models that can generalize to novel and unseen data distributions. In this work, we study how model architectures and pre-training objectives impact feature richness and propose a method to effectively leverage them for domain generalization. Specifically, given a pre-trained feature space, we first discover latent domain structures, referred to as pseudo-domains, that capture domain-specific variations in an unsupervised manner. Next, we augment existing classifiers with these complementary pseudo-domain representations making them more amenable to diverse unseen test domains. We analyze how different pre-training feature spaces differ in the domain-specific variances they capture. Our empirical studies reveal that features from diffusion models excel at separating domains in the absence of explicit domain labels and capture nuanced domain-specific information. On 5 datasets, we show that our very simple framework improves generalization to unseen domains by a maximum test accuracy improvement of over 4% compared to the standard baseline Empirical Risk Minimization (ERM). Crucially, our method outperforms most algorithms that access domain labels during training.

Teaser image

t-SNE visualization of the latent space from different pre-training objectives: CLIP, DiT, MAE, ResNet-50 on the VLCS dataset. Note how the diffusion features from DiT separates the 4 domains (Caltech101, LabelMe, SUN09, VOC2007) effectively, suggesting that latent domain structures can be captured without explicit supervision.

Method

Pseudo-Domain Discovery

Our method builds on the insight that domain-specific structure can be inferred from pre-trained features without requiring domain labels. We begin by identifying pseudo-domains via clustering in the latent space.

Pseudo-domain discovery
DiT feature space on the PACS dataset. Note how the feature space captures domain-related variations (e.g., light sketches are grouped together and separated from darker ones; cartoons are separated from sketches, photos, and paintings, etc.).
Clustering visualization
Clustering visualization
Pseudo-domains captured in the diffusion latent space of DiT on PACS. The clusters group images based on nuanced style-specific variances rather than class-specific variances.

Quantifying Domain Separation

We quantify domain separation using the Normalized Mutual Information (NMI) score between cluster assignments (with K = number of ground truth domains) and the corresponding domain labels. For example for the VLCS dataset shown below, we cluster with K = 4 and compute the NMI score between the cluster assignments and the domain labels. In addition to Domain NMI scores, we also compute Class NMI scores in a similar fashion. This allows us to quantify the amount of class-specific information captured by the clusters.

Domain separation comparison

Domain vs Class NMI

To obtain domain-specific centroids from the clusters, ideally we would need a feature space that captures domain-specific information while being invariant to class-specific information. We can quantify this using the NMI scores described above. The ideal feature space would have high domain NMI and a relatively lower class NMI.

Normalized Mutual Information (NMI) – Domain Labels

Domain NMI scores

Normalized Mutual Information (NMI) – Class Labels

Class NMI scores
VLCS domain vs class nmi
Comparison of domain and class NMI scores for different feature spaces on the VLCS dataset. DiT exhibits high domain NMI scores while having a low class NMI score.

GUIDE: Generalization using Inferred Domains from Latent Embeddings

To the standard classification pipeline, we append these pseudo-domain representations to the features coming from the ResNet50 backbone before passing them to the classifier.
Clustering visualization
Training Pipeline. The green-shaded region represents the clustering and transformation step. Green solid arrows indicate gradient flow, while red arrows represent non-gradient operations. The feature extractor $\ \mathbf{\Psi}$ first clusters samples to compute the pseudo-domain centroids. The transformation function $\ \mathcal{T}$ then transforms these centroids to the latent space of $\ \mathbf{\Phi}$, producing transformed pseudo-domain centroids, which are concatenated with the features from $\ \mathbf{\Phi}$, and sent to the classifier.

DomainBed Results

GUIDE on different feature spaces

Class NMI scores
  • Pseudo-domain features from diffusion models offered the best utility.
  • DiT excelled on high-level domain shifts (e.g., PACS, VLCS).
  • Stable Diffusion 2.1 performed best on environmental and spatial shifts (e.g., TerraIncognita).

GUIDE against other approaches

Class NMI scores
  • GUIDE outperforms both the baseline and other methods that rely on explicit ground truth domain labels during training. Methods in cyan correspond to domain-adaptive classifiers (described in Sec. 3.3).

GUIDE + Enhanced Training Strategies

Class NMI scores

BibTeX

@misc{thomas2025whatslatentleveragingdiffusion,
        title={What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization}, 
        author={Xavier Thomas and Deepti Ghadiyaram},
        year={2025},
        eprint={2503.06698},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2503.06698}, 
  }