Semantic Richness or Geometric Reasoning?
The Fragility of VLM’s Visual Invariance

Jason Qiu1*,  Zachary Meurer1*,  Xavier Thomas1*†,  Deepti Ghadiyaram1

1Boston University  ·  *Equal contribution  ·  Corresponding author


Motivation

Do VLMs reason about these images the same way?

We probe the same geometric tasks across typed fonts, handwritten characters, rare scripts, and different visual domains—semantic richness changes, the spatial question does not, but the performance does.

Times New Roman

Handwritten English

Omniglot

PACS · Photo

PACS · Art painting

PACS · Cartoon

PACS · Sketch


Overview

We test whether state of the art VLMs reason about rotation, scale, and identity consistently across symbolic sketches, natural photos, and art paintings. The figure below summarizes the drop in accuracy as semantic cues become sparse.

Figure 1: failure of visual transformation reasoning across visual domains

Models are probed on whether two images depict the same object under rotation, scale, or identity. Performance stays high on photos and art but falls on sketches and symbolic scripts, especially for rotation.


Abstract

This work investigates the fundamental fragility of state of the art vision language models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.


Main findings

Short summary of the main takeaways.

01

Sparse semantics expose weak geometric reasoning

Accuracy stays high on natural images (e.g., PACS photos) but drops on sketches and symbolic scripts where visual cues are sparse.

02

Rotation is the hardest transform

Across rotation, scale, and identity tasks, rotation recognition is consistently the most challenging for the MLLMs we evaluate.

03

Not fixed by architecture, scale, or prompting

Failures persist regardless of model architecture, scale, or prompting strategy, and are only partially mitigated by in-context learning or structured visual prompts.

04

Encoder similarity ≠ VLM performance

Visual encoders retain rotational similarity, yet VLMs often fail to use it when reasoning with a language decoder.


Datasets & models

A suite designed to vary semantic richness and test model behavior.

Grid of character examples across scripts and visual styles (Omniglot style evaluation data)

Models evaluated

Open Qwen2.5-VL-7B Qwen2.5-VL-32B Qwen3-VL-8B Qwen3-VL-30B
Closed GPT-5.2 Gemini-2.5-Pro

Results

Rotation Performance

Rotation recognition on character datasets (aggregated over 10° to 90° angles). Hover a model name to focus that column group; click Acc. / TNR / TPR to sort rows.

Dataset GPT-5.2 Gemini-2.5-Pro Qwen2.5-VL-7B Qwen2.5-VL-32B Qwen3-VL-8B Qwen3-VL-30B
Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR
Times New Roman 74.2595.0953.42 89.32100.0078.63 51.07100.0002.14 52.67100.0005.34 50.11100.0000.21 65.81100.0031.62
Handwritten English 67.8496.5839.10 68.2799.5736.97 50.8598.0803.63 62.5098.0826.92 50.00100.0000.00 55.98100.0011.97
Omniglot 75.5580.9170.19 76.9098.6955.10 50.7299.4601.98 54.1794.7113.62 51.0199.9602.06 56.6499.7513.53
Random guess 50.0050.0050.00 50.0050.0050.00 50.0050.0050.00 50.0050.0050.00 50.0050.0050.00 50.0050.0050.00
Best accuracy per model (across datasets) Worst accuracy per model Closed source (API) Open weights

Rotation recognition across character datasets. Best and worst accuracy per model across datasets are highlighted in teal and red, respectively. Values are aggregated over rotation angles 10° to 90°. TNR stays near ceiling for many settings while TPR stays consistenly low. Closed source models tend to outperform open weights, but the pattern holds across the board.

Rotation Performance for PACS Domains

Rotation recognition across PACS domains (aggregated over 90°, 180°, and 270°).

Domain GPT-5.2 Gemini-2.5-Pro Qwen2.5-VL-7B Qwen2.5-VL-32B Qwen3-VL-8B Qwen3-VL-30B
Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR
Photo 99.50100.0099.00 92.67100.0085.33 59.42100.0018.83 77.67100.0055.33 50.25100.0000.50 64.58100.0029.17
Art Painting 99.50100.0099.00 94.17100.0088.33 50.25100.0000.50 63.50100.0027.00 50.08100.0000.17 55.08100.0010.17
Cartoon 98.17100.0096.33 90.33100.0080.67 51.42100.0002.83 70.08100.0040.17 50.25100.0000.50 60.67100.0021.33
Sketch 92.2599.6784.83 86.5099.8373.17 52.25100.0004.50 57.25100.0014.50 50.42100.0000.83 52.83100.0005.67
Random guess 50.0050.0050.00 50.0050.0050.00 50.0050.0050.00 50.0050.0050.00 50.0050.0050.00 50.0050.0050.00
Best accuracy per model (across domains) Worst accuracy per model Closed source (API) Open weights

Rotation recognition performance across PACS domains. Performance aggregated over rotation angles 90°, 180°, and 270°. While TNR remains near-perfect across models, TPR varies significantly across domains, with strong performance on photos and substantial degradation on sketches, indicating reliance on semantic cues rather than true geometric reasoning.

Scale-invariance performance

Scale-invariance task: metrics aggregated across all scales (0.1×, 0.3×, 0.5×, 0.9×).

Dataset GPT-5.2 Gemini-2.5-Pro Qwen2.5-VL-7B Qwen2.5-VL-32B Qwen3-VL-8B Qwen3-VL-30B
Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR
Times New Roman 98.7999.0398.55 99.51100.0099.03 98.8097.60100.00 99.7699.52100.00 100.00100.00100.00 98.7997.59100.00
Handwritten English 98.0798.0798.07 96.6398.0795.19 96.7795.5697.98 95.3692.3498.39 97.5995.19100.00 97.5999.0396.15
Omniglot 79.7293.2066.23 82.5690.1175.01 77.4092.9961.81 74.2167.8780.55 76.0597.1154.99 77.0495.4458.64
Random guess 50.0050.0050.00 50.0050.0050.00 50.0050.0050.00 50.0050.0050.00 50.0050.0050.00 50.0050.0050.00
Best accuracy per model (across datasets) Worst accuracy per model Closed source (API) Open weights

Model performance on the scale-invariance task aggregated across all scales (0.1×, 0.3×, 0.5×, 0.9×). All models achieve near-perfect performance on Times New Roman and Handwritten English characters, indicating robustness to scale changes. In contrast, performance on Omniglot is substantially lower and exhibits greater variation in recall (TPR) and specificity (TNR) across models.

Recall and specificity by script at scale 0.3×, Qwen2.5-VL-7B
Qwen2.5-VL-7B
Recall and specificity by script at scale 0.3×, Qwen2.5-VL-32B
Qwen2.5-VL-32B

Recall and specificity at scale 0.3× for representative scripts for the scale-invariance task. English characters rendered in Times New Roman, Handwritten English characters, and Omniglot scripts are shown in blue, purple, and orange respectively, and are selected to represent high-, medium-, and low-performing groups. Across both models, familiar scripts such as Greek and Latin consistently outperform less familiar scripts like Braille. While Qwen2.5-VL-32B achieves higher recall than Qwen2.5-VL-7B on low-performing Omniglot scripts, it exhibits lower specificity.

Model accuracy on the scale-invariance task across scale factors for Times New Roman, Handwritten English, and Omniglot

Model accuracy on the scale-invariance task across scale factors. Both Qwen2.5-VL-7B and Qwen2.5-VL-32B maintain near-perfect accuracy for both Times New Roman and Handwritten English characters across all scales, while performance on Omniglot scripts is substantially and consistently lower for both models.

Can transformation invariance be instilled?

Model performance with in-context learning and structured visual prompting across scripts. Arrows indicate change in TPR relative to the None setting: improvement, degradation. Best TPR for each script–model pair is highlighted in teal. Few-shot and rotational-grid inputs often raise TPR but can lower TNR; gains tend to be larger for higher-capacity models.

See the paper for details on few-shot and rotational-grid settings.

GPT-5.2 Gemini-2.5-Pro Qwen2.5-VL-7B Qwen2.5-VL-32B Qwen3-VL-8B Qwen3-VL-30B
Script ICL setting TNRTPR TNRTPR TNRTPR TNRTPR TNRTPR TNRTPR
Malayalam (top-tier) None 93.6234.04 100.0038.30 100.0000.00 97.8706.38 100.0000.00 100.0002.13
Few-shot 91.4985.11 97.8734.04 100.0000.00 65.9651.06 100.0034.04 78.7272.34
Rotational grid 72.3497.87 93.6248.94 100.0000.00 91.4912.77 65.9665.96 23.4087.23
Tengwar (medium-tier) None 100.0020.00 96.0032.00 100.0000.00 88.0016.00 100.0000.00 100.0000.00
Few-shot 80.0068.00 92.0032.00 100.0000.00 68.0076.00 100.0016.00 68.0072.00
Rotational grid 60.0088.00 88.0056.00 100.0000.00 88.0024.00 72.0088.00 40.0076.00
Braille (bottom-tier) None 100.0007.69 100.0050.00 100.0000.00 65.3815.38 100.0000.00 100.0003.85
Few-shot 92.3173.08 96.1526.92 100.0000.00 88.4619.23 100.0011.54 92.3150.00
Rotational grid 69.2365.38 100.0046.15 100.0000.00 100.0000.00 100.0011.54 73.0850.00
Best TPR per script–model pair TPR decrease vs None / TPR change vs None; no arrow if unchanged.

While few-shot ICL and rotational-grid prompting help, they do not fix robust rotation recognition. From these two experiments, it is clear that both approaches help instill some but not a robust understanding of rotation recognition. While we hypothesized that ICL and the rotational grid would provide the visual evidence necessary to map the input across different angles, it appears to make the high-capacity models “over-eager” and induce a confirmation bias that spikes TPR at the expense of discriminative accuracy.


Citation

BibTeX
@article{qiu2026semantic,
  title={Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance},
  author={Qiu, Jason and Meurer, Zachary and Thomas, Xavier and Ghadiyaram, Deepti},
  journal={arXiv preprint arXiv:2604.01848},
  year={2026},
  url={https://arxiv.org/abs/2604.01848}
}