Semantic Richness or Geometric Reasoning? The Fragility of VLM’s Visual Invariance

This work investigates the fundamental fragility of state of the art vision language models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.

Datasets & models

A suite designed to vary semantic richness and test model behavior.

Omniglot

Handwritten characters from 50 scripts, ranging from familiar alphabets to rare writing systems. (a) Spatial reasoning vs. script familiarity: mixing prevalent and rare scripts tests whether models rely on geometric reasoning or prior exposure. (b) Controlled stimuli: the black-on-white format removes backgrounds, textures, and scene cues, helping isolate spatial reasoning with fewer confounds.
Times New Roman

English letters rendered in a single canonical typeface. The familiarity of English characters and the structural precision of digital fonts make it a clean baseline for evaluating geometric robustness.
Handwritten English

English letters and digits with natural handwriting variation. It adds stroke-level variability while preserving the familiarity of English characters.
PACS

A multi-domain benchmark spanning photo, art painting, cartoon, and sketch. It introduces richer semantics and visual diversity, enabling comparison across levels of visual abstraction.

Grid of character examples across scripts and visual styles (Omniglot style evaluation data)

Models evaluated

Open Qwen2.5-VL-7B Qwen2.5-VL-32B Qwen3-VL-8B Qwen3-VL-30B

Closed GPT-5.2 Gemini-2.5-Pro

Results

Rotation Performance

Rotation recognition on character datasets (aggregated over 10° to 90° angles). Hover a model name to focus that column group; click Acc. / TNR / TPR to sort rows.

Dataset	GPT-5.2			Gemini-2.5-Pro			Qwen2.5-VL-7B			Qwen2.5-VL-32B			Qwen3-VL-8B			Qwen3-VL-30B
Dataset	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕
Times New Roman	74.25	95.09	53.42	89.32	100.00	78.63	51.07	100.00	02.14	52.67	100.00	05.34	50.11	100.00	00.21	65.81	100.00	31.62
Handwritten English	67.84	96.58	39.10	68.27	99.57	36.97	50.85	98.08	03.63	62.50	98.08	26.92	50.00	100.00	00.00	55.98	100.00	11.97
Omniglot	75.55	80.91	70.19	76.90	98.69	55.10	50.72	99.46	01.98	54.17	94.71	13.62	51.01	99.96	02.06	56.64	99.75	13.53
Random guess	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00

Best accuracy per model (across datasets) Worst accuracy per model Closed source (API) Open weights

Rotation recognition across character datasets. Best and worst accuracy per model across datasets are highlighted in teal and red, respectively. Values are aggregated over rotation angles 10° to 90°. TNR stays near ceiling for many settings while TPR stays consistenly low. Closed source models tend to outperform open weights, but the pattern holds across the board.

Rotation Performance for PACS Domains

Rotation recognition across PACS domains (aggregated over 90°, 180°, and 270°).

Domain	GPT-5.2			Gemini-2.5-Pro			Qwen2.5-VL-7B			Qwen2.5-VL-32B			Qwen3-VL-8B			Qwen3-VL-30B
Domain	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕
Photo	99.50	100.00	99.00	92.67	100.00	85.33	59.42	100.00	18.83	77.67	100.00	55.33	50.25	100.00	00.50	64.58	100.00	29.17
Art Painting	99.50	100.00	99.00	94.17	100.00	88.33	50.25	100.00	00.50	63.50	100.00	27.00	50.08	100.00	00.17	55.08	100.00	10.17
Cartoon	98.17	100.00	96.33	90.33	100.00	80.67	51.42	100.00	02.83	70.08	100.00	40.17	50.25	100.00	00.50	60.67	100.00	21.33
Sketch	92.25	99.67	84.83	86.50	99.83	73.17	52.25	100.00	04.50	57.25	100.00	14.50	50.42	100.00	00.83	52.83	100.00	05.67
Random guess	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00

Best accuracy per model (across domains) Worst accuracy per model Closed source (API) Open weights

Rotation recognition performance across PACS domains. Performance aggregated over rotation angles 90°, 180°, and 270°. While TNR remains near-perfect across models, TPR varies significantly across domains, with strong performance on photos and substantial degradation on sketches, indicating reliance on semantic cues rather than true geometric reasoning.

Scale-invariance performance

Scale-invariance task: metrics aggregated across all scales (0.1×, 0.3×, 0.5×, 0.9×).

Dataset	GPT-5.2			Gemini-2.5-Pro			Qwen2.5-VL-7B			Qwen2.5-VL-32B			Qwen3-VL-8B			Qwen3-VL-30B
Dataset	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕	Acc.↕	TNR↕	TPR↕
Times New Roman	98.79	99.03	98.55	99.51	100.00	99.03	98.80	97.60	100.00	99.76	99.52	100.00	100.00	100.00	100.00	98.79	97.59	100.00
Handwritten English	98.07	98.07	98.07	96.63	98.07	95.19	96.77	95.56	97.98	95.36	92.34	98.39	97.59	95.19	100.00	97.59	99.03	96.15
Omniglot	79.72	93.20	66.23	82.56	90.11	75.01	77.40	92.99	61.81	74.21	67.87	80.55	76.05	97.11	54.99	77.04	95.44	58.64
Random guess	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00

Best accuracy per model (across datasets) Worst accuracy per model Closed source (API) Open weights

Model performance on the scale-invariance task aggregated across all scales (0.1×, 0.3×, 0.5×, 0.9×). All models achieve near-perfect performance on Times New Roman and Handwritten English characters, indicating robustness to scale changes. In contrast, performance on Omniglot is substantially lower and exhibits greater variation in recall (TPR) and specificity (TNR) across models.

Recall and specificity by script at scale 0.3×, Qwen2.5-VL-7B — Qwen2.5-VL-7B

Recall and specificity by script at scale 0.3×, Qwen2.5-VL-32B — Qwen2.5-VL-32B

Recall and specificity at scale 0.3× for representative scripts for the scale-invariance task. English characters rendered in Times New Roman, Handwritten English characters, and Omniglot scripts are shown in blue, purple, and orange respectively, and are selected to represent high-, medium-, and low-performing groups. Across both models, familiar scripts such as Greek and Latin consistently outperform less familiar scripts like Braille. While Qwen2.5-VL-32B achieves higher recall than Qwen2.5-VL-7B on low-performing Omniglot scripts, it exhibits lower specificity.

Model accuracy on the scale-invariance task across scale factors for Times New Roman, Handwritten English, and Omniglot

Model accuracy on the scale-invariance task across scale factors. Both Qwen2.5-VL-7B and Qwen2.5-VL-32B maintain near-perfect accuracy for both Times New Roman and Handwritten English characters across all scales, while performance on Omniglot scripts is substantially and consistently lower for both models.

Can transformation invariance be instilled?

Model performance with in-context learning and structured visual prompting across scripts. Arrows indicate change in TPR relative to the None setting: ↑ improvement, ↓ degradation. Best TPR for each script–model pair is highlighted in teal. Few-shot and rotational-grid inputs often raise TPR but can lower TNR; gains tend to be larger for higher-capacity models.

See the paper for details on few-shot and rotational-grid settings.

		GPT-5.2		Gemini-2.5-Pro		Qwen2.5-VL-7B		Qwen2.5-VL-32B		Qwen3-VL-8B		Qwen3-VL-30B
Script	ICL setting	TNR	TPR	TNR	TPR	TNR	TPR	TNR	TPR	TNR	TPR	TNR	TPR
Malayalam (top-tier)	None	93.62	34.04	100.00	38.30	100.00	00.00	97.87	06.38	100.00	00.00	100.00	02.13
	Few-shot	91.49	85.11↑	97.87	34.04↓	100.00	00.00	65.96	51.06↑	100.00	34.04↑	78.72	72.34↑
	Rotational grid	72.34	97.87↑	93.62	48.94↑	100.00	00.00	91.49	12.77↑	65.96	65.96↑	23.40	87.23↑

Tengwar (medium-tier)	None	100.00	20.00	96.00	32.00	100.00	00.00	88.00	16.00	100.00	00.00	100.00	00.00
	Few-shot	80.00	68.00↑	92.00	32.00	100.00	00.00	68.00	76.00↑	100.00	16.00↑	68.00	72.00↑
	Rotational grid	60.00	88.00↑	88.00	56.00↑	100.00	00.00	88.00	24.00↑	72.00	88.00↑	40.00	76.00↑

Braille (bottom-tier)	None	100.00	07.69	100.00	50.00	100.00	00.00	65.38	15.38	100.00	00.00	100.00	03.85
	Few-shot	92.31	73.08↑	96.15	26.92↓	100.00	00.00	88.46	19.23↑	100.00	11.54↑	92.31	50.00↑
	Rotational grid	69.23	65.38↑	100.00	46.15↓	100.00	00.00	100.00	00.00↓	100.00	11.54↑	73.08	50.00↑

Best TPR per script–model pair TPR decrease vs None ↑ / ↓ TPR change vs None; no arrow if unchanged.

While few-shot ICL and rotational-grid prompting help, they do not fix robust rotation recognition. From these two experiments, it is clear that both approaches help instill some but not a robust understanding of rotation recognition. While we hypothesized that ICL and the rotational grid would provide the visual evidence necessary to map the input across different angles, it appears to make the high-capacity models “over-eager” and induce a confirmation bias that spikes TPR at the expense of discriminative accuracy.

Semantic Richness or Geometric Reasoning?
The Fragility of VLM’s Visual Invariance

Do VLMs reason about these images the same way?

Overview

Abstract