VLMs have Tunnel Vision

Evaluating Nonlocal Visual Reasoning in Vision-Language Models

Princeton University
NeurIPS 2025
Spotlight Presentation

Abstract

Vision Language Models (VLMs) excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple perceptual tests. We present an evaluation that tests vision-language models’ capacity for nonlocal visual reasoning—reasoning that requires chaining evidence collected from multiple, possibly distant, regions of an image. We isolate three distinct forms of non-local vision: comparative perception, which demands holding two images in working memory and comparing them; saccadic search, which requires making discrete, evidence-driven jumps to locate successive targets; and smooth visual search, which involves searching smoothly along a continuous contour. Flagship models (e.g. GPT-5, Gemini 2.5 Pro, Claude Sonnet 4), even those that perform well on prior primitive-vision benchmarks, fail these tests and barely exceed random accuracy on two variants of our tasks that are trivial for humans. Our structured evaluation suite allows us to test if VLMs can perform visual algorithms similar to those humans deploy. Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.

What is Nonlocal Visual Reasoning?

Comparative Perception

Comparative perception is the ability to compare small differences between two images by moving visual attention back and forth between them.

Saccadic Search

Sacadic search is the ability to make discrete jumps around an image based on the content of that image. For instance, a traffic sign might redirect a driver's attention towards a road hazard.

Smooth Visual Search

Smooth visual search is the ability to follow a line or contour around an image. For example, humans use this skill to follow an object in flight or untangle a wire.

Comparative Perception

Comparative perception differs from general comparison because it occurs chiefly in the visual domain. The images must be close enough that it requires repeated examination to tell them apart. For instance, an image of a penguin and of Angkor Wat can be easily distinguished. We design "Object Re-Identification" to require this ability.

Object Re-Identification

Each instance of the task shows an object composed of multiple geometric shapes in Image 1. That same, globally transformed object is also shown in Image 2. The task is to determine if any of the component shapes have been transformed seperately from the object as a whole, making it a different object. We avoid imperceptible edits.

Presentation Variants

  • Standard: all component shapes in the object touch.
  • Unconnected: parts may float apart, probing the model's conception of an object.
  • Pixel-Perfect: positive examples reuse Image 1 exactly, with no global transform. This can be solved through pixel-matching.

Example Task

Prompt: Decide whether the composite object from Image 1 is still present somewhere in Image 2, even though the scene may include distractors.

Image 1
Object Re-ID Image 1

This image defines the object to look for in Image 2.

Image 2
Object Re-ID Image 2

The object (green circle) has been rotated, but the three component shapes that compose it remain identically positioned relative to each other. The distraction object (in the red circle) can be ignored.

Answer: Yes

A visual skim is not enough to solve this task. Both images must be compared carefully.

Try it yourself

Image 1
Interactive Image 1
Image 2
Interactive Image 2

Determine whether the object from Image 1 reappears in Image 2, or if it is corrupted.

Select an answer to see how you compare.

Results

Object re-identification accuracy across variants
Object Re-Identification accuracy across variants.

BibTeX

@inproceedings{berman2025vlms_tunnel_vision,
  title        = {VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs},
  author       = {Berman, Shmuel and Deng, Jia},
  booktitle    = {NeurIPS},
  year         = {2025},
}