VLMs have Tunnel Vision

Abstract

Vision Language Models (VLMs) excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple perceptual tests. We present an evaluation that tests vision-language models’ capacity for nonlocal visual reasoning—reasoning that requires chaining evidence collected from multiple, possibly distant, regions of an image. We isolate three distinct forms of non-local vision: comparative perception, which demands holding two images in working memory and comparing them; saccadic search, which requires making discrete, evidence-driven jumps to locate successive targets; and smooth visual search, which involves searching smoothly along a continuous contour. Flagship models (e.g. GPT-5, Gemini 2.5 Pro, Claude Sonnet 4), even those that perform well on prior primitive-vision benchmarks, fail these tests and barely exceed random accuracy on two variants of our tasks that are trivial for humans. Our structured evaluation suite allows us to test if VLMs can perform visual algorithms similar to those humans deploy. Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.

What is Nonlocal Visual Reasoning?

Comparative Perception

Comparative perception is the ability to compare small differences between two images by moving visual attention back and forth between them.

Saccadic Search

Sacadic search is the ability to make discrete jumps around an image based on the content of that image. For instance, a traffic sign might redirect a driver's attention towards a road hazard.

Smooth Visual Search

Smooth visual search is the ability to follow a line or contour around an image. For example, humans use this skill to follow an object in flight or untangle a wire.

Comparative Perception

Comparative perception differs from general comparison because it occurs chiefly in the visual domain. The images must be close enough that it requires repeated examination to tell them apart. For instance, an image of a penguin and of Angkor Wat can be easily distinguished. We design "Object Re-Identification" to require this ability.

Object Re-Identification

Each instance of the task shows an object composed of multiple geometric shapes in Image 1. That same, globally transformed object is also shown in Image 2. The task is to determine if any of the component shapes have been transformed seperately from the object as a whole, making it a different object. We avoid imperceptible edits.

Presentation Variants

Standard: all component shapes in the object touch.
Unconnected: parts may float apart, probing the model's conception of an object.
Pixel-Perfect: positive examples reuse Image 1 exactly, with no global transform. This can be solved through pixel-matching.

Example Task

Prompt: Decide whether the composite object from Image 1 is still present somewhere in Image 2, even though the scene may include distractors.

Image 1

This image defines the object to look for in Image 2.

Image 2

The object (green circle) has been rotated, but the three component shapes that compose it remain identically positioned relative to each other. The distraction object (in the red circle) can be ignored.

Answer: Yes

A visual skim is not enough to solve this task. Both images must be compared carefully.

Try it yourself

Image 1

Image 2

Determine whether the object from Image 1 reappears in Image 2, or if it is corrupted.

Select an answer to see how you compare.

Object re-identification accuracy across variants — Object Re-Identification accuracy across variants.

Saccadic Search

Saccadic search captures how humans scan a scene: each observation suggests the next fixation. Updating a world model requires repeatedly revisiting the image in light of what was just found.

Visual Scavenger Hunt

The board contains labeled shapes. Starting from a designated tile, the model must follow the printed labels for several hops. Success hinges on repeatedly searching for the next shape in the chain.

Example Task

Prompt: Start on the blue circle indicated and follow the printed labels for three hops. Which color do you land on?

Puzzle Board

Follow the printed labels three times:

Begin on the blue circle in the bottom row, fourth from the right.
Its label sends you to the gray square in the top-left corner.
From there move to the yellow triangle in the fourth row, leftmost column.
The yellow triangle points to the red square in the second row, third shape from the left. The answer is red.

Answer: Red

We test this for multiple chain lengths.

Try it yourself

Puzzle Board

Start from the indicated shape and follow three label hops. What color do you land on?

Select the color you land on after three hops.

Visual scavenger hunt accuracy by chain length — Visual Scavenger Hunt accuracy by chain length.

Smooth Visual Search

Smooth visual search is the ability to trace a contour continuously, keeping attention on the target(s). Humans use it when following a moving object or tracking a line across a schematic.

Circuit Connections

Each diagram shows a breadboard, wires, and components. The model must trace a single wire from a numbered port to its destination component. Heuristics based on color or positioning cannot consistently answer this correctly across trials.

Generator Variants

Standard: wires are drawn from a palette of five colors.
Unique Colors: each wire has its own color. This tests if the models are performing a color-lookup heuristic.
Single Color: every wire shares the same color, negating the effect of color-related heuristics.

Each diagram contains 4–10 components; random guessing yields roughly 14% accuracy.

Example Task

Prompt: Trace the wire leaving port 5 on the breadboard to see which component it reaches.

Circuit Diagram

The only reliable strategy is to trace the wire end to end without relying on color coincidences.

Answer: C8 — the blue wire from port 5 bends up and then back down, terminating at component C8.

Try it yourself

Circuit Diagram

Follow one highlighted port from the breadboard to its destination component.

Pick the component label you believe the wire reaches.

Circuit connections accuracy across variants — Circuit Connections accuracy across variants.

BibTeX

@inproceedings{berman2025vlms_tunnel_vision,
  title        = {VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs},
  author       = {Berman, Shmuel and Deng, Jia},
  booktitle    = {NeurIPS},
  year         = {2025},
}

VLMs have Tunnel Vision

Evaluating Nonlocal Visual Reasoning in Vision-Language Models

Abstract

What is Nonlocal Visual Reasoning?

Comparative Perception

Saccadic Search

Smooth Visual Search

Comparative Perception

Object Re-Identification

Presentation Variants

Example Task

Try it yourself

Results

Saccadic Search

Visual Scavenger Hunt

Example Task

Try it yourself

Results

Smooth Visual Search

Circuit Connections

Generator Variants

Example Task

Try it yourself

Results

BibTeX