Google Adds 'Agentic Vision' Capability to Gemini 3 Flash
Google has added a new capability called 'Agentic Vision' to its Gemini 3 Flash model, which combines visual reasoning with code execution. This feature enables the AI to analyze images more accurately by examining them step-by-step rather than with a single glance.
Transforming Visual Understanding into an Active Investigation
Google has announced a new capability called 'Agentic Vision' for its Gemini 3 Flash AI model. This capability fundamentally changes how the model processes visual data. While traditional advanced models often process the world with a single, static view, Agentic Vision transforms this process into an active investigation.
According to the company's statement, when current systems miss fine details like a serial number on a microchip or a distant street sign, they are forced to make guesses. The new system evolves image understanding from a static action into an 'agent'-like process.
Operates on a 'Think, Act, Observe' Cycle
The foundation of Agentic Vision is a cycle called 'Think, Act, Observe'. In the first step, the model analyzes the user query and the initial image to create a multi-step plan. In the second step, it generates and executes Python code to process or analyze images. In the final step, the transformed image is added to the model's context window, allowing new data to be examined in a better context before the final response is generated.
It is stated that this approach provides a consistent quality improvement of 5% to 10% on most vision tests.
Practical Application Areas and Capabilities
The new capability brings a range of new behaviors to the Gemini 3 Flash model:
- Zoom and Inspect: The model can automatically zoom in on an image when it detects fine details. The plan review platform PlanCheckSolver.com used this feature to iteratively examine high-resolution building plans, increasing its accuracy by 5%.
- Image Description: The model not only identifies what it sees but can also run code to draw directly on the canvas. For example, when counting fingers on a hand, it can draw bounding boxes and numerical labels to identify each finger.
- Visual Math and Graph Plotting: Agentic Vision can parse dense data tables and run Python code to visualize findings. Against the tendency of standard language models to hallucinate in multi-step visual arithmetic, it transfers calculations to a deterministic Python environment to produce verifiable results.
Opening Up for Developer Use
The Agentic Vision capability is beginning to be offered in the Gemini app with the 'Thinking' model. Developers can access this new feature via the Gemini API in Google AI Studio and Vertex AI. The demo application in Google AI Studio provides an environment to experiment with various use cases of the feature.
The company states that Agentic Vision is still in its early stages and plans to add more implicit code-based behaviors in the future, such as rotating images or performing visual math operations.