Claude 3.5 Computer Use: The AI That Sees and Controls Your Computer

Imagine an AI that can navigate your computer just like you do, using only its "eyes" to understand and interact with the screen. That's exactly what Claude 3.5 Computer Use aims to achieve. It can tackle various tasks, from browsing the web to conquering challenges in video games, all without relying on traditional methods like HTML parsing or access to internal software APIs. Researches from the National University of Singapore have conducted a study of how well Computer Use works in variety of domains and software.

Claude 3.5 Computer Use Observation Method

Claude 3.5 Computer Use observes its environment exclusively through visual information obtained from real-time screenshots, without relying on any metadata or HTML information. This approach allows the model to function effectively even with closed-source software, where access to internal APIs or code is restricted.

This method - also known as - vision-only approach - highlights the model's ability to mimic human desktop interactions by relying solely on visual input. This is a significant advancement in GUI automation as it enables the model to adapt to the dynamic nature of GUI environments without needing to understand the underlying structure of the interface.

Screenshot Integration in Claude's Reasoning Process

Claude 3.5 employs a reasoning-acting paradigm, similar to the traditional ReAct approach. This means the model first observes the environment before deciding on an action, ensuring that its actions are appropriate for the current GUI state. The screenshots are captured during the task operation and are integrated into the model's reasoning process as follows:

Historical Context Maintenance: Claude 3.5 maintains a history of screenshots from previous steps, accumulating visual information as the task progresses.
Action Generation: At each time step, the model uses the current screenshot, combined with the historical screenshot context, to determine the next action.

This approach allows Claude 3.5 to make more informed decisions by considering the full visual context of the task as it unfolds.

Selective Observation Strategy

Importantly, Claude 3.5 departs from the traditional ReAct paradigm by adopting a **selective observation strategy**. This means that the model does not observe the GUI state continuously at every step but only when necessary, as determined by its reasoning. This selective observation reduces the computational cost and accelerates the overall process by avoiding unnecessary screenshot capture and analysis.

Evaluating the Performance of Claude 3.5 Computer Use

The study highlights that Claude 3.5 Computer Use exhibits strong performance in automating a diverse range of desktop tasks, but also reveal areas for improvement. This evaluation considers planning, action execution, and critic feedback as key aspects of performance.

Strengths

Web Search:The model successfully navigates complex websites like Amazon and Apple's official site, efficiently finding information, adding items to carts, and even handling dynamic elements like pop-up windows.
Workflow Automation: Claude 3.5 demonstrates proficiency in coordinating actions across multiple applications. It can transfer data between Amazon and Excel, export and open online documents locally, install apps from the App Store, and even report storage usage.
Office Productivity: The model excels in automating various tasks in Microsoft Office applications, including Word, PowerPoint, and Excel. It successfully modifies document layouts, inserts formulas, manipulates presentations, and performs find-and-replace operations.
Video Games: Notably, Claude 3.5 demonstrates adaptability to gaming environments, interacting with game interfaces and executing multi-step actions in games like Hearthstone and Honkai: Star Rail. It creates and renames decks, uses hero powers effectively, automates warp sequences, and completes daily mission tasks.

Limitations

Planning Errors: The model sometimes misinterprets user instructions or the computer's current state, resulting in incorrect task execution. For example, it mistakenly navigated to the "Account" tab instead of scrolling for "Formula 1" in the Fox Sports navigation menu.
Action Errors: Claude 3.5 can struggle with precise control within the GUI environment, leading to inaccuracies in tasks requiring specific selections or interactions. This is evident in the resume template task, where the model only partially updated the name and phone number due to inaccurate text selection.
Critic Errors: The model may incorrectly assess its actions or the computer's state, prematurely declaring task completion or overlooking errors. For example, it reported successful completion of the resume template update despite incomplete changes and mistakenly applied bullets instead of numbering in PowerPoint.
Non-Human-like Interaction: Reliance on "Page Up/Down" shortcuts for scrolling limits the model's ability to browse and perceive information comprehensively, creating a discrepancy between its interaction style and human user behaviour.

Key Insights

Vision-Only Approach: Claude 3.5's reliance solely on visual information from screenshots for environment observation allows it to interact with diverse applications, even closed-source software, without requiring metadata or HTML parsing.
Reasoning-Acting Paradigm: The model employs a reasoning-acting paradigm, similar to ReAct, to ensure its actions are appropriate for the current GUI state. It uses both current and historical screenshots to generate actions dynamically.
Selective Observation Strategy: Claude 3.5 observes the GUI state selectively, only when necessary, to reduce computational cost and accelerate task execution.

Areas for Improvement

Critic Module Enhancement: Improving the model's self-assessment capabilities to better detect errors and accurately determine task completion is crucial for increasing its reliability.
Dynamic Benchmarking: Evaluating Claude 3.5 in more dynamic and interactive environments that simulate real-world application usage would provide a more comprehensive assessment of its performance and adaptability.
Human-like Interaction: Bridging the gap between the model's interaction style and that of human users, particularly in areas like scrolling and browsing, would enhance its effectiveness in real-world scenarios.

Conclusion

Claude 3.5 Computer Use demonstrates significant potential in GUI automation. Its performance across a variety of desktop tasks highlights its strengths in web search, workflow automation, office productivity, and even video games. However, limitations in planning, action execution, critic feedback, and its reliance on non-human-like interaction patterns underscore areas for future development. Addressing these limitations will be essential for creating truly sophisticated and reliable GUI automation models capable of effectively supporting and augmenting human computer use.

Photo by Google DeepMind