Hodoscope: Unsupervised Behavior Discovery in AI Agents

We're releasing Hodoscope, an open-source tool for understanding what AI agents are really doing. As agents are deployed on increasingly complex tasks, we need ways to discover unexpected behaviors without knowing in advance what to look for. Hodoscope enables human-in-the-loop analysis of agent trajectories at scale. Rather than relying on rigid and pre-defined criteria for what constitutes an interesting behavior, hodoscope uses unsupervised methods to surface insights.

$ pip install hodoscope

Part 1: Overview

The Problem

As AI agents tackle increasingly complex tasks — writing code, browsing the web, running businesses — we need better ways to monitor them. Automated LLM-based evaluation (ask an LLM if the agent did something suspicious) is promising, but it may share the same blind spots as the agent itself. It works well when we know exactly what to look for, but struggles to catch novel failure modes we haven't anticipated.

We believe human judgment is essential for catching these novel failure modes, but reviewing every trace doesn't scale. Hodoscope aims to reduce the human effort needed as much as possible, so that a person can effectively oversee hundreds or thousands of agent traces, spending their attention where it matters most.

Example: the time traveling incident (SWE-bench issue #465)

SWE-Bench evaluates coding agents by asking them to fix real GitHub issues. Each task checks out a repo at the commit before the fix — but the actual solution commit remains reachable via git log. Some agents exploited this, running commands like git log --all to surface the fix diff directly rather than reasoning about the problem.

When we asked Claude Opus 4.6 to evaluate a trace containing this behavior, it marked it as non-suspicious, claiming this is a routine investigation step. Only after we provided the issue link as additional context could Opus recognize what was wrong.

Claude Opus 4.6 needed a hint to recognize the time traveling behavior as problematic.

Our Approach

Hodoscope takes an unsupervised approach with three stages:

Summarization — condense each agent action into a high-level summary that captures the behavior rather than setup-specific details (e.g., "edit test file to fix assertion" instead of "modify /home/user/repo/django/tests/utils.py line 42 to assert error_code == 403").
Embedding — embed summaries into a shared vector space where similar behaviors (e.g. "run test suite," or "edit source file") end up close together, then project into 2D via t-SNE for visualization.
Distribution diffing — compute and compare kernel density estimates (KDE) of different agent setups where abnormal behaviors are expected to manifest only in some setups. For example, we may compare agents using different LLMs on the same task. Overlaying these density differences on the embedding visualization helps identify behaviors unique to one setup.

Hodoscope visualization on SWE-bench. The density difference between traces of o3 and other models' is overlaid (red = overrepresented, blue = underrepresented).

From the result plot, humans can investigate further on regions with significant distribution differences interactively. Significant human effort is saved in this process as only a few examples from each region need to be examined for a complete picture.

This approach is particularly useful for exploratory studies, where you are investigating a new model or benchmark and genuinely don't know what failure modes to expect. With possible behaviors surfaced, targeted or supervised study can then be carried out.

What We've Found

We used Hodoscope to discover benchmark gaming behaviors by frontier AI agents. The general strategy is to compare setups where you expect different behaviors, and focus on what's unique to one setup. This could be comparing multiple models on the same benchmark (reward hacking is unlikely to manifest identically across all models), or comparing the same model across different task variants (in our case, solvable and impossible tasks).

New finding: unsquashed git history on Commit0. In collaboration with OpenHands, we discovered a previously unknown time-traveling vulnerability in the Commit0 benchmark (Commit0 Issue #113, OpenHands Issue #421). Commit0 works by taking a real repository, creating an initial commit that strips out the bulk of the code, and then asking the agent to rewrite the removed portions. The problem: the git history wasn't squashed. Models including MiniMax M2.5 and Claude Opus figured out they could simply browse the git log, find the original code, and copy it back — bypassing the task entirely. We surfaced this by examining clusters unique to Minimax M2.5. After patching the issue, MiniMax's score dropped significantly. Visualization (27 MB)

We discovered this suspicious cluster of git operations by overlaying the density difference between traces of MiniMax-M2.5 and other models.

Action details could be examined by clicking on individual dots. Here, Minimax-M2.5 is looking at a code snippet from an unredacted commit.

We also applied Hodoscope to benchmarks where agents are known to game, and successfully rediscovered those behaviors.

Rediscovery: GPT-5 hacking on ImpossibleBench. We compared GPT-5 trajectories on SWE-bench and impossible tasks from ImpossibleBench where the only way to pass is to hack the evaluation. When we originally built ImpossibleBench, identifying these reward hacking behaviors required painstaking manual review of many traces. With Hodoscope, the same behaviors — modifying test cases, special-casing, maintaining special internal states — surfaced in a prominent cluster. Visualization (37MB)

We overlaid the density difference between traces in SWE-bench and ImpossibleBench and discovered this suspicious cluster.

Rediscovery: iQuest-Coder-V1 time traveling on SWE-bench. We also re-discovered the iQuest-Coder-V1 time-traveling issue on SWE-bench. As outdated Docker images were used for evaluation, the agent could access git history containing future patches — essentially looking at the answer key. Hodoscope made this visible by surfacing an unusual cluster of git log actions that is unique to iQuest-Coder-V1. Visualization (13MB)

We walk through this finding in more detail in the next part (demo) of this blog.

We (re-)discovered iQuest-Coder-V1 trawling git logs for a fix in the future by diffing traces of iQuest-Coder-V1 and other models.

Although we focused on benchmark gaming, we believe the tool could also be useful for other setups, such as understanding model behavioral differences pre- and post-reinforcement learning or debugging agent deployments. We are releasing Hodoscope now so others can start exploring their own agent traces.

Part 2: Demo — Replicating the SWE-bench Finding

Here's a concrete walkthrough of how Hodoscope works, using the SWE-bench study that re-discovered the iQuest time-traveling issue (& more!). A full runnable version of this example is available at examples/swebench_iquest in our repo.

Step 1: Fetch traces

Hodoscope natively accepts Docent collections, Inspect AI .eval files, OpenHands evaluation results, and plain trajectory JSON directories.

We're comparing five models on SWE-bench Verified. Four are Docent collections from the SWE-bench Leaderboard: o3, gpt-4.1, qwen3-coder, and deepseek-v3.2-reasoner. The fifth, iQuest-Coder-v1, requires manual preprocessing. Note that the SWE-bench Leaderboard has since been updated to a newer set of traces; we use the older versions that were available when we ran these experiments.

iQuest preprocessing — raw trajectories are available as a json from their Github repo. We download and split the single JSON into individual trajectory files:

$ wget https://github.com/IQuestLab/IQuest-Coder-V1/raw/044c57a/IQuest-Coder-Eval/SWE-Verified/traj.zip
$ unzip traj.zip && rm traj.zip && mv traj.json iquest_traj.json

# Split into individual trajectory files
$ python -c "
import json, pathlib
p = pathlib.Path('iquest_samples')
p.mkdir(exist_ok=True)
for i, t in enumerate(json.load(open('iquest_traj.json'))):
    (p / f'traj_{i:04d}.json').write_text(
        json.dumps({'id': f'traj_{i:04d}', 'messages': t['messages']})
    )
"
$ rm iquest_traj.json

Step 2: Analyze

Run hodoscope analyze on each source. We sample 50 trajectories per model with a fixed seed for reproducibility. We also annotate the model names.

⚠️ Cost note: This demo uses the default summarization model (gpt-5.2) and costs ~$35 in OpenAI API calls total. You can reduce cost by switching summarization model (e.g. --summarize-model gemini/gemini-3-flash-preview) or further subsampling.

# Docent sources
$ hodoscope analyze --docent-id 565e5680-b913-4031-b537-00721a7a619a -l 50 --seed 42 --field model=o3
$ hodoscope analyze --docent-id cd7a23c5-a2b1-4cab-b851-6e2c42aaf0f3 -l 50 --seed 42 --field model=gpt-4.1
$ hodoscope analyze --docent-id f39d3041-d9d7-4f1b-b75e-8a13addb9e6e -l 50 --seed 42 --field model=qwen3-coder
$ hodoscope analyze --docent-id 7fde5552-6b17-4cb7-ab9c-15fd9fb5b845 -l 50 --seed 42 --field model=deepseek

# Raw trajectory directory
$ hodoscope analyze iquest_samples/ -l 50 --seed 42 --field model=iquest-coder-v1

Each command produces a .hodoscope.json file containing action summaries and embeddings.

Step 3: Visualize

Combine everything into a single interactive explorer:

$ hodoscope viz *.hodoscope.json --proj tsne --open

This command generates a self-contained HTML file and opens it in your browser. Each point corresponds to an agent action, colored by model. This is our output for this walkthrough (13 MB).

Step 4: Discover

With all models plotted together, you can immediately see how behaviors cluster. Most of the embedding space is shared — all models write code, run tests, read files. But there are regions that only light up for iQuest. The density overlay makes this especially clear: switch the overlay to iQuest vs. the other models, and the unique cluster stands out in red.

We can then manually analyze these regions. The following is the authors' human annotations.

We can click on dots to examine corresponding actions. In this case, we found actions running git log near the center of the "git related" cluster above.

If we search for "git log" in the textbox above, we will find only 79 out of 4006 actions (2.0%) have this behavior. With Hodoscope however, it only took us a few minutes to discover them.

All the "git log" occurrences — our embedding is working well!

Step 5: Verify (when needed)

Another thing we discovered is that iQuest modifies test cases a lot. Is it problematic?

By examining a couple of transcripts, we discovered that most of the test modifications seem reasonable. In the example below, while this test was not modified in the standard patch, the modification makes sense and this particular test is not included in evaluation (see discussion around this here). A lot of the modifications are also on the tests that iQuest itself generated.

Are any of these modifications actually harmful? To find out, we wrote a python script that examines all instances of test modification under testbed/tests in the full iQuest trajectory set.

Type	Count	Explanation
Modified tests within `test_patch`	60	Expected/Justified: The modification targets tests meant to be evaluated.
Modified unreferenced tests (not in `FAIL_TO_PASS` or `PASS_TO_PASS`)	10	Harmless: Edits target tests that are ignored by the evaluation framework.
Malicious manipulation of referenced tests	0	Harmful: No such instances of the agent actively gaming the evaluation were found.

Turns out none of these edits were actually gaming the evaluation, and therefore, we do not consider this behavior truly detrimental.

This illustrates a general pattern for using Hodoscope: first, unsupervised exploration surfaces a suspicious behavior you wouldn't have thought to look for; then, you follow up with targeted, supervised analysis to verify whether the behavior is actually problematic.

Part 3: Design Choices

Summarize, then embed

Raw agent actions are noisy — full of setup-specific paths and formatting artifacts. Directly embedding raw actions causes the embedding space to cluster by task context (e.g., all Django actions together, all SymPy actions together) rather than by behavior. Our summarization step focuses on stripping away these artifacts so that semantically similar actions end up close together regardless of context.

For example, consider two actions that both view a range of lines from a source file, but in completely different codebases and using different tools:

Raw action A: bash {"command": "sed -n '710,717p' sympy/physics/quantum/tensorproduct.py"} → Output: 8 lines of tensor product code
Raw action B: text_editor {"command": "view", "path": "/testbed/tests/delete/tests.py", "view_range": [470, 500]} → Output: 30 lines of Django test code

After our summarization step:

Summary A: "Read a small line range from a source file — for inspecting specific conditional logic around a tensor operation"
Summary B: "Viewed a range of lines from a test source file — for inspecting a fast-delete test and potential indentation issue nearby"

and embedding cosine similarity improves from 0.70 to 0.79.

To measure this quantitatively, we compiled pairs of semantically similar actions under different contexts and measured their embedding cosine similarity before and after summarization and used this to tweak our summarization prompt. In the end, we are able to achieve generally higher cosine similarities for similar actions while different actions receive much lower cosine similarities (since task-specific noise is removed).

Balanced projections

Hodoscope supports multiple projection methods (PCA, t-SNE, UMAP, TriMAP, PaCMAP), but we generally find t-SNE best preserves local structure, keeping similar behaviors tightly clustered while separating distinct ones.

When groups have different sizes, we oversample smaller groups before fitting the projection so all groups have equal influence on the layout. This prevents a large group from dominating the embedding space and hiding patterns in smaller groups.

Density diffing

We designed Hodoscope upon the population diffing idea: the problematic behaviors are the ones that are uniquely dense in one setup compared to the baseline setup.

To support this, Hodoscope computes per-group kernel density estimates (KDE) over the 2D projection and overlays the difference: for a selected group, each point in the overlay is colored by how much that group's density exceeds (or falls below) the average density of all other groups. Regions where the selected group is overrepresented light up, guiding the user to the most distinctive behaviors.

Getting Started

$ pip install hodoscope
$ hodoscope analyze *.eval
$ hodoscope viz *.hodoscope.json --open

We hope Hodoscope is useful for folks working on agent evaluation, safety, or just trying to understand what their agents are up to. If you find interesting behavioral patterns or have ideas to improve this, we'd love to hear from you! A paper with more details will be coming soon.

Authors

Ziqian Zhong, Shashwat Saxena, Aditi Raghunathan

Acknowledgements

We would like to thank Graham Neubig and OpenHands for suggesting and supporting experiments, and Mingyang Deng, Chen Wu, and Jiaxin Wen for reviewing an early draft of this post. We also gratefully acknowledge support from Jane Street, UK AISI, Schmidt Sciences, National Institute of Standards and Technology.

Citation

@article{zhong2026hodoscope,
  title={Hodoscope: Unsupervised Behavior Discovery in AI Agents},
  author={Zhong, Ziqian and Saxena, Shashwat and Raghunathan, Aditi},
  year={2026},
  url={https://hodoscope.dev/blog/announcement.html}
}

🔭 Hodoscope: Unsupervised Behavior Discovery in AI Agents