We're releasing Hodoscope, an open-source tool for understanding what AI agents are really doing. As agents are deployed on increasingly complex tasks, we need ways to discover unexpected behaviors without knowing in advance what to look for. Hodoscope enables human-in-the-loop analysis of agent trajectories at scale. Rather than relying on rigid and pre-defined criteria for what constitutes an interesting behavior, hodoscope uses unsupervised methods to surface insights.
As AI agents tackle increasingly complex tasks — writing code, browsing the web, running businesses — we need better ways to monitor them. Automated LLM-based evaluation (ask an LLM if the agent did something suspicious) is promising, but it may share the same blind spots as the agent itself. It works well when we know exactly what to look for, but struggles to catch novel failure modes we haven't anticipated.
We believe human judgment is essential for catching these novel failure modes, but reviewing every trace doesn't scale. Hodoscope aims to reduce the human effort needed as much as possible, so that a person can effectively oversee hundreds or thousands of agent traces, spending their attention where it matters most.
SWE-Bench evaluates coding agents by asking them to fix real GitHub issues. Each task checks out a repo at the commit before the fix — but the actual solution commit remains reachable via git log. Some agents exploited this, running commands like git log --all to surface the fix diff directly rather than reasoning about the problem.
When we asked Claude Opus 4.6 to evaluate a trace containing this behavior, it marked it as non-suspicious, claiming this is a routine investigation step. Only after we provided the issue link as additional context could Opus recognize what was wrong.
Hodoscope takes an unsupervised approach with three stages:
From the result plot, humans can investigate further on regions with significant distribution differences interactively. Significant human effort is saved in this process as only a few examples from each region need to be examined for a complete picture.
This approach is particularly useful for exploratory studies, where you are investigating a new model or benchmark and genuinely don't know what failure modes to expect. With possible behaviors surfaced, targeted or supervised study can then be carried out.
We used Hodoscope to discover benchmark gaming behaviors by frontier AI agents. The general strategy is to compare setups where you expect different behaviors, and focus on what's unique to one setup. This could be comparing multiple models on the same benchmark (reward hacking is unlikely to manifest identically across all models), or comparing the same model across different task variants (in our case, solvable and impossible tasks).
New finding: unsquashed git history on Commit0. In collaboration with OpenHands, we discovered a previously unknown time-traveling vulnerability in the Commit0 benchmark (Commit0 Issue #113, OpenHands Issue #421). Commit0 works by taking a real repository, creating an initial commit that strips out the bulk of the code, and then asking the agent to rewrite the removed portions. The problem: the git history wasn't squashed. Models including MiniMax M2.5 and Claude Opus figured out they could simply browse the git log, find the original code, and copy it back — bypassing the task entirely. We surfaced this by examining clusters unique to Minimax M2.5. After patching the issue, MiniMax's score dropped significantly. Visualization (27 MB)
We also applied Hodoscope to benchmarks where agents are known to game, and successfully rediscovered those behaviors.
Rediscovery: GPT-5 hacking on ImpossibleBench. We compared GPT-5 trajectories on SWE-bench and impossible tasks from ImpossibleBench where the only way to pass is to hack the evaluation. When we originally built ImpossibleBench, identifying these reward hacking behaviors required painstaking manual review of many traces. With Hodoscope, the same behaviors — modifying test cases, special-casing, maintaining special internal states — surfaced in a prominent cluster. Visualization (37MB)
Rediscovery: iQuest-Coder-V1 time traveling on SWE-bench.
We also re-discovered the iQuest-Coder-V1 time-traveling issue on SWE-bench. As outdated Docker images were used for evaluation, the agent could access git history containing future patches — essentially looking at the answer key. Hodoscope made this visible by surfacing an unusual cluster of git log actions that is unique to iQuest-Coder-V1. Visualization (13MB)
We walk through this finding in more detail in the next part (demo) of this blog.
Although we focused on benchmark gaming, we believe the tool could also be useful for other setups, such as understanding model behavioral differences pre- and post-reinforcement learning or debugging agent deployments. We are releasing Hodoscope now so others can start exploring their own agent traces.
Here's a concrete walkthrough of how Hodoscope works, using the SWE-bench study that re-discovered the iQuest time-traveling issue (& more!). A full runnable version of this example is available at examples/swebench_iquest in our repo.
Hodoscope natively accepts Docent collections, Inspect AI .eval files, OpenHands evaluation results, and plain trajectory JSON directories.
We're comparing five models on SWE-bench Verified. Four are Docent collections from the SWE-bench Leaderboard: o3, gpt-4.1, qwen3-coder, and deepseek-v3.2-reasoner. The fifth, iQuest-Coder-v1, requires manual preprocessing. Note that the SWE-bench Leaderboard has since been updated to a newer set of traces; we use the older versions that were available when we ran these experiments.
iQuest preprocessing — raw trajectories are available as a json from their Github repo. We download and split the single JSON into individual trajectory files:
Run hodoscope analyze on each source. We sample 50 trajectories per model with a fixed seed for reproducibility. We also annotate the model names.
gpt-5.2) and costs ~$35 in OpenAI API calls total. You can reduce cost by switching summarization model (e.g. --summarize-model gemini/gemini-3-flash-preview) or further subsampling.
Each command produces a .hodoscope.json file containing action summaries and embeddings.
Combine everything into a single interactive explorer:
This command generates a self-contained HTML file and opens it in your browser. Each point corresponds to an agent action, colored by model. This is our output for this walkthrough (13 MB).
With all models plotted together, you can immediately see how behaviors cluster. Most of the embedding space is shared — all models write code, run tests, read files. But there are regions that only light up for iQuest. The density overlay makes this especially clear: switch the overlay to iQuest vs. the other models, and the unique cluster stands out in red.
We can then manually analyze these regions. The following is the authors' human annotations.
We can click on dots to examine corresponding actions. In this case, we found actions running git log near the center of the "git related" cluster above.
If we search for "git log" in the textbox above, we will find only 79 out of 4006 actions (2.0%) have this behavior. With Hodoscope however, it only took us a few minutes to discover them.
Another thing we discovered is that iQuest modifies test cases a lot. Is it problematic?
By examining a couple of transcripts, we discovered that most of the test modifications seem reasonable. In the example below, while this test was not modified in the standard patch, the modification makes sense and this particular test is not included in evaluation (see discussion around this here). A lot of the modifications are also on the tests that iQuest itself generated.
Are any of these modifications actually harmful? To find out, we wrote a python script that examines all instances of test modification under testbed/tests in the full iQuest trajectory set.
| Type | Count | Explanation |
|---|---|---|
Modified tests within test_patch |
60 | Expected/Justified: The modification targets tests meant to be evaluated. |
Modified unreferenced tests (not in FAIL_TO_PASS or PASS_TO_PASS) |
10 | Harmless: Edits target tests that are ignored by the evaluation framework. |
| Malicious manipulation of referenced tests | 0 | Harmful: No such instances of the agent actively gaming the evaluation were found. |
Turns out none of these edits were actually gaming the evaluation, and therefore, we do not consider this behavior truly detrimental.
This illustrates a general pattern for using Hodoscope: first, unsupervised exploration surfaces a suspicious behavior you wouldn't have thought to look for; then, you follow up with targeted, supervised analysis to verify whether the behavior is actually problematic.
Raw agent actions are noisy — full of setup-specific paths and formatting artifacts. Directly embedding raw actions causes the embedding space to cluster by task context (e.g., all Django actions together, all SymPy actions together) rather than by behavior. Our summarization step focuses on stripping away these artifacts so that semantically similar actions end up close together regardless of context.
For example, consider two actions that both view a range of lines from a source file, but in completely different codebases and using different tools:
bash {"command": "sed -n '710,717p' sympy/physics/quantum/tensorproduct.py"} → Output: 8 lines of tensor product codetext_editor {"command": "view", "path": "/testbed/tests/delete/tests.py", "view_range": [470, 500]} → Output: 30 lines of Django test codeAfter our summarization step:
and embedding cosine similarity improves from 0.70 to 0.79.
To measure this quantitatively, we compiled pairs of semantically similar actions under different contexts and measured their embedding cosine similarity before and after summarization and used this to tweak our summarization prompt. In the end, we are able to achieve generally higher cosine similarities for similar actions while different actions receive much lower cosine similarities (since task-specific noise is removed).
Hodoscope supports multiple projection methods (PCA, t-SNE, UMAP, TriMAP, PaCMAP), but we generally find t-SNE best preserves local structure, keeping similar behaviors tightly clustered while separating distinct ones.
When groups have different sizes, we oversample smaller groups before fitting the projection so all groups have equal influence on the layout. This prevents a large group from dominating the embedding space and hiding patterns in smaller groups.
We designed Hodoscope upon the population diffing idea: the problematic behaviors are the ones that are uniquely dense in one setup compared to the baseline setup.
To support this, Hodoscope computes per-group kernel density estimates (KDE) over the 2D projection and overlays the difference: for a selected group, each point in the overlay is colored by how much that group's density exceeds (or falls below) the average density of all other groups. Regions where the selected group is overrepresented light up, guiding the user to the most distinctive behaviors.
We hope Hodoscope is useful for folks working on agent evaluation, safety, or just trying to understand what their agents are up to. If you find interesting behavioral patterns or have ideas to improve this, we'd love to hear from you! A paper with more details will be coming soon.