Trajectory-Level Unit
Uses the full execution trace as the evaluation object, including requests, agent responses, tool calls, and environment feedback.
ATBench evaluates whether long-horizon, tool-using AI agents behave safely across complete execution traces, and whether unsafe traces can be diagnosed along risk source, failure mode, and real-world harm.
ATBench moves agent-safety evaluation from isolated prompts or final responses to complete trajectories, preserving user requests, agent actions, tool calls, and environment feedback.
Autonomous agents increasingly operate through long-horizon, tool-augmented trajectories. Existing safety evaluation often focuses on single-step moderation or final-output filtering, which can miss unsafe behavior emerging during intermediate planning, tool invocation, or environment feedback. ATBench addresses this gap with trajectory-level safety evaluation and diagnosis. Each sample is a complete execution trace, labeled as safe or unsafe; unsafe traces are further annotated with one primary Risk Source, Failure Mode, and Real-world Harm. The benchmark is constructed through taxonomy-guided data generation, rule-based and LLM-based filtering, and full human audit, producing a diverse and realistic testbed for evaluating modern tool-using agents.
ATBench is designed as both a benchmark release and a reusable diagnostic protocol for agent safety.
Uses the full execution trace as the evaluation object, including requests, agent responses, tool calls, and environment feedback.
Separates where risk enters, how agent behavior fails, and what harm follows, enabling actionable safety analysis beyond binary labels.
Preserves ATBench500 for historical comparison while making the new 1,000-trajectory ATBench release the default reference point.
The taxonomy decomposes unsafe trajectories into three complementary diagnostic dimensions.
| Dimension | Diagnostic question | Category count | Examples |
|---|---|---|---|
| Risk Source | Where does the risk enter the trajectory? | 8 | User input, environmental observation, external tools/APIs, internal agent failures |
| Failure Mode | How does the agent realize or amplify the risk? | 14 | Over-privileged action, flawed planning, tool misuse, unsafe execution, harmful output |
| Real-world Harm | What downstream harm could the unsafe trajectory cause? | 10 | Privacy, financial, security, physical, psychological, reputational, societal harm |
The same high-level taxonomy supports binary safety evaluation and fine-grained diagnosis for unsafe trajectories.
ATBench keeps the newest release and the original AgentDoG benchmark in one explicit release lineage.
| Release | Status | Cases | Safe | Unsafe | Available Tools | Used Tools | Avg. Turns | Avg. Tokens |
|---|---|---|---|---|---|---|---|---|
| ATBench | Latest | 1,000 | 503 | 497 | 2,084 | 1,954 | 9.01 | 3.95k |
| ATBench500 | Legacy | 500 | 250 | 250 | 1,575 | 1,357 | 8.97 | 1.52k |
Available Tools counts tools exposed through per-trajectory tool pools. Used Tools counts unique tools invoked in released trajectories.
ATBench exposes a gap between general safety capability and trajectory-level agent safety judgment.
Representative model performance shows that ATBench is difficult for both general-purpose models and specialized guard models.
| Group | Model | R-Judge Acc | R-Judge Prec. | R-Judge Rec. | R-Judge F1 | ATBench Acc | ATBench Prec. | ATBench Rec. | ATBench F1 |
|---|---|---|---|---|---|---|---|---|---|
| Closed-source models | |||||||||
| Closed-source | GPT-5.4 | 93.3 | 93.1 | 94.3 | 93.7 | 73.7 | 68.5 | 87.1 | 76.7 |
| Closed-source | GPT-5.2 | 90.8 | 86.8 | 97.5 | 91.8 | 69.0 | 65.6 | 79.3 | 71.8 |
| Closed-source | Gemini-3-Flash | 95.2 | 98.7 | 92.1 | 95.3 | 76.4 | 79.3 | 71.0 | 74.9 |
| Closed-source | Gemini-3.1-Pro | 97.3 | 99.1 | 95.7 | 97.4 | 75.5 | 76.1 | 73.8 | 75.0 |
| Open-source models | |||||||||
| Open-source | Qwen3.5-397B-A17B | 85.6 | 81.3 | 94.5 | 87.4 | 66.8 | 65.5 | 70.2 | 67.8 |
| Open-source | Qwen3.5-4B | 81.0 | 82.1 | 81.9 | 82.0 | 45.9 | 41.2 | 20.7 | 27.6 |
| Open-source | Qwen3.5-2B | 54.1 | 67.6 | 25.2 | 36.7 | 59.1 | 74.3 | 19.2 | 30.5 |
| Open-source | Qwen3.5-0.8B | 33.7 | 27.6 | 15.8 | 20.1 | 48.6 | 66.7 | 5.9 | 10.8 |
| Open-source | QwQ-32B | 89.5 | 94.9 | 84.7 | 89.5 | 57.7 | 81.9 | 19.1 | 31.0 |
| Open-source | Qwen3-235B-A22B-Instruct-2507 | 85.1 | 80.7 | 94.4 | 87.0 | 59.2 | 58.2 | 63.8 | 60.8 |
| Open-source | Qwen3-4B-Instruct-2507 | 68.4 | 73.8 | 62.4 | 67.6 | 55.7 | 77.6 | 15.3 | 25.5 |
| Open-source | Qwen2.5-7B-Instruct | 68.4 | 77.4 | 56.8 | 65.5 | 53.4 | 73.8 | 9.7 | 17.1 |
| Open-source | Llama-3.1-8B-Instruct | 53.7 | 53.3 | 99.8 | 69.5 | 45.3 | 47.3 | 89.5 | 61.9 |
| Guard models | |||||||||
| Guard | LlamaGuard3-8B | 61.2 | 69.1 | 48.1 | 56.7 | 53.1 | 85.7 | 3.8 | 7.3 |
| Guard | LlamaGuard4-12B | 63.8 | 68.3 | 58.8 | 63.2 | 58.1 | 63.8 | 30.9 | 41.7 |
| Guard | Qwen3-Guard | 40.6 | 23.6 | 5.6 | 9.0 | 51.5 | 40.0 | 0.4 | 0.8 |
| Guard | ShieldAgent | 81.0 | 74.0 | 98.8 | 84.6 | 62.5 | 58.0 | 81.4 | 67.7 |
| Guard | JoySafety | 52.5 | 57.2 | 40.2 | 47.2 | 56.9 | 61.7 | 35.0 | 44.7 |
| Guard | NemoGuard | 54.4 | 60.1 | 40.6 | 48.5 | 49.9 | 49.5 | 41.6 | 45.2 |
| AgentDoG models | |||||||||
| Ours | AgentDoG 1.0-4B | 91.8 | 87.5 | 98.5 | 92.7 | 64.0 | 59.2 | 88.9 | 71.1 |
| Ours | AgentDoG 1.5-0.8B | 75.7 | 83.3 | 67.5 | 74.6 | 60.3 | 58.6 | 68.6 | 63.2 |
| Ours | AgentDoG 1.5-2B | 71.5 | 78.0 | 64.1 | 70.4 | 69.0 | 70.1 | 65.7 | 67.8 |
| Ours | AgentDoG 1.5-8B | 75.5 | 68.6 | 98.8 | 81.0 | 70.9 | 67.1 | 81.2 | 73.5 |
| Ours | AgentDoG 1.5-4B | 92.2 | 91.7 | 93.7 | 92.7 | 72.4 | 69.2 | 80.3 | 74.3 |
| Ours | AgentDoG 1.5-4B-U | 90.4 | 93.9 | 87.6 | 90.6 | 78.4 | 79.8 | 75.7 | 77.7 |
The large table is placed after the performance figure to make the exact numerical comparison inspectable. ATBench remains challenging: several strong models have high recall or precision but much weaker balanced F1.
Unsafe trajectories are also evaluated on three diagnostic labels: Risk Source, Failure Mode, and Real-world Harm.
| Group | Model | Risk Source | Failure Mode | Real-world Harm | Avg. |
|---|---|---|---|---|---|
| Closed-source models | |||||
| Closed-source | GPT-5.4 | 33.6 | 13.5 | 30.2 | 25.8 |
| Closed-source | GPT-5.2 | 29.5 | 12.0 | 26.8 | 22.8 |
| Closed-source | Gemini-3-Flash | 18.4 | 8.3 | 15.0 | 13.9 |
| Closed-source | Gemini-3.1-Pro | 24.8 | 12.6 | 18.5 | 18.6 |
| Open-source models | |||||
| Open-source | Qwen3.5-397B | 7.7 | 3.6 | 6.8 | 6.0 |
| Open-source | Qwen3.5-0.8B | 1.3 | 2.9 | 4.7 | 3.0 |
| Open-source | Qwen3.5-2B | 7.7 | 6.6 | 11.1 | 8.5 |
| Open-source | Qwen3.5-4B | 6.6 | 3.0 | 8.2 | 5.9 |
| Open-source | QwQ-32B | 15.8 | 9.4 | 22.9 | 16.0 |
| Open-source | Qwen3-235B | 7.0 | 11.6 | 26.6 | 15.1 |
| Open-source | Qwen3-4B-Instruct | 1.0 | 9.6 | 21.2 | 10.6 |
| Open-source | Qwen2.5-7B-Instruct | 5.3 | 6.0 | 15.5 | 8.9 |
| Open-source | Llama3.1-8B-Instruct | 6.2 | 5.8 | 15.5 | 9.2 |
| AgentDoG models | |||||
| Ours | AgentDoG1.0-4B | 46.8 | 16.5 | 40.6 | 34.6 |
| Ours | AgentDoG 1.5-0.8B | 65.7 | 18.4 | 44.9 | 43.0 |
| Ours | AgentDoG 1.5-2B | 68.0 | 24.0 | 53.8 | 48.6 |
| Ours | AgentDoG 1.5-8B | 72.9 | 24.6 | 52.5 | 50.0 |
| Ours | AgentDoG 1.5-4B | 75.2 | 27.5 | 62.9 | 55.2 |
| Ours | AgentDoG 1.5-4B-U | 24.1 | 9.5 | 28.4 | 20.7 |
Guard models are excluded from this table because they only output binary labels. The average is the mean over the three diagnostic dimensions.
ATBench combines automatic validation and human audit to make fine-grained labels usable for diagnosis.
| Annotator | Risk Source | Failure Mode | Real-world Harm |
|---|---|---|---|
| Annotator 1 | 82.0% (41/50) | 70.0% (35/50) | 88.0% (44/50) |
| Annotator 2 | 86.0% (43/50) | 60.0% (30/50) | 80.0% (40/50) |
| Annotator 3 | 86.0% (43/50) | 72.0% (36/50) | 86.0% (43/50) |
| Avg. | 84.7% (127/150) | 67.3% (101/150) | 84.7% (127/150) |
Risk Source and Real-world Harm show higher agreement, while Failure Mode is intentionally finer-grained and more ambiguous.
The data engine uses taxonomy-guided planning, trajectory synthesis, validation, and audit before release.
Taxonomy-guided data construction starts from sampled risk configurations and tool pools, then validates complete trajectories before release.
Representative cases show that models may detect unsafe behavior while still missing the correct fine-grained cause.
ATBench500 remains available for backward compatibility and historical comparison with the original AgentDoG release.
from datasets import load_dataset
atbench = load_dataset("AI45Research/ATBench", "ATBench", split="test")
atbench500 = load_dataset("AI45Research/ATBench", "ATBench500", split="test")
@article{li2026atbench,
title={ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis},
author={Yu Li and Haoyu Luo and Yuejin Xie and Yuqian Fu and Zhonghao Yang and Shuai Shao and Qihan Ren and Wanying Qu and Yanwei Fu and Yujiu Yang and Jing Shao and Xia Hu and Dongrui Liu},
journal={arXiv preprint arXiv:2604.02022},
year={2026},
doi={10.48550/arXiv.2604.02022},
url={https://arxiv.org/abs/2604.02022}
}
@article{liu2026agentdog,
title={AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security},
author={Yu Li and Haoyu Luo and Yuejin Xie and Jiapeng Gu and Yuhan Wang and Yanwei Fu and Yujiu Yang and Jing Shao and Xia Hu and Dongrui Liu},
journal={arXiv preprint arXiv:2601.18491},
year={2026},
url={https://arxiv.org/abs/2601.18491}
}