Agent trajectory safety benchmark family

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

ATBench evaluates whether long-horizon, tool-using AI agents behave safely across complete execution traces, and whether unsafe traces can be diagnosed along risk source, failure mode, and real-world harm.

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, Dongrui Liu

arXiv 2026
1,000audited trajectories
503 / 497safe and unsafe cases
2,084available tools
1,954unique invoked tools
9.01average turns per trajectory
ATBench overview teaser showing benchmark motivation, construction, and evaluation.

ATBench moves agent-safety evaluation from isolated prompts or final responses to complete trajectories, preserving user requests, agent actions, tool calls, and environment feedback.

Abstract

Autonomous agents increasingly operate through long-horizon, tool-augmented trajectories. Existing safety evaluation often focuses on single-step moderation or final-output filtering, which can miss unsafe behavior emerging during intermediate planning, tool invocation, or environment feedback. ATBench addresses this gap with trajectory-level safety evaluation and diagnosis. Each sample is a complete execution trace, labeled as safe or unsafe; unsafe traces are further annotated with one primary Risk Source, Failure Mode, and Real-world Harm. The benchmark is constructed through taxonomy-guided data generation, rule-based and LLM-based filtering, and full human audit, producing a diverse and realistic testbed for evaluating modern tool-using agents.

Key Contributions

ATBench is designed as both a benchmark release and a reusable diagnostic protocol for agent safety.

Trajectory-Level Unit

Uses the full execution trace as the evaluation object, including requests, agent responses, tool calls, and environment feedback.

Three-Dimensional Diagnosis

Separates where risk enters, how agent behavior fails, and what harm follows, enabling actionable safety analysis beyond binary labels.

Audited Benchmark Lineage

Preserves ATBench500 for historical comparison while making the new 1,000-trajectory ATBench release the default reference point.

Safety Taxonomy

The taxonomy decomposes unsafe trajectories into three complementary diagnostic dimensions.

Three-dimensional trajectory-safety taxonomy
Dimension Diagnostic question Category count Examples
Risk Source Where does the risk enter the trajectory? 8 User input, environmental observation, external tools/APIs, internal agent failures
Failure Mode How does the agent realize or amplify the risk? 14 Over-privileged action, flawed planning, tool misuse, unsafe execution, harmful output
Real-world Harm What downstream harm could the unsafe trajectory cause? 10 Privacy, financial, security, physical, psychological, reputational, societal harm
ATBench three-dimensional safety taxonomy.

The same high-level taxonomy supports binary safety evaluation and fine-grained diagnosis for unsafe trajectories.

Benchmark Release

ATBench keeps the newest release and the original AgentDoG benchmark in one explicit release lineage.

Release zoo
Release Status Cases Safe Unsafe Available Tools Used Tools Avg. Turns Avg. Tokens
ATBench Latest 1,000 503 497 2,084 1,954 9.01 3.95k
ATBench500 Legacy 500 250 250 1,575 1,357 8.97 1.52k

Available Tools counts tools exposed through per-trajectory tool pools. Used Tools counts unique tools invoked in released trajectories.

Main Results

ATBench exposes a gap between general safety capability and trajectory-level agent safety judgment.

Model performance comparison on ATBench and other agent safety benchmarks.

Representative model performance shows that ATBench is difficult for both general-purpose models and specialized guard models.

Trajectory-level safety results on R-Judge and ATBench (%)
Group Model R-Judge Acc R-Judge Prec. R-Judge Rec. R-Judge F1 ATBench Acc ATBench Prec. ATBench Rec. ATBench F1
Closed-source models
Closed-sourceGPT-5.493.393.194.393.773.768.587.176.7
Closed-sourceGPT-5.290.886.897.591.869.065.679.371.8
Closed-sourceGemini-3-Flash95.298.792.195.376.479.371.074.9
Closed-sourceGemini-3.1-Pro97.399.195.797.475.576.173.875.0
Open-source models
Open-sourceQwen3.5-397B-A17B85.681.394.587.466.865.570.267.8
Open-sourceQwen3.5-4B81.082.181.982.045.941.220.727.6
Open-sourceQwen3.5-2B54.167.625.236.759.174.319.230.5
Open-sourceQwen3.5-0.8B33.727.615.820.148.666.75.910.8
Open-sourceQwQ-32B89.594.984.789.557.781.919.131.0
Open-sourceQwen3-235B-A22B-Instruct-250785.180.794.487.059.258.263.860.8
Open-sourceQwen3-4B-Instruct-250768.473.862.467.655.777.615.325.5
Open-sourceQwen2.5-7B-Instruct68.477.456.865.553.473.89.717.1
Open-sourceLlama-3.1-8B-Instruct53.753.399.869.545.347.389.561.9
Guard models
GuardLlamaGuard3-8B61.269.148.156.753.185.73.87.3
GuardLlamaGuard4-12B63.868.358.863.258.163.830.941.7
GuardQwen3-Guard40.623.65.69.051.540.00.40.8
GuardShieldAgent81.074.098.884.662.558.081.467.7
GuardJoySafety52.557.240.247.256.961.735.044.7
GuardNemoGuard54.460.140.648.549.949.541.645.2
AgentDoG models
OursAgentDoG 1.0-4B91.887.598.592.764.059.288.971.1
OursAgentDoG 1.5-0.8B75.783.367.574.660.358.668.663.2
OursAgentDoG 1.5-2B71.578.064.170.469.070.165.767.8
OursAgentDoG 1.5-8B75.568.698.881.070.967.181.273.5
OursAgentDoG 1.5-4B92.291.793.792.772.469.280.374.3
OursAgentDoG 1.5-4B-U90.493.987.690.678.479.875.777.7

The large table is placed after the performance figure to make the exact numerical comparison inspectable. ATBench remains challenging: several strong models have high recall or precision but much weaker balanced F1.

Fine-Grained Diagnosis

Unsafe trajectories are also evaluated on three diagnostic labels: Risk Source, Failure Mode, and Real-world Harm.

Three-class diagnostic accuracy on ATBench (%)
Group Model Risk Source Failure Mode Real-world Harm Avg.
Closed-source models
Closed-sourceGPT-5.433.613.530.225.8
Closed-sourceGPT-5.229.512.026.822.8
Closed-sourceGemini-3-Flash18.48.315.013.9
Closed-sourceGemini-3.1-Pro24.812.618.518.6
Open-source models
Open-sourceQwen3.5-397B7.73.66.86.0
Open-sourceQwen3.5-0.8B1.32.94.73.0
Open-sourceQwen3.5-2B7.76.611.18.5
Open-sourceQwen3.5-4B6.63.08.25.9
Open-sourceQwQ-32B15.89.422.916.0
Open-sourceQwen3-235B7.011.626.615.1
Open-sourceQwen3-4B-Instruct1.09.621.210.6
Open-sourceQwen2.5-7B-Instruct5.36.015.58.9
Open-sourceLlama3.1-8B-Instruct6.25.815.59.2
AgentDoG models
OursAgentDoG1.0-4B46.816.540.634.6
OursAgentDoG 1.5-0.8B65.718.444.943.0
OursAgentDoG 1.5-2B68.024.053.848.6
OursAgentDoG 1.5-8B72.924.652.550.0
OursAgentDoG 1.5-4B75.227.562.955.2
OursAgentDoG 1.5-4B-U24.19.528.420.7

Guard models are excluded from this table because they only output binary labels. The average is the mean over the three diagnostic dimensions.

Quality Control

ATBench combines automatic validation and human audit to make fine-grained labels usable for diagnosis.

Human verification of three diagnostic labels
Annotator Risk Source Failure Mode Real-world Harm
Annotator 182.0% (41/50)70.0% (35/50)88.0% (44/50)
Annotator 286.0% (43/50)60.0% (30/50)80.0% (40/50)
Annotator 386.0% (43/50)72.0% (36/50)86.0% (43/50)
Avg.84.7% (127/150)67.3% (101/150)84.7% (127/150)
0.5%safe/unsafe labels corrected during audit
11.1%fine-grained labels corrected during audit
4heterogeneous verifier families before adjudication
Full audithuman pass over released benchmark labels

Risk Source and Real-world Harm show higher agreement, while Failure Mode is intentionally finer-grained and more ambiguous.

Generation Pipeline and Cases

The data engine uses taxonomy-guided planning, trajectory synthesis, validation, and audit before release.

ATBench data generation pipeline.

Taxonomy-guided data construction starts from sampled risk configurations and tool pools, then validates complete trajectories before release.

Representative unsafe ATBench case studies.

Representative cases show that models may detect unsafe behavior while still missing the correct fine-grained cause.

ATBench500 benchmark taxonomy distribution.

ATBench500 remains available for backward compatibility and historical comparison with the original AgentDoG release.

Quick Start

from datasets import load_dataset

atbench = load_dataset("AI45Research/ATBench", "ATBench", split="test")
atbench500 = load_dataset("AI45Research/ATBench", "ATBench500", split="test")

Citation

@article{li2026atbench,
  title={ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis},
  author={Yu Li and Haoyu Luo and Yuejin Xie and Yuqian Fu and Zhonghao Yang and Shuai Shao and Qihan Ren and Wanying Qu and Yanwei Fu and Yujiu Yang and Jing Shao and Xia Hu and Dongrui Liu},
  journal={arXiv preprint arXiv:2604.02022},
  year={2026},
  doi={10.48550/arXiv.2604.02022},
  url={https://arxiv.org/abs/2604.02022}
}

@article{liu2026agentdog,
  title={AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security},
  author={Yu Li and Haoyu Luo and Yuejin Xie and Jiapeng Gu and Yuhan Wang and Yanwei Fu and Yujiu Yang and Jing Shao and Xia Hu and Dongrui Liu},
  journal={arXiv preprint arXiv:2601.18491},
  year={2026},
  url={https://arxiv.org/abs/2601.18491}
}