Agent trajectory safety benchmark family

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

ATBench evaluates whether long-horizon, tool-using AI agents behave safely across complete execution traces, and whether unsafe traces can be diagnosed along risk source, failure mode, and real-world harm.

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, Dongrui Liu

arXiv 2026

Paper Dataset GitHub ATBench500 Paper

1,000audited trajectories

503 / 497safe and unsafe cases

2,084available tools

1,954unique invoked tools

9.01average turns per trajectory

ATBench overview teaser showing benchmark motivation, construction, and evaluation.

ATBench moves agent-safety evaluation from isolated prompts or final responses to complete trajectories, preserving user requests, agent actions, tool calls, and environment feedback.

Abstract

Autonomous agents increasingly operate through long-horizon, tool-augmented trajectories. Existing safety evaluation often focuses on single-step moderation or final-output filtering, which can miss unsafe behavior emerging during intermediate planning, tool invocation, or environment feedback. ATBench addresses this gap with trajectory-level safety evaluation and diagnosis. Each sample is a complete execution trace, labeled as safe or unsafe; unsafe traces are further annotated with one primary Risk Source, Failure Mode, and Real-world Harm. The benchmark is constructed through taxonomy-guided data generation, rule-based and LLM-based filtering, and full human audit, producing a diverse and realistic testbed for evaluating modern tool-using agents.

Key Contributions

ATBench is designed as both a benchmark release and a reusable diagnostic protocol for agent safety.

Trajectory-Level Unit

Uses the full execution trace as the evaluation object, including requests, agent responses, tool calls, and environment feedback.

Three-Dimensional Diagnosis

Separates where risk enters, how agent behavior fails, and what harm follows, enabling actionable safety analysis beyond binary labels.

Audited Benchmark Lineage

Preserves ATBench500 for historical comparison while making the new 1,000-trajectory ATBench release the default reference point.

Safety Taxonomy

The taxonomy decomposes unsafe trajectories into three complementary diagnostic dimensions.

Three-dimensional trajectory-safety taxonomy
Dimension	Diagnostic question	Category count	Examples
Risk Source	Where does the risk enter the trajectory?	8	User input, environmental observation, external tools/APIs, internal agent failures
Failure Mode	How does the agent realize or amplify the risk?	14	Over-privileged action, flawed planning, tool misuse, unsafe execution, harmful output
Real-world Harm	What downstream harm could the unsafe trajectory cause?	10	Privacy, financial, security, physical, psychological, reputational, societal harm

ATBench three-dimensional safety taxonomy.

The same high-level taxonomy supports binary safety evaluation and fine-grained diagnosis for unsafe trajectories.

Benchmark Release

ATBench keeps the newest release and the original AgentDoG benchmark in one explicit release lineage.

Release zoo
Release	Status	Cases	Safe	Unsafe	Available Tools	Used Tools	Avg. Turns	Avg. Tokens
ATBench	Latest	1,000	503	497	2,084	1,954	9.01	3.95k
ATBench500	Legacy	500	250	250	1,575	1,357	8.97	1.52k

Available Tools counts tools exposed through per-trajectory tool pools. Used Tools counts unique tools invoked in released trajectories.

Main Results

ATBench exposes a gap between general safety capability and trajectory-level agent safety judgment.

Model performance comparison on ATBench and other agent safety benchmarks.

Representative model performance shows that ATBench is difficult for both general-purpose models and specialized guard models.

Trajectory-level safety results on R-Judge and ATBench (%)
Group	Model	R-Judge Acc	R-Judge Prec.	R-Judge Rec.	R-Judge F1	ATBench Acc	ATBench Prec.	ATBench Rec.	ATBench F1
Closed-source models
Closed-source	GPT-5.4	93.3	93.1	94.3	93.7	73.7	68.5	87.1	76.7
Closed-source	GPT-5.2	90.8	86.8	97.5	91.8	69.0	65.6	79.3	71.8
Closed-source	Gemini-3-Flash	95.2	98.7	92.1	95.3	76.4	79.3	71.0	74.9
Closed-source	Gemini-3.1-Pro	97.3	99.1	95.7	97.4	75.5	76.1	73.8	75.0
Open-source models
Open-source	Qwen3.5-397B-A17B	85.6	81.3	94.5	87.4	66.8	65.5	70.2	67.8
Open-source	Qwen3.5-4B	81.0	82.1	81.9	82.0	45.9	41.2	20.7	27.6
Open-source	Qwen3.5-2B	54.1	67.6	25.2	36.7	59.1	74.3	19.2	30.5
Open-source	Qwen3.5-0.8B	33.7	27.6	15.8	20.1	48.6	66.7	5.9	10.8
Open-source	QwQ-32B	89.5	94.9	84.7	89.5	57.7	81.9	19.1	31.0
Open-source	Qwen3-235B-A22B-Instruct-2507	85.1	80.7	94.4	87.0	59.2	58.2	63.8	60.8
Open-source	Qwen3-4B-Instruct-2507	68.4	73.8	62.4	67.6	55.7	77.6	15.3	25.5
Open-source	Qwen2.5-7B-Instruct	68.4	77.4	56.8	65.5	53.4	73.8	9.7	17.1
Open-source	Llama-3.1-8B-Instruct	53.7	53.3	99.8	69.5	45.3	47.3	89.5	61.9
Guard models
Guard	LlamaGuard3-8B	61.2	69.1	48.1	56.7	53.1	85.7	3.8	7.3
Guard	LlamaGuard4-12B	63.8	68.3	58.8	63.2	58.1	63.8	30.9	41.7
Guard	Qwen3-Guard	40.6	23.6	5.6	9.0	51.5	40.0	0.4	0.8
Guard	ShieldAgent	81.0	74.0	98.8	84.6	62.5	58.0	81.4	67.7
Guard	JoySafety	52.5	57.2	40.2	47.2	56.9	61.7	35.0	44.7
Guard	NemoGuard	54.4	60.1	40.6	48.5	49.9	49.5	41.6	45.2
AgentDoG models
Ours	AgentDoG 1.0-4B	91.8	87.5	98.5	92.7	64.0	59.2	88.9	71.1
Ours	AgentDoG 1.5-0.8B	75.7	83.3	67.5	74.6	60.3	58.6	68.6	63.2
Ours	AgentDoG 1.5-2B	71.5	78.0	64.1	70.4	69.0	70.1	65.7	67.8
Ours	AgentDoG 1.5-8B	75.5	68.6	98.8	81.0	70.9	67.1	81.2	73.5
Ours	AgentDoG 1.5-4B	92.2	91.7	93.7	92.7	72.4	69.2	80.3	74.3
Ours	AgentDoG 1.5-4B-U	90.4	93.9	87.6	90.6	78.4	79.8	75.7	77.7

The large table is placed after the performance figure to make the exact numerical comparison inspectable. ATBench remains challenging: several strong models have high recall or precision but much weaker balanced F1.

Fine-Grained Diagnosis

Unsafe trajectories are also evaluated on three diagnostic labels: Risk Source, Failure Mode, and Real-world Harm.

Three-class diagnostic accuracy on ATBench (%)
Group	Model	Risk Source	Failure Mode	Real-world Harm	Avg.
Closed-source models
Closed-source	GPT-5.4	33.6	13.5	30.2	25.8
Closed-source	GPT-5.2	29.5	12.0	26.8	22.8
Closed-source	Gemini-3-Flash	18.4	8.3	15.0	13.9
Closed-source	Gemini-3.1-Pro	24.8	12.6	18.5	18.6
Open-source models
Open-source	Qwen3.5-397B	7.7	3.6	6.8	6.0
Open-source	Qwen3.5-0.8B	1.3	2.9	4.7	3.0
Open-source	Qwen3.5-2B	7.7	6.6	11.1	8.5
Open-source	Qwen3.5-4B	6.6	3.0	8.2	5.9
Open-source	QwQ-32B	15.8	9.4	22.9	16.0
Open-source	Qwen3-235B	7.0	11.6	26.6	15.1
Open-source	Qwen3-4B-Instruct	1.0	9.6	21.2	10.6
Open-source	Qwen2.5-7B-Instruct	5.3	6.0	15.5	8.9
Open-source	Llama3.1-8B-Instruct	6.2	5.8	15.5	9.2
AgentDoG models
Ours	AgentDoG1.0-4B	46.8	16.5	40.6	34.6
Ours	AgentDoG 1.5-0.8B	65.7	18.4	44.9	43.0
Ours	AgentDoG 1.5-2B	68.0	24.0	53.8	48.6
Ours	AgentDoG 1.5-8B	72.9	24.6	52.5	50.0
Ours	AgentDoG 1.5-4B	75.2	27.5	62.9	55.2
Ours	AgentDoG 1.5-4B-U	24.1	9.5	28.4	20.7

Guard models are excluded from this table because they only output binary labels. The average is the mean over the three diagnostic dimensions.

Quality Control

ATBench combines automatic validation and human audit to make fine-grained labels usable for diagnosis.

Human verification of three diagnostic labels
Annotator	Risk Source	Failure Mode	Real-world Harm
Annotator 1	82.0% (41/50)	70.0% (35/50)	88.0% (44/50)
Annotator 2	86.0% (43/50)	60.0% (30/50)	80.0% (40/50)
Annotator 3	86.0% (43/50)	72.0% (36/50)	86.0% (43/50)
Avg.	84.7% (127/150)	67.3% (101/150)	84.7% (127/150)

0.5%safe/unsafe labels corrected during audit

11.1%fine-grained labels corrected during audit

4heterogeneous verifier families before adjudication

Full audithuman pass over released benchmark labels

Risk Source and Real-world Harm show higher agreement, while Failure Mode is intentionally finer-grained and more ambiguous.

Generation Pipeline and Cases

The data engine uses taxonomy-guided planning, trajectory synthesis, validation, and audit before release.

Taxonomy-guided data construction starts from sampled risk configurations and tool pools, then validates complete trajectories before release.

Representative unsafe ATBench case studies.

Representative cases show that models may detect unsafe behavior while still missing the correct fine-grained cause.

ATBench500 benchmark taxonomy distribution.

ATBench500 remains available for backward compatibility and historical comparison with the original AgentDoG release.

Quick Start

from datasets import load_dataset

atbench = load_dataset("AI45Research/ATBench", "ATBench", split="test")
atbench500 = load_dataset("AI45Research/ATBench", "ATBench500", split="test")

Citation

@article{li2026atbench,
  title={ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis},
  author={Yu Li and Haoyu Luo and Yuejin Xie and Yuqian Fu and Zhonghao Yang and Shuai Shao and Qihan Ren and Wanying Qu and Yanwei Fu and Yujiu Yang and Jing Shao and Xia Hu and Dongrui Liu},
  journal={arXiv preprint arXiv:2604.02022},
  year={2026},
  doi={10.48550/arXiv.2604.02022},
  url={https://arxiv.org/abs/2604.02022}
}

@article{liu2026agentdog,
  title={AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security},
  author={Yu Li and Haoyu Luo and Yuejin Xie and Jiapeng Gu and Yuhan Wang and Yanwei Fu and Yujiu Yang and Jing Shao and Xia Hu and Dongrui Liu},
  journal={arXiv preprint arXiv:2601.18491},
  year={2026},
  url={https://arxiv.org/abs/2601.18491}
}