OpenEnv Benchmark

IT Helpdesk Operations Environment

A multi-step benchmark for operational AI agents. Each episode contains helpdesk or security tickets that require evidence gathering, policy checks, and a safe final action such as unlock, revoke, deny, or escalate.

Easy

Routine IT helpdesk cases focused on account recovery, VPN restoration, approved SaaS access, and license assignment.

Cases: 5
Baseline: 0.94

Medium

Moderately complex support cases involving policy checks, travel-risk interpretation, department transfers, and escalation boundaries.

Cases: 5
Baseline: 0.90

Hard

High-stakes operational cases covering offboarding failures, probable compromise, unmanaged devices, production data access, and audit-driven remediation.

Cases: 7
Baseline: 0.89

Security

Security-heavy cases involving compromise signals, leaked credentials, offboarding drift, phishing, and unsafe data-handling requests.

Cases: 6
Baseline: 0.90

Agent Workflow

Why This Benchmark Is Useful