OpenEnv Benchmark

IT Helpdesk Operations Environment

A multi-step benchmark for operational AI agents. Each episode contains helpdesk or security tickets that require evidence gathering, policy checks, and a safe final action such as unlock, revoke, deny, or escalate.

Open API Docs Task Catalog Manifest Session Summary Health

Easy

Routine IT helpdesk cases focused on account recovery, VPN restoration, approved SaaS access, and license assignment.

Cases: 5

Baseline: 0.94

Medium

Moderately complex support cases involving policy checks, travel-risk interpretation, department transfers, and escalation boundaries.

Cases: 5

Baseline: 0.90

Hard

High-stakes operational cases covering offboarding failures, probable compromise, unmanaged devices, production data access, and audit-driven remediation.

Cases: 7

Baseline: 0.89

Security

Security-heavy cases involving compromise signals, leaked credentials, offboarding drift, phishing, and unsafe data-handling requests.

Cases: 6

Baseline: 0.90

Agent Workflow

Investigate: use identity, device, policy, knowledge-base, or login-risk lookups.
Interpret: combine the revealed facts with the user ticket and escalation boundaries.
Act safely: resolve, deny, revoke, or escalate with a clear customer-facing response.

Why This Benchmark Is Useful

Multi-step operational reasoning instead of one-shot classification.
Deterministic grading with evidence quality, resolution quality, and safety quality.
Coverage across identity, endpoint, SaaS access, incident response, and data-handling cases.

Start with POST /reset?task_name=easy, inspect the active ticket, then send a single operation to POST /step until the case is resolved or escalated.