01 / the consensusv2026.04

Behavioral infrastructure
for the supervised-agent era.
The field is building it. We’re naming it.

Seven AI labs, a dozen research groups, the EU, OWASP, and Colorado are each building a piece of the same layer. Nobody has named the whole thing. Calx is calling it behavioral infrastructure for the supervised-agent era and building the piece every cited source is missing: your corrections, compiled into runtime enforcement. Every citation below is independent. Every working model is public.

why now

The harness layer is consolidating.
The enforcement primitive isn’t.

OpenAI

Codex App Server + Codex 0.114.0 Guardian. Lifecycle hooks, skill governance. March 2026.

Anthropic

Claude Managed Agents public beta, April 8, 2026. Hosted runtime, session-metered at $0.08/hour.

Microsoft

Agent Governance Toolkit, MIT-licensed, April 2, 2026. All 10 OWASP agentic risks, inside the runtime.

Meta

Acquired Manus for $2B+ (Dec 2025). Model-agnostic runtime ownership. ~$125M ARR in 8 months.

AWS

AgentCore policy and evaluation governance. Defines which apps, APIs, and MCP servers agents can access. Q1 2026.

LangChain

Middleware chain: HumanInTheLoopMiddleware and 6 composable hooks. HITL as first-class primitive.

02 / cited work and working models

Cited work and working models

Independent research, academic papers, open-source releases, and standards bodies shaping the category Calx is naming. We cite each one below.

Tier 01 · Industry
OpenAI
Microsoft
Meta
HumanLayer
LangChain
Anthropic
Manus
Cursor
AWSAgentCore
MozillaThunderbolt
Tier 02 · Academic + Research
Stanford University
SambaNova Systems
UCL
Huawei Noah's Ark Lab
Tsinghua University
TU Eindhoven
Singapore Management University
Phil SchmidPhil SchmidGoogle DeepMind
Varun Pratap BhardwajIndependent researcher
Tier 03 · Standards + Regulatory
OWASP
European Union
State of Colorado
03 / the synthesis

What each one built. What Calx adds.

Eighteen entities, each building a piece of the same layer. No claim of endorsement. Every row is a public artifact: a shipped system, a published paper, or a regulatory instrument.

Tier 01 · Industry
What they builtWhat Calx adds
OpenAI

Harness engineering is a named architectural pattern. The Codex App Server decouples agent core logic from client surfaces (CLI, VS Code, web, desktop) through a bidirectional protocol.

Calx builds the behavioral governance layer inside the same harness pattern, cross-runtime and model-agnostic by construction. OpenAI is building their harness for Codex. Calx is building the one everyone else needs.

Microsoft

Enterprise agent governance shipped this month. Agent Governance Toolkit (MIT-licensed, April 2, 2026) enforces all 10 OWASP agentic risks deterministically, at sub-millisecond latency, inside the agent runtime. Microsoft has named the category by shipping it.

Microsoft's toolkit enforces policies admins pre-write. Calx compiles the policies nobody knew to write: the corrections your team makes every day, captured automatically and promoted to enforcement. Same runtime posture, inverse origination.

AWS

AgentCore policy and evaluation governance (Q1 2026). Defines which apps, APIs, and MCP servers agents can access. First hyperscaler to ship an agent governance surface as cloud primitive, not library.

AWS gates capability at the account and IAM boundary. Calx runs inside that boundary and compiles the behavioral rules the IAM layer cannot express: patterns of recurrence captured from how humans correct agents at runtime. Policy plus behavior.

LangChain

Middleware chain is the composition pattern for governance primitives. LangChain 1.0 shipped SummarizationMiddleware, PIIMiddleware, HumanInTheLoopMiddleware, and four more hooks as composable, single-responsibility interceptors.

Calx treats correction capture, recurrence detection, and rule enforcement as middleware in this pattern. The composition is portable across frameworks and testable in isolation.

Anthropic

Claude Managed Agents public beta, April 8, 2026. Anthropic shipped a hosted agent runtime bundling agent loop, tool execution, sandbox, and state persistence. Session-metered at $0.08/hour. Anthropic is productizing the harness layer for Claude.

Calx is the cross-runtime version of what Anthropic built for Claude. Behavioral enforcement that survives across providers, not tied to a specific vendor’s hosted runtime. The rule you compile once enforces wherever your agents run.

Meta

Dual-channel feedback (pre-action + post-action) is mathematically necessary for drift-resistant memory. PAHF (Personalized Agents from Human Feedback, 2026) proves pre-action-only memory collapses under drift.

Calx implements both channels as first-class primitives. Rules compile from corrections; corrections update rules. Every operator operates under the PAHF theoretical bound.

Manus

Harness-centric agent platforms are the layer the market pays for. Manus built an agent platform where the harness, not the model, owned session state, tool dispatch, and memory. Meta acquired them December 29, 2025.

Calx is building the behavioral governance piece of the harness layer, before the market names it and before the big companies build their own.

Cursor

The harness must preserve reasoning continuity. Cursor's internal Bench experiments showed a 30% performance drop when reasoning traces were removed from GPT-5-Codex.

Calx preserves reasoning trajectories through the compilation pipeline and treats them as first-class training signal, not ephemeral debugging output.

HumanLayer

Subagents are about context control, not role-playing. HumanLayer’s practitioner research ("Attempts at ‘frontend engineer,’ ‘backend engineer,’ ‘data analyst’ sub-agents don’t work.") established the pattern.

Calx dispatches subagents via LangGraph Send API with per-subagent identity in calx-serve. Context control by default. Non-inheritance enforced by the harness, not hoped for.

Tier 02 · Academic + Research
What they builtWhat Calx adds
Stanford + SambaNovaACE: Agentic Context Engineering

ACE (Agentic Context Engineering, ICLR 2026, arXiv:2510.04618): evolving contexts via incremental delta updates, not monolithic rewrites. Solves context collapse.

Calx's correction lifecycle is an ACE-style delta system by design. ACE formalized the pattern. Calx is the production substrate that runs it.

UCL + Huawei Noah's Ark LabMemento-Skills

Memento-Skills (arXiv:2603.18743): deployment-time learning in external memory converges to an optimal retrieval policy. Theorem 1.3 proves it.

Every Calx operator is a Memento-Skills-compatible system by construction. Identity, rules, and lessons persist. Skills evolve as corrections accumulate. Calx inherits the convergence properties the paper proves.

Tsinghua UniversityVia Negativa for AI Alignment

Via Negativa for AI Alignment (arXiv:2603.16417, Quan Cheng, 2026): negative constraints converge; positive preferences don’t. Structural, not empirical.

Calx is a via negativa system. Corrections are negative signals. Compiled rules are negative constraints. The feasible behavioral space contracts monotonically as rules accumulate.

TU EindhovenRuntime Governance

Runtime Governance for AI Agents: Policies on Paths (arXiv:2603.16586, Kaptein et al., 2026): enforcement must be architecturally interposed, not advisory.

Calx enforcement runs at the harness level, before the agent executes. The agent cannot skip it. This is the "architecturally interposed" requirement the paper specifies.

Singapore Management UniversityAgentSpec

AgentSpec (arXiv:2503.18666, Wang/Poskitt/Sun, ICSE 2026): LLM-generated runtime enforcement rules achieve 95.56% precision. Machine-generated rules work.

Calx's compilation engine produces enforcement artifacts automatically from corrections. AgentSpec validates the principle; Calx applies it to behavioral governance at user scale.

Phil SchmidGoogle DeepMind

The harness is the operating system of AI agents: Model = CPU, context window = RAM, harness = OS, agent = application. The competitive advantage has shifted from model choice to harness quality.

Calx builds the behavioral governance layer of the harness OS. We are not competing with OpenAI for the CPU. We are building the layer everyone using those CPUs needs.

Varun Pratap BhardwajIndependent

Agent Behavioral Contracts (arXiv:2602.22302, 2026): formal specification of preconditions, invariants, governance, and recovery for AI agents. AgentContract-Bench showed 88–100% hard constraint compliance across 7 models.

Calx provides the runtime substrate for contracts of this shape. Compiled rules are contract clauses. Violation tracking is automatic. The harness makes contracts enforceable in production, not just in benchmarks.

Tier 03 · Standards + Regulatory
What they builtWhat Calx adds
OWASP GenAI Security Project

Top 10 for Agentic Applications (December 9, 2025), the first formal threat taxonomy for autonomous AI agents.

Calx enforcement maps directly to the OWASP threat model. Compiled rules address specific threats in the top 10 at the harness level, before the agent can violate them.

European UnionEU AI Act

European Parliament voted March 26, 2026 to delay high-risk AI obligations to December 2, 2027. Watermarking rules still apply November 2026. Non-compliance penalties for deployers: up to €15M or 3% of global annual turnover, whichever is higher (Article 99). Eighteen-month window for procurement teams to stand up runtime governance before the hammer drops.

Calx provides the runtime enforcement substrate the Act assumes but does not specify. Audit trail, compliance exports, behavioral evidence on demand for any operator in any deployment. The tailwind is the delay, not the rule.

State of ColoradoColorado AI Act

Colorado's AI Act, effective June 2026, is the first US state law requiring runtime AI governance for consequential decisions. The enforcement posture in the US is no longer theoretical. It is on the calendar.

Calx provides the enforcement layer that makes the compliance story defensible. Every action scoped to an operator, every rule auditable, every violation logged.

04 / supporting research

Research threads that underwrite the synthesis above. Each is cited inline in the Calx papers where it does load-bearing theoretical or empirical work.

Wang 2026. Multi-agent coordination failure modes at scale. Taxonomy of breakdowns across cooperating agent populations.arXiv:2601.15300
Liu et al. 2024. "Lost in the Middle": the U-shaped attention curve in long-context models. Instructions in the middle of the prompt are systematically under-weighted.Liu, Lin, Hewitt, Paranjape, Bevilacqua, Petroni, Liang
Hadeliya et al. 2025. "When Refusals Fail": empirical evidence for compliance degradation as context length and instruction density grow.arXiv:2512.02445
Argyris 1977. Double Loop Learning in Organizations. Patching symptoms is single-loop; changing the governing variables is double-loop. The same distinction applies to AI agents.Harvard Business Review
Peysakhovich & Lerer 2023. Attention sink effects and position-dependent instruction weighting in transformer language models.arXiv preprint
Montes 2026. "Claude Code follows about 80% of them, 60% of the time." Practitioner evidence on rule compliance drift. Cited as practitioner observation, never as empirical study.Medium · Christopher Montes
Potham 2025. Capability masking: agents can intentionally under-report their own capabilities to avoid scrutiny. Distinct from stated-then-violated compliance failures.arXiv:2506.02357
05 / calx research

Calx’s own published research

Three peer-reviewable papers on Zenodo, all CC-BY-4.0. The behavioral plane, the stickiness failure mode, and the compiler gap. Independent evidence for the category this page is naming.

Paper I

The Behavioral Plane

237 rules transferred from one agent to another. The receiving agent made 44 novel failures in categories the rules explicitly addressed. Behavioral knowledge does not transfer through text.

DOI: 10.5281/zenodo.19159223Read the paper
Paper II

Stickiness Without Resistance

Without human friction in the correction loop, agents accept instructions but fail to modify behavior. Compliance is performed, not enacted.

DOI: 10.5281/zenodo.19382717Read the paper
Paper III

The Compiler Gap

Nine formatting rules tested across three context lengths. Text instructions: 0/9 enforced. Structural enforcement: 9/9. The variable was the delivery mechanism.

DOI: 10.5281/zenodo.19384855Read the paper
06 / the framework

Two planes. Calx builds one.

Behavioral infrastructure for the supervised-agent era runs on two planes. The information plane is what the agent knows. The behavioral plane is what the agent does. Calx builds the behavioral plane and integrates cleanly with the information plane.

08 / answers

Questions a procurement skeptic asks first.

Short answers to the questions that decide whether this conversation is worth your team’s time.

What is behavioral infrastructure for the supervised-agent era?

The system layer that captures corrections, compiles them into structural rules, and enforces those rules inside the harness, before the agent runs. It is distinct from prompt engineering (information plane) and from memory systems (also information plane). Behavioral infrastructure is the enforcement layer. Calx is naming the category because seven labs converged on the harness in a single quarter, and nobody has named the compiler piece that compounds human corrections into runtime behavior.

How is Calx different from prompt engineering or rules files?

Prompt engineering writes text and hopes the agent follows it. Calx compiles corrections into structural enforcement that runs before the agent. We measured the difference in a controlled study: 0 of 9 for text rules, 9 of 9 for compiled rules. Same rules, same runtime, three context lengths. The variable was the delivery mechanism.

Is Calx competing with OpenAI Codex, Anthropic Managed Agents, or Microsoft Agent Governance Toolkit?

No. Calx is built on LangChain and LangGraph and runs cross-runtime, including alongside OpenAI Codex, Anthropic Managed Agents, AWS AgentCore, and Microsoft AGT. OpenAI and Anthropic each ship a hosted harness for their own model. Microsoft enforces admin-written policies. AWS gates capability at the IAM boundary. Calx builds the behavioral governance layer inside the same harness pattern, compiling the policies nobody knew to write, cross-runtime. Same category, different scope.

How does Calx relate to Mem0, Letta, Zep, and other memory systems?

Calx is a behavioral plane system. Memory and retrieval (information plane) are a different category solving a different problem. Calx integrates cleanly with information plane systems. We do not build them and we do not compete with them. Paper 3 (The Compiler Gap) frames the distinction explicitly: storing rules is not the same thing as governing behavior.

What models and runtimes does Calx support?

Model-agnostic by construction, via LiteLLM. BYOK for any model: Claude, GPT, Gemini, Llama, open-source. The behavioral layer is portable and lives inside the harness, not the model. Native first-party experience is Bench plus the Calx harness (Tether). The same behavioral governance layer also plugs into Cursor, Claude Desktop, OpenAI Codex App Server, Anthropic Managed Agents, LangGraph, AWS AgentCore, and any other interaction or harness in the ecosystem.

What does Calx cite as evidence for the category?

Eighteen entities on the page above: every one of them has shipped a working system, published a peer-reviewable paper, or enacted a regulatory instrument that addresses a piece of behavioral infrastructure. Industry: OpenAI, Microsoft, LangChain, Anthropic, Meta, Manus, Cursor, HumanLayer. Academic: Stanford and SambaNova on ACE; UCL and Huawei Noah’s Ark Lab on Memento-Skills; Tsinghua on via negativa alignment; TU Eindhoven on runtime governance; Singapore Management University on AgentSpec; Phil Schmid (Google DeepMind) on the harness-as-OS analogy; Varun Pratap Bhardwaj on Agent Behavioral Contracts. Standards and regulatory: OWASP Top 10 for Agentic Applications, EU AI Act, Colorado AI Act. Calx publishes three of its own peer-reviewable papers on Zenodo (CC-BY-4.0).

Is Calx production-ready? When can I use it?

The Calx harness (Tether), the behavioral compilation engine (Serve), and the underlying runtime are running in production with the current design-partner cohort today. Bench (the first-party desktop experience) is in cohort and shipping to macOS first. Public release follows the design-partner cohort. If you are running agents across teams and the corrections are slipping, the fastest way in is a Correction Audit: book one at calx.sh/audit.

07 / the ratio

The product is the ratio.

Every primitive in behavioral infrastructure runs on one of two planes. The information plane is where rules live as text: prompts, CLAUDE.md files, system prompts, retrieval indices. Scaffolding.

The behavioral plane is where actions execute: gates, hooks, tests, compiled rules enforced before the agent runs. Enforcement.

Every AI system ships some ratio of the two. The product is the ratio. Calx is the company whose entire product optimizes for it.

OpenAI built the harness. Microsoft shipped the governance toolkit. LangChain shipped the middleware pattern. Stanford proved the delta-update lifecycle. UCL proved the convergence. Tsinghua proved the theoretical foundation. Meta proved the dual-channel requirement. Chris Argyris proved in 1977 that organizations that only patch symptoms never change the system. The same is true of AI agents. Every citation on this page is building a piece of this layer.

We are naming it behavioral infrastructure for AI agents. We are building the piece that turns your corrections into rules your agents cannot violate. We are not inventing this. We are finishing it.

OWASP logo used under CC BY-SA 4.0. All other marks are the property of their respective owners and appear here for editorial citation of publicly available research and open-source work. Nothing on this page implies endorsement.