in

Sohrab Hosseini on Governing Agents as Business Logic

In this interview, Sohrab Hosseini, Co-Founder of Orq, argues that prompts are business logic, production failures stem from governance gaps, and AI agents must be managed as operational resources rather than experiments.

Orq.ai did not start as an AI company but as a configuration management platform, until a design partner began using it to manage LLM prompts. When you realized prompts were effectively business rules rather than “AI magic,” how did that shift your architectural thinking, and why did this origin give Orq.ai an advantage over tools that grew out of chat interfaces or playgrounds?

When Anthony and I started building what was then called Orquesta, our backgrounds pulled us naturally toward enablement platforms: cloud, DevOps, data infrastructure. We had spent years as technology consultants helping large organizations not just adopt new technology but actually govern it. Even before generative AI was the conversation, we were building tooling around configuration management: how do you dynamically control application behavior across environments without redeploying code? How do you give non-engineers visibility into system behavior without exposing them to raw infrastructure?

Then a design partner, a fintech team we were working closely with, started using our remote configuration layer to manage their LLM prompts. They weren’t doing anything exotic. They just wanted to change a prompt in production without a deployment cycle, roll back when something went wrong, and track which version was live at any given time. The same problems they had with feature flags and application configuration, they now had with prompts.

That was the moment the framing clicked. A prompt is not AI magic. It is a business rule expressed in natural language. It encodes decision logic, tone, constraints, brand policy, and compliance requirements. It belongs in version control, it needs audit history, and it should be deployable and rollback-able just like any other piece of business logic. Once you see it that way, the entire operational picture changes.

Tools that grew out of chat interfaces or model playgrounds started with the model at the center. Their mental model is: here is a model, how do you talk to it better? Everything else gets bolted on afterward. Our mental model from day one was: here is a piece of business logic living at the boundary between your application and an AI model, how do you manage its entire lifecycle? That inversion changes how you think about state, collaboration, cost attribution, and compliance. We did not have to retrofit governance into a chat tool. Governance was the original product.

You have spoken openly about the 2008 collapse of one of your early businesses, when you chose to spend eight years repaying debt instead of declaring bankruptcy. How did that personal experience with loss of control shape Orq.ai’s obsession with governance, auditability, and reliability, and do you see the platform as a response to “black-box risk” you once lived through yourself?

I came to the Netherlands as a child with my family, as a refugee. The thing that stays with you from that experience is the feeling of having no control over your own life. You are dependent on systems, on decisions made by others, on processes you cannot see or influence. I promised myself very early that I would never feel that way again.

When the business I founded in my twenties ran into serious trouble during the 2008 financial crisis, I had a choice. I could walk away through bankruptcy or I could face it. I chose to face it. It taught me something I could not have learned any other way: what it actually feels like when a system fails and you have no visibility into why, no way to intervene, and no audit trail to help you understand what went wrong.

That experience is directly baked into Orq.ai’s DNA. When I look at how enterprises are deploying AI today, agents making decisions, running workflows, calling APIs, interacting with customers, and I see them doing it without proper observability or governance, I recognize that feeling immediately. For an enterprise, a black box is not just uncomfortable. It is a compliance risk, a reputational risk, and potentially a financial catastrophe.

So yes, Orq.ai is a response to black-box risk, and part of that response comes from having lived through what it means to lose control of something important. I want every team using our platform to have the ability to see what is happening, to intervene when something goes wrong, and to trace every decision an agent made. Not as a nice-to-have, but as a baseline expectation for operating in production.

Today almost every enterprise can build an impressive AI demo, yet most fail when moving to production. From your perspective, what actually breaks at the production layer, and why are cost volatility, unpredictable agent behavior, and organizational governance more dangerous than model hallucinations alone?

The demo-to-production gap is the central problem we exist to solve, so let me be precise about what actually breaks.

The first thing that breaks is cost predictability. In a demo, you run a hundred queries and token usage is trivial. In production, you may have agents looping, retrying on failures, calling multiple models in sequence, and pulling large documents into context. We have seen cases where a single poorly constrained AI feature could theoretically consume a significant share of a company’s revenue in token costs alone. Most teams have no way to see this coming because they lack per-feature cost attribution.

The second thing that breaks is behavioral consistency. A model that performs well in your test environment behaves differently when inputs are messier, context windows are fuller, or a new model version was silently deployed. In a demo, you pick the happy path. In production, the full input distribution hits you at once.

The third thing that breaks is organizational governance. Who owns a prompt in production? Who can change it? If your compliance team needs to audit every response an agent gave over the last 90 days, can you produce that? In most organizations today, the answer is either “we don’t know” or “we can’t.”

These failure modes are more dangerous than hallucinations in isolation because hallucinations are a known, watched-for problem. Cost overruns, behavioral drift, and governance failures are largely invisible until they become catastrophic. A hallucination surfaces quickly and triggers a fix. A cost structure that is silently bleeding money, or an agent operating outside policy for weeks without anyone noticing, those compound quietly and then explode.

You’ve compared custom-built AI stacks to Formula One cars, powerful but impractical for most companies, while arguing that enterprises really need something closer to a reliable Audi S8. Why do engineering teams still default to building bespoke LLMOps stacks, and at what point does “building in-house” start destroying value instead of creating it?

A Formula One car is an extraordinary machine optimized for one specific track, one set of conditions, and a team of fifty engineers on standby. The moment conditions change, it becomes very difficult to handle. Most enterprise AI teams are essentially trying to build Formula One cars and then wondering why their operations teams cannot maintain them.

Engineering teams default to bespoke stacks for understandable reasons. Culturally, engineers love to build, and the AI tooling landscape evolves fast enough that it feels like no vendor solution can keep up. Practically, teams often start with one specific pain point and reach for a tool that solves it well enough. The problem is that as the system grows, they add another tool, then another, and before long they have five to seven tools that do not talk to each other properly. Data gets fragmented, visibility disappears, and a significant portion of engineering capacity is now just maintaining integrations rather than shipping product.

The point where building in-house starts destroying value is earlier than most teams realize. It happens when engineers are spending more time maintaining AI infrastructure than building AI features. It happens when a compliance team asks for an audit trail and engineering has to build a custom reporting pipeline to produce it. It happens when you want to switch models for cost reasons and realize your abstraction layer is so tightly coupled to one provider that it would take weeks to make that change.

What enterprises actually need is the Audi S8. Something fast, capable, well-engineered, and reliable under real-world conditions. Something any qualified driver can operate, with safety systems built in, that your team can maintain without a racing pit crew.

Cost control is often underestimated in AI discussions. You’ve cited cases where poorly constrained LLM features could theoretically burn half a company’s revenue through token usage alone. How should teams think about the unit economics of production-grade AI, especially when deploying autonomous or looping agents, and how does Orq.ai make these risks visible before they become financial disasters?

This is one of the most underappreciated operational risks in enterprise AI right now. When you deploy an agent that loops, every iteration consumes tokens. If that agent enters an error state that causes it to loop, or pulls large documents into context on every iteration, costs compound extremely quickly. We have worked with customers who modeled their agent behavior in production and realized that at realistic traffic volumes, an unconstrained feature could theoretically consume a material fraction of their revenue in token costs.

The way teams should think about this is through unit economics at the feature level, not the infrastructure level. You should know the approximate cost per user action, per customer interaction, per document processed. You should have hard and soft budget limits that trigger alerts or throttling before costs go exponential. And when operating looping agents, you need iteration limits and timeout controls that prevent runaway execution.

Orq.ai gives teams the granularity they need to make those decisions. Every request flowing through the gateway is tagged by model, feature, user, and environment. Cost dashboards aggregate this in real time so you can see exactly what is being spent and where. Budget limits and alerts are configurable per deployment. For agentic workflows, we support maximum iteration counts and execution time limits as guardrails. When an agent approaches a limit, you can stop it, surface it for human review, or escalate through an exception handler, rather than discovering the problem on your cloud bill three weeks later.

Cost governance is not just a finance problem. It is a product sustainability problem. AI features that are not cost-controlled cannot survive in production.

As the industry shifts from stateless chatbots to stateful, multi-step agents, the term “agentic AI” is increasingly vague. In your definition, what is the technical threshold that turns a workflow into a true agent, and how does Orq.ai’s Agent Runtime manage state, memory, retries, and tool execution at a systems level rather than a single-model level?

The threshold that turns a workflow into a true agent is autonomous decision-making under uncertainty with persistent state across multiple steps. The system is not executing a fixed sequence of instructions. It is choosing what to do next based on the output of a previous step, maintaining context across those steps, and interacting with external systems in ways that have real-world consequences. A simple RAG pipeline is not an agent. An agent is a system that can decide whether to call a tool, what to call it with, how to interpret the result, and whether to proceed, retry, escalate, or terminate, and does so across an extended interaction without constant human direction.

The implications for how you build the runtime are significant. At Orq.ai, the Agent Runtime operates at a systems level, not a model level. It manages the full execution lifecycle of an agent invocation. It handles persistent memory and knowledge bases so the agent has structured context without you re-injecting it manually on every call. It manages tool registration and execution, including MCP compatibility, so agents can call external APIs in a controlled, observable way. It handles retries and fault tolerance at the orchestration layer so that a transient provider outage does not fail a long-running task silently. It enforces iteration limits and timeout controls to prevent runaway loops. And it surfaces human-in-the-loop checkpoints where high-stakes actions require explicit approval before the agent proceeds.

When you operate at multi-agent scale, you cannot manage each agent as an individual model call. You need infrastructure that treats agents as managed processes with lifecycle, state, and governance. That is what the Agent Runtime is built to provide.

You often describe the future role of humans as supervisors rather than operators, using the metaphor of air-traffic control. In practice, what does a “human-in-the-loop” interface need to look like for non-technical stakeholders, and how do you design approval, intervention, and exception handling without forcing people to read raw logs or JSON traces?

The air traffic control metaphor is useful because the controller does not pilot every plane. They monitor the system, maintain situational awareness, and intervene when something is off course. They work with instrumentation designed for human cognition, not raw data streams. That is exactly what we need to design for non-technical stakeholders.

The failure mode I see most often is teams building observability for engineers, raw traces, token-level logs, JSON outputs, and then handing that to a compliance officer or product manager and expecting them to make sense of it. That is not human-in-the-loop. That is just log access with a different job title attached.

Real human-in-the-loop design means surfacing exceptions at the right level of abstraction. When an agent is about to take a high-stakes action, the approval interface should present the proposed action in plain language, the relevant context, and a clear choice. Not a JSON payload. A description of what is about to happen and why.

It also means designing escalation paths that match organizational structure. An executive needs a dashboard showing agent performance, cost trends, and compliance status. A domain expert needs the ability to review and approve prompt changes for their area without going through an engineering deployment cycle. A compliance officer needs a complete, queryable audit trail of every agent decision. All of that is the same underlying data, presented through different lenses.

This is one of the reasons we emphasize collaboration tools alongside technical features. The platform is designed so that engineers and non-technical team members can work within the same environment with access to the views and controls appropriate for their role.

Multi-agent systems introduce new failure modes such as infinite loops, cascading errors, and compounded hallucinations. From what you’re seeing today, what is the most underappreciated orchestration risk in agentic systems, and where do you think the industry is still overly optimistic?

The failure mode I think is most underappreciated is what I would call silent semantic drift in multi-agent handoffs. When one agent passes context to another, a summary, a decision, an extracted piece of information, there is almost always some information loss or distortion. Each model makes probabilistic choices about how to represent what it received. Over a chain of three or four agents, small distortions compound. The final output can be coherent and confident while being substantially different from what the original data supported. And because each individual step looked reasonable, the error is extremely hard to trace back to its origin.

This is qualitatively different from single-model hallucination. With a single model, you can test and evaluate. With multi-agent chains, your evaluation has to operate at the systems level, across the full chain, and the state space is combinatorially large. The industry is not taking this seriously enough yet because most evaluation frameworks are still single-model-centric.

The other area where I think optimism is misplaced is third-party agent interoperability. The assumption in many agentic architecture discussions is that agents from different vendors will collaborate reliably. In practice, when you have heterogeneous agents with different memory models, different tool schemas, and different failure modes, making them work together reliably while maintaining governance and auditability is a genuinely hard problem. We are working on it actively with A2A protocol support and MCP integration, but I want to be honest: the industry is probably eighteen to twenty-four months away from having robust patterns for this. Anyone telling you it is solved today is overselling.

Orq.ai takes a bundled platform approach, combining gateway, orchestration, observability, and collaboration, while many engineers prefer assembling best-in-breed tools. Why do you believe bundling wins in the enterprise context, and where do you think the real friction cost of tool sprawl shows up inside large organizations?

The bundling-versus-best-of-breed debate often gets framed incorrectly. We are not arguing that a single monolithic platform always beats specialized tools. We are arguing that in the enterprise context, fragmented tooling has a hidden total cost that teams consistently underestimate.

The real friction cost of tool sprawl shows up in three places. The first is data fragmentation. When your observability data is in one tool, prompt versions in another, and cost data in a third, you lose the ability to ask cross-cutting questions. Why did costs spike on Tuesday? You need to correlate a prompt change with a traffic increase with a model provider switch, and if those data points live in different systems, that correlation either does not happen or requires manual detective work.

The second is integration maintenance overhead. Every tool you add is a dependency you have to maintain. API changes, deprecations, authentication updates: none of it is glamorous work, but it consumes real engineering hours. We frequently find that ten to fifteen percent of engineering capacity is absorbed by maintaining AI tooling integrations rather than building product.

The third is organizational friction. In an enterprise, different functions have different needs and different tooling tolerances. When you have five specialized tools each with their own learning curve, you create a gap between engineers who understand the tooling and the product managers, compliance officers, and domain experts who need to participate in AI governance. A unified platform with role-appropriate views makes that cross-functional collaboration actually work.

We launched the AI Router as a standalone product in early 2026 because not everyone is ready to adopt a full platform from day one. But the integrated platform remains the destination, because that is where the full value of unified data and governance is realized.

You’ve argued that companies shouldn’t have an “AI department,” just as they don’t have an “Excel department,” yet democratization often leads to shadow AI and compliance risk. How does Orq.ai balance empowering domain experts like product, legal, or marketing teams while still giving CISOs and compliance officers confidence that nothing is slipping out of control?

The Excel analogy reframes the ambition. The goal is not to have a specialist department that does AI things for everyone else. The goal is for AI to be a capability embedded in how every function works, the way spreadsheets are embedded in how organizations operate today. That means product managers can adjust how AI features work, legal teams can configure guardrails for their domains, and marketing teams can iterate on AI-powered workflows without every change requiring an engineering sprint.

But democratization without governance creates shadow AI, and shadow AI is happening right now. People are using consumer AI tools for work tasks that involve sensitive data in ways that are not visible to security or compliance teams, because the friction of officially sanctioned tooling is too high.

The way we address this is through what I call governed democratization. The platform is structured so that different roles have appropriate levels of access and autonomy within a defined governance framework. A domain expert can edit prompts, run experiments, and deploy changes, but they do so within a system that enforces role-based access controls, maintains a complete audit trail, applies the guardrails the organization has defined, and routes changes through whatever approval workflows are required. They have autonomy within boundaries, and those boundaries are fully visible to the CISO and compliance teams.

The key insight is that control and empowerment are not opposites. The right architecture makes them reinforcing. When everyone operates in a platform with proper governance built in, shadow AI becomes less attractive because the official platform is actually usable by non-engineers.

As a Europe-based company operating under GDPR and the EU AI Act, you seem to treat regulation less as a constraint and more as a design input. Is data sovereignty becoming a decisive buying factor for enterprises, and how does Orq.ai technically allow customers to use best-in-class global models without violating residency or jurisdictional boundaries?

Regulation has been a design input for us from the beginning. When you treat compliance as a constraint, something you deal with after you have built the product, you end up with architecture that is fundamentally at odds with what enterprise customers need. When you treat it as a design input, compliance becomes a natural property of how the platform works rather than an add-on.

On data sovereignty specifically: yes, it has become a decisive buying factor, and the shift over the last twelve to eighteen months has been dramatic. It used to be a concern primarily in regulated industries like financial services and healthcare. Now it is a mainstream enterprise requirement driven by GDPR enforcement, the EU AI Act, and geopolitical dynamics that have made organizations think carefully about where their AI inference is running and whose infrastructure it depends on. We have seen a significant surge in demand from enterprises and public sector institutions who want to use best-in-class global models but need guarantees about where data is processed and stored.

The technical architecture that enables this sits at the routing layer. When a request comes in, the router applies configurable rules: which models are approved, which data residency requirements apply, which providers are permitted based on jurisdiction. A customer can configure routing so that all inference for their EU users runs through providers with EU data residency guarantees, while US customers may have access to a broader set of providers. The routing decision happens in milliseconds and is fully logged and auditable. Data never leaves the approved region.

For customers who need maximum control, we offer VPC and on-premises deployment options through both the AWS and Azure Marketplaces. We are SOC 2 Type II certified, fully GDPR compliant, and built with the EU AI Act’s traceability requirements in mind. Sovereignty is not a checkbox for us. It is an architectural property built in from the foundation.

Looking ahead to 2026, if agentic systems succeed and intelligence becomes cheap while trust and context become scarce, what is one non-obvious change you expect in how enterprises build or govern AI systems, and where do you think today’s assumptions about LLMOps will quietly break?

The non-obvious change I expect is that the unit of enterprise software management will shift from applications to agents. This sounds abstract, but the operational implications are profound.

Today, enterprises manage software applications with change management processes, release cycles, and access controls built around the concept of a deployed application. Agents are fundamentally different. They are not static deployments. They are dynamic actors that make decisions, take actions, and interact with other systems in ways that no single release cycle fully governs. They are more like employees than software. They need onboarding, monitoring, adjustment, and sometimes offboarding. They need a management layer that tracks not just what they are but what they are doing, whether their decisions are consistent with organizational policy, and how they interact with other agents in the system. We call this capability Agent Resource Management, and the product is called Agent Control Tower.

What I think will break quietly in today’s LLMOps assumptions is the single-model evaluation paradigm. The entire evaluation infrastructure built over the last two years, benchmarks, test sets, LLM-as-a-judge frameworks, is built around evaluating individual model calls. That paradigm does not scale to multi-agent systems where the thing you need to evaluate is emergent behavior across a chain of decisions, not a single input-output pair. Teams that have invested heavily in single-model eval will find their tooling gives them a false sense of confidence about multi-agent system quality.

The other assumption that will break is that context is free. Teams are currently stuffing everything into context windows. As agents run longer tasks, context management becomes a serious engineering discipline. Which information does an agent actually need at each step? How do you maintain relevant context without consuming the entire context budget on retrieval? Teams that treat context as a free resource will find their agents become both expensive and unreliable as task complexity grows.

What we are building toward with Agent Resource Management is a governance layer that treats the agent fleet as a managed resource, with the same rigor applied to cloud infrastructure or a software release pipeline. The organizations that get there first will have a structural advantage that compounds over time, because their AI systems will be trustworthy enough to take on progressively more consequential decisions. And trust, as you note, is the scarce resource that everything else depends on.

Editor’s Note

This interview examines the operational risks emerging in agentic systems, including cost volatility and governance gaps, and argues for treating AI workflows as managed infrastructure rather than isolated model calls.

Andrew Fleury on Why Automotive ADAS Breaks on Two Wheels

Dytto Raises 1.5M to Free Accountants