Skip to content
AA Consulting
Technical9 min19 June 2026

Every AI platform converges on a control plane. Build it on purpose.

Multi-team LLM traffic creates nine predictable problems that all want to live in one place. Whether that place is a single gateway, several gateways feeding a shared plane, or a runtime layer is an open question. The decision to design it is not.


The first LLM call is three lines of code and a key in an environment variable. It demos beautifully before lunch. A few weeks later the bills arrive, a provider key turns up pasted into a chat channel, one agent burns another team’s quota overnight, and security asks whether any customer data left the network. Every team that reaches production arrives at the same place: the nine problems below, all pulling toward one place to govern AI traffic. Most call that place a gateway. The only real choice is whether you design it on purpose or discover it piecemeal in production.

LLM traffic is not the REST traffic you already run

A few properties make it its own problem. Cost is token-based, not request-based: one call costs a fraction of a cent, the next costs a dollar. Responses stream, so anything that buffers traffic breaks the experience. Payloads are sensitive: prompts and responses are gold for debugging and a liability for compliance. The provider set changes every quarter. And caching genuinely works, because a surprising share of queries are repeats.

The nine problems you end up solving

Once more than one team uses LLMs, the same set of platform problems shows up. They share a property: each belongs in a central chokepoint, not scattered across every application.

  1. 01

    Credential ownership

    Application code should never hold the provider's real key. Issue each tenant or app a platform-owned virtual key that the gateway swaps for provider auth, so one leak is contained and every tenant stays auditable.

  2. 02

    Routing and failover

    Real platforms run several deployments: a cheap provisioned-throughput tier, a pay-as-you-go fallback in another region, a different provider for experiment-heavy teams. The gateway decides per request where traffic goes, and fails over automatically when capacity throttles.

  3. 03

    Token-aware limits

    Requests-per-second is nearly meaningless when one call can spend fifty thousand tokens. You need tokens-per-minute limits, daily token budgets, and optional spend caps, or a runaway loop quietly drains the budget overnight.

  4. 04

    Cost attribution

    Tag every call with tenant, application, model, prompt and completion tokens, and latency, then join it to billing. Without that telemetry, 'who spent forty thousand dollars last month' is guesswork.

  5. 05

    Data protection

    PII redaction or hashing, content moderation, prompt-injection detection, and schema validation in sensitive workflows. The chokepoint is the natural place to enforce all of them.

  6. 06

    Centralized tracing

    The full or redacted prompt, model identity, response, token usage, latency, cache hit or miss, and tenant metadata, captured in one place. Without it, diagnosing a hallucination or a workflow failure is a guessing game.

  7. 07

    Caching

    Exact-match caching for repeated prompts, semantic caching for similar ones via embeddings. Implemented well, it cuts both cost and latency on the large share of queries that are duplicates.

  8. 08

    Egress control

    In regulated environments: private connectivity to providers, restricted outbound traffic, and clear network boundaries. Again, the gateway is where those controls live.

  9. 09

    Self-service onboarding

    A developer portal where teams register applications, mint credentials, read docs, and track usage, so platform engineers are not the bottleneck filing tickets for every new app.

Pattern one: an open-source router plus a control plane

The open approach splits the job in two. A router gives every application a single OpenAI-compatible endpoint and forwards calls to Azure OpenAI, OpenAI, Anthropic, Bedrock, Vertex, Mistral, or self-hosted models. The payoff is provider independence: switching a backend model becomes a config change, not a rewrite. The router also carries failover, token budgets, per-tenant keys, and cost accounting.

A control plane adds what a router deliberately leaves out: virtual tenants and keys, per-tenant limits and budgets, guardrails for PII, injection, and schema enforcement, semantic caching, and analytics. The trade is operational: you run the services, rotate keys, handle backups, and scale the gateway yourself. For Azure workloads, open tools that score provisioned-throughput against pay-as-you-go from real usage data take some of the guesswork out of the cost model.

Pattern two: Azure API Management as the gateway

Many enterprises already run Azure API Management, and when you map the nine problems onto it, most already have an equivalent. Its subscriptions function as virtual keys: a credential with primary and secondary keys, scoped to specific APIs, which is the tenant-identity abstraction many platforms reinvent from scratch. Microsoft has added GenAI-specific policies, and APIM authenticates to Azure OpenAI with Managed Identity, so the platform stores no provider keys at all. Add the built-in developer portal, analytics, policy orchestration, and an enterprise support path, and for an Azure-standardized organization it fits the existing governance model naturally.

Which one, and the hybrid

The router-plus-control-plane pattern fits teams running multiple providers, preferring open infrastructure, valuing fast experimentation, and comfortable operating platform services. API Management fits Azure-centric models, governance-first organizations, shops already standardized on APIM, and anyone who wants onboarding handled out of the box. Some enterprises run both: APIM for identity, quotas, onboarding, and network boundaries, with a router behind it for multi-provider routing. Azure governance, provider flexibility.

The shape is an open question

One caution, because it is the kind of thing people state too confidently. Today the common form is a single gateway. It may not stay that way, and we will not pretend to know that it will. The same nine problems could be solved by several gateways reporting into a shared control plane, or by a runtime layer that observes every AI call below the application and forwards it to one place, or by many small gateways feeding a single pane for the people who answer for cost and risk. The pattern that recurs is centralization, one place to set policy, see spend, and prove control, not any specific topology. What is well supported is the nine problems and the pull toward one place. The shape they settle into is genuinely uncertain, and anyone describing the future architecture as settled is selling the topology, not solving your problem.

The point is the decision, not the tool

The goal is never a specific product. The patterns and tools named above are examples of each shape, drawn from real engagements, not a ranking and not a push to reach for any one of them; the same nine problems can be met with tools we have not listed. What matters is to design the control plane your platform will need, deliberately, before your architecture assembles a worse one by accident. Which pattern is right is a function of your constraints: which clouds you run, your governance posture, your appetite for operating infrastructure, and your data residency rules. That decision, and the build-versus-buy behind it, is the work worth paying for.