Generative AI architecture for SaaS: the layered design and the tenant-isolation problem at its core

Generative AI architecture for a multi-tenant SaaS is a layered stack, model gateway, orchestration, RAG retrieval, tools, guardrails, and evals, placed on top of your existing product. The design decisions look routine until you reach the one that decides whether the feature ships safely: keeping one tenant from ever retrieving another tenant’s data from a shared vector store or LLM pipeline. This guide lays out the full reference stack and goes deep on that isolation layer.

By Kanika Mathur, Head of Service Delivery

Reviewed by Resourcifi engineeringPublished Feb 18, 2026Updated Feb 18, 202613 min read

SaaS

Key takeaways

The short version

SaaS AI architecture is a layered AI stack added on top of a multi-tenant product: a model gateway/router, RAG retrieval, tools over the product API, orchestration, guardrails, and evals/observability. Microsoft, AWS, and the LLM-gateway vendors converge on the same primitives under different names.
The central hard problem is multi-tenant data isolation. The query that grounds the model must return only the calling tenant’s authorized data, every time, because a similarity search has no concept of who is allowed to see what.
For per-tenant separation in a vector store, the practical hierarchy is separate index > namespace > metadata filter. Pinecone recommends one namespace per tenant as the default and explicitly advises against putting all tenants in one namespace with a metadata filter.
Tenant isolation is necessary but not sufficient. You also need permission-aware retrieval: carry the tenant in a signed token, enforce it at the query layer through a gatekeeper API, and inject the tenant filter automatically so application code can never query the stores directly.
The dominant security risk is prompt injection (OWASP LLM01:2025), especially the indirect kind that arrives through retrieved content. Pair least-privilege tool tokens and input/output guardrails with per-tenant tracing and evals so a regression is attributable to a tenant.

Generative AI architecture and the core isolation problem

The defining challenge in generative AI architecture for SaaS is multi-tenant data isolation: when one model and one retrieval pipeline serve many customers, the query that grounds the model must return only the calling tenant’s authorized data, every time. A similarity search has no concept of ownership or permission, so isolation has to be enforced in the architecture around it and can never be assumed inside the index. Get this layer right and the rest of the stack is conventional engineering; get it wrong and a single ordinary query can surface another customer’s data.

Start with the isolation model, because it sets the cost and the blast radius for everything above it. Three models recur across independent first-party vendors, which is the strongest signal they are the real design axis. Silo is a store per tenant: the best data and performance isolation and clean per-tenant cost attribution, at higher operational overhead, and Microsoft notes you can hit service limits, so it advises against silo when you have a large number of small tenants.¹ Pool is a shared store with a tenant discriminator in every query: cost-optimized and scales to far more tenants, but Microsoft calls data isolation the most important concern, and the query must include a tenant discriminator.¹ Bridge is the hybrid, with siloed premium or high-compliance tenants alongside a pooled tier for the rest, a naming AWS uses for its multi-tenant agent guidance.²

Per-tenant separation in a vector store

Inside the retrieval layer the decision narrows to how you separate tenants in the vector store, and this is the most technically load-bearing choice on the page. Pinecone’s multitenancy guidance is unambiguous: use one namespace per tenant as the default, because each namespace is stored separately and query cost scales with one tenant’s data rather than the whole index.³ It advises against putting all tenants in a single namespace and filtering by a tenant ID in metadata, because queries scan the entire namespace regardless of the filter, so you pay to scan every tenant’s data and latency grows with total size.³ The underlying reason is a property of approximate-nearest-neighbor indexes: the engine cannot prune by another field before searching, so it finds nearest neighbors first and filters by tenant after, which is both slower and more expensive.⁴ A separate index per tenant is the strongest option, reserved for high-compliance tenants or to split workloads.

One honesty note keeps the takeaway durable: the namespace-versus-metadata cost math is partly Pinecone-specific, because Pinecone meters by namespace. On other engines the exact economics differ, but the isolation hierarchy of separate index, then namespace, then metadata filter holds generally. That hierarchy is the durable rule; the cost figures belong to Pinecone.

Per-tenant isolation in a vector store, three approaches

The same data axis read three ways. Isolation strength runs separate index, then namespace, then metadata filter; Pinecone meters query cost by namespace, which is what makes the middle column the practical default.

Multi-tenant vector isolation: namespace vs metadata filter vs separate index
Approach	Isolation	Cost at scale	Latency	Offboarding	Best for
Separate index per tenant (silo)	Strongest, physical	Highest; per-tenant overhead, can hit service limits	Lowest, smallest search space	Drop the index	High-compliance and regulated tenants; a few large tenants
Namespace per tenant	Strong, physical separation	Low; query cost scales with one tenant’s data	Low	Delete the namespace, near-instant	The default for most B2B SaaS
Shared index + metadata filter (pool)	Weakest; relies on query correctness	Pay to scan all tenants’ data; filter capped at 10k values	Degrades as total data grows	Delete by filter, slow	Many tiny tenants (B2C); low-sensitivity data

Source: Microsoft Azure Architecture Center, Design a Secure Multitenant RAG Inferencing Solution (2026); Pinecone, Implement multitenancy and Multi-Tenancy in Vector Databases (2026).

Permission-aware retrieval inside a tenant

Tenant isolation keeps customers apart; permission-aware retrieval keeps users within a tenant to the data they are entitled to see. Within one tenant a viewer, an admin, and an auditor have different entitlements, so the authorization rules must be defined first and then used as the basis for retrieval filtering. Microsoft calls this filtering or security trimming, implemented through data-platform features like row-level security or custom access-control metadata on each chunk.¹

The mechanics that prevent cross-tenant leakage are concrete and converge across vendors. Carry the tenant in a signed token: AWS flows tenant context in a JWT and keys its data-layer access policies off the token’s claims, validating the principal against the requested path on every retrieval.² Inject the tenant filter automatically and sanitize results, which AWS states directly as a way to help prevent cross-tenant data leakage.² Route all data access through one gatekeeper API layer: Microsoft is explicit that code which needs tenant data should not be able to query the back-end stores directly, so a single API layer selects only the tenant’s rows, enforces trimming, and logs grounding access for audit.¹ Together these turn isolation from a convention any query could break into an invariant the architecture enforces.

The layered reference architecture

A SaaS AI architecture is a layered AI stack on top of the multi-tenant product: a model gateway and router, an orchestration layer, RAG retrieval, tools over the product API, guardrails on input and output, and evals plus observability, with caching and identity as cross-cutting rails. Present it as a reference architecture and treat it as one valid shape among several. Microsoft, AWS, and the LLM-gateway vendors describe the same primitives under different names, and that convergence is the credibility signal.

The canonical control flow ties the layers together. A client authenticates the user against an identity provider, then calls the orchestrator with the query plus the user’s authorization token; the orchestrator fetches tenant-authorized grounding data with security filtering applied, assembles the prompt, and calls the model’s inferencing API, with results returning to the app.¹ Each layer below maps to a real, citable concern, and the isolation work from the previous section lives inside the retrieval and tools rows. Building this layer for a product is what our AI application development team does, and the tenant-aware data side is where it meets our SaaS engineering work.

The layered reference architecture

Seven layers plus cross-cutting rails. Read top to bottom as the request path; the tenant-isolation boundary runs through the retrieval and tools layers.

SaaS AI reference architecture, layer by layer
Layer	What it does	Why it matters
Model gateway / router	One endpoint in front of every model provider: provider auth, request translation, routing by task, cost, and latency, failover, and spend tracking	Decouples the product from any single provider and centralizes cost and reliability control.
Orchestration	Planner plus workflow that sequences retrieve, synthesize, validate, and selects tools	Turns a single prompt into repeatable, inspectable steps.
RAG retrieval	Vector store and retriever with tenant-scoped, permission-aware queries	The isolation boundary; every retrieval is filtered to the authorized tenant and user.
Tools over the product API	The agent acts through scoped, least-privilege API tokens, never raw database access	Contains blast radius and keeps actions inside the product’s own authorization.
Guardrails	Input checks for jailbreak and injection; output checks for PII, leakage, and policy	AWS recommends pre-processing input and post-processing output guardrails that scan for cross-tenant leakage.
Evals + observability	End-to-end tracing of retrievals, prompts, tool calls, cost, and latency, plus offline and online evaluation	Makes quality, cost, and regressions attributable, ideally per tenant.
Caching (rail)	Prompt caching on the static prefix; semantic caching on near-duplicate queries	The highest-leverage cost and latency control, cutting across every layer.

Source: Microsoft Azure Architecture Center (2026); AWS, Building multi-tenant agents with Amazon Bedrock AgentCore (2026); OpenRouter, LLM Gateway (2026). Layer names vary by vendor; the primitives converge.

Model strategy: gateway, routing, and fallbacks

Most SaaS teams put a single gateway endpoint in front of every model provider and route per request, rather than hard-wiring one model into the product. The gateway fronts providers like OpenAI, Anthropic, Google, and self-hosted models, and provides failover, retries, conditional routing, load-balancing, caching, and budget controls as a standard function set.⁵ That one seam is where cost, reliability, and provider choice get managed instead of being scattered through the codebase.

Routing has three useful dimensions. Route by task complexity, sending easy turns to a small, cheap model and hard reasoning to a frontier model. Route by risk, sending sensitive or regulated prompts to a private or compliant deployment. Route by latency target so interactive paths stay fast. Fallbacks are a related gateway capability: on a provider error, timeout, or rate-limit, the gateway fails over from a primary to a secondary model so a single provider hiccup does not take the feature down.⁵

The hosted-versus-self-hosted choice is an engineering tradeoff with no single right answer. A managed gateway suits teams without spare DevOps capacity and gets a product to market faster; a self-hosted gateway gives infrastructure ownership, deeper routing control, and lower per-request cost at scale, at the price of real operational burden. Many enterprises reach for a private cloud deployment, such as AWS Bedrock or Azure OpenAI, when they need a hard data-residency or no-training guarantee, which is the architectural backstop behind the contractual one covered in the next section.

Security: prompt injection, PII, and training data

The dominant security risk in a SaaS AI feature is prompt injection, ranked LLM01:2025 by OWASP, and the variant that matters most for RAG is indirect injection, where malicious instructions arrive through retrieved or external content. OWASP’s mitigations map directly onto the stack: constrain model behavior in the system prompt, require source citations, filter inputs and outputs, enforce least privilege with scoped tool tokens, segregate untrusted content from instructions, gate high-risk actions behind human approval, and run adversarial testing.⁶ Each of these is a design decision the architecture has to make on day one, never a feature you bolt on at the end.

PII handling belongs at retrieval time as well as at display time: classify and scrub sensitive data before and after generation, which aligns with OWASP’s input and output filtering and with the AWS output guardrail that scans for sensitive-data leakage across tenant boundaries.² The data-not-for-training question has clear first-party answers worth stating plainly. Anthropic’s commercial terms state that Anthropic may not train models on customer content from its services.⁷ OpenAI does not use API inputs or outputs to train models by default, retains data up to thirty days for abuse monitoring, and offers zero-data-retention on eligible endpoints for qualifying organizations.⁸ Private cloud deployments keep data in the customer’s own tenancy and add the architectural backstop to those contractual terms.

Observability and evals in production

Treat the AI layer as a production system you can trace and evaluate, not a prompt you ship and forget. Trace requests, retrievals, prompts, tool calls, cost, latency, and the sources hit end to end, so quality correlates with cost and latency instead of living as anecdote. For a multi-tenant SaaS, scope those traces and eval dashboards per tenant, so a regression or a cost anomaly is attributable to a specific customer, which ties back to Microsoft’s recommendation to log grounding access for audit.¹

Evaluation comes in two modes. Offline evals run curated datasets of inputs and expected outputs as automated experiments before each release or prompt change, catching regressions ahead of users. Online evals sample a small share of live traffic and score it, often with an LLM as a judge plus explicit user feedback, to track qualitative trends in production. The two together give a pre-release gate and a live signal, and per-tenant scoping is what turns both into an audit trail a SaaS customer can trust. This page sits alongside our guides to agentic RAG and AI agents, which go deeper on the retrieval and agent layers named here.

Frequently asked

SaaS AI architecture questions

What is generative AI architecture?

Generative AI architecture is the layered system design that connects a large language model to your product: a model gateway and router, an orchestration layer, retrieval-augmented generation (RAG) for grounding, tools that act over the product API, guardrails on input and output, and evals plus observability. For a SaaS product, the architecture must also enforce multi-tenant data isolation, so the retrieval pipeline never returns one customer’s data to another. Microsoft, AWS, and the major LLM-gateway vendors converge on the same primitives under different names.

What is SaaS AI architecture?

SaaS AI architecture is a generative AI architecture built specifically for multi-tenant products: a model gateway and router, an orchestration layer, RAG retrieval, tools that act over the product API, guardrails on input and output, and evals plus observability, with caching and identity as cross-cutting rails. The defining requirement is that every retrieval is tenant-scoped and permission-aware, so one customer never reads another customer’s data. Microsoft, AWS, and the LLM-gateway vendors converge on the same primitives under different names.

How do you keep tenant data isolated in a multi-tenant AI or RAG application?

Choose an isolation model (silo, pool, or bridge), prefer one vector namespace per tenant or a separate index for high-compliance tenants, carry the tenant in a signed JWT, and enforce it at the query layer through a gatekeeper API plus automatic tenant-filter injection. Pinecone recommends a namespace per tenant as the default and advises against a shared namespace filtered by tenant metadata, because queries then scan every tenant’s data. Microsoft adds that code should never query the back-end stores directly; all access goes through one API layer that selects only the tenant’s rows and logs the access for audit.

Namespace, metadata filter, or separate index for multi-tenant vectors?

The isolation hierarchy is separate index, then namespace, then metadata filter. Pinecone recommends one namespace per tenant as the default because each namespace is stored separately and query cost scales with one tenant’s data rather than the whole index. It advises against putting all tenants in one namespace and filtering by tenant metadata, since the query scans the entire namespace regardless of the filter, so you pay to scan every tenant’s data and latency grows. A separate index per tenant is the strongest option, used for high-compliance tenants or to split workloads.

Does the LLM provider train on our customer data?

By default, no, for the major commercial APIs. Anthropic’s commercial terms state that Anthropic may not train models on customer content from its services, and OpenAI does not use API inputs or outputs to train models by default, retaining data up to thirty days for abuse monitoring with zero-data-retention available on eligible endpoints for qualifying organizations. A private cloud deployment, such as AWS Bedrock or Azure OpenAI, keeps data in the customer’s own tenancy as an architectural backstop to those contractual terms.

How do you prevent prompt injection in a SaaS AI feature?

Treat all retrieved and user content as untrusted and segregate it from instructions, enforce least-privilege scoped tokens for any tool the model can call, filter inputs and outputs, require human approval for high-risk actions, and run adversarial tests, per OWASP LLM01:2025. The variant that matters most for a RAG SaaS is indirect injection, where instructions arrive through retrieved content rather than direct user input. Constraining model behavior in the system prompt and requiring source citations are part of the same OWASP mitigation set.

Kanika Mathur

Head of Service Delivery, Resourcifi

Kanika Mathur is Head of Service Delivery at Resourcifi, where her engineering pods design the AI layer for multi-tenant SaaS products and run the tenant-isolation reviews before any retrieval path ships. She has sat in the design sessions where a namespace-versus-metadata decision quietly set a product’s per-query cost and its cross-tenant blast radius for years, which is the vantage point this guide is written from.

Resourcifi on LinkedIn →