How long from kickoff to a SaaS AI feature live in production?
Median is 90 days for a single well-scoped feature with clear deployment constraints (p95 latency, cost-per-call, accuracy floor); pilots can prove a feature in 6 to 8 weeks. The longest pole is almost never the model, it is multi-tenant data plumbing, evals against representative production queries, and observability. We do not ship a SaaS AI feature without evals running in CI.
Will the AI feature be profitable at our seat price?
We model gross margin per AI feature per tier before code is written, and if expected usage prices into negative contribution margin, we re-scope. Common levers: a tighter cost-per-call ceiling, a cheaper model for the common path with smart routing to a stronger model only when needed, prompt-prefix and response caching, and bounded retry policies. On inherited systems the largest savings usually come from these levers, not from swapping the model.
How do you handle multi-tenant data isolation in RAG?
Per-tenant retrieval indexes by default, not a shared vector store with metadata filters. Auth scoping is enforced at the retrieval layer and again at inference, and every retrieval is logged with tenant, document IDs, and timestamp for audit. We treat ingested customer content as untrusted and apply least-privilege retrieval scoped to the tenant, then prove isolation with adversarial cross-tenant eval cases that have to pass before release.
Per-tenant fine-tuning, or a shared model?
The default is a shared base model with per-tenant retrieval. Move to per-tenant adapters (LoRA or prompt-suffix) when vocabularies meaningfully differ, as in a vertical SaaS serving multiple industries on one platform. Move to full per-tenant fine-tunes only when an enterprise account justifies the operational overhead and contractually requires it; most SaaS products never need them.
What about prompt injection from customer-uploaded content?
We treat ingested documents, tickets, and chats as untrusted and run them through a four-layer governance stack: model guardrails (Guardrails.ai validators), validation pipelines (schema validation on structured output), auto-retraining (incidents become regression evals), and real-time observability (LangSmith, Evidently AI, Weights & Biases, Prometheus and Grafana). Ingested content passes validation before it can influence an action.
How do you evaluate AI quality for SaaS use cases?
A three-layer eval suite. A reference dataset of 100 to 500 representative queries from real product usage, scored on the metric that matters for the feature (accept rate for copilots, deflection rate for support, exact-match for SQL generation). An adversarial set covering known failure modes. And a regression set where every production incident becomes a permanent eval entry. The suite runs on every deploy and on a schedule behind feature flags.
What happens to ownership of the AI feature after delivery?
We design for hand-off from week one. Your in-house team owns the model selection, the eval suite, the observability dashboards, and the run-book at the end of the engagement, and we document the deployment constraint set, the eval methodology, the fallback strategy, and the cost model. A meaningful share of our AI work is recovery on systems built by other vendors where this hand-off was never engineered, and we do not ship into that pattern.