Enterprise AI Assistant Case Study: A RAG System at Scale

Krushang Mandani·Jun 06, 2026·15 min read·Insights

Knowledge workers at large organizations lose close to two hours every working day searching for information they already own. Our client, a large multinational enterprise, employed more than 14,000 people across finance, legal, engineering, and customer support functions. Their institutional knowledge was real and valuable, yet it was scattered across Confluence, SharePoint, Slack, Jira, Salesforce, and dozens of disconnected internal portals. The cost of that fragmentation was measured in millions of dollars of lost productivity each year, not in minutes.

When KriraAI was engaged, the goal was not a chatbot demo. The brief was a production grade enterprise AI assistant that every employee could trust as a single point of truth. This blog covers exactly what we built, how the architecture was designed layer by layer, the challenges we solved during delivery, and the measurable results the client achieved. We have written it for the technical and executive readers who evaluate whether an AI partner can actually ship hardened systems at scale.

The Problem KriraAI Was Called In To Solve

The enterprise was not short on information. It was drowning in it. Every team had built its own knowledge silo over fifteen years of growth and acquisition. Finance policies lived in SharePoint, engineering runbooks lived in Confluence, deal context lived in Salesforce, and tribal knowledge lived in archived Slack threads nobody could find again. The data existed, but it was effectively invisible at the moment of need.

The workflows that depended on this knowledge were quietly breaking. A support agent answering a complex billing question had to open six systems and ping three colleagues. A new engineer took weeks to learn where critical documentation even lived. A sales representative preparing for a renewal could not quickly surface the history of a five year account. Each of these moments was small on its own, yet they repeated thousands of times every single day.

The financial impact was significant and growing. Internal estimates put the cost of unproductive search at close to two hours per knowledge worker per day. Across the workforce, that represented an enormous recurring drain on output. Repeated questions also overloaded internal help desks and subject matter experts, pulling senior people away from high value work to answer the same things again and again.

Keyword search had failed them completely. The legacy enterprise search tool matched literal strings, not meaning. A query phrased differently from the source document returned nothing useful. Results were not ranked by relevance, were not aware of who was allowed to see what, and never produced a direct answer. Employees had simply stopped trusting it.

Competitive pressure made the status quo unsustainable. Faster competitors were shipping decisions while this organization was still searching for the inputs to make them. Onboarding velocity, support resolution time, and sales responsiveness were all being dragged down by the same root cause. Leadership understood that the problem was no longer a convenience issue but a structural disadvantage. That is the situation that brought the engagement to KriraAI.

What KriraAI Built

KriraAI designed and delivered a production enterprise AI assistant built on a retrieval augmented generation architecture. The system unifies the entire knowledge estate behind a single conversational interface. An employee asks a question in natural language inside Slack, Microsoft Teams, or a web application. The assistant returns a grounded, cited answer drawn only from sources that employee is permitted to see. Every answer links back to its source documents for verification.

At its core, this is an enterprise RAG system, not a fine-tuned chatbot. We deliberately chose retrieval over parametric memory because enterprise knowledge changes daily. Baking facts into model weights would have produced a system that was stale on day one. Instead, the assistant retrieves fresh, governed context at query time and uses a large language model purely as a reasoning and synthesis engine over that retrieved evidence.

The end to end flow is precise and observable at every step. When a question arrives, the system first rewrites it for retrieval, expanding acronyms and resolving conversational references. It then runs a hybrid search across the vector database and a sparse keyword index in parallel. Candidate passages are passed through a cross-encoder reranker that scores true semantic relevance. Only the top ranked, access-filtered passages are assembled into the final prompt.

The language model then generates an answer constrained tightly to that retrieved evidence. We enforced strict grounding so the model cannot answer from its own pretraining when the context does not support a claim. If the retrieved evidence is insufficient, the assistant says so and offers to escalate, rather than inventing a confident wrong answer. This single design decision was central to building employee trust in the tool.

Beyond pure question answering, KriraAI built an agentic layer on top of the retrieval core. For defined workflows, the assistant can take action, not just answer. It can draft a support response, create a Jira ticket, or summarize an account before a renewal call. These actions run through tool-calling functions with explicit permission checks, so the assistant operates within the same guardrails as the human it serves. The result replaced fragmented manual search and augmented the daily work of every knowledge worker in the organization.

The Solution Architecture Behind Our Enterprise AI Assistant

This is the technical centerpiece of the engagement. The enterprise AI assistant was built as six distinct architectural layers, each with deliberate engineering rationale. We designed for production scale from the first day, not for a proof of concept that would need rebuilding later. The full system runs on Kubernetes within the client's private cloud, provisioned entirely through Terraform.

Data Ingestion and Pipeline Layer

The ingestion layer was the hardest part of the system and the most important. We connected to source systems using three distinct patterns matched to each source. We used change data capture from operational databases, scheduled batch extraction from SharePoint and Confluence, and webhook-driven updates from Slack and Jira. This ensured the index reflected reality within minutes, not days.

Raw documents flowed into a processing pipeline orchestrated as DAGs in Dagster. We chose Dagster over Airflow specifically for its asset-aware model and strong data lineage tracking. Documents were parsed with the unstructured library, which handled the messy reality of PDFs, slide decks, and HTML. We applied schema normalization, entity resolution to deduplicate near-identical documents, and semantic chunking that respected document structure.

Embeddings were generated at ingestion time rather than at query time. Each chunk was converted into a dense vector and written to the index alongside its full access control metadata. The final indexed corpus held 4.2 million document chunks at go-live. This offline embedding strategy kept query latency low and made the entire pipeline reproducible and backfillable.

AI and Machine Learning Core

The retrieval core combined dense and sparse signals through hybrid search. Dense retrieval used a transformer-based bi-encoder embedding model, BAAI bge-large, which we fine-tuned on the client's own domain corpus using contrastive learning. That fine-tuning improved retrieval recall by 22 percentage points over the off-the-shelf baseline. Sparse retrieval ran in parallel using BM25 to catch exact identifiers and product codes.

Candidates from both retrievers were fused and then reranked by a cross-encoder, bge-reranker-v2. The reranker reads the query and passage together, producing far more accurate relevance scores than vector similarity alone. The top reranked passages became the grounding context. This two stage retrieve-then-rerank design is the difference between a demo and a production enterprise RAG system.

For generation, we used a hybrid serving strategy. Most traffic was handled by a self-hosted open model served through vLLM with quantization for cost-efficient throughput. A smaller share of complex reasoning queries routed to a frontier model through a governed API. LangGraph orchestrated the full pipeline as an explicit state machine, including query rewriting, retrieval, generation, and the agentic tool-calling branches.

Integration Layer

The integration layer connected the assistant to the wider enterprise. We exposed the assistant through versioned REST and GraphQL API contracts for application clients. Internal service-to-service communication used gRPC for low-latency calls between the retrieval, reranking, and generation services. This kept the internal request path fast even under heavy concurrent load.

Agentic actions reached downstream systems through an event-driven design. When the assistant created a ticket or triggered a workflow, it published an event to Apache Kafka. Downstream consumers acted on those events asynchronously, which decoupled the assistant from the systems it touched. Webhook-based triggers connected outbound AI outputs cleanly into existing business processes without brittle point-to-point coupling.

Monitoring and Observability Layer

A system employees must trust requires deep observability. We tracked retrieval and answer quality continuously against a curated held-out evaluation set using RAGAS metrics. We monitored data drift on incoming queries using population stability index and KL divergence to catch shifts in how employees were asking questions. Feature distribution alerts fired when query patterns moved outside expected bounds.

Operationally, we instrumented latency at p50, p95, and p99 across every service in the pipeline. LangSmith captured full request traces for debugging, while Prometheus and Grafana powered live dashboards and alerting. When evaluation scores crossed defined thresholds, automated retraining and reindexing triggers fired. This closed loop kept the enterprise AI assistant accurate as the underlying knowledge evolved.

Security and Compliance Layer

Security was non-negotiable in an enterprise handling sensitive finance and legal data. We implemented role-based access control with attribute-level data masking, enforced at retrieval time. Every chunk carried the access control list inherited from its source document. The assistant could only ever retrieve passages the requesting user was already permitted to read, which made data leakage architecturally impossible rather than merely discouraged.

We applied end-to-end encryption of model inputs and outputs in transit and at rest. All interactions were written to immutable append-only audit logs for compliance review. The entire deployment ran inside a private VPC with no public endpoints. The architecture was designed to meet SOC 2 and GDPR requirements that governed the client's regulated data.

User Interface and Delivery Layer

We met employees where they already worked. The assistant was delivered as native Slack and Microsoft Teams applications, plus a dedicated web interface for richer interactions. Every answer streamed token by token for responsiveness and included inline citations to source documents. Users could expand any citation to verify the underlying evidence in one click. This transparency was the single biggest driver of adoption across the workforce.

Technology Stack

Every technology in this stack was chosen deliberately against the client's scale, environment, and constraints. Nothing was selected by default or by hype. The choices reflect how an experienced AI engineering team makes tradeoffs in production.

The full stack organized by layer was as follows:

Ingestion and orchestration used Dagster for its asset-aware lineage, the unstructured library for robust document parsing, and Kafka with Flink for stream processing of live updates.
Vector database search ran on Qdrant with HNSW indexing, chosen for its strong filtering performance, which was essential for enforcing access control at retrieval time.
Embeddings and reranking used a fine-tuned BAAI bge-large bi-encoder and a bge-reranker-v2 cross-encoder, selected for accuracy on domain text without per-query API cost.
Model serving used vLLM with quantized open models for high-throughput traffic, with a governed frontier model API reserved for complex reasoning queries.
Orchestration logic used LangGraph to model the pipeline as an explicit, debuggable state machine rather than opaque chained calls.
Observability combined LangSmith for tracing, RAGAS for quality evaluation, and Prometheus with Grafana for operational metrics and alerting.
Infrastructure ran on Kubernetes provisioned by Terraform inside a private VPC, giving the client full control and reproducibility.

We chose Qdrant over alternatives specifically because metadata filtering performance was a hard requirement. The assistant filters every search by user permissions, so a vector database that degrades under heavy filtering would have failed in production. We chose vLLM because continuous batching and paged attention let us serve high concurrency at a fraction of naive serving cost. Each decision traded raw simplicity for the scale and governance this enterprise actually needed.

How We Delivered It, The Implementation Journey

The enterprise AI assistant implementation ran across five phases over a focused delivery timeline. KriraAI runs every engagement as a structured journey, not an open-ended research project. We move from discovery to a hardened production handover with clear gates at each stage.

The phases of delivery were as follows:

Discovery and requirements mapped every source system, its data sensitivity, and its access model, while we profiled real employee queries to define success metrics.
Architecture design produced the layered system above, validated against scale, latency, and security targets before any production code was written.
Development built the ingestion pipeline, retrieval core, agentic layer, and delivery surfaces in parallel tracks behind a shared API contract.
Testing and validation ran the system against a curated evaluation set and a pilot user group to measure groundedness, precision, and latency under real load.
Deployment and handover rolled the assistant out progressively, then transferred runbooks, dashboards, and retraining procedures to the client's internal team.

The challenges during delivery were real and instructive. The first was data quality. The source corpus was full of duplicates, outdated documents, and conflicting versions of the same policy. We solved this with entity resolution during ingestion and a recency-aware scoring signal that down-weighted stale content.

The second challenge was access control complexity. Mapping fifteen years of inconsistent permission models into a single retrieval filter was genuinely hard. We resolved it by inheriting each document's native access control list directly at ingestion and enforcing it at query time, so the assistant never had to reinvent permissions.

The third challenge was retrieval quality. Early evaluation showed the baseline embedding model missed too many correct passages on domain-specific phrasing. We fixed this by fine-tuning the embedding model with contrastive learning on the client's own corpus and adding the cross-encoder reranker. The fourth challenge was hallucination, which we eliminated through strict grounding and citation enforcement that refused to answer beyond retrieved evidence.

Results the Client Achieved

The results were measured over the first four months after go-live, against a clear baseline captured during discovery. The outcomes were decisive and verifiable. This was a completed engagement with confirmed business impact, not a projection.

The headline metrics were as follows:

Time to answer fell from an average of 8.5 minutes of cross-system searching to under 14 seconds, a reduction of roughly 97 percent.
Support ticket deflection reached 43 percent, as employees self-served answers that previously required a help desk escalation.
Onboarding ramp time for new hires dropped by 38 percent, because institutional knowledge became instantly accessible.
Retrieval precision reached 0.91 at the top five results, with answer groundedness measured at 94 percent against the evaluation set.
System performance held p95 latency under 2.7 seconds even under peak concurrent load.
Adoption climbed to 81 percent monthly active users across the eligible workforce within four months of launch.

The before and after contrast was stark. Before the engagement, an employee faced six systems, fragmented results, and a search tool nobody trusted. After go-live, that same employee asked one question and received a cited, governed answer in seconds. The direct gain in knowledge worker productivity translated into thousands of recovered hours every week. Senior experts were freed from repetitive questions to focus on the work only they could do.

What This Architecture Makes Possible Next

The architecture was built to scale and to extend, which is exactly why these results are durable. As data volume grows, the ingestion pipeline scales horizontally, and the Qdrant index handles new chunks through incremental upserts without downtime. The vector database search layer was sized with significant headroom, so the system absorbs growth in both corpus and query traffic. Nothing about adding more knowledge requires a rebuild.

New use cases sit naturally on the same foundation. Because the retrieval core is decoupled from the delivery surfaces, the client can add new agentic workflows by defining new tools, not new systems. A finance reconciliation assistant or a compliance review helper can reuse the same governed retrieval and security layers. KriraAI deliberately designed the foundation so that each new capability compounds on the last rather than starting from zero.

The client's roadmap for the next two to three years builds directly on this base. They plan to extend the assistant into proactive workflows, multilingual support for global teams, and deeper integration into core business systems. Other organizations in the enterprise software space can apply the same core lessons from this enterprise AI assistant implementation. Retrieve fresh governed context rather than fine-tuning facts into weights, enforce permissions at retrieval time, and rerank before you generate. Those three principles transfer to almost any enterprise knowledge problem.

Conclusion

Three insights from this engagement matter most. The technical insight is that retrieval beats memorization for enterprise knowledge, and a retrieve-then-rerank pipeline with grounding is what separates a production enterprise RAG system from a fragile demo. The operational insight is that access control must be enforced at retrieval time, because trust and security are the real adoption gates, not model quality alone. The strategic insight is that a well-designed foundation compounds, letting new use cases reuse the same governed core instead of starting over.

KriraAI brings this same level of engineering rigor and delivery discipline to every client engagement. We design production systems, not pilots, and we make every technology choice deliberately against real scale, security, and cost constraints. The enterprise AI assistant we delivered turned an invisible knowledge estate into an instant, trusted, and governed answer engine. If your organization is wrestling with the same fragmentation, slow decisions, and untapped institutional knowledge, bring that challenge to KriraAI and let us architect the system that solves it.

FAQs

An enterprise AI assistant is a governed conversational system that lets employees retrieve trusted answers from across all their internal knowledge in natural language. It works using a retrieval augmented generation architecture rather than a fine-tuned chatbot. When a question arrives, the system retrieves relevant, permission-filtered passages from a vector database, reranks them for true relevance, and passes only that evidence to a large language model for synthesis. The model answers strictly from the retrieved context and cites its sources. This design keeps answers current, accurate, and grounded in the organization's real documents rather than the model's pretraining.

A production enterprise AI assistant implementation typically runs across five structured phases covering discovery, architecture design, development, validation, and deployment. The realistic timeline depends heavily on the number of source systems, the complexity of the access control models, and the quality of the underlying data. In our engagement, the hardest and longest phase was ingestion, because legacy knowledge estates are full of duplicates, stale documents, and inconsistent permissions. A disciplined team that solves data quality and access control early can reach a hardened production launch efficiently, then measure real business results within the first few months after go-live.

An enterprise AI assistant keeps data secure by enforcing access control at retrieval time, not as an afterthought. In our system, every document chunk inherits the access control list from its source, so the assistant can only ever retrieve passages the requesting user is already permitted to read. We added role-based access control with attribute-level masking, end-to-end encryption of inputs and outputs, and immutable append-only audit logging. The entire system ran inside a private VPC with no public endpoints and was designed to meet SOC 2 and GDPR requirements. This makes data leakage architecturally impossible rather than merely discouraged by policy.

The difference is depth, governance, and grounding. A basic chatbot answers from a fixed script or from a model's pretrained memory, which goes stale immediately and cannot respect who is allowed to see what. An enterprise AI assistant built as a RAG system retrieves fresh, permission-filtered evidence at query time and synthesizes a cited answer from it. It enforces security at retrieval, refuses to answer beyond its evidence, and can take governed actions through tool-calling. In our engagement, this is precisely why the assistant earned employee trust where the previous keyword search tool had failed completely.

A well-engineered enterprise AI assistant achieves high accuracy when it combines strong retrieval with strict grounding. In our system, retrieval precision reached 0.91 at the top five results and answer groundedness was measured at 94 percent against a held-out evaluation set. Hallucination was prevented by constraining the language model to answer only from retrieved evidence and enforcing inline citations for every claim. When the retrieved context was insufficient, the assistant explicitly said so and offered escalation rather than guessing. Continuous evaluation with RAGAS metrics and drift detection kept accuracy stable as the underlying knowledge changed over time.

Krushang Mandani

CTO

Jun 06, 2026

Krushang Mandani is the CTO at KriraAI, driving innovation in AI-powered voice and automation solutions. He shares practical insights on conversational AI, business automation, and scalable tech strategies.

Ready to Write Your Success Story?

Do not wait for tomorrow; lets start building your future today. Get in touch with KriraAI and unlock a world of possibilities for your business. Your digital journey begins here - with KriraAI, where innovation knows no bounds.