Cohesity Redefines RAG Architecture by Eliminating Data Movement and Strengthening Security

Cohesity Redefines RAG Architecture by Eliminating Data Movement and Strengthening Security

1. What strategic need drove Cohesity to develop a RAG platform built directly on secondary data? The need came from a gap we kept seeing on both sides of our business. On one side, enterprises were sitting on enormous volumes of secondary data: years of

As enterprises accelerate their adoption of generative AI, the challenge is no longer accessing powerful models—it is securely connecting those models to trusted enterprise data. Organizations are increasingly seeking ways to unlock the value of their vast data estates without creating new security, governance, or compliance risks. In an exclusive conversation with AI Reporter America, Greg Statton, Office of the CTO – Data & AI, Cohesity discussed how its patented retrieval-augmented generation (RAG) approach l9everages secondary data as a secure foundation for enterprise AI, enabling organizations to extract greater intelligence from their data while strengthening cyber resilience and maintaining strict governance controls.

What strategic need drove Cohesity to develop a RAG platform built directly on secondary data?

The need came from a gap we kept seeing on both sides of our business. On one side, enterprises were sitting on enormous volumes of secondary data: years of files, emails, databases, and virtual machines captured in their backup and recovery environments. That data is the most complete and trustworthy record an organization has of its own information and institutional knowledge, yet historically it did nothing but wait for a restore. It was treated as an insurance policy, not an asset.

On the other side, those same enterprises were racing to adopt generative AI and discovering that the hard part was never the model. The hard part was getting their own data safely into the model. Most retrieval-augmented generation projects required teams to extract and copy data into a separate AI stack: a new vector database, a new pipeline, a new environment to secure. That created fresh silos, multiplied copies of sensitive information, and pulled data outside the governance and access controls that already protected it.

We recognized that the corpus most valuable to enterprise AI already lived inside the backup estate, already aggregated and governed in one place. So the strategic question became: why move it at all? Why not bring the AI to the data, rather than the data to the AI?

That insight is what drove us to build a RAG semantic layer directly on secondary data and to invest in the foundational engineering behind it —work now reflected in this patent. The goal was to convert dormant recovery data into a governed, searchable knowledge source for GenAI without forcing customers to duplicate sensitive information or widen their exposure. It lets us solve an AI problem and a data-security problem with a single architecture, which is exactly the intersection where Cohesity operates.

How does this patented approach differ from traditional enterprise AI architectures that rely on data movement?

The defining difference is where the intelligence is built. A conventional enterprise RAG pipeline follows a copy-first pattern: identify your sources, extract the data, ship it into a separate AI environment, generate embeddings there, and query against that new store. Every one of those copies becomes another place that has to be secured, another set of permissions that can drift out of sync with the original, and another expansion of the attack surface. The architecture is optimized around the AI tooling, and the data's security posture is something you try to reconstruct afterward.

Our patented approach inverts that. The patent "Data Retrieval Using Embeddings for Data in Backup Systems" covers generating embeddings for data that already resides in backup systems and retrieving against it in place. We build the RAG semantic layer directly on the secondary data platform. Hence, the data never has to be moved into a separate AI infrastructure to be useful to a large language model.

Because nothing moves, the data continues to inherit the security, governance, compliance, and access controls that already protect the backup environment. There is no second copy to harden, no shadow data store running under looser rules, no governance gap opening up between the source and the AI layer. You reduce data sprawl rather than add to it, and you shrink the attack surface rather than expand it.

Put simply, data-movement architectures optimize for the convenience of the AI tool and treat security as a downstream concern. Ours treats the data's protection as the starting point and makes AI a capability that operates within that perimeter. That is why we describe it as a security-first architecture for enterprise AI, and it is the distinction the patent recognizes. Cohesity is the first data protection vendor to patent applying GenAI to secondary data this way.

What advantages does secondary data offer as a foundation for GenAI applications?

Several, and they compound. The first is completeness. Backup and recovery data captures the full estate over time — not only what is live in production today, but historical states, prior versions, and content that may have changed or been removed from primary systems. For an AI system trying to reason over an organization's knowledge, that breadth and time depth is uniquely rich. It is the most complete repository of enterprise information that most organizations have.

The second advantage is that this data is already consolidated and deduplicated on a single platform. A backup environment exists precisely to aggregate information from across many sources (files, email, databases, virtual machines) into a managed, indexed repository. That makes it a natural single substrate for retrieval, rather than having to stand up and maintain dozens of fragile connectors to individual production systems.

The third advantage is the existing governance. Secondary data typically carries retention rules, access controls, and immutability protections. When you build AI on top of that foundation, you inherit those controls instead of trying to recreate them in a new environment. Governance becomes a property you start with, not a feature you bolt on later.

The fourth is operational safety. Running AI workloads against the secondary copy means you are not querying or straining production systems. There is no performance penalty on the live applications the business depends on.

And finally, there is trust. Because this data is held as protected, often immutable, known-good copies, it is clean and reliable, which matters enormously when you are grounding AI answers and when that same data underpins cyber resilience. The result is a foundation that is comprehensive, centralized, governed, nondisruptive, and trustworthy all at once.

How does the platform balance AI accessibility with security, governance, and compliance requirements?

The principle we built around is that AI should inherit the existing controls, never bypass them. The same access permissions, retention policies, and data classifications that govern the secondary data continue to govern what the AI can retrieve and surface. If a user is not entitled to see a piece of information in the underlying environment, they do not see it through Cohesity Gaia either. Accessibility is expanded for the right people, not opened up indiscriminately.

Keeping data in place is central to that balance. Because we do not replicate sensitive data into a separate AI infrastructure, there is no new copy operating under weaker rules and no parallel store for an attacker or an audit to worry about. The governance perimeter the organization already trusts is the perimeter the AI operates inside. That is what lets security and accessibility coexist rather than trade off against each other.

This matters most acutely in regulated and sovereignty-sensitive settings. One of our customers evaluated enterprise AI approaches with sovereign, on-premises control and preservation of their security posture as non-negotiable requirements, and found that using their existing backup data as the foundation was the only approach that made AI viable on those terms. That is a good illustration of the point. When the architecture respects existing controls by design, compliance stops being the obstacle to AI and becomes part of how it is delivered.

So the balance is not achieved by adding guardrails after the fact. It comes from the design itself: data that stays protected and in place, AI that operates within established access boundaries, and a clear, auditable line between what the model can reach and what an individual user is permitted to see.

How do you see RAG evolving as enterprises seek to unlock value from existing data assets?

The first wave of generative AI in the enterprise was about proving the technology worked at all. The current wave is about grounding it — making AI answer from an organization's own proprietary, trustworthy information rather than from general knowledge. RAG is the bridge that makes that possible, and it is rapidly moving from an experiment to core infrastructure.

The clearest trajectory we see is a shift from "copy the data to the AI" toward "bring the AI to the data where it already lives." As organizations confront the cost, risk, and governance burden of duplicating data into purpose-built AI stacks, the appeal of operating against an existing system of record grows. The retrieval substrate increasingly becomes the data platform the enterprise already manages, rather than yet another environment.

We also expect RAG to become more autonomous and agentic. Rather than a single question-and-answer step, AI systems will chain retrieval across multiple sources and data types, such as emails, documents, database records, and the contents of virtual machines, to complete multi-step workflows. That raises the bar on the underlying corpus: it has to be broad, current, and well-governed.

Alongside that, provenance and lineage will become table stakes. Enterprises will demand to know where an answer came from, whether the requesting user was permitted to see the source, and how to audit the whole chain. Questions of "who was allowed to see this" will be as important as "what is the answer."

Finally, cyber resilience and AI will converge. The protected, immutable data that defends an organization against ransomware is the same trustworthy corpus that should ground its AI. Over time, those will be understood as two uses of a single well-governed data foundation rather than as separate initiatives.

What challenges remain in making enterprise data AI-ready without increasing risk or complexity?

Several real challenges remain, and we would be the first to say the work is ongoing. The most pervasive is data quality. Enterprise data is messy: unstructured, duplicated, inconsistently formatted, and full of stale or contradictory material. Making it genuinely useful to AI requires consolidation, deduplication, and a way to favor current, relevant content over noise. A consolidated secondary data platform helps with this, but curation never fully goes away.

The hardest security challenge is permission fidelity at scale — what we call permission-aware retrieval. Ensuring that AI honors every entitlement, especially when the original permissions across source systems are complex or inconsistent, is genuinely difficult. Getting this wrong is how organizations accidentally expose information through AI that they would never expose directly, so it has to be engineered carefully rather than assumed.

Classification is closely related. Before sensitive or regulated data, such as PII, financial records, and health information, is made available to AI, an organization needs to know where it is and how it should be handled. That discovery and labeling work is substantial in most enterprises.

There are also the well-known model-side issues: grounding answers reliably, avoiding hallucination, and providing clear provenance for every response. And there is the practical matter of scale and freshness. generating and maintaining embeddings across very large corporations, and keeping them current as data changes, without runaway cost.

Our architecture is designed to reduce rather than add to these risks, because it does not move data and it inherits existing controls. But classification, permission fidelity, freshness, and scale are industry-wide problems that demand continued engineering investment. The answer is not to slow AI adoption; it is to make the underlying data foundation governed and resilient enough that adoption does not come at the cost of new exposure.

How does this patent strengthen Cohesity's long-term vision for AI-powered data management and cyber resilience?

This patent validates and protects the core architectural bet at the center of our strategy: that enterprise AI should run directly on secondary data, in place, with security first. Being the first data protection vendor to patent this approach gives us defensible, differentiated ground at exactly the intersection where we believe the market is heading. And the fact that the inventors span our engineering and executive leadership, including our CEO, reflects how foundational we consider this to be: it is not a side project, it is central to where the company is going.

More importantly, it ties together the two halves of our mission. Cohesity exists to keep data secure and recoverable, and to help organizations extract value from it. For a long time, those felt like distinct goals. This innovation unifies them. The protected, immutable, governed recovery data that defends an organization against ransomware and lets it bounce back from an attack becomes the same trusted knowledge source that grounds its AI. One data estate delivers both cyber resilience and enterprise intelligence, without the customer having to maintain two separate worlds.

That has real strategic implications. It means every investment a customer makes in strengthening their data protection posture also makes their data more AI-ready, and every step toward AI-driven insight is built on a resilient, governed foundation rather than undermining it. The two reinforce each other.

Looking ahead, the patent anchors a platform we will continue to extend by deepening Cohesity Gaia, expanding the data types and workflows it supports, and moving toward more agentic capabilities — all while preserving the in-place, governed model that makes it safe. Our long-term vision is for the backup estate to be recognized as the secure foundation for enterprise AI, and this patent is a meaningful marker on that path.