
Data enrichment is the foundation for the AI era
How we built a data enrichment pipeline inside a high-assurance enterprise using Elasticsearch, Python, and Kestra. Clean, connected data is the real foundation for AI.
Inside a high-assurance environment, we're building an enterprise search platform designed to help employees find information faster and more reliably. To get there, we need a deep understanding of the organisation's data landscape: across teams, across systems, and everything in between.
Very quickly we realised something fundamental.
Most organisational data is messy. Outdated pages. Broken links. Duplicate content. Unmaintained systems. You name it, we encountered it.
So the question became:
How do we make this data usable, trustworthy, and intelligent?
One of our answers was systematic data enrichment, powered by Elasticsearch, Python, and Kestra.
From raw data to enriched knowledge
We ingest data from multiple internal sources, always respecting the source system's security boundaries (RBAC, ABAC, and all the policies that come with working in a highly-regulated environment).
But ingestion is only the first step.
Once the data lands in Elasticsearch, we run a series of enrichment tasks orchestrated by Kestra, a workflow engine that lets us schedule and run isolated Python jobs inside containerised environments.
This is where things get interesting.
Discovering references across the organisation
Our first enrichment task scans documents for outbound links and writes every referenced URL into a dedicated Elasticsearch index, one document per URL with a reference back to its source.
This transforms our initial dataset into a much more interconnected, graph-like structure:
- Each source page becomes a node.
- Every outgoing link on a page becomes an edge.
- The full reference network becomes queryable and analysable.
By reusing the data we already collect, we create entirely new insights without touching the source systems again.
Detecting broken links at scale
A second Kestra task picks up the output of the reference scanner. For every URL, it checks whether the page still responds, or whether it has gone stale with a 404 or other error. A simple step in isolation, but at the scale of millions of URLs, it becomes very insightful.
Why does this matter? Because in a large enterprise, things move fast:
- Teams reorganise.
- Systems get decommissioned.
- Documentation gets forgotten.
- Ownership shifts.
Before you know it, thousands of documents are pointing to pages that no longer exist.
With our enrichment pipeline, we can automatically detect which pages contain large numbers of broken links and feed that signal directly back to system owners.
This, on its own, already delivers huge value. System owners often lose oversight in such a large and constantly evolving landscape. By surfacing these issues proactively, we help them regain control, maintain healthier systems, and focus their time where it matters most.
Feeding better data back into the search engine
But it doesn't stop with reporting. We can use these enrichment signals to improve ranking inside our search engine.
For example:
- Pages with many broken references.
- Pages that consistently link to outdated content.
- Pages that rarely get updated.
These may not be the most reliable sources of truth.
By incorporating data quality indicators into our scoring logic, we move gradually towards a health-aware search engine. One that doesn't just return results, but understands the trustworthiness of those results.
Performance at enterprise scale
All of this runs in minutes. Why? Because each Python task:
- Retrieves millions of documents using efficient Elasticsearch scrolls.
- Processes data in parallel using Python ThreadPools.
- Writes enriched data back using bulk ingest.
- Runs inside isolated Kestra containers with predictable performance.
That combination lets us process millions of URLs in minutes.
Data enrichment is the foundation for the AI era
As AI and agentic systems become standard in enterprise environments, the importance of clean, connected, contextualised data grows exponentially.
LLMs don't magically fix data quality. We have to engineer the foundation. Our enrichment pipelines are a step in that direction: continuously improving data, discovering structure, surfacing issues, and creating new signals that make downstream AI tools smarter.
Share this piece