Resources | Cegal

How AI unlocks subsurface productivity

Written by Editorial staff | Feb 18, 2026 7:45:12 AM

By Dr. Thibaud Freyd & Dr. Raphael Peltzer

A geoscientist’s morning

Picture a typical morning: a geoscientist needs reservoir pressure data from the Maastrichtian interval for the wells in the Johan Sverdrup field. Should be simple, right? She fires up her search tool. Type the query. Gets back PDFs with “Johan Sverdrup” in the filename. Some scanned documents from the 90s. Some daily drilling reports where “pressure” gets mentioned somewhere, about something else entirely. None give her what she needs.

So the manual hunt begins. Open PDF. Search for “Maastrichtian.” Scroll. Find a table. Copy numbers. Open next PDF. Repeat. Two hours later, she’s got… something. Maybe the right answer. She definitely missed a tiny footnote in a 2018 report mentioning anomalous pressure readings during a specific drilling phase. That footnote might have saved her team weeks of work.

This isn’t about bad technology. It’s a mismatch between how we store information and how we need to use it. In oil and gas, we’ve done a great job digitizing structured data: well logs, seismic traces, production numbers. Clean, organized, accessible with OSDU, for instance. But the real knowledge - the explanations, the context, the “why did that work?” - is still stuck in what we call “text soup”: piles of PDFs, scanned reports, legacy documents that no search engine really understands.

Why search tools get it wrong

Traditional keyword search fails here. It doesn’t understand that “reservoir quality” relates to “porosity” across hundreds of reports using different terms. It doesn’t get that section “3.1 Core Analysis” goes with the tables and figures below it. It can’t tell casual mentions from detailed analyses.
Most “chat with your PDF” demos make the same mistake: they treat documents as flat text. But subsurface documents aren’t flat! They have structure: headings, subheadings, tables that span pages, rotated headers, and figures that actually matter.

When Equinor’s CIO said employees spend 80% of their time searching for data instead of analyzing, she wasn’t talking about a technology problem. She was describing cognitive overload. Engineers aren’t analyzing data; they’re playing archaeological detective with fragments of information scattered everywhere.

Reassembling the puzzle first

We tried something different: rebuild the document before searching it. Instead of pulling text out and hoping for the best, we rebuild the document’s logical and semantic structure first - treating headings, sections, tables, and figures as containers of meaning. Context matters: a porosity value in the “Core Analysis” section of a well report gives insight, not just data. But structure alone wasn’t enough, we needed a search that understood both meaning and specific terms. So we used hybrid retrieval: meaning-based vector search (“reservoir quality” ≈ “porosity”) plus exact keyword matching with BM25 plus synonym expansion (“Christmas Tree” = “Flow Control Assembly”). We fused these results using Reciprocal Rank Fusion (RRF) to balance relevance, then the semantic reranker prioritized the most contextually accurate matches. Now “pressure trends in the Maastrichtian interval for Johan Sverdrup wells” finds what you need, even when different reports use different wording.

Smart chunking actually made all the difference

Most systems break documents into fixed-size text chunks with overlap, ignoring that a table might start on one page and finish on the next. Our header-aware chunking walks the reconstructed document tree and preserves entire sections as single chunks when they fit so that “3.1 Core Analysis” stays with its data. When reconstruction fails due to poor Optical Character Recognition (OCR), we fall back to naive chunking with overlap to maintain continuous data flow. The chunks are then passed to an embedding model to generate semantic embeddings (e.g., OpenAI’s text-embedding-3-large).

Security first, not later

In oil and gas, data breaches aren’t just about money. They’re about safety protocols and decades of competitive advantage.
Most “chat with PDF” systems take a risky shortcut: search first, filter after. They run broad semantic vector searches, then apply security filters to the top-K results. This creates “silent leakage”: sensitive data enters the retrieval pipeline before access checks, even if it is later discarded. Plus, if you request the top ten results and eight get filtered out, users see poor results and assume the whole solution is garbage.
We flipped the order: identity, then access, then search. Before any vector similarity search, we validate user identity via Entra ID, map effective group membership to Open Subsurface Data Universe (OSDU) Access Control Lists (ACLs), and apply OData filters to narrow the search universe to authorized content only. Unauthorized documents never even get considered.

Latency stays low because we’re not ranking documents that we’ll discard during the message generation. Every query leaves an audit trail showing who accessed what, when, and with what permissions. We keep a “shadow index” - just chunks, embeddings, and metadata with signed pointers back to the real OSDU records. We never store original documents, and a sync pipeline detects upstream changes and updates the index to prevent data drift. Every answer cites source documents with direct links for verification - no black-box responses.

From finding answers to taking action

This setup enables something more interesting than just better search. We’re working toward Reasoning + Acting (ReAct) systems.
Traditional Retrieval Augmented Generation (RAG) answers “What’s the reservoir pressure?” ReAct plans “How do we optimize production given current reservoir conditions, operational constraints, and past performance?”

This opens possibilities like well-planning agents that pull geological context from unstructured reports and generate plans, compliance agents monitoring regulations and flagging issues, and Anomaly detection agents combining sensor data with incident reports to predict problems before they escalate. This is what happens when a system actually understands content instead of just indexing it.

Does this actually work?

We tested with 250 subsurface questions curated by domain experts—Real questions geoscientists actually ask. We measured precision, recall, and cosine similarity scoring, plus integrated an LLM-as-a-Judge framework for reference benchmarking.
Compared to baseline methods (naïve chunking): up to 20% better accuracy in finding correct answers. Header-aware chunking significantly outperformed fixed-size splits. Hybrid retrieval cuts false positives while maintaining recall. Conservative tuning prioritized accuracy over completeness to build actual trust.

When the query is ambiguous (e.g., “Well A” or “Well B”?), the system asks for clarification before searching. This human-in-the-loop approach constrains the vector space to specific assets via strict OData filters, keeping retrieval precise and focused on the documents of interest -helping stop expensive mistakes before they happen.

The real payoff

When people spend less time hunting through documents and more time analyzing them, organizations see productivity gains. Simple models (roles × tasks/month × time saved × fully loaded rate) suggest potentially significant operational savings at scale. The value isn’t just time saved - it’s better decisions made with complete context, spotting patterns hidden across documents, and asking “what if” questions that were too time-consuming before. The architecture scales naturally thanks to its modular design. The reconstruction pipeline grows with document volumes. The retrieval engine handles increasing query traffic. Entitlement systems adapt to organizational complexity. Like building blocks, you add capacity where needed without rebuilding everything. The AI landscape changes monthly; new Large Language Models, better embedding models, new frameworks, or new OCR capabilities are constantly emerging. We built the solution for flexibility. Modular components and continuous vendor roadmap monitoring mean we can improve any brick of the solution without tearing everything down.

Getting from "text soup" to trusted answers

Moving from “text soup” to trusted AI isn’t about replacing people - it’s about giving them tools that actually help. Tools that understand context, respect security boundaries, and show their work. We’ve tackled the big problems with unstructured data in oil and gas: semantic understanding, hybrid retrieval, and security-first design. Today it’s finding documents faster. Tomorrow it’s systems planning wells, checking compliance, spotting operational issues - all while keeping the audit trail and security that energy companies need.

Trusted intelligence that lets people innovate without compromising what matters.