What It Takes To Turn Unstructured PDFs Into Enterprise-Ready Data

Brittany Bafandeh

Dec 17, 2025

•

min read

Everyone wants a plug-and-play solution for PDF data extraction.

Upload the file.
Use an LLM.
Get perfect structured data.

We built an extraction pipeline for a client processing thousands of financial research reports, and learned why that story doesn’t hold up in production.

LLMs are powerful, but non-deterministic. And that single fact changes how extraction must be designed, evaluated, and continuously improved if you want enterprise-grade results.

This project made one thing clear: human judgment is still the missing ingredient.

In this post, we share how we used LLMs intentionally (not everywhere), built system-level evaluation to surface failure modes, and iterated toward extraction that’s trustworthy at enterprise scale.

‍

The Use Case: Highly variable financial research

We recently completed a project for a financial services firm that aggregates equity research from partners around the world.

On a typical day, the firm receives hundreds of research reports delivered primarily as PDFs.

Each partner uses different layouts, terminology, formatting conventions, and disclosure structures. Some reports follow traditional cover-page layouts; others bury key information deep in narrative text or side panels.

While these reports contain highly structured information: company identifiers, analyst recommendations, pricing, market data, authorship, all of it lives inside unstructured documents.

Valuable data trapped inside PDFs can’t be analyzed at scale.

That creates a familiar set of challenges:

Manual preparation overhead: Key information can’t be used at scale without extraction, forcing manual processes (like creating required cover sheets for distribution) that are costly and slow.
Trapped insight: Crucial data lives inside PDFs, making it difficult to search, filter, or compare research by company, sector, or recommendation.
No way to see change over time: Without structured data, it’s nearly impossible to track changes in sentiment over time or analyze broader coverage and industry trends.
Extreme variability: Each partner’s unique format makes rule-based approaches brittle and off-the-shelf tools unreliable in production.

The volume, variability, and business criticality mean that any solution must work reliably at scale, and must surface its own uncertainty when it fails.

Extraction rarely fails loudly, it fails quietly

A field is slightly wrong.
A ticker is missing.
A recommendation is misclassified.

Those small errors compound quickly, and without a way to detect, measure, and correct them, extraction becomes untrustworthy… even if it “mostly works.”

To move from experimentation to enterprise-grade extraction, we had to be explicit about where LLMs help, where they hurt, and where humans need to stay in the loop.

‍

Principle 1: Structured evaluation is the only path to trust (and improvement)

Our biggest breakthrough wasn’t a better prompt.

It was an evaluation strategy.

Without a clear way to measure correctness, every change felt subjective so we grounded the system in supervision results, the gold standard.

Gold vs. Silver

The client already had a manual process: humans reviewed reports and produced standardized cover sheets. We used those as our gold standard.

Gold = human-verified values in the cover sheets
Silver = fields extracted from PDFs by the LLM

Instead of asking “does this look right?”, we asked: How closely does the extraction match what humans previously produced?

We defined two simple metrics:

% Exact match – identical to gold
% Partial match – close, but not identical

That distinction mattered. A ticker with or without an exchange code. A correct value with inconsistent formatting. These weren’t failures, but depending on the context, they might not be production-safe either.

This let us:

See overall system performance
Isolate near-misses
Inspect exactly where and why the model struggled

Why this unlocked progress

High-level metrics showed when things broke. Manual inspection showed why.
That combination gave us the intuition to fix the system: adjusting prompts, adding normalization, and re-measuring impact with every change.

No guessing.

‍

Principle 2: Be ruthless about what the LLM touches

Once we could see failure modes clearly, the highest-leverage change was a hybrid approach.

Most extraction systems fail because they treat the model like an oracle. We realized quickly that this wouldn’t work.

We used a hybrid approach:

LLMs for what only LLMs can do
Noisy, judgment-heavy extraction from unstructured text.
Deterministic logic for everything else
Normalization, enrichment, validation, and formatting.

For example, once the model extracted the ticker correctly, we stopped asking it to infer messy details from the text like sector or industry. Instead, we used a lookup table to map each ticker to standardized attributes. This approach was not only more reliable, it also enforced consistent taxonomy and eliminated an entire class of avoidable errors.

The result was far less work for the model and far fewer ways for it to fail.

By shrinking the LLM’s responsibility, we reduced the overall error surface area. Accuracy improved, debugging became tractable, and behavior was predictable across partners and formats.

‍

There’s no magic button for reliable data extraction

Our point of view is simple: no tool is going to magically solve enterprise data extraction.

LLMs are powerful and non-deterministic. That’s not a flaw, but it is a constraint. Reliable extraction means acknowledging that reality and designing around it: using models where they add unique value, relying on deterministic systems where precision matters, and building evaluation loops that surface uncertainty instead of hiding it.

What made this pipeline work wasn’t a single prompt or model upgrade. It was:

Clear supervision and measurement
A hybrid architecture that reduced risk
Human judgment applied where it mattered most

Enterprise-grade extraction isn’t plug-and-play (unfortunately).

It’s iterative. It’s system-level. And it’s built to improve over time.

Share this post

Tag one