.png)
Upload the file.
Use an LLM.
Get perfect structured data.
We built an extraction pipeline for a client processing thousands of financial research reports, and learned why that story doesn’t hold up in production.
LLMs are powerful, but non-deterministic. And that single fact changes how extraction must be designed, evaluated, and continuously improved if you want enterprise-grade results.
This project made one thing clear: human judgment is still the missing ingredient.
In this post, we share how we used LLMs intentionally (not everywhere), built system-level evaluation to surface failure modes, and iterated toward extraction that’s trustworthy at enterprise scale.
We recently completed a project for a financial services firm that aggregates equity research from partners around the world.
On a typical day, the firm receives hundreds of research reports delivered primarily as PDFs.
Each partner uses different layouts, terminology, formatting conventions, and disclosure structures. Some reports follow traditional cover-page layouts; others bury key information deep in narrative text or side panels.
While these reports contain highly structured information: company identifiers, analyst recommendations, pricing, market data, authorship, all of it lives inside unstructured documents.
That creates a familiar set of challenges:
The volume, variability, and business criticality mean that any solution must work reliably at scale, and must surface its own uncertainty when it fails.
A field is slightly wrong.
A ticker is missing.
A recommendation is misclassified.
Those small errors compound quickly, and without a way to detect, measure, and correct them, extraction becomes untrustworthy… even if it “mostly works.”
To move from experimentation to enterprise-grade extraction, we had to be explicit about where LLMs help, where they hurt, and where humans need to stay in the loop.
Our biggest breakthrough wasn’t a better prompt.
It was an evaluation strategy.
Without a clear way to measure correctness, every change felt subjective so we grounded the system in supervision results, the gold standard.
The client already had a manual process: humans reviewed reports and produced standardized cover sheets. We used those as our gold standard.
Instead of asking “does this look right?”, we asked: How closely does the extraction match what humans previously produced?
We defined two simple metrics:
That distinction mattered. A ticker with or without an exchange code. A correct value with inconsistent formatting. These weren’t failures, but depending on the context, they might not be production-safe either.
This let us:
High-level metrics showed when things broke. Manual inspection showed why.
That combination gave us the intuition to fix the system: adjusting prompts, adding normalization, and re-measuring impact with every change.
No guessing.

Once we could see failure modes clearly, the highest-leverage change was a hybrid approach.
Most extraction systems fail because they treat the model like an oracle. We realized quickly that this wouldn’t work.
We used a hybrid approach:
For example, once the model extracted the ticker correctly, we stopped asking it to infer messy details from the text like sector or industry. Instead, we used a lookup table to map each ticker to standardized attributes. This approach was not only more reliable, it also enforced consistent taxonomy and eliminated an entire class of avoidable errors.
The result was far less work for the model and far fewer ways for it to fail.
By shrinking the LLM’s responsibility, we reduced the overall error surface area. Accuracy improved, debugging became tractable, and behavior was predictable across partners and formats.
Our point of view is simple: no tool is going to magically solve enterprise data extraction.
LLMs are powerful and non-deterministic. That’s not a flaw, but it is a constraint. Reliable extraction means acknowledging that reality and designing around it: using models where they add unique value, relying on deterministic systems where precision matters, and building evaluation loops that surface uncertainty instead of hiding it.
What made this pipeline work wasn’t a single prompt or model upgrade. It was:
Enterprise-grade extraction isn’t plug-and-play (unfortunately).
It’s iterative. It’s system-level. And it’s built to improve over time.