Training an ML Model to Improve OCR

Eliminating pre‑trial document‑dump with an interface that trains an AI on manual corrections made by paralegals.
LegalTech
AI-ML
Web
OCR
Product Design
1 of 1 Product Designers
A web application interface for validating OCR text extractions from a scanned emergency room record is displayed. On the left, a redacted emergency document from “St. Mary’s of Michigan: Washington Ave. Emergency Record” includes patient data, triage notes, barcodes, and highlighted OCR regions in red. On the right, a “Text Extractions” sidebar lists detected items such as “Pg 4 of 34,” “Pg ID 872,” “smudge,” “DOB: 11/ Wt/Ht,” and others, each paired with a dropdown to classify error types like “Misread,” “Bad format,” and “Other.” One dropdown is currently expanded, showing selectable error types. A blue “Validate” button appears at the bottom.
Problem
Document “dumps” in pre‑trial discovery inundate legal teams with thousands of pages. Existing OCR tools often misinterpret irregular scans; scores below roughly 86 % confidence must be checked manually. Each 1% gain in OCR confidence saves approximately $0.57 per document in review time. Without high‑quality training data, machine‑learning models can’t improve, so manual labour persists.
Solution
We built a human-centered ML training interface where low-confidence OCR regions were marked with red polygons over the source document. Each polygon linked to a correction field in the Text Extractions panel, where reviewers edited the text and selected an error type. A zoom slider enabled close inspection; the annotation layer could be toggled off for clarity. All corrections fed directly into the retraining pipeline.

Impact & Results

We estimated the value of these improvements using a conservative cost model.
7%
Improvement in OCR accuracy
Retraining the model using corrected annotations from human reviewers raised low-confidence zones from 86% to 93% confidence in downstream runs.
3x
Reduction in time spent typing
Paralegals spent less time on manual transcripts thanks to red-box prioritization and confidence scores guiding attention to where it was needed most.
$4K
Estimated savings per case
Reducing just 20 minutes of paralegal time per case across three full‑time staff, our model saves more than $4K per case in labor alone.

User Research

I shadowed three paralegals at an auxiliary office, observing them load scanners and configure OCR tools, and slapping sticky notes on their equipment. Conversations revealed their biggest pain: manually reviewing endless documents.
These insights drove us to prioritize features that reduced search time and focused attention. Their existing process was mandatory and our solution was experimental, so we also learned to respect their existing workflow—designing a tool that required no installation or extensive training.
We observed each document contains an average of 12 low-confidence fields. It takes a minimum of 30 seconds per field to make corrections at a $35/hr labor rate. That’s almost 60 cents per document.
Header image


Improvements

1. Annotation & Focus


A labeled UI mockup demonstrates a machine learning training interface for correcting OCR errors on a scanned emergency medical record. On the left, the redacted document is overlaid with red-highlighted polygons indicating extracted text. On the right, the “Text Extractions” panel displays a scrollable list of extracted items paired with dropdowns for error classification (e.g., “Misread,” “Bad format,” “Other”). Labels annotate key UI features: “Zoom and navigation controls” for the slider, “Panel toggle,” “Polygons share focus with their corresponding input,” and “CTA Position fixed with scrolling form” near the blue “Validate” button. The dropdown for “smudge” is active, showing error type options.
Problem
Paralegals prepping documents for transcription had no guidance on which how many would require manual transcription, forcing them to segregate anything smudged or dirty and retype entire pages.
Solution
Quick win. I took OCR JSON output and used it to overlay red polygons on low‑confidence OCR regions. Clicking an entry in the side panel automatically highlighted its corresponding polygon.
300%
Faster—corrections completed faster than manual transcription of an initial 250 documents
91%
Task completion rate without backtracking during the correction of validation set


2. Zoom & Detail Examination

A partially zoomed-in scanned image of an emergency medical record from “St. Mary’s of Michigan: Washington Ave. Emergency Record.” The top portion of the image shows a webpage interface with a zoom slider labeled “Slide to change zoom.” The document includes barcodes, the heading “Patient Data,” and fields such as “Complaint: abrasions,” “Triage Time: Sat May 30, 2009 00:36,” “DOB: 11/,” and “MedRec: 084…” A court case header at the top indicates “Filed 09/14/15” and “Pg 4 of 34.” The document appears to be part of a legal filing related to a medical incident.
Problem
Within the first few documents, our paralegals reported browser issues while using device and browser-level zoom functions.
Solution
I built-in a zoom slider that allowed smooth magnification while preserving polygon alignment.
27%
Reduction in errors on fine print fields—box height under 8px
3.6x
Increase in character-level accuracy compared to original OCR


3. Error Categorization

By categorizing errors (e.g., 'Misread characters', 'Incomplete text') provided crucial data for refining the OCR model. This feature was based on the understanding that detailed error data is essential for iterative improvement.
Problem
Analysis of the first 250 document set of corrections from our paralegals, we believed that treating all corrections equally limited the model’s ability to learn from specific mistakes.
Solution
We introduced drop-down categories (“Misread,” “Incomplete,” “Bad format,” “Other”) for each extraction. This structured data helped the ML algorithm differentiate misinterpretations from formatting issues.
6.2%
Improvement in model recall from a categorized set of 1,000 documents
4x
Increase in labeling speed observed during model training


4. Flexible JSON Data Structures

A browser window displays a web-based validation tool named “Text Extractions” alongside browser developer tools (Console tab). On the left panel, the tool lists OCR-extracted items with user-assigned error types such as “Bad format,” “Other,” “Misread,” and “Error type.” Entries include text like “Pg 4 of 34,” “Pg ID 872,” “barcode,” and “DOB: 11/ Wt/Ht.” The right panel shows the developer console log for scripts.js, detailing processed items with originalDescription, correctedDescription, errorType, and index values. Several logs show corrected text values like “Complaint” and “Attending an RN.” A blue “Validate” button appears at the bottom of the tool.


Problem
Our early corrections were stored as MSWord files, without metadata on location, label, or field type. This limited their utility in retraining because the model couldn’t generalize patterns beyond raw content.
Solution
Google Cloud OCR output JSON, so we designed our correction schema to align with it. I bundled each correction with its bounding box, original confidence score and description.
7%
Average increase in model confidence scores overall
$4
Each batch delivered labor savings estimated at $4 per document
A simple illustration of a ghost character in front of two stylized web browser window outlines. The ghost has two circular eyes and a small smile, representing a playful or friendly design. This image is commonly used as an empty state or “nothing found” placeholder in user interfaces, indicating no search results or data to display.

Don't Count Your Chickens Before Your Credentials

Our human-in-the-loop approach underscored that technology must augment, not replace, expert work. By involving paralegals in both design and model training, we produced high-quality data that improved model performance. Clear hypotheses and measurable goals kept the project focused.
The biggest surprise came after success: no one had agreed on who owned it. The data was better, the model smarter, the workflow proven—but the moment it worked, the questions started. Who controls the corrections? Who gets to use the output in production? Who signs off on the retraining loop?
Without shared credentials—legal or technical—the project stalled. Not because of failure, but because of ambiguity. It turns out, validating the model is only half the battle. The rest is validating the relationships around it.
© 2025 Robert Duebelbeis — All rights reserved