chrisinmtown.github.io

Batch OCR in Practice

5 August 2015

This blog post shares some lessons learned about batch optical character recognition on PDF documents. There were 3 challenges: deciding whether OCR is necessary for a document, choosing an OCR package, and assessing OCR results.

Background

Recently I drew the assignment of extracting text from over 20,000,000 pages in about 200,000 PDF documents. Some were scans, so choosing an OCR package was a concern, but other issues were just as important. I only needed the words; didn’t have to pick out structured data like tables. (If tables are your thing, check out the Scraperwiki and Tabula projects.) So the first step was to produce a version of every document that would allow text extraction.

I have to rant a little about PDF. Now that we are in an era of big data, you really have to dial down your expectations when dealing with the portable document format. PDF is a legacy presentation format that’s a thin layer on top of postscript, largely opaque to software tools, and an extremely effective means of locking away text from machine processing. Every PDF text document basically is a bunch of statements that say “paint the following characters at this X,Y pixel position on the page.” Repeat that a bunch of times, then you and I can come along and read “Hello, world!” on our favorite devices, or maybe a table of soccer match scores. You don’t have to paint spaces in PDF, just choose the right pixel positions and hey presto, space between words, or perfect column alignment. PDF supports some intensional mark-up, so tables could in theory be nicely decorated with row and column indicators, but as far as I can tell nobody (i.e., no word processor vendor) uses that PDF feature.

Before I jump in, I’ll define three document types for clarity. For me a searchable document is one that was produced using a modern text-processing program that prints directly to PDF; i.e., it has characters and words painted on the page, not images. An image document consists solely of pictures, like the output from a scanner. And a mixed document is one with both searchable pages and image pages.

But in practice, documents often straddle the boundary of those three types. PDF allows multiple layers in a document. Every OCR package that I tried can create an output document with a layer of searchable text on top of the original image (sometimes called “PDF Searchable Image”). The new layer aligns word characters as closely as possible on the word images so that when you double-click on the document, the viewer allows you to select and copy the word under your cursor. Sometimes that selection aligns perfectly with the underlying text, sometimes not so much.

One last issue is protected documents. PDF supports a couple of digital-rights management features. One encrypts the document (the owner password, I never had to deal with that), and the other feature allows authors to use a password to block viewer actions such as printing, copying or editing. It’s up to the viewer to honor this limited-permissions feature. In my experience every document allowed printing, and I found that printing to PDF format is a crude way of lifting this protection. Also the text-extraction software that I used ignored the protection, which was incredibly convenient for me.

Do I need to run OCR on this document?

Based on a per-page execution time of several seconds, multiplied by millions of pages, I was highly motivated to minimize the number of pages to process thru OCR. So answering this question, is OCR required, was worth some investment. I tried categorizing files into two buckets: purely searchable content (OCR not necessary); or image content (definitely OCR needed).

I had no software tools that could evaluate the content of a raw PDF document, i.e., no machine-vision magic, so my recourse was to extract the available text. I used the Poppler package’s command-line “pdftotext” tool. Then I used scripts to evaluate crude heuristics on the extracted text, with the goal of deciding whether a document was searchable. First I checked the quantity of text that was extracted, then checked the text quality. See the end of this post for a Python method that counts character classes.

Quantity of available text

This heuristic counts characters on each page in the original document as a first step towards deciding if the document is searchable.

  1. Is every page completely empty? An empty page in pdftotext output probably means the original document has an image there. So if no page has text, you have fairly high confidence that the document has only images, and no further classification is required.
  2. Are some pages completely empty? This probably means the document has mixed content; i.e., both searchable and image pages. But what’s the acceptable threshold, 5%? If your documents were produced by members of the legal profession then it’s likely the least amount of text on any page will be “This page intentionally left blank” and you will have very, very few truly empty pages. Alternately (and I’ve seen this), if someone scanned a one-sided document using two-sided mode, you could have 50% empty pages in a perfectly searchable document!
  3. Does every page have characters? This may mean that the document is searchable. However, some image pages yield a few (sometimes many) gibberish control characters, so passing this test isn’t any guarantee of text content.

A caveat: counting characters per page requires detection of page boundaries, and that is unfortunately not straightforward. The Poppler tool writes a form-feed character at the end of every page, as the first character on a line. But sometimes it writes form-feed characters in gibberish pages, and every once in a while a form-feed character amidst the gibberish appears as the first character on a line. So the results of this heuristic are not nearly as solid as I would have liked.

Note to self: enhance pdftotext to allow use of a null or other seriously unusual character as the page-termination indicator.

Quality of available text

A document that yields text from every page just might be searchable. But maybe not, and here’s the first pitfall: I found quite a few PDF documents with a layer of impossibly garbled, low-quality text on every page, presumably from a previous OCR pass. On the screen your viewer will just show you the original image. You have to extract the text layer to be sure. In other words, the document appeared to be searchable, but I sure didn’t want the text that it had.

So I had to develop some quality checks to apply to extracted text.

  1. Count letters, numbers, spaces, and ordinary punctuation; separately count other characters. Then calculate the percentage of the total number of characters that fall in that “other” bucket. Even rather ordinary text documents have a handful of special characters such as fancy double quotes, long dashes, bullets and others, so a modest number in that bucket is perfectly acceptable. But if it’s more than a few percent, that very likely means pdftotext was not able to extract a satisfactory text from every page – and OCR will be required. </LI>
  2. Count recognized and unrecognized words using a spell checker. A spell checker is readily available on linux systems and offers a fast quality check of words. (I used the Linux utility hunspell.) Mine had an English-language dictionary that included many common first names (“Mary”) and well-known place names (“Wyoming”), but not too many last names nor names of small municipalities in the U.S. So at least a 1% rate of unrecognized words was to be expected. If the document exhibits 10% or more unrecognized words, then very likely the existing text is not satisfactory, possibly it has a layer of poor OCR results, and again OCR will be required. </LI>

I did not calculate other characters nor unrecognized words on a per-page basis. This would be be a useful refinement, because that would allow detecting one page among hundreds that has thousands of unusual characters; an average over the document would look fine.

Which OCR package should I use?

Ok you’ve classified your documents into buckets and are ready to start OCR. Which one? Here’s the conclusion right up front: I recommend you start with Nuance products, Omnipage Professional and/or PowerPDF Advanced.

I was looking for a package to do hands-off, batch conversion. All the packages I found are desktop software for Windows. If you’re willing to spend a few hundred dollars you have many options; trial versions can get you thru if you only need to convert a few hundred pages. You can also spend tens of thousands if you choose.

Many software packages want to load all the input files, do OCR, query the user for corrections, and coalesce the mess into a single output. This may work for some people but I was looking for true batch: one output file for each input file, unattended operation, don’t stop for anything, give me a detailed report at the end. Spoiler alert: I didn’t find that.

When I’m doing OCR, should I rasterize? This basically means, should every page be treated as an image, or is it acceptable to use available text? This goes back to the quality assessment from before. Obviously it’s a lot faster to process a mixed-content document if the available text can simply be reused (instead of re-recognized). But if you sent a document off to OCR because of low quality in the available text, then you must tell the OCR to rasterize and recognize. Not all packages offer this feature!

Packages in alphabetic order follow. Prices shown below are list (and are dated) but discounts abound. Take my comments about accuracy with a grain of salt; your inputs will not be the same as my inputs so your mileage will certainly vary.

ABBYY Finereader 12 Corporate: $400

Batch feature is called the “Task Manager” and it’s on the Tools menu. It will process files from a folder, including subfolders; it will happily create a separate output file for each input file. It does not seem capable of preserving the input folder hierarchy; all output files went to the same output folder. The accuracy was high in my tests, yet still the lowest of the packages I’ve listed here.

Adobe Acrobat XI: $300

Batch feature is called “Text Recognition/In Multiple Files” which can be found by clicking on Tools (third toolbar, top right side of the main screen). Processes subfolders, one output for each input. Stops and puts up a prompt if it finds a password-protected file. Does not preserve input directory tree by default; can do so by writing output to same folder as input. Accuracy was quite good in my tests.

Nuance OmniPage Ultimate (aka v19): $500

Batch feature is called “DocuDirect” and it’s a separate program that comes with the package. It will process folders and subfolders; if you select the features just right, it will preserve the input directory tree in the output area. One output for each input. Stops and demands a password for a protected file. Seems to take excellent advantage of multi-core processors to run tasks in parallel. The accuracy was excellent. But stability of the batch processor is poor; a fuzzy document will stop it in its tracks, never to recover, derailing a batch with ease.

Nuance PowerPDF Advanced v1.1: $150

Batch feature is called “Batch Converter” and it’s reachable from the main program under the Advanced Processing tab. It will process folders and subfolders, preserving the input structure in the output. One output for each input. Will use multiple cores, but not aggressively; what that means is I could not get it to saturate a multi-core host. Accuracy is excllent, as good or better than OmniPage. Bad or fuzzy files did not cause it to hang. The batch processor writes (shock) a plain-text log file to the output directory. Not a replacement for Omnipage Professional; has fewer features and is intended for simpler use cases, not to mention at a lower price point.

PDF Compressor: \(\)

This super high-end package is licensed on a per-page basis, with a base price in the thousands of dollars plus a fee per page. Altho the main feature is reducing the size of PDF documents, it also has an extremely capable OCR feature. This package includes a proper batch feature and offers incredibly low-level control over the text-recognition features. It also writes a detailed log file. The speed of the processing was as fast as any other package, and the quality of the output was every bit as good as the Nuance products. But the price point is 10 to 100 times the others.

ReadIris Corporate 14: $600

Batch feature is invoked by the “Batch OCR” item which is revealed by clicking on the “From Files” button on the main screen. It will process folders and subfolders, one output for each input, and by default the output directory structure matches the input directory structure. Stops and demands user input on an invalid file; processes without further complaint all protected documents apparently by OCR-ing the image. The accuracy was very good, on par with Acrobat.

On my desktop machine (only dual core), with my chosen inputs, every package required at least 3 seconds to process a page; some took more. Might be able to drive this down on a machine with more cores. Some packages took good advantage of multiple cores, especially the Nuance packages. Others resolutely stuck to just 1 core (Adobe).

Gotchas abound, be sure to plan for them: invalid PDFs (some packages halt), password-protected PDFs (some packages halt, others convert anyhow!), and rotated pages (landscape instead of portrait). If you want the batch to run thru to completion, you have to prep the input area for these packages Very, Very Carefully.

Running large batches can lead to memory-exhaustion and hanging problems, even tho it should not (argh - probably memory leaks). If you’re doing any kind of automation at all, a big problem is discovering after the fact what really happened - which documents could not be processed, which failed during processing, etc. It’s like the desktop software people never heard of something called a “log file”.

Finally getting support, even as a paying customer, is pretty difficult for these mass-market packages. For example I complained to one esteemed customer support rep about a package (which shall remain nameless) hanging for some large inputs. I watched it stay hung 36 hours before giving up :). They sweetly suggested limiting the batch size to 300 documents. That was just completely unacceptable to me, but hey it got that support ticket closed dang quick, right? And that’s all that matters, right? Sigh.

What’s the quality of the OCR result?

You’ve successfully pushed documents thru OCR and now you need to evaluate the result. Basically I reused all the heuristics discussed above for assessing the available text in the pre-OCR document.

  1. Start with the basics: how many pages are in the result? Maybe it’s obvious but you have to check that there’s an output page for every input page. Some packages silently dropped pages on me, that was fatal. Also make sure the result doesn’t have more pages than the input!
  2. Are all the pages empty? This is a basic sanity test, should never happen, but worth checking.
  3. Count characters and normalize by pages. A basic minimum (100 chars per page) and maximum (5,000 chars per page) will reveal serious problems.</LI>
  4. What’s the percentage of unrecognized words in the output? The spell-checker again is tremendously useful here. A truly fuzzy document will defeat even the best algorithm. Even a rather modest character non-recognition rate will lead to a significant word-miss rate, which means many mis-spelled words. Setting the threshold for an acceptable misspelling rate is very much dependent on the project and its goals.

Summary

These lessons stemmed from my encounters with an enormous variety of PDF documents in a large corpus. Many of these issues were total surprises to me, especially the documents with garbage OCR results! Like my father used to say: too soon old, too late smart.

Code

This Python 2 method profiles pdftotext output, yielding some measures for assessing quality of OCR output.

def analyze_doc_content(content):
    '''
    Analyzes document content and returns the following counts in a tuple:
    1. number of form-feed characters at the start of a line (approximates page count)
    2. number of empty pages (no chars at all)
    3. number of whitespace characters
    4. number of lower alpha characters
    5. number of upper alpha characters
    6. number of digits 
    7. number of punctuation characters
    8. number of other characters
    
    Form-feed characters can occur anywhere (and do).
    A succession of FFs on a line indicate empty pages;
    a FF occurring randomly on a page doesn't mean a damn thing.
    This uses a trivial state machine to track the start of lines.
    HOWEVER, files exhibit the magic newline-formfeed sequence
    at places that are not page breaks, so the count is not a
    truly reliable indicator of pages.  Unfortunately.
    
    `content`: A string that is the output of "pdftotext" with
            form-feed characters at the end of each page.
    '''
    # initial state
    ff_state_begin_line = True
    is_page_empty = True
    nr_page = nr_empty = nr_ws = nr_lower = nr_upper = nr_digit = nr_punc = nr_other = 0
    for c in content:
        
        # run the form-feed state machine
        if c == u'\f':
            if ff_state_begin_line:
                # FF occurs at start of line
                nr_page += 1
                # were any chars found on the page?
                if is_page_empty:
                    nr_empty += 1
                # reset empty check; stay in begin_line state
                is_page_empty = True
            #else:
                # FF occurs mid line.
                # don't change state
            # if
            # count the FF as WS and skip the tests below
            nr_ws += 1
            continue
        elif c == u'\n':
            ff_state_begin_line = True
            is_page_empty = False
            # count the LF as WS and skip the tests below
            nr_ws += 1
            continue
        else:
            ff_state_begin_line = False
            is_page_empty = False
        # state machine
            
        # count character classes
        if c in u"abcedfghijklmnopqrstuvwxyz":
            nr_lower += 1
        elif c in u"\u0009\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF": 
            # never get here on \f or \n
            nr_ws += 1
        elif c in u"~!@#$%^&*()[]{}<>-_+=:;\"'`,.?/\\|":
            nr_punc += 1
        # fancy quotes and dashes
        elif c in u"\u0060\u00B4\u02B9\u02BB\u02BC\u02BD\u2018\u2019\u201B\u2032\u2035\u275B\u275C\u02BA\u2033\u3003\u201C\u201D\u00AD\u2010\u2011\u2012\u2013\u2014\u2015\uFE58\uFE63\uFF0D": 
            nr_punc += 1
        elif c in u"ABCDEFGHIJKLMNOPQRSTUVWXYZ":
            nr_upper += 1
        elif c in u"0123456789":
            nr_digit += 1
        else:
            # print("Other char: %s (%x)" % (c, ord(c)))
            nr_other += 1
            
    # for c

    return (nr_page, nr_empty, nr_ws, nr_lower, nr_upper, nr_digit, nr_punc, nr_other)

Blog index / feedback to christopher döt lott át gmail dðt com