Acquiring the text

I've had extensive experience with tesseract-ocr but it's not well-suited for this sort of work. I think it's great for detecting snippets of text in non-text-based images and I've used it to automatically determine page numbers on other projects. But for text-heavy, document-based images, I just can't beat Finereader's high-quality OCR.

Finereader is built as a complete document management system, but I just want it for its OCR.

Especially with its layout recognition. In other projects I had to give Finereader more extensive hints to distinguish the text areas from the graphics but I didn't have to do that at all with Dull Physics. That meant the primary OCR task was almost completely automated.

In this project I had no need to change the text areas that Finereader detected on its own.

It got confused near the end of a section when the columns were short, but that was a minor inconvenience.

Figure captions and minor layout issues were easy to handle in the primary proofing stage.

I've tried proofing text in Finereader but it just doesn't work for me. The storage format for the text will be asciidoc, so that's another reason to jump right into a plain-text format. I've tried to play with Finereader's other export formats, including exporting styles to OpenDocument formats that I'd use as a basis for layout, but it invariably drives me around the bend. As a matter of fact, that's what led me to discover asciidoc (more on that in a subsequent post).

I save Finereader's results as plain text and proof from there.