Acquiring the text
I've had extensive experience with tesseract-ocr but it's not well-suited for this sort of work. I think it's great for detecting snippets of text in non-text-based images and I've used it to automatically determine page numbers on other projects. But for text-heavy, document-based images, I just can't beat Finereader's high-quality OCR.
Especially with its layout recognition. In other projects I had to give Finereader more extensive hints to distinguish the text areas from the graphics but I didn't have to do that at all with Dull Physics. That meant the primary OCR task was almost completely automated.
It got confused near the end of a section when the columns were short, but that was a minor inconvenience.
I've tried proofing text in Finereader but it just doesn't work for me. The storage format for the text will be asciidoc, so that's another reason to jump right into a plain-text format. I've tried to play with Finereader's other export formats, including exporting styles to OpenDocument formats that I'd use as a basis for layout, but it invariably drives me around the bend. As a matter of fact, that's what led me to discover asciidoc (more on that in a subsequent post).