Primary proofreading

I'm working with Unit 1 of the book (86 pages). That's about the same size as the knitting manual so I have an idea of about how much work that will involve. I can do the initial proofreading at a volume of between ten and 20 pages at a sitting; my goal is an average of ten pages a day.

3. Some Special Properties of Matter

17. What is tenacity? By laboratory test we find that a silk thread is stronger than one of cotton, if both have the same diameter, or the same cross-sectional area. A copper wire is more easily broken than one of steel. We say that steel is more tenacious than copper. When we ride in an elevator, our safety depends upon the tenacity of the cable. The tenacity of any material, or its tensile strength, is measured by the force needed to break a rod or wire of that material whose cross-sectional area is unity, one square inch for example. (See Table 7, Appendix B.) It takes a load of 300,000 lb. to break a bar of high-grade steel whose cross-sectional area is one square inch. (See Fig. 6.)

[Figure 006. With its approaches, the George Washington Bridge is 8700 feet in length. The steel cables, which are 36 inches in diameter, support the weight of the bridge. *Courtesy of the Port of New York Authority*]

The steel cables that sustain the weight of the George Washington Bridge are 36 inches in diameter. A single span of the Golden Gate Bridge of San Francisco stretches over 4200 feet of water. (See Fig. 7.)

Primary proofreading is how I refer to my first pass through the text. I'm there to find misrecognized text, rearrange text that was laid out incorrectly, and indicate in some primitive way the large-scale features of the document structure.

As you can see above, I'm already using some conventions, like "empty lines indicate paragraph breaks" and "Figure captions go in square brackets". At this point, though, I'm not too concerned with structural markup and very little with presentation markup. The idea is that I'm eliminating most of the mistakes made during OCR and getting the text in a form that will be easy to navigate and compare with the page images when it comes time to add structural markup and text styles.

Primary proofreading is about getting the large-scale features in shape; structural markup comes later.

I like to have the page images open while I'm doing primary proofreading, but I don't want to be comparing it line-by-line unless I get to a section that was badly mangled during OCR (like those short columns).

Acquiring the text

I've had extensive experience with tesseract-ocr but it's not well-suited for this sort of work. I think it's great for detecting snippets of text in non-text-based images and I've used it to automatically determine page numbers on other projects. But for text-heavy, document-based images, I just can't beat Finereader's high-quality OCR.

Finereader is built as a complete document management system, but I just want it for its OCR.

Especially with its layout recognition. In other projects I had to give Finereader more extensive hints to distinguish the text areas from the graphics but I didn't have to do that at all with Dull Physics. That meant the primary OCR task was almost completely automated.

In this project I had no need to change the text areas that Finereader detected on its own.

It got confused near the end of a section when the columns were short, but that was a minor inconvenience.

Figure captions and minor layout issues were easy to handle in the primary proofing stage.

I've tried proofing text in Finereader but it just doesn't work for me. The storage format for the text will be asciidoc, so that's another reason to jump right into a plain-text format. I've tried to play with Finereader's other export formats, including exporting styles to OpenDocument formats that I'd use as a basis for layout, but it invariably drives me around the bend. As a matter of fact, that's what led me to discover asciidoc (more on that in a subsequent post).

I save Finereader's results as plain text and proof from there.

Getting the images in order

Every project is different, so having some scripting skills comes in handy. This book is over 600 pages long so it was definitely worthwhile to create a tool that would help automate the task of renaming the image files in something that approximated page order.

I worked on one signature at a time and renamed the files after I had completed scanning both sides of each leaf. That means I was left with all the odd pages in one file name range, followed by the corresponding even pages in the next range. Time for Perl!

#!/usr/bin/perl

use strict;
use File::Copy;

my ($firstfile, $lastfile, $startnum, $increment) = @ARGV;

$firstfile =~ /(.*?)([0-9]+)(\..*)/;

my ($pre, $input_fnumpart, $suff)=($1, $2, $3);

print "First file number part is '$input_fnumpart'\n";
print "And is " . length($input_fnumpart) . " long\n";

my $l = length($input_fnumpart);
my $sprin = "%0$l" . "d";
my @inputfilenames;
my $inputfilename=$firstfile;
my $number=0;
push(@inputfilenames, $inputfilename);

while ($inputfilename ne $lastfile) {
	$number++;
	$inputfilename = $pre 
            . sprintf($sprin, $number + $input_fnumpart)
            . $suff;
	push(@inputfilenames, $inputfilename);
}

print "@inputfilenames\n";
for my $inputfilename (@inputfilenames) {
	$inputfilename =~ /(.*?)([0-9]+)(\..*)/;
	my ($pre, $input_fnumpart, $suff)=($1, $2, $3);
	my $outputfilename = $inputfilename;
	my $output_fnumpart = sprintf($sprin, $startnum);
	$outputfilename =~ s/$input_fnumpart/$output_fnumpart/g;
	$startnum += $increment;
	print "mv $inputfilename $outputfilename\n";	
	move($inputfilename, "renamed/$outputfilename");
}

Does the script detect the page numbers in the image? No, you have to do that yourself. Good question, though. Some images are well-suited to such a process (with tesseract-ocr and imagemagick), but I didn't try it here.

The script was written for this project but would be easy enough to extend for other situations.

Irfanview makes short work of scrolling through a directory of images to ensure they've been renamed correctly. Mistakes were made but were usually limited to no more than about 32 images at a time.

Dull Physics

Modern Physics by Charles Dull, copyright 1943. It's over 600 pages and chock full of beautiful line drawings and other illustrations, three three-color color plates, and is overall a real treasure. Acquiring the pages of this book in digital form was straightforward (though time-consuming, about 30 seconds per scan, prep work not included).

Dull Physics was sewn and stapled. After removing the boards, spine, and end papers, the thread and staples were removed and discarded.

Never underestimate the power of a microspatula! I decided to work with the individual signatures and process them one at a time through image acquisition.

I die a little every time I slice up a book. Processing the text block this way allowed me to acquire the images *nearly* in page order, however.

The pages tended to stick to one another, so I learned to feed the leaves through the ADF one at a time.

Subscribe to daliverse.com RSS