[time-nuts] HP 103AR scans online; post-processing advice sought

Rex rexa at sonic.net
Sun Sep 18 07:26:07 EDT 2005


On Sat, 17 Sep 2005 11:25:59 -0700, David Forbes <dforbes at dakotacom.net>
wrote:

>Hi,
>
>I have placed raw 200DPI 8bit GIFs of the HP103AR manual on my website.
>
>There's no index file; the scans are at:
>
>http://www.nixiebunny.com/hp103ar/hp103ar01.gif
>
>through
>
>http://www.nixiebunny.com/hp103ar/hp103ar39.gif
>
>so you can wget them easily. The files average 2 megabytes each, so 
>there are many extra ones and zeroes in there.
>
>Leading up to the next question:
>
>What's a good post-processing program to shrink the scans of text 
>pages, possibly OCRing them, and make one big PDF file out of the 
>lot? I know that a few folks on this list have done this work, but I 
>don't know how they did it.
>
>If there's free/cheap software that works well, I'll get it and 
>proceed, otherwise would one of the folks with such software step up 
>to the plate and complete the job for us?
>
>If it's many hours of work, then some automated script that can 
>shrink the text-only pages would be sufficient for that work, and a 
>simple PDF maker would handle the rest.
>
>I await your suggestions.

I use Ulead Photoimpact. It's mainly targeted at editing photographs,
but has all the tools you need to clean the scans up. It has a lot of
the functionality of Photoshop but is a lot less expensive. I'm sure
there are other programs that could do the same, but I know and like
this one.

It does take a fair amount of work for something like the HP docs. I
cleaned up what you have, a good bit, and it took me 2-3 hrs. It also is
a complicated program. I'm pretty efficient with it now, but it took a
long time using it to get to know what's there and how to use it. I
could have done 80% of the improvement with batch mode commands, but
took the time to do more.

The first thing I did was use brightness/contrast tools to remove most
of the gray. I also took the time to manually edit out the punch holes
and some other noise. One of the keys to shrinking was to save them as
gif files again but using 'grayscale 128' mode.

The pages with photographic pictures took extra work. There was a lot of
artifacting from scanning the halftone images. I used a combination of
despeckle and blur filters to smooth them out before increasing
contrast.

I also stitched together the pages for scans of wide pages.

The end result is the whole document is about 9.8 MB as opposed to 2-3
MB per page before. That could probably be cut about in half be reducing
resolution on the pages -- in most cases I think they could go down
about 50% and still be quite readable. I didn't do that, though.

The resulting gif files can be had in one zip file here:

http://www.xertech.net/data/hp103ar.zip

Feel free to copy it and share it anywhere.

A lot can theoretically be improved by adjusting settings at scan time
-- contrast, resolution, moire. My previous scanner let me control a lot
of stuff. My current HP seems to think I should not get involved. It
irritates me that I need to adjust contrast on every image after the
fact, rather than making a good setting of the scanner before I scan all
those pages.

As you suggest, OCRing the docs would reduce the size to the ultimate,
but every OCR tool I have tried needs a HUGE amount of hand holding and
corrections to get anything close to the original text and format. I
have done it for a few things in the past, but it is very painful at
best.

Thanks for taking the time to do the scanning. Lots of good helpful
people here in this group.







More information about the time-nuts mailing list