Re: Using Hooks To OCR Documents

Ryan Schmidt Mon, 06 Dec 2010 00:03:01 -0800

On Dec 3, 2010, at 09:44, Jim Jenkins wrote:

> I’m planning to use Hooks to add OCR scanning for select documents going into 
> a SVN repo.  I’m not really sure where to start so I’m hoping someone here 
> can tell me if it’s possible and even suggest how best to proceed.
>  
> Basically I’d like to have every commit to an SVN repo stop at the pre-commit 
> (or another more suitable) hook so the submitted files can be inspected and 
> if needed run through a command line OCR engine.  We are dealing with “image” 
> based PDF files so these would be sent off to the OCR engine and a 
> “test+image” PDF would be returned.  The new PDF would replace the original 
> before being sent on it’s way into the SVN repo.



Some of this is possible, assuming that you will automate everything, including 
the process of deciding whether or not to OCR the document. (Hook scripts run 
on the server and are not interactive.)

Here's an example pre-commit hook which checks the syntax of any committed Java 
files:

http://svn.haxx.se/users/archive-2006-06/0853.shtml

You could change the criteria from "extension .java" to whatever your criteria 
is ("extension .pdf", maybe, and then some other check to see if the PDF is 
image-based), and change the action from running checkstyle to running your OCR 
program.

What's not possible is changing the content of the incoming transaction, as you 
propose. You must either accept the transaction as-is (by returning 0 from your 
pre-commit hook script), or reject it (by returning any other number). So you 
could do that, and if an incoming PDF is image-based, reject the commit and 
inform the user they must run the OCR program on it first.

I have a pre-commit script on my repository doing something similar: I run 
pngcrush on committed PNGs, and if I find a PNG that would benefit from being 
crushed, I reject the commit and tell the user to pngcrush it and then try the 
commit again.

That would be the preferred way to do things. But if it will be too difficult 
for your users to run the OCR program themselves and you want to automate the 
process server-side, an alternative is to accept the commit -- not run any of 
these checks in the pre-commit -- and run your script at post-commit time 
instead. If you detect that a just-committed revision contains an image-based 
PDF that you can OCR, then OCR it, and replace it, in a second commit initiated 
by the post-commit script. This is trickier because the hook script might then 
have to manage a working copy (check out the directory, change the PDF to the 
OCR'd one, commit, delete the working copy). This is fraught with problems such 
as: What happens if the post-commit script decides to act on the PDF that's 
being committed by the post-commit script? (Infinite loop?) What happens if 
someone manages to commit another revision to that PDF before the hook script 
is done committing its revision? Perhaps that's not likely. But commits can 
fail for many reasons, which the script would either have to anticipate and 
deal with, or log or email failures for someone to deal with manually. There's 
also the problem that a user who committed an image-based PDF would then 
immediately have an out-of-date working copy, which is not expected in normal 
Subversion usage, though you could train your users to understand this and 
recommend they run "svn up" again shortly after committing. Or, if your script 
does replace a PDF, you could inform the user via out-of-band means (email, 
instant message, etc.) that they should run "svn up".

Re: Using Hooks To OCR Documents

Reply via email to