Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir
> > I don't deal with a lot of multi-lingual stuff, but my understanding is > that this sort of thing gets a lot easier if you can partition your docs > by language -- and even if you can't, doing some langauge detection on the > (dirty) OCRed text to get a language guess (and then partition by lan

Re: Cleaning up dirty OCR

2010-03-11 Thread Chris Hostetter
: Interesting. I wonder though if we have 4 million English documents and 250 : in Urdu, if the Urdu words would score badly when compared to ngram : statistics for the entire corpus. Well it doesn't have to be a strict ratio cutoff .. you could look at the average frequency of all character

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
> > Hmm, how about a classifier? Common words are the "yes" training set, > hapax legomenons are the "no" set, and n-grams are the features. > > But why isn't the OCR program already doing this? > > wunder > > > > > > -- View th

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
and their contexts to deide if they are legitimate) > > ? > > > -Hoss > > > -- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871353.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Cleaning up dirty OCR

2010-03-11 Thread Walter Underwood
On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote: > I wonder if one way to try and generalize > the idea of "unlikely" letter combinations into a math problem (instead of > grammer/spelling problem) would be to score all the hapax legomenon > words in your index Hmm, how about a classifier?

Re: Cleaning up dirty OCR

2010-03-11 Thread Chris Hostetter
: We can probably implement your suggestion about runs of punctuation and : unlikely mixes of alpha/numeric/punctuation. I'm also thinking about : looking for unlikely mixes of unicode character blocks. For example some of : the CJK material ends up with Cyrillic characters. (except we would hav

Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir
On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West wrote: > > Thanks Simon, > > We can probably implement your suggestion about runs of punctuation and > unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about > looking for unlikely mixes of unicode character blocks.  For example some

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
es tend to be longer). We also looked for runs of > punctuation, unlikely mixes of alpha/numeric/punctuation, and also > eliminated longer words which consisted of runs of not-ocurring-in-English > bigrams. > > Hope this helps > > -Simon > >> >> -- >> &g

Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir
On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom wrote: > Thanks Robert, > > I've been thinking about this since you suggested it on another thread.  One > problem is that it would also remove real words. Apparently 40-60% of the > words in large corpora occur only once > (http://en.wikipedia.

RE: Cleaning up dirty OCR

2010-03-11 Thread Burton-West, Tom
...@gmail.com] Sent: Tuesday, March 09, 2010 2:36 PM To: solr-user@lucene.apache.org Subject: Re: Cleaning up dirty OCR > Can anyone suggest any practical solutions to removing some fraction of the > tokens containing OCR errors from our input stream? one approach would be to try http

Re: Cleaning up dirty OCR

2010-03-09 Thread simon
On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir wrote: > > Can anyone suggest any practical solutions to removing some fraction of > the tokens containing OCR errors from our input stream? > > one approach would be to try > http://issues.apache.org/jira/browse/LUCENE-1812 > > and filter terms that on

Re: Cleaning up dirty OCR

2010-03-09 Thread Robert Muir
> Can anyone suggest any practical solutions to removing some fraction of the > tokens containing OCR errors from our input stream? one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812 and filter terms that only appear once in the document. -- Robert Muir rcm...@gmail

Cleaning up dirty OCR

2010-03-09 Thread Burton-West, Tom
Hello all, We have been indexing a large collection of OCR'd text. About 5 million books in over 200 languages. With 1.5 billion OCR'd pages, even a small OCR error rate creates a relatively large number of meaningless unique terms. (See http://www.hathitrust.org/blogs/large-scale-search/too