Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir
> > I don't deal with a lot of multi-lingual stuff, but my understanding is > that this sort of thing gets a lot easier if you can partition your docs > by language -- and even if you can't, doing some langauge detection on the > (dirty) OCRed text to get a language guess (and then partition by lan

Re: Cleaning up dirty OCR

2010-03-11 Thread Chris Hostetter
: Interesting. I wonder though if we have 4 million English documents and 250 : in Urdu, if the Urdu words would score badly when compared to ngram : statistics for the entire corpus. Well it doesn't have to be a strict ratio cutoff .. you could look at the average frequency of all character

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
We've been thinking about running some kind of a classifier against each book to select books with a high percentage of dirty OCR for some kind of special processing. Haven't quite figured out a multilingual feature set yet other than the punctuation/alphanumeric and character block ideas mention

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
Interesting. I wonder though if we have 4 million English documents and 250 in Urdu, if the Urdu words would score badly when compared to ngram statistics for the entire corpus. hossman wrote: > > > > Since you are dealing with multiple langugaes, and multiple varient usages > of langauge

Re: Cleaning up dirty OCR

2010-03-11 Thread Walter Underwood
On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote: > I wonder if one way to try and generalize > the idea of "unlikely" letter combinations into a math problem (instead of > grammer/spelling problem) would be to score all the hapax legomenon > words in your index Hmm, how about a classifier?

Re: Cleaning up dirty OCR

2010-03-11 Thread Chris Hostetter
: We can probably implement your suggestion about runs of punctuation and : unlikely mixes of alpha/numeric/punctuation. I'm also thinking about : looking for unlikely mixes of unicode character blocks. For example some of : the CJK material ends up with Cyrillic characters. (except we would hav

Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir
On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West wrote: > > Thanks Simon, > > We can probably implement your suggestion about runs of punctuation and > unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about > looking for unlikely mixes of unicode character blocks.  For example some

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
Thanks Simon, We can probably implement your suggestion about runs of punctuation and unlikely mixes of alpha/numeric/punctuation. I'm also thinking about looking for unlikely mixes of unicode character blocks. For example some of the CJK material ends up with Cyrillic characters. (except we wo

Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir
On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom wrote: > Thanks Robert, > > I've been thinking about this since you suggested it on another thread.  One > problem is that it would also remove real words. Apparently 40-60% of the > words in large corpora occur only once > (http://en.wikipedia.

RE: Cleaning up dirty OCR

2010-03-11 Thread Burton-West, Tom
...@gmail.com] Sent: Tuesday, March 09, 2010 2:36 PM To: solr-user@lucene.apache.org Subject: Re: Cleaning up dirty OCR > Can anyone suggest any practical solutions to removing some fraction of the > tokens containing OCR errors from our input stream? one approach would be to try http

Re: Cleaning up dirty OCR

2010-03-09 Thread simon
On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir wrote: > > Can anyone suggest any practical solutions to removing some fraction of > the tokens containing OCR errors from our input stream? > > one approach would be to try > http://issues.apache.org/jira/browse/LUCENE-1812 > > and filter terms that on

Re: Cleaning up dirty OCR

2010-03-09 Thread Robert Muir
> Can anyone suggest any practical solutions to removing some fraction of the > tokens containing OCR errors from our input stream? one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812 and filter terms that only appear once in the document. -- Robert Muir rcm...@gmail