>
> I don't deal with a lot of multi-lingual stuff, but my understanding is
> that this sort of thing gets a lot easier if you can partition your docs
> by language -- and even if you can't, doing some langauge detection on the
> (dirty) OCRed text to get a language guess (and then partition by lan
: Interesting. I wonder though if we have 4 million English documents and 250
: in Urdu, if the Urdu words would score badly when compared to ngram
: statistics for the entire corpus.
Well it doesn't have to be a strict ratio cutoff .. you could look at the
average frequency of all character
>
> Hmm, how about a classifier? Common words are the "yes" training set,
> hapax legomenons are the "no" set, and n-grams are the features.
>
> But why isn't the OCR program already doing this?
>
> wunder
>
>
>
>
>
>
--
View th
and their contexts to deide if they are legitimate)
>
> ?
>
>
> -Hoss
>
>
>
--
View this message in context:
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871353.html
Sent from the Solr - User mailing list archive at Nabble.com.
On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote:
> I wonder if one way to try and generalize
> the idea of "unlikely" letter combinations into a math problem (instead of
> grammer/spelling problem) would be to score all the hapax legomenon
> words in your index
Hmm, how about a classifier?
: We can probably implement your suggestion about runs of punctuation and
: unlikely mixes of alpha/numeric/punctuation. I'm also thinking about
: looking for unlikely mixes of unicode character blocks. For example some of
: the CJK material ends up with Cyrillic characters. (except we would hav
On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West wrote:
>
> Thanks Simon,
>
> We can probably implement your suggestion about runs of punctuation and
> unlikely mixes of alpha/numeric/punctuation. I'm also thinking about
> looking for unlikely mixes of unicode character blocks. For example some
es tend to be longer). We also looked for runs of
> punctuation, unlikely mixes of alpha/numeric/punctuation, and also
> eliminated longer words which consisted of runs of not-ocurring-in-English
> bigrams.
>
> Hope this helps
>
> -Simon
>
>>
>> --
>>
&g
On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom wrote:
> Thanks Robert,
>
> I've been thinking about this since you suggested it on another thread. One
> problem is that it would also remove real words. Apparently 40-60% of the
> words in large corpora occur only once
> (http://en.wikipedia.
...@gmail.com]
Sent: Tuesday, March 09, 2010 2:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Cleaning up dirty OCR
> Can anyone suggest any practical solutions to removing some fraction of the
> tokens containing OCR errors from our input stream?
one approach would be to try http
On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir wrote:
> > Can anyone suggest any practical solutions to removing some fraction of
> the tokens containing OCR errors from our input stream?
>
> one approach would be to try
> http://issues.apache.org/jira/browse/LUCENE-1812
>
> and filter terms that on
> Can anyone suggest any practical solutions to removing some fraction of the
> tokens containing OCR errors from our input stream?
one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812
and filter terms that only appear once in the document.
--
Robert Muir
rcm...@gmail
Hello all,
We have been indexing a large collection of OCR'd text. About 5 million books
in over 200 languages. With 1.5 billion OCR'd pages, even a small OCR error
rate creates a relatively large number of meaningless unique terms. (See
http://www.hathitrust.org/blogs/large-scale-search/too
13 matches
Mail list logo