>
> I don't deal with a lot of multi-lingual stuff, but my understanding is
> that this sort of thing gets a lot easier if you can partition your docs
> by language -- and even if you can't, doing some langauge detection on the
> (dirty) OCRed text to get a language guess (and then partition by lan
: Interesting. I wonder though if we have 4 million English documents and 250
: in Urdu, if the Urdu words would score badly when compared to ngram
: statistics for the entire corpus.
Well it doesn't have to be a strict ratio cutoff .. you could look at the
average frequency of all character
We've been thinking about running some kind of a classifier against each book
to select books with a high percentage of dirty OCR for some kind of special
processing. Haven't quite figured out a multilingual feature set yet other
than the punctuation/alphanumeric and character block ideas mention
Interesting. I wonder though if we have 4 million English documents and 250
in Urdu, if the Urdu words would score badly when compared to ngram
statistics for the entire corpus.
hossman wrote:
>
>
>
> Since you are dealing with multiple langugaes, and multiple varient usages
> of langauge
On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote:
> I wonder if one way to try and generalize
> the idea of "unlikely" letter combinations into a math problem (instead of
> grammer/spelling problem) would be to score all the hapax legomenon
> words in your index
Hmm, how about a classifier?
: We can probably implement your suggestion about runs of punctuation and
: unlikely mixes of alpha/numeric/punctuation. I'm also thinking about
: looking for unlikely mixes of unicode character blocks. For example some of
: the CJK material ends up with Cyrillic characters. (except we would hav
On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West wrote:
>
> Thanks Simon,
>
> We can probably implement your suggestion about runs of punctuation and
> unlikely mixes of alpha/numeric/punctuation. I'm also thinking about
> looking for unlikely mixes of unicode character blocks. For example some
Thanks Simon,
We can probably implement your suggestion about runs of punctuation and
unlikely mixes of alpha/numeric/punctuation. I'm also thinking about
looking for unlikely mixes of unicode character blocks. For example some of
the CJK material ends up with Cyrillic characters. (except we wo
On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom wrote:
> Thanks Robert,
>
> I've been thinking about this since you suggested it on another thread. One
> problem is that it would also remove real words. Apparently 40-60% of the
> words in large corpora occur only once
> (http://en.wikipedia.
...@gmail.com]
Sent: Tuesday, March 09, 2010 2:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Cleaning up dirty OCR
> Can anyone suggest any practical solutions to removing some fraction of the
> tokens containing OCR errors from our input stream?
one approach would be to try http
On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir wrote:
> > Can anyone suggest any practical solutions to removing some fraction of
> the tokens containing OCR errors from our input stream?
>
> one approach would be to try
> http://issues.apache.org/jira/browse/LUCENE-1812
>
> and filter terms that on
> Can anyone suggest any practical solutions to removing some fraction of the
> tokens containing OCR errors from our input stream?
one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812
and filter terms that only appear once in the document.
--
Robert Muir
rcm...@gmail
12 matches
Mail list logo