OCR _without errors_ wouldn't break it. That comment assumed that the OCR
was dirty I thought.

Honest, I once was trying to index an OCR'd image of a "family tree" that was a
stylized tree where the most remote ancestor was labeled in vertical text on the
trunk, and descendants at various angles as the trunk branched, the branches
branched and on and on....

And as far as cleaning up the text is concerned if it's dirty,
anything you do is
wrong. For instance, again using the genealogy example, throwing out
unrecognized
words like, removes the data that's important when they're names.

But leaving nonsense characters in is wrong too....

And hand-correcting all of the data is almost always far too expensive.

If your OCR is, indeed perfect, then I envy you ;)...

On a different note, I thought the captcha-image way of correcting OCR
text was brilliant.

Erick

On Thu, Oct 6, 2016 at 8:05 AM, Rick Leir <rl...@leirtech.com> wrote:
> I am curious to know where the square-root assumption is from, and why OCR
> (without errors) would break it. TIA
>
> cheers - - Rick
>
> On 2016-10-04 10:51 AM, Walter Underwood wrote:
>>
>> No, we don’t have OCR’ed text. But if you do, it breaks the assumption
>> that vocabulary size
>> is the square root of the text size.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>> On Oct 4, 2016, at 7:14 AM, Rick Leir <rl...@leirtech.com> wrote:
>>>
>>> OCR’ed text can have large amounts of garbage such as '';,-d'."
>>> particularly when there is poor image quality or embedded graphics. Is that
>>> what is causing your huge vocabularies? I filtered the text, removing any
>>> word with fewer than 3 alphanumerics or more than 2 non-alphas.
>>>
>>>
>>> On 2016-10-03 09:30 PM, Walter Underwood wrote:
>>>>
>>>> That approach doesn’t work very well for estimates.
>>>>
>>>> Some parts of the index size and speed scale with the vocabulary instead
>>>> of the number of documents.
>>>> Vocabulary usually grows at about the square root of the total amount of
>>>> text in the index. OCR’ed text
>>>> breaks that estimate badly, with huge vocabularies.
>>>>
>>>>
>

Reply via email to