We use two indexed copies of the same text, one with stemming and stopwords
and the other with neither.  We do phrase search on the second.

You might use two different OCR implementations and cross-correlate the
output.  

Lance

-----Original Message-----
From: Phillip Farber [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 23, 2008 8:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr feasibility with terabyte-scale data

For sure this is a problem.  We have considered some strategies.  One might
be to use a dictionary to clean up the OCR but that gets hard for proper
names and technical jargon. Another is to use stop words (which has the
unfortunate side effect of making phrase searches like "to be or not to be"
impossible).  I've heard you can't make a silk purse out of a sows ear ...

Phil



Erick Erickson wrote:
> Just to add another wrinkle, how clean is your OCR? I've seen it
> range from very nice (i.e. 99.9% of the words are actually words) to
> horrible (60%+ of the "words" are nonsense). I saw one attempt
> to OCR a family tree. As in a stylized tree with the data
> hand-written along the various branches in every orientation. Not a
> recognizable word in the bunch <G>....
> 
> Best
> Erick
> 
> On Jan 22, 2008 2:05 PM, Phillip Farber <[EMAIL PROTECTED]> wrote:
> 
>>
>> Ryan McKinley wrote:
>>>> We are considering Solr 1.2 to index and search a terabyte-scale
>>>> dataset of OCR.  Initially our requirements are simple: basic
>>>> tokenizing, score sorting only, no faceting.   The schema is simple
>>>> too.  A document consists of a numeric id, stored and indexed and a
>>>> large text field, indexed not stored, containing the OCR typically
>>>> ~1.4Mb.  Some limited faceting or additional metadata fields may be
>>>> added later.
>>> I have not done anything on this scale...  but with:
>>> https://issues.apache.org/jira/browse/SOLR-303 it will be possible to
>>> split a large index into many smaller indices and return the union of
>>> all results.  This may or may not be necessary depending on what the
>>> data actually looks like (if you text just uses 100 words, your index
>>> may not be that big)
>>>
>>> How many documents are you talking about?
>>>
>> Currently 1M docs @ ~1.4M/doc.  Scaling to 7M docs.  This is OCR so we
>> are talking perhaps 50K words total to index so as you point out the
>> index might not be too big.  It's the *data* that is big not the
>> *index*, right?.  So I don't think SOLR-303 (distributed search) is
>> required here.
>>
>>  Obviously as the number of documents increase the index size must
>> increase to some degree -- I think linearly?  But what index size will
>> result for 7M documents over 50K words where we're talking just 2 fields
>> per doc: 1 id field and one OCR field of ~1.4M?  Ballpark?
>>
>> Regarding single word queries, do you think, say, 0.5 sec/query to
>> return 7M score-ranked IDs is possible/reasonable in this scenario?
>>
>>
>>>> Should we expect Solr indexing time to slow significantly as we scale
>>>> up?  What kind of query performance could we expect?  Is it totally
>>>> naive even to consider Solr at this kind of scale?
>>>>
>>> You may want to check out the lucene benchmark stuff
>>> http://lucene.apache.org/java/docs/benchmarks.html
>>>
>>>
>>
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/p
ackage-summary.html
>>>
>>>
>>> ryan
>>>
>>>
> 

Reply via email to