Re: 400 MB Fields

Fuad Efendi Tue, 07 Jun 2011 17:34:26 -0700

Hi Otis,

I am recalling "pagination" feature, it is still unresolved (with default
scoring implementation): even with small documents, searching-retrieving
documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can
take few minutes (I saw it with trunk version 6 months ago, and with very
small documents, total 100 mlns docs); it is advisable to restrict search
results to top-1000 in any case (as with Google)...

I believe things can get wrong; yes, most plain-text retrieved from books
should be 2kb per page, 500 pages, :=> 1,000,000 bytes (or double it for
UTF-8)

Theoretically, it doesn't make any sense to index BIG document containing
all terms from dictionary without any "terms frequency" calcs, but even
with it... I can't imagine we should index 1000s docs and each is just
(different) version of whole Wikipedia, should be wrong design...

Ok, use case: index single HUGE document. What will we do? Create index
with _the_only_ document? And all search will return the same result (or
nothing)? Paginate it; split into pages. I am pragmatic...

Fuad

On 11-06-07 8:04 PM, "Otis Gospodnetic" <otis_gospodne...@yahoo.com> wrote:

>Hi,
>
>
>> I think the question is strange... May be you are wondering about
>>possible
>> OOM exceptions? 
>
>No, that's an easier one. I was more wondering whether with 400 MB Fields
>(indexed, not stored) it becomes incredibly slow to:
>* analyze
>* commit / write to disk
>* search
>
>> I think we can pass to Lucene single document  containing
>> comma separated list of "term, term, ..." (few billion times)...  Except
>> "stored" and "TermVectorComponent"...

Re: 400 MB Fields

Reply via email to