Re: Index size vs. number of documents

2008-08-15 Thread Otis Gospodnetic
apache.org > Sent: Friday, August 15, 2008 12:22:30 PM > Subject: Re: Index size vs. number of documents > > By "Index size almost never grows linearly with the number of > documents" are you saying it increases more slowly that the number of > documents, i.e. sub-line

Re: Index size vs. number of documents

2008-08-15 Thread Phillip Farber
By "Index size almost never grows linearly with the number of documents" are you saying it increases more slowly that the number of documents, i.e. sub-linearly or more rapidly? With dirty OCR the number of unique terms is always increasing due to the garbage "words" -Phil Chris Hostetter w

Re: Index size vs. number of documents

2008-08-14 Thread Chris Hostetter
: > I'm surprised, as you are, by the non-linearity. Out of curiosity, what is Unless the data in "stored" fields is significantly greater then "indexed" fields the Index size almost never grows linearly with the number of documents -- it's the number of unique terms that tends to primarily in

Re: Index size vs. number of documents

2008-08-14 Thread Phillip Farber
Erick Erickson wrote: I'm surprised, as you are, by the non-linearity. Out of curiosity, what is your MaxFieldLength? By default only the first 10,000 tokens are added to a field per document. If you haven't set this higher, that could account for it. We set it to a very large number so we in

Re: Index size vs. number of documents

2008-08-13 Thread Erick Erickson
I'm surprised, as you are, by the non-linearity. Out of curiosity, what is your MaxFieldLength? By default only the first 10,000 tokens are added to a field per document. If you haven't set this higher, that could account for it. As far as I know, optimization shouldn't really affect the index siz