Here's a very useful page for looking at what "index size" means. http://lucene.apache.org/java/3_0_2/fileformats.html#file-names Note that the files having to do with stored data (e.g. *.fdt) have very little impact on searching, they don't consume very many valuable resources.
The "stored=true"-related files *do* have an impact on replication, and perhaps assembling the results pages though.... One bit of clarification about the indexed portion of the files. The terms are stored once, but each term has the doc IDs associated with it, so even though the term is only there once, having it appear in multiple documents will increase the size because of having to store the document associations.... Best Erick On Thu, Aug 11, 2011 at 4:30 PM, Kevin Osborn <osbo...@yahoo.com> wrote: > Thant makes sense. There are actually stored fields. I was mostly just trying > to figure out how much my index size might grow. These fields I am dealing > with are large and repetitive (but mixed). > > > ________________________________ > From: Erick Erickson <erickerick...@gmail.com> > To: solr-user@lucene.apache.org; Kevin Osborn <osbo...@yahoo.com> > Sent: Wednesday, August 10, 2011 7:08 AM > Subject: Re: unique terms and multi-valued fields > > Well, it depends (tm). > > If you're talking about *indexed* terms, then the value is stored only > once in both the cases you mentioned below. There's really very little > difference between a non-multi-valued field and a multi-valued field > in terms of how it's stored in the searchable portion of the index, > except for some position information. > > So, having an XML doc with a single-valued field > > <field name="category">computers laptops</field> > > is almost identical (except for position info as positionIncrementGap) as a > > <field name="category">computers</field> > <field name="category">laptops</field> > > multiValued refers to the *input*, not whether more than one word is > allowed in that field. > > > Now, about *stored* fields. If you store the data, verbatim copies are > kept in the > storage-specific files in each segment, and the values will be on disk for > each document. > > But you probably don't care much because this data is only referenced when you > assemble a document for return to the client, it's irrelevant for searching. > > Best > Erick > > On Tue, Aug 9, 2011 at 8:02 PM, Kevin Osborn <osbo...@yahoo.com> wrote: >> Please verify my understanding. I have a field called "category" and it has >> a value "computers". If I use this same field and value for all of my >> documents, it is really only stored on disk once because >> "category:computers" is a unique term. Is this correct? >> >> But, what about multi-valued fields. So, I have a field called "category". >> For 100 documents, it has the values "computers" and "laptops". For 100 >> other documents, it has the values "computers" and "tablets". Is this stored >> as "category:computers", "category:laptops", "category:tablets", meaning 3 >> unique terms. Or is it stored as "category:computers,laptops" and >> "category:computers,tablets". I believe it is the first case (hopefully), >> but I am not sure. >> >> Thanks.