: I have 2 fields which will sometimes contain the same data. When they do
: contain the same data, am I paying the same performance cost as when they
: contain unique data? I think the real question here is: does Lucene index
: values per field, or per document?

are we talking about the stored value of the indexed value?

for stored values, i'm fairly sure nothing in Lucene recognizes duplicate 
strings and only stores them once.

for indexed values, everything is done by "term", which is a fieldname, 
fieldvalue pair ... so 100,000 docs that all contain the indexed term 
"topic:sports" only involves the string "sports" being on disk once ... 
but if those 100,000 docs also contain the indexed term "keywords:sports" 
then the string "sports" is on disk twice.

if those 100,000 docs all have distinct 10 character values indexed in the 
"id" field, and those distinct values are also indexed in the "sameId" 
field, then those 100,000 10 character strings will all be on disk twice.



-Hoss

Reply via email to