Just for illustration:

This is your original data:

doc1 : hello world
doc2: hello daniem
doc3: hello pal

Now, Lucene produces something like this from the input:
hello: id_doc1,id_doc2,id_doc3
daniem: id_doc2
pal: id_doc3

Well, it's more complex, but enough for illustration.
As you can see, the representation of a document is completly different.
A document costs only a few bytes for a Lucene-internal id per word.

If words occur more than one time per document AND you do not store
termVectors, Lucene just adds the number of occurences per word per doc to
its index:

hello: id_doc1[1],id_doc2[1],id_doc3[1]
daniem: id_doc2[1]
pal: id_doc3[1]

Imagine what happens to longer texts where especially stopwords or important
words occur more than one time.

I would suggest to start with the Lucene-Wiki, if you want to learn more
about Lucene.

Regards,
Em
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2319920.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to