Just for illustration: This is your original data:
doc1 : hello world doc2: hello daniem doc3: hello pal Now, Lucene produces something like this from the input: hello: id_doc1,id_doc2,id_doc3 daniem: id_doc2 pal: id_doc3 Well, it's more complex, but enough for illustration. As you can see, the representation of a document is completly different. A document costs only a few bytes for a Lucene-internal id per word. If words occur more than one time per document AND you do not store termVectors, Lucene just adds the number of occurences per word per doc to its index: hello: id_doc1[1],id_doc2[1],id_doc3[1] daniem: id_doc2[1] pal: id_doc3[1] Imagine what happens to longer texts where especially stopwords or important words occur more than one time. I would suggest to start with the Lucene-Wiki, if you want to learn more about Lucene. Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2319920.html Sent from the Solr - User mailing list archive at Nabble.com.