Hi All,

I'm indexing a pretty large collection of documents (about 500K relatively
long documents taking up >1TB space, mostly in MS Office formats), and am
confused about the file sizes in the index.  I've gotten through about 180K
documents, and the *.pos files add up to 325GB, while the all of the rest
combined are using less than 5GB--including some large stored fields and
term vectors.  It makes sense to me that the compression on stored fields
helps to keep that part down on large text fields, and that term vectors
wouldn't be too big since they don't need position information, but the
magnitude of the difference is alarming.  Is that to be expected?  Is there
any way to reduce the size of the positions index if phrase searching is a
requirement?

I am using Solr 4.2.1.  These documents have some a number of small
metadata elements, along with the big content field.  Like the default
schema, I'm storing but not indexing the content field, and a lot of the
fields get put into a catchall that is indexed and uses term vectors, but
is not stored.

Thanks,
Mike

Reply via email to