Hi All, I'm indexing a pretty large collection of documents (about 500K relatively long documents taking up >1TB space, mostly in MS Office formats), and am confused about the file sizes in the index. I've gotten through about 180K documents, and the *.pos files add up to 325GB, while the all of the rest combined are using less than 5GB--including some large stored fields and term vectors. It makes sense to me that the compression on stored fields helps to keep that part down on large text fields, and that term vectors wouldn't be too big since they don't need position information, but the magnitude of the difference is alarming. Is that to be expected? Is there any way to reduce the size of the positions index if phrase searching is a requirement?
I am using Solr 4.2.1. These documents have some a number of small metadata elements, along with the big content field. Like the default schema, I'm storing but not indexing the content field, and a lot of the fields get put into a catchall that is indexed and uses term vectors, but is not stored. Thanks, Mike