It's my understanding that if my mergeFactor is 10, then there shouldn't be more than 11 segments in my index directory (10 segments, plus an additional segment if a merge is in progress). It would seem to follow that there shouldn't be more than 11 fdt files, 11 tis files, etc.. However, I'm looking at one of my indexes now, and this doesn't seem to be the case. Here are the tis files for this index, for instance:
07/22/2008 07:49 PM 77,925,180 _1je.tis 07/23/2008 02:57 AM 65,988,651 _256.tis 07/23/2008 04:18 AM 13,159,578 _29t.tis 07/23/2008 05:08 AM 10,146,941 _2cw.tis 07/23/2008 05:39 AM 6,749,665 _2el.tis 07/23/2008 06:24 AM 12,274,012 _2he.tis 07/23/2008 07:01 AM 14,069,531 _2kh.tis 07/23/2008 07:53 AM 13,795,213 _2nu.tis 07/23/2008 08:20 AM 6,284,902 _2p0.tis 07/23/2008 08:27 AM 1,980,945 _2p9.tis 07/23/2008 08:36 AM 1,674,640 _2pk.tis 07/23/2008 08:37 AM 311,483 _2pl.tis 07/23/2008 08:38 AM 285,881 _2pm.tis 07/23/2008 08:39 AM 245,138 _2pn.tis 07/23/2008 08:40 AM 116,881 _2po.tis 07/17/2008 11:22 PM 69,635,905 _rp.tis 07/18/2008 12:59 AM 15,883,866 _xu.tis There are 17 of these files. (File sizes are in bytes.) When I open up the index in Luke, it says all of them are "In Use" and it doesn't list any of them as "Deletable". This seems to rule out the possibility that Solr/Lucene somehow "forget" to clean up files that were no longer in use. I'm noticing that _2pk, _2pl, _2pm, _2pn, _2po are sequential file names, alphabetically speaking, and their last modified times are very close to one another. Does this mean they're actually part of the same segment, even though they are in separate files? If those files are indeed part of a single segment, then the number of segments represented by these files would really be 17-4=13. But that's still more than the expected 11 segments. I just discovered that one of my other indexes has over 11,000 tis files. That's disturbing. I'm not sure if it would have the same underlying cause. Any ideas?