On 11/27/2018 5:29 AM, Markus Jelsma wrote:
A background batch process compiles a data set, when finished, it sends a delete all to its target collection, then everything gets sent by SolrJ, followed by a regular commit. When inspecting the core i notice it has one segment with 9578 documents, of which exactly half are deleted.
I know you're not a newbie, so if I seem condescending here, that's not my intent.
If you want Solr to get rid of all deleted docs without an optimize, you need to ensure that the segments with deleted documents do not have ANY live documents. When that's the case, the commit alone should delete those segments for you. If you're in a situation where you have segments that are MOSTLY (but not entirely) deleted documents, then an optimize will be the only way to clear them all.
If you're sending the delete all first, then all segments that exist before you start indexing should be entirely marked deleted. Lucene will delete those segments entirely at commit time. That's one of it's important performance-related capabilities.
If that is the way you're deleting at the start, then it sounds like your indexing code might actually be indexing everything two or more times -- the original segments will be fully deleted at commit time, but multiple indexing is marking NEW documents deleted, so you end up with at least one segment that has both deleted and live data.
Thanks, Shawn