Hello everyone, I apologise beforehand if this is a question that has been visited numerous times on this list, but after hours spent on Google and talking to SOLR savvy people on #solr @ Freenode I'm still a bit at a loss about SOLR and deleted documents.
I have quite a few indexes in both production and development environments, where I see that the number of deleted documents just keeps on growing and growing, but they never seem to be deleted. From my understanding, this can be controller in the merge policy set for the current core, but I've not been able to find any specifics on the topic. The general consensus on most search hits I've found is to perform an optimize of the core, however this is both an expensive operation, both in terms of CPU cycles as well as disk I/O, and also requires you to have anywhere from 2 times to 3 times the size of the index available on disk to be guaranteed to complete fully. Given these criteria, it's often not something that is a viable option in certain environments, both to it being a resource hog and often that you just don't have the needed available disk space to perform the optimize. After having spoken with a couple of people on IRC (thanks tokee and elyograg), I was made aware of an optional parameter for <commit> called 'expungeDeletes' that can explicitly make sure that deleted documents are deleted from the index, i.e: curl http://localhost:8983/solr/coreName/update -H "Content-Type: text/xml" --data-binary '<commit expungeDeletes="true"/>' Now my questions are as follows: 1) How can I make sure that this is dealt with in my merge policy, if at all possible? 2) I've tried to find some disk space guidelines for 'expungeDeletes', however I've not been able to find any. What are the general guidelines here? Does it require as much space as an optimize, or is it less "aggressive" compared to an optimize? 3) Is 'expungeDeletes' the recommended method to make sure your deleted documents are actually removed from the index, or should you deal with this in your merge policy? 4) I have also heard from talks on #SOLR that deleted documents has an impact on the relevancy of performed searches. Is this correct, or just misinformation? If you require any additional information, like snippets from my configuration (solrconfig.xml), I'm more than happy to provide this. Again, if this is an issue that's being revisited for the Nth time, I apologize, I'm just trying to get my head around this with my somewhat limited SOLR knowledge. -- Yours sincerely Jostein Elvaker Haande "A free society is a society where it is safe to be unpopular" - Adlai Stevenson http://tolecnal.net -- tolecnal at tolecnal dot net