On 11/28/2011 3:26 AM, Jones, Graham wrote:
Hello

Brief question: How can I clean-up excess files after performing optimize 
without restarting the Tomcat service?

Detail follows:

I've been running several SOLR cores for approx 12 months and have recently 
noticed the disk usage of one of them is growing considerably faster than the 
rate at which documents are being added.

- 1,200,000 docs 12 months ago used a 45 GB index
- 1,700,000 docs today use a 87 GB index
- There may have been _some_ deletions, almost certainly<100,000
- The documents are of a broadly uniform style, approx 1000 words

So, approximately 45% growth in documents had grown the disk usage by approx 
100%.

I took a server out of production (I've 1 master&  7 slaves) and did the 
following.
I ran http://server/corename/update?stream.body=<optimize/>  on this core which 
added 49.4 GB to the index folder No previously existing files were deleted I 
restarted the Tomcat service ONLY the files generated by the optimize remained. All 
older files were deleted.

This is the result I want, but not quite the method I'd prefer. How can I get 
to this position without restarting the service?

Based on this description, it seems likely that you are running Solr on Windows. On Windows, if you have a file open for any reason (even just reading) it's not possible to delete that file. Solr keeps the old index files open to serve queries until the new index is fully committed and ready to take over, which can often be quite a while in software terms.

On Unix/Linux, deleting a file just removes the link to that file in the filesystem directory. When the last link is gone, the space is reclaimed. When a program opens a file, the OS creates an internal link to that file. If you delete that file while it's still open, it is still there, but only accessible via the internal link. This is what happens during an optimize - the files are removed from the directory, but part of Solr still has them open, until the newly created index is completely online and all queries to the old one are complete. Once they are closed, the OS reclaims the space. I'm fairly sure that there is little communication between the processes that serve queries and the processes that update and merge the index.

I've checked previous messages on this. If you can arrange to run the optimize a second time before any documents are added or deleted, it will complete instantaneously and the extra files will be deleted. If the index is changed at all between the two optimizes, it won't really help, as you'll have a new set of old files that won't get deleted.

I am not in a position to test it, but it's possible that issuing a RELOAD command to the CoreAdmin might also take care of deleting the old files. I'm pretty sure that such an action is potentially disruptive, but in my experience, the index is back online within a second or two, much much faster than a full restart.

http://wiki.apache.org/solr/CoreAdmin#RELOAD

This has been a known problem for quite a while, but I do not believe that it is a major priority for most Solr users. Most people I've seen posting to this list do not run on Windows. I found the following bug filed on Solr:

https://issues.apache.org/jira/browse/SOLR-1691

Thanks,
Shawn

Reply via email to