On 11/28/2011 3:26 AM, Jones, Graham wrote:
Hello
Brief question: How can I clean-up excess files after performing optimize
without restarting the Tomcat service?
Detail follows:
I've been running several SOLR cores for approx 12 months and have recently
noticed the disk usage of one of them is growing considerably faster than the
rate at which documents are being added.
- 1,200,000 docs 12 months ago used a 45 GB index
- 1,700,000 docs today use a 87 GB index
- There may have been _some_ deletions, almost certainly<100,000
- The documents are of a broadly uniform style, approx 1000 words
So, approximately 45% growth in documents had grown the disk usage by approx
100%.
I took a server out of production (I've 1 master& 7 slaves) and did the
following.
I ran http://server/corename/update?stream.body=<optimize/> on this core which
added 49.4 GB to the index folder No previously existing files were deleted I
restarted the Tomcat service ONLY the files generated by the optimize remained. All
older files were deleted.
This is the result I want, but not quite the method I'd prefer. How can I get
to this position without restarting the service?
Based on this description, it seems likely that you are running Solr on
Windows. On Windows, if you have a file open for any reason (even just
reading) it's not possible to delete that file. Solr keeps the old
index files open to serve queries until the new index is fully committed
and ready to take over, which can often be quite a while in software terms.
On Unix/Linux, deleting a file just removes the link to that file in the
filesystem directory. When the last link is gone, the space is
reclaimed. When a program opens a file, the OS creates an internal link
to that file. If you delete that file while it's still open, it is
still there, but only accessible via the internal link. This is what
happens during an optimize - the files are removed from the directory,
but part of Solr still has them open, until the newly created index is
completely online and all queries to the old one are complete. Once
they are closed, the OS reclaims the space. I'm fairly sure that there
is little communication between the processes that serve queries and the
processes that update and merge the index.
I've checked previous messages on this. If you can arrange to run the
optimize a second time before any documents are added or deleted, it
will complete instantaneously and the extra files will be deleted. If
the index is changed at all between the two optimizes, it won't really
help, as you'll have a new set of old files that won't get deleted.
I am not in a position to test it, but it's possible that issuing a
RELOAD command to the CoreAdmin might also take care of deleting the old
files. I'm pretty sure that such an action is potentially disruptive,
but in my experience, the index is back online within a second or two,
much much faster than a full restart.
http://wiki.apache.org/solr/CoreAdmin#RELOAD
This has been a known problem for quite a while, but I do not believe
that it is a major priority for most Solr users. Most people I've seen
posting to this list do not run on Windows. I found the following bug
filed on Solr:
https://issues.apache.org/jira/browse/SOLR-1691
Thanks,
Shawn