Re: Reclaiming disk space from (large, optimized) segments

Jason Hellman Tue, 29 Oct 2013 11:22:30 -0700

If I sage Otis’ intent here it is to create shards on the basis of intervals of 
time.  A shard represents a single interval (let’s say a year’s worth of data) 
and when that data is no longer necessary it is simply shut down and no longer 
included in queries.


So, for example, you could have three shards spanning the years 2011, 2012, and 
2013 respectively.  When you no longer need 2011 you simply remove the shard.  
My example is simple…compress based upon your needs.

On Oct 29, 2013, at 8:42 AM, Gun Akkor <gun.ak...@carbonblack.com> wrote:

> Otis,
> 
> Thank you for your response,
> 
> Could you elaborate a bit more on what you have in mind when you say
> "time-based" indices?
> 
> Gun
> 
> 
> ---
> Senior Software Engineer
> Carbon Black, Inc.
> gun.ak...@carbonblack.com
> 
> 
> On Thu, Oct 24, 2013 at 11:56 PM, Otis Gospodnetic <
> otis.gospodne...@gmail.com> wrote:
> 
>> Only skimmed your email, but purge every 4 hours jumped out at me. Would it
>> make sense to have time-based indices that can be periodically dropped
>> instead of being purged?
>> 
>> Otis
>> Solr & ElasticSearch Support
>> http://sematext.com/
>> On Oct 23, 2013 10:33 AM, "Scott Lundgren" <scott.lundg...@carbonblack.com
>>> 
>> wrote:
>> 
>>> *Background:*
>>> 
>>> - Our use case is to use SOLR as a massive FIFO queue.
>>> 
>>> - Document additions and updates happen continuously.
>>> 
>>>    - Documents are being added at sustained a rate of 50 - 100 documents
>>> per second.
>>> 
>>>    - About 50% of these document are updates to existing docs, indexed
>>> using atomic updates: the original doc is thus deleted and re-added.
>>> 
>>> - There is a separate purge operation running every four hours that
>> deletes
>>> the oldest docs, if required based on a number of unrelated configuration
>>> parameters.
>>> 
>>> - At some time in the past, a manual force merge / optimize with
>>> maxSegments=2 was run to troubleshoot high disk i/o and remove "too many
>>> segments" as a potential variable.  Currently, the largest fdts are 74G
>> and
>>> 43G.   There are 47 total segments, the largest other sizes are all
>> around
>>> 2G.
>>> 
>>> - Merge policies are all at Solr 4 defaults. Index size is currently ~50M
>>> maxDocs, ~35M numDocs, 276GB.
>>> 
>>> *Issue:*
>>> 
>>> The background purge operation is deleting docs on schedule, but the disk
>>> space is not being recovered.
>>> 
>>> *Presumptions:*
>>> I presume, but have not confirmed (how?) the 15M deleted documents are
>>> predominately in the two large segments.  Because they are largely in the
>>> two large segments, and those large segments still have (some/many) live
>>> documents, the segment backing files are not deleted.
>>> 
>>> *Questions:*
>>> 
>>> - When will those segments get merged and documents recovered?  Does it
>>> happen when _all_ the documents in those segments are deleted?  Some
>>> percentage of the segment is filled with deleted documents?
>>> - Is there a way to do it right now vs. just waiting?
>>> - In some cases, the purge delete conditional is _just_ free disk space:
>>> when index > free space, delete oldest.  Those setups are now in
>> scenarios
>>> where index >> free space, and getting worse.  How does low disk space
>>> effect above two questions?
>>> - Is there a way for me to determine stats on a per-segment basis?
>>>   - for example, how many deleted documents in a particular segment?
>>> - On the flip side, can I determine in what segment a particular document
>>> is located?
>>> 
>>> Thank you,
>>> 
>>> Scott
>>> 
>>> --
>>> Scott Lundgren
>>> Director of Engineering
>>> Carbon Black, Inc.
>>> (210) 204-0483 | scott.lundg...@carbonblack.com
>>> 
>>

Re: Reclaiming disk space from (large, optimized) segments

Reply via email to