On Sat, Dec 17, 2016 at 12:04 AM, Chris Hostetter <hossman_luc...@fucit.org> wrote:
> > : > lucene, something has to "mark" the segements as deleted in order for > them > ... > : Note, it doesn't mark the "segment", it marks the "document". > > correct, typo on my part -- sorry. > > : > The disatisfaction you expressed with this approach confuses me... > : > > : Really ? > : If you have many expiring docs > > ...you didn't seem to finish that thought so i'm still not really sure > what your're suggestion is in terms of why an alternative would be more > efficient. > Sorry about that. The reason why (i think/thought) it won't be as efficient, is because in some cases, like mine, all docs will expire, rather fast (30 minutes in my case), so there will be a large number of "deletes", which I thought were expensive. So, if rocksdb would do it this way, it would have to keep 1 index on the ttl-timestamp and then issue 2 deletes (to delete the index, original row). While in lucene, because the storage is different, this is ~just a deleted_bitmap[x]=1, which if you disable translog fsync (only for ttl-delete) should be really fast and nonblocking(my issue). So, the other way this can be made better in my opinion is:::: (if the optimization is not already there) Is to make the 'delete-query' on ttl-documents operation on translog to not be forced to fsync to disk (so still written to translog, but no fsync). The another index/delete happens, it will also fsync the translog of the previous 'delete ttl query'. If the server crashes, meaning we lost those deletes because the translog wasn't fsynced to disk, then a thread can run on startup to recheck ttl-deletes. This will make it so the delete-query comes "free" in disk-fsync on translog. Makes sense ? > > : "For example, with the configuration below the > : DocExpirationUpdateProcessorFactory will create a timer thread that > wakes > : up every 30 seconds. When the timer triggers, it will execute a > : *deleteByQuery* command to *remove any documents* with a value in the > : press_release_expiration_date field value that is in the past " > > that document is describing a *logical* deletion as i mentioned before -- > the documents are "removed" in the sense that they are flaged "not alive" > won't be included in future searches, but the data still lives in the > segements on disk until a future merge. (That is end user documentation, > focusing on the effects as percieved by clients -- the concept of "delete" > from a low level storage implementation is a much more involved concept > that affects any discussion of "deleting" documents in solr, not just TTL > based deletes) > > : > 1) nothing would ensure that docs *ever* get removed during perioids > when > : > docs aren't being added (thus no new segments, thus no merging) > : > > : This can be done with a periodic/smart thread that wakes up every 'ttl' > and > : checks min-max (or histogram) of timestamps on segments. If there are a > : lot, do merge (or just delete the whole dead segment). At least that's > how > : those systems do it. > > OK -- with lucene/solr today we have the ConcurrentMergeScheduler which > will watch for segments that have many (logically deleted) documents > flaged "not alive" and will proactively merge those segments when the > number of docs is above some configured/default threshold -- but to > automatically flag those documents as "deleted" you need something like > what solr is doing today. > I knew it checks "should we be merging". This would just be another clause. > > > Again: i really feel like the only disconnect here is terminology. > > You're describing a background thread that wakes up periodically, scans > the docs in each segment to see if they have an expire field > $now, and > based on the size of the set of matches merges some segments and expunges > the docs that were in that set. For segments that aren't merged, docs > stay put and are excluded from queries only by filters specified at > request time. > > What Solr/Lucene has are 2 background threads: one wakes up periodically, > scans the docs in the index to see if the expire field > $now and if so > flags them as being "not alive" so they don't match queries at request > time. A second thread chegks each segment to see how many docs are marked > "not alive" -- either by the previous thread or by some other form of > (logical) deletion -- and merges some of those segments, expunging the > docs that were marked "not alive". For segments that aren't merged, the > "not alive" docs are still in the segment, but the "not alive" flag > automatically excludes them from queries. > Yes I knew it functions that way. The ~whole~ misunderstanding, is that the delete is more efficient than I thought. The whole reason why the other storage engines did it "the other way" is because of the efficiency of the delete on those engines. > > > > -Hoss > http://www.lucidworks.com/ >