On Fri, Dec 16, 2016 at 10:53 PM, Chris Hostetter <[email protected]> wrote:
> > : Yep, that's what came in my search. See how TTL work in hbase/cassandra/ > : rocksdb <https://github.com/facebook/rocksdb/wiki/Time-to-Live>. There > : isn't a "delete old docs"query, but old docs are deleted by the storage > : when merging. Looks like this needs to be a lucene-module which can then > be > : configured by solr ? > ... > : Just like in hbase,cassandra,rocksdb, when you "select" a row/document > that > : has expired, it exists on the storage, but isn't returned by the db, > > > What you're describing is exactly how segment merges work in Lucene, it's > just a question of terminology. > > In Lucene, "deleting" a document is a *logical* operation, the data still > lives in the (existing) segments but the affected docs are recorded in a > list of deletions (and automatically excluded from future searchers that > are opened against them) ... once the segments are merged then the deleted > documents are "expunged" rather then being copied over to the new > segments. > > Where this diverges from what you describe is that as things stand in > lucene, something has to "mark" the segements as deleted in order for them > to later be expunged -- in Solr right now is the code in question that > does this via (internal) DBQ. > Note, it doesn't mark the "segment", it marks the "document". > > The disatisfaction you expressed with this approach confuses me... > Really ? If you have many expiring docs > > >> I did some search for TTL on solr, and found only a way to do it with a > >> delete-query. But that ~sucks, because you have to do a lot of inserts > >> (and queries). > > ...nothing about this approach does any "inserts" (or queries -- unless > you mean the DBQ itself?) so w/o more elaboration on what exactly you find > problematic about this approach, it's hard to make any sense of your > objection or request for an alternative. > "For example, with the configuration below the DocExpirationUpdateProcessorFactory will create a timer thread that wakes up every 30 seconds. When the timer triggers, it will execute a *deleteByQuery* command to *remove any documents* with a value in the press_release_expiration_date field value that is in the past " > > With all those caveats out of the way... > > What you're ultimately requesting -- new code that hooks into segment > merging to exclude "expired" documents from being copied into the the new > merged segments --- should be theoretically possible with a custom > MergePolicy, but I don't really see how it would be better then the > current approach in typically use cases (ie: i want docs excluded from > results after the expiration date is reached, with a min tollerance of > X) ... > I mentioned that the client would also make a range-query since expired documents in this case would still be indexed. > > 1) nothing would ensure that docs *ever* get removed during perioids when > docs aren't being added (thus no new segments, thus no merging) > This can be done with a periodic/smart thread that wakes up every 'ttl' and checks min-max (or histogram) of timestamps on segments. If there are a lot, do merge (or just delete the whole dead segment). At least that's how those systems do it. > > 2) as you described, query clients would be required to specify date range > filters on every query to identify the "logically live docs at this > moment" on a per-request basis -- something that's far less efficient from > a cachng standpoint then letting the system do a DBQ on the backened to > affect the *global* set of logically live docs at the index level. > This makes sense. Deleted docs-ids is cached better than the range-query that I said. > > > Frankly: It seems to me that you've looked at how other non-lucene based > systems X & Y handle TTL type logic and decided that's the best possible > solution therefore the solution used by Solr "sucks" w/o taking into > account that what's efficient in the underlying Lucene storage > implementation might just be diff then what's efficient in the underlying > storage implementation of X & Y. > Yes. > > If you'd like to tackle implementing TTL as a lower level primitive > concept in Lucene, then by all means be my guest -- but personally i > don't think you're going to find any real perf improvements in an > approach like you describe compared to what we offer today. i look > forward to being proved wrong. > Since the implementation is apparently more efficient than I thought I'm gonna leave it. > > > > -Hoss > http://www.lucidworks.com/ >
