: Yep, that's what came in my search. See how TTL work in hbase/cassandra/
: rocksdb <https://github.com/facebook/rocksdb/wiki/Time-to-Live>. There
: isn't a "delete old docs"query, but old docs are deleted by the storage
: when merging. Looks like this needs to be a lucene-module which can then be
: configured by solr ?
        ...
: Just like in hbase,cassandra,rocksdb, when you "select" a row/document that
: has expired, it exists on the storage, but isn't returned by the db,


What you're describing is exactly how segment merges work in Lucene, it's 
just a question of terminology.

In Lucene, "deleting" a document is a *logical* operation, the data still 
lives in the (existing) segments but the affected docs are recorded in a 
list of deletions (and automatically excluded from future searchers that 
are opened against them) ... once the segments are merged then the deleted 
documents are "expunged" rather then being copied over to the new 
segments.

Where this diverges from what you describe is that as things stand in 
lucene, something has to "mark" the segements as deleted in order for them 
to later be expunged -- in Solr right now is the code in question that 
does this via (internal) DBQ.

The disatisfaction you expressed with this approach confuses me...

>> I did some search for TTL on solr, and found only a way to do it with a
>> delete-query. But that ~sucks, because you have to do a lot of inserts 
>> (and queries).

...nothing about this approach does any "inserts" (or queries -- unless 
you mean the DBQ itself?) so w/o more elaboration on what exactly you find 
problematic about this approach, it's hard to make any sense of your 
objection or request for an alternative.


With all those caveats out of the way...

What you're ultimately requesting -- new code that hooks into segment 
merging to exclude "expired" documents from being copied into the the new 
merged segments --- should be theoretically possible with a custom 
MergePolicy, but I don't really see how it would be better then the 
current approach in typically use cases (ie: i want docs excluded from 
results after the expiration date is reached, with a min tollerance of 
X) ...

1) nothing would ensure that docs *ever* get removed during perioids when 
docs aren't being added (thus no new segments, thus no merging)

2) as you described, query clients would be required to specify date range 
filters on every query to identify the "logically live docs at this 
moment" on a per-request basis -- something that's far less efficient from 
a cachng standpoint then letting the system do a DBQ on the backened to 
affect the *global* set of logically live docs at the index level.


Frankly: It seems to me that you've looked at how other non-lucene based 
systems X & Y handle TTL type logic and decided that's the best possible 
solution therefore the solution used by Solr "sucks" w/o taking into 
account that what's efficient in the underlying Lucene storage 
implementation might just be diff then what's efficient in the underlying 
storage implementation of X & Y.

If you'd like to tackle implementing TTL as a lower level primitive 
concept in Lucene, then by all means be my guest -- but personally i 
don't think you're going to find any real perf improvements in an 
approach like you describe compared to what we offer today.  i look 
forward to being proved wrong.



-Hoss
http://www.lucidworks.com/

Reply via email to