Erick Erickson created SOLR-13913:
-------------------------------------

             Summary: CDCR should limit TLOG growth
                 Key: SOLR-13913
                 URL: https://issues.apache.org/jira/browse/SOLR-13913
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Erick Erickson


CDCR uses TLOGs for a queueing mechanism. If the connection between DCs goes 
down for any reason and is not caught, the tlogs will grow forever, which can 
lead to disk full situations and all that entails.

Aside from that problem, it's not clear that reprocessing a zillion updates is 
faster than a full replication anyway.

Since the full-index replication was added, we can avoid runaway tlogs by 
somehow noticing we haven't been connected to the remote DC for a long time, 
purge the tlogs (keeping just enough for peer sync of course) and do a full 
index replication next time we do connect.

This is pretty vague, I don't have a good idea of whether tlog size is the 
right metric, or some sort of time since last successful transmission, or the 
queue size or some combination of these and others. The point is simply that 
after some threshold was crossed, reset to a zero state and avoid the pitfalls 
of continuing to accumulate updates.

I'd suggest these be tunable parameters defined in solrconfig.xml since I can 
imagine that  terabyte-scale indexes should fall back to full-index replication 
more rarely than megabyte-scale indexes.

This idea came up in discussions and I wanted to preserve the it in case 
someone wants to pursue it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to