On 5/5/2015 7:29 AM, Rishi Easwaran wrote: > Worried about data loss makes sense. If I get the way solr behaves, the new > directory should only have missing/changed segments. > I guess since our application is extremely write heavy, with lot of inserts > and deletes, almost every segment is touched even during a short window, so > it appears like for our deployment every segment is copied over when replicas > get out of sync.
Once a segment is written, it is *NEVER* updated again. This aspect of Lucene indexes makes Solr replication more efficient. The ids of deleted documents are written to separate files specifically for tracking deletes. Those files are typically quite small compared to the index segments. Any new documents are inserted into new segments. When older segments are merged, the information in all of those segments is copied to a single new segment (minus documents marked as deleted), and then the old segments are erased. Optimizing replaces the entire index, and each replica of the index would be considered different, so an index recovery that happens after optimization might copy the whole thing. If you are seeing a lot of index recoveries during normal operation, chances are that your Solr servers do not have enough resources, and the resource that has the most impact on performance is memory. The amount of memory required for good Solr performance is higher than most people expect. It's a normal expectation that programs require memory to run, but Solr has an additional memory requirement that often surprises them -- the need for a significant OS disk cache: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn