Comments inline: On Wed, Jan 3, 2018 at 11:39 AM, S G <sg.online.em...@gmail.com> wrote: > AFAIK, tlog file is truncated with a hard-commit. > So if the TLOG replica is only pulling the tlog-file, it would become out > of date if it does not pull the full index too. > That means that the TLOG replica would do a full copy every time there is a > commit on the leader.
TLOG replica does not pull the tlog file. Instead, each update is pushed from the leader to all TLOG replicas as a synchronous operation. The same thing happens for NRT replicas as well. Such updates are appended to the transaction logs on the PULL replicas. Similar to the old master/slave model, the leader (hard) commits its index frequently and the TLOG replica polls the leader to check if a newer index version is available for download. If there is a new version on the leader then the replica downloads only the new segments from the leader that are not present locally. > > PULL replica, by definition copies index files only and so it would do full > recoveries often too. PULL replicas also download only the newest segments present on the leader which haven't been copied over previously. At no time does a full recovery happen unless the replica is new i.e. it has an empty index or it has been out of sync for so long that leader has completely new segments due to merges. One thing to note is that you should never mix NRT replicas with other types. Either have a collection with only NRT replicas or have a mix of TLOG and PULL replicas. This way you ensure that the leader is never different enough for a full recovery to be required. > > > How intelligent are the two replica types in determining that they need to > do a full recovery vs partial recovery? > Does full recovery happen every hard-commit on the leader? > Or does it happen with segment merges on the leader? (because index files > will look much different after a segment-merge) > > > > NRT replicas will typically have very different files in their on-disk >> indexes even though they contain the same documents. > > > This is something which has caused full recoveries many times in my > clusters and I wish there was a solution to this one. > Do you think it would make sense for all replicas of a shard to agree upon > the segment where a document should go to? > Coupling this with an agreed cadence on segment merges, Solr would never do > full recovery. (It's a very high level view of course and will need lot of > refinements if implemented). > > Getting a cadence on segment merges could possibly be implemented by a > time-based merging strategy where documents arriving within a particular > time-range only will form a particular segment. > So documents arriving between 1pm-2pm go to segment 1, those between > 2pm-3pm go to segment 2 and so on. > That ways replicas will only copy the last N segments (with N being 1 > generally) instead of doing a full recovery. > Even if merging happens on the leader, the last N segments should not be > cleared to avoid full recoveries on the replicas. > (I know something like this happens today, but not very sure about the > internal details and it's nowhere documented clearly). > > Currently, I see my replicas go into full-recovery even when I dynamically > add a field to a collection or a replica missed updates for a few seconds. > (I do have high values for catchup rather than the default 100) > > > Thanks > SG > > > > > > > On Tue, Jan 2, 2018 at 8:58 PM, Shawn Heisey <apa...@elyograg.org> wrote: > >> On 1/2/2018 8:02 PM, S G wrote: >> >>> If the above is incorrect, can someone please point that out? >>> >> >> Assuming I have a correct understanding of how the different replica types >> work, I have some small clarifications. If my understanding is incorrect, >> I hope somebody will point out my errors. >> >> TLOG is leader eligible because it keeps transaction logs from ongoing >> indexing, although it does not perform that indexing on its own index >> unless it becomes the leader. Transaction logs are necessary for operation >> as leader. >> >> PULL does not keep transaction logs, which is why it is not leader >> eligible. It only copies the index data. >> >> Either TLOG or PULL would do a full index copy if the local index is >> suddenly very different from the leader. This could happen in situations >> where you have NRT replicas and the leader changes -- NRT replicas will >> typically have very different files in their on-disk indexes even though >> they contain the same documents. When the leader changes to a different >> NRT replica, TLOG/PULL replicas will suddenly find that they have a very >> different list of index files, so they will fetch the entire index. >> >> Thanks, >> Shawn >> -- Regards, Shalin Shekhar Mangar.