[
https://issues.apache.org/jira/browse/SOLR-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17252259#comment-17252259
]
Ilan Ginzburg commented on SOLR-15052:
--------------------------------------
I think this goes in the right direction and thank you for doing that. I do
have comments and a suggestion (last two paragraphs) that is likely the most
interesting part of this long post.
I do not understand why {{ReplicaStatesProvider}} goes to a thread local. Are
we sure threads do a single collection at a time? Or do we assume replica names
do not collide across collections?
Replica states should be available where a {{DocCollection}} is so I’d
reference them there rather than thread local. The fact they are now persisted
differently also makes {{DocCollection}} implementing {{JSONWriter.Writable}} a
bit weird. In the process we’ve added ZK {{cversion}} tracking of
{{state.json}} to {{DocCollection}} to optimize re-reads, creating an unwelcome
additional ZK dependency. I think given that the persistence of
{{DocCollection}} gets a lot more ambitious with this change, its persistence
layer deserves its own abstraction, and {{DocCollection}} should continue to
represent everything we need to know about a collection.
There is a race condition on updating the znode children of {{state.json}}:
{{ZkController.publish()}} re-reads the directory if needed and later does the
updates. Multiple operations on the same directory could be done by multiple
nodes (including Overseer) at the same time. There is no conditional update in
Zookeeper to save us here (we’d need conditional update based on the parent
{{cversion}}, I don’t think this exists). The version numbers used in the names
of the replica state znodes can get confused as a consequence. It is likely
safer to use the parent {{cversion}} as the replica state znode version in the
znode name rather than letting each replica creation start at 0 or increase the
previous value read. It does not fix the race but limits its consequences.
Another aspect that bothers me is that when a child znode for a given replica
is not found, we fall back to using the values that come from {{state.json}},
if any. The fact that the information is present in both places (and likely
stale most often in {{state.json}} unless that file is written back for other
reasons) is confusing. The fact that we might use it is dangerous (if a bug
leads to this happening it will be complicated to understand). I suggest to
never fallback to trying to use {{state.json}} for collections with per replica
state and remove that information from {{state.json}}.
*Last but not least,* each time we create or delete a child znode for a replica
state, we first re-read all the children of {{state.json}} because its
{{cversion}} has changed. This does mean that all replica state updates are
still serialized for a collection. The observed performance improvements (the
pdf attached to the jira) are likely due to less data being read (directory
listing instead of the large {{state.json}}) and the update being much smaller
(znode delete + create).
I assume though that listing this directory is more expensive than reading a
single file on a znode containing the equivalent text. This could not only
bring further performance improvements but will allow fixing the race condition
(can do Compare And Swap on a file) and will simplify the code. It would amount
to splitting out replica state from {{state.json}} into another much smaller
per collection {{something.json}} file.
Slightly longer term, we could decide to have multiple such replica state
files, one per shard. A main {{state.json}} for Collection level info, and
{{state_<shard>.json}} for each shard. This will increase further the
concurrency of replica state changes and make them more efficient.
> Reducing overseer bottlenecks using per-replica states
> ------------------------------------------------------
>
> Key: SOLR-15052
> URL: https://issues.apache.org/jira/browse/SOLR-15052
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Ishan Chattopadhyaya
> Priority: Major
> Attachments: per-replica-states-gcp.pdf
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> This work has the same goal as SOLR-13951, that is to reduce overseer
> bottlenecks by avoiding replica state updates from going to the state.json
> via the overseer. However, the approach taken here is different from
> SOLR-13951 and hence this work supercedes that work.
> The design proposed is here:
> https://docs.google.com/document/d/1xdxpzUNmTZbk0vTMZqfen9R3ArdHokLITdiISBxCFUg/edit
> Briefly,
> # Every replica's state will be in a separate znode nested under the
> state.json. It has the name that encodes the replica name, state, leadership
> status.
> # An additional children watcher to be set on state.json for state changes.
> # Upon a state change, a ZK multi-op to delete the previous znode and add a
> new znode with new state.
> Differences between this and SOLR-13951,
> # In SOLR-13951, we planned to leverage shard terms for per shard states.
> # As a consequence, the code changes required for SOLR-13951 were massive (we
> needed a shard state provider abstraction and introduce it everywhere in the
> codebase).
> # This approach is a drastically simpler change and design.
> Credits for this design and the PR is due to [~noble.paul].
> [[email protected]], [~noble.paul] and I have collaborated on this
> effort. The reference branch takes a conceptually similar (but not identical)
> approach.
> I shall attach a PR and performance benchmarks shortly.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]