[ 
https://issues.apache.org/jira/browse/SOLR-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17252259#comment-17252259
 ] 

Ilan Ginzburg commented on SOLR-15052:
--------------------------------------

I think this goes in the right direction and thank you for doing that. I do 
have comments and a suggestion (last two paragraphs) that is likely the most 
interesting part of this long post. 

I do not understand why {{ReplicaStatesProvider}} goes to a thread local. Are 
we sure threads do a single collection at a time? Or do we assume replica names 
do not collide across collections?
Replica states should be available where a {{DocCollection}} is so I’d 
reference them there rather than thread local. The fact they are now persisted 
differently also makes {{DocCollection}} implementing {{JSONWriter.Writable}} a 
bit weird. In the process we’ve added ZK {{cversion}} tracking of 
{{state.json}} to {{DocCollection}} to optimize re-reads, creating an unwelcome 
additional ZK dependency. I think given that the persistence of 
{{DocCollection}} gets a lot more ambitious with this change, its persistence 
layer deserves its own abstraction, and {{DocCollection}} should continue to 
represent everything we need to know about a collection.

There is a race condition on updating the znode children of {{state.json}}: 
{{ZkController.publish()}} re-reads the directory if needed and later does the 
updates. Multiple operations on the same directory could be done by multiple 
nodes (including Overseer) at the same time. There is no conditional update in 
Zookeeper to save us here (we’d need conditional update based on the parent 
{{cversion}}, I don’t think this exists). The version numbers used in the names 
of the replica state znodes can get confused as a consequence. It is likely 
safer to use the parent {{cversion}} as the replica state znode version in the 
znode name rather than letting each replica creation start at 0 or increase the 
previous value read. It does not fix the race but limits its consequences.

Another aspect that bothers me is that when a child znode for a given replica 
is not found, we fall back to using the values that come from {{state.json}}, 
if any. The fact that the information is present in both places (and likely 
stale most often in {{state.json}} unless that file is written back for other 
reasons) is confusing. The fact that we might use it is dangerous (if a bug 
leads to this happening it will be complicated to understand). I suggest to 
never fallback to trying to use {{state.json}} for collections with per replica 
state and remove that information from {{state.json}}.

*Last but not least,* each time we create or delete a child znode for a replica 
state, we first re-read all the children of {{state.json}} because its 
{{cversion}} has changed. This does mean that all replica state updates are 
still serialized for a collection. The observed performance improvements (the 
pdf attached to the jira) are likely due to less data being read (directory 
listing instead of the large {{state.json}}) and the update being much smaller 
(znode delete + create).
I assume though that listing this directory is more expensive than reading a 
single file on a znode containing the equivalent text. This could not only 
bring further performance improvements but will allow fixing the race condition 
(can do Compare And Swap on a file) and will simplify the code. It would amount 
to splitting out replica state from {{state.json}} into another much smaller 
per collection {{something.json}} file.

Slightly longer term, we could decide to have multiple such replica state 
files, one per shard. A main {{state.json}} for Collection level info, and 
{{state_<shard>.json}} for each shard. This will increase further the 
concurrency of replica state changes and make them more efficient.

> Reducing overseer bottlenecks using per-replica states
> ------------------------------------------------------
>
>                 Key: SOLR-15052
>                 URL: https://issues.apache.org/jira/browse/SOLR-15052
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>            Priority: Major
>         Attachments: per-replica-states-gcp.pdf
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This work has the same goal as SOLR-13951, that is to reduce overseer 
> bottlenecks by avoiding replica state updates from going to the state.json 
> via the overseer. However, the approach taken here is different from 
> SOLR-13951 and hence this work supercedes that work.
> The design proposed is here: 
> https://docs.google.com/document/d/1xdxpzUNmTZbk0vTMZqfen9R3ArdHokLITdiISBxCFUg/edit
> Briefly,
> # Every replica's state will be in a separate znode nested under the 
> state.json. It has the name that encodes the replica name, state, leadership 
> status.
> # An additional children watcher to be set on state.json for state changes.
> # Upon a state change, a ZK multi-op to delete the previous znode and add a 
> new znode with new state.
> Differences between this and SOLR-13951,
> # In SOLR-13951, we planned to leverage shard terms for per shard states.
> # As a consequence, the code changes required for SOLR-13951 were massive (we 
> needed a shard state provider abstraction and introduce it everywhere in the 
> codebase).
> # This approach is a drastically simpler change and design.
> Credits for this design and the PR is due to [~noble.paul]. 
> [~markrmil...@gmail.com], [~noble.paul] and I have collaborated on this 
> effort. The reference branch takes a conceptually similar (but not identical) 
> approach.
> I shall attach a PR and performance benchmarks shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to