Thanks for sharing Pierre! It's clear you put some thought into the clear writeup.
I wonder whether this shouldn't go into a JIRA ticket so there's a record of it that might be more visible to users? Would you be willing to summarize this in a JIRA ticket? I'm not sure if anyone will tackle the fix for 8x, but at least a ticket would document things more visibly and it opens the door to someone else tackling the problem down the road. Best, Jason On Thu, Jun 1, 2023 at 10:29 AM Pierre Salagnac <pierre.salag...@gmail.com> wrote: > > I know the autoscaling framework does not exist anymore with Solr 9+, but I > wanted to share here a bug we found in it. > Probably there are still plenty of Solr 8 users still relying on this > framework. > > > The triggers use timestamps returned by the JVM call System.nanoTime(), but > according to the Javadoc, this is NOT an absolute timestamp. This is just a > number relative to a random origin, and this origin will change each time > the JVM is restarted. > > I figured out this impacts at least the following triggers (with basically > the same pattern), > - IndexSizeTrigger > - MetricTrigger > - SearchRateTrigger > > These triggers want to fire an event when a certain condition (depending on > each trigger) is met for a certain period of time. They maintain a map with > [what, timestamp] entries to track a short term history, with the option to > remove an entry if the condition is not met anymore, so we don't trigger > any event. > Timestamps come from System.nanoTime(). So far so good as long as we > compare these timestamps to each others in the same JVM. Now, this map is > persisted in Zookeeper in case of an overseer change (written and read by > TriggerBase.saveState() and restoreState() ). With an overseer change, the > nanoTime() origin is randomly moved to something else. Consequently, all > the persisted timestamps from the previous overseer cannot be compared with > the current JVM "clock". > This ends in triggers never being fired, or being fired without waiting for > the time configured. > > I found no Jira entry for this (but maybe there is one?), and I think this > could be a major contributor to the instability of this framework for some > environments. > Also, I'm unsure whether it is still maintained in a 8x branch. > > Simple fix could be to always use TimeSource.getEpochTimeNs() instead > of getTimeNs() in autoscaling code. > But I'm not sure why we use nano seconds anyway. Seconds would be > sufficient... > > Thanks, > Pierre --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org