Thanks for sharing Pierre!  It's clear you put some thought into the
clear writeup.

I wonder whether this shouldn't go into a JIRA ticket so there's a
record of it that might be more visible to users?  Would you be
willing to summarize this in a JIRA ticket?  I'm not sure if anyone
will tackle the fix for 8x, but at least a ticket would document
things more visibly and it opens the door to someone else tackling the
problem down the road.

Best,

Jason

On Thu, Jun 1, 2023 at 10:29 AM Pierre Salagnac
<pierre.salag...@gmail.com> wrote:
>
> I know the autoscaling framework does not exist anymore with Solr 9+, but I
> wanted to share here a bug we found in it.
> Probably there are still plenty of Solr 8 users still relying on this
> framework.
>
>
> The triggers use timestamps returned by the JVM call System.nanoTime(), but
> according to the Javadoc, this is NOT an absolute timestamp. This is just a
> number relative to a random origin, and this origin will change each time
> the JVM is restarted.
>
> I figured out this impacts at least the following triggers (with basically
> the same pattern),
> - IndexSizeTrigger
> - MetricTrigger
> - SearchRateTrigger
>
> These triggers want to fire an event when a certain condition (depending on
> each trigger) is met for a certain period of time. They maintain a map with
> [what, timestamp] entries to track a short term history, with the option to
> remove an entry if the condition is not met anymore, so we don't trigger
> any event.
> Timestamps come from System.nanoTime(). So far so good as long as we
> compare these timestamps to each others in the same JVM. Now, this map is
> persisted in Zookeeper in case of an overseer change (written and read by
> TriggerBase.saveState() and restoreState() ). With an overseer change, the
> nanoTime() origin is randomly moved to something else. Consequently, all
> the persisted timestamps from the previous overseer cannot be compared with
> the current JVM "clock".
> This ends in triggers never being fired, or being fired without waiting for
> the time configured.
>
> I found no Jira entry for this (but maybe there is one?), and I think this
> could be a major contributor to the instability of this framework for some
> environments.
> Also, I'm unsure whether it is still maintained in a 8x branch.
>
> Simple fix could be to always use TimeSource.getEpochTimeNs() instead
> of getTimeNs() in autoscaling code.
> But I'm not sure why we use nano seconds anyway. Seconds would be
> sufficient...
>
> Thanks,
> Pierre

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org

Reply via email to