I know the autoscaling framework does not exist anymore with Solr 9+, but I
wanted to share here a bug we found in it.
Probably there are still plenty of Solr 8 users still relying on this
framework.


The triggers use timestamps returned by the JVM call System.nanoTime(), but
according to the Javadoc, this is NOT an absolute timestamp. This is just a
number relative to a random origin, and this origin will change each time
the JVM is restarted.

I figured out this impacts at least the following triggers (with basically
the same pattern),
- IndexSizeTrigger
- MetricTrigger
- SearchRateTrigger

These triggers want to fire an event when a certain condition (depending on
each trigger) is met for a certain period of time. They maintain a map with
[what, timestamp] entries to track a short term history, with the option to
remove an entry if the condition is not met anymore, so we don't trigger
any event.
Timestamps come from System.nanoTime(). So far so good as long as we
compare these timestamps to each others in the same JVM. Now, this map is
persisted in Zookeeper in case of an overseer change (written and read by
TriggerBase.saveState() and restoreState() ). With an overseer change, the
nanoTime() origin is randomly moved to something else. Consequently, all
the persisted timestamps from the previous overseer cannot be compared with
the current JVM "clock".
This ends in triggers never being fired, or being fired without waiting for
the time configured.

I found no Jira entry for this (but maybe there is one?), and I think this
could be a major contributor to the instability of this framework for some
environments.
Also, I'm unsure whether it is still maintained in a 8x branch.

Simple fix could be to always use TimeSource.getEpochTimeNs() instead
of getTimeNs() in autoscaling code.
But I'm not sure why we use nano seconds anyway. Seconds would be
sufficient...

Thanks,
Pierre

Reply via email to