Re: Please test: HZ bump

Martin Pieuchot Tue, 25 Dec 2018 12:37:53 -0800

On 24/12/18(Mon) 20:07, Scott Cheloha wrote:
> On Tue, Dec 18, 2018 at 03:39:43PM -0600, Ian Sutton wrote:
> > On Mon, Aug 14, 2017 at 3:07 PM Martin Pieuchot <m...@openbsd.org> wrote:
> > >
> > > I'd like to improve the fairness of the scheduler, with the goal of
> > > mitigating userland starvations.  For that the kernel needs to have
> > > a better understanding of the amount of executed time per task.
> > >
> > > The smallest interval currently usable on all our architectures for
> > > such accounting is a tick.  With the current HZ value of 100, this
> > > smallest interval is 10ms.  I'd like to bump this value to 1000.
> > >
> > > The diff below intentionally bump other `hz' value to keep current
> > > ratios.  We certainly want to call schedclock(), or a similar time
> > > accounting function, at a higher frequency than 16 Hz.  However this
> > > will be part of a later diff.
> > >
> > > I'd be really interested in test reports.  mlarkin@ raised a good
> > > question: is your battery lifetime shorter with this diff?
> > >
> > [...] 
> > I'd like to see more folks test and other devs to share their
> > thoughts: What are the risks associated with bumping HZ globally?
> > Drawbacks? Reasons for hesitation?
> 
> In general I'd like to reduce wakeup latency as well.  Raising HZ is an
> obvious route to achieving that.  But I think there are a couple things
> that need to be addressed before it would be reasonable.  The things that
> come to mind for me are:
> 
>  - A tick is a 32-bit signed integer on all platforms.  If HZ=100, we
>    can represent at most ~248 days in ticks.  This is plenty.  If HZ=1000,
>    we now only have ~24.8 days.  Some may disagree, but I don't think this
>    is enough.


Why do you think it isn't enough?

>    One possible solution is to make ticks 64-bit.  This addresses the
>    timeout length issue at a cost to 32-bit platforms that I cannot
>    quantify without lots of testing: what is the overhead of using 64-bit
>    arithmetic on a 32-bit machine for all timeouts?
> 
>    A compromise is to make ticks a long.  kettenis mentioned this
>    possibility in a commit [1] some time back.  This would allow 64-bit
>    platforms to raise HZ without crippling timeout ranges.  But then you
>    have ticks of different sizes on different platforms, which could be a
>    headache, I imagine.

Note that we had, and certainly still have, tick-wrapping bugs in the
kernel :)  

>    (maybe there are other solutions?)

Solution to what?

>  - How does an OpenBSD guest on vmd(8) behave when HZ=1000?  Multiple such
>    guests on vmd(8)?  Such guests on other hypervisors?
> 
>  - The replies in this thread don't indicate any effect on battery life or
>    power consumption but I find it hard to believe that raising HZ has no
>    impact on such things.  Bumping HZ like this *must* increase CPU 
> utilization.
>    What is the cost in watt-hours?

It depends on the machine.  But that's one of the reasons I dropped the
bump.

>  - Can smaller machines even handle HZ=1000?  Linux experimented with this
>    over a decade ago and settled on a default HZ=250 for i386 [2].  I don't
>    know how it all shook out, but my guess is that they didn't revert from
>    1000 -> 250 for no reason at all.  Of course, FreeBSD went ahead with 1000
>    on i386, so opinions differ.

Indeed, we still support architectures that can't handle an HZ of 1000.

>  - How does this effect e.g. packet throughput on smaller machines?  I think
>    bigger boxes on amd64 would be fine, but I wonder if throughput would take
>    a noticeable hit on a smaller router.

Some measurements indicated a drop of 10% in packet forwarding on some
machines and no difference on others. 

> And then... can we reduce wakeup latency in general without raising HZ?  Other
> systems (e.g. DFly) have better wakeup latencies and still have HZ=100.  What
> are they doing?  Can we borrow it?

I haven't looked at other systems like DragonFly, but since you seem
interested to improve that area, here's my story.  I didn't look at
wakeup latencies.  I don't know why you're after that.  Instead I
focused on `schedhz' and schedclock().  I landed there after observing
that with a high number of threads in "running" state (an active
browser while baking a muild), work was badly distributed amongst CPUs.
Some per-CPU queues were growing and others stayed empty.

CPUs have runqueues that are selected based on per-thread `p_priority'.
What does this field represent today is confusing.  Many changes since
the original scheduler design, including hardware improvements, side-
effects and developer mistakes makes it more confusing.  However bumping
HZ improves the placements of "running" threads in per-CPU runqueues.

I spend a lot of time trying to observe and understand why.  I don't
remember the details but came to the conclusion that `p_priority' was
newer.  In other words the kernel has more up-to-date information to
make choices.

However it became clear to me that the current mis-design works well
enough by luck :)  Trying to theorise & understand it today is hard. 
For example the introduction of kernel threads and the switch to rthread
1:1 model changed the meaning of sleeping priorities.  This has led
to multiple workarounds in the past years...

It is also hard to shrink the SCHED_LOCK() because it protects
accountings fields used to compute priorities.

There's also a current known problem with threads moving often between
CPUs.  This is particularity bad when the distance between two CPUs is
important (think multiple sockets).

Now I'm afraid that bumping HZ will lead to new mis-calculated values
which will lead to new workarounds.  Instead I'd suggest to spend time
to move to a scheduler that is understandable & understood and not a
result of optimistic changes :o)

At the time of the diff I discussed moving to virtual deadlines instead
of priorities.  That should simplify math & locking because there would
be nothing to calculate.  I did an experiment using `hz' to calculate
virtual deadlines and that's why I needed a higher HZ.

Now having a scheduler depending on `hz' is, IMHO, a limitation.  And
that's one of the reasons why bumping HZ is complicated.  So I came to
the conclusion that bumping HZ wasn't the solution to *my* problem.
That's why I dropped the diff.

I think we should use high resolution timers to calculate deadlines.
There is plenty of prior work in that area, so it shouldn't be too hard
to get started.

Reducing usages of `hz' in the kernel would also be a very good step
in the tickless direction.  For example by using timeout_add_msec(9)
instead of  timeout_add(9)  :o)

Re: Please test: HZ bump

Reply via email to