Bug#305556: dstat: SIGINT sometimes ignored

Dag Wieers Thu, 21 Apr 2005 08:03:58 -0700

On Thu, 21 Apr 2005, Marc Lehmann wrote:

> On Thu, Apr 21, 2005 at 02:43:37PM +0200, Dag Wieers <[EMAIL PROTECTED]> 
> wrote:
> > On Thu, 21 Apr 2005, Marc Lehmann wrote:
> > > Fast system with no activity :) On my dual-P3 1Ghz dstat usually needs
> > > around 4 seconds to start up. On my dual 600mhz p3 machine it can take up
> > > to 10 seconds.
> > 
> > Ouch, blame python :) I'm wondering if your system might slow down dstat 
> > this badly that the job scheduling can't guarantee a near 1 second 
> > interval.
> 
> I guess it's more that dstat does 430 open() calls (and reads 82 files),
> while, say, vmstat only opens 5. Reading 82 seperate files is always going
> to be slow. That's the only reason I still use vmstat sometimes, because,
> when a system starts to thrash it will usually not be able to start dstat
> in any reasonable timeframe :)
>
> (But that's ok for me, and won't detract me from usign dstat, really :)


Hmmm, 430 open calls ? 82 files being read ? every second or only at 
start-up ? Damn, I have to learn to start profiling python itself. I was 
expecting much less. Also I was planning to cache file-accesses from /proc 
but the overhead of doing it might not weigh up against the gain for the 
few files that are read multiple times (most of these are small anyway).


 
> > And then use dstat -t <options>. On my system it results in:
> 
> I get this:
> 
> 1114091418.016
> 1114091419.017
> 1114091420.017
> 1114091421.018
> 1114091422.019
> 1114091423.020
> 1114091424.020
> 1114091425.021
> 1114091426.022
> 1114091427.023
> 1114091428.025
> 1114091429.026
> 1114091430.026
> 1114091431.027
> 1114091432.028
> 
> So the deviation increases. This happens on my completely idle dual-opteron, 
> too:
> 
> 1114091500.907
> 1114091501.908
> 1114091502.908
> 1114091503.909
> 1114091504.910
> 1114091505.911
> 1114091506.912
> 1114091507.912
> 1114091508.913
> 1114091509.914
> 1114091510.915
> 1114091511.916
> 1114091512.916
> 
> Even on my rather busy and slow freenet node (less busy during daytime,
> but dstat still takes a few seconds to start), I get 1ms increasing
> deviation:
> 
> 1114091890.894
> 1114091891.895
> 1114091892.896
> 1114091893.897
> 1114091894.898
> 1114091895.899
> 1114091896.900
> 1114091897.901
> 1114091898.901
> 1114091899.902
> 1114091900.903
> 1114091901.904
> 1114091902.905
> 1114091903.906
> 1114091904.907
> 1114091905.908

That's ok. On a 2.4 kernel I didn't have this sligh deviation. (it would 
deviate 1ms only after 5 minutes or so). But on all 2.6 systems I tested 
this it seems to be present. I feared the deviation was much worse on your 
slow system. Could you do the same test with actual counters being used 
(like you are used to ?) Just to see if that could impact job scheduling.


> > 1114086819.071|  1   3  96   0   0   0|   0     0 |   0     0 |   0     0 
> > |1082   844 |0.11 0.06 0.08
> > 1114086820.072|  1   4  95   0   0   0|   0     0 |   0     0 |   0     0 
> > |1061   885 |0.11 0.06 0.08
> > 1114086821.073|  1   3  96   0   0   0|   0     0 |   0     0 |   0     0 
> > |1082   814 |0.11 0.06 0.08
> > 
> > So you see a 1ms deviation per second. (about 1sec deviation after 17mins)
> 
> Obviously a minor bug in dstat then. However, it should not affect
> calculations much.

Well, it's not a bug, I bet vmstat has the same problem since it cannot 
control the correctness either (unless I'm mistaken).


> One thing, though: the longer I think about it, the mroe I come to the
> conclusion that the intermediate updates are useless, as by looking at
> dstat output you cannot know what the numbers actually show, because they
> are averaged over an unknown number of seconds.

Not if you use -t, in fact it might be worthwhile to make -t show the 
number of seconds since the last update.


> Personaly, I would prefer either a real n-second average, with
> intermediate updates and one "scroll" every n seconds, or no averages at
> all for the intermediate reprots. Given that I don't really rely on the
> intermediate updates (it's just one extra feature over vmstat), and this
> is my personal preference, you might simply ignore my thoughts :->

It definitely would be something to be considered configurable.


> > Now try the same when enabling the following lines:
> > 
> >         ### Increase precision if we're root (does not seem to have effect)
> >     #   if os.geteuid() == 0:
> >     #       os.nice(-20)
> >     #   sys.setcheckinterval(op.delay / 10000)
> > 
> > And let me know if this makes a difference for you. On all occasions it 
> > never made a real difference for me. (ie. you may want to try this both as 
> > root as well as a user and maybe disable the if statement)
> 
> Your and my time dumps show no problem with precision, but with clock
> stability. I tried to find out how dstat does it's time scheduling but
> could only find references to ALARM, which has no stability guarentees.
> 
> Not that deviations of a few ms/second matter to me, but if you want to
> make one update per second, on average, for continued time, then you'd
> need to wait "till the next update" and not "one second between updates",
> as the latter doesn't take into account the time of the update itself (or
> in this case, delays in ALARM handling out of your control).

Yes, there is a problem that theupdate delay lasts longer than a second. 
In practise this never happened, but the more expensive stats you might 
add can make this worse. My plan is to make the ALARM interrupt the 
processing of stats so that if you overload the delay it would interrupt 
and start over.

I cannot add custom counters until I have fixed this, for sure.


> > you're absolutely right. I'm going to look into speeding up dstat (or 
> > slowing down my system and profiling statements, I bet some modules take 
> > a long time and may not be necessary all the time).
> 
> I have similar problems with any of my perl programs. Perl simply has to
> read so many files that it is impossible to make it faster, except by
> using no modules, which is inacceptable.
> 
> I don't think it's dstat's fault at all. It's a mere artifact of dstat
> being written in a language that does all linking at runtime.

I know, but I bet it can be improved as I have not looked at it closely 
enough. I need to strace and see how many files (and what files) are being 
accessed, how much time is spend and if the need weighs up to what it 
does. (Some things I might be able to implement myself)


> > > I guess the best thing is to only catching INT before initscr, as before
> > > there is no reaosn even to catch the signal, because catching it only adds
> > > time before the user gets back at his/her prompt (there is nothing to
> > > cleanup), although I suspect that it can't be done with python.
> > 
> > It can be done. But it still requires me to import the signal module :)
> 
> Hm.. you mean to say that python gives a backtrace by default within
> a "short" timeframe after starting up? Frankly, I'd report this as a
> bug immediately, after all, that precludes being able to control signal
> handlign completely within python :->

I bet it does. By default python gives a traceback on SIGINT. You can only 
try to decrease the window of opportunity or start with an exception 
handling block. (so that the WOO becomes practically zero)


> Thanks for investing time. If vmstat had bugged me enough I would have
> wirtten my own vmstat replacement, but it would have been some small
> unpublished hack that would only work for me. Thanks for taking the time
> to do it properly and release it, I know that's quite an amount of extra
> work.

I actually enjoy it. It gives a unique opportunity to find out how to do 
things in python that you otherwise might not know about. Especially the 
signal handling, profiling it and a modular design are new things for me 
to learn :)

Although many users have indicated they like the self-contained 
characteristic of dstat, I need to find a way to make it plugable or 
make it work on freebsd and others too.

--   dag wieers,  [EMAIL PROTECTED],  http://dag.wieers.com/   --
[all I want is a warm bed and a kind word and unlimited power]


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Bug#305556: dstat: SIGINT sometimes ignored

Reply via email to