[ https://issues.apache.org/jira/browse/GEODE-9002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bill Burcham updated GEODE-9002: -------------------------------- Description: Linux performance icon Brendan Gregg advocates the [USE|http://www.brendangregg.com/usemethod.html] method of performance analysis: Utilization Saturation and Errors. When it comes to CPU, Geode captures a number of _utilization_ statistics. Some are direct like LinuxSystemStats cpuIdle and cpuActive. Others are indirect like: * DistributionStats ** heartbeatsSent: you may see a gap in the every-five-seconds heartbeats * StatSampler ** delayDuration: you may see a rise when CPU is scarce ** sampleCount: you may see an interruption in the regular once-per-second sampling * (G1GC collector) ** (various memory utilization statistics may indicate memory pressure which in turn can give rise to long GC pauses) * LinuxSystemStats ** cpuSteal: indicating that the virtualization environment has not given the VM its share of CPU But utilization statistics alone can't tell you when a resource (like CPU) is _saturated_, i.e. when demand is higher than the servicing ability. If you're just looking at utilization metrics, then a saturated system might look a lot like a system just below saturation. In order to tell the difference, saturation metrics are needed. In the case of CPU, there is a conceptual queue in front of each processor. Tasks (operating system threads) that are ready to run, enter a queue, and after some delay, are given a time slice by an actual physical CPU. You might think that Geode's LinuxSystemStats loadAverage1 and 5 and 15, might fit this bill. Those statistics do provide some saturation information. The problem is, they conflate CPU with I/O and other things (see [Linux Load Averages: Solving the Mystery|[http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html].)] A better, more specific measure of CPU saturation is available through statistics exposed via the /proc/schedstat virtual file. When this ticket is complete, there will be a new statistic type called LinuxThreadScheduler, with three associated statistics gathered directly from /proc/schedstat or derived from data gathered from it: * runningTimeNanos: sum of all time spent running by tasks on this processor in nanoseconds * queuedTimeNanos: sum of all time spent waiting to run by tasks on this processor in nanoseconds * tasksScheduledCount: # of tasks (not necessarily unique) given to the processor * meanTaskQueuedTimeNanos: average time that a ready-to-run task waited for a CPU, since the last sample, in nanoseconds One "statistic" will be gathered for each CPU. So a Geode process running on a two-CPU system will capture two statistics, called "cpu0", "cpu1", each of this new type. By default Geode will not gather these new statistics. A TBD Java system property will be used to enable gathering the new LinuxThreadScheduler statistic. was: Linux performance icon Brendan Gregg advocates the [USE|http://www.brendangregg.com/usemethod.html] method of performance analysis: Utilization Saturation and Errors. When it comes to CPU, Geode captures a number of _utilization_ statistics. Some are direct like LinuxSystemStats cpuIdle and cpuActive. Others are indirect like: But utilization statistics alone can't tell you when a resource (like CPU) is _saturated_, i.e. when demand is higher than the servicing ability. If you're just looking at utilization metrics, then a saturated system might look a lot like a system just below saturation. In order to tell the difference, saturation metrics are needed. In the case of CPU, there is a conceptual queue in front of each processor. Tasks (operating system threads) that are ready to run, enter a queue, and after some delay, are given a time slice by an actual physical CPU. You might think that Geode's LinuxSystemStats loadAverage1 and 5 and 15, might fit this bill. Those statistics do provide some saturation information. The problem is, they conflate CPU with I/O and other things (see [Linux Load Averages: Solving the Mystery|[http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html].)] A better, more specific measure of CPU saturation is available through statistics exposed via the /proc/schedstat virtual file. When this ticket is complete, there will be a new statistic type called LinuxThreadScheduler, with three associated statistics gathered directly from /proc/schedstat or derived from data gathered from it: * runningTimeNanos: sum of all time spent running by tasks on this processor in nanoseconds * queuedTimeNanos: sum of all time spent waiting to run by tasks on this processor in nanoseconds * tasksScheduledCount: # of tasks (not necessarily unique) given to the processor * meanTaskQueuedTimeNanos: average time that a ready-to-run task waited for a CPU, since the last sample, in nanoseconds One "statistic" will be gathered for each CPU. So a Geode process running on a two-CPU system will capture two statistics, called "cpu0", "cpu1", each of this new type. By default Geode will not gather these new statistics. A TBD Java system property will be used to enable gathering the new LinuxThreadScheduler statistic. > Add Statistic for /proc/schedstat > --------------------------------- > > Key: GEODE-9002 > URL: https://issues.apache.org/jira/browse/GEODE-9002 > Project: Geode > Issue Type: New Feature > Components: statistics > Reporter: Bill Burcham > Assignee: Bill Burcham > Priority: Major > Labels: pull-request-available > > Linux performance icon Brendan Gregg advocates the > [USE|http://www.brendangregg.com/usemethod.html] method of performance > analysis: Utilization Saturation and Errors. > When it comes to CPU, Geode captures a number of _utilization_ statistics. > Some are direct like LinuxSystemStats cpuIdle and cpuActive. Others are > indirect like: > * DistributionStats > ** heartbeatsSent: you may see a gap in the every-five-seconds heartbeats > * StatSampler > ** delayDuration: you may see a rise when CPU is scarce > ** sampleCount: you may see an interruption in the regular once-per-second > sampling > * (G1GC collector) > ** (various memory utilization statistics may indicate memory pressure which > in turn can give rise to long GC pauses) > * LinuxSystemStats > ** cpuSteal: indicating that the virtualization environment has not given > the VM its share of CPU > > But utilization statistics alone can't tell you when a resource (like CPU) is > _saturated_, i.e. when demand is higher than the servicing ability. If > you're just looking at utilization metrics, then a saturated system might > look a lot like a system just below saturation. In order to tell the > difference, saturation metrics are needed. > In the case of CPU, there is a conceptual queue in front of each processor. > Tasks (operating system threads) that are ready to run, enter a queue, and > after some delay, are given a time slice by an actual physical CPU. > You might think that Geode's LinuxSystemStats loadAverage1 and 5 and 15, > might fit this bill. Those statistics do provide some saturation information. > The problem is, they conflate CPU with I/O and other things (see [Linux Load > Averages: Solving the > Mystery|[http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html].)] > A better, more specific measure of CPU saturation is available through > statistics exposed via the /proc/schedstat virtual file. > When this ticket is complete, there will be a new statistic type called > LinuxThreadScheduler, with three associated statistics gathered directly from > /proc/schedstat or derived from data gathered from it: > * runningTimeNanos: sum of all time spent running by tasks on this processor > in nanoseconds > * queuedTimeNanos: sum of all time spent waiting to run by tasks on this > processor in nanoseconds > * tasksScheduledCount: # of tasks (not necessarily unique) given to the > processor > * meanTaskQueuedTimeNanos: average time that a ready-to-run task waited for > a CPU, since the last sample, in nanoseconds > One "statistic" will be gathered for each CPU. So a Geode process running on > a two-CPU system will capture two statistics, called "cpu0", "cpu1", each of > this new type. > By default Geode will not gather these new statistics. A TBD Java system > property will be used to enable gathering the new LinuxThreadScheduler > statistic. -- This message was sent by Atlassian Jira (v8.3.4#803005)