[ 
https://issues.apache.org/jira/browse/GEODE-9002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9002:
--------------------------------
    Description: 
Linux performance icon Brendan Gregg advocates the 
[USE|http://www.brendangregg.com/usemethod.html] method of performance 
analysis: Utilization Saturation and Errors.

When it comes to CPU, Geode captures a number of _utilization_ statistics. Some 
are direct like LinuxSystemStats cpuIdle and cpuActive. Others are indirect 
like:
 * DistributionStats
 ** heartbeatsSent: you may see a gap in the every-five-seconds heartbeats
 * StatSampler
 ** delayDuration: you may see a rise when CPU is scarce
 ** sampleCount: you may see an interruption in the regular once-per-second 
sampling
 * (G1GC collector)
 ** (various memory utilization statistics may indicate memory pressure which 
in turn can give rise to long GC pauses)
 * LinuxSystemStats
 ** cpuSteal: indicating that the virtualization environment has not given the 
VM its share of CPU

 

But utilization statistics alone can't tell you when a resource (like CPU) is 
_saturated_, i.e. when  demand is higher than the servicing ability. If you're 
just looking at utilization metrics, then a saturated system might look a lot 
like a system just below saturation. In order to tell the difference, 
saturation metrics are needed.

In the case of CPU, there is a conceptual queue in front of each processor. 
Tasks (operating system threads) that are ready to run, enter a queue, and 
after some delay, are given a time slice by an actual physical CPU.

You might think that Geode's LinuxSystemStats loadAverage1 and 5 and 15, might 
fit this bill. Those statistics do provide some saturation information. The 
problem is, they conflate CPU with I/O and other things (see [Linux Load 
Averages: Solving the 
Mystery|[http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html].)]

A better, more specific measure of CPU saturation is available through 
statistics exposed via the /proc/schedstat virtual file.

When this ticket is complete, there will be a new statistic type called 
LinuxThreadScheduler, with three associated statistics gathered directly from 
/proc/schedstat or derived from data gathered from it:
 * runningTimeNanos: sum of all time spent running by tasks on this processor 
in nanoseconds
 * queuedTimeNanos: sum of all time spent waiting to run by tasks on this 
processor in nanoseconds
 * tasksScheduledCount: # of tasks (not necessarily unique) given to the 
processor
 * meanTaskQueuedTimeNanos: average time that a ready-to-run task waited for a 
CPU, since the last sample, in nanoseconds

One "statistic" will be gathered for each CPU. So a Geode process running on a 
two-CPU system will capture two statistics, called "cpu0", "cpu1", each of this 
new type.

By default Geode will not gather these new statistics. A TBD Java system 
property will be used to enable gathering the new LinuxThreadScheduler 
statistic.

  was:
Linux performance icon Brendan Gregg advocates the 
[USE|http://www.brendangregg.com/usemethod.html] method of performance 
analysis: Utilization Saturation and Errors.

When it comes to CPU, Geode captures a number of _utilization_ statistics. Some 
are direct like LinuxSystemStats cpuIdle and cpuActive. Others are indirect 
like:

 

But utilization statistics alone can't tell you when a resource (like CPU) is 
_saturated_, i.e. when  demand is higher than the servicing ability. If you're 
just looking at utilization metrics, then a saturated system might look a lot 
like a system just below saturation. In order to tell the difference, 
saturation metrics are needed.

In the case of CPU, there is a conceptual queue in front of each processor. 
Tasks (operating system threads) that are ready to run, enter a queue, and 
after some delay, are given a time slice by an actual physical CPU.

You might think that Geode's LinuxSystemStats loadAverage1 and 5 and 15, might 
fit this bill. Those statistics do provide some saturation information. The 
problem is, they conflate CPU with I/O and other things (see [Linux Load 
Averages: Solving the 
Mystery|[http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html].)]

A better, more specific measure of CPU saturation is available through 
statistics exposed via the /proc/schedstat virtual file.

When this ticket is complete, there will be a new statistic type called 
LinuxThreadScheduler, with three associated statistics gathered directly from 
/proc/schedstat or derived from data gathered from it:
 * runningTimeNanos: sum of all time spent running by tasks on this processor 
in nanoseconds
 * queuedTimeNanos: sum of all time spent waiting to run by tasks on this 
processor in nanoseconds
 * tasksScheduledCount: # of tasks (not necessarily unique) given to the 
processor
 * meanTaskQueuedTimeNanos: average time that a ready-to-run task waited for a 
CPU, since the last sample, in nanoseconds

One "statistic" will be gathered for each CPU. So a Geode process running on a 
two-CPU system will capture two statistics, called "cpu0", "cpu1", each of this 
new type.

By default Geode will not gather these new statistics. A TBD Java system 
property will be used to enable gathering the new LinuxThreadScheduler 
statistic.


> Add Statistic for /proc/schedstat
> ---------------------------------
>
>                 Key: GEODE-9002
>                 URL: https://issues.apache.org/jira/browse/GEODE-9002
>             Project: Geode
>          Issue Type: New Feature
>          Components: statistics
>            Reporter: Bill Burcham
>            Assignee: Bill Burcham
>            Priority: Major
>              Labels: pull-request-available
>
> Linux performance icon Brendan Gregg advocates the 
> [USE|http://www.brendangregg.com/usemethod.html] method of performance 
> analysis: Utilization Saturation and Errors.
> When it comes to CPU, Geode captures a number of _utilization_ statistics. 
> Some are direct like LinuxSystemStats cpuIdle and cpuActive. Others are 
> indirect like:
>  * DistributionStats
>  ** heartbeatsSent: you may see a gap in the every-five-seconds heartbeats
>  * StatSampler
>  ** delayDuration: you may see a rise when CPU is scarce
>  ** sampleCount: you may see an interruption in the regular once-per-second 
> sampling
>  * (G1GC collector)
>  ** (various memory utilization statistics may indicate memory pressure which 
> in turn can give rise to long GC pauses)
>  * LinuxSystemStats
>  ** cpuSteal: indicating that the virtualization environment has not given 
> the VM its share of CPU
>  
> But utilization statistics alone can't tell you when a resource (like CPU) is 
> _saturated_, i.e. when  demand is higher than the servicing ability. If 
> you're just looking at utilization metrics, then a saturated system might 
> look a lot like a system just below saturation. In order to tell the 
> difference, saturation metrics are needed.
> In the case of CPU, there is a conceptual queue in front of each processor. 
> Tasks (operating system threads) that are ready to run, enter a queue, and 
> after some delay, are given a time slice by an actual physical CPU.
> You might think that Geode's LinuxSystemStats loadAverage1 and 5 and 15, 
> might fit this bill. Those statistics do provide some saturation information. 
> The problem is, they conflate CPU with I/O and other things (see [Linux Load 
> Averages: Solving the 
> Mystery|[http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html].)]
> A better, more specific measure of CPU saturation is available through 
> statistics exposed via the /proc/schedstat virtual file.
> When this ticket is complete, there will be a new statistic type called 
> LinuxThreadScheduler, with three associated statistics gathered directly from 
> /proc/schedstat or derived from data gathered from it:
>  * runningTimeNanos: sum of all time spent running by tasks on this processor 
> in nanoseconds
>  * queuedTimeNanos: sum of all time spent waiting to run by tasks on this 
> processor in nanoseconds
>  * tasksScheduledCount: # of tasks (not necessarily unique) given to the 
> processor
>  * meanTaskQueuedTimeNanos: average time that a ready-to-run task waited for 
> a CPU, since the last sample, in nanoseconds
> One "statistic" will be gathered for each CPU. So a Geode process running on 
> a two-CPU system will capture two statistics, called "cpu0", "cpu1", each of 
> this new type.
> By default Geode will not gather these new statistics. A TBD Java system 
> property will be used to enable gathering the new LinuxThreadScheduler 
> statistic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to