Hi everybody, I'm Arthur from the Prometheus-Operator team.

We've recently added support for running Prometheus in Agent mode with 
Prometheus-Operator and we've started to brainstorm new Deployment Patterns 
that could be explored with the Agent, e.g. as Daemonsets or Sidecars.

At this point in time, I'm drafting how things could look like if 
Prometheus Agent is run as Pod sidecars, and would love to know the opinion 
of the community about it. I'm particularly interested to know if there is 
an appetite from the community for such a deployment pattern and if you 
find new failure modes with that approach.

Here is the proposal:

Agent Deployment Pattern: Sidecar Injection

<https://github.com/prometheus-operator/prometheus-operator/blob/803a331736a6b05274bf07862c6550d053735a19/Documentation/designs/agent-deployment-pattern-sidecar.md#summary>
Summary

With Prometheus-Operator finally supporting running Prometheus in Agent 
mode, we can start thinking about different deployment patterns that can be 
explored with this minimal container. This document aims to continue the 
work started by this document 
<https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/designs/prometheus-agent.md>,
 
focusing on exploring how Prometheus-Operator can leverage deploying 
PrometheusAgents as sidecars running alongside pods that a user wants to 
monitor.
<https://github.com/prometheus-operator/prometheus-operator/blob/803a331736a6b05274bf07862c6550d053735a19/Documentation/designs/agent-deployment-pattern-sidecar.md#background>
Background

By the time this document was written, Prometheus-Operator can deploy 
Prometheus in Agent mode, but only using a pattern similar to the original 
implementation of Prometheus Server: using StatefulSets. The original 
design document 
<https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/designs/prometheus-agent.md>
 for 
Prometheus Agent already mentions that different deployment patterns are 
desired, however, for the sake of speeding up the initial implementation it 
was decided to re-use the logic and start with the Agent running as 
StatefulSets.

Also for the sake of speeding up implementation, this document won't focus 
on several new Deployment patterns, but only one: Sidecar Injection.

Looking at the traditional deployment model, we have a single Prometheus 
(or an HA setup) per cluster or namespace, responsible for scraping all 
containers under their scope. Prometheus operator relies on ServiceMonitor, 
PodMonitor, and Probe CRs to configure Prometheus, which will eventually 
use Kubernetes service-discovery to find endpoints that need to be scraped.

Depending on the Cluster's scale and how often Prometheus hits Kubernetes 
API, Prometheus service discovery can increase the load on the API 
significantly and affect the overall functionality of said cluster.

Another problem is that one or more containers can be updated to a 
problematic version that causes a Cardinality Spike 
<https://grafana.com/blog/2022/02/15/what-are-cardinality-spikes-and-why-do-they-matter/>.
 
Depending on the proportion of the spike, it is possible that a container 
could single-handedly crash the monitoring system of the whole cluster.

[image: Traditional Deployment Pattern] 
<https://github.com/prometheus-operator/prometheus-operator/blob/803a331736a6b05274bf07862c6550d053735a19/Documentation/designs/assets/agent-deployment-pattern-sidecar/traditional-deployment-pattern.png>
.
<https://github.com/prometheus-operator/prometheus-operator/blob/803a331736a6b05274bf07862c6550d053735a19/Documentation/designs/agent-deployment-pattern-sidecar.md#proposal>
Proposal

This document proposes a new deployment model where Prometheus-Operator 
injects Prometheus agents as a sidecar container (and Prometheus config 
reloader) to pods that needs to be scrapped. With a sidecar, we tackle both 
problems mentioned above:


   - Load on Kubernetes API won't exist since it's not needed anymore. 
   Prometheus will scrape containers from the same pod through their shared 
   network interface and scrape configuration can be declared via pod 
   annotations.
   - A sudden cardinality spike will not affect the whole monitoring 
   system. In a worst-case scenario, it will fail a single pod.

A common pattern used with Prometheus's Kubernetes service discovery is the 
usage 
of annotation to declaratively tell Prometheus which endpoints need to be 
scraped 
<https://www.acagroup.be/en/blog/auto-discovery-of-kubernetes-endpoint-services-prometheus/>.
 
>From a code search at Github 
<https://github.com/search?q=prometheus.io%2Fscrape%3A+%22true%22&type=code>
 for prometheus.io/scrape: "true", we can tell that this approach has good 
adoption already. To not conflict with the already commonly used 
annotation, we can start with our own, but with a very similar approach.
apiVersion: v1 
  kind: Pod 
  metadata: 
    name: example 
    annotations: 
      prometheus.operator.io/scrape: "true"
      prometheus.operator.io/path: "/metrics"
      prometheus.operator.io/port: "8080"
      prometheus.operator.io/scrape-interval: "60s" 
spec: 
...

The existing PrometheusAgent CRD would be extended with a new field called 
mode, which can be one of two values(for now): [statefulset, sidecar], with 
statefulset as default. If mode is set to sidecar, Prometheus-Operator 
won't deploy any Prometheus agents initially. Instead, it will watch for 
Pod updates and inject the Prometheus Agent as a sidecar with the 
pre-determined annotations present.

In addition to telling the deployment model, the Agent CR will be the 
source of truth for remote-write configuration, such as URL and 
authentication. A change to the remote-write configuration would still 
require a hot reload of potentially millions of agent sidecar containers, 
but by avoiding having the remote-write configuration in pod annotation we 
at least avoid requiring that the Pod manifest also needs to be upgraded.

If different sets of pods require different remote-write configurations, 
then multiple PrometheusAgent CRs are needed. This means that the pod also 
needs to specify which Agent CR will inject the sidecar:
apiVersion: v1 
  kind: Pod 
  metadata: 
    name: example 
    annotations: 
      prometheus.operator.io/scrape: "true"
      prometheus.operator.io/path: "/metrics"
      prometheus.operator.io/port: "8080"
      prometheus.operator.io/scrape-interval: "60s"
      prometheus.operator.io/agent-selector: "monitoring/agent-example"
spec: 
... 
--- 
  apiVersion: monitoring.coreos.com/v1alpha1 
  kind: PrometheusAgent 
  metadata: 
    name: agent-example 
    namespace: monitoring 
spec: 
  mode: sidecar 
  remoteWrite: 
   - url: https://example.com

With a visualization:

[image: Sidecar Deployment Pattern] 
<https://github.com/prometheus-operator/prometheus-operator/blob/803a331736a6b05274bf07862c6550d053735a19/Documentation/designs/assets/agent-deployment-pattern-sidecar/sidecar-deployment-pattern.png>
<https://github.com/prometheus-operator/prometheus-operator/blob/803a331736a6b05274bf07862c6550d053735a19/Documentation/designs/agent-deployment-pattern-sidecar.md#what-to-do-with-servicemonitor-podmonitor-and-probe-selectors>What
 
to do with ServiceMonitor, PodMonitor, and Probe selectors?

With the sidecar approach, our goal is to scale Prometheus horizontally 
while avoiding impact in the Kubernetes API. It wouldn't make sense for a 
sidecar to also scrape metrics from other pods.

If mode is set to sidecar, a validating webhook would forbid 
PrometheusAgent CRs to be created/updated with the following fields:

   - serviceMonitorSelector
   - serviceMonitorNamespaceSelector
   - podMonitorSelector
   - podMonitorNamespaceSelector
   - probeSelector
   - probeNamespaceSelector

<https://github.com/prometheus-operator/prometheus-operator/blob/803a331736a6b05274bf07862c6550d053735a19/Documentation/designs/agent-deployment-pattern-sidecar.md#caveats>
Caveats 
<https://github.com/prometheus-operator/prometheus-operator/blob/803a331736a6b05274bf07862c6550d053735a19/Documentation/designs/agent-deployment-pattern-sidecar.md#config-hot-reload>Config
 
Hot Reload

There will be two ways to change Prometheus configuration now, 1) by 
changing annotation on the pod and 2) by changing the remote-write field in 
PrometheusAgent CRD. The first one will only trigger a hot reload for the 
involved pod, but the latter has the potential to trigger millions of hot 
reloads, depending on the scale of the cluster.

While there is no research regarding the config-reloader efficiency, this 
particular container might become problematic for huge-scale environments.
<https://github.com/prometheus-operator/prometheus-operator/blob/803a331736a6b05274bf07862c6550d053735a19/Documentation/designs/agent-deployment-pattern-sidecar.md#wal-not-optimized-for-small-environments>WAL
 
not optimized for small environments

Prometheus Write-Ahead-log(WAL) is stored as a sequence of numbered files 
with 128MiB each by default. This means that, by default, at least 128MiB 
is needed for running Prometheus Agent if we ignore every other part of 
Prometheus. Using a sidecar, we're optimizing for horizontal scale and 
128MiB might be much more than necessary to store metrics from a single Pod.
<https://github.com/prometheus-operator/prometheus-operator/blob/803a331736a6b05274bf07862c6550d053735a19/Documentation/designs/agent-deployment-pattern-sidecar.md#lack-of-high-availability-setup>Lack
 
of High-Availability setup

With the problem that Prometheus is not optimized for very small 
environments, injecting 2 sidecars per Pod sounds like a big waste of 
resources. However, with only 1 sidecar HA Prometheus won't be an option.

With that said, having an HA Prometheus in the traditional deployment 
pattern seems to be more critical than the sidecar approach. That's because 
with Prometheus fails in the first approach we lose the monitoring stack 
for the whole cluster, while with the latter we just lose metrics from a 
pod.
<https://github.com/prometheus-operator/prometheus-operator/blob/803a331736a6b05274bf07862c6550d053735a19/Documentation/designs/agent-deployment-pattern-sidecar.md#references>
References
   
   - [1] 
   
https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/designs/prometheus-agent.md
   - [2] https://opentelemetry.io/docs/collector/scaling/
   - [3] 
   
https://www.acagroup.be/en/blog/auto-discovery-of-kubernetes-endpoint-services-prometheus/
   - [4] https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/d3e4d7c7-d79e-494a-bdcc-32ce2d04a88dn%40googlegroups.com.

Reply via email to