*: migrate kube-prometheus to Prometheus Operator repo
This commit is contained in:
133
README.md
133
README.md
@@ -4,135 +4,6 @@ This repository collects Kubernetes manifests, dashboards, and alerting rules
|
|||||||
combined with documentation and scripts to provide single-command deployments
|
combined with documentation and scripts to provide single-command deployments
|
||||||
of end-to-end Kubernetes cluster monitoring.
|
of end-to-end Kubernetes cluster monitoring.
|
||||||
|
|
||||||
## Prerequisites
|
# This repository has moved
|
||||||
|
|
||||||
First, you need a running Kubernetes cluster. If you don't have one, follow the
|
This repository has been merged with the [Prometheus Operator](https://github.com/coreos/prometheus-operator). It can now be found under [`contrib/kube-prometheus`](https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus).
|
||||||
instructions of [bootkube](https://github.com/kubernetes-incubator/bootkube) or
|
|
||||||
[minikube](https://github.com/kubernetes/minikube). Some sample contents of this
|
|
||||||
repository are adapted to work with a [multi-node setup](https://github.com/kubernetes-incubator/bootkube/tree/master/hack/multi-node)
|
|
||||||
using [bootkube](https://github.com/kubernetes-incubator/bootkube).
|
|
||||||
|
|
||||||
## Monitoring Kubernetes
|
|
||||||
|
|
||||||
The manifests used here use the [Prometheus Operator](https://github.com/coreos/prometheus-operator),
|
|
||||||
which manages Prometheus servers and their configuration in a cluster. With a single command we can install
|
|
||||||
|
|
||||||
* The Operator itself
|
|
||||||
* The Prometheus [node_exporter](https://github.com/prometheus/node_exporter)
|
|
||||||
* [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics)
|
|
||||||
* The [Prometheus specification](https://github.com/coreos/prometheus-operator/blob/master/Documentation/prometheus.md) based on which the Operator deploys a Prometheus setup
|
|
||||||
* A Prometheus configuration covering monitoring of all Kubernetes core components and exporters
|
|
||||||
* A default set of alerting rules on the cluster component's health
|
|
||||||
* A Grafana instance serving dashboards on cluster metrics
|
|
||||||
* A three node highly available Alertmanager cluster
|
|
||||||
|
|
||||||
Simply run:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export KUBECONFIG=<path> # defaults to "~/.kube/config"
|
|
||||||
hack/cluster-monitoring/deploy
|
|
||||||
```
|
|
||||||
|
|
||||||
After all pods are ready, you can reach:
|
|
||||||
|
|
||||||
* Prometheus UI on node port `30900`
|
|
||||||
* Alertmanager UI on node port `30903`
|
|
||||||
* Grafana on node port `30902`
|
|
||||||
|
|
||||||
To tear it all down again, run:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
hack/cluster-monitoring/teardown
|
|
||||||
```
|
|
||||||
|
|
||||||
## Monitoring custom services
|
|
||||||
|
|
||||||
The example manifests in [/manifests/examples/example-app](/manifests/examples/example-app)
|
|
||||||
deploy a fake service exposing Prometheus metrics. They additionally define a new Prometheus
|
|
||||||
server and a [`ServiceMonitor`](https://github.com/coreos/prometheus-operator/blob/master/Documentation/service-monitor.md),
|
|
||||||
which specifies how the example service should be monitored.
|
|
||||||
The Prometheus Operator will deploy and configure the desired Prometheus instance and continiously
|
|
||||||
manage its life cycle.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
hack/example-service-monitoring/deploy
|
|
||||||
```
|
|
||||||
|
|
||||||
After all pods are ready you can reach the Prometheus server on node port `30100` and observe
|
|
||||||
how it monitors the service as specified. Same as before, this Prometheus server automatically
|
|
||||||
discovers the Alertmanager cluster deployed in the [Monitoring Kubernetes](#Monitoring-Kubernetes)
|
|
||||||
section.
|
|
||||||
|
|
||||||
Teardown:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
hack/example-service-monitoring/teardown
|
|
||||||
```
|
|
||||||
|
|
||||||
## Dashboarding
|
|
||||||
|
|
||||||
The provided manifests deploy a Grafana instance serving dashboards provided via a ConfigMap.
|
|
||||||
To modify, delete, or add dashboards, the `grafana-dashboards` ConfigMap must be modified.
|
|
||||||
|
|
||||||
Currently, Grafana does not support serving dashboards from static files. Instead, the `grafana-watcher`
|
|
||||||
sidecar container aims to emulate the behavior, by keeping the Grafana database always in sync
|
|
||||||
with the provided ConfigMap. Hence, the Grafana pod is effectively stateless.
|
|
||||||
This allows managing dashboards via `git` etc. and easily deploying them via CD pipelines.
|
|
||||||
|
|
||||||
In the future, a separate Grafana operator will support gathering dashboards from multiple
|
|
||||||
ConfigMaps based on label selection.
|
|
||||||
|
|
||||||
## Roadmap
|
|
||||||
|
|
||||||
* Grafana Operator that dynamically discovers and deploys dashboards from ConfigMaps
|
|
||||||
* KPM/Helm packages to easily provide production-ready cluster-monitoring setup (essentially contents of `hack/cluster-monitoring`)
|
|
||||||
* Add meta-monitoring to default cluster monitoring setup
|
|
||||||
* Build out the provided dashboards and alerts for cluster monitoring to have full coverage of all system aspects
|
|
||||||
|
|
||||||
## Monitoring other Cluster Components
|
|
||||||
|
|
||||||
Discovery of API servers and kubelets works the same across all clusters.
|
|
||||||
Depending on a cluster's setup several other core components, such as etcd or the
|
|
||||||
scheduler, may be deployed in different ways.
|
|
||||||
The easiest integration point is for the cluster operator to provide headless services
|
|
||||||
of all those components to provide a common interface of discovering them. With that
|
|
||||||
setup they will automatically be discovered by the provided Prometheus configuration.
|
|
||||||
|
|
||||||
For the `kube-scheduler` and `kube-controller-manager` there are headless
|
|
||||||
services prepared, simply add them to your running cluster:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
kubectl -n kube-system create manifests/k8s/
|
|
||||||
```
|
|
||||||
|
|
||||||
> Hint: if you use this for a cluster not created with bootkube, make sure you
|
|
||||||
> populate an endpoints object with the address to your `kube-scheduler` and
|
|
||||||
> `kube-controller-manager`, or adapt the label selectors to match your setup.
|
|
||||||
|
|
||||||
Aside from Kubernetes specific components, etcd is an important part of a
|
|
||||||
working cluster, but is typically deployed outside of it. This monitoring
|
|
||||||
setup assumes that it is made visible from within the cluster through a headless
|
|
||||||
service as well.
|
|
||||||
|
|
||||||
> Note that minikube hides some components like etcd so to see the extend of
|
|
||||||
> this setup we recommend setting up a [local cluster using bootkube](https://github.com/kubernetes-incubator/bootkube/tree/master/hack/multi-node).
|
|
||||||
|
|
||||||
An example for bootkube's multi-node vagrant setup is [here](/manifests/etcd/etcd-bootkube-vagrant-multi.yaml).
|
|
||||||
|
|
||||||
> Hint: this is merely an example for a local setup. The addresses will have to
|
|
||||||
> be adapted for a setup, that is not a single etcd bootkube created cluster.
|
|
||||||
|
|
||||||
With that setup the headless services provide endpoint lists consumed by
|
|
||||||
Prometheus to discover the endpoints as targets:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ kubectl get endpoints --all-namespaces
|
|
||||||
NAMESPACE NAME ENDPOINTS AGE
|
|
||||||
default kubernetes 172.17.4.101:443 2h
|
|
||||||
kube-system kube-controller-manager-prometheus-discovery 10.2.30.2:10252 1h
|
|
||||||
kube-system kube-scheduler-prometheus-discovery 10.2.30.4:10251 1h
|
|
||||||
monitoring etcd-k8s 172.17.4.51:2379 1h
|
|
||||||
```
|
|
||||||
|
|
||||||
## Other Documentation
|
|
||||||
[Install Docs for a cluster created with KOPS on AWS](docs/KOPSonAWS.md)
|
|
||||||
|
@@ -1,860 +0,0 @@
|
|||||||
{
|
|
||||||
"dashboard":
|
|
||||||
{
|
|
||||||
"__inputs": [
|
|
||||||
{
|
|
||||||
"name": "DS_PROMETHEUS",
|
|
||||||
"label": "prometheus",
|
|
||||||
"description": "",
|
|
||||||
"type": "datasource",
|
|
||||||
"pluginId": "prometheus",
|
|
||||||
"pluginName": "Prometheus"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"__requires": [
|
|
||||||
{
|
|
||||||
"type": "grafana",
|
|
||||||
"id": "grafana",
|
|
||||||
"name": "Grafana",
|
|
||||||
"version": "4.1.1"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"type": "panel",
|
|
||||||
"id": "graph",
|
|
||||||
"name": "Graph",
|
|
||||||
"version": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"type": "datasource",
|
|
||||||
"id": "prometheus",
|
|
||||||
"name": "Prometheus",
|
|
||||||
"version": "1.0.0"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"type": "panel",
|
|
||||||
"id": "singlestat",
|
|
||||||
"name": "Singlestat",
|
|
||||||
"version": ""
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"annotations": {
|
|
||||||
"list": []
|
|
||||||
},
|
|
||||||
"description": "Dashboard to get an overview of one server",
|
|
||||||
"editable": true,
|
|
||||||
"gnetId": 22,
|
|
||||||
"graphTooltip": 0,
|
|
||||||
"hideControls": false,
|
|
||||||
"id": null,
|
|
||||||
"links": [],
|
|
||||||
"refresh": false,
|
|
||||||
"rows": [
|
|
||||||
{
|
|
||||||
"collapse": false,
|
|
||||||
"height": "250px",
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"alerting": {},
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {},
|
|
||||||
"id": 3,
|
|
||||||
"legend": {
|
|
||||||
"avg": false,
|
|
||||||
"current": false,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": false
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [],
|
|
||||||
"span": 6,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "sum(rate(node_cpu{mode=\"idle\"}[2m])) * 100",
|
|
||||||
"hide": false,
|
|
||||||
"intervalFactor": 10,
|
|
||||||
"legendFormat": "",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 50
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": [],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "Idle cpu",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": false,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"mode": "time",
|
|
||||||
"name": null,
|
|
||||||
"show": true,
|
|
||||||
"values": []
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "percent",
|
|
||||||
"label": "cpu usage",
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": 0,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "short",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"alerting": {},
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {},
|
|
||||||
"id": 9,
|
|
||||||
"legend": {
|
|
||||||
"avg": false,
|
|
||||||
"current": false,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": false
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [],
|
|
||||||
"span": 6,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "sum(node_load1)",
|
|
||||||
"intervalFactor": 4,
|
|
||||||
"legendFormat": "load 1m",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 20,
|
|
||||||
"target": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "sum(node_load5)",
|
|
||||||
"intervalFactor": 4,
|
|
||||||
"legendFormat": "load 5m",
|
|
||||||
"refId": "B",
|
|
||||||
"step": 20,
|
|
||||||
"target": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "sum(node_load15)",
|
|
||||||
"intervalFactor": 4,
|
|
||||||
"legendFormat": "load 15m",
|
|
||||||
"refId": "C",
|
|
||||||
"step": 20,
|
|
||||||
"target": ""
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": [],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "System load",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": false,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"mode": "time",
|
|
||||||
"name": null,
|
|
||||||
"show": true,
|
|
||||||
"values": []
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "percentunit",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "short",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"repeat": null,
|
|
||||||
"repeatIteration": null,
|
|
||||||
"repeatRowId": null,
|
|
||||||
"showTitle": false,
|
|
||||||
"title": "New row",
|
|
||||||
"titleSize": "h6"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"collapse": false,
|
|
||||||
"height": "250px",
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"alerting": {},
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {},
|
|
||||||
"id": 4,
|
|
||||||
"legend": {
|
|
||||||
"avg": false,
|
|
||||||
"current": false,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": false
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [
|
|
||||||
{
|
|
||||||
"alias": "node_memory_SwapFree{instance=\"172.17.0.1:9100\",job=\"prometheus\"}",
|
|
||||||
"yaxis": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 9,
|
|
||||||
"stack": true,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "sum(node_memory_MemTotal) - sum(node_memory_MemFree) - sum(node_memory_Buffers) - sum(node_memory_Cached)",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "memory usage",
|
|
||||||
"metric": "memo",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 4,
|
|
||||||
"target": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "sum(node_memory_Buffers)",
|
|
||||||
"interval": "",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "memory buffers",
|
|
||||||
"metric": "memo",
|
|
||||||
"refId": "B",
|
|
||||||
"step": 4,
|
|
||||||
"target": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "sum(node_memory_Cached)",
|
|
||||||
"interval": "",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "memory cached",
|
|
||||||
"metric": "memo",
|
|
||||||
"refId": "C",
|
|
||||||
"step": 4,
|
|
||||||
"target": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "sum(node_memory_MemFree)",
|
|
||||||
"interval": "",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "memory free",
|
|
||||||
"metric": "memo",
|
|
||||||
"refId": "D",
|
|
||||||
"step": 4,
|
|
||||||
"target": ""
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": [],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "Memory usage",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": false,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "individual"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"mode": "time",
|
|
||||||
"name": null,
|
|
||||||
"show": true,
|
|
||||||
"values": []
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": "0",
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "short",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cacheTimeout": null,
|
|
||||||
"colorBackground": false,
|
|
||||||
"colorValue": false,
|
|
||||||
"colors": [
|
|
||||||
"rgba(50, 172, 45, 0.97)",
|
|
||||||
"rgba(237, 129, 40, 0.89)",
|
|
||||||
"rgba(245, 54, 54, 0.9)"
|
|
||||||
],
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"format": "percent",
|
|
||||||
"gauge": {
|
|
||||||
"maxValue": 100,
|
|
||||||
"minValue": 0,
|
|
||||||
"show": true,
|
|
||||||
"thresholdLabels": false,
|
|
||||||
"thresholdMarkers": true
|
|
||||||
},
|
|
||||||
"id": 5,
|
|
||||||
"interval": null,
|
|
||||||
"links": [],
|
|
||||||
"mappingType": 1,
|
|
||||||
"mappingTypes": [
|
|
||||||
{
|
|
||||||
"name": "value to text",
|
|
||||||
"value": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "range to text",
|
|
||||||
"value": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"maxDataPoints": 100,
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"nullText": null,
|
|
||||||
"postfix": "",
|
|
||||||
"postfixFontSize": "50%",
|
|
||||||
"prefix": "",
|
|
||||||
"prefixFontSize": "50%",
|
|
||||||
"rangeMaps": [
|
|
||||||
{
|
|
||||||
"from": "null",
|
|
||||||
"text": "N/A",
|
|
||||||
"to": "null"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 3,
|
|
||||||
"sparkline": {
|
|
||||||
"fillColor": "rgba(31, 118, 189, 0.18)",
|
|
||||||
"full": false,
|
|
||||||
"lineColor": "rgb(31, 120, 193)",
|
|
||||||
"show": false
|
|
||||||
},
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "((sum(node_memory_MemTotal) - sum(node_memory_MemFree) - sum(node_memory_Buffers) - sum(node_memory_Cached)) / sum(node_memory_MemTotal)) * 100",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"metric": "",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 60,
|
|
||||||
"target": ""
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": "80, 90",
|
|
||||||
"title": "Memory usage",
|
|
||||||
"type": "singlestat",
|
|
||||||
"valueFontSize": "80%",
|
|
||||||
"valueMaps": [
|
|
||||||
{
|
|
||||||
"op": "=",
|
|
||||||
"text": "N/A",
|
|
||||||
"value": "null"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"valueName": "avg"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"repeat": null,
|
|
||||||
"repeatIteration": null,
|
|
||||||
"repeatRowId": null,
|
|
||||||
"showTitle": false,
|
|
||||||
"title": "New row",
|
|
||||||
"titleSize": "h6"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"collapse": false,
|
|
||||||
"height": "250px",
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"alerting": {},
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {},
|
|
||||||
"id": 6,
|
|
||||||
"legend": {
|
|
||||||
"avg": false,
|
|
||||||
"current": false,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": false
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [
|
|
||||||
{
|
|
||||||
"alias": "read",
|
|
||||||
"yaxis": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"alias": "{instance=\"172.17.0.1:9100\"}",
|
|
||||||
"yaxis": 2
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"alias": "io time",
|
|
||||||
"yaxis": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 9,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "sum(rate(node_disk_bytes_read[5m]))",
|
|
||||||
"hide": false,
|
|
||||||
"intervalFactor": 4,
|
|
||||||
"legendFormat": "read",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 8,
|
|
||||||
"target": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "sum(rate(node_disk_bytes_written[5m]))",
|
|
||||||
"intervalFactor": 4,
|
|
||||||
"legendFormat": "written",
|
|
||||||
"refId": "B",
|
|
||||||
"step": 8
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "sum(rate(node_disk_io_time_ms[5m]))",
|
|
||||||
"intervalFactor": 4,
|
|
||||||
"legendFormat": "io time",
|
|
||||||
"refId": "C",
|
|
||||||
"step": 8
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": [],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "Disk I/O",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": false,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"mode": "time",
|
|
||||||
"name": null,
|
|
||||||
"show": true,
|
|
||||||
"values": []
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "ms",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cacheTimeout": null,
|
|
||||||
"colorBackground": false,
|
|
||||||
"colorValue": false,
|
|
||||||
"colors": [
|
|
||||||
"rgba(50, 172, 45, 0.97)",
|
|
||||||
"rgba(237, 129, 40, 0.89)",
|
|
||||||
"rgba(245, 54, 54, 0.9)"
|
|
||||||
],
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"format": "percentunit",
|
|
||||||
"gauge": {
|
|
||||||
"maxValue": 1,
|
|
||||||
"minValue": 0,
|
|
||||||
"show": true,
|
|
||||||
"thresholdLabels": false,
|
|
||||||
"thresholdMarkers": true
|
|
||||||
},
|
|
||||||
"id": 7,
|
|
||||||
"interval": null,
|
|
||||||
"links": [],
|
|
||||||
"mappingType": 1,
|
|
||||||
"mappingTypes": [
|
|
||||||
{
|
|
||||||
"name": "value to text",
|
|
||||||
"value": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "range to text",
|
|
||||||
"value": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"maxDataPoints": 100,
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"nullText": null,
|
|
||||||
"postfix": "",
|
|
||||||
"postfixFontSize": "50%",
|
|
||||||
"prefix": "",
|
|
||||||
"prefixFontSize": "50%",
|
|
||||||
"rangeMaps": [
|
|
||||||
{
|
|
||||||
"from": "null",
|
|
||||||
"text": "N/A",
|
|
||||||
"to": "null"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 3,
|
|
||||||
"sparkline": {
|
|
||||||
"fillColor": "rgba(31, 118, 189, 0.18)",
|
|
||||||
"full": false,
|
|
||||||
"lineColor": "rgb(31, 120, 193)",
|
|
||||||
"show": false
|
|
||||||
},
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "(sum(node_filesystem_size{device!=\"rootfs\"}) - sum(node_filesystem_free{device!=\"rootfs\"})) / sum(node_filesystem_size{device!=\"rootfs\"})",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"refId": "A",
|
|
||||||
"step": 60,
|
|
||||||
"target": ""
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": "0.75, 0.9",
|
|
||||||
"title": "Disk space usage",
|
|
||||||
"type": "singlestat",
|
|
||||||
"valueFontSize": "80%",
|
|
||||||
"valueMaps": [
|
|
||||||
{
|
|
||||||
"op": "=",
|
|
||||||
"text": "N/A",
|
|
||||||
"value": "null"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"valueName": "current"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"repeat": null,
|
|
||||||
"repeatIteration": null,
|
|
||||||
"repeatRowId": null,
|
|
||||||
"showTitle": false,
|
|
||||||
"title": "New row",
|
|
||||||
"titleSize": "h6"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"collapse": false,
|
|
||||||
"height": "250px",
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"alerting": {},
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {},
|
|
||||||
"id": 8,
|
|
||||||
"legend": {
|
|
||||||
"avg": false,
|
|
||||||
"current": false,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": false
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [
|
|
||||||
{
|
|
||||||
"alias": "transmitted ",
|
|
||||||
"yaxis": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 6,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "sum(rate(node_network_receive_bytes{device!~\"lo\"}[5m]))",
|
|
||||||
"hide": false,
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 10,
|
|
||||||
"target": ""
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": [],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "Network received",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": false,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"mode": "time",
|
|
||||||
"name": null,
|
|
||||||
"show": true,
|
|
||||||
"values": []
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"alerting": {},
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {},
|
|
||||||
"id": 10,
|
|
||||||
"legend": {
|
|
||||||
"avg": false,
|
|
||||||
"current": false,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": false
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [
|
|
||||||
{
|
|
||||||
"alias": "transmitted ",
|
|
||||||
"yaxis": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 6,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "sum(rate(node_network_transmit_bytes{device!~\"lo\"}[5m]))",
|
|
||||||
"hide": false,
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "",
|
|
||||||
"refId": "B",
|
|
||||||
"step": 10,
|
|
||||||
"target": ""
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": [],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "Network transmitted",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": false,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"mode": "time",
|
|
||||||
"name": null,
|
|
||||||
"show": true,
|
|
||||||
"values": []
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"repeat": null,
|
|
||||||
"repeatIteration": null,
|
|
||||||
"repeatRowId": null,
|
|
||||||
"showTitle": false,
|
|
||||||
"title": "New row",
|
|
||||||
"titleSize": "h6"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"schemaVersion": 14,
|
|
||||||
"style": "dark",
|
|
||||||
"tags": [
|
|
||||||
"prometheus"
|
|
||||||
],
|
|
||||||
"templating": {
|
|
||||||
"list": []
|
|
||||||
},
|
|
||||||
"time": {
|
|
||||||
"from": "now-1h",
|
|
||||||
"to": "now"
|
|
||||||
},
|
|
||||||
"timepicker": {
|
|
||||||
"refresh_intervals": [
|
|
||||||
"5s",
|
|
||||||
"10s",
|
|
||||||
"30s",
|
|
||||||
"1m",
|
|
||||||
"5m",
|
|
||||||
"15m",
|
|
||||||
"30m",
|
|
||||||
"1h",
|
|
||||||
"2h",
|
|
||||||
"1d"
|
|
||||||
],
|
|
||||||
"time_options": [
|
|
||||||
"5m",
|
|
||||||
"15m",
|
|
||||||
"1h",
|
|
||||||
"6h",
|
|
||||||
"12h",
|
|
||||||
"24h",
|
|
||||||
"2d",
|
|
||||||
"7d",
|
|
||||||
"30d"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"timezone": "browser",
|
|
||||||
"title": "All Nodes",
|
|
||||||
"version": 1
|
|
||||||
},
|
|
||||||
"inputs": [
|
|
||||||
{
|
|
||||||
"name": "DS_PROMETHEUS",
|
|
||||||
"pluginId": "prometheus",
|
|
||||||
"type": "datasource",
|
|
||||||
"value": "prometheus"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"overwrite": true
|
|
||||||
}
|
|
@@ -1,817 +0,0 @@
|
|||||||
{
|
|
||||||
"dashboard": {
|
|
||||||
"__inputs": [
|
|
||||||
{
|
|
||||||
"name": "DS_PROMETHEUS",
|
|
||||||
"label": "prometheus",
|
|
||||||
"description": "",
|
|
||||||
"type": "datasource",
|
|
||||||
"pluginId": "prometheus",
|
|
||||||
"pluginName": "Prometheus"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"__requires": [
|
|
||||||
{
|
|
||||||
"type": "panel",
|
|
||||||
"id": "singlestat",
|
|
||||||
"name": "Singlestat",
|
|
||||||
"version": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"type": "panel",
|
|
||||||
"id": "graph",
|
|
||||||
"name": "Graph",
|
|
||||||
"version": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"type": "grafana",
|
|
||||||
"id": "grafana",
|
|
||||||
"name": "Grafana",
|
|
||||||
"version": "3.1.1"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"type": "datasource",
|
|
||||||
"id": "prometheus",
|
|
||||||
"name": "Prometheus",
|
|
||||||
"version": "1.0.0"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"id": null,
|
|
||||||
"title": "Deployment",
|
|
||||||
"tags": [],
|
|
||||||
"style": "dark",
|
|
||||||
"timezone": "browser",
|
|
||||||
"editable": true,
|
|
||||||
"hideControls": false,
|
|
||||||
"sharedCrosshair": true,
|
|
||||||
"rows": [
|
|
||||||
{
|
|
||||||
"collapse": false,
|
|
||||||
"editable": true,
|
|
||||||
"height": "200px",
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"title": "CPU",
|
|
||||||
"error": false,
|
|
||||||
"span": 4,
|
|
||||||
"editable": true,
|
|
||||||
"type": "singlestat",
|
|
||||||
"isNew": true,
|
|
||||||
"id": 8,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"refId": "A",
|
|
||||||
"expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}[3m])) ",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"step": 600
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"links": [],
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"maxDataPoints": 100,
|
|
||||||
"interval": null,
|
|
||||||
"cacheTimeout": null,
|
|
||||||
"format": "none",
|
|
||||||
"prefix": "",
|
|
||||||
"postfix": "cores",
|
|
||||||
"nullText": null,
|
|
||||||
"valueMaps": [
|
|
||||||
{
|
|
||||||
"value": "null",
|
|
||||||
"op": "=",
|
|
||||||
"text": "N/A"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"mappingTypes": [
|
|
||||||
{
|
|
||||||
"name": "value to text",
|
|
||||||
"value": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "range to text",
|
|
||||||
"value": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"rangeMaps": [
|
|
||||||
{
|
|
||||||
"from": "null",
|
|
||||||
"to": "null",
|
|
||||||
"text": "N/A"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"mappingType": 1,
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"valueName": "avg",
|
|
||||||
"prefixFontSize": "50%",
|
|
||||||
"valueFontSize": "110%",
|
|
||||||
"postfixFontSize": "50%",
|
|
||||||
"thresholds": "",
|
|
||||||
"colorBackground": false,
|
|
||||||
"colorValue": false,
|
|
||||||
"colors": [
|
|
||||||
"rgba(245, 54, 54, 0.9)",
|
|
||||||
"rgba(237, 129, 40, 0.89)",
|
|
||||||
"rgba(50, 172, 45, 0.97)"
|
|
||||||
],
|
|
||||||
"sparkline": {
|
|
||||||
"show": true,
|
|
||||||
"full": false,
|
|
||||||
"lineColor": "rgb(31, 120, 193)",
|
|
||||||
"fillColor": "rgba(31, 118, 189, 0.18)"
|
|
||||||
},
|
|
||||||
"gauge": {
|
|
||||||
"show": false,
|
|
||||||
"minValue": 0,
|
|
||||||
"maxValue": 100,
|
|
||||||
"thresholdMarkers": true,
|
|
||||||
"thresholdLabels": false
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"title": "Memory",
|
|
||||||
"error": false,
|
|
||||||
"span": 4,
|
|
||||||
"editable": true,
|
|
||||||
"type": "singlestat",
|
|
||||||
"isNew": true,
|
|
||||||
"id": 9,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"refId": "A",
|
|
||||||
"expr": "sum(container_memory_usage_bytes{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}) / 1024^3",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"step": 600
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"links": [],
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"maxDataPoints": 100,
|
|
||||||
"interval": null,
|
|
||||||
"cacheTimeout": null,
|
|
||||||
"format": "none",
|
|
||||||
"prefix": "",
|
|
||||||
"postfix": "GB",
|
|
||||||
"nullText": null,
|
|
||||||
"valueMaps": [
|
|
||||||
{
|
|
||||||
"value": "null",
|
|
||||||
"op": "=",
|
|
||||||
"text": "N/A"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"mappingTypes": [
|
|
||||||
{
|
|
||||||
"name": "value to text",
|
|
||||||
"value": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "range to text",
|
|
||||||
"value": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"rangeMaps": [
|
|
||||||
{
|
|
||||||
"from": "null",
|
|
||||||
"to": "null",
|
|
||||||
"text": "N/A"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"mappingType": 1,
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"valueName": "avg",
|
|
||||||
"prefixFontSize": "80%",
|
|
||||||
"valueFontSize": "110%",
|
|
||||||
"postfixFontSize": "50%",
|
|
||||||
"thresholds": "",
|
|
||||||
"colorBackground": false,
|
|
||||||
"colorValue": false,
|
|
||||||
"colors": [
|
|
||||||
"rgba(245, 54, 54, 0.9)",
|
|
||||||
"rgba(237, 129, 40, 0.89)",
|
|
||||||
"rgba(50, 172, 45, 0.97)"
|
|
||||||
],
|
|
||||||
"sparkline": {
|
|
||||||
"show": true,
|
|
||||||
"full": false,
|
|
||||||
"lineColor": "rgb(31, 120, 193)",
|
|
||||||
"fillColor": "rgba(31, 118, 189, 0.18)"
|
|
||||||
},
|
|
||||||
"gauge": {
|
|
||||||
"show": false,
|
|
||||||
"minValue": 0,
|
|
||||||
"maxValue": 100,
|
|
||||||
"thresholdMarkers": true,
|
|
||||||
"thresholdLabels": false
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"title": "Network",
|
|
||||||
"error": false,
|
|
||||||
"span": 4,
|
|
||||||
"editable": true,
|
|
||||||
"type": "singlestat",
|
|
||||||
"isNew": true,
|
|
||||||
"id": 7,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"refId": "A",
|
|
||||||
"expr": "sum(rate(container_network_transmit_bytes_total{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}[3m])) + sum(rate(container_network_receive_bytes_total{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}[3m])) ",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"step": 600
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"links": [],
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"maxDataPoints": 100,
|
|
||||||
"interval": null,
|
|
||||||
"cacheTimeout": null,
|
|
||||||
"format": "Bps",
|
|
||||||
"prefix": "",
|
|
||||||
"postfix": "",
|
|
||||||
"nullText": null,
|
|
||||||
"valueMaps": [
|
|
||||||
{
|
|
||||||
"value": "null",
|
|
||||||
"op": "=",
|
|
||||||
"text": "N/A"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"mappingTypes": [
|
|
||||||
{
|
|
||||||
"name": "value to text",
|
|
||||||
"value": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "range to text",
|
|
||||||
"value": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"rangeMaps": [
|
|
||||||
{
|
|
||||||
"from": "null",
|
|
||||||
"to": "null",
|
|
||||||
"text": "N/A"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"mappingType": 1,
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"valueName": "avg",
|
|
||||||
"prefixFontSize": "50%",
|
|
||||||
"valueFontSize": "80%",
|
|
||||||
"postfixFontSize": "50%",
|
|
||||||
"thresholds": "",
|
|
||||||
"colorBackground": false,
|
|
||||||
"colorValue": false,
|
|
||||||
"colors": [
|
|
||||||
"rgba(245, 54, 54, 0.9)",
|
|
||||||
"rgba(237, 129, 40, 0.89)",
|
|
||||||
"rgba(50, 172, 45, 0.97)"
|
|
||||||
],
|
|
||||||
"sparkline": {
|
|
||||||
"show": true,
|
|
||||||
"full": false,
|
|
||||||
"lineColor": "rgb(31, 120, 193)",
|
|
||||||
"fillColor": "rgba(31, 118, 189, 0.18)"
|
|
||||||
},
|
|
||||||
"gauge": {
|
|
||||||
"show": false,
|
|
||||||
"minValue": 0,
|
|
||||||
"maxValue": 100,
|
|
||||||
"thresholdMarkers": false,
|
|
||||||
"thresholdLabels": false
|
|
||||||
}
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"title": "Row",
|
|
||||||
"showTitle": false
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"title": "New row",
|
|
||||||
"height": "100px",
|
|
||||||
"editable": true,
|
|
||||||
"collapse": false,
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"title": "Desired Replicas",
|
|
||||||
"error": false,
|
|
||||||
"span": 3,
|
|
||||||
"editable": true,
|
|
||||||
"type": "singlestat",
|
|
||||||
"isNew": true,
|
|
||||||
"id": 5,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"refId": "A",
|
|
||||||
"expr": "kube_deployment_spec_replicas{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"step": 600,
|
|
||||||
"metric": "kube_deployment_spec_replicas"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"links": [],
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"maxDataPoints": 100,
|
|
||||||
"interval": null,
|
|
||||||
"cacheTimeout": null,
|
|
||||||
"format": "none",
|
|
||||||
"prefix": "",
|
|
||||||
"postfix": "",
|
|
||||||
"nullText": null,
|
|
||||||
"valueMaps": [
|
|
||||||
{
|
|
||||||
"value": "null",
|
|
||||||
"op": "=",
|
|
||||||
"text": "N/A"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"mappingTypes": [
|
|
||||||
{
|
|
||||||
"name": "value to text",
|
|
||||||
"value": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "range to text",
|
|
||||||
"value": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"rangeMaps": [
|
|
||||||
{
|
|
||||||
"from": "null",
|
|
||||||
"to": "null",
|
|
||||||
"text": "N/A"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"mappingType": 1,
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"valueName": "avg",
|
|
||||||
"prefixFontSize": "50%",
|
|
||||||
"valueFontSize": "80%",
|
|
||||||
"postfixFontSize": "50%",
|
|
||||||
"thresholds": "",
|
|
||||||
"colorBackground": false,
|
|
||||||
"colorValue": false,
|
|
||||||
"colors": [
|
|
||||||
"rgba(245, 54, 54, 0.9)",
|
|
||||||
"rgba(237, 129, 40, 0.89)",
|
|
||||||
"rgba(50, 172, 45, 0.97)"
|
|
||||||
],
|
|
||||||
"sparkline": {
|
|
||||||
"show": false,
|
|
||||||
"full": false,
|
|
||||||
"lineColor": "rgb(31, 120, 193)",
|
|
||||||
"fillColor": "rgba(31, 118, 189, 0.18)"
|
|
||||||
},
|
|
||||||
"gauge": {
|
|
||||||
"show": false,
|
|
||||||
"minValue": 0,
|
|
||||||
"maxValue": 100,
|
|
||||||
"thresholdMarkers": false,
|
|
||||||
"thresholdLabels": false
|
|
||||||
},
|
|
||||||
"decimals": null
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"title": "Available Replicas",
|
|
||||||
"error": false,
|
|
||||||
"span": 3,
|
|
||||||
"editable": true,
|
|
||||||
"type": "singlestat",
|
|
||||||
"isNew": true,
|
|
||||||
"id": 6,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"refId": "A",
|
|
||||||
"expr": "kube_deployment_status_replicas_available{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"step": 600
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"links": [],
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"maxDataPoints": 100,
|
|
||||||
"interval": null,
|
|
||||||
"cacheTimeout": null,
|
|
||||||
"format": "none",
|
|
||||||
"prefix": "",
|
|
||||||
"postfix": "",
|
|
||||||
"nullText": null,
|
|
||||||
"valueMaps": [
|
|
||||||
{
|
|
||||||
"value": "null",
|
|
||||||
"op": "=",
|
|
||||||
"text": "N/A"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"mappingTypes": [
|
|
||||||
{
|
|
||||||
"name": "value to text",
|
|
||||||
"value": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "range to text",
|
|
||||||
"value": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"rangeMaps": [
|
|
||||||
{
|
|
||||||
"from": "null",
|
|
||||||
"to": "null",
|
|
||||||
"text": "N/A"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"mappingType": 1,
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"valueName": "avg",
|
|
||||||
"prefixFontSize": "50%",
|
|
||||||
"valueFontSize": "80%",
|
|
||||||
"postfixFontSize": "50%",
|
|
||||||
"thresholds": "",
|
|
||||||
"colorBackground": false,
|
|
||||||
"colorValue": false,
|
|
||||||
"colors": [
|
|
||||||
"rgba(245, 54, 54, 0.9)",
|
|
||||||
"rgba(237, 129, 40, 0.89)",
|
|
||||||
"rgba(50, 172, 45, 0.97)"
|
|
||||||
],
|
|
||||||
"sparkline": {
|
|
||||||
"show": false,
|
|
||||||
"full": false,
|
|
||||||
"lineColor": "rgb(31, 120, 193)",
|
|
||||||
"fillColor": "rgba(31, 118, 189, 0.18)"
|
|
||||||
},
|
|
||||||
"gauge": {
|
|
||||||
"show": false,
|
|
||||||
"minValue": 0,
|
|
||||||
"maxValue": 100,
|
|
||||||
"thresholdMarkers": true,
|
|
||||||
"thresholdLabels": false
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cacheTimeout": null,
|
|
||||||
"colorBackground": false,
|
|
||||||
"colorValue": false,
|
|
||||||
"colors": [
|
|
||||||
"rgba(245, 54, 54, 0.9)",
|
|
||||||
"rgba(237, 129, 40, 0.89)",
|
|
||||||
"rgba(50, 172, 45, 0.97)"
|
|
||||||
],
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"format": "none",
|
|
||||||
"gauge": {
|
|
||||||
"maxValue": 100,
|
|
||||||
"minValue": 0,
|
|
||||||
"show": false,
|
|
||||||
"thresholdLabels": false,
|
|
||||||
"thresholdMarkers": true
|
|
||||||
},
|
|
||||||
"id": 3,
|
|
||||||
"interval": null,
|
|
||||||
"isNew": true,
|
|
||||||
"links": [],
|
|
||||||
"mappingType": 1,
|
|
||||||
"mappingTypes": [
|
|
||||||
{
|
|
||||||
"name": "value to text",
|
|
||||||
"value": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "range to text",
|
|
||||||
"value": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"maxDataPoints": 100,
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"nullText": null,
|
|
||||||
"postfix": "",
|
|
||||||
"postfixFontSize": "50%",
|
|
||||||
"prefix": "",
|
|
||||||
"prefixFontSize": "50%",
|
|
||||||
"rangeMaps": [
|
|
||||||
{
|
|
||||||
"from": "null",
|
|
||||||
"text": "N/A",
|
|
||||||
"to": "null"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 3,
|
|
||||||
"sparkline": {
|
|
||||||
"fillColor": "rgba(31, 118, 189, 0.18)",
|
|
||||||
"full": false,
|
|
||||||
"lineColor": "rgb(31, 120, 193)",
|
|
||||||
"show": false
|
|
||||||
},
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "kube_deployment_status_observed_generation{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 600
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": "",
|
|
||||||
"title": "Observed Generation",
|
|
||||||
"type": "singlestat",
|
|
||||||
"valueFontSize": "80%",
|
|
||||||
"valueMaps": [
|
|
||||||
{
|
|
||||||
"op": "=",
|
|
||||||
"text": "N/A",
|
|
||||||
"value": "null"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"valueName": "avg"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cacheTimeout": null,
|
|
||||||
"colorBackground": false,
|
|
||||||
"colorValue": false,
|
|
||||||
"colors": [
|
|
||||||
"rgba(245, 54, 54, 0.9)",
|
|
||||||
"rgba(237, 129, 40, 0.89)",
|
|
||||||
"rgba(50, 172, 45, 0.97)"
|
|
||||||
],
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"format": "none",
|
|
||||||
"gauge": {
|
|
||||||
"maxValue": 100,
|
|
||||||
"minValue": 0,
|
|
||||||
"show": false,
|
|
||||||
"thresholdLabels": false,
|
|
||||||
"thresholdMarkers": true
|
|
||||||
},
|
|
||||||
"id": 2,
|
|
||||||
"interval": null,
|
|
||||||
"isNew": true,
|
|
||||||
"links": [],
|
|
||||||
"mappingType": 1,
|
|
||||||
"mappingTypes": [
|
|
||||||
{
|
|
||||||
"name": "value to text",
|
|
||||||
"value": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "range to text",
|
|
||||||
"value": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"maxDataPoints": 100,
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"nullText": null,
|
|
||||||
"postfix": "",
|
|
||||||
"postfixFontSize": "50%",
|
|
||||||
"prefix": "",
|
|
||||||
"prefixFontSize": "50%",
|
|
||||||
"rangeMaps": [
|
|
||||||
{
|
|
||||||
"from": "null",
|
|
||||||
"text": "N/A",
|
|
||||||
"to": "null"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 3,
|
|
||||||
"sparkline": {
|
|
||||||
"fillColor": "rgba(31, 118, 189, 0.18)",
|
|
||||||
"full": false,
|
|
||||||
"lineColor": "rgb(31, 120, 193)",
|
|
||||||
"show": false
|
|
||||||
},
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "kube_deployment_metadata_generation{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 600
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": "",
|
|
||||||
"title": "Metadata Generation",
|
|
||||||
"type": "singlestat",
|
|
||||||
"valueFontSize": "80%",
|
|
||||||
"valueMaps": [
|
|
||||||
{
|
|
||||||
"op": "=",
|
|
||||||
"text": "N/A",
|
|
||||||
"value": "null"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"valueName": "avg"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"collapse": false,
|
|
||||||
"editable": true,
|
|
||||||
"height": "350px",
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {
|
|
||||||
"threshold1": null,
|
|
||||||
"threshold1Color": "rgba(216, 200, 27, 0.27)",
|
|
||||||
"threshold2": null,
|
|
||||||
"threshold2Color": "rgba(234, 112, 112, 0.22)"
|
|
||||||
},
|
|
||||||
"id": 1,
|
|
||||||
"isNew": true,
|
|
||||||
"legend": {
|
|
||||||
"avg": false,
|
|
||||||
"current": false,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": false,
|
|
||||||
"hideZero": false
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [],
|
|
||||||
"span": 12,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "kube_deployment_status_replicas{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "current replicas",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 30
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "kube_deployment_status_replicas_available{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "available",
|
|
||||||
"refId": "B",
|
|
||||||
"step": 30
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "kube_deployment_status_replicas_unavailable{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "unavailable",
|
|
||||||
"refId": "C",
|
|
||||||
"step": 30
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "kube_deployment_status_replicas_updated{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "updated",
|
|
||||||
"refId": "D",
|
|
||||||
"step": 30
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "kube_deployment_spec_replicas{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "desired",
|
|
||||||
"refId": "E",
|
|
||||||
"step": 30
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": [],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "Replicas",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": true,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"mode": "time",
|
|
||||||
"name": null,
|
|
||||||
"show": true,
|
|
||||||
"values": []
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "none",
|
|
||||||
"label": "",
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "short",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": false
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"transparent": false
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"title": "New row",
|
|
||||||
"showTitle": false
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"time": {
|
|
||||||
"from": "now-6h",
|
|
||||||
"to": "now"
|
|
||||||
},
|
|
||||||
"timepicker": {
|
|
||||||
"refresh_intervals": [
|
|
||||||
"5s",
|
|
||||||
"10s",
|
|
||||||
"30s",
|
|
||||||
"1m",
|
|
||||||
"5m",
|
|
||||||
"15m",
|
|
||||||
"30m",
|
|
||||||
"1h",
|
|
||||||
"2h",
|
|
||||||
"1d"
|
|
||||||
],
|
|
||||||
"time_options": [
|
|
||||||
"5m",
|
|
||||||
"15m",
|
|
||||||
"1h",
|
|
||||||
"6h",
|
|
||||||
"12h",
|
|
||||||
"24h",
|
|
||||||
"2d",
|
|
||||||
"7d",
|
|
||||||
"30d"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"templating": {
|
|
||||||
"list": [
|
|
||||||
{
|
|
||||||
"allValue": ".*",
|
|
||||||
"current": {},
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"hide": 0,
|
|
||||||
"includeAll": false,
|
|
||||||
"label": "Namespace",
|
|
||||||
"multi": false,
|
|
||||||
"name": "deployment_namespace",
|
|
||||||
"options": [],
|
|
||||||
"query": "label_values(kube_deployment_metadata_generation, namespace)",
|
|
||||||
"refresh": 1,
|
|
||||||
"regex": "",
|
|
||||||
"sort": 0,
|
|
||||||
"tagValuesQuery": null,
|
|
||||||
"tagsQuery": "",
|
|
||||||
"type": "query",
|
|
||||||
"useTags": false
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"allValue": null,
|
|
||||||
"current": {},
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"hide": 0,
|
|
||||||
"includeAll": false,
|
|
||||||
"label": "Deployment",
|
|
||||||
"multi": false,
|
|
||||||
"name": "deployment_name",
|
|
||||||
"options": [],
|
|
||||||
"query": "label_values(kube_deployment_metadata_generation{namespace=\"$deployment_namespace\"}, deployment)",
|
|
||||||
"refresh": 1,
|
|
||||||
"regex": "",
|
|
||||||
"sort": 0,
|
|
||||||
"tagValuesQuery": "",
|
|
||||||
"tagsQuery": "deployment",
|
|
||||||
"type": "query",
|
|
||||||
"useTags": false
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"annotations": {
|
|
||||||
"list": []
|
|
||||||
},
|
|
||||||
"schemaVersion": 12,
|
|
||||||
"version": 2,
|
|
||||||
"links": [],
|
|
||||||
"gnetId": null
|
|
||||||
},
|
|
||||||
"inputs": [
|
|
||||||
{
|
|
||||||
"name": "DS_PROMETHEUS",
|
|
||||||
"pluginId": "prometheus",
|
|
||||||
"type": "datasource",
|
|
||||||
"value": "prometheus"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"overwrite": true
|
|
||||||
}
|
|
@@ -1,409 +0,0 @@
|
|||||||
{
|
|
||||||
"dashboard": {
|
|
||||||
"__inputs": [
|
|
||||||
{
|
|
||||||
"description": "",
|
|
||||||
"label": "prometheus",
|
|
||||||
"name": "DS_PROMETHEUS",
|
|
||||||
"pluginId": "prometheus",
|
|
||||||
"pluginName": "Prometheus",
|
|
||||||
"type": "datasource"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"__requires": [
|
|
||||||
{
|
|
||||||
"id": "graph",
|
|
||||||
"name": "Graph",
|
|
||||||
"type": "panel",
|
|
||||||
"version": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"id": "grafana",
|
|
||||||
"name": "Grafana",
|
|
||||||
"type": "grafana",
|
|
||||||
"version": "3.1.1"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"id": "prometheus",
|
|
||||||
"name": "Prometheus",
|
|
||||||
"type": "datasource",
|
|
||||||
"version": "1.0.0"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"annotations": {
|
|
||||||
"list": []
|
|
||||||
},
|
|
||||||
"editable": true,
|
|
||||||
"gnetId": null,
|
|
||||||
"hideControls": false,
|
|
||||||
"id": null,
|
|
||||||
"links": [],
|
|
||||||
"rows": [
|
|
||||||
{
|
|
||||||
"collapse": false,
|
|
||||||
"editable": true,
|
|
||||||
"height": "250px",
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {
|
|
||||||
"threshold1": null,
|
|
||||||
"threshold1Color": "rgba(216, 200, 27, 0.27)",
|
|
||||||
"threshold2": null,
|
|
||||||
"threshold2Color": "rgba(234, 112, 112, 0.22)"
|
|
||||||
},
|
|
||||||
"id": 1,
|
|
||||||
"isNew": true,
|
|
||||||
"legend": {
|
|
||||||
"alignAsTable": true,
|
|
||||||
"avg": true,
|
|
||||||
"current": true,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"rightSide": true,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": true
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [],
|
|
||||||
"span": 12,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "sum by(container_name) (container_memory_usage_bytes{pod_name=\"$pod\", container_name=~\"$container\", container_name!=\"POD\"})",
|
|
||||||
"interval": "10s",
|
|
||||||
"intervalFactor": 1,
|
|
||||||
"legendFormat": "Current: {{ container_name }}",
|
|
||||||
"metric": "container_memory_usage_bytes",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 10
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "kube_pod_container_requested_memory_bytes{pod=\"$pod\", container=~\"$container\"}",
|
|
||||||
"interval": "10s",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "Requested: {{ container }}",
|
|
||||||
"metric": "kube_pod_container_requested_memory_bytes",
|
|
||||||
"refId": "B",
|
|
||||||
"step": 20
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "Memory Usage",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": true,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "short",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"title": "Row"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"collapse": false,
|
|
||||||
"editable": true,
|
|
||||||
"height": "250px",
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {
|
|
||||||
"threshold1": null,
|
|
||||||
"threshold1Color": "rgba(216, 200, 27, 0.27)",
|
|
||||||
"threshold2": null,
|
|
||||||
"threshold2Color": "rgba(234, 112, 112, 0.22)"
|
|
||||||
},
|
|
||||||
"id": 2,
|
|
||||||
"isNew": true,
|
|
||||||
"legend": {
|
|
||||||
"alignAsTable": true,
|
|
||||||
"avg": true,
|
|
||||||
"current": true,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"rightSide": true,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": true
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [],
|
|
||||||
"span": 12,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "sum by (container_name)( rate(container_cpu_usage_seconds_total{image!=\"\",container_name!=\"POD\",pod_name=\"$pod\"}[1m] ) )",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "{{ container_name }}",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 30
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "CPU Usage",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": true,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "short",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "short",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"title": "New row"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"collapse": false,
|
|
||||||
"editable": true,
|
|
||||||
"height": "250px",
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {
|
|
||||||
"threshold1": null,
|
|
||||||
"threshold1Color": "rgba(216, 200, 27, 0.27)",
|
|
||||||
"threshold2": null,
|
|
||||||
"threshold2Color": "rgba(234, 112, 112, 0.22)"
|
|
||||||
},
|
|
||||||
"id": 3,
|
|
||||||
"isNew": true,
|
|
||||||
"legend": {
|
|
||||||
"alignAsTable": true,
|
|
||||||
"avg": true,
|
|
||||||
"current": true,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"rightSide": true,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": true
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [],
|
|
||||||
"span": 12,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "sort_desc(sum by (pod_name) (rate (container_network_receive_bytes_total{pod_name=\"$pod\"}[1m]) ))",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "{{ pod_name }}",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 30
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "Network I/O",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": true,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "short",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"title": "New row"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"schemaVersion": 12,
|
|
||||||
"sharedCrosshair": true,
|
|
||||||
"style": "dark",
|
|
||||||
"tags": [],
|
|
||||||
"templating": {
|
|
||||||
"list": [
|
|
||||||
{
|
|
||||||
"allValue": ".*",
|
|
||||||
"current": {},
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"hide": 0,
|
|
||||||
"includeAll": true,
|
|
||||||
"label": "Namespace",
|
|
||||||
"multi": false,
|
|
||||||
"name": "namespace",
|
|
||||||
"options": [],
|
|
||||||
"query": "label_values(kube_pod_info, namespace)",
|
|
||||||
"refresh": 1,
|
|
||||||
"regex": "",
|
|
||||||
"type": "query"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"current": {},
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"hide": 0,
|
|
||||||
"includeAll": false,
|
|
||||||
"label": "Pod",
|
|
||||||
"multi": false,
|
|
||||||
"name": "pod",
|
|
||||||
"options": [],
|
|
||||||
"query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, pod)",
|
|
||||||
"refresh": 1,
|
|
||||||
"regex": "",
|
|
||||||
"type": "query"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"allValue": ".*",
|
|
||||||
"current": {},
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"hide": 0,
|
|
||||||
"includeAll": true,
|
|
||||||
"label": "Container",
|
|
||||||
"multi": false,
|
|
||||||
"name": "container",
|
|
||||||
"options": [],
|
|
||||||
"query": "label_values(kube_pod_container_info{namespace=\"$namespace\", pod=\"$pod\"}, container)",
|
|
||||||
"refresh": 1,
|
|
||||||
"regex": "",
|
|
||||||
"type": "query"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"time": {
|
|
||||||
"from": "now-6h",
|
|
||||||
"to": "now"
|
|
||||||
},
|
|
||||||
"timepicker": {
|
|
||||||
"refresh_intervals": [
|
|
||||||
"5s",
|
|
||||||
"10s",
|
|
||||||
"30s",
|
|
||||||
"1m",
|
|
||||||
"5m",
|
|
||||||
"15m",
|
|
||||||
"30m",
|
|
||||||
"1h",
|
|
||||||
"2h",
|
|
||||||
"1d"
|
|
||||||
],
|
|
||||||
"time_options": [
|
|
||||||
"5m",
|
|
||||||
"15m",
|
|
||||||
"1h",
|
|
||||||
"6h",
|
|
||||||
"12h",
|
|
||||||
"24h",
|
|
||||||
"2d",
|
|
||||||
"7d",
|
|
||||||
"30d"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"timezone": "browser",
|
|
||||||
"title": "Pods",
|
|
||||||
"version": 26
|
|
||||||
},
|
|
||||||
"inputs": [
|
|
||||||
{
|
|
||||||
"name": "DS_PROMETHEUS",
|
|
||||||
"pluginId": "prometheus",
|
|
||||||
"type": "datasource",
|
|
||||||
"value": "prometheus"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"overwrite": true
|
|
||||||
}
|
|
@@ -1,880 +0,0 @@
|
|||||||
{
|
|
||||||
"dashboard":
|
|
||||||
{
|
|
||||||
"__inputs": [
|
|
||||||
{
|
|
||||||
"name": "DS_PROMETHEUS",
|
|
||||||
"label": "prometheus",
|
|
||||||
"description": "",
|
|
||||||
"type": "datasource",
|
|
||||||
"pluginId": "prometheus",
|
|
||||||
"pluginName": "Prometheus"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"__requires": [
|
|
||||||
{
|
|
||||||
"type": "grafana",
|
|
||||||
"id": "grafana",
|
|
||||||
"name": "Grafana",
|
|
||||||
"version": "4.1.1"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"type": "panel",
|
|
||||||
"id": "graph",
|
|
||||||
"name": "Graph",
|
|
||||||
"version": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"type": "datasource",
|
|
||||||
"id": "prometheus",
|
|
||||||
"name": "Prometheus",
|
|
||||||
"version": "1.0.0"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"type": "panel",
|
|
||||||
"id": "singlestat",
|
|
||||||
"name": "Singlestat",
|
|
||||||
"version": ""
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"annotations": {
|
|
||||||
"list": []
|
|
||||||
},
|
|
||||||
"description": "Dashboard to get an overview of one server",
|
|
||||||
"editable": true,
|
|
||||||
"gnetId": 22,
|
|
||||||
"graphTooltip": 0,
|
|
||||||
"hideControls": false,
|
|
||||||
"id": null,
|
|
||||||
"links": [],
|
|
||||||
"refresh": false,
|
|
||||||
"rows": [
|
|
||||||
{
|
|
||||||
"collapse": false,
|
|
||||||
"height": "250px",
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"alerting": {},
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {},
|
|
||||||
"id": 3,
|
|
||||||
"legend": {
|
|
||||||
"avg": false,
|
|
||||||
"current": false,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": false
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [],
|
|
||||||
"span": 6,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "100 - (avg by (cpu) (irate(node_cpu{mode=\"idle\", instance=\"$server\"}[5m])) * 100)",
|
|
||||||
"hide": false,
|
|
||||||
"intervalFactor": 10,
|
|
||||||
"legendFormat": "{{cpu}}",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 50
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": [],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "Idle cpu",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": false,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"mode": "time",
|
|
||||||
"name": null,
|
|
||||||
"show": true,
|
|
||||||
"values": []
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "percent",
|
|
||||||
"label": "cpu usage",
|
|
||||||
"logBase": 1,
|
|
||||||
"max": 100,
|
|
||||||
"min": 0,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "short",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"alerting": {},
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {},
|
|
||||||
"id": 9,
|
|
||||||
"legend": {
|
|
||||||
"avg": false,
|
|
||||||
"current": false,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": false
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [],
|
|
||||||
"span": 6,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "node_load1{instance=\"$server\"}",
|
|
||||||
"intervalFactor": 4,
|
|
||||||
"legendFormat": "load 1m",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 20,
|
|
||||||
"target": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "node_load5{instance=\"$server\"}",
|
|
||||||
"intervalFactor": 4,
|
|
||||||
"legendFormat": "load 5m",
|
|
||||||
"refId": "B",
|
|
||||||
"step": 20,
|
|
||||||
"target": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "node_load15{instance=\"$server\"}",
|
|
||||||
"intervalFactor": 4,
|
|
||||||
"legendFormat": "load 15m",
|
|
||||||
"refId": "C",
|
|
||||||
"step": 20,
|
|
||||||
"target": ""
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": [],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "System load",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": false,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"mode": "time",
|
|
||||||
"name": null,
|
|
||||||
"show": true,
|
|
||||||
"values": []
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "percentunit",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "short",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"repeat": null,
|
|
||||||
"repeatIteration": null,
|
|
||||||
"repeatRowId": null,
|
|
||||||
"showTitle": false,
|
|
||||||
"title": "New row",
|
|
||||||
"titleSize": "h6"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"collapse": false,
|
|
||||||
"height": "250px",
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"alerting": {},
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {},
|
|
||||||
"id": 4,
|
|
||||||
"legend": {
|
|
||||||
"alignAsTable": false,
|
|
||||||
"avg": false,
|
|
||||||
"current": false,
|
|
||||||
"hideEmpty": false,
|
|
||||||
"hideZero": false,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"rightSide": false,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": false
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [
|
|
||||||
{
|
|
||||||
"alias": "node_memory_SwapFree{instance=\"172.17.0.1:9100\",job=\"prometheus\"}",
|
|
||||||
"yaxis": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 9,
|
|
||||||
"stack": true,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "node_memory_MemTotal{instance=\"$server\"} - node_memory_MemFree{instance=\"$server\"} - node_memory_Buffers{instance=\"$server\"} - node_memory_Cached{instance=\"$server\"}",
|
|
||||||
"hide": false,
|
|
||||||
"interval": "",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "memory used",
|
|
||||||
"metric": "",
|
|
||||||
"refId": "C",
|
|
||||||
"step": 4
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "node_memory_Buffers{instance=\"$server\"}",
|
|
||||||
"interval": "",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "memory buffers",
|
|
||||||
"metric": "",
|
|
||||||
"refId": "E",
|
|
||||||
"step": 4
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "node_memory_Cached{instance=\"$server\"}",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "memory cached",
|
|
||||||
"metric": "",
|
|
||||||
"refId": "F",
|
|
||||||
"step": 4
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "node_memory_MemFree{instance=\"$server\"}",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "memory free",
|
|
||||||
"metric": "",
|
|
||||||
"refId": "D",
|
|
||||||
"step": 4
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": [],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "Memory usage",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": false,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "individual"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"mode": "time",
|
|
||||||
"name": null,
|
|
||||||
"show": true,
|
|
||||||
"values": []
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": "0",
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "short",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cacheTimeout": null,
|
|
||||||
"colorBackground": false,
|
|
||||||
"colorValue": false,
|
|
||||||
"colors": [
|
|
||||||
"rgba(50, 172, 45, 0.97)",
|
|
||||||
"rgba(237, 129, 40, 0.89)",
|
|
||||||
"rgba(245, 54, 54, 0.9)"
|
|
||||||
],
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"format": "percent",
|
|
||||||
"gauge": {
|
|
||||||
"maxValue": 100,
|
|
||||||
"minValue": 0,
|
|
||||||
"show": true,
|
|
||||||
"thresholdLabels": false,
|
|
||||||
"thresholdMarkers": true
|
|
||||||
},
|
|
||||||
"id": 5,
|
|
||||||
"interval": null,
|
|
||||||
"links": [],
|
|
||||||
"mappingType": 1,
|
|
||||||
"mappingTypes": [
|
|
||||||
{
|
|
||||||
"name": "value to text",
|
|
||||||
"value": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "range to text",
|
|
||||||
"value": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"maxDataPoints": 100,
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"nullText": null,
|
|
||||||
"postfix": "",
|
|
||||||
"postfixFontSize": "50%",
|
|
||||||
"prefix": "",
|
|
||||||
"prefixFontSize": "50%",
|
|
||||||
"rangeMaps": [
|
|
||||||
{
|
|
||||||
"from": "null",
|
|
||||||
"text": "N/A",
|
|
||||||
"to": "null"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 3,
|
|
||||||
"sparkline": {
|
|
||||||
"fillColor": "rgba(31, 118, 189, 0.18)",
|
|
||||||
"full": false,
|
|
||||||
"lineColor": "rgb(31, 120, 193)",
|
|
||||||
"show": false
|
|
||||||
},
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "((node_memory_MemTotal{instance=\"$server\"} - node_memory_MemFree{instance=\"$server\"} - node_memory_Buffers{instance=\"$server\"} - node_memory_Cached{instance=\"$server\"}) / node_memory_MemTotal{instance=\"$server\"}) * 100",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"refId": "A",
|
|
||||||
"step": 60,
|
|
||||||
"target": ""
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": "80, 90",
|
|
||||||
"title": "Memory usage",
|
|
||||||
"type": "singlestat",
|
|
||||||
"valueFontSize": "80%",
|
|
||||||
"valueMaps": [
|
|
||||||
{
|
|
||||||
"op": "=",
|
|
||||||
"text": "N/A",
|
|
||||||
"value": "null"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"valueName": "avg"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"repeat": null,
|
|
||||||
"repeatIteration": null,
|
|
||||||
"repeatRowId": null,
|
|
||||||
"showTitle": false,
|
|
||||||
"title": "New row",
|
|
||||||
"titleSize": "h6"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"collapse": false,
|
|
||||||
"height": "250px",
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"alerting": {},
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {},
|
|
||||||
"id": 6,
|
|
||||||
"legend": {
|
|
||||||
"avg": false,
|
|
||||||
"current": false,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": false
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [
|
|
||||||
{
|
|
||||||
"alias": "read",
|
|
||||||
"yaxis": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"alias": "{instance=\"172.17.0.1:9100\"}",
|
|
||||||
"yaxis": 2
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"alias": "io time",
|
|
||||||
"yaxis": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 9,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "sum by (instance) (rate(node_disk_bytes_read{instance=\"$server\"}[2m]))",
|
|
||||||
"hide": false,
|
|
||||||
"intervalFactor": 4,
|
|
||||||
"legendFormat": "read",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 8,
|
|
||||||
"target": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "sum by (instance) (rate(node_disk_bytes_written{instance=\"$server\"}[2m]))",
|
|
||||||
"intervalFactor": 4,
|
|
||||||
"legendFormat": "written",
|
|
||||||
"refId": "B",
|
|
||||||
"step": 8
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"expr": "sum by (instance) (rate(node_disk_io_time_ms{instance=\"$server\"}[2m]))",
|
|
||||||
"intervalFactor": 4,
|
|
||||||
"legendFormat": "io time",
|
|
||||||
"refId": "C",
|
|
||||||
"step": 8
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": [],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "Disk I/O",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": false,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"mode": "time",
|
|
||||||
"name": null,
|
|
||||||
"show": true,
|
|
||||||
"values": []
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "ms",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cacheTimeout": null,
|
|
||||||
"colorBackground": false,
|
|
||||||
"colorValue": false,
|
|
||||||
"colors": [
|
|
||||||
"rgba(50, 172, 45, 0.97)",
|
|
||||||
"rgba(237, 129, 40, 0.89)",
|
|
||||||
"rgba(245, 54, 54, 0.9)"
|
|
||||||
],
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"format": "percentunit",
|
|
||||||
"gauge": {
|
|
||||||
"maxValue": 1,
|
|
||||||
"minValue": 0,
|
|
||||||
"show": true,
|
|
||||||
"thresholdLabels": false,
|
|
||||||
"thresholdMarkers": true
|
|
||||||
},
|
|
||||||
"id": 7,
|
|
||||||
"interval": null,
|
|
||||||
"links": [],
|
|
||||||
"mappingType": 1,
|
|
||||||
"mappingTypes": [
|
|
||||||
{
|
|
||||||
"name": "value to text",
|
|
||||||
"value": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "range to text",
|
|
||||||
"value": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"maxDataPoints": 100,
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"nullText": null,
|
|
||||||
"postfix": "",
|
|
||||||
"postfixFontSize": "50%",
|
|
||||||
"prefix": "",
|
|
||||||
"prefixFontSize": "50%",
|
|
||||||
"rangeMaps": [
|
|
||||||
{
|
|
||||||
"from": "null",
|
|
||||||
"text": "N/A",
|
|
||||||
"to": "null"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 3,
|
|
||||||
"sparkline": {
|
|
||||||
"fillColor": "rgba(31, 118, 189, 0.18)",
|
|
||||||
"full": false,
|
|
||||||
"lineColor": "rgb(31, 120, 193)",
|
|
||||||
"show": false
|
|
||||||
},
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "(sum(node_filesystem_size{device!=\"rootfs\",instance=\"$server\"}) - sum(node_filesystem_free{device!=\"rootfs\",instance=\"$server\"})) / sum(node_filesystem_size{device!=\"rootfs\",instance=\"$server\"})",
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"refId": "A",
|
|
||||||
"step": 60,
|
|
||||||
"target": ""
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": "0.75, 0.9",
|
|
||||||
"title": "Disk space usage",
|
|
||||||
"type": "singlestat",
|
|
||||||
"valueFontSize": "80%",
|
|
||||||
"valueMaps": [
|
|
||||||
{
|
|
||||||
"op": "=",
|
|
||||||
"text": "N/A",
|
|
||||||
"value": "null"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"valueName": "current"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"repeat": null,
|
|
||||||
"repeatIteration": null,
|
|
||||||
"repeatRowId": null,
|
|
||||||
"showTitle": false,
|
|
||||||
"title": "New row",
|
|
||||||
"titleSize": "h6"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"collapse": false,
|
|
||||||
"height": "250px",
|
|
||||||
"panels": [
|
|
||||||
{
|
|
||||||
"alerting": {},
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {},
|
|
||||||
"id": 8,
|
|
||||||
"legend": {
|
|
||||||
"avg": false,
|
|
||||||
"current": false,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": false
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [
|
|
||||||
{
|
|
||||||
"alias": "transmitted ",
|
|
||||||
"yaxis": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 6,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "rate(node_network_receive_bytes{instance=\"$server\",device!~\"lo\"}[5m])",
|
|
||||||
"hide": false,
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "{{device}}",
|
|
||||||
"refId": "A",
|
|
||||||
"step": 10,
|
|
||||||
"target": ""
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": [],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "Network received",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": false,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"mode": "time",
|
|
||||||
"name": null,
|
|
||||||
"show": true,
|
|
||||||
"values": []
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"alerting": {},
|
|
||||||
"aliasColors": {},
|
|
||||||
"bars": false,
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"editable": true,
|
|
||||||
"error": false,
|
|
||||||
"fill": 1,
|
|
||||||
"grid": {},
|
|
||||||
"id": 10,
|
|
||||||
"legend": {
|
|
||||||
"avg": false,
|
|
||||||
"current": false,
|
|
||||||
"max": false,
|
|
||||||
"min": false,
|
|
||||||
"show": true,
|
|
||||||
"total": false,
|
|
||||||
"values": false
|
|
||||||
},
|
|
||||||
"lines": true,
|
|
||||||
"linewidth": 2,
|
|
||||||
"links": [],
|
|
||||||
"nullPointMode": "connected",
|
|
||||||
"percentage": false,
|
|
||||||
"pointradius": 5,
|
|
||||||
"points": false,
|
|
||||||
"renderer": "flot",
|
|
||||||
"seriesOverrides": [
|
|
||||||
{
|
|
||||||
"alias": "transmitted ",
|
|
||||||
"yaxis": 2
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"span": 6,
|
|
||||||
"stack": false,
|
|
||||||
"steppedLine": false,
|
|
||||||
"targets": [
|
|
||||||
{
|
|
||||||
"expr": "rate(node_network_transmit_bytes{instance=\"$server\",device!~\"lo\"}[5m])",
|
|
||||||
"hide": false,
|
|
||||||
"intervalFactor": 2,
|
|
||||||
"legendFormat": "{{device}}",
|
|
||||||
"refId": "B",
|
|
||||||
"step": 10,
|
|
||||||
"target": ""
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"thresholds": [],
|
|
||||||
"timeFrom": null,
|
|
||||||
"timeShift": null,
|
|
||||||
"title": "Network transmitted",
|
|
||||||
"tooltip": {
|
|
||||||
"msResolution": false,
|
|
||||||
"shared": true,
|
|
||||||
"sort": 0,
|
|
||||||
"value_type": "cumulative"
|
|
||||||
},
|
|
||||||
"type": "graph",
|
|
||||||
"xaxis": {
|
|
||||||
"mode": "time",
|
|
||||||
"name": null,
|
|
||||||
"show": true,
|
|
||||||
"values": []
|
|
||||||
},
|
|
||||||
"yaxes": [
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"format": "bytes",
|
|
||||||
"label": null,
|
|
||||||
"logBase": 1,
|
|
||||||
"max": null,
|
|
||||||
"min": null,
|
|
||||||
"show": true
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"repeat": null,
|
|
||||||
"repeatIteration": null,
|
|
||||||
"repeatRowId": null,
|
|
||||||
"showTitle": false,
|
|
||||||
"title": "New row",
|
|
||||||
"titleSize": "h6"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"schemaVersion": 14,
|
|
||||||
"style": "dark",
|
|
||||||
"tags": [
|
|
||||||
"prometheus"
|
|
||||||
],
|
|
||||||
"templating": {
|
|
||||||
"list": [
|
|
||||||
{
|
|
||||||
"allValue": null,
|
|
||||||
"current": {},
|
|
||||||
"datasource": "${DS_PROMETHEUS}",
|
|
||||||
"hide": 0,
|
|
||||||
"includeAll": false,
|
|
||||||
"label": null,
|
|
||||||
"multi": false,
|
|
||||||
"name": "server",
|
|
||||||
"options": [],
|
|
||||||
"query": "label_values(node_boot_time, instance)",
|
|
||||||
"refresh": 1,
|
|
||||||
"regex": "",
|
|
||||||
"sort": 0,
|
|
||||||
"tagValuesQuery": "",
|
|
||||||
"tags": [],
|
|
||||||
"tagsQuery": "",
|
|
||||||
"type": "query",
|
|
||||||
"useTags": false
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"time": {
|
|
||||||
"from": "now-1h",
|
|
||||||
"to": "now"
|
|
||||||
},
|
|
||||||
"timepicker": {
|
|
||||||
"refresh_intervals": [
|
|
||||||
"5s",
|
|
||||||
"10s",
|
|
||||||
"30s",
|
|
||||||
"1m",
|
|
||||||
"5m",
|
|
||||||
"15m",
|
|
||||||
"30m",
|
|
||||||
"1h",
|
|
||||||
"2h",
|
|
||||||
"1d"
|
|
||||||
],
|
|
||||||
"time_options": [
|
|
||||||
"5m",
|
|
||||||
"15m",
|
|
||||||
"1h",
|
|
||||||
"6h",
|
|
||||||
"12h",
|
|
||||||
"24h",
|
|
||||||
"2d",
|
|
||||||
"7d",
|
|
||||||
"30d"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"timezone": "browser",
|
|
||||||
"title": "Nodes",
|
|
||||||
"version": 1
|
|
||||||
},
|
|
||||||
"inputs": [
|
|
||||||
{
|
|
||||||
"name": "DS_PROMETHEUS",
|
|
||||||
"pluginId": "prometheus",
|
|
||||||
"type": "datasource",
|
|
||||||
"value": "prometheus"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"overwrite": true
|
|
||||||
}
|
|
@@ -1,7 +0,0 @@
|
|||||||
{
|
|
||||||
"access": "proxy",
|
|
||||||
"basicAuth": false,
|
|
||||||
"name": "prometheus",
|
|
||||||
"type": "prometheus",
|
|
||||||
"url": "http://prometheus-k8s.monitoring.svc:9090"
|
|
||||||
}
|
|
@@ -1,121 +0,0 @@
|
|||||||
### General cluster availability ###
|
|
||||||
|
|
||||||
# alert if another failed peer will result in an unavailable cluster
|
|
||||||
ALERT InsufficientPeers
|
|
||||||
IF count(up{job="etcd-k8s"} == 0) > (count(up{job="etcd-k8s"}) / 2 - 1)
|
|
||||||
FOR 3m
|
|
||||||
LABELS {
|
|
||||||
severity = "critical"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Etcd cluster small",
|
|
||||||
description = "If one more etcd peer goes down the cluster will be unavailable",
|
|
||||||
}
|
|
||||||
|
|
||||||
### HTTP requests alerts ###
|
|
||||||
|
|
||||||
# alert if more than 1% of requests to an HTTP endpoint have failed with a non 4xx response
|
|
||||||
ALERT HighNumberOfFailedHTTPRequests
|
|
||||||
IF sum by(method) (rate(etcd_http_failed_total{job="etcd-k8s", code!~"4[0-9]{2}"}[5m]))
|
|
||||||
/ sum by(method) (rate(etcd_http_received_total{job="etcd-k8s"}[5m])) > 0.01
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "a high number of HTTP requests are failing",
|
|
||||||
description = "{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}",
|
|
||||||
}
|
|
||||||
|
|
||||||
# alert if more than 5% of requests to an HTTP endpoint have failed with a non 4xx response
|
|
||||||
ALERT HighNumberOfFailedHTTPRequests
|
|
||||||
IF sum by(method) (rate(etcd_http_failed_total{job="etcd-k8s", code!~"4[0-9]{2}"}[5m]))
|
|
||||||
/ sum by(method) (rate(etcd_http_received_total{job="etcd-k8s"}[5m])) > 0.05
|
|
||||||
FOR 5m
|
|
||||||
LABELS {
|
|
||||||
severity = "critical"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "a high number of HTTP requests are failing",
|
|
||||||
description = "{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}",
|
|
||||||
}
|
|
||||||
|
|
||||||
# alert if 50% of requests get a 4xx response
|
|
||||||
ALERT HighNumberOfFailedHTTPRequests
|
|
||||||
IF sum by(method) (rate(etcd_http_failed_total{job="etcd-k8s", code=~"4[0-9]{2}"}[5m]))
|
|
||||||
/ sum by(method) (rate(etcd_http_received_total{job="etcd-k8s"}[5m])) > 0.5
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
severity = "critical"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "a high number of HTTP requests are failing",
|
|
||||||
description = "{{ $value }}% of requests for {{ $labels.method }} failed with 4xx responses on etcd instance {{ $labels.instance }}",
|
|
||||||
}
|
|
||||||
|
|
||||||
# alert if the 99th percentile of HTTP requests take more than 150ms
|
|
||||||
ALERT HTTPRequestsSlow
|
|
||||||
IF histogram_quantile(0.99, rate(etcd_http_successful_duration_second_bucket[5m])) > 0.15
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "slow HTTP requests",
|
|
||||||
description = "on ectd instance {{ $labels.instance }} HTTP requests to {{ $label.method }} are slow",
|
|
||||||
}
|
|
||||||
|
|
||||||
### File descriptor alerts ###
|
|
||||||
|
|
||||||
instance:fd_utilization = process_open_fds / process_max_fds
|
|
||||||
|
|
||||||
# alert if file descriptors are likely to exhaust within the next 4 hours
|
|
||||||
ALERT FdExhaustionClose
|
|
||||||
IF predict_linear(instance:fd_utilization[1h], 3600 * 4) > 1
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "file descriptors soon exhausted",
|
|
||||||
description = "{{ $labels.job }} instance {{ $labels.instance }} will exhaust in file descriptors soon",
|
|
||||||
}
|
|
||||||
|
|
||||||
# alert if file descriptors are likely to exhaust within the next hour
|
|
||||||
ALERT FdExhaustionClose
|
|
||||||
IF predict_linear(instance:fd_utilization[10m], 3600) > 1
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
severity = "critical"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "file descriptors soon exhausted",
|
|
||||||
description = "{{ $labels.job }} instance {{ $labels.instance }} will exhaust in file descriptors soon",
|
|
||||||
}
|
|
||||||
|
|
||||||
### etcd proposal alerts ###
|
|
||||||
|
|
||||||
# alert if there are several failed proposals within an hour
|
|
||||||
ALERT HighNumberOfFailedProposals
|
|
||||||
IF increase(etcd_server_proposal_failed_total{job="etcd"}[1h]) > 5
|
|
||||||
LABELS {
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "a high number of failed proposals within the etcd cluster are happening",
|
|
||||||
description = "etcd instance {{ $labels.instance }} has seen {{ $value }} proposal failures within the last hour",
|
|
||||||
}
|
|
||||||
|
|
||||||
### etcd disk io latency alerts ###
|
|
||||||
|
|
||||||
# alert if 99th percentile of fsync durations is higher than 500ms
|
|
||||||
ALERT HighFsyncDurations
|
|
||||||
IF histogram_quantile(0.99, rate(etcd_wal_fsync_durations_seconds_bucket[5m])) > 0.5
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "high fsync durations",
|
|
||||||
description = "ectd instance {{ $labels.instance }} fync durations are high",
|
|
||||||
}
|
|
@@ -1,388 +0,0 @@
|
|||||||
# NOTE: These rules were kindly contributed by the SoundCloud engineering team.
|
|
||||||
|
|
||||||
### Container resources ###
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:spec_memory_limit_bytes =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_spec_memory_limit_bytes{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:spec_cpu_shares =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_spec_cpu_shares{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:cpu_usage:rate =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
irate(
|
|
||||||
container_cpu_usage_seconds_total{container_name!=""}[5m]
|
|
||||||
),
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:memory_usage:bytes =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_memory_usage_bytes{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:memory_working_set:bytes =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_memory_working_set_bytes{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:memory_rss:bytes =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_memory_rss{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:memory_cache:bytes =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_memory_cache{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:disk_usage:bytes =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_disk_usage_bytes{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:memory_pagefaults:rate =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name,scope,type) (
|
|
||||||
label_replace(
|
|
||||||
irate(
|
|
||||||
container_memory_failures_total{container_name!=""}[5m]
|
|
||||||
),
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:memory_oom:rate =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name,scope,type) (
|
|
||||||
label_replace(
|
|
||||||
irate(
|
|
||||||
container_memory_failcnt{container_name!=""}[5m]
|
|
||||||
),
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
### Cluster resources ###
|
|
||||||
|
|
||||||
cluster:memory_allocation:percent =
|
|
||||||
100 * sum by (cluster) (
|
|
||||||
container_spec_memory_limit_bytes{pod_name!=""}
|
|
||||||
) / sum by (cluster) (
|
|
||||||
machine_memory_bytes
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster:memory_used:percent =
|
|
||||||
100 * sum by (cluster) (
|
|
||||||
container_memory_usage_bytes{pod_name!=""}
|
|
||||||
) / sum by (cluster) (
|
|
||||||
machine_memory_bytes
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster:cpu_allocation:percent =
|
|
||||||
100 * sum by (cluster) (
|
|
||||||
container_spec_cpu_shares{pod_name!=""}
|
|
||||||
) / sum by (cluster) (
|
|
||||||
container_spec_cpu_shares{id="/"} * on(cluster,instance) machine_cpu_cores
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster:node_cpu_use:percent =
|
|
||||||
100 * sum by (cluster) (
|
|
||||||
rate(node_cpu{mode!="idle"}[5m])
|
|
||||||
) / sum by (cluster) (
|
|
||||||
machine_cpu_cores
|
|
||||||
)
|
|
||||||
|
|
||||||
### API latency ###
|
|
||||||
|
|
||||||
# Raw metrics are in microseconds. Convert to seconds.
|
|
||||||
cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.99"} =
|
|
||||||
histogram_quantile(
|
|
||||||
0.99,
|
|
||||||
sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket)
|
|
||||||
) / 1e6
|
|
||||||
cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.9"} =
|
|
||||||
histogram_quantile(
|
|
||||||
0.9,
|
|
||||||
sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket)
|
|
||||||
) / 1e6
|
|
||||||
cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.5"} =
|
|
||||||
histogram_quantile(
|
|
||||||
0.5,
|
|
||||||
sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket)
|
|
||||||
) / 1e6
|
|
||||||
|
|
||||||
### Scheduling latency ###
|
|
||||||
|
|
||||||
cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.99"} =
|
|
||||||
histogram_quantile(0.99,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6
|
|
||||||
cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.9"} =
|
|
||||||
histogram_quantile(0.9,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6
|
|
||||||
cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.5"} =
|
|
||||||
histogram_quantile(0.5,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6
|
|
||||||
|
|
||||||
cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.99"} =
|
|
||||||
histogram_quantile(0.99,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6
|
|
||||||
cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.9"} =
|
|
||||||
histogram_quantile(0.9,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6
|
|
||||||
cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.5"} =
|
|
||||||
histogram_quantile(0.5,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6
|
|
||||||
|
|
||||||
cluster:scheduler_binding_latency:quantile_seconds{quantile="0.99"} =
|
|
||||||
histogram_quantile(0.99,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6
|
|
||||||
cluster:scheduler_binding_latency:quantile_seconds{quantile="0.9"} =
|
|
||||||
histogram_quantile(0.9,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6
|
|
||||||
cluster:scheduler_binding_latency:quantile_seconds{quantile="0.5"} =
|
|
||||||
histogram_quantile(0.5,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6
|
|
||||||
|
|
||||||
ALERT K8SNodeDown
|
|
||||||
IF up{job="kubelet"} == 0
|
|
||||||
FOR 1h
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Kubelet cannot be scraped",
|
|
||||||
description = "Prometheus could not scrape a {{ $labels.job }} for more than one hour",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SNodeNotReady
|
|
||||||
IF kube_node_status_ready{condition="true"} == 0
|
|
||||||
FOR 1h
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning",
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Node status is NotReady",
|
|
||||||
description = "The Kubelet on {{ $labels.node }} has not checked in with the API, or has set itself to NotReady, for more than an hour",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SManyNodesNotReady
|
|
||||||
IF
|
|
||||||
count by (cluster) (kube_node_status_ready{condition="true"} == 0) > 1
|
|
||||||
AND
|
|
||||||
(
|
|
||||||
count by (cluster) (kube_node_status_ready{condition="true"} == 0)
|
|
||||||
/
|
|
||||||
count by (cluster) (kube_node_status_ready{condition="true"})
|
|
||||||
) > 0.2
|
|
||||||
FOR 1m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical",
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Many K8s nodes are Not Ready",
|
|
||||||
description = "{{ $value }} K8s nodes (more than 10% of cluster {{ $labels.cluster }}) are in the NotReady state.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SKubeletNodeExporterDown
|
|
||||||
IF up{job="node-exporter"} == 0
|
|
||||||
FOR 15m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Kubelet node_exporter cannot be scraped",
|
|
||||||
description = "Prometheus could not scrape a {{ $labels.job }} for more than one hour.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SKubeletDown
|
|
||||||
IF absent(up{job="kubelet"}) or count by (cluster) (up{job="kubelet"} == 0) / count by (cluster) (up{job="kubelet"}) > 0.1
|
|
||||||
FOR 1h
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Many Kubelets cannot be scraped",
|
|
||||||
description = "Prometheus failed to scrape more than 10% of kubelets, or all Kubelets have disappeared from service discovery.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SApiserverDown
|
|
||||||
IF up{job="kubernetes"} == 0
|
|
||||||
FOR 15m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "API server unreachable",
|
|
||||||
description = "An API server could not be scraped.",
|
|
||||||
}
|
|
||||||
|
|
||||||
# Disable for non HA kubernetes setups.
|
|
||||||
ALERT K8SApiserverDown
|
|
||||||
IF absent({job="kubernetes"}) or (count by(cluster) (up{job="kubernetes"} == 1) < count by(cluster) (up{job="kubernetes"}))
|
|
||||||
FOR 5m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "API server unreachable",
|
|
||||||
description = "Prometheus failed to scrape multiple API servers, or all API servers have disappeared from service discovery.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SSchedulerDown
|
|
||||||
IF absent(up{job="kube-scheduler"}) or (count by(cluster) (up{job="kube-scheduler"} == 1) == 0)
|
|
||||||
FOR 5m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical",
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Scheduler is down",
|
|
||||||
description = "There is no running K8S scheduler. New pods are not being assigned to nodes.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SControllerManagerDown
|
|
||||||
IF absent(up{job="kube-controller-manager"}) or (count by(cluster) (up{job="kube-controller-manager"} == 1) == 0)
|
|
||||||
FOR 5m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical",
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Controller manager is down",
|
|
||||||
description = "There is no running K8S controller manager. Deployments and replication controllers are not making progress.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SConntrackTableFull
|
|
||||||
IF 100*node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 50
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Number of tracked connections is near the limit",
|
|
||||||
description = "The nf_conntrack table is {{ $value }}% full.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SConntrackTableFull
|
|
||||||
IF 100*node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 90
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Number of tracked connections is near the limit",
|
|
||||||
description = "The nf_conntrack table is {{ $value }}% full.",
|
|
||||||
}
|
|
||||||
|
|
||||||
# To catch the conntrack sysctl de-tuning when it happens
|
|
||||||
ALERT K8SConntrackTuningMissing
|
|
||||||
IF node_nf_conntrack_udp_timeout > 10
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning",
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Node does not have the correct conntrack tunings",
|
|
||||||
description = "Nodes keep un-setting the correct tunings, investigate when it happens.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8STooManyOpenFiles
|
|
||||||
IF 100*process_open_fds{job=~"kubelet|kubernetes"} / process_max_fds > 50
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "{{ $labels.job }} has too many open file descriptors",
|
|
||||||
description = "{{ $labels.node }} is using {{ $value }}% of the available file/socket descriptors.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8STooManyOpenFiles
|
|
||||||
IF 100*process_open_fds{job=~"kubelet|kubernetes"} / process_max_fds > 80
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "{{ $labels.job }} has too many open file descriptors",
|
|
||||||
description = "{{ $labels.node }} is using {{ $value }}% of the available file/socket descriptors.",
|
|
||||||
}
|
|
||||||
|
|
||||||
# Some verbs excluded because they are expected to be long-lasting:
|
|
||||||
# WATCHLIST is long-poll, CONNECT is `kubectl exec`.
|
|
||||||
ALERT K8SApiServerLatency
|
|
||||||
IF histogram_quantile(
|
|
||||||
0.99,
|
|
||||||
sum without (instance,node,resource) (apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH"})
|
|
||||||
) / 1e6 > 1.0
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Kubernetes apiserver latency is high",
|
|
||||||
description = "99th percentile Latency for {{ $labels.verb }} requests to the kube-apiserver is higher than 1s.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SApiServerEtcdAccessLatency
|
|
||||||
IF etcd_request_latencies_summary{quantile="0.99"} / 1e6 > 1.0
|
|
||||||
FOR 15m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Access to etcd is slow",
|
|
||||||
description = "99th percentile latency for apiserver to access etcd is higher than 1s.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SKubeletTooManyPods
|
|
||||||
IF kubelet_running_pod_count > 100
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning",
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Kubelet is close to pod limit",
|
|
||||||
description = "Kubelet {{$labels.instance}} is running {{$value}} pods, close to the limit of 110",
|
|
||||||
}
|
|
||||||
|
|
@@ -1,35 +0,0 @@
|
|||||||
# Adding kube-prometheus to [KOPS](https://github.com/kubernetes/kops) on AWS 1.5.x
|
|
||||||
|
|
||||||
|
|
||||||
## Prerequisites
|
|
||||||
|
|
||||||
A running Kubernetes cluster created with [KOPS](https://github.com/kubernetes/kops).
|
|
||||||
|
|
||||||
These instructions have currently been tested with **topology=public** on AWS with KOPS 1.5.1 and Kubernetes 1.5.x
|
|
||||||
|
|
||||||
## Open AWS Security Groups:
|
|
||||||
1. Open port 9100 on the masters security group to the nodes security group
|
|
||||||
1. Open ports 10250-10252 on the masters security group to the nodes security group.
|
|
||||||
|
|
||||||
Example script below requires $AWS\_DEFAULT_PROFILE and [$NAME](https://github.com/kubernetes/kops/blob/master/docs/aws.md#prepare-local-environment)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
MASTER_SG=$(aws --profile ${AWS_DEFAULT_PROFILE} ec2 describe-security-groups --filters "Name=tag:Name,Values=masters.$NAME" --query "SecurityGroups[*].GroupId[]" --output=text)
|
|
||||||
NODES_SG=$(aws --profile ${AWS_DEFAULT_PROFILE} ec2 describe-security-groups --filters "Name=tag:Name,Values=nodes.$NAME" --query "SecurityGroups[*].GroupId[]" --output=text)
|
|
||||||
aws --profile ${AWS_DEFAULT_PROFILE} ec2 authorize-security-group-ingress --group-id $MASTER_SG --protocol tcp --port 9100 --source-group $NODES_SG
|
|
||||||
aws --profile ${AWS_DEFAULT_PROFILE} ec2 authorize-security-group-ingress --group-id $MASTER_SG --protocol tcp --port 10250-10252 --source-group $NODES_SG
|
|
||||||
```
|
|
||||||
|
|
||||||
## Adding kube-prometheus
|
|
||||||
Following the instructions in the [README](https://github.com/coreos/kube-prometheus/blob/master/README.md):
|
|
||||||
|
|
||||||
Example:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git clone -b master https://github.com/coreos/kube-prometheus.git kube-prometheus-temp;
|
|
||||||
cd kube-prometheus-temp
|
|
||||||
./hack/cluster-monitoring/deploy
|
|
||||||
kubectl -n kube-system create -f manifests/k8s/self-hosted/
|
|
||||||
cd -
|
|
||||||
rm -rf kube-prometheus-temp
|
|
||||||
```
|
|
@@ -1,41 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
|
|
||||||
if [ -z "${KUBECONFIG}" ]; then
|
|
||||||
export KUBECONFIG=~/.kube/config
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ -z "${NAMESPACE}" ]; then
|
|
||||||
NAMESPACE=monitoring
|
|
||||||
fi
|
|
||||||
|
|
||||||
kubectl create namespace "$NAMESPACE"
|
|
||||||
|
|
||||||
kctl() {
|
|
||||||
kubectl --namespace "$NAMESPACE" "$@"
|
|
||||||
}
|
|
||||||
|
|
||||||
kctl apply -f manifests/prometheus-operator.yaml
|
|
||||||
|
|
||||||
# Wait for TPRs to be ready.
|
|
||||||
printf "Waiting for Operator to register third party objects..."
|
|
||||||
until kctl get servicemonitor > /dev/null 2>&1; do sleep 1; printf "."; done
|
|
||||||
until kctl get prometheus > /dev/null 2>&1; do sleep 1; printf "."; done
|
|
||||||
until kctl get alertmanager > /dev/null 2>&1; do sleep 1; printf "."; done
|
|
||||||
echo "done!"
|
|
||||||
|
|
||||||
kctl apply -f manifests/exporters
|
|
||||||
kctl apply -f manifests/grafana
|
|
||||||
|
|
||||||
kctl apply -f manifests/prometheus/prometheus-k8s-rules.yaml
|
|
||||||
kctl apply -f manifests/prometheus/prometheus-k8s-service.yaml
|
|
||||||
|
|
||||||
kctl apply -f manifests/alertmanager/alertmanager-config.yaml
|
|
||||||
kctl apply -f manifests/alertmanager/alertmanager-service.yaml
|
|
||||||
|
|
||||||
# `kubectl apply` is currently not working for third party resources so we are
|
|
||||||
# using `kubectl create` here for the time being.
|
|
||||||
# (https://github.com/kubernetes/kubernetes/issues/29542)
|
|
||||||
kctl create -f manifests/prometheus/prometheus-k8s-servicemonitors.yaml
|
|
||||||
kctl create -f manifests/prometheus/prometheus-k8s.yaml
|
|
||||||
kctl create -f manifests/alertmanager/alertmanager.yaml
|
|
||||||
|
|
@@ -1,6 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
|
|
||||||
hack/cluster-monitoring/deploy
|
|
||||||
|
|
||||||
awk 'FNR==1{print "---"}1' manifests/k8s/minikube/*.yaml | sed s/MINIKUBE_IP/`minikube ip`/g | kubectl --namespace=kube-system apply -f -
|
|
||||||
|
|
@@ -1,6 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
|
|
||||||
hack/cluster-monitoring/teardown
|
|
||||||
|
|
||||||
kubectl --namespace=kube-system delete -f manifests/k8s/minikube
|
|
||||||
|
|
@@ -1,6 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
|
|
||||||
hack/cluster-monitoring/deploy
|
|
||||||
|
|
||||||
kubectl --namespace=kube-system apply -f manifests/k8s/self-hosted
|
|
||||||
|
|
@@ -1,6 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
|
|
||||||
hack/cluster-monitoring/teardown
|
|
||||||
|
|
||||||
kubectl --namespace=kube-system delete -f manifests/k8s/self-hosted
|
|
||||||
|
|
@@ -1,24 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
|
|
||||||
if [ -z "${KUBECONFIG}" ]; then
|
|
||||||
export KUBECONFIG=~/.kube/config
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ -z "${NAMESPACE}" ]; then
|
|
||||||
NAMESPACE=monitoring
|
|
||||||
fi
|
|
||||||
|
|
||||||
kctl() {
|
|
||||||
kubectl --namespace "$NAMESPACE" "$@"
|
|
||||||
}
|
|
||||||
|
|
||||||
kctl delete -f manifests/exporters
|
|
||||||
kctl delete -f manifests/grafana
|
|
||||||
kctl delete -f manifests/prometheus
|
|
||||||
kctl delete -f manifests/alertmanager
|
|
||||||
|
|
||||||
# Hack: wait a bit to let the controller delete the deployed Prometheus server.
|
|
||||||
sleep 5
|
|
||||||
|
|
||||||
kctl delete -f manifests/prometheus-operator.yaml
|
|
||||||
|
|
@@ -1,19 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
|
|
||||||
if [ -z "${KUBECONFIG}" ]; then
|
|
||||||
KUBECONFIG=~/.kube/config
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ -z "${NAMESPACE}" ]; then
|
|
||||||
NAMESPACE=default
|
|
||||||
fi
|
|
||||||
|
|
||||||
kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" apply -f manifests/examples/example-app/prometheus-frontend-svc.yaml
|
|
||||||
kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" apply -f manifests/examples/example-app/example-app.yaml
|
|
||||||
|
|
||||||
# `kubectl apply` is currently not working for third party resources so we are
|
|
||||||
# using `kubectl create` here for the time being.
|
|
||||||
# (https://github.com/kubernetes/kubernetes/issues/29542)
|
|
||||||
kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" create -f manifests/examples/example-app/prometheus-frontend.yaml
|
|
||||||
kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" create -f manifests/examples/example-app/servicemonitor-frontend.yaml
|
|
||||||
|
|
@@ -1,12 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
|
|
||||||
if [ -z "${KUBECONFIG}" ]; then
|
|
||||||
KUBECONFIG=~/.kube/config
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ -z "${NAMESPACE}" ]; then
|
|
||||||
NAMESPACE=default
|
|
||||||
fi
|
|
||||||
|
|
||||||
kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" delete -f manifests/examples/example-app
|
|
||||||
|
|
@@ -1,8 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
# Generate Alert Rules ConfigMap
|
|
||||||
kubectl create configmap --dry-run=true prometheus-k8s-rules --from-file=assets/prometheus/rules/ -oyaml > manifests/prometheus/prometheus-k8s-rules.yaml
|
|
||||||
|
|
||||||
# Generate Dashboard ConfigMap
|
|
||||||
kubectl create configmap --dry-run=true grafana-dashboards --from-file=assets/grafana/ -oyaml > manifests/grafana/grafana-dashboards.yaml
|
|
||||||
|
|
@@ -1,50 +0,0 @@
|
|||||||
#!/bin/bash -eu
|
|
||||||
|
|
||||||
# Intended usage:
|
|
||||||
# * Edit dashboard in Grafana (you need to login first with admin/admin
|
|
||||||
# login/password).
|
|
||||||
# * Save dashboard in Grafana to check is specification is correct.
|
|
||||||
# Looks like this is the only way to check is dashboard specification
|
|
||||||
# has error.
|
|
||||||
# * Download dashboard specification as JSON file in Grafana:
|
|
||||||
# Share -> Export -> Save to file.
|
|
||||||
# * Wrap dashboard specification to make it digestable by kube-prometheus:
|
|
||||||
# ./hack/scripts/wrap-dashboard.sh Nodes-1488465802729.json
|
|
||||||
# * Replace dashboard specification:
|
|
||||||
# mv Nodes-1488465802729.json assets/grafana/node-dashboard.json
|
|
||||||
# * Regenerate Grafana configmap:
|
|
||||||
# ./hack/scripts/generate-configmaps.sh
|
|
||||||
# * Apply new configmap:
|
|
||||||
# kubectl -n monitoring apply -f manifests/grafana/grafana-cm.yaml
|
|
||||||
|
|
||||||
if [ "$#" -ne 1 ]; then
|
|
||||||
echo "Usage: $0 path-to-dashboard.json"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
json=$1
|
|
||||||
temp=$(tempfile -m 0644)
|
|
||||||
|
|
||||||
cat >> $temp <<EOF
|
|
||||||
{
|
|
||||||
"dashboard":
|
|
||||||
EOF
|
|
||||||
|
|
||||||
cat $json >> $temp
|
|
||||||
|
|
||||||
cat >> $temp <<EOF
|
|
||||||
,
|
|
||||||
"inputs": [
|
|
||||||
{
|
|
||||||
"name": "DS_PROMETHEUS",
|
|
||||||
"pluginId": "prometheus",
|
|
||||||
"type": "datasource",
|
|
||||||
"value": "prometheus"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"overwrite": true
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
|
|
||||||
mv $temp $json
|
|
||||||
|
|
@@ -1,18 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: ConfigMap
|
|
||||||
metadata:
|
|
||||||
name: alertmanager-main
|
|
||||||
data:
|
|
||||||
alertmanager.yaml: |-
|
|
||||||
global:
|
|
||||||
resolve_timeout: 5m
|
|
||||||
route:
|
|
||||||
group_by: ['job']
|
|
||||||
group_wait: 30s
|
|
||||||
group_interval: 5m
|
|
||||||
repeat_interval: 12h
|
|
||||||
receiver: 'webhook'
|
|
||||||
receivers:
|
|
||||||
- name: 'webhook'
|
|
||||||
webhook_configs:
|
|
||||||
- url: 'http://alertmanagerwh:30500/'
|
|
@@ -1,14 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
name: alertmanager-main
|
|
||||||
spec:
|
|
||||||
type: NodePort
|
|
||||||
ports:
|
|
||||||
- name: web
|
|
||||||
nodePort: 30903
|
|
||||||
port: 9093
|
|
||||||
protocol: TCP
|
|
||||||
targetPort: web
|
|
||||||
selector:
|
|
||||||
alertmanager: main
|
|
@@ -1,9 +0,0 @@
|
|||||||
apiVersion: "monitoring.coreos.com/v1alpha1"
|
|
||||||
kind: "Alertmanager"
|
|
||||||
metadata:
|
|
||||||
name: "main"
|
|
||||||
labels:
|
|
||||||
alertmanager: "main"
|
|
||||||
spec:
|
|
||||||
replicas: 3
|
|
||||||
version: v0.5.1
|
|
@@ -1,28 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
name: etcd-k8s
|
|
||||||
labels:
|
|
||||||
k8s-app: etcd
|
|
||||||
spec:
|
|
||||||
type: ClusterIP
|
|
||||||
clusterIP: None
|
|
||||||
ports:
|
|
||||||
- name: api
|
|
||||||
port: 2379
|
|
||||||
protocol: TCP
|
|
||||||
---
|
|
||||||
apiVersion: v1
|
|
||||||
kind: Endpoints
|
|
||||||
metadata:
|
|
||||||
name: etcd-k8s
|
|
||||||
labels:
|
|
||||||
k8s-app: etcd
|
|
||||||
subsets:
|
|
||||||
- addresses:
|
|
||||||
- ip: 10.142.0.2
|
|
||||||
nodeName: 10.142.0.2
|
|
||||||
ports:
|
|
||||||
- name: api
|
|
||||||
port: 2379
|
|
||||||
protocol: TCP
|
|
@@ -1,28 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
name: etcd-k8s
|
|
||||||
labels:
|
|
||||||
k8s-app: etcd
|
|
||||||
spec:
|
|
||||||
type: ClusterIP
|
|
||||||
clusterIP: None
|
|
||||||
ports:
|
|
||||||
- name: api
|
|
||||||
port: 2379
|
|
||||||
protocol: TCP
|
|
||||||
---
|
|
||||||
apiVersion: v1
|
|
||||||
kind: Endpoints
|
|
||||||
metadata:
|
|
||||||
name: etcd-k8s
|
|
||||||
labels:
|
|
||||||
k8s-app: etcd
|
|
||||||
subsets:
|
|
||||||
- addresses:
|
|
||||||
- ip: 172.17.4.51
|
|
||||||
nodeName: 172.17.4.51
|
|
||||||
ports:
|
|
||||||
- name: api
|
|
||||||
port: 2379
|
|
||||||
protocol: TCP
|
|
@@ -1,34 +0,0 @@
|
|||||||
kind: Service
|
|
||||||
apiVersion: v1
|
|
||||||
metadata:
|
|
||||||
name: example-app
|
|
||||||
labels:
|
|
||||||
tier: frontend
|
|
||||||
spec:
|
|
||||||
selector:
|
|
||||||
app: example-app
|
|
||||||
ports:
|
|
||||||
- name: web
|
|
||||||
protocol: TCP
|
|
||||||
port: 8080
|
|
||||||
targetPort: web
|
|
||||||
---
|
|
||||||
apiVersion: extensions/v1beta1
|
|
||||||
kind: Deployment
|
|
||||||
metadata:
|
|
||||||
name: example-app
|
|
||||||
spec:
|
|
||||||
replicas: 4
|
|
||||||
template:
|
|
||||||
metadata:
|
|
||||||
labels:
|
|
||||||
app: example-app
|
|
||||||
version: 1.1.3
|
|
||||||
spec:
|
|
||||||
containers:
|
|
||||||
- name: example-app
|
|
||||||
image: quay.io/fabxc/prometheus_demo_service
|
|
||||||
ports:
|
|
||||||
- name: web
|
|
||||||
containerPort: 8080
|
|
||||||
protocol: TCP
|
|
@@ -1,14 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
name: prometheus-frontend
|
|
||||||
spec:
|
|
||||||
type: NodePort
|
|
||||||
ports:
|
|
||||||
- name: web
|
|
||||||
nodePort: 30100
|
|
||||||
port: 9090
|
|
||||||
protocol: TCP
|
|
||||||
targetPort: web
|
|
||||||
selector:
|
|
||||||
prometheus: prometheus-frontend
|
|
@@ -1,24 +0,0 @@
|
|||||||
apiVersion: monitoring.coreos.com/v1alpha1
|
|
||||||
kind: Prometheus
|
|
||||||
metadata:
|
|
||||||
name: prometheus-frontend
|
|
||||||
namespace: default
|
|
||||||
labels:
|
|
||||||
prometheus: frontend
|
|
||||||
spec:
|
|
||||||
version: v1.5.2
|
|
||||||
serviceMonitorSelector:
|
|
||||||
matchLabels:
|
|
||||||
tier: frontend
|
|
||||||
resources:
|
|
||||||
requests:
|
|
||||||
# 2Gi is default, but won't schedule if you don't have a node with >2Gi
|
|
||||||
# memory. Modify based on your target and time-series count for
|
|
||||||
# production use. This value is mainly meant for demonstration/testing
|
|
||||||
# purposes.
|
|
||||||
memory: 400Mi
|
|
||||||
alerting:
|
|
||||||
alertmanagers:
|
|
||||||
- namespace: monitoring
|
|
||||||
name: alertmanager-main
|
|
||||||
port: web
|
|
@@ -1,13 +0,0 @@
|
|||||||
apiVersion: monitoring.coreos.com/v1alpha1
|
|
||||||
kind: ServiceMonitor
|
|
||||||
metadata:
|
|
||||||
name: frontend
|
|
||||||
labels:
|
|
||||||
tier: frontend
|
|
||||||
spec:
|
|
||||||
selector:
|
|
||||||
matchLabels:
|
|
||||||
tier: frontend
|
|
||||||
endpoints:
|
|
||||||
- port: web
|
|
||||||
interval: 10s
|
|
@@ -1,25 +0,0 @@
|
|||||||
apiVersion: extensions/v1beta1
|
|
||||||
kind: Deployment
|
|
||||||
metadata:
|
|
||||||
name: kube-state-metrics
|
|
||||||
spec:
|
|
||||||
replicas: 1
|
|
||||||
template:
|
|
||||||
metadata:
|
|
||||||
labels:
|
|
||||||
app: kube-state-metrics
|
|
||||||
spec:
|
|
||||||
containers:
|
|
||||||
- name: kube-state-metrics
|
|
||||||
image: gcr.io/google_containers/kube-state-metrics:v0.4.1
|
|
||||||
ports:
|
|
||||||
- name: metrics
|
|
||||||
containerPort: 8080
|
|
||||||
resources:
|
|
||||||
requests:
|
|
||||||
memory: 30Mi
|
|
||||||
cpu: 100m
|
|
||||||
limits:
|
|
||||||
memory: 50Mi
|
|
||||||
cpu: 200m
|
|
||||||
|
|
@@ -1,18 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
labels:
|
|
||||||
app: kube-state-metrics
|
|
||||||
k8s-app: kube-state-metrics
|
|
||||||
annotations:
|
|
||||||
alpha.monitoring.coreos.com/non-namespaced: "true"
|
|
||||||
name: kube-state-metrics
|
|
||||||
spec:
|
|
||||||
ports:
|
|
||||||
- name: http-metrics
|
|
||||||
port: 8080
|
|
||||||
targetPort: metrics
|
|
||||||
protocol: TCP
|
|
||||||
selector:
|
|
||||||
app: kube-state-metrics
|
|
||||||
|
|
@@ -1,45 +0,0 @@
|
|||||||
apiVersion: extensions/v1beta1
|
|
||||||
kind: DaemonSet
|
|
||||||
metadata:
|
|
||||||
name: node-exporter
|
|
||||||
spec:
|
|
||||||
template:
|
|
||||||
metadata:
|
|
||||||
labels:
|
|
||||||
app: node-exporter
|
|
||||||
name: node-exporter
|
|
||||||
spec:
|
|
||||||
hostNetwork: true
|
|
||||||
hostPID: true
|
|
||||||
containers:
|
|
||||||
- image: quay.io/prometheus/node-exporter:v0.13.0
|
|
||||||
args:
|
|
||||||
- "-collector.procfs=/host/proc"
|
|
||||||
- "-collector.sysfs=/host/sys"
|
|
||||||
name: node-exporter
|
|
||||||
ports:
|
|
||||||
- containerPort: 9100
|
|
||||||
hostPort: 9100
|
|
||||||
name: scrape
|
|
||||||
resources:
|
|
||||||
requests:
|
|
||||||
memory: 30Mi
|
|
||||||
cpu: 100m
|
|
||||||
limits:
|
|
||||||
memory: 50Mi
|
|
||||||
cpu: 200m
|
|
||||||
volumeMounts:
|
|
||||||
- name: proc
|
|
||||||
readOnly: true
|
|
||||||
mountPath: /host/proc
|
|
||||||
- name: sys
|
|
||||||
readOnly: true
|
|
||||||
mountPath: /host/sys
|
|
||||||
volumes:
|
|
||||||
- name: proc
|
|
||||||
hostPath:
|
|
||||||
path: /proc
|
|
||||||
- name: sys
|
|
||||||
hostPath:
|
|
||||||
path: /sys
|
|
||||||
|
|
@@ -1,17 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
labels:
|
|
||||||
app: node-exporter
|
|
||||||
k8s-app: node-exporter
|
|
||||||
name: node-exporter
|
|
||||||
spec:
|
|
||||||
type: ClusterIP
|
|
||||||
clusterIP: None
|
|
||||||
ports:
|
|
||||||
- name: http-metrics
|
|
||||||
port: 9100
|
|
||||||
protocol: TCP
|
|
||||||
selector:
|
|
||||||
app: node-exporter
|
|
||||||
|
|
File diff suppressed because it is too large
Load Diff
@@ -1,56 +0,0 @@
|
|||||||
apiVersion: extensions/v1beta1
|
|
||||||
kind: Deployment
|
|
||||||
metadata:
|
|
||||||
name: grafana
|
|
||||||
spec:
|
|
||||||
replicas: 1
|
|
||||||
template:
|
|
||||||
metadata:
|
|
||||||
labels:
|
|
||||||
app: grafana
|
|
||||||
spec:
|
|
||||||
containers:
|
|
||||||
- name: grafana
|
|
||||||
image: grafana/grafana:4.1.1
|
|
||||||
env:
|
|
||||||
- name: GF_AUTH_BASIC_ENABLED
|
|
||||||
value: "true"
|
|
||||||
- name: GF_AUTH_ANONYMOUS_ENABLED
|
|
||||||
value: "true"
|
|
||||||
volumeMounts:
|
|
||||||
- name: grafana-storage
|
|
||||||
mountPath: /var/grafana-storage
|
|
||||||
ports:
|
|
||||||
- name: web
|
|
||||||
containerPort: 3000
|
|
||||||
resources:
|
|
||||||
requests:
|
|
||||||
memory: 100Mi
|
|
||||||
cpu: 100m
|
|
||||||
limits:
|
|
||||||
memory: 300Mi
|
|
||||||
cpu: 300m
|
|
||||||
- name: grafana-watcher
|
|
||||||
image: quay.io/coreos/grafana-watcher:latest
|
|
||||||
args:
|
|
||||||
- '--watch-dir=/var/grafana-dashboards'
|
|
||||||
- '--grafana-url=http://admin:admin@localhost:3000'
|
|
||||||
volumeMounts:
|
|
||||||
- name: grafana-dashboards
|
|
||||||
mountPath: /var/grafana-dashboards
|
|
||||||
resources:
|
|
||||||
requests:
|
|
||||||
memory: "16Mi"
|
|
||||||
cpu: "50m"
|
|
||||||
limits:
|
|
||||||
memory: "32Mi"
|
|
||||||
cpu: "100m"
|
|
||||||
volumeMounts:
|
|
||||||
- name: grafana-dashboards
|
|
||||||
mountPath: /var/grafana-dashboards
|
|
||||||
volumes:
|
|
||||||
- name: grafana-storage
|
|
||||||
emptyDir: {}
|
|
||||||
- name: grafana-dashboards
|
|
||||||
configMap:
|
|
||||||
name: grafana-dashboards
|
|
@@ -1,15 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
name: grafana
|
|
||||||
labels:
|
|
||||||
app: grafana
|
|
||||||
spec:
|
|
||||||
type: NodePort
|
|
||||||
ports:
|
|
||||||
- name: web
|
|
||||||
port: 3000
|
|
||||||
protocol: TCP
|
|
||||||
nodePort: 30902
|
|
||||||
selector:
|
|
||||||
app: grafana
|
|
@@ -1,28 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
name: kube-controller-manager-prometheus-discovery
|
|
||||||
labels:
|
|
||||||
k8s-app: kube-controller-manager
|
|
||||||
spec:
|
|
||||||
type: ClusterIP
|
|
||||||
clusterIP: None
|
|
||||||
ports:
|
|
||||||
- name: http-metrics
|
|
||||||
port: 10252
|
|
||||||
targetPort: 10252
|
|
||||||
protocol: TCP
|
|
||||||
---
|
|
||||||
apiVersion: v1
|
|
||||||
kind: Endpoints
|
|
||||||
metadata:
|
|
||||||
name: kube-controller-manager-prometheus-discovery
|
|
||||||
labels:
|
|
||||||
k8s-app: kube-controller-manager
|
|
||||||
subsets:
|
|
||||||
- addresses:
|
|
||||||
- ip: MINIKUBE_IP
|
|
||||||
ports:
|
|
||||||
- name: http-metrics
|
|
||||||
port: 10252
|
|
||||||
protocol: TCP
|
|
@@ -1,28 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
name: kube-scheduler-prometheus-discovery
|
|
||||||
labels:
|
|
||||||
k8s-app: kube-scheduler
|
|
||||||
spec:
|
|
||||||
type: ClusterIP
|
|
||||||
clusterIP: None
|
|
||||||
ports:
|
|
||||||
- name: http-metrics
|
|
||||||
port: 10251
|
|
||||||
targetPort: 10251
|
|
||||||
protocol: TCP
|
|
||||||
---
|
|
||||||
apiVersion: v1
|
|
||||||
kind: Endpoints
|
|
||||||
metadata:
|
|
||||||
name: kube-scheduler-prometheus-discovery
|
|
||||||
labels:
|
|
||||||
k8s-app: kube-scheduler
|
|
||||||
subsets:
|
|
||||||
- addresses:
|
|
||||||
- ip: MINIKUBE_IP
|
|
||||||
ports:
|
|
||||||
- name: http-metrics
|
|
||||||
port: 10251
|
|
||||||
protocol: TCP
|
|
@@ -1,16 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
name: kube-controller-manager-prometheus-discovery
|
|
||||||
labels:
|
|
||||||
k8s-app: kube-controller-manager
|
|
||||||
spec:
|
|
||||||
selector:
|
|
||||||
k8s-app: kube-controller-manager
|
|
||||||
type: ClusterIP
|
|
||||||
clusterIP: None
|
|
||||||
ports:
|
|
||||||
- name: http-metrics
|
|
||||||
port: 10252
|
|
||||||
targetPort: 10252
|
|
||||||
protocol: TCP
|
|
@@ -1,20 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
name: kube-dns-prometheus-discovery
|
|
||||||
labels:
|
|
||||||
k8s-app: kube-dns
|
|
||||||
spec:
|
|
||||||
selector:
|
|
||||||
k8s-app: kube-dns
|
|
||||||
type: ClusterIP
|
|
||||||
clusterIP: None
|
|
||||||
ports:
|
|
||||||
- name: http-metrics-skydns
|
|
||||||
port: 10055
|
|
||||||
targetPort: 10055
|
|
||||||
protocol: TCP
|
|
||||||
- name: http-metrics-dnsmasq
|
|
||||||
port: 10054
|
|
||||||
targetPort: 10054
|
|
||||||
protocol: TCP
|
|
@@ -1,16 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
name: kube-scheduler-prometheus-discovery
|
|
||||||
labels:
|
|
||||||
k8s-app: kube-scheduler
|
|
||||||
spec:
|
|
||||||
selector:
|
|
||||||
k8s-app: kube-scheduler
|
|
||||||
type: ClusterIP
|
|
||||||
clusterIP: None
|
|
||||||
ports:
|
|
||||||
- name: http-metrics
|
|
||||||
port: 10251
|
|
||||||
targetPort: 10251
|
|
||||||
protocol: TCP
|
|
@@ -1,26 +0,0 @@
|
|||||||
apiVersion: extensions/v1beta1
|
|
||||||
kind: Deployment
|
|
||||||
metadata:
|
|
||||||
name: prometheus-operator
|
|
||||||
labels:
|
|
||||||
operator: prometheus
|
|
||||||
spec:
|
|
||||||
replicas: 1
|
|
||||||
template:
|
|
||||||
metadata:
|
|
||||||
labels:
|
|
||||||
operator: prometheus
|
|
||||||
spec:
|
|
||||||
containers:
|
|
||||||
- name: prometheus-operator
|
|
||||||
image: quay.io/coreos/prometheus-operator:v0.6.0
|
|
||||||
args:
|
|
||||||
- "--kubelet-object=kube-system/kubelet"
|
|
||||||
- "--config-reloader-image=quay.io/coreos/configmap-reload:v0.0.1"
|
|
||||||
resources:
|
|
||||||
requests:
|
|
||||||
cpu: 100m
|
|
||||||
memory: 50Mi
|
|
||||||
limits:
|
|
||||||
cpu: 200m
|
|
||||||
memory: 300Mi
|
|
@@ -1,447 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
data:
|
|
||||||
etcd2.rules: "### General cluster availability ###\n\n# alert if another failed
|
|
||||||
peer will result in an unavailable cluster\nALERT InsufficientPeers\n IF count(up{job=\"etcd-k8s\"}
|
|
||||||
== 0) > (count(up{job=\"etcd-k8s\"}) / 2 - 1)\n FOR 3m\n LABELS {\n severity
|
|
||||||
= \"critical\"\n }\n ANNOTATIONS {\n summary = \"Etcd cluster small\",\n
|
|
||||||
\ description = \"If one more etcd peer goes down the cluster will be unavailable\",\n
|
|
||||||
\ }\n\n### HTTP requests alerts ###\n\n# alert if more than 1% of requests to
|
|
||||||
an HTTP endpoint have failed with a non 4xx response\nALERT HighNumberOfFailedHTTPRequests\n
|
|
||||||
\ IF sum by(method) (rate(etcd_http_failed_total{job=\"etcd-k8s\", code!~\"4[0-9]{2}\"}[5m]))\n
|
|
||||||
\ / sum by(method) (rate(etcd_http_received_total{job=\"etcd-k8s\"}[5m])) >
|
|
||||||
0.01\n FOR 10m\n LABELS {\n severity = \"warning\"\n }\n ANNOTATIONS {\n
|
|
||||||
\ summary = \"a high number of HTTP requests are failing\",\n description
|
|
||||||
= \"{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance
|
|
||||||
{{ $labels.instance }}\",\n }\n\n# alert if more than 5% of requests to an HTTP
|
|
||||||
endpoint have failed with a non 4xx response\nALERT HighNumberOfFailedHTTPRequests\n
|
|
||||||
\ IF sum by(method) (rate(etcd_http_failed_total{job=\"etcd-k8s\", code!~\"4[0-9]{2}\"}[5m]))
|
|
||||||
\n / sum by(method) (rate(etcd_http_received_total{job=\"etcd-k8s\"}[5m]))
|
|
||||||
> 0.05\n FOR 5m\n LABELS {\n severity = \"critical\"\n }\n ANNOTATIONS
|
|
||||||
{\n summary = \"a high number of HTTP requests are failing\",\n description
|
|
||||||
= \"{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance
|
|
||||||
{{ $labels.instance }}\",\n }\n\n# alert if 50% of requests get a 4xx response\nALERT
|
|
||||||
HighNumberOfFailedHTTPRequests\n IF sum by(method) (rate(etcd_http_failed_total{job=\"etcd-k8s\",
|
|
||||||
code=~\"4[0-9]{2}\"}[5m]))\n / sum by(method) (rate(etcd_http_received_total{job=\"etcd-k8s\"}[5m]))
|
|
||||||
> 0.5\n FOR 10m\n LABELS {\n severity = \"critical\"\n }\n ANNOTATIONS
|
|
||||||
{\n summary = \"a high number of HTTP requests are failing\",\n description
|
|
||||||
= \"{{ $value }}% of requests for {{ $labels.method }} failed with 4xx responses
|
|
||||||
on etcd instance {{ $labels.instance }}\",\n }\n\n# alert if the 99th percentile
|
|
||||||
of HTTP requests take more than 150ms\nALERT HTTPRequestsSlow\n IF histogram_quantile(0.99,
|
|
||||||
rate(etcd_http_successful_duration_second_bucket[5m])) > 0.15\n FOR 10m\n LABELS
|
|
||||||
{\n severity = \"warning\"\n }\n ANNOTATIONS {\n summary = \"slow HTTP
|
|
||||||
requests\",\n description = \"on ectd instance {{ $labels.instance }} HTTP
|
|
||||||
requests to {{ $label.method }} are slow\",\n }\n\n### File descriptor alerts
|
|
||||||
###\n\ninstance:fd_utilization = process_open_fds / process_max_fds\n\n# alert
|
|
||||||
if file descriptors are likely to exhaust within the next 4 hours\nALERT FdExhaustionClose\n
|
|
||||||
\ IF predict_linear(instance:fd_utilization[1h], 3600 * 4) > 1\n FOR 10m\n LABELS
|
|
||||||
{\n severity = \"warning\"\n }\n ANNOTATIONS {\n summary = \"file descriptors
|
|
||||||
soon exhausted\",\n description = \"{{ $labels.job }} instance {{ $labels.instance
|
|
||||||
}} will exhaust in file descriptors soon\",\n }\n\n# alert if file descriptors
|
|
||||||
are likely to exhaust within the next hour\nALERT FdExhaustionClose\n IF predict_linear(instance:fd_utilization[10m],
|
|
||||||
3600) > 1\n FOR 10m\n LABELS {\n severity = \"critical\"\n }\n ANNOTATIONS
|
|
||||||
{\n summary = \"file descriptors soon exhausted\",\n description = \"{{
|
|
||||||
$labels.job }} instance {{ $labels.instance }} will exhaust in file descriptors
|
|
||||||
soon\",\n }\n\n### etcd proposal alerts ###\n\n# alert if there are several failed
|
|
||||||
proposals within an hour\nALERT HighNumberOfFailedProposals\n IF increase(etcd_server_proposal_failed_total{job=\"etcd\"}[1h])
|
|
||||||
> 5\n LABELS {\n severity = \"warning\"\n }\n ANNOTATIONS {\n summary
|
|
||||||
= \"a high number of failed proposals within the etcd cluster are happening\",\n
|
|
||||||
\ description = \"etcd instance {{ $labels.instance }} has seen {{ $value }}
|
|
||||||
proposal failures within the last hour\",\n }\n\n### etcd disk io latency alerts
|
|
||||||
###\n\n# alert if 99th percentile of fsync durations is higher than 500ms\nALERT
|
|
||||||
HighFsyncDurations\n IF histogram_quantile(0.99, rate(etcd_wal_fsync_durations_seconds_bucket[5m]))
|
|
||||||
> 0.5\n FOR 10m\n LABELS {\n severity = \"warning\"\n }\n ANNOTATIONS {\n
|
|
||||||
\ summary = \"high fsync durations\",\n description = \"ectd instance {{
|
|
||||||
$labels.instance }} fync durations are high\",\n }\n"
|
|
||||||
kubernetes.rules: |+
|
|
||||||
# NOTE: These rules were kindly contributed by the SoundCloud engineering team.
|
|
||||||
|
|
||||||
### Container resources ###
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:spec_memory_limit_bytes =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_spec_memory_limit_bytes{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:spec_cpu_shares =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_spec_cpu_shares{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:cpu_usage:rate =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
irate(
|
|
||||||
container_cpu_usage_seconds_total{container_name!=""}[5m]
|
|
||||||
),
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:memory_usage:bytes =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_memory_usage_bytes{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:memory_working_set:bytes =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_memory_working_set_bytes{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:memory_rss:bytes =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_memory_rss{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:memory_cache:bytes =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_memory_cache{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:disk_usage:bytes =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
|
||||||
label_replace(
|
|
||||||
container_disk_usage_bytes{container_name!=""},
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:memory_pagefaults:rate =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name,scope,type) (
|
|
||||||
label_replace(
|
|
||||||
irate(
|
|
||||||
container_memory_failures_total{container_name!=""}[5m]
|
|
||||||
),
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_namespace_controller_pod_container:memory_oom:rate =
|
|
||||||
sum by (cluster,namespace,controller,pod_name,container_name,scope,type) (
|
|
||||||
label_replace(
|
|
||||||
irate(
|
|
||||||
container_memory_failcnt{container_name!=""}[5m]
|
|
||||||
),
|
|
||||||
"controller", "$1",
|
|
||||||
"pod_name", "^(.*)-[a-z0-9]+"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
### Cluster resources ###
|
|
||||||
|
|
||||||
cluster:memory_allocation:percent =
|
|
||||||
100 * sum by (cluster) (
|
|
||||||
container_spec_memory_limit_bytes{pod_name!=""}
|
|
||||||
) / sum by (cluster) (
|
|
||||||
machine_memory_bytes
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster:memory_used:percent =
|
|
||||||
100 * sum by (cluster) (
|
|
||||||
container_memory_usage_bytes{pod_name!=""}
|
|
||||||
) / sum by (cluster) (
|
|
||||||
machine_memory_bytes
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster:cpu_allocation:percent =
|
|
||||||
100 * sum by (cluster) (
|
|
||||||
container_spec_cpu_shares{pod_name!=""}
|
|
||||||
) / sum by (cluster) (
|
|
||||||
container_spec_cpu_shares{id="/"} * on(cluster,instance) machine_cpu_cores
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster:node_cpu_use:percent =
|
|
||||||
100 * sum by (cluster) (
|
|
||||||
rate(node_cpu{mode!="idle"}[5m])
|
|
||||||
) / sum by (cluster) (
|
|
||||||
machine_cpu_cores
|
|
||||||
)
|
|
||||||
|
|
||||||
### API latency ###
|
|
||||||
|
|
||||||
# Raw metrics are in microseconds. Convert to seconds.
|
|
||||||
cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.99"} =
|
|
||||||
histogram_quantile(
|
|
||||||
0.99,
|
|
||||||
sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket)
|
|
||||||
) / 1e6
|
|
||||||
cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.9"} =
|
|
||||||
histogram_quantile(
|
|
||||||
0.9,
|
|
||||||
sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket)
|
|
||||||
) / 1e6
|
|
||||||
cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.5"} =
|
|
||||||
histogram_quantile(
|
|
||||||
0.5,
|
|
||||||
sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket)
|
|
||||||
) / 1e6
|
|
||||||
|
|
||||||
### Scheduling latency ###
|
|
||||||
|
|
||||||
cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.99"} =
|
|
||||||
histogram_quantile(0.99,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6
|
|
||||||
cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.9"} =
|
|
||||||
histogram_quantile(0.9,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6
|
|
||||||
cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.5"} =
|
|
||||||
histogram_quantile(0.5,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6
|
|
||||||
|
|
||||||
cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.99"} =
|
|
||||||
histogram_quantile(0.99,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6
|
|
||||||
cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.9"} =
|
|
||||||
histogram_quantile(0.9,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6
|
|
||||||
cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.5"} =
|
|
||||||
histogram_quantile(0.5,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6
|
|
||||||
|
|
||||||
cluster:scheduler_binding_latency:quantile_seconds{quantile="0.99"} =
|
|
||||||
histogram_quantile(0.99,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6
|
|
||||||
cluster:scheduler_binding_latency:quantile_seconds{quantile="0.9"} =
|
|
||||||
histogram_quantile(0.9,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6
|
|
||||||
cluster:scheduler_binding_latency:quantile_seconds{quantile="0.5"} =
|
|
||||||
histogram_quantile(0.5,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6
|
|
||||||
|
|
||||||
ALERT K8SNodeDown
|
|
||||||
IF up{job="kubelet"} == 0
|
|
||||||
FOR 1h
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Kubelet cannot be scraped",
|
|
||||||
description = "Prometheus could not scrape a {{ $labels.job }} for more than one hour",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SNodeNotReady
|
|
||||||
IF kube_node_status_ready{condition="true"} == 0
|
|
||||||
FOR 1h
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning",
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Node status is NotReady",
|
|
||||||
description = "The Kubelet on {{ $labels.node }} has not checked in with the API, or has set itself to NotReady, for more than an hour",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SManyNodesNotReady
|
|
||||||
IF
|
|
||||||
count by (cluster) (kube_node_status_ready{condition="true"} == 0) > 1
|
|
||||||
AND
|
|
||||||
(
|
|
||||||
count by (cluster) (kube_node_status_ready{condition="true"} == 0)
|
|
||||||
/
|
|
||||||
count by (cluster) (kube_node_status_ready{condition="true"})
|
|
||||||
) > 0.2
|
|
||||||
FOR 1m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical",
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Many K8s nodes are Not Ready",
|
|
||||||
description = "{{ $value }} K8s nodes (more than 10% of cluster {{ $labels.cluster }}) are in the NotReady state.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SKubeletNodeExporterDown
|
|
||||||
IF up{job="node-exporter"} == 0
|
|
||||||
FOR 15m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Kubelet node_exporter cannot be scraped",
|
|
||||||
description = "Prometheus could not scrape a {{ $labels.job }} for more than one hour.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SKubeletDown
|
|
||||||
IF absent(up{job="kubelet"}) or count by (cluster) (up{job="kubelet"} == 0) / count by (cluster) (up{job="kubelet"}) > 0.1
|
|
||||||
FOR 1h
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Many Kubelets cannot be scraped",
|
|
||||||
description = "Prometheus failed to scrape more than 10% of kubelets, or all Kubelets have disappeared from service discovery.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SApiserverDown
|
|
||||||
IF up{job="kubernetes"} == 0
|
|
||||||
FOR 15m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "API server unreachable",
|
|
||||||
description = "An API server could not be scraped.",
|
|
||||||
}
|
|
||||||
|
|
||||||
# Disable for non HA kubernetes setups.
|
|
||||||
ALERT K8SApiserverDown
|
|
||||||
IF absent({job="kubernetes"}) or (count by(cluster) (up{job="kubernetes"} == 1) < count by(cluster) (up{job="kubernetes"}))
|
|
||||||
FOR 5m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "API server unreachable",
|
|
||||||
description = "Prometheus failed to scrape multiple API servers, or all API servers have disappeared from service discovery.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SSchedulerDown
|
|
||||||
IF absent(up{job="kube-scheduler"}) or (count by(cluster) (up{job="kube-scheduler"} == 1) == 0)
|
|
||||||
FOR 5m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical",
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Scheduler is down",
|
|
||||||
description = "There is no running K8S scheduler. New pods are not being assigned to nodes.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SControllerManagerDown
|
|
||||||
IF absent(up{job="kube-controller-manager"}) or (count by(cluster) (up{job="kube-controller-manager"} == 1) == 0)
|
|
||||||
FOR 5m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical",
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Controller manager is down",
|
|
||||||
description = "There is no running K8S controller manager. Deployments and replication controllers are not making progress.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SConntrackTableFull
|
|
||||||
IF 100*node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 50
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Number of tracked connections is near the limit",
|
|
||||||
description = "The nf_conntrack table is {{ $value }}% full.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SConntrackTableFull
|
|
||||||
IF 100*node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 90
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Number of tracked connections is near the limit",
|
|
||||||
description = "The nf_conntrack table is {{ $value }}% full.",
|
|
||||||
}
|
|
||||||
|
|
||||||
# To catch the conntrack sysctl de-tuning when it happens
|
|
||||||
ALERT K8SConntrackTuningMissing
|
|
||||||
IF node_nf_conntrack_udp_timeout > 10
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning",
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Node does not have the correct conntrack tunings",
|
|
||||||
description = "Nodes keep un-setting the correct tunings, investigate when it happens.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8STooManyOpenFiles
|
|
||||||
IF 100*process_open_fds{job=~"kubelet|kubernetes"} / process_max_fds > 50
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "{{ $labels.job }} has too many open file descriptors",
|
|
||||||
description = "{{ $labels.node }} is using {{ $value }}% of the available file/socket descriptors.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8STooManyOpenFiles
|
|
||||||
IF 100*process_open_fds{job=~"kubelet|kubernetes"} / process_max_fds > 80
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "critical"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "{{ $labels.job }} has too many open file descriptors",
|
|
||||||
description = "{{ $labels.node }} is using {{ $value }}% of the available file/socket descriptors.",
|
|
||||||
}
|
|
||||||
|
|
||||||
# Some verbs excluded because they are expected to be long-lasting:
|
|
||||||
# WATCHLIST is long-poll, CONNECT is `kubectl exec`.
|
|
||||||
ALERT K8SApiServerLatency
|
|
||||||
IF histogram_quantile(
|
|
||||||
0.99,
|
|
||||||
sum without (instance,node,resource) (apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH"})
|
|
||||||
) / 1e6 > 1.0
|
|
||||||
FOR 10m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Kubernetes apiserver latency is high",
|
|
||||||
description = "99th percentile Latency for {{ $labels.verb }} requests to the kube-apiserver is higher than 1s.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SApiServerEtcdAccessLatency
|
|
||||||
IF etcd_request_latencies_summary{quantile="0.99"} / 1e6 > 1.0
|
|
||||||
FOR 15m
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning"
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Access to etcd is slow",
|
|
||||||
description = "99th percentile latency for apiserver to access etcd is higher than 1s.",
|
|
||||||
}
|
|
||||||
|
|
||||||
ALERT K8SKubeletTooManyPods
|
|
||||||
IF kubelet_running_pod_count > 100
|
|
||||||
LABELS {
|
|
||||||
service = "k8s",
|
|
||||||
severity = "warning",
|
|
||||||
}
|
|
||||||
ANNOTATIONS {
|
|
||||||
summary = "Kubelet is close to pod limit",
|
|
||||||
description = "Kubelet {{$labels.instance}} is running {{$value}} pods, close to the limit of 110",
|
|
||||||
}
|
|
||||||
|
|
||||||
kind: ConfigMap
|
|
||||||
metadata:
|
|
||||||
creationTimestamp: null
|
|
||||||
name: prometheus-k8s-rules
|
|
@@ -1,14 +0,0 @@
|
|||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
name: prometheus-k8s
|
|
||||||
spec:
|
|
||||||
type: NodePort
|
|
||||||
ports:
|
|
||||||
- name: web
|
|
||||||
nodePort: 30900
|
|
||||||
port: 9090
|
|
||||||
protocol: TCP
|
|
||||||
targetPort: web
|
|
||||||
selector:
|
|
||||||
prometheus: k8s
|
|
@@ -1,69 +0,0 @@
|
|||||||
apiVersion: monitoring.coreos.com/v1alpha1
|
|
||||||
kind: ServiceMonitor
|
|
||||||
metadata:
|
|
||||||
name: kube-apiserver
|
|
||||||
labels:
|
|
||||||
k8s-apps: https
|
|
||||||
spec:
|
|
||||||
jobLabel: provider
|
|
||||||
selector:
|
|
||||||
matchLabels:
|
|
||||||
component: apiserver
|
|
||||||
provider: kubernetes
|
|
||||||
namespaceSelector:
|
|
||||||
matchNames:
|
|
||||||
- default
|
|
||||||
endpoints:
|
|
||||||
- port: https
|
|
||||||
interval: 15s
|
|
||||||
scheme: https
|
|
||||||
tlsConfig:
|
|
||||||
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
|
|
||||||
serverName: kubernetes
|
|
||||||
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
|
|
||||||
---
|
|
||||||
apiVersion: monitoring.coreos.com/v1alpha1
|
|
||||||
kind: ServiceMonitor
|
|
||||||
metadata:
|
|
||||||
name: k8s-apps-https
|
|
||||||
labels:
|
|
||||||
k8s-apps: https
|
|
||||||
spec:
|
|
||||||
jobLabel: k8s-app
|
|
||||||
selector:
|
|
||||||
matchExpressions:
|
|
||||||
- {key: k8s-app, operator: Exists}
|
|
||||||
namespaceSelector:
|
|
||||||
matchNames:
|
|
||||||
- kube-system
|
|
||||||
endpoints:
|
|
||||||
- port: https-metrics
|
|
||||||
interval: 15s
|
|
||||||
scheme: https
|
|
||||||
tlsConfig:
|
|
||||||
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
|
|
||||||
insecureSkipVerify: true
|
|
||||||
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
|
|
||||||
---
|
|
||||||
apiVersion: monitoring.coreos.com/v1alpha1
|
|
||||||
kind: ServiceMonitor
|
|
||||||
metadata:
|
|
||||||
name: k8s-apps-http
|
|
||||||
labels:
|
|
||||||
k8s-apps: http
|
|
||||||
spec:
|
|
||||||
jobLabel: k8s-app
|
|
||||||
selector:
|
|
||||||
matchExpressions:
|
|
||||||
- {key: k8s-app, operator: Exists}
|
|
||||||
namespaceSelector:
|
|
||||||
matchNames:
|
|
||||||
- kube-system
|
|
||||||
- monitoring
|
|
||||||
endpoints:
|
|
||||||
- port: http-metrics
|
|
||||||
interval: 15s
|
|
||||||
- port: http-metrics-dnsmasq
|
|
||||||
interval: 15s
|
|
||||||
- port: http-metrics-skydns
|
|
||||||
interval: 15s
|
|
@@ -1,24 +0,0 @@
|
|||||||
apiVersion: monitoring.coreos.com/v1alpha1
|
|
||||||
kind: Prometheus
|
|
||||||
metadata:
|
|
||||||
name: k8s
|
|
||||||
labels:
|
|
||||||
prometheus: k8s
|
|
||||||
spec:
|
|
||||||
replicas: 2
|
|
||||||
version: v1.5.2
|
|
||||||
serviceMonitorSelector:
|
|
||||||
matchExpression:
|
|
||||||
- {key: k8s-apps, operator: Exists}
|
|
||||||
resources:
|
|
||||||
requests:
|
|
||||||
# 2Gi is default, but won't schedule if you don't have a node with >2Gi
|
|
||||||
# memory. Modify based on your target and time-series count for
|
|
||||||
# production use. This value is mainly meant for demonstration/testing
|
|
||||||
# purposes.
|
|
||||||
memory: 400Mi
|
|
||||||
alerting:
|
|
||||||
alertmanagers:
|
|
||||||
- namespace: monitoring
|
|
||||||
name: alertmanager-main
|
|
||||||
port: web
|
|
Reference in New Issue
Block a user