*: migrate kube-prometheus to Prometheus Operator repo
This commit is contained in:
133
README.md
133
README.md
@@ -4,135 +4,6 @@ This repository collects Kubernetes manifests, dashboards, and alerting rules
|
||||
combined with documentation and scripts to provide single-command deployments
|
||||
of end-to-end Kubernetes cluster monitoring.
|
||||
|
||||
## Prerequisites
|
||||
# This repository has moved
|
||||
|
||||
First, you need a running Kubernetes cluster. If you don't have one, follow the
|
||||
instructions of [bootkube](https://github.com/kubernetes-incubator/bootkube) or
|
||||
[minikube](https://github.com/kubernetes/minikube). Some sample contents of this
|
||||
repository are adapted to work with a [multi-node setup](https://github.com/kubernetes-incubator/bootkube/tree/master/hack/multi-node)
|
||||
using [bootkube](https://github.com/kubernetes-incubator/bootkube).
|
||||
|
||||
## Monitoring Kubernetes
|
||||
|
||||
The manifests used here use the [Prometheus Operator](https://github.com/coreos/prometheus-operator),
|
||||
which manages Prometheus servers and their configuration in a cluster. With a single command we can install
|
||||
|
||||
* The Operator itself
|
||||
* The Prometheus [node_exporter](https://github.com/prometheus/node_exporter)
|
||||
* [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics)
|
||||
* The [Prometheus specification](https://github.com/coreos/prometheus-operator/blob/master/Documentation/prometheus.md) based on which the Operator deploys a Prometheus setup
|
||||
* A Prometheus configuration covering monitoring of all Kubernetes core components and exporters
|
||||
* A default set of alerting rules on the cluster component's health
|
||||
* A Grafana instance serving dashboards on cluster metrics
|
||||
* A three node highly available Alertmanager cluster
|
||||
|
||||
Simply run:
|
||||
|
||||
```bash
|
||||
export KUBECONFIG=<path> # defaults to "~/.kube/config"
|
||||
hack/cluster-monitoring/deploy
|
||||
```
|
||||
|
||||
After all pods are ready, you can reach:
|
||||
|
||||
* Prometheus UI on node port `30900`
|
||||
* Alertmanager UI on node port `30903`
|
||||
* Grafana on node port `30902`
|
||||
|
||||
To tear it all down again, run:
|
||||
|
||||
```bash
|
||||
hack/cluster-monitoring/teardown
|
||||
```
|
||||
|
||||
## Monitoring custom services
|
||||
|
||||
The example manifests in [/manifests/examples/example-app](/manifests/examples/example-app)
|
||||
deploy a fake service exposing Prometheus metrics. They additionally define a new Prometheus
|
||||
server and a [`ServiceMonitor`](https://github.com/coreos/prometheus-operator/blob/master/Documentation/service-monitor.md),
|
||||
which specifies how the example service should be monitored.
|
||||
The Prometheus Operator will deploy and configure the desired Prometheus instance and continiously
|
||||
manage its life cycle.
|
||||
|
||||
```bash
|
||||
hack/example-service-monitoring/deploy
|
||||
```
|
||||
|
||||
After all pods are ready you can reach the Prometheus server on node port `30100` and observe
|
||||
how it monitors the service as specified. Same as before, this Prometheus server automatically
|
||||
discovers the Alertmanager cluster deployed in the [Monitoring Kubernetes](#Monitoring-Kubernetes)
|
||||
section.
|
||||
|
||||
Teardown:
|
||||
|
||||
```bash
|
||||
hack/example-service-monitoring/teardown
|
||||
```
|
||||
|
||||
## Dashboarding
|
||||
|
||||
The provided manifests deploy a Grafana instance serving dashboards provided via a ConfigMap.
|
||||
To modify, delete, or add dashboards, the `grafana-dashboards` ConfigMap must be modified.
|
||||
|
||||
Currently, Grafana does not support serving dashboards from static files. Instead, the `grafana-watcher`
|
||||
sidecar container aims to emulate the behavior, by keeping the Grafana database always in sync
|
||||
with the provided ConfigMap. Hence, the Grafana pod is effectively stateless.
|
||||
This allows managing dashboards via `git` etc. and easily deploying them via CD pipelines.
|
||||
|
||||
In the future, a separate Grafana operator will support gathering dashboards from multiple
|
||||
ConfigMaps based on label selection.
|
||||
|
||||
## Roadmap
|
||||
|
||||
* Grafana Operator that dynamically discovers and deploys dashboards from ConfigMaps
|
||||
* KPM/Helm packages to easily provide production-ready cluster-monitoring setup (essentially contents of `hack/cluster-monitoring`)
|
||||
* Add meta-monitoring to default cluster monitoring setup
|
||||
* Build out the provided dashboards and alerts for cluster monitoring to have full coverage of all system aspects
|
||||
|
||||
## Monitoring other Cluster Components
|
||||
|
||||
Discovery of API servers and kubelets works the same across all clusters.
|
||||
Depending on a cluster's setup several other core components, such as etcd or the
|
||||
scheduler, may be deployed in different ways.
|
||||
The easiest integration point is for the cluster operator to provide headless services
|
||||
of all those components to provide a common interface of discovering them. With that
|
||||
setup they will automatically be discovered by the provided Prometheus configuration.
|
||||
|
||||
For the `kube-scheduler` and `kube-controller-manager` there are headless
|
||||
services prepared, simply add them to your running cluster:
|
||||
|
||||
```bash
|
||||
kubectl -n kube-system create manifests/k8s/
|
||||
```
|
||||
|
||||
> Hint: if you use this for a cluster not created with bootkube, make sure you
|
||||
> populate an endpoints object with the address to your `kube-scheduler` and
|
||||
> `kube-controller-manager`, or adapt the label selectors to match your setup.
|
||||
|
||||
Aside from Kubernetes specific components, etcd is an important part of a
|
||||
working cluster, but is typically deployed outside of it. This monitoring
|
||||
setup assumes that it is made visible from within the cluster through a headless
|
||||
service as well.
|
||||
|
||||
> Note that minikube hides some components like etcd so to see the extend of
|
||||
> this setup we recommend setting up a [local cluster using bootkube](https://github.com/kubernetes-incubator/bootkube/tree/master/hack/multi-node).
|
||||
|
||||
An example for bootkube's multi-node vagrant setup is [here](/manifests/etcd/etcd-bootkube-vagrant-multi.yaml).
|
||||
|
||||
> Hint: this is merely an example for a local setup. The addresses will have to
|
||||
> be adapted for a setup, that is not a single etcd bootkube created cluster.
|
||||
|
||||
With that setup the headless services provide endpoint lists consumed by
|
||||
Prometheus to discover the endpoints as targets:
|
||||
|
||||
```bash
|
||||
$ kubectl get endpoints --all-namespaces
|
||||
NAMESPACE NAME ENDPOINTS AGE
|
||||
default kubernetes 172.17.4.101:443 2h
|
||||
kube-system kube-controller-manager-prometheus-discovery 10.2.30.2:10252 1h
|
||||
kube-system kube-scheduler-prometheus-discovery 10.2.30.4:10251 1h
|
||||
monitoring etcd-k8s 172.17.4.51:2379 1h
|
||||
```
|
||||
|
||||
## Other Documentation
|
||||
[Install Docs for a cluster created with KOPS on AWS](docs/KOPSonAWS.md)
|
||||
This repository has been merged with the [Prometheus Operator](https://github.com/coreos/prometheus-operator). It can now be found under [`contrib/kube-prometheus`](https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus).
|
||||
|
@@ -1,860 +0,0 @@
|
||||
{
|
||||
"dashboard":
|
||||
{
|
||||
"__inputs": [
|
||||
{
|
||||
"name": "DS_PROMETHEUS",
|
||||
"label": "prometheus",
|
||||
"description": "",
|
||||
"type": "datasource",
|
||||
"pluginId": "prometheus",
|
||||
"pluginName": "Prometheus"
|
||||
}
|
||||
],
|
||||
"__requires": [
|
||||
{
|
||||
"type": "grafana",
|
||||
"id": "grafana",
|
||||
"name": "Grafana",
|
||||
"version": "4.1.1"
|
||||
},
|
||||
{
|
||||
"type": "panel",
|
||||
"id": "graph",
|
||||
"name": "Graph",
|
||||
"version": ""
|
||||
},
|
||||
{
|
||||
"type": "datasource",
|
||||
"id": "prometheus",
|
||||
"name": "Prometheus",
|
||||
"version": "1.0.0"
|
||||
},
|
||||
{
|
||||
"type": "panel",
|
||||
"id": "singlestat",
|
||||
"name": "Singlestat",
|
||||
"version": ""
|
||||
}
|
||||
],
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"description": "Dashboard to get an overview of one server",
|
||||
"editable": true,
|
||||
"gnetId": 22,
|
||||
"graphTooltip": 0,
|
||||
"hideControls": false,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"refresh": false,
|
||||
"rows": [
|
||||
{
|
||||
"collapse": false,
|
||||
"height": "250px",
|
||||
"panels": [
|
||||
{
|
||||
"alerting": {},
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {},
|
||||
"id": 3,
|
||||
"legend": {
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": false
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [],
|
||||
"span": 6,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(node_cpu{mode=\"idle\"}[2m])) * 100",
|
||||
"hide": false,
|
||||
"intervalFactor": 10,
|
||||
"legendFormat": "",
|
||||
"refId": "A",
|
||||
"step": 50
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "Idle cpu",
|
||||
"tooltip": {
|
||||
"msResolution": false,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "percent",
|
||||
"label": "cpu usage",
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": 0,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "short",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"alerting": {},
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {},
|
||||
"id": 9,
|
||||
"legend": {
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": false
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [],
|
||||
"span": 6,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(node_load1)",
|
||||
"intervalFactor": 4,
|
||||
"legendFormat": "load 1m",
|
||||
"refId": "A",
|
||||
"step": 20,
|
||||
"target": ""
|
||||
},
|
||||
{
|
||||
"expr": "sum(node_load5)",
|
||||
"intervalFactor": 4,
|
||||
"legendFormat": "load 5m",
|
||||
"refId": "B",
|
||||
"step": 20,
|
||||
"target": ""
|
||||
},
|
||||
{
|
||||
"expr": "sum(node_load15)",
|
||||
"intervalFactor": 4,
|
||||
"legendFormat": "load 15m",
|
||||
"refId": "C",
|
||||
"step": 20,
|
||||
"target": ""
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "System load",
|
||||
"tooltip": {
|
||||
"msResolution": false,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "percentunit",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "short",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"repeat": null,
|
||||
"repeatIteration": null,
|
||||
"repeatRowId": null,
|
||||
"showTitle": false,
|
||||
"title": "New row",
|
||||
"titleSize": "h6"
|
||||
},
|
||||
{
|
||||
"collapse": false,
|
||||
"height": "250px",
|
||||
"panels": [
|
||||
{
|
||||
"alerting": {},
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {},
|
||||
"id": 4,
|
||||
"legend": {
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": false
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [
|
||||
{
|
||||
"alias": "node_memory_SwapFree{instance=\"172.17.0.1:9100\",job=\"prometheus\"}",
|
||||
"yaxis": 2
|
||||
}
|
||||
],
|
||||
"span": 9,
|
||||
"stack": true,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(node_memory_MemTotal) - sum(node_memory_MemFree) - sum(node_memory_Buffers) - sum(node_memory_Cached)",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "memory usage",
|
||||
"metric": "memo",
|
||||
"refId": "A",
|
||||
"step": 4,
|
||||
"target": ""
|
||||
},
|
||||
{
|
||||
"expr": "sum(node_memory_Buffers)",
|
||||
"interval": "",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "memory buffers",
|
||||
"metric": "memo",
|
||||
"refId": "B",
|
||||
"step": 4,
|
||||
"target": ""
|
||||
},
|
||||
{
|
||||
"expr": "sum(node_memory_Cached)",
|
||||
"interval": "",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "memory cached",
|
||||
"metric": "memo",
|
||||
"refId": "C",
|
||||
"step": 4,
|
||||
"target": ""
|
||||
},
|
||||
{
|
||||
"expr": "sum(node_memory_MemFree)",
|
||||
"interval": "",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "memory free",
|
||||
"metric": "memo",
|
||||
"refId": "D",
|
||||
"step": 4,
|
||||
"target": ""
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "Memory usage",
|
||||
"tooltip": {
|
||||
"msResolution": false,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "individual"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": "0",
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "short",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cacheTimeout": null,
|
||||
"colorBackground": false,
|
||||
"colorValue": false,
|
||||
"colors": [
|
||||
"rgba(50, 172, 45, 0.97)",
|
||||
"rgba(237, 129, 40, 0.89)",
|
||||
"rgba(245, 54, 54, 0.9)"
|
||||
],
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"format": "percent",
|
||||
"gauge": {
|
||||
"maxValue": 100,
|
||||
"minValue": 0,
|
||||
"show": true,
|
||||
"thresholdLabels": false,
|
||||
"thresholdMarkers": true
|
||||
},
|
||||
"id": 5,
|
||||
"interval": null,
|
||||
"links": [],
|
||||
"mappingType": 1,
|
||||
"mappingTypes": [
|
||||
{
|
||||
"name": "value to text",
|
||||
"value": 1
|
||||
},
|
||||
{
|
||||
"name": "range to text",
|
||||
"value": 2
|
||||
}
|
||||
],
|
||||
"maxDataPoints": 100,
|
||||
"nullPointMode": "connected",
|
||||
"nullText": null,
|
||||
"postfix": "",
|
||||
"postfixFontSize": "50%",
|
||||
"prefix": "",
|
||||
"prefixFontSize": "50%",
|
||||
"rangeMaps": [
|
||||
{
|
||||
"from": "null",
|
||||
"text": "N/A",
|
||||
"to": "null"
|
||||
}
|
||||
],
|
||||
"span": 3,
|
||||
"sparkline": {
|
||||
"fillColor": "rgba(31, 118, 189, 0.18)",
|
||||
"full": false,
|
||||
"lineColor": "rgb(31, 120, 193)",
|
||||
"show": false
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "((sum(node_memory_MemTotal) - sum(node_memory_MemFree) - sum(node_memory_Buffers) - sum(node_memory_Cached)) / sum(node_memory_MemTotal)) * 100",
|
||||
"intervalFactor": 2,
|
||||
"metric": "",
|
||||
"refId": "A",
|
||||
"step": 60,
|
||||
"target": ""
|
||||
}
|
||||
],
|
||||
"thresholds": "80, 90",
|
||||
"title": "Memory usage",
|
||||
"type": "singlestat",
|
||||
"valueFontSize": "80%",
|
||||
"valueMaps": [
|
||||
{
|
||||
"op": "=",
|
||||
"text": "N/A",
|
||||
"value": "null"
|
||||
}
|
||||
],
|
||||
"valueName": "avg"
|
||||
}
|
||||
],
|
||||
"repeat": null,
|
||||
"repeatIteration": null,
|
||||
"repeatRowId": null,
|
||||
"showTitle": false,
|
||||
"title": "New row",
|
||||
"titleSize": "h6"
|
||||
},
|
||||
{
|
||||
"collapse": false,
|
||||
"height": "250px",
|
||||
"panels": [
|
||||
{
|
||||
"alerting": {},
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {},
|
||||
"id": 6,
|
||||
"legend": {
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": false
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [
|
||||
{
|
||||
"alias": "read",
|
||||
"yaxis": 1
|
||||
},
|
||||
{
|
||||
"alias": "{instance=\"172.17.0.1:9100\"}",
|
||||
"yaxis": 2
|
||||
},
|
||||
{
|
||||
"alias": "io time",
|
||||
"yaxis": 2
|
||||
}
|
||||
],
|
||||
"span": 9,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(node_disk_bytes_read[5m]))",
|
||||
"hide": false,
|
||||
"intervalFactor": 4,
|
||||
"legendFormat": "read",
|
||||
"refId": "A",
|
||||
"step": 8,
|
||||
"target": ""
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(node_disk_bytes_written[5m]))",
|
||||
"intervalFactor": 4,
|
||||
"legendFormat": "written",
|
||||
"refId": "B",
|
||||
"step": 8
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(node_disk_io_time_ms[5m]))",
|
||||
"intervalFactor": 4,
|
||||
"legendFormat": "io time",
|
||||
"refId": "C",
|
||||
"step": 8
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "Disk I/O",
|
||||
"tooltip": {
|
||||
"msResolution": false,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "ms",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cacheTimeout": null,
|
||||
"colorBackground": false,
|
||||
"colorValue": false,
|
||||
"colors": [
|
||||
"rgba(50, 172, 45, 0.97)",
|
||||
"rgba(237, 129, 40, 0.89)",
|
||||
"rgba(245, 54, 54, 0.9)"
|
||||
],
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"format": "percentunit",
|
||||
"gauge": {
|
||||
"maxValue": 1,
|
||||
"minValue": 0,
|
||||
"show": true,
|
||||
"thresholdLabels": false,
|
||||
"thresholdMarkers": true
|
||||
},
|
||||
"id": 7,
|
||||
"interval": null,
|
||||
"links": [],
|
||||
"mappingType": 1,
|
||||
"mappingTypes": [
|
||||
{
|
||||
"name": "value to text",
|
||||
"value": 1
|
||||
},
|
||||
{
|
||||
"name": "range to text",
|
||||
"value": 2
|
||||
}
|
||||
],
|
||||
"maxDataPoints": 100,
|
||||
"nullPointMode": "connected",
|
||||
"nullText": null,
|
||||
"postfix": "",
|
||||
"postfixFontSize": "50%",
|
||||
"prefix": "",
|
||||
"prefixFontSize": "50%",
|
||||
"rangeMaps": [
|
||||
{
|
||||
"from": "null",
|
||||
"text": "N/A",
|
||||
"to": "null"
|
||||
}
|
||||
],
|
||||
"span": 3,
|
||||
"sparkline": {
|
||||
"fillColor": "rgba(31, 118, 189, 0.18)",
|
||||
"full": false,
|
||||
"lineColor": "rgb(31, 120, 193)",
|
||||
"show": false
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "(sum(node_filesystem_size{device!=\"rootfs\"}) - sum(node_filesystem_free{device!=\"rootfs\"})) / sum(node_filesystem_size{device!=\"rootfs\"})",
|
||||
"intervalFactor": 2,
|
||||
"refId": "A",
|
||||
"step": 60,
|
||||
"target": ""
|
||||
}
|
||||
],
|
||||
"thresholds": "0.75, 0.9",
|
||||
"title": "Disk space usage",
|
||||
"type": "singlestat",
|
||||
"valueFontSize": "80%",
|
||||
"valueMaps": [
|
||||
{
|
||||
"op": "=",
|
||||
"text": "N/A",
|
||||
"value": "null"
|
||||
}
|
||||
],
|
||||
"valueName": "current"
|
||||
}
|
||||
],
|
||||
"repeat": null,
|
||||
"repeatIteration": null,
|
||||
"repeatRowId": null,
|
||||
"showTitle": false,
|
||||
"title": "New row",
|
||||
"titleSize": "h6"
|
||||
},
|
||||
{
|
||||
"collapse": false,
|
||||
"height": "250px",
|
||||
"panels": [
|
||||
{
|
||||
"alerting": {},
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {},
|
||||
"id": 8,
|
||||
"legend": {
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": false
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [
|
||||
{
|
||||
"alias": "transmitted ",
|
||||
"yaxis": 2
|
||||
}
|
||||
],
|
||||
"span": 6,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(node_network_receive_bytes{device!~\"lo\"}[5m]))",
|
||||
"hide": false,
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "",
|
||||
"refId": "A",
|
||||
"step": 10,
|
||||
"target": ""
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "Network received",
|
||||
"tooltip": {
|
||||
"msResolution": false,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"alerting": {},
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {},
|
||||
"id": 10,
|
||||
"legend": {
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": false
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [
|
||||
{
|
||||
"alias": "transmitted ",
|
||||
"yaxis": 2
|
||||
}
|
||||
],
|
||||
"span": 6,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(node_network_transmit_bytes{device!~\"lo\"}[5m]))",
|
||||
"hide": false,
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "",
|
||||
"refId": "B",
|
||||
"step": 10,
|
||||
"target": ""
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "Network transmitted",
|
||||
"tooltip": {
|
||||
"msResolution": false,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"repeat": null,
|
||||
"repeatIteration": null,
|
||||
"repeatRowId": null,
|
||||
"showTitle": false,
|
||||
"title": "New row",
|
||||
"titleSize": "h6"
|
||||
}
|
||||
],
|
||||
"schemaVersion": 14,
|
||||
"style": "dark",
|
||||
"tags": [
|
||||
"prometheus"
|
||||
],
|
||||
"templating": {
|
||||
"list": []
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"timepicker": {
|
||||
"refresh_intervals": [
|
||||
"5s",
|
||||
"10s",
|
||||
"30s",
|
||||
"1m",
|
||||
"5m",
|
||||
"15m",
|
||||
"30m",
|
||||
"1h",
|
||||
"2h",
|
||||
"1d"
|
||||
],
|
||||
"time_options": [
|
||||
"5m",
|
||||
"15m",
|
||||
"1h",
|
||||
"6h",
|
||||
"12h",
|
||||
"24h",
|
||||
"2d",
|
||||
"7d",
|
||||
"30d"
|
||||
]
|
||||
},
|
||||
"timezone": "browser",
|
||||
"title": "All Nodes",
|
||||
"version": 1
|
||||
},
|
||||
"inputs": [
|
||||
{
|
||||
"name": "DS_PROMETHEUS",
|
||||
"pluginId": "prometheus",
|
||||
"type": "datasource",
|
||||
"value": "prometheus"
|
||||
}
|
||||
],
|
||||
"overwrite": true
|
||||
}
|
@@ -1,817 +0,0 @@
|
||||
{
|
||||
"dashboard": {
|
||||
"__inputs": [
|
||||
{
|
||||
"name": "DS_PROMETHEUS",
|
||||
"label": "prometheus",
|
||||
"description": "",
|
||||
"type": "datasource",
|
||||
"pluginId": "prometheus",
|
||||
"pluginName": "Prometheus"
|
||||
}
|
||||
],
|
||||
"__requires": [
|
||||
{
|
||||
"type": "panel",
|
||||
"id": "singlestat",
|
||||
"name": "Singlestat",
|
||||
"version": ""
|
||||
},
|
||||
{
|
||||
"type": "panel",
|
||||
"id": "graph",
|
||||
"name": "Graph",
|
||||
"version": ""
|
||||
},
|
||||
{
|
||||
"type": "grafana",
|
||||
"id": "grafana",
|
||||
"name": "Grafana",
|
||||
"version": "3.1.1"
|
||||
},
|
||||
{
|
||||
"type": "datasource",
|
||||
"id": "prometheus",
|
||||
"name": "Prometheus",
|
||||
"version": "1.0.0"
|
||||
}
|
||||
],
|
||||
"id": null,
|
||||
"title": "Deployment",
|
||||
"tags": [],
|
||||
"style": "dark",
|
||||
"timezone": "browser",
|
||||
"editable": true,
|
||||
"hideControls": false,
|
||||
"sharedCrosshair": true,
|
||||
"rows": [
|
||||
{
|
||||
"collapse": false,
|
||||
"editable": true,
|
||||
"height": "200px",
|
||||
"panels": [
|
||||
{
|
||||
"title": "CPU",
|
||||
"error": false,
|
||||
"span": 4,
|
||||
"editable": true,
|
||||
"type": "singlestat",
|
||||
"isNew": true,
|
||||
"id": 8,
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}[3m])) ",
|
||||
"intervalFactor": 2,
|
||||
"step": 600
|
||||
}
|
||||
],
|
||||
"links": [],
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"maxDataPoints": 100,
|
||||
"interval": null,
|
||||
"cacheTimeout": null,
|
||||
"format": "none",
|
||||
"prefix": "",
|
||||
"postfix": "cores",
|
||||
"nullText": null,
|
||||
"valueMaps": [
|
||||
{
|
||||
"value": "null",
|
||||
"op": "=",
|
||||
"text": "N/A"
|
||||
}
|
||||
],
|
||||
"mappingTypes": [
|
||||
{
|
||||
"name": "value to text",
|
||||
"value": 1
|
||||
},
|
||||
{
|
||||
"name": "range to text",
|
||||
"value": 2
|
||||
}
|
||||
],
|
||||
"rangeMaps": [
|
||||
{
|
||||
"from": "null",
|
||||
"to": "null",
|
||||
"text": "N/A"
|
||||
}
|
||||
],
|
||||
"mappingType": 1,
|
||||
"nullPointMode": "connected",
|
||||
"valueName": "avg",
|
||||
"prefixFontSize": "50%",
|
||||
"valueFontSize": "110%",
|
||||
"postfixFontSize": "50%",
|
||||
"thresholds": "",
|
||||
"colorBackground": false,
|
||||
"colorValue": false,
|
||||
"colors": [
|
||||
"rgba(245, 54, 54, 0.9)",
|
||||
"rgba(237, 129, 40, 0.89)",
|
||||
"rgba(50, 172, 45, 0.97)"
|
||||
],
|
||||
"sparkline": {
|
||||
"show": true,
|
||||
"full": false,
|
||||
"lineColor": "rgb(31, 120, 193)",
|
||||
"fillColor": "rgba(31, 118, 189, 0.18)"
|
||||
},
|
||||
"gauge": {
|
||||
"show": false,
|
||||
"minValue": 0,
|
||||
"maxValue": 100,
|
||||
"thresholdMarkers": true,
|
||||
"thresholdLabels": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Memory",
|
||||
"error": false,
|
||||
"span": 4,
|
||||
"editable": true,
|
||||
"type": "singlestat",
|
||||
"isNew": true,
|
||||
"id": 9,
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum(container_memory_usage_bytes{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}) / 1024^3",
|
||||
"intervalFactor": 2,
|
||||
"step": 600
|
||||
}
|
||||
],
|
||||
"links": [],
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"maxDataPoints": 100,
|
||||
"interval": null,
|
||||
"cacheTimeout": null,
|
||||
"format": "none",
|
||||
"prefix": "",
|
||||
"postfix": "GB",
|
||||
"nullText": null,
|
||||
"valueMaps": [
|
||||
{
|
||||
"value": "null",
|
||||
"op": "=",
|
||||
"text": "N/A"
|
||||
}
|
||||
],
|
||||
"mappingTypes": [
|
||||
{
|
||||
"name": "value to text",
|
||||
"value": 1
|
||||
},
|
||||
{
|
||||
"name": "range to text",
|
||||
"value": 2
|
||||
}
|
||||
],
|
||||
"rangeMaps": [
|
||||
{
|
||||
"from": "null",
|
||||
"to": "null",
|
||||
"text": "N/A"
|
||||
}
|
||||
],
|
||||
"mappingType": 1,
|
||||
"nullPointMode": "connected",
|
||||
"valueName": "avg",
|
||||
"prefixFontSize": "80%",
|
||||
"valueFontSize": "110%",
|
||||
"postfixFontSize": "50%",
|
||||
"thresholds": "",
|
||||
"colorBackground": false,
|
||||
"colorValue": false,
|
||||
"colors": [
|
||||
"rgba(245, 54, 54, 0.9)",
|
||||
"rgba(237, 129, 40, 0.89)",
|
||||
"rgba(50, 172, 45, 0.97)"
|
||||
],
|
||||
"sparkline": {
|
||||
"show": true,
|
||||
"full": false,
|
||||
"lineColor": "rgb(31, 120, 193)",
|
||||
"fillColor": "rgba(31, 118, 189, 0.18)"
|
||||
},
|
||||
"gauge": {
|
||||
"show": false,
|
||||
"minValue": 0,
|
||||
"maxValue": 100,
|
||||
"thresholdMarkers": true,
|
||||
"thresholdLabels": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Network",
|
||||
"error": false,
|
||||
"span": 4,
|
||||
"editable": true,
|
||||
"type": "singlestat",
|
||||
"isNew": true,
|
||||
"id": 7,
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "sum(rate(container_network_transmit_bytes_total{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}[3m])) + sum(rate(container_network_receive_bytes_total{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}[3m])) ",
|
||||
"intervalFactor": 2,
|
||||
"step": 600
|
||||
}
|
||||
],
|
||||
"links": [],
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"maxDataPoints": 100,
|
||||
"interval": null,
|
||||
"cacheTimeout": null,
|
||||
"format": "Bps",
|
||||
"prefix": "",
|
||||
"postfix": "",
|
||||
"nullText": null,
|
||||
"valueMaps": [
|
||||
{
|
||||
"value": "null",
|
||||
"op": "=",
|
||||
"text": "N/A"
|
||||
}
|
||||
],
|
||||
"mappingTypes": [
|
||||
{
|
||||
"name": "value to text",
|
||||
"value": 1
|
||||
},
|
||||
{
|
||||
"name": "range to text",
|
||||
"value": 2
|
||||
}
|
||||
],
|
||||
"rangeMaps": [
|
||||
{
|
||||
"from": "null",
|
||||
"to": "null",
|
||||
"text": "N/A"
|
||||
}
|
||||
],
|
||||
"mappingType": 1,
|
||||
"nullPointMode": "connected",
|
||||
"valueName": "avg",
|
||||
"prefixFontSize": "50%",
|
||||
"valueFontSize": "80%",
|
||||
"postfixFontSize": "50%",
|
||||
"thresholds": "",
|
||||
"colorBackground": false,
|
||||
"colorValue": false,
|
||||
"colors": [
|
||||
"rgba(245, 54, 54, 0.9)",
|
||||
"rgba(237, 129, 40, 0.89)",
|
||||
"rgba(50, 172, 45, 0.97)"
|
||||
],
|
||||
"sparkline": {
|
||||
"show": true,
|
||||
"full": false,
|
||||
"lineColor": "rgb(31, 120, 193)",
|
||||
"fillColor": "rgba(31, 118, 189, 0.18)"
|
||||
},
|
||||
"gauge": {
|
||||
"show": false,
|
||||
"minValue": 0,
|
||||
"maxValue": 100,
|
||||
"thresholdMarkers": false,
|
||||
"thresholdLabels": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"title": "Row",
|
||||
"showTitle": false
|
||||
},
|
||||
{
|
||||
"title": "New row",
|
||||
"height": "100px",
|
||||
"editable": true,
|
||||
"collapse": false,
|
||||
"panels": [
|
||||
{
|
||||
"title": "Desired Replicas",
|
||||
"error": false,
|
||||
"span": 3,
|
||||
"editable": true,
|
||||
"type": "singlestat",
|
||||
"isNew": true,
|
||||
"id": 5,
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "kube_deployment_spec_replicas{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
||||
"intervalFactor": 2,
|
||||
"step": 600,
|
||||
"metric": "kube_deployment_spec_replicas"
|
||||
}
|
||||
],
|
||||
"links": [],
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"maxDataPoints": 100,
|
||||
"interval": null,
|
||||
"cacheTimeout": null,
|
||||
"format": "none",
|
||||
"prefix": "",
|
||||
"postfix": "",
|
||||
"nullText": null,
|
||||
"valueMaps": [
|
||||
{
|
||||
"value": "null",
|
||||
"op": "=",
|
||||
"text": "N/A"
|
||||
}
|
||||
],
|
||||
"mappingTypes": [
|
||||
{
|
||||
"name": "value to text",
|
||||
"value": 1
|
||||
},
|
||||
{
|
||||
"name": "range to text",
|
||||
"value": 2
|
||||
}
|
||||
],
|
||||
"rangeMaps": [
|
||||
{
|
||||
"from": "null",
|
||||
"to": "null",
|
||||
"text": "N/A"
|
||||
}
|
||||
],
|
||||
"mappingType": 1,
|
||||
"nullPointMode": "connected",
|
||||
"valueName": "avg",
|
||||
"prefixFontSize": "50%",
|
||||
"valueFontSize": "80%",
|
||||
"postfixFontSize": "50%",
|
||||
"thresholds": "",
|
||||
"colorBackground": false,
|
||||
"colorValue": false,
|
||||
"colors": [
|
||||
"rgba(245, 54, 54, 0.9)",
|
||||
"rgba(237, 129, 40, 0.89)",
|
||||
"rgba(50, 172, 45, 0.97)"
|
||||
],
|
||||
"sparkline": {
|
||||
"show": false,
|
||||
"full": false,
|
||||
"lineColor": "rgb(31, 120, 193)",
|
||||
"fillColor": "rgba(31, 118, 189, 0.18)"
|
||||
},
|
||||
"gauge": {
|
||||
"show": false,
|
||||
"minValue": 0,
|
||||
"maxValue": 100,
|
||||
"thresholdMarkers": false,
|
||||
"thresholdLabels": false
|
||||
},
|
||||
"decimals": null
|
||||
},
|
||||
{
|
||||
"title": "Available Replicas",
|
||||
"error": false,
|
||||
"span": 3,
|
||||
"editable": true,
|
||||
"type": "singlestat",
|
||||
"isNew": true,
|
||||
"id": 6,
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "kube_deployment_status_replicas_available{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
||||
"intervalFactor": 2,
|
||||
"step": 600
|
||||
}
|
||||
],
|
||||
"links": [],
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"maxDataPoints": 100,
|
||||
"interval": null,
|
||||
"cacheTimeout": null,
|
||||
"format": "none",
|
||||
"prefix": "",
|
||||
"postfix": "",
|
||||
"nullText": null,
|
||||
"valueMaps": [
|
||||
{
|
||||
"value": "null",
|
||||
"op": "=",
|
||||
"text": "N/A"
|
||||
}
|
||||
],
|
||||
"mappingTypes": [
|
||||
{
|
||||
"name": "value to text",
|
||||
"value": 1
|
||||
},
|
||||
{
|
||||
"name": "range to text",
|
||||
"value": 2
|
||||
}
|
||||
],
|
||||
"rangeMaps": [
|
||||
{
|
||||
"from": "null",
|
||||
"to": "null",
|
||||
"text": "N/A"
|
||||
}
|
||||
],
|
||||
"mappingType": 1,
|
||||
"nullPointMode": "connected",
|
||||
"valueName": "avg",
|
||||
"prefixFontSize": "50%",
|
||||
"valueFontSize": "80%",
|
||||
"postfixFontSize": "50%",
|
||||
"thresholds": "",
|
||||
"colorBackground": false,
|
||||
"colorValue": false,
|
||||
"colors": [
|
||||
"rgba(245, 54, 54, 0.9)",
|
||||
"rgba(237, 129, 40, 0.89)",
|
||||
"rgba(50, 172, 45, 0.97)"
|
||||
],
|
||||
"sparkline": {
|
||||
"show": false,
|
||||
"full": false,
|
||||
"lineColor": "rgb(31, 120, 193)",
|
||||
"fillColor": "rgba(31, 118, 189, 0.18)"
|
||||
},
|
||||
"gauge": {
|
||||
"show": false,
|
||||
"minValue": 0,
|
||||
"maxValue": 100,
|
||||
"thresholdMarkers": true,
|
||||
"thresholdLabels": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cacheTimeout": null,
|
||||
"colorBackground": false,
|
||||
"colorValue": false,
|
||||
"colors": [
|
||||
"rgba(245, 54, 54, 0.9)",
|
||||
"rgba(237, 129, 40, 0.89)",
|
||||
"rgba(50, 172, 45, 0.97)"
|
||||
],
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"format": "none",
|
||||
"gauge": {
|
||||
"maxValue": 100,
|
||||
"minValue": 0,
|
||||
"show": false,
|
||||
"thresholdLabels": false,
|
||||
"thresholdMarkers": true
|
||||
},
|
||||
"id": 3,
|
||||
"interval": null,
|
||||
"isNew": true,
|
||||
"links": [],
|
||||
"mappingType": 1,
|
||||
"mappingTypes": [
|
||||
{
|
||||
"name": "value to text",
|
||||
"value": 1
|
||||
},
|
||||
{
|
||||
"name": "range to text",
|
||||
"value": 2
|
||||
}
|
||||
],
|
||||
"maxDataPoints": 100,
|
||||
"nullPointMode": "connected",
|
||||
"nullText": null,
|
||||
"postfix": "",
|
||||
"postfixFontSize": "50%",
|
||||
"prefix": "",
|
||||
"prefixFontSize": "50%",
|
||||
"rangeMaps": [
|
||||
{
|
||||
"from": "null",
|
||||
"text": "N/A",
|
||||
"to": "null"
|
||||
}
|
||||
],
|
||||
"span": 3,
|
||||
"sparkline": {
|
||||
"fillColor": "rgba(31, 118, 189, 0.18)",
|
||||
"full": false,
|
||||
"lineColor": "rgb(31, 120, 193)",
|
||||
"show": false
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "kube_deployment_status_observed_generation{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "",
|
||||
"refId": "A",
|
||||
"step": 600
|
||||
}
|
||||
],
|
||||
"thresholds": "",
|
||||
"title": "Observed Generation",
|
||||
"type": "singlestat",
|
||||
"valueFontSize": "80%",
|
||||
"valueMaps": [
|
||||
{
|
||||
"op": "=",
|
||||
"text": "N/A",
|
||||
"value": "null"
|
||||
}
|
||||
],
|
||||
"valueName": "avg"
|
||||
},
|
||||
{
|
||||
"cacheTimeout": null,
|
||||
"colorBackground": false,
|
||||
"colorValue": false,
|
||||
"colors": [
|
||||
"rgba(245, 54, 54, 0.9)",
|
||||
"rgba(237, 129, 40, 0.89)",
|
||||
"rgba(50, 172, 45, 0.97)"
|
||||
],
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"format": "none",
|
||||
"gauge": {
|
||||
"maxValue": 100,
|
||||
"minValue": 0,
|
||||
"show": false,
|
||||
"thresholdLabels": false,
|
||||
"thresholdMarkers": true
|
||||
},
|
||||
"id": 2,
|
||||
"interval": null,
|
||||
"isNew": true,
|
||||
"links": [],
|
||||
"mappingType": 1,
|
||||
"mappingTypes": [
|
||||
{
|
||||
"name": "value to text",
|
||||
"value": 1
|
||||
},
|
||||
{
|
||||
"name": "range to text",
|
||||
"value": 2
|
||||
}
|
||||
],
|
||||
"maxDataPoints": 100,
|
||||
"nullPointMode": "connected",
|
||||
"nullText": null,
|
||||
"postfix": "",
|
||||
"postfixFontSize": "50%",
|
||||
"prefix": "",
|
||||
"prefixFontSize": "50%",
|
||||
"rangeMaps": [
|
||||
{
|
||||
"from": "null",
|
||||
"text": "N/A",
|
||||
"to": "null"
|
||||
}
|
||||
],
|
||||
"span": 3,
|
||||
"sparkline": {
|
||||
"fillColor": "rgba(31, 118, 189, 0.18)",
|
||||
"full": false,
|
||||
"lineColor": "rgb(31, 120, 193)",
|
||||
"show": false
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "kube_deployment_metadata_generation{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "",
|
||||
"refId": "A",
|
||||
"step": 600
|
||||
}
|
||||
],
|
||||
"thresholds": "",
|
||||
"title": "Metadata Generation",
|
||||
"type": "singlestat",
|
||||
"valueFontSize": "80%",
|
||||
"valueMaps": [
|
||||
{
|
||||
"op": "=",
|
||||
"text": "N/A",
|
||||
"value": "null"
|
||||
}
|
||||
],
|
||||
"valueName": "avg"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"collapse": false,
|
||||
"editable": true,
|
||||
"height": "350px",
|
||||
"panels": [
|
||||
{
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {
|
||||
"threshold1": null,
|
||||
"threshold1Color": "rgba(216, 200, 27, 0.27)",
|
||||
"threshold2": null,
|
||||
"threshold2Color": "rgba(234, 112, 112, 0.22)"
|
||||
},
|
||||
"id": 1,
|
||||
"isNew": true,
|
||||
"legend": {
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": false,
|
||||
"hideZero": false
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [],
|
||||
"span": 12,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "kube_deployment_status_replicas{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "current replicas",
|
||||
"refId": "A",
|
||||
"step": 30
|
||||
},
|
||||
{
|
||||
"expr": "kube_deployment_status_replicas_available{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "available",
|
||||
"refId": "B",
|
||||
"step": 30
|
||||
},
|
||||
{
|
||||
"expr": "kube_deployment_status_replicas_unavailable{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "unavailable",
|
||||
"refId": "C",
|
||||
"step": 30
|
||||
},
|
||||
{
|
||||
"expr": "kube_deployment_status_replicas_updated{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "updated",
|
||||
"refId": "D",
|
||||
"step": 30
|
||||
},
|
||||
{
|
||||
"expr": "kube_deployment_spec_replicas{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "desired",
|
||||
"refId": "E",
|
||||
"step": 30
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "Replicas",
|
||||
"tooltip": {
|
||||
"msResolution": true,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "none",
|
||||
"label": "",
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "short",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": false
|
||||
}
|
||||
],
|
||||
"transparent": false
|
||||
}
|
||||
],
|
||||
"title": "New row",
|
||||
"showTitle": false
|
||||
}
|
||||
],
|
||||
"time": {
|
||||
"from": "now-6h",
|
||||
"to": "now"
|
||||
},
|
||||
"timepicker": {
|
||||
"refresh_intervals": [
|
||||
"5s",
|
||||
"10s",
|
||||
"30s",
|
||||
"1m",
|
||||
"5m",
|
||||
"15m",
|
||||
"30m",
|
||||
"1h",
|
||||
"2h",
|
||||
"1d"
|
||||
],
|
||||
"time_options": [
|
||||
"5m",
|
||||
"15m",
|
||||
"1h",
|
||||
"6h",
|
||||
"12h",
|
||||
"24h",
|
||||
"2d",
|
||||
"7d",
|
||||
"30d"
|
||||
]
|
||||
},
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"allValue": ".*",
|
||||
"current": {},
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"hide": 0,
|
||||
"includeAll": false,
|
||||
"label": "Namespace",
|
||||
"multi": false,
|
||||
"name": "deployment_namespace",
|
||||
"options": [],
|
||||
"query": "label_values(kube_deployment_metadata_generation, namespace)",
|
||||
"refresh": 1,
|
||||
"regex": "",
|
||||
"sort": 0,
|
||||
"tagValuesQuery": null,
|
||||
"tagsQuery": "",
|
||||
"type": "query",
|
||||
"useTags": false
|
||||
},
|
||||
{
|
||||
"allValue": null,
|
||||
"current": {},
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"hide": 0,
|
||||
"includeAll": false,
|
||||
"label": "Deployment",
|
||||
"multi": false,
|
||||
"name": "deployment_name",
|
||||
"options": [],
|
||||
"query": "label_values(kube_deployment_metadata_generation{namespace=\"$deployment_namespace\"}, deployment)",
|
||||
"refresh": 1,
|
||||
"regex": "",
|
||||
"sort": 0,
|
||||
"tagValuesQuery": "",
|
||||
"tagsQuery": "deployment",
|
||||
"type": "query",
|
||||
"useTags": false
|
||||
}
|
||||
]
|
||||
},
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"schemaVersion": 12,
|
||||
"version": 2,
|
||||
"links": [],
|
||||
"gnetId": null
|
||||
},
|
||||
"inputs": [
|
||||
{
|
||||
"name": "DS_PROMETHEUS",
|
||||
"pluginId": "prometheus",
|
||||
"type": "datasource",
|
||||
"value": "prometheus"
|
||||
}
|
||||
],
|
||||
"overwrite": true
|
||||
}
|
@@ -1,409 +0,0 @@
|
||||
{
|
||||
"dashboard": {
|
||||
"__inputs": [
|
||||
{
|
||||
"description": "",
|
||||
"label": "prometheus",
|
||||
"name": "DS_PROMETHEUS",
|
||||
"pluginId": "prometheus",
|
||||
"pluginName": "Prometheus",
|
||||
"type": "datasource"
|
||||
}
|
||||
],
|
||||
"__requires": [
|
||||
{
|
||||
"id": "graph",
|
||||
"name": "Graph",
|
||||
"type": "panel",
|
||||
"version": ""
|
||||
},
|
||||
{
|
||||
"id": "grafana",
|
||||
"name": "Grafana",
|
||||
"type": "grafana",
|
||||
"version": "3.1.1"
|
||||
},
|
||||
{
|
||||
"id": "prometheus",
|
||||
"name": "Prometheus",
|
||||
"type": "datasource",
|
||||
"version": "1.0.0"
|
||||
}
|
||||
],
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"editable": true,
|
||||
"gnetId": null,
|
||||
"hideControls": false,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"rows": [
|
||||
{
|
||||
"collapse": false,
|
||||
"editable": true,
|
||||
"height": "250px",
|
||||
"panels": [
|
||||
{
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {
|
||||
"threshold1": null,
|
||||
"threshold1Color": "rgba(216, 200, 27, 0.27)",
|
||||
"threshold2": null,
|
||||
"threshold2Color": "rgba(234, 112, 112, 0.22)"
|
||||
},
|
||||
"id": 1,
|
||||
"isNew": true,
|
||||
"legend": {
|
||||
"alignAsTable": true,
|
||||
"avg": true,
|
||||
"current": true,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"rightSide": true,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": true
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [],
|
||||
"span": 12,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by(container_name) (container_memory_usage_bytes{pod_name=\"$pod\", container_name=~\"$container\", container_name!=\"POD\"})",
|
||||
"interval": "10s",
|
||||
"intervalFactor": 1,
|
||||
"legendFormat": "Current: {{ container_name }}",
|
||||
"metric": "container_memory_usage_bytes",
|
||||
"refId": "A",
|
||||
"step": 10
|
||||
},
|
||||
{
|
||||
"expr": "kube_pod_container_requested_memory_bytes{pod=\"$pod\", container=~\"$container\"}",
|
||||
"interval": "10s",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "Requested: {{ container }}",
|
||||
"metric": "kube_pod_container_requested_memory_bytes",
|
||||
"refId": "B",
|
||||
"step": 20
|
||||
}
|
||||
],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "Memory Usage",
|
||||
"tooltip": {
|
||||
"msResolution": true,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"show": true
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "short",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"title": "Row"
|
||||
},
|
||||
{
|
||||
"collapse": false,
|
||||
"editable": true,
|
||||
"height": "250px",
|
||||
"panels": [
|
||||
{
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {
|
||||
"threshold1": null,
|
||||
"threshold1Color": "rgba(216, 200, 27, 0.27)",
|
||||
"threshold2": null,
|
||||
"threshold2Color": "rgba(234, 112, 112, 0.22)"
|
||||
},
|
||||
"id": 2,
|
||||
"isNew": true,
|
||||
"legend": {
|
||||
"alignAsTable": true,
|
||||
"avg": true,
|
||||
"current": true,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"rightSide": true,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": true
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [],
|
||||
"span": 12,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (container_name)( rate(container_cpu_usage_seconds_total{image!=\"\",container_name!=\"POD\",pod_name=\"$pod\"}[1m] ) )",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "{{ container_name }}",
|
||||
"refId": "A",
|
||||
"step": 30
|
||||
}
|
||||
],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "CPU Usage",
|
||||
"tooltip": {
|
||||
"msResolution": true,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"show": true
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "short",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "short",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"title": "New row"
|
||||
},
|
||||
{
|
||||
"collapse": false,
|
||||
"editable": true,
|
||||
"height": "250px",
|
||||
"panels": [
|
||||
{
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {
|
||||
"threshold1": null,
|
||||
"threshold1Color": "rgba(216, 200, 27, 0.27)",
|
||||
"threshold2": null,
|
||||
"threshold2Color": "rgba(234, 112, 112, 0.22)"
|
||||
},
|
||||
"id": 3,
|
||||
"isNew": true,
|
||||
"legend": {
|
||||
"alignAsTable": true,
|
||||
"avg": true,
|
||||
"current": true,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"rightSide": true,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": true
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [],
|
||||
"span": 12,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sort_desc(sum by (pod_name) (rate (container_network_receive_bytes_total{pod_name=\"$pod\"}[1m]) ))",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "{{ pod_name }}",
|
||||
"refId": "A",
|
||||
"step": 30
|
||||
}
|
||||
],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "Network I/O",
|
||||
"tooltip": {
|
||||
"msResolution": true,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"show": true
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "short",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"title": "New row"
|
||||
}
|
||||
],
|
||||
"schemaVersion": 12,
|
||||
"sharedCrosshair": true,
|
||||
"style": "dark",
|
||||
"tags": [],
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"allValue": ".*",
|
||||
"current": {},
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"hide": 0,
|
||||
"includeAll": true,
|
||||
"label": "Namespace",
|
||||
"multi": false,
|
||||
"name": "namespace",
|
||||
"options": [],
|
||||
"query": "label_values(kube_pod_info, namespace)",
|
||||
"refresh": 1,
|
||||
"regex": "",
|
||||
"type": "query"
|
||||
},
|
||||
{
|
||||
"current": {},
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"hide": 0,
|
||||
"includeAll": false,
|
||||
"label": "Pod",
|
||||
"multi": false,
|
||||
"name": "pod",
|
||||
"options": [],
|
||||
"query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, pod)",
|
||||
"refresh": 1,
|
||||
"regex": "",
|
||||
"type": "query"
|
||||
},
|
||||
{
|
||||
"allValue": ".*",
|
||||
"current": {},
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"hide": 0,
|
||||
"includeAll": true,
|
||||
"label": "Container",
|
||||
"multi": false,
|
||||
"name": "container",
|
||||
"options": [],
|
||||
"query": "label_values(kube_pod_container_info{namespace=\"$namespace\", pod=\"$pod\"}, container)",
|
||||
"refresh": 1,
|
||||
"regex": "",
|
||||
"type": "query"
|
||||
}
|
||||
]
|
||||
},
|
||||
"time": {
|
||||
"from": "now-6h",
|
||||
"to": "now"
|
||||
},
|
||||
"timepicker": {
|
||||
"refresh_intervals": [
|
||||
"5s",
|
||||
"10s",
|
||||
"30s",
|
||||
"1m",
|
||||
"5m",
|
||||
"15m",
|
||||
"30m",
|
||||
"1h",
|
||||
"2h",
|
||||
"1d"
|
||||
],
|
||||
"time_options": [
|
||||
"5m",
|
||||
"15m",
|
||||
"1h",
|
||||
"6h",
|
||||
"12h",
|
||||
"24h",
|
||||
"2d",
|
||||
"7d",
|
||||
"30d"
|
||||
]
|
||||
},
|
||||
"timezone": "browser",
|
||||
"title": "Pods",
|
||||
"version": 26
|
||||
},
|
||||
"inputs": [
|
||||
{
|
||||
"name": "DS_PROMETHEUS",
|
||||
"pluginId": "prometheus",
|
||||
"type": "datasource",
|
||||
"value": "prometheus"
|
||||
}
|
||||
],
|
||||
"overwrite": true
|
||||
}
|
@@ -1,880 +0,0 @@
|
||||
{
|
||||
"dashboard":
|
||||
{
|
||||
"__inputs": [
|
||||
{
|
||||
"name": "DS_PROMETHEUS",
|
||||
"label": "prometheus",
|
||||
"description": "",
|
||||
"type": "datasource",
|
||||
"pluginId": "prometheus",
|
||||
"pluginName": "Prometheus"
|
||||
}
|
||||
],
|
||||
"__requires": [
|
||||
{
|
||||
"type": "grafana",
|
||||
"id": "grafana",
|
||||
"name": "Grafana",
|
||||
"version": "4.1.1"
|
||||
},
|
||||
{
|
||||
"type": "panel",
|
||||
"id": "graph",
|
||||
"name": "Graph",
|
||||
"version": ""
|
||||
},
|
||||
{
|
||||
"type": "datasource",
|
||||
"id": "prometheus",
|
||||
"name": "Prometheus",
|
||||
"version": "1.0.0"
|
||||
},
|
||||
{
|
||||
"type": "panel",
|
||||
"id": "singlestat",
|
||||
"name": "Singlestat",
|
||||
"version": ""
|
||||
}
|
||||
],
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"description": "Dashboard to get an overview of one server",
|
||||
"editable": true,
|
||||
"gnetId": 22,
|
||||
"graphTooltip": 0,
|
||||
"hideControls": false,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"refresh": false,
|
||||
"rows": [
|
||||
{
|
||||
"collapse": false,
|
||||
"height": "250px",
|
||||
"panels": [
|
||||
{
|
||||
"alerting": {},
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {},
|
||||
"id": 3,
|
||||
"legend": {
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": false
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [],
|
||||
"span": 6,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "100 - (avg by (cpu) (irate(node_cpu{mode=\"idle\", instance=\"$server\"}[5m])) * 100)",
|
||||
"hide": false,
|
||||
"intervalFactor": 10,
|
||||
"legendFormat": "{{cpu}}",
|
||||
"refId": "A",
|
||||
"step": 50
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "Idle cpu",
|
||||
"tooltip": {
|
||||
"msResolution": false,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "percent",
|
||||
"label": "cpu usage",
|
||||
"logBase": 1,
|
||||
"max": 100,
|
||||
"min": 0,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "short",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"alerting": {},
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {},
|
||||
"id": 9,
|
||||
"legend": {
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": false
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [],
|
||||
"span": 6,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "node_load1{instance=\"$server\"}",
|
||||
"intervalFactor": 4,
|
||||
"legendFormat": "load 1m",
|
||||
"refId": "A",
|
||||
"step": 20,
|
||||
"target": ""
|
||||
},
|
||||
{
|
||||
"expr": "node_load5{instance=\"$server\"}",
|
||||
"intervalFactor": 4,
|
||||
"legendFormat": "load 5m",
|
||||
"refId": "B",
|
||||
"step": 20,
|
||||
"target": ""
|
||||
},
|
||||
{
|
||||
"expr": "node_load15{instance=\"$server\"}",
|
||||
"intervalFactor": 4,
|
||||
"legendFormat": "load 15m",
|
||||
"refId": "C",
|
||||
"step": 20,
|
||||
"target": ""
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "System load",
|
||||
"tooltip": {
|
||||
"msResolution": false,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "percentunit",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "short",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"repeat": null,
|
||||
"repeatIteration": null,
|
||||
"repeatRowId": null,
|
||||
"showTitle": false,
|
||||
"title": "New row",
|
||||
"titleSize": "h6"
|
||||
},
|
||||
{
|
||||
"collapse": false,
|
||||
"height": "250px",
|
||||
"panels": [
|
||||
{
|
||||
"alerting": {},
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {},
|
||||
"id": 4,
|
||||
"legend": {
|
||||
"alignAsTable": false,
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"hideEmpty": false,
|
||||
"hideZero": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"rightSide": false,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": false
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [
|
||||
{
|
||||
"alias": "node_memory_SwapFree{instance=\"172.17.0.1:9100\",job=\"prometheus\"}",
|
||||
"yaxis": 2
|
||||
}
|
||||
],
|
||||
"span": 9,
|
||||
"stack": true,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "node_memory_MemTotal{instance=\"$server\"} - node_memory_MemFree{instance=\"$server\"} - node_memory_Buffers{instance=\"$server\"} - node_memory_Cached{instance=\"$server\"}",
|
||||
"hide": false,
|
||||
"interval": "",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "memory used",
|
||||
"metric": "",
|
||||
"refId": "C",
|
||||
"step": 4
|
||||
},
|
||||
{
|
||||
"expr": "node_memory_Buffers{instance=\"$server\"}",
|
||||
"interval": "",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "memory buffers",
|
||||
"metric": "",
|
||||
"refId": "E",
|
||||
"step": 4
|
||||
},
|
||||
{
|
||||
"expr": "node_memory_Cached{instance=\"$server\"}",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "memory cached",
|
||||
"metric": "",
|
||||
"refId": "F",
|
||||
"step": 4
|
||||
},
|
||||
{
|
||||
"expr": "node_memory_MemFree{instance=\"$server\"}",
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "memory free",
|
||||
"metric": "",
|
||||
"refId": "D",
|
||||
"step": 4
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "Memory usage",
|
||||
"tooltip": {
|
||||
"msResolution": false,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "individual"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": "0",
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "short",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cacheTimeout": null,
|
||||
"colorBackground": false,
|
||||
"colorValue": false,
|
||||
"colors": [
|
||||
"rgba(50, 172, 45, 0.97)",
|
||||
"rgba(237, 129, 40, 0.89)",
|
||||
"rgba(245, 54, 54, 0.9)"
|
||||
],
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"format": "percent",
|
||||
"gauge": {
|
||||
"maxValue": 100,
|
||||
"minValue": 0,
|
||||
"show": true,
|
||||
"thresholdLabels": false,
|
||||
"thresholdMarkers": true
|
||||
},
|
||||
"id": 5,
|
||||
"interval": null,
|
||||
"links": [],
|
||||
"mappingType": 1,
|
||||
"mappingTypes": [
|
||||
{
|
||||
"name": "value to text",
|
||||
"value": 1
|
||||
},
|
||||
{
|
||||
"name": "range to text",
|
||||
"value": 2
|
||||
}
|
||||
],
|
||||
"maxDataPoints": 100,
|
||||
"nullPointMode": "connected",
|
||||
"nullText": null,
|
||||
"postfix": "",
|
||||
"postfixFontSize": "50%",
|
||||
"prefix": "",
|
||||
"prefixFontSize": "50%",
|
||||
"rangeMaps": [
|
||||
{
|
||||
"from": "null",
|
||||
"text": "N/A",
|
||||
"to": "null"
|
||||
}
|
||||
],
|
||||
"span": 3,
|
||||
"sparkline": {
|
||||
"fillColor": "rgba(31, 118, 189, 0.18)",
|
||||
"full": false,
|
||||
"lineColor": "rgb(31, 120, 193)",
|
||||
"show": false
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "((node_memory_MemTotal{instance=\"$server\"} - node_memory_MemFree{instance=\"$server\"} - node_memory_Buffers{instance=\"$server\"} - node_memory_Cached{instance=\"$server\"}) / node_memory_MemTotal{instance=\"$server\"}) * 100",
|
||||
"intervalFactor": 2,
|
||||
"refId": "A",
|
||||
"step": 60,
|
||||
"target": ""
|
||||
}
|
||||
],
|
||||
"thresholds": "80, 90",
|
||||
"title": "Memory usage",
|
||||
"type": "singlestat",
|
||||
"valueFontSize": "80%",
|
||||
"valueMaps": [
|
||||
{
|
||||
"op": "=",
|
||||
"text": "N/A",
|
||||
"value": "null"
|
||||
}
|
||||
],
|
||||
"valueName": "avg"
|
||||
}
|
||||
],
|
||||
"repeat": null,
|
||||
"repeatIteration": null,
|
||||
"repeatRowId": null,
|
||||
"showTitle": false,
|
||||
"title": "New row",
|
||||
"titleSize": "h6"
|
||||
},
|
||||
{
|
||||
"collapse": false,
|
||||
"height": "250px",
|
||||
"panels": [
|
||||
{
|
||||
"alerting": {},
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {},
|
||||
"id": 6,
|
||||
"legend": {
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": false
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [
|
||||
{
|
||||
"alias": "read",
|
||||
"yaxis": 1
|
||||
},
|
||||
{
|
||||
"alias": "{instance=\"172.17.0.1:9100\"}",
|
||||
"yaxis": 2
|
||||
},
|
||||
{
|
||||
"alias": "io time",
|
||||
"yaxis": 2
|
||||
}
|
||||
],
|
||||
"span": 9,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (instance) (rate(node_disk_bytes_read{instance=\"$server\"}[2m]))",
|
||||
"hide": false,
|
||||
"intervalFactor": 4,
|
||||
"legendFormat": "read",
|
||||
"refId": "A",
|
||||
"step": 8,
|
||||
"target": ""
|
||||
},
|
||||
{
|
||||
"expr": "sum by (instance) (rate(node_disk_bytes_written{instance=\"$server\"}[2m]))",
|
||||
"intervalFactor": 4,
|
||||
"legendFormat": "written",
|
||||
"refId": "B",
|
||||
"step": 8
|
||||
},
|
||||
{
|
||||
"expr": "sum by (instance) (rate(node_disk_io_time_ms{instance=\"$server\"}[2m]))",
|
||||
"intervalFactor": 4,
|
||||
"legendFormat": "io time",
|
||||
"refId": "C",
|
||||
"step": 8
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "Disk I/O",
|
||||
"tooltip": {
|
||||
"msResolution": false,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "ms",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cacheTimeout": null,
|
||||
"colorBackground": false,
|
||||
"colorValue": false,
|
||||
"colors": [
|
||||
"rgba(50, 172, 45, 0.97)",
|
||||
"rgba(237, 129, 40, 0.89)",
|
||||
"rgba(245, 54, 54, 0.9)"
|
||||
],
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"format": "percentunit",
|
||||
"gauge": {
|
||||
"maxValue": 1,
|
||||
"minValue": 0,
|
||||
"show": true,
|
||||
"thresholdLabels": false,
|
||||
"thresholdMarkers": true
|
||||
},
|
||||
"id": 7,
|
||||
"interval": null,
|
||||
"links": [],
|
||||
"mappingType": 1,
|
||||
"mappingTypes": [
|
||||
{
|
||||
"name": "value to text",
|
||||
"value": 1
|
||||
},
|
||||
{
|
||||
"name": "range to text",
|
||||
"value": 2
|
||||
}
|
||||
],
|
||||
"maxDataPoints": 100,
|
||||
"nullPointMode": "connected",
|
||||
"nullText": null,
|
||||
"postfix": "",
|
||||
"postfixFontSize": "50%",
|
||||
"prefix": "",
|
||||
"prefixFontSize": "50%",
|
||||
"rangeMaps": [
|
||||
{
|
||||
"from": "null",
|
||||
"text": "N/A",
|
||||
"to": "null"
|
||||
}
|
||||
],
|
||||
"span": 3,
|
||||
"sparkline": {
|
||||
"fillColor": "rgba(31, 118, 189, 0.18)",
|
||||
"full": false,
|
||||
"lineColor": "rgb(31, 120, 193)",
|
||||
"show": false
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "(sum(node_filesystem_size{device!=\"rootfs\",instance=\"$server\"}) - sum(node_filesystem_free{device!=\"rootfs\",instance=\"$server\"})) / sum(node_filesystem_size{device!=\"rootfs\",instance=\"$server\"})",
|
||||
"intervalFactor": 2,
|
||||
"refId": "A",
|
||||
"step": 60,
|
||||
"target": ""
|
||||
}
|
||||
],
|
||||
"thresholds": "0.75, 0.9",
|
||||
"title": "Disk space usage",
|
||||
"type": "singlestat",
|
||||
"valueFontSize": "80%",
|
||||
"valueMaps": [
|
||||
{
|
||||
"op": "=",
|
||||
"text": "N/A",
|
||||
"value": "null"
|
||||
}
|
||||
],
|
||||
"valueName": "current"
|
||||
}
|
||||
],
|
||||
"repeat": null,
|
||||
"repeatIteration": null,
|
||||
"repeatRowId": null,
|
||||
"showTitle": false,
|
||||
"title": "New row",
|
||||
"titleSize": "h6"
|
||||
},
|
||||
{
|
||||
"collapse": false,
|
||||
"height": "250px",
|
||||
"panels": [
|
||||
{
|
||||
"alerting": {},
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {},
|
||||
"id": 8,
|
||||
"legend": {
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": false
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [
|
||||
{
|
||||
"alias": "transmitted ",
|
||||
"yaxis": 2
|
||||
}
|
||||
],
|
||||
"span": 6,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "rate(node_network_receive_bytes{instance=\"$server\",device!~\"lo\"}[5m])",
|
||||
"hide": false,
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "{{device}}",
|
||||
"refId": "A",
|
||||
"step": 10,
|
||||
"target": ""
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "Network received",
|
||||
"tooltip": {
|
||||
"msResolution": false,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"alerting": {},
|
||||
"aliasColors": {},
|
||||
"bars": false,
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"editable": true,
|
||||
"error": false,
|
||||
"fill": 1,
|
||||
"grid": {},
|
||||
"id": 10,
|
||||
"legend": {
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"show": true,
|
||||
"total": false,
|
||||
"values": false
|
||||
},
|
||||
"lines": true,
|
||||
"linewidth": 2,
|
||||
"links": [],
|
||||
"nullPointMode": "connected",
|
||||
"percentage": false,
|
||||
"pointradius": 5,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [
|
||||
{
|
||||
"alias": "transmitted ",
|
||||
"yaxis": 2
|
||||
}
|
||||
],
|
||||
"span": 6,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "rate(node_network_transmit_bytes{instance=\"$server\",device!~\"lo\"}[5m])",
|
||||
"hide": false,
|
||||
"intervalFactor": 2,
|
||||
"legendFormat": "{{device}}",
|
||||
"refId": "B",
|
||||
"step": 10,
|
||||
"target": ""
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "Network transmitted",
|
||||
"tooltip": {
|
||||
"msResolution": false,
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "cumulative"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
},
|
||||
{
|
||||
"format": "bytes",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": true
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"repeat": null,
|
||||
"repeatIteration": null,
|
||||
"repeatRowId": null,
|
||||
"showTitle": false,
|
||||
"title": "New row",
|
||||
"titleSize": "h6"
|
||||
}
|
||||
],
|
||||
"schemaVersion": 14,
|
||||
"style": "dark",
|
||||
"tags": [
|
||||
"prometheus"
|
||||
],
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"allValue": null,
|
||||
"current": {},
|
||||
"datasource": "${DS_PROMETHEUS}",
|
||||
"hide": 0,
|
||||
"includeAll": false,
|
||||
"label": null,
|
||||
"multi": false,
|
||||
"name": "server",
|
||||
"options": [],
|
||||
"query": "label_values(node_boot_time, instance)",
|
||||
"refresh": 1,
|
||||
"regex": "",
|
||||
"sort": 0,
|
||||
"tagValuesQuery": "",
|
||||
"tags": [],
|
||||
"tagsQuery": "",
|
||||
"type": "query",
|
||||
"useTags": false
|
||||
}
|
||||
]
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"timepicker": {
|
||||
"refresh_intervals": [
|
||||
"5s",
|
||||
"10s",
|
||||
"30s",
|
||||
"1m",
|
||||
"5m",
|
||||
"15m",
|
||||
"30m",
|
||||
"1h",
|
||||
"2h",
|
||||
"1d"
|
||||
],
|
||||
"time_options": [
|
||||
"5m",
|
||||
"15m",
|
||||
"1h",
|
||||
"6h",
|
||||
"12h",
|
||||
"24h",
|
||||
"2d",
|
||||
"7d",
|
||||
"30d"
|
||||
]
|
||||
},
|
||||
"timezone": "browser",
|
||||
"title": "Nodes",
|
||||
"version": 1
|
||||
},
|
||||
"inputs": [
|
||||
{
|
||||
"name": "DS_PROMETHEUS",
|
||||
"pluginId": "prometheus",
|
||||
"type": "datasource",
|
||||
"value": "prometheus"
|
||||
}
|
||||
],
|
||||
"overwrite": true
|
||||
}
|
@@ -1,7 +0,0 @@
|
||||
{
|
||||
"access": "proxy",
|
||||
"basicAuth": false,
|
||||
"name": "prometheus",
|
||||
"type": "prometheus",
|
||||
"url": "http://prometheus-k8s.monitoring.svc:9090"
|
||||
}
|
@@ -1,121 +0,0 @@
|
||||
### General cluster availability ###
|
||||
|
||||
# alert if another failed peer will result in an unavailable cluster
|
||||
ALERT InsufficientPeers
|
||||
IF count(up{job="etcd-k8s"} == 0) > (count(up{job="etcd-k8s"}) / 2 - 1)
|
||||
FOR 3m
|
||||
LABELS {
|
||||
severity = "critical"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Etcd cluster small",
|
||||
description = "If one more etcd peer goes down the cluster will be unavailable",
|
||||
}
|
||||
|
||||
### HTTP requests alerts ###
|
||||
|
||||
# alert if more than 1% of requests to an HTTP endpoint have failed with a non 4xx response
|
||||
ALERT HighNumberOfFailedHTTPRequests
|
||||
IF sum by(method) (rate(etcd_http_failed_total{job="etcd-k8s", code!~"4[0-9]{2}"}[5m]))
|
||||
/ sum by(method) (rate(etcd_http_received_total{job="etcd-k8s"}[5m])) > 0.01
|
||||
FOR 10m
|
||||
LABELS {
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "a high number of HTTP requests are failing",
|
||||
description = "{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}",
|
||||
}
|
||||
|
||||
# alert if more than 5% of requests to an HTTP endpoint have failed with a non 4xx response
|
||||
ALERT HighNumberOfFailedHTTPRequests
|
||||
IF sum by(method) (rate(etcd_http_failed_total{job="etcd-k8s", code!~"4[0-9]{2}"}[5m]))
|
||||
/ sum by(method) (rate(etcd_http_received_total{job="etcd-k8s"}[5m])) > 0.05
|
||||
FOR 5m
|
||||
LABELS {
|
||||
severity = "critical"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "a high number of HTTP requests are failing",
|
||||
description = "{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}",
|
||||
}
|
||||
|
||||
# alert if 50% of requests get a 4xx response
|
||||
ALERT HighNumberOfFailedHTTPRequests
|
||||
IF sum by(method) (rate(etcd_http_failed_total{job="etcd-k8s", code=~"4[0-9]{2}"}[5m]))
|
||||
/ sum by(method) (rate(etcd_http_received_total{job="etcd-k8s"}[5m])) > 0.5
|
||||
FOR 10m
|
||||
LABELS {
|
||||
severity = "critical"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "a high number of HTTP requests are failing",
|
||||
description = "{{ $value }}% of requests for {{ $labels.method }} failed with 4xx responses on etcd instance {{ $labels.instance }}",
|
||||
}
|
||||
|
||||
# alert if the 99th percentile of HTTP requests take more than 150ms
|
||||
ALERT HTTPRequestsSlow
|
||||
IF histogram_quantile(0.99, rate(etcd_http_successful_duration_second_bucket[5m])) > 0.15
|
||||
FOR 10m
|
||||
LABELS {
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "slow HTTP requests",
|
||||
description = "on ectd instance {{ $labels.instance }} HTTP requests to {{ $label.method }} are slow",
|
||||
}
|
||||
|
||||
### File descriptor alerts ###
|
||||
|
||||
instance:fd_utilization = process_open_fds / process_max_fds
|
||||
|
||||
# alert if file descriptors are likely to exhaust within the next 4 hours
|
||||
ALERT FdExhaustionClose
|
||||
IF predict_linear(instance:fd_utilization[1h], 3600 * 4) > 1
|
||||
FOR 10m
|
||||
LABELS {
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "file descriptors soon exhausted",
|
||||
description = "{{ $labels.job }} instance {{ $labels.instance }} will exhaust in file descriptors soon",
|
||||
}
|
||||
|
||||
# alert if file descriptors are likely to exhaust within the next hour
|
||||
ALERT FdExhaustionClose
|
||||
IF predict_linear(instance:fd_utilization[10m], 3600) > 1
|
||||
FOR 10m
|
||||
LABELS {
|
||||
severity = "critical"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "file descriptors soon exhausted",
|
||||
description = "{{ $labels.job }} instance {{ $labels.instance }} will exhaust in file descriptors soon",
|
||||
}
|
||||
|
||||
### etcd proposal alerts ###
|
||||
|
||||
# alert if there are several failed proposals within an hour
|
||||
ALERT HighNumberOfFailedProposals
|
||||
IF increase(etcd_server_proposal_failed_total{job="etcd"}[1h]) > 5
|
||||
LABELS {
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "a high number of failed proposals within the etcd cluster are happening",
|
||||
description = "etcd instance {{ $labels.instance }} has seen {{ $value }} proposal failures within the last hour",
|
||||
}
|
||||
|
||||
### etcd disk io latency alerts ###
|
||||
|
||||
# alert if 99th percentile of fsync durations is higher than 500ms
|
||||
ALERT HighFsyncDurations
|
||||
IF histogram_quantile(0.99, rate(etcd_wal_fsync_durations_seconds_bucket[5m])) > 0.5
|
||||
FOR 10m
|
||||
LABELS {
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "high fsync durations",
|
||||
description = "ectd instance {{ $labels.instance }} fync durations are high",
|
||||
}
|
@@ -1,388 +0,0 @@
|
||||
# NOTE: These rules were kindly contributed by the SoundCloud engineering team.
|
||||
|
||||
### Container resources ###
|
||||
|
||||
cluster_namespace_controller_pod_container:spec_memory_limit_bytes =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_spec_memory_limit_bytes{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:spec_cpu_shares =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_spec_cpu_shares{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:cpu_usage:rate =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
irate(
|
||||
container_cpu_usage_seconds_total{container_name!=""}[5m]
|
||||
),
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:memory_usage:bytes =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_memory_usage_bytes{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:memory_working_set:bytes =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_memory_working_set_bytes{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:memory_rss:bytes =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_memory_rss{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:memory_cache:bytes =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_memory_cache{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:disk_usage:bytes =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_disk_usage_bytes{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:memory_pagefaults:rate =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name,scope,type) (
|
||||
label_replace(
|
||||
irate(
|
||||
container_memory_failures_total{container_name!=""}[5m]
|
||||
),
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:memory_oom:rate =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name,scope,type) (
|
||||
label_replace(
|
||||
irate(
|
||||
container_memory_failcnt{container_name!=""}[5m]
|
||||
),
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
### Cluster resources ###
|
||||
|
||||
cluster:memory_allocation:percent =
|
||||
100 * sum by (cluster) (
|
||||
container_spec_memory_limit_bytes{pod_name!=""}
|
||||
) / sum by (cluster) (
|
||||
machine_memory_bytes
|
||||
)
|
||||
|
||||
cluster:memory_used:percent =
|
||||
100 * sum by (cluster) (
|
||||
container_memory_usage_bytes{pod_name!=""}
|
||||
) / sum by (cluster) (
|
||||
machine_memory_bytes
|
||||
)
|
||||
|
||||
cluster:cpu_allocation:percent =
|
||||
100 * sum by (cluster) (
|
||||
container_spec_cpu_shares{pod_name!=""}
|
||||
) / sum by (cluster) (
|
||||
container_spec_cpu_shares{id="/"} * on(cluster,instance) machine_cpu_cores
|
||||
)
|
||||
|
||||
cluster:node_cpu_use:percent =
|
||||
100 * sum by (cluster) (
|
||||
rate(node_cpu{mode!="idle"}[5m])
|
||||
) / sum by (cluster) (
|
||||
machine_cpu_cores
|
||||
)
|
||||
|
||||
### API latency ###
|
||||
|
||||
# Raw metrics are in microseconds. Convert to seconds.
|
||||
cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.99"} =
|
||||
histogram_quantile(
|
||||
0.99,
|
||||
sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket)
|
||||
) / 1e6
|
||||
cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.9"} =
|
||||
histogram_quantile(
|
||||
0.9,
|
||||
sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket)
|
||||
) / 1e6
|
||||
cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.5"} =
|
||||
histogram_quantile(
|
||||
0.5,
|
||||
sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket)
|
||||
) / 1e6
|
||||
|
||||
### Scheduling latency ###
|
||||
|
||||
cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.99"} =
|
||||
histogram_quantile(0.99,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6
|
||||
cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.9"} =
|
||||
histogram_quantile(0.9,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6
|
||||
cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.5"} =
|
||||
histogram_quantile(0.5,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6
|
||||
|
||||
cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.99"} =
|
||||
histogram_quantile(0.99,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6
|
||||
cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.9"} =
|
||||
histogram_quantile(0.9,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6
|
||||
cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.5"} =
|
||||
histogram_quantile(0.5,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6
|
||||
|
||||
cluster:scheduler_binding_latency:quantile_seconds{quantile="0.99"} =
|
||||
histogram_quantile(0.99,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6
|
||||
cluster:scheduler_binding_latency:quantile_seconds{quantile="0.9"} =
|
||||
histogram_quantile(0.9,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6
|
||||
cluster:scheduler_binding_latency:quantile_seconds{quantile="0.5"} =
|
||||
histogram_quantile(0.5,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6
|
||||
|
||||
ALERT K8SNodeDown
|
||||
IF up{job="kubelet"} == 0
|
||||
FOR 1h
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Kubelet cannot be scraped",
|
||||
description = "Prometheus could not scrape a {{ $labels.job }} for more than one hour",
|
||||
}
|
||||
|
||||
ALERT K8SNodeNotReady
|
||||
IF kube_node_status_ready{condition="true"} == 0
|
||||
FOR 1h
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning",
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Node status is NotReady",
|
||||
description = "The Kubelet on {{ $labels.node }} has not checked in with the API, or has set itself to NotReady, for more than an hour",
|
||||
}
|
||||
|
||||
ALERT K8SManyNodesNotReady
|
||||
IF
|
||||
count by (cluster) (kube_node_status_ready{condition="true"} == 0) > 1
|
||||
AND
|
||||
(
|
||||
count by (cluster) (kube_node_status_ready{condition="true"} == 0)
|
||||
/
|
||||
count by (cluster) (kube_node_status_ready{condition="true"})
|
||||
) > 0.2
|
||||
FOR 1m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical",
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Many K8s nodes are Not Ready",
|
||||
description = "{{ $value }} K8s nodes (more than 10% of cluster {{ $labels.cluster }}) are in the NotReady state.",
|
||||
}
|
||||
|
||||
ALERT K8SKubeletNodeExporterDown
|
||||
IF up{job="node-exporter"} == 0
|
||||
FOR 15m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Kubelet node_exporter cannot be scraped",
|
||||
description = "Prometheus could not scrape a {{ $labels.job }} for more than one hour.",
|
||||
}
|
||||
|
||||
ALERT K8SKubeletDown
|
||||
IF absent(up{job="kubelet"}) or count by (cluster) (up{job="kubelet"} == 0) / count by (cluster) (up{job="kubelet"}) > 0.1
|
||||
FOR 1h
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Many Kubelets cannot be scraped",
|
||||
description = "Prometheus failed to scrape more than 10% of kubelets, or all Kubelets have disappeared from service discovery.",
|
||||
}
|
||||
|
||||
ALERT K8SApiserverDown
|
||||
IF up{job="kubernetes"} == 0
|
||||
FOR 15m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "API server unreachable",
|
||||
description = "An API server could not be scraped.",
|
||||
}
|
||||
|
||||
# Disable for non HA kubernetes setups.
|
||||
ALERT K8SApiserverDown
|
||||
IF absent({job="kubernetes"}) or (count by(cluster) (up{job="kubernetes"} == 1) < count by(cluster) (up{job="kubernetes"}))
|
||||
FOR 5m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "API server unreachable",
|
||||
description = "Prometheus failed to scrape multiple API servers, or all API servers have disappeared from service discovery.",
|
||||
}
|
||||
|
||||
ALERT K8SSchedulerDown
|
||||
IF absent(up{job="kube-scheduler"}) or (count by(cluster) (up{job="kube-scheduler"} == 1) == 0)
|
||||
FOR 5m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical",
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Scheduler is down",
|
||||
description = "There is no running K8S scheduler. New pods are not being assigned to nodes.",
|
||||
}
|
||||
|
||||
ALERT K8SControllerManagerDown
|
||||
IF absent(up{job="kube-controller-manager"}) or (count by(cluster) (up{job="kube-controller-manager"} == 1) == 0)
|
||||
FOR 5m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical",
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Controller manager is down",
|
||||
description = "There is no running K8S controller manager. Deployments and replication controllers are not making progress.",
|
||||
}
|
||||
|
||||
ALERT K8SConntrackTableFull
|
||||
IF 100*node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 50
|
||||
FOR 10m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Number of tracked connections is near the limit",
|
||||
description = "The nf_conntrack table is {{ $value }}% full.",
|
||||
}
|
||||
|
||||
ALERT K8SConntrackTableFull
|
||||
IF 100*node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 90
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Number of tracked connections is near the limit",
|
||||
description = "The nf_conntrack table is {{ $value }}% full.",
|
||||
}
|
||||
|
||||
# To catch the conntrack sysctl de-tuning when it happens
|
||||
ALERT K8SConntrackTuningMissing
|
||||
IF node_nf_conntrack_udp_timeout > 10
|
||||
FOR 10m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning",
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Node does not have the correct conntrack tunings",
|
||||
description = "Nodes keep un-setting the correct tunings, investigate when it happens.",
|
||||
}
|
||||
|
||||
ALERT K8STooManyOpenFiles
|
||||
IF 100*process_open_fds{job=~"kubelet|kubernetes"} / process_max_fds > 50
|
||||
FOR 10m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "{{ $labels.job }} has too many open file descriptors",
|
||||
description = "{{ $labels.node }} is using {{ $value }}% of the available file/socket descriptors.",
|
||||
}
|
||||
|
||||
ALERT K8STooManyOpenFiles
|
||||
IF 100*process_open_fds{job=~"kubelet|kubernetes"} / process_max_fds > 80
|
||||
FOR 10m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "{{ $labels.job }} has too many open file descriptors",
|
||||
description = "{{ $labels.node }} is using {{ $value }}% of the available file/socket descriptors.",
|
||||
}
|
||||
|
||||
# Some verbs excluded because they are expected to be long-lasting:
|
||||
# WATCHLIST is long-poll, CONNECT is `kubectl exec`.
|
||||
ALERT K8SApiServerLatency
|
||||
IF histogram_quantile(
|
||||
0.99,
|
||||
sum without (instance,node,resource) (apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH"})
|
||||
) / 1e6 > 1.0
|
||||
FOR 10m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Kubernetes apiserver latency is high",
|
||||
description = "99th percentile Latency for {{ $labels.verb }} requests to the kube-apiserver is higher than 1s.",
|
||||
}
|
||||
|
||||
ALERT K8SApiServerEtcdAccessLatency
|
||||
IF etcd_request_latencies_summary{quantile="0.99"} / 1e6 > 1.0
|
||||
FOR 15m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Access to etcd is slow",
|
||||
description = "99th percentile latency for apiserver to access etcd is higher than 1s.",
|
||||
}
|
||||
|
||||
ALERT K8SKubeletTooManyPods
|
||||
IF kubelet_running_pod_count > 100
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning",
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Kubelet is close to pod limit",
|
||||
description = "Kubelet {{$labels.instance}} is running {{$value}} pods, close to the limit of 110",
|
||||
}
|
||||
|
@@ -1,35 +0,0 @@
|
||||
# Adding kube-prometheus to [KOPS](https://github.com/kubernetes/kops) on AWS 1.5.x
|
||||
|
||||
|
||||
## Prerequisites
|
||||
|
||||
A running Kubernetes cluster created with [KOPS](https://github.com/kubernetes/kops).
|
||||
|
||||
These instructions have currently been tested with **topology=public** on AWS with KOPS 1.5.1 and Kubernetes 1.5.x
|
||||
|
||||
## Open AWS Security Groups:
|
||||
1. Open port 9100 on the masters security group to the nodes security group
|
||||
1. Open ports 10250-10252 on the masters security group to the nodes security group.
|
||||
|
||||
Example script below requires $AWS\_DEFAULT_PROFILE and [$NAME](https://github.com/kubernetes/kops/blob/master/docs/aws.md#prepare-local-environment)
|
||||
|
||||
```bash
|
||||
MASTER_SG=$(aws --profile ${AWS_DEFAULT_PROFILE} ec2 describe-security-groups --filters "Name=tag:Name,Values=masters.$NAME" --query "SecurityGroups[*].GroupId[]" --output=text)
|
||||
NODES_SG=$(aws --profile ${AWS_DEFAULT_PROFILE} ec2 describe-security-groups --filters "Name=tag:Name,Values=nodes.$NAME" --query "SecurityGroups[*].GroupId[]" --output=text)
|
||||
aws --profile ${AWS_DEFAULT_PROFILE} ec2 authorize-security-group-ingress --group-id $MASTER_SG --protocol tcp --port 9100 --source-group $NODES_SG
|
||||
aws --profile ${AWS_DEFAULT_PROFILE} ec2 authorize-security-group-ingress --group-id $MASTER_SG --protocol tcp --port 10250-10252 --source-group $NODES_SG
|
||||
```
|
||||
|
||||
## Adding kube-prometheus
|
||||
Following the instructions in the [README](https://github.com/coreos/kube-prometheus/blob/master/README.md):
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
git clone -b master https://github.com/coreos/kube-prometheus.git kube-prometheus-temp;
|
||||
cd kube-prometheus-temp
|
||||
./hack/cluster-monitoring/deploy
|
||||
kubectl -n kube-system create -f manifests/k8s/self-hosted/
|
||||
cd -
|
||||
rm -rf kube-prometheus-temp
|
||||
```
|
@@ -1,41 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
if [ -z "${KUBECONFIG}" ]; then
|
||||
export KUBECONFIG=~/.kube/config
|
||||
fi
|
||||
|
||||
if [ -z "${NAMESPACE}" ]; then
|
||||
NAMESPACE=monitoring
|
||||
fi
|
||||
|
||||
kubectl create namespace "$NAMESPACE"
|
||||
|
||||
kctl() {
|
||||
kubectl --namespace "$NAMESPACE" "$@"
|
||||
}
|
||||
|
||||
kctl apply -f manifests/prometheus-operator.yaml
|
||||
|
||||
# Wait for TPRs to be ready.
|
||||
printf "Waiting for Operator to register third party objects..."
|
||||
until kctl get servicemonitor > /dev/null 2>&1; do sleep 1; printf "."; done
|
||||
until kctl get prometheus > /dev/null 2>&1; do sleep 1; printf "."; done
|
||||
until kctl get alertmanager > /dev/null 2>&1; do sleep 1; printf "."; done
|
||||
echo "done!"
|
||||
|
||||
kctl apply -f manifests/exporters
|
||||
kctl apply -f manifests/grafana
|
||||
|
||||
kctl apply -f manifests/prometheus/prometheus-k8s-rules.yaml
|
||||
kctl apply -f manifests/prometheus/prometheus-k8s-service.yaml
|
||||
|
||||
kctl apply -f manifests/alertmanager/alertmanager-config.yaml
|
||||
kctl apply -f manifests/alertmanager/alertmanager-service.yaml
|
||||
|
||||
# `kubectl apply` is currently not working for third party resources so we are
|
||||
# using `kubectl create` here for the time being.
|
||||
# (https://github.com/kubernetes/kubernetes/issues/29542)
|
||||
kctl create -f manifests/prometheus/prometheus-k8s-servicemonitors.yaml
|
||||
kctl create -f manifests/prometheus/prometheus-k8s.yaml
|
||||
kctl create -f manifests/alertmanager/alertmanager.yaml
|
||||
|
@@ -1,6 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
hack/cluster-monitoring/deploy
|
||||
|
||||
awk 'FNR==1{print "---"}1' manifests/k8s/minikube/*.yaml | sed s/MINIKUBE_IP/`minikube ip`/g | kubectl --namespace=kube-system apply -f -
|
||||
|
@@ -1,6 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
hack/cluster-monitoring/teardown
|
||||
|
||||
kubectl --namespace=kube-system delete -f manifests/k8s/minikube
|
||||
|
@@ -1,6 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
hack/cluster-monitoring/deploy
|
||||
|
||||
kubectl --namespace=kube-system apply -f manifests/k8s/self-hosted
|
||||
|
@@ -1,6 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
hack/cluster-monitoring/teardown
|
||||
|
||||
kubectl --namespace=kube-system delete -f manifests/k8s/self-hosted
|
||||
|
@@ -1,24 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
if [ -z "${KUBECONFIG}" ]; then
|
||||
export KUBECONFIG=~/.kube/config
|
||||
fi
|
||||
|
||||
if [ -z "${NAMESPACE}" ]; then
|
||||
NAMESPACE=monitoring
|
||||
fi
|
||||
|
||||
kctl() {
|
||||
kubectl --namespace "$NAMESPACE" "$@"
|
||||
}
|
||||
|
||||
kctl delete -f manifests/exporters
|
||||
kctl delete -f manifests/grafana
|
||||
kctl delete -f manifests/prometheus
|
||||
kctl delete -f manifests/alertmanager
|
||||
|
||||
# Hack: wait a bit to let the controller delete the deployed Prometheus server.
|
||||
sleep 5
|
||||
|
||||
kctl delete -f manifests/prometheus-operator.yaml
|
||||
|
@@ -1,19 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
if [ -z "${KUBECONFIG}" ]; then
|
||||
KUBECONFIG=~/.kube/config
|
||||
fi
|
||||
|
||||
if [ -z "${NAMESPACE}" ]; then
|
||||
NAMESPACE=default
|
||||
fi
|
||||
|
||||
kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" apply -f manifests/examples/example-app/prometheus-frontend-svc.yaml
|
||||
kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" apply -f manifests/examples/example-app/example-app.yaml
|
||||
|
||||
# `kubectl apply` is currently not working for third party resources so we are
|
||||
# using `kubectl create` here for the time being.
|
||||
# (https://github.com/kubernetes/kubernetes/issues/29542)
|
||||
kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" create -f manifests/examples/example-app/prometheus-frontend.yaml
|
||||
kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" create -f manifests/examples/example-app/servicemonitor-frontend.yaml
|
||||
|
@@ -1,12 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
if [ -z "${KUBECONFIG}" ]; then
|
||||
KUBECONFIG=~/.kube/config
|
||||
fi
|
||||
|
||||
if [ -z "${NAMESPACE}" ]; then
|
||||
NAMESPACE=default
|
||||
fi
|
||||
|
||||
kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" delete -f manifests/examples/example-app
|
||||
|
@@ -1,8 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Generate Alert Rules ConfigMap
|
||||
kubectl create configmap --dry-run=true prometheus-k8s-rules --from-file=assets/prometheus/rules/ -oyaml > manifests/prometheus/prometheus-k8s-rules.yaml
|
||||
|
||||
# Generate Dashboard ConfigMap
|
||||
kubectl create configmap --dry-run=true grafana-dashboards --from-file=assets/grafana/ -oyaml > manifests/grafana/grafana-dashboards.yaml
|
||||
|
@@ -1,50 +0,0 @@
|
||||
#!/bin/bash -eu
|
||||
|
||||
# Intended usage:
|
||||
# * Edit dashboard in Grafana (you need to login first with admin/admin
|
||||
# login/password).
|
||||
# * Save dashboard in Grafana to check is specification is correct.
|
||||
# Looks like this is the only way to check is dashboard specification
|
||||
# has error.
|
||||
# * Download dashboard specification as JSON file in Grafana:
|
||||
# Share -> Export -> Save to file.
|
||||
# * Wrap dashboard specification to make it digestable by kube-prometheus:
|
||||
# ./hack/scripts/wrap-dashboard.sh Nodes-1488465802729.json
|
||||
# * Replace dashboard specification:
|
||||
# mv Nodes-1488465802729.json assets/grafana/node-dashboard.json
|
||||
# * Regenerate Grafana configmap:
|
||||
# ./hack/scripts/generate-configmaps.sh
|
||||
# * Apply new configmap:
|
||||
# kubectl -n monitoring apply -f manifests/grafana/grafana-cm.yaml
|
||||
|
||||
if [ "$#" -ne 1 ]; then
|
||||
echo "Usage: $0 path-to-dashboard.json"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
json=$1
|
||||
temp=$(tempfile -m 0644)
|
||||
|
||||
cat >> $temp <<EOF
|
||||
{
|
||||
"dashboard":
|
||||
EOF
|
||||
|
||||
cat $json >> $temp
|
||||
|
||||
cat >> $temp <<EOF
|
||||
,
|
||||
"inputs": [
|
||||
{
|
||||
"name": "DS_PROMETHEUS",
|
||||
"pluginId": "prometheus",
|
||||
"type": "datasource",
|
||||
"value": "prometheus"
|
||||
}
|
||||
],
|
||||
"overwrite": true
|
||||
}
|
||||
EOF
|
||||
|
||||
mv $temp $json
|
||||
|
@@ -1,18 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: alertmanager-main
|
||||
data:
|
||||
alertmanager.yaml: |-
|
||||
global:
|
||||
resolve_timeout: 5m
|
||||
route:
|
||||
group_by: ['job']
|
||||
group_wait: 30s
|
||||
group_interval: 5m
|
||||
repeat_interval: 12h
|
||||
receiver: 'webhook'
|
||||
receivers:
|
||||
- name: 'webhook'
|
||||
webhook_configs:
|
||||
- url: 'http://alertmanagerwh:30500/'
|
@@ -1,14 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: alertmanager-main
|
||||
spec:
|
||||
type: NodePort
|
||||
ports:
|
||||
- name: web
|
||||
nodePort: 30903
|
||||
port: 9093
|
||||
protocol: TCP
|
||||
targetPort: web
|
||||
selector:
|
||||
alertmanager: main
|
@@ -1,9 +0,0 @@
|
||||
apiVersion: "monitoring.coreos.com/v1alpha1"
|
||||
kind: "Alertmanager"
|
||||
metadata:
|
||||
name: "main"
|
||||
labels:
|
||||
alertmanager: "main"
|
||||
spec:
|
||||
replicas: 3
|
||||
version: v0.5.1
|
@@ -1,28 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: etcd-k8s
|
||||
labels:
|
||||
k8s-app: etcd
|
||||
spec:
|
||||
type: ClusterIP
|
||||
clusterIP: None
|
||||
ports:
|
||||
- name: api
|
||||
port: 2379
|
||||
protocol: TCP
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Endpoints
|
||||
metadata:
|
||||
name: etcd-k8s
|
||||
labels:
|
||||
k8s-app: etcd
|
||||
subsets:
|
||||
- addresses:
|
||||
- ip: 10.142.0.2
|
||||
nodeName: 10.142.0.2
|
||||
ports:
|
||||
- name: api
|
||||
port: 2379
|
||||
protocol: TCP
|
@@ -1,28 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: etcd-k8s
|
||||
labels:
|
||||
k8s-app: etcd
|
||||
spec:
|
||||
type: ClusterIP
|
||||
clusterIP: None
|
||||
ports:
|
||||
- name: api
|
||||
port: 2379
|
||||
protocol: TCP
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Endpoints
|
||||
metadata:
|
||||
name: etcd-k8s
|
||||
labels:
|
||||
k8s-app: etcd
|
||||
subsets:
|
||||
- addresses:
|
||||
- ip: 172.17.4.51
|
||||
nodeName: 172.17.4.51
|
||||
ports:
|
||||
- name: api
|
||||
port: 2379
|
||||
protocol: TCP
|
@@ -1,34 +0,0 @@
|
||||
kind: Service
|
||||
apiVersion: v1
|
||||
metadata:
|
||||
name: example-app
|
||||
labels:
|
||||
tier: frontend
|
||||
spec:
|
||||
selector:
|
||||
app: example-app
|
||||
ports:
|
||||
- name: web
|
||||
protocol: TCP
|
||||
port: 8080
|
||||
targetPort: web
|
||||
---
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: example-app
|
||||
spec:
|
||||
replicas: 4
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: example-app
|
||||
version: 1.1.3
|
||||
spec:
|
||||
containers:
|
||||
- name: example-app
|
||||
image: quay.io/fabxc/prometheus_demo_service
|
||||
ports:
|
||||
- name: web
|
||||
containerPort: 8080
|
||||
protocol: TCP
|
@@ -1,14 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: prometheus-frontend
|
||||
spec:
|
||||
type: NodePort
|
||||
ports:
|
||||
- name: web
|
||||
nodePort: 30100
|
||||
port: 9090
|
||||
protocol: TCP
|
||||
targetPort: web
|
||||
selector:
|
||||
prometheus: prometheus-frontend
|
@@ -1,24 +0,0 @@
|
||||
apiVersion: monitoring.coreos.com/v1alpha1
|
||||
kind: Prometheus
|
||||
metadata:
|
||||
name: prometheus-frontend
|
||||
namespace: default
|
||||
labels:
|
||||
prometheus: frontend
|
||||
spec:
|
||||
version: v1.5.2
|
||||
serviceMonitorSelector:
|
||||
matchLabels:
|
||||
tier: frontend
|
||||
resources:
|
||||
requests:
|
||||
# 2Gi is default, but won't schedule if you don't have a node with >2Gi
|
||||
# memory. Modify based on your target and time-series count for
|
||||
# production use. This value is mainly meant for demonstration/testing
|
||||
# purposes.
|
||||
memory: 400Mi
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- namespace: monitoring
|
||||
name: alertmanager-main
|
||||
port: web
|
@@ -1,13 +0,0 @@
|
||||
apiVersion: monitoring.coreos.com/v1alpha1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: frontend
|
||||
labels:
|
||||
tier: frontend
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
tier: frontend
|
||||
endpoints:
|
||||
- port: web
|
||||
interval: 10s
|
@@ -1,25 +0,0 @@
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: kube-state-metrics
|
||||
spec:
|
||||
replicas: 1
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: kube-state-metrics
|
||||
spec:
|
||||
containers:
|
||||
- name: kube-state-metrics
|
||||
image: gcr.io/google_containers/kube-state-metrics:v0.4.1
|
||||
ports:
|
||||
- name: metrics
|
||||
containerPort: 8080
|
||||
resources:
|
||||
requests:
|
||||
memory: 30Mi
|
||||
cpu: 100m
|
||||
limits:
|
||||
memory: 50Mi
|
||||
cpu: 200m
|
||||
|
@@ -1,18 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
labels:
|
||||
app: kube-state-metrics
|
||||
k8s-app: kube-state-metrics
|
||||
annotations:
|
||||
alpha.monitoring.coreos.com/non-namespaced: "true"
|
||||
name: kube-state-metrics
|
||||
spec:
|
||||
ports:
|
||||
- name: http-metrics
|
||||
port: 8080
|
||||
targetPort: metrics
|
||||
protocol: TCP
|
||||
selector:
|
||||
app: kube-state-metrics
|
||||
|
@@ -1,45 +0,0 @@
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: node-exporter
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: node-exporter
|
||||
name: node-exporter
|
||||
spec:
|
||||
hostNetwork: true
|
||||
hostPID: true
|
||||
containers:
|
||||
- image: quay.io/prometheus/node-exporter:v0.13.0
|
||||
args:
|
||||
- "-collector.procfs=/host/proc"
|
||||
- "-collector.sysfs=/host/sys"
|
||||
name: node-exporter
|
||||
ports:
|
||||
- containerPort: 9100
|
||||
hostPort: 9100
|
||||
name: scrape
|
||||
resources:
|
||||
requests:
|
||||
memory: 30Mi
|
||||
cpu: 100m
|
||||
limits:
|
||||
memory: 50Mi
|
||||
cpu: 200m
|
||||
volumeMounts:
|
||||
- name: proc
|
||||
readOnly: true
|
||||
mountPath: /host/proc
|
||||
- name: sys
|
||||
readOnly: true
|
||||
mountPath: /host/sys
|
||||
volumes:
|
||||
- name: proc
|
||||
hostPath:
|
||||
path: /proc
|
||||
- name: sys
|
||||
hostPath:
|
||||
path: /sys
|
||||
|
@@ -1,17 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
labels:
|
||||
app: node-exporter
|
||||
k8s-app: node-exporter
|
||||
name: node-exporter
|
||||
spec:
|
||||
type: ClusterIP
|
||||
clusterIP: None
|
||||
ports:
|
||||
- name: http-metrics
|
||||
port: 9100
|
||||
protocol: TCP
|
||||
selector:
|
||||
app: node-exporter
|
||||
|
File diff suppressed because it is too large
Load Diff
@@ -1,56 +0,0 @@
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: grafana
|
||||
spec:
|
||||
replicas: 1
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: grafana
|
||||
spec:
|
||||
containers:
|
||||
- name: grafana
|
||||
image: grafana/grafana:4.1.1
|
||||
env:
|
||||
- name: GF_AUTH_BASIC_ENABLED
|
||||
value: "true"
|
||||
- name: GF_AUTH_ANONYMOUS_ENABLED
|
||||
value: "true"
|
||||
volumeMounts:
|
||||
- name: grafana-storage
|
||||
mountPath: /var/grafana-storage
|
||||
ports:
|
||||
- name: web
|
||||
containerPort: 3000
|
||||
resources:
|
||||
requests:
|
||||
memory: 100Mi
|
||||
cpu: 100m
|
||||
limits:
|
||||
memory: 300Mi
|
||||
cpu: 300m
|
||||
- name: grafana-watcher
|
||||
image: quay.io/coreos/grafana-watcher:latest
|
||||
args:
|
||||
- '--watch-dir=/var/grafana-dashboards'
|
||||
- '--grafana-url=http://admin:admin@localhost:3000'
|
||||
volumeMounts:
|
||||
- name: grafana-dashboards
|
||||
mountPath: /var/grafana-dashboards
|
||||
resources:
|
||||
requests:
|
||||
memory: "16Mi"
|
||||
cpu: "50m"
|
||||
limits:
|
||||
memory: "32Mi"
|
||||
cpu: "100m"
|
||||
volumeMounts:
|
||||
- name: grafana-dashboards
|
||||
mountPath: /var/grafana-dashboards
|
||||
volumes:
|
||||
- name: grafana-storage
|
||||
emptyDir: {}
|
||||
- name: grafana-dashboards
|
||||
configMap:
|
||||
name: grafana-dashboards
|
@@ -1,15 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: grafana
|
||||
labels:
|
||||
app: grafana
|
||||
spec:
|
||||
type: NodePort
|
||||
ports:
|
||||
- name: web
|
||||
port: 3000
|
||||
protocol: TCP
|
||||
nodePort: 30902
|
||||
selector:
|
||||
app: grafana
|
@@ -1,28 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: kube-controller-manager-prometheus-discovery
|
||||
labels:
|
||||
k8s-app: kube-controller-manager
|
||||
spec:
|
||||
type: ClusterIP
|
||||
clusterIP: None
|
||||
ports:
|
||||
- name: http-metrics
|
||||
port: 10252
|
||||
targetPort: 10252
|
||||
protocol: TCP
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Endpoints
|
||||
metadata:
|
||||
name: kube-controller-manager-prometheus-discovery
|
||||
labels:
|
||||
k8s-app: kube-controller-manager
|
||||
subsets:
|
||||
- addresses:
|
||||
- ip: MINIKUBE_IP
|
||||
ports:
|
||||
- name: http-metrics
|
||||
port: 10252
|
||||
protocol: TCP
|
@@ -1,28 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: kube-scheduler-prometheus-discovery
|
||||
labels:
|
||||
k8s-app: kube-scheduler
|
||||
spec:
|
||||
type: ClusterIP
|
||||
clusterIP: None
|
||||
ports:
|
||||
- name: http-metrics
|
||||
port: 10251
|
||||
targetPort: 10251
|
||||
protocol: TCP
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Endpoints
|
||||
metadata:
|
||||
name: kube-scheduler-prometheus-discovery
|
||||
labels:
|
||||
k8s-app: kube-scheduler
|
||||
subsets:
|
||||
- addresses:
|
||||
- ip: MINIKUBE_IP
|
||||
ports:
|
||||
- name: http-metrics
|
||||
port: 10251
|
||||
protocol: TCP
|
@@ -1,16 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: kube-controller-manager-prometheus-discovery
|
||||
labels:
|
||||
k8s-app: kube-controller-manager
|
||||
spec:
|
||||
selector:
|
||||
k8s-app: kube-controller-manager
|
||||
type: ClusterIP
|
||||
clusterIP: None
|
||||
ports:
|
||||
- name: http-metrics
|
||||
port: 10252
|
||||
targetPort: 10252
|
||||
protocol: TCP
|
@@ -1,20 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: kube-dns-prometheus-discovery
|
||||
labels:
|
||||
k8s-app: kube-dns
|
||||
spec:
|
||||
selector:
|
||||
k8s-app: kube-dns
|
||||
type: ClusterIP
|
||||
clusterIP: None
|
||||
ports:
|
||||
- name: http-metrics-skydns
|
||||
port: 10055
|
||||
targetPort: 10055
|
||||
protocol: TCP
|
||||
- name: http-metrics-dnsmasq
|
||||
port: 10054
|
||||
targetPort: 10054
|
||||
protocol: TCP
|
@@ -1,16 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: kube-scheduler-prometheus-discovery
|
||||
labels:
|
||||
k8s-app: kube-scheduler
|
||||
spec:
|
||||
selector:
|
||||
k8s-app: kube-scheduler
|
||||
type: ClusterIP
|
||||
clusterIP: None
|
||||
ports:
|
||||
- name: http-metrics
|
||||
port: 10251
|
||||
targetPort: 10251
|
||||
protocol: TCP
|
@@ -1,26 +0,0 @@
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: prometheus-operator
|
||||
labels:
|
||||
operator: prometheus
|
||||
spec:
|
||||
replicas: 1
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
operator: prometheus
|
||||
spec:
|
||||
containers:
|
||||
- name: prometheus-operator
|
||||
image: quay.io/coreos/prometheus-operator:v0.6.0
|
||||
args:
|
||||
- "--kubelet-object=kube-system/kubelet"
|
||||
- "--config-reloader-image=quay.io/coreos/configmap-reload:v0.0.1"
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 50Mi
|
||||
limits:
|
||||
cpu: 200m
|
||||
memory: 300Mi
|
@@ -1,447 +0,0 @@
|
||||
apiVersion: v1
|
||||
data:
|
||||
etcd2.rules: "### General cluster availability ###\n\n# alert if another failed
|
||||
peer will result in an unavailable cluster\nALERT InsufficientPeers\n IF count(up{job=\"etcd-k8s\"}
|
||||
== 0) > (count(up{job=\"etcd-k8s\"}) / 2 - 1)\n FOR 3m\n LABELS {\n severity
|
||||
= \"critical\"\n }\n ANNOTATIONS {\n summary = \"Etcd cluster small\",\n
|
||||
\ description = \"If one more etcd peer goes down the cluster will be unavailable\",\n
|
||||
\ }\n\n### HTTP requests alerts ###\n\n# alert if more than 1% of requests to
|
||||
an HTTP endpoint have failed with a non 4xx response\nALERT HighNumberOfFailedHTTPRequests\n
|
||||
\ IF sum by(method) (rate(etcd_http_failed_total{job=\"etcd-k8s\", code!~\"4[0-9]{2}\"}[5m]))\n
|
||||
\ / sum by(method) (rate(etcd_http_received_total{job=\"etcd-k8s\"}[5m])) >
|
||||
0.01\n FOR 10m\n LABELS {\n severity = \"warning\"\n }\n ANNOTATIONS {\n
|
||||
\ summary = \"a high number of HTTP requests are failing\",\n description
|
||||
= \"{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance
|
||||
{{ $labels.instance }}\",\n }\n\n# alert if more than 5% of requests to an HTTP
|
||||
endpoint have failed with a non 4xx response\nALERT HighNumberOfFailedHTTPRequests\n
|
||||
\ IF sum by(method) (rate(etcd_http_failed_total{job=\"etcd-k8s\", code!~\"4[0-9]{2}\"}[5m]))
|
||||
\n / sum by(method) (rate(etcd_http_received_total{job=\"etcd-k8s\"}[5m]))
|
||||
> 0.05\n FOR 5m\n LABELS {\n severity = \"critical\"\n }\n ANNOTATIONS
|
||||
{\n summary = \"a high number of HTTP requests are failing\",\n description
|
||||
= \"{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance
|
||||
{{ $labels.instance }}\",\n }\n\n# alert if 50% of requests get a 4xx response\nALERT
|
||||
HighNumberOfFailedHTTPRequests\n IF sum by(method) (rate(etcd_http_failed_total{job=\"etcd-k8s\",
|
||||
code=~\"4[0-9]{2}\"}[5m]))\n / sum by(method) (rate(etcd_http_received_total{job=\"etcd-k8s\"}[5m]))
|
||||
> 0.5\n FOR 10m\n LABELS {\n severity = \"critical\"\n }\n ANNOTATIONS
|
||||
{\n summary = \"a high number of HTTP requests are failing\",\n description
|
||||
= \"{{ $value }}% of requests for {{ $labels.method }} failed with 4xx responses
|
||||
on etcd instance {{ $labels.instance }}\",\n }\n\n# alert if the 99th percentile
|
||||
of HTTP requests take more than 150ms\nALERT HTTPRequestsSlow\n IF histogram_quantile(0.99,
|
||||
rate(etcd_http_successful_duration_second_bucket[5m])) > 0.15\n FOR 10m\n LABELS
|
||||
{\n severity = \"warning\"\n }\n ANNOTATIONS {\n summary = \"slow HTTP
|
||||
requests\",\n description = \"on ectd instance {{ $labels.instance }} HTTP
|
||||
requests to {{ $label.method }} are slow\",\n }\n\n### File descriptor alerts
|
||||
###\n\ninstance:fd_utilization = process_open_fds / process_max_fds\n\n# alert
|
||||
if file descriptors are likely to exhaust within the next 4 hours\nALERT FdExhaustionClose\n
|
||||
\ IF predict_linear(instance:fd_utilization[1h], 3600 * 4) > 1\n FOR 10m\n LABELS
|
||||
{\n severity = \"warning\"\n }\n ANNOTATIONS {\n summary = \"file descriptors
|
||||
soon exhausted\",\n description = \"{{ $labels.job }} instance {{ $labels.instance
|
||||
}} will exhaust in file descriptors soon\",\n }\n\n# alert if file descriptors
|
||||
are likely to exhaust within the next hour\nALERT FdExhaustionClose\n IF predict_linear(instance:fd_utilization[10m],
|
||||
3600) > 1\n FOR 10m\n LABELS {\n severity = \"critical\"\n }\n ANNOTATIONS
|
||||
{\n summary = \"file descriptors soon exhausted\",\n description = \"{{
|
||||
$labels.job }} instance {{ $labels.instance }} will exhaust in file descriptors
|
||||
soon\",\n }\n\n### etcd proposal alerts ###\n\n# alert if there are several failed
|
||||
proposals within an hour\nALERT HighNumberOfFailedProposals\n IF increase(etcd_server_proposal_failed_total{job=\"etcd\"}[1h])
|
||||
> 5\n LABELS {\n severity = \"warning\"\n }\n ANNOTATIONS {\n summary
|
||||
= \"a high number of failed proposals within the etcd cluster are happening\",\n
|
||||
\ description = \"etcd instance {{ $labels.instance }} has seen {{ $value }}
|
||||
proposal failures within the last hour\",\n }\n\n### etcd disk io latency alerts
|
||||
###\n\n# alert if 99th percentile of fsync durations is higher than 500ms\nALERT
|
||||
HighFsyncDurations\n IF histogram_quantile(0.99, rate(etcd_wal_fsync_durations_seconds_bucket[5m]))
|
||||
> 0.5\n FOR 10m\n LABELS {\n severity = \"warning\"\n }\n ANNOTATIONS {\n
|
||||
\ summary = \"high fsync durations\",\n description = \"ectd instance {{
|
||||
$labels.instance }} fync durations are high\",\n }\n"
|
||||
kubernetes.rules: |+
|
||||
# NOTE: These rules were kindly contributed by the SoundCloud engineering team.
|
||||
|
||||
### Container resources ###
|
||||
|
||||
cluster_namespace_controller_pod_container:spec_memory_limit_bytes =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_spec_memory_limit_bytes{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:spec_cpu_shares =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_spec_cpu_shares{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:cpu_usage:rate =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
irate(
|
||||
container_cpu_usage_seconds_total{container_name!=""}[5m]
|
||||
),
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:memory_usage:bytes =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_memory_usage_bytes{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:memory_working_set:bytes =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_memory_working_set_bytes{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:memory_rss:bytes =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_memory_rss{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:memory_cache:bytes =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_memory_cache{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:disk_usage:bytes =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name) (
|
||||
label_replace(
|
||||
container_disk_usage_bytes{container_name!=""},
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:memory_pagefaults:rate =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name,scope,type) (
|
||||
label_replace(
|
||||
irate(
|
||||
container_memory_failures_total{container_name!=""}[5m]
|
||||
),
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
cluster_namespace_controller_pod_container:memory_oom:rate =
|
||||
sum by (cluster,namespace,controller,pod_name,container_name,scope,type) (
|
||||
label_replace(
|
||||
irate(
|
||||
container_memory_failcnt{container_name!=""}[5m]
|
||||
),
|
||||
"controller", "$1",
|
||||
"pod_name", "^(.*)-[a-z0-9]+"
|
||||
)
|
||||
)
|
||||
|
||||
### Cluster resources ###
|
||||
|
||||
cluster:memory_allocation:percent =
|
||||
100 * sum by (cluster) (
|
||||
container_spec_memory_limit_bytes{pod_name!=""}
|
||||
) / sum by (cluster) (
|
||||
machine_memory_bytes
|
||||
)
|
||||
|
||||
cluster:memory_used:percent =
|
||||
100 * sum by (cluster) (
|
||||
container_memory_usage_bytes{pod_name!=""}
|
||||
) / sum by (cluster) (
|
||||
machine_memory_bytes
|
||||
)
|
||||
|
||||
cluster:cpu_allocation:percent =
|
||||
100 * sum by (cluster) (
|
||||
container_spec_cpu_shares{pod_name!=""}
|
||||
) / sum by (cluster) (
|
||||
container_spec_cpu_shares{id="/"} * on(cluster,instance) machine_cpu_cores
|
||||
)
|
||||
|
||||
cluster:node_cpu_use:percent =
|
||||
100 * sum by (cluster) (
|
||||
rate(node_cpu{mode!="idle"}[5m])
|
||||
) / sum by (cluster) (
|
||||
machine_cpu_cores
|
||||
)
|
||||
|
||||
### API latency ###
|
||||
|
||||
# Raw metrics are in microseconds. Convert to seconds.
|
||||
cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.99"} =
|
||||
histogram_quantile(
|
||||
0.99,
|
||||
sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket)
|
||||
) / 1e6
|
||||
cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.9"} =
|
||||
histogram_quantile(
|
||||
0.9,
|
||||
sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket)
|
||||
) / 1e6
|
||||
cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.5"} =
|
||||
histogram_quantile(
|
||||
0.5,
|
||||
sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket)
|
||||
) / 1e6
|
||||
|
||||
### Scheduling latency ###
|
||||
|
||||
cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.99"} =
|
||||
histogram_quantile(0.99,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6
|
||||
cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.9"} =
|
||||
histogram_quantile(0.9,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6
|
||||
cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.5"} =
|
||||
histogram_quantile(0.5,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6
|
||||
|
||||
cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.99"} =
|
||||
histogram_quantile(0.99,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6
|
||||
cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.9"} =
|
||||
histogram_quantile(0.9,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6
|
||||
cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.5"} =
|
||||
histogram_quantile(0.5,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6
|
||||
|
||||
cluster:scheduler_binding_latency:quantile_seconds{quantile="0.99"} =
|
||||
histogram_quantile(0.99,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6
|
||||
cluster:scheduler_binding_latency:quantile_seconds{quantile="0.9"} =
|
||||
histogram_quantile(0.9,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6
|
||||
cluster:scheduler_binding_latency:quantile_seconds{quantile="0.5"} =
|
||||
histogram_quantile(0.5,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6
|
||||
|
||||
ALERT K8SNodeDown
|
||||
IF up{job="kubelet"} == 0
|
||||
FOR 1h
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Kubelet cannot be scraped",
|
||||
description = "Prometheus could not scrape a {{ $labels.job }} for more than one hour",
|
||||
}
|
||||
|
||||
ALERT K8SNodeNotReady
|
||||
IF kube_node_status_ready{condition="true"} == 0
|
||||
FOR 1h
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning",
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Node status is NotReady",
|
||||
description = "The Kubelet on {{ $labels.node }} has not checked in with the API, or has set itself to NotReady, for more than an hour",
|
||||
}
|
||||
|
||||
ALERT K8SManyNodesNotReady
|
||||
IF
|
||||
count by (cluster) (kube_node_status_ready{condition="true"} == 0) > 1
|
||||
AND
|
||||
(
|
||||
count by (cluster) (kube_node_status_ready{condition="true"} == 0)
|
||||
/
|
||||
count by (cluster) (kube_node_status_ready{condition="true"})
|
||||
) > 0.2
|
||||
FOR 1m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical",
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Many K8s nodes are Not Ready",
|
||||
description = "{{ $value }} K8s nodes (more than 10% of cluster {{ $labels.cluster }}) are in the NotReady state.",
|
||||
}
|
||||
|
||||
ALERT K8SKubeletNodeExporterDown
|
||||
IF up{job="node-exporter"} == 0
|
||||
FOR 15m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Kubelet node_exporter cannot be scraped",
|
||||
description = "Prometheus could not scrape a {{ $labels.job }} for more than one hour.",
|
||||
}
|
||||
|
||||
ALERT K8SKubeletDown
|
||||
IF absent(up{job="kubelet"}) or count by (cluster) (up{job="kubelet"} == 0) / count by (cluster) (up{job="kubelet"}) > 0.1
|
||||
FOR 1h
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Many Kubelets cannot be scraped",
|
||||
description = "Prometheus failed to scrape more than 10% of kubelets, or all Kubelets have disappeared from service discovery.",
|
||||
}
|
||||
|
||||
ALERT K8SApiserverDown
|
||||
IF up{job="kubernetes"} == 0
|
||||
FOR 15m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "API server unreachable",
|
||||
description = "An API server could not be scraped.",
|
||||
}
|
||||
|
||||
# Disable for non HA kubernetes setups.
|
||||
ALERT K8SApiserverDown
|
||||
IF absent({job="kubernetes"}) or (count by(cluster) (up{job="kubernetes"} == 1) < count by(cluster) (up{job="kubernetes"}))
|
||||
FOR 5m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "API server unreachable",
|
||||
description = "Prometheus failed to scrape multiple API servers, or all API servers have disappeared from service discovery.",
|
||||
}
|
||||
|
||||
ALERT K8SSchedulerDown
|
||||
IF absent(up{job="kube-scheduler"}) or (count by(cluster) (up{job="kube-scheduler"} == 1) == 0)
|
||||
FOR 5m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical",
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Scheduler is down",
|
||||
description = "There is no running K8S scheduler. New pods are not being assigned to nodes.",
|
||||
}
|
||||
|
||||
ALERT K8SControllerManagerDown
|
||||
IF absent(up{job="kube-controller-manager"}) or (count by(cluster) (up{job="kube-controller-manager"} == 1) == 0)
|
||||
FOR 5m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical",
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Controller manager is down",
|
||||
description = "There is no running K8S controller manager. Deployments and replication controllers are not making progress.",
|
||||
}
|
||||
|
||||
ALERT K8SConntrackTableFull
|
||||
IF 100*node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 50
|
||||
FOR 10m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Number of tracked connections is near the limit",
|
||||
description = "The nf_conntrack table is {{ $value }}% full.",
|
||||
}
|
||||
|
||||
ALERT K8SConntrackTableFull
|
||||
IF 100*node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 90
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Number of tracked connections is near the limit",
|
||||
description = "The nf_conntrack table is {{ $value }}% full.",
|
||||
}
|
||||
|
||||
# To catch the conntrack sysctl de-tuning when it happens
|
||||
ALERT K8SConntrackTuningMissing
|
||||
IF node_nf_conntrack_udp_timeout > 10
|
||||
FOR 10m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning",
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Node does not have the correct conntrack tunings",
|
||||
description = "Nodes keep un-setting the correct tunings, investigate when it happens.",
|
||||
}
|
||||
|
||||
ALERT K8STooManyOpenFiles
|
||||
IF 100*process_open_fds{job=~"kubelet|kubernetes"} / process_max_fds > 50
|
||||
FOR 10m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "{{ $labels.job }} has too many open file descriptors",
|
||||
description = "{{ $labels.node }} is using {{ $value }}% of the available file/socket descriptors.",
|
||||
}
|
||||
|
||||
ALERT K8STooManyOpenFiles
|
||||
IF 100*process_open_fds{job=~"kubelet|kubernetes"} / process_max_fds > 80
|
||||
FOR 10m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "critical"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "{{ $labels.job }} has too many open file descriptors",
|
||||
description = "{{ $labels.node }} is using {{ $value }}% of the available file/socket descriptors.",
|
||||
}
|
||||
|
||||
# Some verbs excluded because they are expected to be long-lasting:
|
||||
# WATCHLIST is long-poll, CONNECT is `kubectl exec`.
|
||||
ALERT K8SApiServerLatency
|
||||
IF histogram_quantile(
|
||||
0.99,
|
||||
sum without (instance,node,resource) (apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH"})
|
||||
) / 1e6 > 1.0
|
||||
FOR 10m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Kubernetes apiserver latency is high",
|
||||
description = "99th percentile Latency for {{ $labels.verb }} requests to the kube-apiserver is higher than 1s.",
|
||||
}
|
||||
|
||||
ALERT K8SApiServerEtcdAccessLatency
|
||||
IF etcd_request_latencies_summary{quantile="0.99"} / 1e6 > 1.0
|
||||
FOR 15m
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning"
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Access to etcd is slow",
|
||||
description = "99th percentile latency for apiserver to access etcd is higher than 1s.",
|
||||
}
|
||||
|
||||
ALERT K8SKubeletTooManyPods
|
||||
IF kubelet_running_pod_count > 100
|
||||
LABELS {
|
||||
service = "k8s",
|
||||
severity = "warning",
|
||||
}
|
||||
ANNOTATIONS {
|
||||
summary = "Kubelet is close to pod limit",
|
||||
description = "Kubelet {{$labels.instance}} is running {{$value}} pods, close to the limit of 110",
|
||||
}
|
||||
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
creationTimestamp: null
|
||||
name: prometheus-k8s-rules
|
@@ -1,14 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: prometheus-k8s
|
||||
spec:
|
||||
type: NodePort
|
||||
ports:
|
||||
- name: web
|
||||
nodePort: 30900
|
||||
port: 9090
|
||||
protocol: TCP
|
||||
targetPort: web
|
||||
selector:
|
||||
prometheus: k8s
|
@@ -1,69 +0,0 @@
|
||||
apiVersion: monitoring.coreos.com/v1alpha1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: kube-apiserver
|
||||
labels:
|
||||
k8s-apps: https
|
||||
spec:
|
||||
jobLabel: provider
|
||||
selector:
|
||||
matchLabels:
|
||||
component: apiserver
|
||||
provider: kubernetes
|
||||
namespaceSelector:
|
||||
matchNames:
|
||||
- default
|
||||
endpoints:
|
||||
- port: https
|
||||
interval: 15s
|
||||
scheme: https
|
||||
tlsConfig:
|
||||
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
|
||||
serverName: kubernetes
|
||||
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||
---
|
||||
apiVersion: monitoring.coreos.com/v1alpha1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: k8s-apps-https
|
||||
labels:
|
||||
k8s-apps: https
|
||||
spec:
|
||||
jobLabel: k8s-app
|
||||
selector:
|
||||
matchExpressions:
|
||||
- {key: k8s-app, operator: Exists}
|
||||
namespaceSelector:
|
||||
matchNames:
|
||||
- kube-system
|
||||
endpoints:
|
||||
- port: https-metrics
|
||||
interval: 15s
|
||||
scheme: https
|
||||
tlsConfig:
|
||||
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
|
||||
insecureSkipVerify: true
|
||||
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||
---
|
||||
apiVersion: monitoring.coreos.com/v1alpha1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: k8s-apps-http
|
||||
labels:
|
||||
k8s-apps: http
|
||||
spec:
|
||||
jobLabel: k8s-app
|
||||
selector:
|
||||
matchExpressions:
|
||||
- {key: k8s-app, operator: Exists}
|
||||
namespaceSelector:
|
||||
matchNames:
|
||||
- kube-system
|
||||
- monitoring
|
||||
endpoints:
|
||||
- port: http-metrics
|
||||
interval: 15s
|
||||
- port: http-metrics-dnsmasq
|
||||
interval: 15s
|
||||
- port: http-metrics-skydns
|
||||
interval: 15s
|
@@ -1,24 +0,0 @@
|
||||
apiVersion: monitoring.coreos.com/v1alpha1
|
||||
kind: Prometheus
|
||||
metadata:
|
||||
name: k8s
|
||||
labels:
|
||||
prometheus: k8s
|
||||
spec:
|
||||
replicas: 2
|
||||
version: v1.5.2
|
||||
serviceMonitorSelector:
|
||||
matchExpression:
|
||||
- {key: k8s-apps, operator: Exists}
|
||||
resources:
|
||||
requests:
|
||||
# 2Gi is default, but won't schedule if you don't have a node with >2Gi
|
||||
# memory. Modify based on your target and time-series count for
|
||||
# production use. This value is mainly meant for demonstration/testing
|
||||
# purposes.
|
||||
memory: 400Mi
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- namespace: monitoring
|
||||
name: alertmanager-main
|
||||
port: web
|
Reference in New Issue
Block a user