groups: - name: node.rules rules: - record: instance:node_cpu:rate:sum expr: sum(rate(node_cpu{mode!="idle",mode!="iowait",mode!~"^(?:guest.*)$"}[3m])) BY (instance) - record: instance:node_filesystem_usage:sum expr: sum((node_filesystem_size{mountpoint="/"} - node_filesystem_free{mountpoint="/"})) BY (instance) - record: instance:node_network_receive_bytes:rate:sum expr: sum(rate(node_network_receive_bytes[3m])) BY (instance) - record: instance:node_network_transmit_bytes:rate:sum expr: sum(rate(node_network_transmit_bytes[3m])) BY (instance) - record: instance:node_cpu:ratio expr: sum(rate(node_cpu{mode!="idle"}[5m])) WITHOUT (cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu) BY (instance, cpu)) BY (instance) - record: cluster:node_cpu:sum_rate5m expr: sum(rate(node_cpu{mode!="idle"}[5m])) - record: cluster:node_cpu:ratio expr: cluster:node_cpu:rate5m / count(sum(node_cpu) BY (instance, cpu)) - alert: NodeExporterDown expr: absent(up{job="node-exporter"} == 1) for: 10m labels: severity: warning annotations: description: Prometheus could not scrape a node-exporter for more than 10m, or node-exporters have disappeared from discovery summary: Prometheus could not scrape a node-exporter - alert: NodeDiskRunningFull expr: predict_linear(node_filesystem_free[6h], 3600 * 24) < 0 for: 30m labels: severity: warning annotations: description: device {{$labels.device}} on node {{$labels.instance}} is running full within the next 24 hours (mounted at {{$labels.mountpoint}}) summary: Node disk is running full within 24 hours - alert: NodeDiskRunningFull expr: predict_linear(node_filesystem_free[30m], 3600 * 2) < 0 for: 10m labels: severity: critical annotations: description: device {{$labels.device}} on node {{$labels.instance}} is running full within the next 2 hours (mounted at {{$labels.mountpoint}}) summary: Node disk is running full within 2 hours