Previously we dropped pod-centric metrics without a (pod, namespace) label set
however these can be critical for debugging.
Keep 'container_fs_.*' metrics from cAdvisor
The following provides a description and cardinality estimation based on the tests in a local cluster:
container_blkio_device_usage_total - useful for containers, but not for system services (nodes*disks*services*operations*2)
container_fs_.* - add filesystem read/write data (nodes*disks*services*4)
container_file_descriptors - file descriptors limits and global numbers are exposed via (nodes*services)
container_threads_max - max number of threads in cgroup. Usually for system services it is not limited (nodes*services)
container_threads - used threads in cgroup. Usually not important for system services (nodes*services)
container_sockets - used sockets in cgroup. Usually not important for system services (nodes*services)
container_start_time_seconds - container start. Possibly not needed for system services (nodes*services)
container_last_seen - Not needed as system services are always running (nodes*services)
container_spec_.* - Everything related to cgroup specification and thus static data (nodes*services*5)
etcd refactored their repo moving and renaming etcd-mixin. the
jsonnetfile depended on "master" even though the lock was for an older
version. checking out from the last commit before the move works.
kube-apiserver has a histogram etcd_request_duration_seconds that
measures latency between the kube-apiserver and etcd instance.
This metrics is currently dropped by cluster-prometheus. Enable
this metrics so we have visibility into etcd latency.
We ensured that this does not enable other unwanted metrcis
count by(name) ({name=~"etcd_request.+"})
etcd_request_duration_seconds_bucket
etcd_request_duration_seconds_count
etcd_request_duration_seconds_sum
Previously the alert would fire when the number of Alertmanager pods
didn't match the number of replicas defined in the Alertmanager spec
even though all the running pods had the same configuration hash. This
type of issue is already covered by KubeStatefulSetUpdateNotRolledOut
(and possibly KubePodNotReady), having AlertmanagerConfigInconsistent
also active in this situation creates unnecessary noise.
With this change, the alert expression only returns when Alertmanager
pods have different configuration hash values irrespective of the number
of pod replicas. The message annotation has also been enhanced to report
the configuration hash for each pod.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Ignore kubelet pod filesystem mounts of the form:
/var/lib/kubelet/pods/1b260ce7-e75d-44d4-8409-922d2bd0851f/volumes...
Metrics for these volumes are available via the kubelet_volume_stats*
metrics.