Rules

alertmanager.rules

13.118s ago

417.1us

Rule State Error Last Evaluation Evaluation Time
alert: AlertmanagerConfigInconsistent expr: count by(namespace, service) (count_values by(namespace, service) ("config_hash", alertmanager_config_hash{job="alertmanager-main",namespace="monitoring"})) != 1 for: 5m labels: severity: critical annotations: message: | The configuration of the instances of the Alertmanager cluster `{{ $labels.namespace }}/{{ $labels.service }}` are out of sync. {{ range printf "alertmanager_config_hash{namespace=\"%s\",service=\"%s\"}" $labels.namespace $labels.service | query }} Configuration hash for pod {{ .Labels.pod }} is "{{ printf "%.f" .Value }}" {{ end }} ok 13.118s ago 248.4us
alert: AlertmanagerFailedReload expr: alertmanager_config_last_reload_successful{job="alertmanager-main",namespace="monitoring"} == 0 for: 10m labels: severity: warning annotations: message: Reloading Alertmanager's configuration has failed for {{ $labels.namespace }}/{{ $labels.pod}}. ok 13.118s ago 64.75us
alert: AlertmanagerMembersInconsistent expr: alertmanager_cluster_members{job="alertmanager-main",namespace="monitoring"} != on(service) group_left() count by(service) (alertmanager_cluster_members{job="alertmanager-main",namespace="monitoring"}) for: 5m labels: severity: critical annotations: message: Alertmanager has not found all other members of the cluster. ok 13.118s ago 92.87us

general.rules

16.678s ago

807.1us

Rule State Error Last Evaluation Evaluation Time
alert: TargetDown expr: 100 * (count by(job, namespace, service) (up == 0) / count by(job, namespace, service) (up)) > 10 for: 10m labels: severity: warning annotations: message: '{{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service }} targets in {{ $labels.namespace }} namespace are down.' ok 16.697s ago 537.2us
alert: Watchdog expr: vector(1) labels: severity: none annotations: message: | This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing. For example the "DeadMansSnitch" integration in PagerDuty. ok 16.696s ago 257.9us

k8s.rules

11.506s ago

20.34ms

Rule State Error Last Evaluation Evaluation Time
record: namespace:container_cpu_usage_seconds_total:sum_rate expr: sum by(namespace) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) ok 11.523s ago 1.578ms
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate expr: sum by(cluster, namespace, pod, container) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) * on(cluster, namespace, pod) group_left(node) topk by(cluster, namespace, pod) (1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})) ok 11.522s ago 3.034ms
record: node_namespace_pod_container:container_memory_working_set_bytes expr: container_memory_working_set_bytes{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""})) ok 11.519s ago 3.785ms
record: node_namespace_pod_container:container_memory_rss expr: container_memory_rss{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""})) ok 11.515s ago 2.685ms
record: node_namespace_pod_container:container_memory_cache expr: container_memory_cache{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""})) ok 11.512s ago 2.285ms
record: node_namespace_pod_container:container_memory_swap expr: container_memory_swap{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""})) ok 11.51s ago 2.558ms
record: namespace:container_memory_usage_bytes:sum expr: sum by(namespace) (container_memory_usage_bytes{container!="POD",image!="",job="kubelet",metrics_path="/metrics/cadvisor"}) ok 11.508s ago 755us
record: namespace:kube_pod_container_resource_requests_memory_bytes:sum expr: sum by(namespace) (sum by(namespace, pod) (max by(namespace, pod, container) (kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"}) * on(namespace, pod) group_left() max by(namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))) ok 11.507s ago 957.3us
record: namespace:kube_pod_container_resource_requests_cpu_cores:sum expr: sum by(namespace) (sum by(namespace, pod) (max by(namespace, pod, container) (kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}) * on(namespace, pod) group_left() max by(namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))) ok 11.506s ago 963.5us
record: namespace_workload_pod:kube_pod_owner:relabel expr: max by(cluster, namespace, workload, pod) (label_replace(label_replace(kube_pod_owner{job="kube-state-metrics",owner_kind="ReplicaSet"}, "replicaset", "$1", "owner_name", "(.*)") * on(replicaset, namespace) group_left(owner_name) topk by(replicaset, namespace) (1, max by(replicaset, namespace, owner_name) (kube_replicaset_owner{job="kube-state-metrics"})), "workload", "$1", "owner_name", "(.*)")) labels: workload_type: deployment ok 11.505s ago 1.367ms
record: namespace_workload_pod:kube_pod_owner:relabel expr: max by(cluster, namespace, workload, pod) (label_replace(kube_pod_owner{job="kube-state-metrics",owner_kind="DaemonSet"}, "workload", "$1", "owner_name", "(.*)")) labels: workload_type: daemonset ok 11.504s ago 168.9us
record: namespace_workload_pod:kube_pod_owner:relabel expr: max by(cluster, namespace, workload, pod) (label_replace(kube_pod_owner{job="kube-state-metrics",owner_kind="StatefulSet"}, "workload", "$1", "owner_name", "(.*)")) labels: workload_type: statefulset ok 11.504s ago 161.9us

kube-apiserver-availability.rules

2m47.709s ago

434.9ms

Rule State Error Last Evaluation Evaluation Time
record: apiserver_request:availability30d expr: 1 - ((sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d])) - sum(increase(apiserver_request_duration_seconds_bucket{le="1",verb=~"POST|PUT|PATCH|DELETE"}[30d]))) + (sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d])) - ((sum(increase(apiserver_request_duration_seconds_bucket{le="0.1",scope=~"resource|",verb=~"LIST|GET"}[30d])) or vector(0)) + sum(increase(apiserver_request_duration_seconds_bucket{le="0.5",scope="namespace",verb=~"LIST|GET"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{le="5",scope="cluster",verb=~"LIST|GET"}[30d])))) + sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))) / sum(code:apiserver_request_total:increase30d) labels: verb: all ok 2m47.709s ago 149ms
record: apiserver_request:availability30d expr: 1 - (sum(increase(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[30d])) - ((sum(increase(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[30d])) or vector(0)) + sum(increase(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[30d]))) + sum(code:apiserver_request_total:increase30d{code=~"5..",verb="read"} or vector(0))) / sum(code:apiserver_request_total:increase30d{verb="read"}) labels: verb: read ok 2m47.56s ago 109.3ms
record: apiserver_request:availability30d expr: 1 - ((sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d])) - sum(increase(apiserver_request_duration_seconds_bucket{le="1",verb=~"POST|PUT|PATCH|DELETE"}[30d]))) + sum(code:apiserver_request_total:increase30d{code=~"5..",verb="write"} or vector(0))) / sum(code:apiserver_request_total:increase30d{verb="write"}) labels: verb: write ok 2m47.451s ago 39.02ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="LIST"}[30d])) ok 2m47.412s ago 72.99ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="GET"}[30d])) ok 2m47.339s ago 19.54ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="POST"}[30d])) ok 2m47.32s ago 8.977ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="PUT"}[30d])) ok 2m47.311s ago 8.231ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="PATCH"}[30d])) ok 2m47.303s ago 5.294ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="DELETE"}[30d])) ok 2m47.297s ago 3.856ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="LIST"}[30d])) ok 2m47.294s ago 458.7us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="GET"}[30d])) ok 2m47.293s ago 277.8us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="POST"}[30d])) ok 2m47.293s ago 240us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="PUT"}[30d])) ok 2m47.293s ago 230.2us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="PATCH"}[30d])) ok 2m47.293s ago 244.5us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="DELETE"}[30d])) ok 2m47.293s ago 219.3us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="LIST"}[30d])) ok 2m47.292s ago 839.5us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="GET"}[30d])) ok 2m47.292s ago 5.178ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="POST"}[30d])) ok 2m47.286s ago 1.406ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="PUT"}[30d])) ok 2m47.285s ago 1.748ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="PATCH"}[30d])) ok 2m47.283s ago 1.622ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="DELETE"}[30d])) ok 2m47.282s ago 1.552ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="LIST"}[30d])) ok 2m47.28s ago 388.3us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="GET"}[30d])) ok 2m47.28s ago 1.278ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="POST"}[30d])) ok 2m47.279s ago 1.616ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="PUT"}[30d])) ok 2m47.277s ago 336.3us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="PATCH"}[30d])) ok 2m47.277s ago 309.2us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="DELETE"}[30d])) ok 2m47.277s ago 308.4us
record: code:apiserver_request_total:increase30d expr: sum by(code) (code_verb:apiserver_request_total:increase30d{verb=~"LIST|GET"}) labels: verb: read ok 2m47.277s ago 184.6us
record: code:apiserver_request_total:increase30d expr: sum by(code) (code_verb:apiserver_request_total:increase30d{verb=~"POST|PUT|PATCH|DELETE"}) labels: verb: write ok 2m47.276s ago 184.8us

kube-apiserver-slos

28.851s ago

993us

Rule State Error Last Evaluation Evaluation Time
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate1h) > (14.4 * 0.01) and sum(apiserver_request:burnrate5m) > (14.4 * 0.01) for: 2m labels: long: 1h severity: critical short: 5m annotations: message: The API server is burning too much error budget runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn ok 28.851s ago 448.2us
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate6h) > (6 * 0.01) and sum(apiserver_request:burnrate30m) > (6 * 0.01) for: 15m labels: long: 6h severity: critical short: 30m annotations: message: The API server is burning too much error budget runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn ok 28.85s ago 257.9us
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate1d) > (3 * 0.01) and sum(apiserver_request:burnrate2h) > (3 * 0.01) for: 1h labels: long: 1d severity: warning short: 2h annotations: message: The API server is burning too much error budget runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn ok 28.85s ago 145.4us
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate3d) > (1 * 0.01) and sum(apiserver_request:burnrate6h) > (1 * 0.01) for: 3h labels: long: 3d severity: warning short: 6h annotations: message: The API server is burning too much error budget runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn ok 28.85s ago 128.7us

kube-apiserver.rules

5.735s ago

720.3ms

Rule State Error Last Evaluation Evaluation Time
record: apiserver_request:burnrate1d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[1d])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[1d])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[1d])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[1d])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[1d]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[1d])) labels: verb: read ok 5.735s ago 195.4ms
record: apiserver_request:burnrate1h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[1h])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[1h])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[1h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[1h])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[1h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[1h])) labels: verb: read ok 5.54s ago 10.54ms
record: apiserver_request:burnrate2h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[2h])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[2h])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[2h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[2h])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[2h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[2h])) labels: verb: read ok 5.53s ago 19.81ms
record: apiserver_request:burnrate30m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[30m])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[30m])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[30m])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[30m])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[30m]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[30m])) labels: verb: read ok 5.51s ago 10.39ms
record: apiserver_request:burnrate3d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[3d])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[3d])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[3d])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[3d])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[3d]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[3d])) labels: verb: read ok 5.5s ago 198.2ms
record: apiserver_request:burnrate5m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[5m])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[5m])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[5m])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[5m])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[5m]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[5m])) labels: verb: read ok 5.302s ago 5.244ms
record: apiserver_request:burnrate6h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[6h])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[6h])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[6h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[6h])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[6h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[6h])) labels: verb: read ok 5.297s ago 45.04ms
record: apiserver_request:burnrate1d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[1d]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d])) labels: verb: write ok 5.252s ago 55.2ms
record: apiserver_request:burnrate1h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[1h]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h])) labels: verb: write ok 5.197s ago 3.752ms
record: apiserver_request:burnrate2h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[2h]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h])) labels: verb: write ok 5.193s ago 5.163ms
record: apiserver_request:burnrate30m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[30m]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m])) labels: verb: write ok 5.188s ago 3.435ms
record: apiserver_request:burnrate3d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[3d]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d])) labels: verb: write ok 5.185s ago 71.7ms
record: apiserver_request:burnrate5m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m])) labels: verb: write ok 5.113s ago 2.051ms
record: apiserver_request:burnrate6h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[6h]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h])) labels: verb: write ok 5.111s ago 15.32ms
record: code_resource:apiserver_request_total:rate5m expr: sum by(code, resource) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[5m])) labels: verb: read ok 5.096s ago 2.846ms
record: code_resource:apiserver_request_total:rate5m expr: sum by(code, resource) (rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m])) labels: verb: write ok 5.093s ago 1.021ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum by(le, resource) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET"}[5m]))) > 0 labels: quantile: "0.99" verb: read ok 5.092s ago 17.26ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum by(le, resource) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) > 0 labels: quantile: "0.99" verb: write ok 5.075s ago 5.801ms
record: cluster:apiserver_request_duration_seconds:mean5m expr: sum without(instance, pod) (rate(apiserver_request_duration_seconds_sum{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) / sum without(instance, pod) (rate(apiserver_request_duration_seconds_count{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) ok 5.069s ago 3.211ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum without(instance, pod) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m]))) labels: quantile: "0.99" ok 5.066s ago 18ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum without(instance, pod) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m]))) labels: quantile: "0.9" ok 5.048s ago 16.88ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum without(instance, pod) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m]))) labels: quantile: "0.5" ok 5.032s ago 13.99ms

kube-prometheus-general.rules

10.74s ago

791.8us

Rule State Error Last Evaluation Evaluation Time
record: count:up1 expr: count without(instance, pod, node) (up == 1) ok 10.74s ago 484.3us
record: count:up0 expr: count without(instance, pod, node) (up == 0) ok 10.739s ago 293.5us

kube-prometheus-node-recording.rules

8.901s ago

3.102ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_cpu:rate:sum expr: sum by(instance) (rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[3m])) ok 8.901s ago 571.6us
record: instance:node_network_receive_bytes:rate:sum expr: sum by(instance) (rate(node_network_receive_bytes_total[3m])) ok 8.901s ago 479.6us
record: instance:node_network_transmit_bytes:rate:sum expr: sum by(instance) (rate(node_network_transmit_bytes_total[3m])) ok 8.9s ago 483.4us
record: instance:node_cpu:ratio expr: sum without(cpu, mode) (rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[5m])) / on(instance) group_left() count by(instance) (sum by(instance, cpu) (node_cpu_seconds_total)) ok 8.9s ago 782.7us
record: cluster:node_cpu:sum_rate5m expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[5m])) ok 8.899s ago 382.1us
record: cluster:node_cpu:ratio expr: cluster:node_cpu_seconds_total:rate5m / count(sum by(instance, cpu) (node_cpu_seconds_total)) ok 8.899s ago 386.1us

kube-scheduler.rules

14.164s ago

1.396ms

Rule State Error Last Evaluation Evaluation Time
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum without(instance, pod) (rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.99" ok 14.164s ago 269.2us
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum without(instance, pod) (rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.99" ok 14.164s ago 180.2us
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum without(instance, pod) (rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.99" ok 14.164s ago 169.9us
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum without(instance, pod) (rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.9" ok 14.164s ago 127.1us
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum without(instance, pod) (rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.9" ok 14.164s ago 130.9us
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum without(instance, pod) (rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.9" ok 14.164s ago 147.9us
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum without(instance, pod) (rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.5" ok 14.164s ago 133.8us
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum without(instance, pod) (rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.5" ok 14.164s ago 102.5us
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum without(instance, pod) (rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.5" ok 14.164s ago 116.1us

kube-state-metrics

1.327s ago

1.987ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeStateMetricsListErrors expr: (sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m]))) > 0.01 for: 15m labels: severity: critical annotations: message: kube-state-metrics is experiencing errors at an elevated rate in list operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatemetricslisterrors ok 1.327s ago 1.124ms
alert: KubeStateMetricsWatchErrors expr: (sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m]))) > 0.01 for: 15m labels: severity: critical annotations: message: kube-state-metrics is experiencing errors at an elevated rate in watch operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatemetricswatcherrors ok 1.327s ago 852.7us

kubelet.rules

22.22s ago

1.433ms

Rule State Error Last Evaluation Evaluation Time
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"}) labels: quantile: "0.99" ok 22.22s ago 617.6us
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"}) labels: quantile: "0.9" ok 22.219s ago 395.2us
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"}) labels: quantile: "0.5" ok 22.219s ago 409.1us

kubernetes-apps

26.637s ago

7.067ms

Rule State Error Last Evaluation Evaluation Time
alert: KubePodCrashLooping expr: rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[5m]) * 60 * 5 > 0 for: 15m labels: severity: warning annotations: message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping ok 26.637s ago 627.2us
alert: KubePodNotReady expr: sum by(namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"}))) > 0 for: 15m labels: severity: warning annotations: message: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodnotready ok 26.637s ago 1.274ms
alert: KubeDeploymentGenerationMismatch expr: kube_deployment_status_observed_generation{job="kube-state-metrics"} != kube_deployment_metadata_generation{job="kube-state-metrics"} for: 15m labels: severity: warning annotations: message: Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentgenerationmismatch ok 26.636s ago 418.8us
alert: KubeDeploymentReplicasMismatch expr: (kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"}) and (changes(kube_deployment_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0) for: 15m labels: severity: warning annotations: message: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentreplicasmismatch ok 26.635s ago 635.9us
alert: KubeStatefulSetReplicasMismatch expr: (kube_statefulset_status_replicas_ready{job="kube-state-metrics"} != kube_statefulset_status_replicas{job="kube-state-metrics"}) and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0) for: 15m labels: severity: warning annotations: message: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetreplicasmismatch ok 26.635s ago 365.7us
alert: KubeStatefulSetGenerationMismatch expr: kube_statefulset_status_observed_generation{job="kube-state-metrics"} != kube_statefulset_metadata_generation{job="kube-state-metrics"} for: 15m labels: severity: warning annotations: message: StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetgenerationmismatch ok 26.635s ago 202.7us
alert: KubeStatefulSetUpdateNotRolledOut expr: (max without(revision) (kube_statefulset_status_current_revision{job="kube-state-metrics"} unless kube_statefulset_status_update_revision{job="kube-state-metrics"}) * (kube_statefulset_replicas{job="kube-state-metrics"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics"})) and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0) for: 15m labels: severity: warning annotations: message: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetupdatenotrolledout ok 26.635s ago 449.8us
alert: KubeDaemonSetRolloutStuck expr: ((kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"}) or (kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} != 0) or (kube_daemonset_updated_number_scheduled{job="kube-state-metrics"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"}) or (kube_daemonset_status_number_available{job="kube-state-metrics"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"})) and (changes(kube_daemonset_updated_number_scheduled{job="kube-state-metrics"}[5m]) == 0) for: 15m labels: severity: warning annotations: message: DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has not finished or progressed for at least 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetrolloutstuck ok 26.635s ago 557.6us
alert: KubeContainerWaiting expr: sum by(namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 0 for: 1h labels: severity: warning annotations: message: Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container}} has been in waiting state for longer than 1 hour. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontainerwaiting ok 26.634s ago 1.705ms
alert: KubeDaemonSetNotScheduled expr: kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0 for: 10m labels: severity: warning annotations: message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetnotscheduled ok 26.633s ago 204.7us
alert: KubeDaemonSetMisScheduled expr: kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0 for: 15m labels: severity: warning annotations: message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetmisscheduled ok 26.633s ago 114.8us
alert: KubeJobCompletion expr: kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} > 0 for: 12h labels: severity: warning annotations: message: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than 12 hours to complete. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobcompletion ok 26.633s ago 147.9us
alert: KubeJobFailed expr: kube_job_failed{job="kube-state-metrics"} > 0 for: 15m labels: severity: warning annotations: message: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobfailed ok 26.633s ago 69.87us
alert: KubeHpaReplicasMismatch expr: (kube_hpa_status_desired_replicas{job="kube-state-metrics"} != kube_hpa_status_current_replicas{job="kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0 for: 15m labels: severity: warning annotations: message: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the desired number of replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubehpareplicasmismatch ok 26.633s ago 175.9us
alert: KubeHpaMaxedOut expr: kube_hpa_status_current_replicas{job="kube-state-metrics"} == kube_hpa_spec_max_replicas{job="kube-state-metrics"} for: 15m labels: severity: warning annotations: message: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has been running at max replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubehpamaxedout ok 26.633s ago 95.81us

kubernetes-resources

14.196s ago

2.577ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeCPUOvercommit expr: sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum) / sum(kube_node_status_allocatable_cpu_cores) > (count(kube_node_status_allocatable_cpu_cores) - 1) / count(kube_node_status_allocatable_cpu_cores) for: 5m labels: severity: warning annotations: message: Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit ok 14.196s ago 644.2us
alert: KubeMemoryOvercommit expr: sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum) / sum(kube_node_status_allocatable_memory_bytes) > (count(kube_node_status_allocatable_memory_bytes) - 1) / count(kube_node_status_allocatable_memory_bytes) for: 5m labels: severity: warning annotations: message: Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryovercommit ok 14.195s ago 459.9us
alert: KubeCPUQuotaOvercommit expr: sum(kube_resourcequota{job="kube-state-metrics",resource="cpu",type="hard"}) / sum(kube_node_status_allocatable_cpu_cores) > 1.5 for: 5m labels: severity: warning annotations: message: Cluster has overcommitted CPU resource requests for Namespaces. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuquotaovercommit ok 14.195s ago 168.5us
alert: KubeMemoryQuotaOvercommit expr: sum(kube_resourcequota{job="kube-state-metrics",resource="memory",type="hard"}) / sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"}) > 1.5 for: 5m labels: severity: warning annotations: message: Cluster has overcommitted memory resource requests for Namespaces. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryquotaovercommit ok 14.195s ago 131.3us
alert: KubeQuotaFullyUsed expr: kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) >= 1 for: 15m labels: severity: info annotations: message: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotafullyused ok 14.195s ago 109.2us
alert: CPUThrottlingHigh expr: sum by(container, pod, namespace) (increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) / sum by(container, pod, namespace) (increase(container_cpu_cfs_periods_total[5m])) > (25 / 100) for: 15m labels: severity: info annotations: message: '{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh ok 14.195s ago 1.047ms

kubernetes-storage

5.705s ago

980.1us

Rule State Error Last Evaluation Evaluation Time
alert: KubePersistentVolumeFillingUp expr: kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics"} < 0.03 for: 1m labels: severity: critical annotations: message: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefillingup ok 5.706s ago 331.4us
alert: KubePersistentVolumeFillingUp expr: (kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics"}) < 0.15 and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics"}[6h], 4 * 24 * 3600) < 0 for: 1h labels: severity: warning annotations: message: Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value | humanizePercentage }} is available. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefillingup ok 5.705s ago 430.4us
alert: KubePersistentVolumeErrors expr: kube_persistentvolume_status_phase{job="kube-state-metrics",phase=~"Failed|Pending"} > 0 for: 5m labels: severity: critical annotations: message: The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeerrors ok 5.705s ago 203.7us

kubernetes-system

14.161s ago

811.9us

Rule State Error Last Evaluation Evaluation Time
alert: KubeVersionMismatch expr: count(count by(gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"}, "gitVersion", "$1", "gitVersion", "(v[0-9]*.[0-9]*).*"))) > 1 for: 15m labels: severity: warning annotations: message: There are {{ $value }} different semantic versions of Kubernetes components running. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeversionmismatch ok 14.161s ago 344.4us
alert: KubeClientErrors expr: (sum by(instance, job) (rate(rest_client_requests_total{code=~"5.."}[5m])) / sum by(instance, job) (rate(rest_client_requests_total[5m]))) > 0.01 for: 15m labels: severity: warning annotations: message: Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors ok 14.161s ago 458.3us

kubernetes-system-apiserver

26.975s ago

1.777ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeClientCertificateExpiration expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by(job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800 labels: severity: warning annotations: message: A client certificate used to authenticate to the apiserver is expiring in less than 7.0 days. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration ok 26.975s ago 453us
alert: KubeClientCertificateExpiration expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by(job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400 labels: severity: critical annotations: message: A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration ok 26.975s ago 270.1us
alert: AggregatedAPIErrors expr: sum by(name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2 labels: severity: warning annotations: message: An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has reported errors. The number of errors have increased for it in the past five minutes. High values indicate that the availability of the service changes too often. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-aggregatedapierrors ok 26.975s ago 69.05us
alert: AggregatedAPIDown expr: (1 - max by(name, namespace) (avg_over_time(aggregator_unavailable_apiservice[5m]))) * 100 < 90 for: 5m labels: severity: warning annotations: message: An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has been only {{ $value | humanize }}% available over the last 5m. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-aggregatedapidown ok 26.975s ago 820.8us
alert: KubeAPIDown expr: absent(up{job="apiserver"} == 1) for: 15m labels: severity: critical annotations: message: KubeAPI has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown ok 26.974s ago 150.3us

kubernetes-system-controller-manager

17.338s ago

573.5us

Rule State Error Last Evaluation Evaluation Time
alert: KubeControllerManagerDown expr: absent(up{job="kube-controller-manager"} == 1) for: 15m labels: severity: critical annotations: message: KubeControllerManager has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontrollermanagerdown ok 17.338s ago 563.8us

kubernetes-system-kubelet

24.81s ago

2.364ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeNodeNotReady expr: kube_node_status_condition{condition="Ready",job="kube-state-metrics",status="true"} == 0 for: 15m labels: severity: warning annotations: message: '{{ $labels.node }} has been unready for more than 15 minutes.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodenotready ok 24.81s ago 216.6us
alert: KubeNodeUnreachable expr: (kube_node_spec_taint{effect="NoSchedule",job="kube-state-metrics",key="node.kubernetes.io/unreachable"} unless ignoring(key, value) kube_node_spec_taint{job="kube-state-metrics",key=~"ToBeDeletedByClusterAutoscaler|cloud.google.com/impending-node-termination|aws-node-termination-handler/spot-itn"}) == 1 labels: severity: warning annotations: message: '{{ $labels.node }} is unreachable and some workloads may be rescheduled.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodeunreachable ok 24.81s ago 247.5us
alert: KubeletTooManyPods expr: count by(node) ((kube_pod_status_phase{job="kube-state-metrics",phase="Running"} == 1) * on(instance, pod, namespace, cluster) group_left(node) topk by(instance, pod, namespace, cluster) (1, kube_pod_info{job="kube-state-metrics"})) / max by(node) (kube_node_status_capacity_pods{job="kube-state-metrics"} != 1) > 0.95 for: 15m labels: severity: warning annotations: message: Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubelettoomanypods ok 24.81s ago 967.9us
alert: KubeNodeReadinessFlapping expr: sum by(node) (changes(kube_node_status_condition{condition="Ready",status="true"}[15m])) > 2 for: 15m labels: severity: warning annotations: message: The readiness status of node {{ $labels.node }} has changed {{ $value }} times in the last 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodereadinessflapping ok 24.809s ago 147.1us
alert: KubeletPlegDurationHigh expr: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 10 for: 5m labels: severity: warning annotations: message: The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletplegdurationhigh ok 24.809s ago 78.85us
alert: KubeletPodStartUpLatencyHigh expr: histogram_quantile(0.99, sum by(instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet",metrics_path="/metrics"}[5m]))) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"} > 60 for: 15m labels: severity: warning annotations: message: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on node {{ $labels.node }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletpodstartuplatencyhigh ok 24.809s ago 569.1us
alert: KubeletDown expr: absent(up{job="kubelet",metrics_path="/metrics"} == 1) for: 15m labels: severity: critical annotations: message: Kubelet has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown ok 24.809s ago 121.3us

kubernetes-system-scheduler

28.814s ago

434.8us

Rule State Error Last Evaluation Evaluation Time
alert: KubeSchedulerDown expr: absent(up{job="kube-scheduler"} == 1) for: 15m labels: severity: critical annotations: message: KubeScheduler has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeschedulerdown ok 28.814s ago 425.5us

node-exporter

15.273s ago

35.37ms

Rule State Error Last Evaluation Evaluation Time
alert: NodeFilesystemSpaceFillingUp expr: (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 40 and predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: warning annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemspacefillingup summary: Filesystem is predicted to run out of space within the next 24 hours. ok 15.273s ago 7.641ms
alert: NodeFilesystemSpaceFillingUp expr: (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 15 and predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: critical annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up fast. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemspacefillingup summary: Filesystem is predicted to run out of space within the next 4 hours. ok 15.265s ago 7.094ms
alert: NodeFilesystemAlmostOutOfSpace expr: (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 5 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: warning annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemalmostoutofspace summary: Filesystem has less than 5% space left. ok 15.259s ago 1.31ms
alert: NodeFilesystemAlmostOutOfSpace expr: (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 3 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: critical annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemalmostoutofspace summary: Filesystem has less than 3% space left. ok 15.257s ago 1.23ms
alert: NodeFilesystemFilesFillingUp expr: (node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 40 and predict_linear(node_filesystem_files_free{fstype!="",job="node-exporter"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: warning annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left and is filling up. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemfilesfillingup summary: Filesystem is predicted to run out of inodes within the next 24 hours. ok 15.256s ago 6.888ms
alert: NodeFilesystemFilesFillingUp expr: (node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 20 and predict_linear(node_filesystem_files_free{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: critical annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left and is filling up fast. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemfilesfillingup summary: Filesystem is predicted to run out of inodes within the next 4 hours. ok 15.25s ago 6.903ms
alert: NodeFilesystemAlmostOutOfFiles expr: (node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 5 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: warning annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemalmostoutoffiles summary: Filesystem has less than 5% inodes left. ok 15.243s ago 1.278ms
alert: NodeFilesystemAlmostOutOfFiles expr: (node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 3 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: critical annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemalmostoutoffiles summary: Filesystem has less than 3% inodes left. ok 15.242s ago 1.5ms
alert: NodeNetworkReceiveErrs expr: increase(node_network_receive_errs_total[2m]) > 10 for: 1h labels: severity: warning annotations: description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last two minutes.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodenetworkreceiveerrs summary: Network interface is reporting many receive errors. ok 15.24s ago 368.6us
alert: NodeNetworkTransmitErrs expr: increase(node_network_transmit_errs_total[2m]) > 10 for: 1h labels: severity: warning annotations: description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last two minutes.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodenetworktransmiterrs summary: Network interface is reporting many transmit errors. ok 15.24s ago 359.6us
alert: NodeHighNumberConntrackEntriesUsed expr: (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75 labels: severity: warning annotations: description: '{{ $value | humanizePercentage }} of conntrack entries are used.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodehighnumberconntrackentriesused summary: Number of conntrack are getting close to the limit. ok 15.24s ago 111.3us
alert: NodeTextFileCollectorScrapeError expr: node_textfile_scrape_error{job="node-exporter"} == 1 labels: severity: warning annotations: description: Node Exporter text file collector failed to scrape. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodetextfilecollectorscrapeerror summary: Node Exporter text file collector failed to scrape. ok 15.24s ago 82.52us
alert: NodeClockSkewDetected expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0) for: 10m labels: severity: warning annotations: message: Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure NTP is configured correctly on this host. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodeclockskewdetected summary: Clock skew detected. ok 15.24s ago 196.1us
alert: NodeClockNotSynchronising expr: min_over_time(node_timex_sync_status[5m]) == 0 for: 10m labels: severity: warning annotations: message: Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodeclocknotsynchronising summary: Clock not synchronising. ok 15.24s ago 380.8us

node-exporter.rules

27.857s ago

2.988ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_num_cpu:sum expr: count without(cpu) (count without(mode) (node_cpu_seconds_total{job="node-exporter"})) ok 27.857s ago 574.5us
record: instance:node_cpu_utilisation:rate1m expr: 1 - avg without(cpu, mode) (rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[1m])) ok 27.856s ago 206.3us
record: instance:node_load1_per_cpu:ratio expr: (node_load1{job="node-exporter"} / instance:node_num_cpu:sum{job="node-exporter"}) ok 27.856s ago 113.8us
record: instance:node_memory_utilisation:ratio expr: 1 - (node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"}) ok 27.856s ago 109.9us
record: instance:node_vmstat_pgmajfault:rate1m expr: rate(node_vmstat_pgmajfault{job="node-exporter"}[1m]) ok 27.856s ago 81.39us
record: instance_device:node_disk_io_time_seconds:rate1m expr: rate(node_disk_io_time_seconds_total{device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+",job="node-exporter"}[1m]) ok 27.856s ago 156.9us
record: instance_device:node_disk_io_time_weighted_seconds:rate1m expr: rate(node_disk_io_time_weighted_seconds_total{device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+",job="node-exporter"}[1m]) ok 27.856s ago 133.8us
record: instance:node_network_receive_bytes_excluding_lo:rate1m expr: sum without(device) (rate(node_network_receive_bytes_total{device!="lo",job="node-exporter"}[1m])) ok 27.856s ago 431.4us
record: instance:node_network_transmit_bytes_excluding_lo:rate1m expr: sum without(device) (rate(node_network_transmit_bytes_total{device!="lo",job="node-exporter"}[1m])) ok 27.856s ago 445.8us
record: instance:node_network_receive_drop_excluding_lo:rate1m expr: sum without(device) (rate(node_network_receive_drop_total{device!="lo",job="node-exporter"}[1m])) ok 27.855s ago 362.2us
record: instance:node_network_transmit_drop_excluding_lo:rate1m expr: sum without(device) (rate(node_network_transmit_drop_total{device!="lo",job="node-exporter"}[1m])) ok 27.855s ago 353.2us

node-network

18.161s ago

626us

Rule State Error Last Evaluation Evaluation Time
alert: NodeNetworkInterfaceFlapping expr: changes(node_network_up{device!~"veth.+",job="node-exporter"}[2m]) > 2 for: 2m labels: severity: warning annotations: message: Network interface "{{ $labels.device }}" changing it's up status often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}" ok 18.161s ago 616.9us

node.rules

5.692s ago

3.539ms

Rule State Error Last Evaluation Evaluation Time
record: :kube_pod_info_node_count: expr: sum(min by(cluster, node) (kube_pod_info{node!=""})) ok 5.692s ago 585.2us
record: node_namespace_pod:kube_pod_info: expr: topk by(namespace, pod) (1, max by(node, namespace, pod) (label_replace(kube_pod_info{job="kube-state-metrics",node!=""}, "pod", "$1", "pod", "(.*)"))) ok 5.692s ago 1.129ms
record: node:node_num_cpu:sum expr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job="node-exporter"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:)) ok 5.691s ago 1.405ms
record: :node_memory_MemAvailable_bytes:sum expr: sum by(cluster) (node_memory_MemAvailable_bytes{job="node-exporter"} or (node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"} + node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Slab_bytes{job="node-exporter"})) ok 5.69s ago 404.7us

prometheus

18.143s ago

2.391ms

Rule State Error Last Evaluation Evaluation Time
alert: PrometheusBadConfig expr: max_over_time(prometheus_config_last_reload_successful{job="prometheus-k8s",namespace="monitoring"}[5m]) == 0 for: 10m labels: severity: critical annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to reload its configuration. summary: Failed Prometheus configuration reload. ok 18.143s ago 282us
alert: PrometheusNotificationQueueRunningFull expr: (predict_linear(prometheus_notifications_queue_length{job="prometheus-k8s",namespace="monitoring"}[5m], 60 * 30) > min_over_time(prometheus_notifications_queue_capacity{job="prometheus-k8s",namespace="monitoring"}[5m])) for: 15m labels: severity: warning annotations: description: Alert notification queue of Prometheus {{$labels.namespace}}/{{$labels.pod}} is running full. summary: Prometheus alert notification queue predicted to run full in less than 30m. ok 18.143s ago 293.8us
alert: PrometheusErrorSendingAlertsToSomeAlertmanagers expr: (rate(prometheus_notifications_errors_total{job="prometheus-k8s",namespace="monitoring"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-k8s",namespace="monitoring"}[5m])) * 100 > 1 for: 15m labels: severity: warning annotations: description: '{{ printf "%.1f" $value }}% errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to Alertmanager {{$labels.alertmanager}}.' summary: Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager. ok 18.143s ago 271.3us
alert: PrometheusErrorSendingAlertsToAnyAlertmanager expr: min without(alertmanager) (rate(prometheus_notifications_errors_total{job="prometheus-k8s",namespace="monitoring"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-k8s",namespace="monitoring"}[5m])) * 100 > 3 for: 15m labels: severity: critical annotations: description: '{{ printf "%.1f" $value }}% minimum errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to any Alertmanager.' summary: Prometheus encounters more than 3% errors sending alerts to any Alertmanager. ok 18.143s ago 184.1us
alert: PrometheusNotConnectedToAlertmanagers expr: max_over_time(prometheus_notifications_alertmanagers_discovered{job="prometheus-k8s",namespace="monitoring"}[5m]) < 1 for: 10m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not connected to any Alertmanagers. summary: Prometheus is not connected to any Alertmanagers. ok 18.142s ago 72.42us
alert: PrometheusTSDBReloadsFailing expr: increase(prometheus_tsdb_reloads_failures_total{job="prometheus-k8s",namespace="monitoring"}[3h]) > 0 for: 4h labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected {{$value | humanize}} reload failures over the last 3h. summary: Prometheus has issues reloading blocks from disk. ok 18.143s ago 179.4us
alert: PrometheusTSDBCompactionsFailing expr: increase(prometheus_tsdb_compactions_failed_total{job="prometheus-k8s",namespace="monitoring"}[3h]) > 0 for: 4h labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected {{$value | humanize}} compaction failures over the last 3h. summary: Prometheus has issues compacting blocks. ok 18.142s ago 130.4us
alert: PrometheusNotIngestingSamples expr: rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-k8s",namespace="monitoring"}[5m]) <= 0 for: 10m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not ingesting samples. summary: Prometheus is not ingesting samples. ok 18.142s ago 68.17us
alert: PrometheusDuplicateTimestamps expr: rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0 for: 10m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping {{ printf "%.4g" $value }} samples/s with different values but duplicated timestamp. summary: Prometheus is dropping samples with duplicate timestamps. ok 18.142s ago 74.23us
alert: PrometheusOutOfOrderTimestamps expr: rate(prometheus_target_scrapes_sample_out_of_order_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0 for: 10m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping {{ printf "%.4g" $value }} samples/s with timestamps arriving out of order. summary: Prometheus drops samples with out-of-order timestamps. ok 18.143s ago 60.19us
alert: PrometheusRemoteStorageFailures expr: (rate(prometheus_remote_storage_failed_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m]) / (rate(prometheus_remote_storage_failed_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m]) + rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m]))) * 100 > 1 for: 15m labels: severity: critical annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} failed to send {{ printf "%.1f" $value }}% of the samples to {{ $labels.remote_name}}:{{ $labels.url }} summary: Prometheus fails to send samples to remote storage. ok 18.143s ago 142.3us
alert: PrometheusRemoteWriteBehind expr: (max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="prometheus-k8s",namespace="monitoring"}[5m]) - on(job, instance) group_right() max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="prometheus-k8s",namespace="monitoring"}[5m])) > 120 for: 15m labels: severity: critical annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write is {{ printf "%.1f" $value }}s behind for {{ $labels.remote_name}}:{{ $labels.url }}. summary: Prometheus remote write is behind. ok 18.143s ago 116.2us
alert: PrometheusRemoteWriteDesiredShards expr: (max_over_time(prometheus_remote_storage_shards_desired{job="prometheus-k8s",namespace="monitoring"}[5m]) > max_over_time(prometheus_remote_storage_shards_max{job="prometheus-k8s",namespace="monitoring"}[5m])) for: 15m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write desired shards calculation wants to run {{ $value }} shards for queue {{ $labels.remote_name}}:{{ $labels.url }}, which is more than the max of {{ printf `prometheus_remote_storage_shards_max{instance="%s",job="prometheus-k8s",namespace="monitoring"}` $labels.instance | query | first | value }}. summary: Prometheus remote write desired shards calculation wants to run more than configured max shards. ok 18.143s ago 105.9us
alert: PrometheusRuleFailures expr: increase(prometheus_rule_evaluation_failures_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0 for: 15m labels: severity: critical annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to evaluate {{ printf "%.0f" $value }} rules in the last 5m. summary: Prometheus is failing rule evaluations. ok 18.143s ago 206.1us
alert: PrometheusMissingRuleEvaluations expr: increase(prometheus_rule_group_iterations_missed_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0 for: 15m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has missed {{ printf "%.0f" $value }} rule group evaluations in the last 5m. summary: Prometheus is missing rule evaluations due to slow rule group evaluation. ok 18.142s ago 185.9us

prometheus-operator

13.434s ago

942.1us

Rule State Error Last Evaluation Evaluation Time
alert: PrometheusOperatorListErrors expr: (sum by(controller, namespace) (rate(prometheus_operator_list_operations_failed_total{job="prometheus-operator",namespace="monitoring"}[1h])) / sum by(controller, namespace) (rate(prometheus_operator_list_operations_total{job="prometheus-operator",namespace="monitoring"}[1h]))) > 0.4 for: 15m labels: severity: warning annotations: message: Errors while performing List operations in controller {{$labels.controller}} in {{$labels.namespace}} namespace. ok 13.434s ago 496.9us
alert: PrometheusOperatorWatchErrors expr: (sum by(controller, namespace) (rate(prometheus_operator_watch_operations_failed_total{job="prometheus-operator",namespace="monitoring"}[1h])) / sum by(controller, namespace) (rate(prometheus_operator_watch_operations_total{job="prometheus-operator",namespace="monitoring"}[1h]))) > 0.4 for: 15m labels: severity: warning annotations: message: Errors while performing Watch operations in controller {{$labels.controller}} in {{$labels.namespace}} namespace. ok 13.434s ago 266us
alert: PrometheusOperatorReconcileErrors expr: rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1 for: 10m labels: severity: warning annotations: message: Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace }} Namespace. ok 13.433s ago 100.8us
alert: PrometheusOperatorNodeLookupErrors expr: rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1 for: 10m labels: severity: warning annotations: message: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace. ok 13.433s ago 66.98us