Reliable RKE2 Certificate Expiry Alerts for Clusters Managed by SUSE Rancher
This shows how to combine Kubernetes Events and Rancher Metadata for precise, actionable alerting.
Alert ManagedClusterRKE2ExpiringCertificates
SUSE Rancher provides an RKE2 cluster management platform for application teams. As an infrastructure team operating a SUSE Rancher instance, we want to notify teams about expiring RKE2 certificates, as their rotation requires manual intervention. Unfortunately, the information about expiring certificates is only available in the Kubernetes Events of the Rancher-managed RKE2 clusters. Hence, we need to automatically install Kubernetes Event log collection via Fleet on the managed clusters.
For notification purposes, we need additional information about the cluster owner. This cannot be easily propagated down to managed clusters, so the event logs contain no reference to the Rancher project they belong to nor to the owning application. Therefore, we need to enrich the alert with the application label that we added to the Rancher Project object. This enables routing of the alert to the correct team via Alertmanager.
Setting the Stage: Rancher, Projects and Managed Clusters
- Rancher deployment – A single RKE2 cluster hosts the Rancher application.
- Rancher Project management – The infrastructure team creates Rancher Projects (e.g.,
proj1,proj2), grants permission to the Application team and adds a custom labelapplication=<app>that denotes the owning product. - Managed Cluster – Application teams operate their own RKE2 clusters inside those projects. These clusters are managed by Rancher.
The monitoring stack (Loki + Prometheus) runs in the Rancher host cluster, while the managed clusters need to forward their Kubernetes events as logs to the central Loki instance. These logs must include the cluster name.
Deploy Loki in the Rancher Cluster
Loki can be easily deployed via Helm chart. For the recording rules, we need additional Ruler configuration - configuration of rules via ConfigMap Remote write – Configure Loki to forward its streams to the Prometheus installed via Helm rancher-monitoring in RKE2 clusters This is achieved by the following values.
# values.yaml snippet for Loki Helm
loki:
# requirement:
# storage.bucketNames.rules is not set!
# Remote write configuration
config:
...
# Remote write to Prometheus
remote_write:
- url: http://prometheus-operated.cattle-monitoring-system.svc:9090/api/v1/write
timeout: 30s
# configuration of rules taken from configmaps:
rulerConfig:
storage:
type: local
local:
directory: /rules
rule_path: /var/loki/rules-temp
# configuration of rules taken from configmaps supporting different tenants:
sidecar:
rules:
folderAnnotation: tenant
Deploy the Loki recording rule
The following ConfigMap specifies the recording rules for logs from tenant “managed-cluster”, where the Kubernetes events should be written to.
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-recording-rules
namespace: monitoring
annotations:
tenant: managed-cluster
labels:
# marks config map to be used by the ruler:
loki_rule: "true"
data:
rke2-recording-rules.yaml: |
groups:
- name: rke2-recording-rules
interval: 10m
rules:
- record: loki:recorded:rke2_cert_expiration_timestamp
expr: |
max by (rancher_cluster,managed_cluster,cert) (max_over_time(
{source="kubernetes_events"} | json | reason = "CertificateExpirationWarning"
| line_format "{{ .msg }}"
| regexp "(?P<pair>[^,:]+\\.crt:.*?(expire within .+ days|expired) at [^,]+)"
| label_format cert=`{{ regexReplaceAll "(?P<cert>[^:]+\\.crt):.*" .pair "${cert}" }}`
| label_format expire=`{{ regexReplaceAll ".*(expire within .+ days|expired) at (?P<date>[^,]+)" .pair "${date}" }}`
| label_format managed_cluster=cluster
| label_format expires_ts=`{{ unixEpoch (toDate "2006-01-02T15:04:05Z07:00" .expire) }}s`
| unwrap duration_seconds(expires_ts)
[30m]))
This log query does the heavy lifting by creating one metric point per mentioned certificate from a single log line, as
{
"count": 638,
"eventRV": "496410095",
"kind": "Node",
"msg": "Node certificates require attention - restart rke2 on this node to trigger automatic rotation: controller-manager/kube-controller-manager.crt: certificate CN=kube-controller-manager will expire within 120 days at 2026-06-08T07:38:41Z, scheduler/kube-scheduler.crt: certificate CN=kube-scheduler will expire within 120 days at 2026-06-08T07:38:41Z",
"name": "rke2-eu-cluster1-controlplane-976f6b69-lt7lj",
"reason": "CertificateExpirationWarning",
"reportingcontroller": "rke2-cert-monitor",
"reportinginstance": "rke2-eu-cluster1-controlplane-976f6b69-lt7lj",
"sourcecomponent": "rke2-cert-monitor",
"sourcehost": "rke2-eu-cluster1-controlplane-976f6b69-lt7lj",
"type": "Warning"
}
with the following labels
cluster=rke2-eu-cluster1
rancher_cluster=rancher-eu-prod
service_name=kubernetes_events
source=kubernetes_events
Let’s examine the steps of the query:
{source="kubernetes_events"} | json | reason = "CertificateExpirationWarning"selects the log lines for Kubernetes events with the reason “CertificateExpirationWarning” and converts them to JSON format| line_format "{{ .msg }}"keeps only the “msg” value from the JSON structure, i.e., the event message| regexp "(?P<pair>[^,:]+\\.crt:.*?(expire within .+ days|expired) at [^,]+)":(?P<pair> … )captures the part “certname … expire … at DATE” until the last date into a temporary label called pair| label_format cert= ... | label_format expire= ...:regexReplaceAllcreates a new line for each occurrence of the patterns inpairand metadata labelscertandexpire| label_format managed_cluster=cluster: simply renames the label tomanaged_cluster| label_format expires_ts={{ unixEpoch (toDate “2006-01-02T15:04:05Z07:00” .expire) }}s`:toDateparses the ISO string using Go’s layout format.unixEpochconverts thetime.Timeinto seconds since epoch.- The trailing
sis required because Loki expects the value to be a float representing seconds; we keep thessuffix for readability (it is stripped later). Now we have a numeric labelexpires_ts = 1713168000(for the example above).
| unwrap duration_seconds(expires_ts): creates a sample with:- value – the result of
duration_seconds(expires_ts), i.e., the timestamp expressed as a float number of seconds. - labels – everything that exists at this point (
rancher_cluster,managed_cluster,cert, etc.).
- value – the result of
- Outer
max by (… (max_over_time(....[30m]))The innermax_over_timestill returns a series per unique set of labels that existed for each inner log line. The outermaxby collapses any residual duplication (e.g., if the same cert appears on two nodes) into a single series per certificate. The result is a canonical, monotonic timestamp series that can be safely joined with other metrics. The longer duration 30m is used to avoid missing logs due to transmission latencies from the managed cluster. interval: 10m: To limit processing effort and handle delayed log ingestion, the rule is executed only every 10 minutes, analyzing logs from the last 30 minutes.
The resulting metric points are:
loki:recorded:rke2_cert_expiration_timestamp{cert="controller-manager/kube-controller-manager.crt", managed_cluster="rke2-eu-cluster1", rancher_cluster="rancher-eu-prod"} 1780904321
loki:recorded:rke2_cert_expiration_timestamp{cert="scheduler/kube-scheduler.crt", managed_cluster="rke2-eu-cluster1", rancher_cluster="rancher-eu-prod"} 1780904236
Collect Kubernetes Events from Managed Clusters
For collecting and forwarding Kubernetes Events from the managed cluster to the central Loki, we can use Grafana Alloy deployed via Helm chart. We need to add the managed cluster name and Rancher cluster as external labels, e.g., via environment variables:
# extract from alloy values.yaml
alloy:
configMap:
content: |-
loki.source.kubernetes_events "cluster_events" {
log_format = "json"
forward_to = [ loki.process.cluster_events.receiver ]
}
...
loki.write "default" {
endpoint {
url = <ingress to loki write endpoint in Rancher RKE2 cluster>
}
external_labels = {
cluster = sys.env("CLUSTER_NAME"),
rancher_cluster = sys.env("RANCHER_CLUSTER"),
}
}
To ensure we get Kubernetes events automatically from all managed clusters, we can use a Fleet GitRepo within the Rancher cluster. Here we can have the managed cluster name in Fleet variables and the Rancher cluster set in the repo file.
Export Rancher-derived Labels as Metrics
For alerting routing, we need metrics that show the associated project and its application label for each managed cluster. However, the connection between managed clusters and projects is not trivial, as it involves several intermediate objects - such as cloud credentials and namespaces - between them. Since SUSE Rancher does not provide these relationships as metrics, we had to write a customer-specific Prometheus metrics exporter. However, it would be too complex to include here.
So let’s assume for now we have the combined metrics available as a single series per managed cluster:
rancher_managed_cluster_owner_info{managed_cluster="rke2-eu-cluster1",project_id="proj1",application="app1"} 1
rancher_managed_cluster_owner_info{managed_cluster="rke2-eu-cluster2",project_id="proj2",application="app1"} 1
rancher_managed_cluster_owner_info{managed_cluster="rke2-eu-cluster3",project_id="proj3",application="app2"} 1
This is an info metric – the value is always 1. This design allows us to use group_left to attach the application label to any rule that joins on managed_cluster.
This exporter is deployed in the Rancher RKE2 cluster and scraped by the operator-managed Prometheus via a ServiceMonitor. For example:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: custom-rancher-exporter
namespace: monitoring
spec:
endpoints:
- port: http
jobLabel: app
selector:
matchLabels:
app.kubernetes.io/instance: custom-rancher-exporter
app.kubernetes.io/name: custom-rancher-exporter
Defining The Alert Rule
Now we can combine the log-derived timestamp with the Rancher label metric in the following Alert rule. This is deployed via CR PrometheusRule in the Rancher cluster, to be added to the Prometheus from SUSE Rancher-monitoring Helm chart.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: customer-alerts
namespace: customer-monitoring
spec:
groups:
- name: rke2-alerts
rules:
- alert: ManagedClusterRKE2ExpiringCertificates
annotations:
action: |
The certificate rotation needs to be triggered manually. Follow the
instructions in https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/manage-clusters/rotate-certificates
description: |
The certificate {{ $labels.cert }} on cluster {{ $labels.managed_cluster }}
expires on {{ $value | humanizeTimestamp }}
impact: |
When the certificate is not renewed, some Kubernetes services will
no longer be accessible.
expr: |
(max_over_time(loki:recorded:rke2_cert_expiration_timestamp[30m])
* on(managed_cluster) group_left(application) rancher_managed_cluster_owner_info
) < (time() + 5 * 60 * 60 * 24)
for: 5m
labels:
severity: critical
Explanation
max_over_time(...[30m]): Pull the most recent expiry timestamp for each certificate. A 30-minute look-back is needed since we evaluate logs only every 10 minutes.* on(managed_cluster) group_left(application) rancher_managed_cluster_owner_info: Join the timestamp series with the Rancher label metric, copying theapplicationlabel onto the result.< (time() + 5*60*60*24): Trigger when the stored expiry is earlier than “now + 5 days”.
With the application label present, Alertmanager can now route the alert to the appropriate receiver (e.g., Slack channel #app1-ops).
Conclusion
We have shown a rather complex setup needed to reliably alert customers about expiring RKE2 certificates in their managed clusters. This required specifically:
- Gathering Events from the managed clusters (Alloy Kubernetes Event collection deployed via Fleet from Rancher Cluster)
- Parsing logs into a recorded time series (Loki recording rules),
- Exporting business metadata as informational metrics (custom Rancher Exporter), and
- Joining both metrics in a Prometheus alert rule, to obtain alerts that are both precise (exact expiry timestamps) and actionable (routed to the correct owners). The approach scales across any number of managed clusters.
Happy monitoring!