cluster-health

Description

The cluster-health command checks the health of a vSphere cluster, evaluating node status against user-defined thresholds. By default it treats nodes that are disconnected or in maintenance as faulty; you can customize what counts as a failure via --faulty.

Options

Besides the general options this command supports the following options:

optiondescription
--cluster-name CLUSTER_NAMEName of the cluster to check
--cluster-threshold CLUSTER_THRESHOLDCluster threshold: [max_members:]warn:crit. Numbers or percentages; max_members optional.
--nostandbyStandby nodes are not considered part of the cluster
--faulty FAULTYFault conditions to treat as failures (e.g., *inMaintenance, *notconnected, inStandby, inQuarantine, overallStatusRed, overallStatusYellow, overallStatusGrey). * marks default entries

–cluster-threshold details

--cluster-threshold CLUSTER_THRESHOLD

The syntax is [max_members:]warn_threshold:crit_threshold.

  • max_members (optional) - apply the rule to clusters with up to this many members (<=). If omitted, the rule is the fallback for any larger size.
  • warn_threshold - number or percent of faulty nodes that triggers a WARNING.
  • crit_threshold - number or percent that triggers a CRITICAL.

Thresholds may be specified as absolute numbers (e.g., 1) or percentages (e.g., 10%). Mixed forms are allowed (e.g., 4:1:50%).

You can supply several --cluster-threshold flags for different cluster sizes. Rules apply to clusters up to their max_members; if multiple rules match, the smallest max_members wins. Exactly one rule must omit max_members; that one is the fallback.

Quick examples:

  • 3:1:1 - For clusters up to 3 nodes: a single fault triggers a critical state (warning and critical equal).
  • 5:1:3 - For clusters up to 5 nodes: warn at >=1 faulty node, critical at >=3.
  • 10:2:5 - For clusters up to 10 nodes: warn at 2 faulty nodes, critical at 5.
  • 50:5:15 - For clusters up to 50 nodes: warn at 5 faulty nodes, critical at 15.
  • 10%:20% - Fallback for larger clusters: warning at 10% failures, critical at 20%.

Examples

check_vsphere cluster-health \
  --host vcenter.example.com \
  -u naemon@vsphere.local \
  --cluster-threshold 3:1:1 \
  --cluster-threshold 5:1:3 \
  --cluster-threshold 10:2:5 \
  --cluster-threshold 50:5:15 \
  --cluster-threshold '10%:20%' \
  --cluster-name MyCluster