Heartbeats
Heartbeats serve the following purposes:
- Exchange data between cluster nodes.
- Detect stale nodes.
- Execute the quorum race when a peer becomes stale.
OpenSVC supports multiple parallel running heartbeats. Exercising different code paths and infrastructure data paths (network and storage switches and site interconnects) helps limit split-brain situations.
Configuration
Heartbeats are declared in /etc/opensvc/cluster.conf
, each in a dedicated section named [hb#<n>]
. A heartbeat definition should work on all nodes, using scoped keywords if necessary, as the definitions are served by the joined node to the joining nodes.
Reconfiguration
Any command that changes the timestamp of the following configuration files triggers a reconfiguration of heartbeats:
/etc/opensvc/node.conf
/etc/opensvc/cluster.conf
Actions Taken During Reconfiguration:
- Any updated parameters are applied to the heartbeats.
- Heartbeats removed from the configuration are stopped.
- Heartbeats newly defined in the configuration are started.
Set a Heartbeat Timeout
To set a timeout for the hb#1
heartbeat, use this command:
om cluster config update --set hb#1.timeout=20
Drop a Heartbeat
To delete the hb#1
heartbeat from the configuration:
om cluster config update --delete hb#1
Monitoring
Each heartbeat runs two threads: tx
and rx
.
The om mon
command display the heartbeats status, statistics, and each peer state.
Threads n1 n2 n3
...
hb |
hb#1.rx running unicast | / O O
hb#1.tx running unicast | / O O
hb#2.rx running relay | / O O
hb#2.tx running relay | / O O
...
The agent daemon automatically restarts heartbeat threads if they exit unexpectedly.
Heartbeat Thread Pair
Tx (Transmit)
The Tx thread handles the transmission of the node data:
- Regularly transmit data or send it as soon as changes occur.
- Data is encrypted.
Rx (Receive)
The Rx thread manages data reception and integration into cluster data:
- Regularly read data from disk or receive it in response to transmissions (unicast/multicast).
- Update peer data in the cluster.
- Timeout if no heartbeat is received within the configured
<hb#n>.timeout
. The default timeout is 15 seconds.
Actions Performed by Rx:
- On receive data:
- Merge updated peer data to maintain accurate cluster data.
- Publish the received events on the local event bus.
- On receive timeout:
- Publish a
HbStale
event - Purge stale peer data if:
- No Maintenance Advertised: Immediately purge stale peer data.
- Maintenance Advertised: Wait for the
node.maintenance grace_period
before purging.
- Publish a
See Also: