Loadbalancer Healthcheck Design for Flaky Services

The Problem

This is some design work I did a while back as a result of an edge case that we had not considered in the original design of the loadbalancer architecture for OpenShift on-prem networking. Our (mistaken) assumption was that apiservers would either be up or down and our healthchecks were written with that in mind. As it turns out, it is possible for a cluster to be in an unhealthy state but not completely down. This results in intermittent failures of API calls, which causes flapping of the healthchecks. One could argue that the healthchecks are correctly representing the state of the cluster, but the problem is that VIP failovers break all connections to the API which can exacerbate the instability of a flaky cluster. Each time the VIP fails over it forces every client to reconnect, and if the apiservers are already struggling to handle the load then having a huge number of connections come in at once just makes it worse.

Subscribe to RSS - Loadbalancer