Loadbalancer Healthcheck Design for Flaky Services

The Problem

This is some design work I did a while back as a result of an edge case that we had not considered in the original design of the loadbalancer architecture for OpenShift on-prem networking. Our (mistaken) assumption was that apiservers would either be up or down and our healthchecks were written with that in mind. As it turns out, it is possible for a cluster to be in an unhealthy state but not completely down. This results in intermittent failures of API calls, which causes flapping of the healthchecks. One could argue that the healthchecks are correctly representing the state of the cluster, but the problem is that VIP failovers break all connections to the API which can exacerbate the instability of a flaky cluster. Each time the VIP fails over it forces every client to reconnect, and if the apiservers are already struggling to handle the load then having a huge number of connections come in at once just makes it worse.

What We Did

There were two schools of thought on how to address this, and since I lost that argument I'm writing about it here rather than implementing it. Maybe it will be useful to someone in the future, possibly even me if the current solution ends up being problematic. The two philosophies on how this should work went approximately as follows:

  • Keepalived should only healthcheck haproxy and as long as haproxy is up it should not fail over the VIP.
  • Keepalived should ensure that the node holding the API VIP is able to contact the API.

I advocated for the latter approach (which is how this was designed originally) because in the former case there is no guarantee that haproxy is actually able to reach the API. The haproxy healthcheck will happily report success even if all of its backends are down. It is possible to configure haproxy to report failure if it has no healthy backends, but for our purposes here all that would have done is move the flaky healthcheck from keepalived to haproxy. It would not have fixed the problem.

The short version of the solution we went with is that we stopped validating connectivity to the API in our loadbalancer, which eliminates any possible flakiness at the cost of potentially leaving the API VIP on a node that can't actually talk to the API.

There was a lot of argument about this and I'm not going to go all the way down that rabbit hole, but suffice it to say I never found any kind of authoritative discussion of loadbalancers that convinced me 100% in either direction. The discussions I found about keepalived and haproxy had absurdly simplistic healthchecks, to the point that I would argue they were objectively wrong. Most were just checking that an haproxy process existed somewhere on the system, but did not have any verification that it was functional or that it was even the haproxy associated with the VIP. Hopefully it's obvious why this is not at all a robust healthcheck.

As a result of the lack of conclusive evidence, I decided to accept the other proposed solution (under protest) in the interest of making some sort of progress on the issue.

What We Should Have Done (IMHO)

That's a whole lot of background that may or may not be interesting, but it's sort of necessary to understand the solution I proposed and will be writing up here. First, I will tell you up front that my solution is not perfect. The possibility for failovers still exists in a flaky cluster, but it is drastically reduced. As we all know, perfection is the enemy of good, and I think my solution is good. Read on to see if you agree.

My proposal actually had two parts - one change which could be implemented quickly to mitigate the problem with minimal risk because it involved no major changes to how the loadbalancer works. I'll only discuss that briefly because it was never intended to be a final solution in and of itself, but it's sort of a simplified version of my more complete solution.

Part One

The first change was simple: Increase the rise value of the keepalived healthchecks significantly so they wouldn't flap. Flap prevention is largely the point of the rise and fall values for the healthchecks, so this is not particularly controversial. I chose to increase only the rise value (which specifies how many healthchecks must pass before the node is considered "healthy") because increasing the fall value (which specifies how many healthchecks must fail before the node is considered "unhealthy") would also increase the amount of time it takes to detect an outage. Fortunately, increasing the rise value accomplishes what we need because it means that if the apiserver is flaky all of the nodes will have a failure before they reach the rise count and will stay in an unhealthy state. As long as all nodes maintain the same priority the VIP will not fail over.

This does have an important limitation, however. It works great as long as the cluster is in an unhealthy state, but when the cluster becomes healthy again the nodes will race to reach the rise value. It's possible that a node not holding the VIP will rise faster than the one holding the VIP and it will briefly have a higher priority until the other nodes catch up. As a result, the VIP might move to that newly-healthy node, and depending on how full the recovery is that might destabilize the cluster again.

Because this whole use case is a big gray area of instability (are 20% of API calls failing or 80%? It significantly changes the behavior of the healthchecks...), it's impossible to say how much of a problem this limitation actually is. My proposal was that we make this change and see how it worked for the affected customer. Unfortunately, since we never did that I can't conclusively say that it would or would not have worked. For the moment this remains a theoretical exercise.

Part Two

Fortunately, we can fix even the recovery edge case by giving the node with the VIP a head start in the aforementioned race. While I don't believe this can be done with just keepalived configuration primitives, we aren't limited to that because we also have a monitor sidecar running alongside keepalived. We already use this for part of the keepalived healthchecks because there are things we can't do in the keepalived container but can in the sidecar (specifically checking for firewall rules). The results of the monitor healthchecks are communicated to keepalived via a shared folder.

My proposal was that we move the entire healthcheck to the monitor sidecar and make the actual keepalived healthcheck just a simple check for whether the monitor reports success or failure. The advantage of this is we can implement as much custom logic in the healthchecking as we want. For this specific case, we would set the rise value for the node holding the VIP to be much lower than for the other nodes. This way even if there is a stability problem with the API and healthchecks are intermittently failing, the node holding the VIP will recover faster than the others and the VIP will not move. Since we're still only modifying the rise value, if there is an actual problem on the node holding the VIP and the other nodes are healthy the VIP will still fail over quickly.

There was valid criticism of this approach because it was a much more complex change than either of the other two I've discussed so far. However, my counter-argument would be that in reality it isn't a huge change in behavior, just in implementation. The same basic rise and fall behavior would be implemented in the monitor code, with the addition of a dynamic component that would prefer the node already holding the VIP. Subjectively, I also liked that the healthcheck results were separated from the other keepalived logging. As someone who reads a lot of keepalived logs, it was nice to not have to parse out the healthcheck-related log messages from all of the other noise that shows up in the keepalived logs.

Again, this was not a perfect solution either. It is still possible for the VIP to fail over inadvisedly if one node happens to get lucky with its healthchecks. However, in my testing this drastically reduced failovers to the point where I no longer think they would have negatively impacted cluster recovery.

You can find some PoC code for this change in the machine-config-operator and baremetal-runtimecfg (where the sidecar monitor code lives) repos. Note that these changes are incomplete and just intended to show the direction I wanted to go.

Testing and Validation

Speaking of testing, I did a bunch of it. It's tough to accurately simulate flakiness like this, but I did my best. In essence, I added a service to each node that would randomly inject a firewall rule to block outgoing API traffic to simulate API outages. This isn't fully representative of the real world situation, but from the perspective of keepalived it should be close enough.

What I found was that even the first, simpler solution massively improved the failover behavior. The VIP would still move, but instead of happening multiple times in a minute, it would often be hours between failovers (depending on what error rate I was simulating). If cluster stability can't be restored in the two hours between failovers I no longer feel like that's our fault, but I guess that's open to interpretation.

Conclusion

In retrospect I probably should have written this back when we were having the design discussions in the first place. I think there were a lot of misunderstandings and talking past each other, and having a complete explanation of my proposal would likely have helped a lot. To that end, I see this post as serving three purposes:

  1. Documenting my design proposal for anyone else who might be facing a similar problem
  2. Documenting my design proposal for myself in case we need to revisit this
  3. A reminder to myself to do better with this sort of discussion in the future

I hope if you read this far you found it helpful or interesting, and if you have any thoughts about how this scenario could best be handled I'd love to hear from you. Our loadbalancer design is an ever-evolving piece of technology and any improvements are welcome.