Submitted by bnemec on Fri, 01/05/2024 - 22:31
Just a quick announcement of a tool I wrote recently to help with debugging of Keepalived behavior in an OpenShift On-Prem IPI cluster. This is specifically intended to handle the logs from the keepalived pods running in the openshift-[platform]-infra namespace, although with a little work it could probably be generalized to work with most any Keepalived configuration.
Submitted by bnemec on Mon, 11/21/2022 - 21:13
The Problem
This is some design work I did a while back as a result of an edge case that we had not considered in the original design of the loadbalancer architecture for OpenShift on-prem networking. Our (mistaken) assumption was that apiservers would either be up or down and our healthchecks were written with that in mind. As it turns out, it is possible for a cluster to be in an unhealthy state but not completely down. This results in intermittent failures of API calls, which causes flapping of the healthchecks. One could argue that the healthchecks are correctly representing the state of the cluster, but the problem is that VIP failovers break all connections to the API which can exacerbate the instability of a flaky cluster. Each time the VIP fails over it forces every client to reconnect, and if the apiservers are already struggling to handle the load then having a huge number of connections come in at once just makes it worse.
Submitted by bnemec on Fri, 09/02/2022 - 21:24
Fair warning: This is gonna be looooong. Proceed at your own risk. ;-)
Introduction
Since I started working with OpenShift on baremetal one of the things I've wanted to do is deploy OpenShift using OpenStack Virtual Baremetal to provide the host VMs. The usual developer setup is dev-scripts, which uses libvirt to stand up a virtual baremetal environment. This works fine, but it has a few drawbacks:
Submitted by bnemec on Mon, 07/11/2022 - 21:06
If you open a bug with NetworkManager, there is a high probability that the first thing they will ask you is to provide trace logs from around the time whatever bad behavior you're reporting occurs. This isn't terribly complicated to do, but most people are not familiar with the NetworkManager logging configuration so when asked for trace logs their first response is: How? I'm writing this up so I can just provide a link here when I get that question.
Submitted by bnemec on Thu, 08/29/2019 - 15:36
This is a problem I ran into recently and found only one other discussion of, which turned out to be unrelated to my situation. In short, running podman image prune
on my system was failing with the following error message:
Error: failed to prune image: Image used by bbe74e76b2e3850ea27f2498ca4e504d271ac230a00c496a37291f3ee8d8b49c: image is in use by a container
Thing is, I had no containers on the system when this error happened. It seems that the actual problem was that I had built an image using buildah, and podman couldn't handle that.
Submitted by bnemec on Fri, 10/13/2017 - 16:34
If you've been following my blog very closely you might notice it's been up and down occasionally over the past couple of weeks. You may also notice that I mentioned Drupal "fun" in my previous post. This was maybe a bit misleading as it wasn't really Drupal's fault, but Drupal did make it much easier to deal with.
Submitted by bnemec on Wed, 05/11/2016 - 16:07
One of the first things a new user of TripleO has to do is write a configuration file for their undercloud. While this is not as complex as, say, writing a nova.conf, there is still some level of difficulty due to the need for consistency between different options, and some less than ideal overlap/inconsistency between the options themselves. In the interest of easing new users' introduction to TripleO as much as possible, I've written a tool to help create undercloud.conf.
Submitted by bnemec on Mon, 05/19/2014 - 18:32
The first security update for Drupal since I started this blog was released a while back, and I negligently dragged my feet in actually applying it here because I was unsure how to do so in OpenShift. It turns out to be quite simple, but it did take me a while to figure out all the steps so I figured I would write them down here to help anyone else doing the same thing.