Improving TripleO CI Throughput

If you spend any significant amount of time working on TripleO, you have probably run into the dreaded CI queue. In this case that typically refers to the check-tripleo queue that runs OVB. Why does that queue back up more than the regular check queue and what can/have we done about it? That's what we're going to talk about here.

The Problems

Before we discuss solutions, it's important to understand the problems we face in scaling TripleO's OVB CI. It's a very different beast from the rest of OpenStack CI. Here's a (probably incomplete) list of how:

  • Our test environments are considerably larger than normal. An average TripleO OVB job makes use of 5 VMs. As of this writing, the most that any regular infra job uses is 3, and that's an experimental TripleO job. In general they max out at 2, and most use a single node. A TripleO environment averages around 35 GB of memory (generally our limiting factor) per test environment, as well as a lot of vcpus and disk.
  • Our test environments are also considerably more complex. Those 5 VMs are attached to some combination of 6 different neutron subnets. One of the VMs is also configured as an IPMI server that controls the others. We use Heat to deploy them in order to keep the whole thing manageable. This adds yet another layer of complexity because regular infra doesn't know how to deploy the Heat stacks, so we have to run private infrastructure to handle that.
  • Related to the previous point, out test environments have some unusual requirements on the host cloud. While some work has been done to reduce the number of ways our CI cloud is a snowflake, there are currently just 2 available clouds on which our jobs can run. And one of those is being used by TripleO developers and not available for CI. So we have exactly one cloud of ~30 128 GB compute nodes for TripleO CI.
  • In the interest of maximizing compute capacity, our CI cloud has a single, nonha controller. When a large number of test environments are being created and/or deleted at once, it can overload that controller and cause all kinds of strange (read: broken) behavior.
  • TripleO CI jobs are long. At this time, I estimate we average about 1 hour 50 minutes per job. And I'm happy with that number, if you can believe it. It used to be well above 2 hours. If that sounds bad, keep in mind that we're deploying not one, but two OpenStack clouds in that time. We're also simulating the baremetal deployment part of the process, which no other OpenStack CI jobs do.

Just for some context, at our peak utilization of TripleO CI I've seen as many as 750 jobs being run in a 24 hour period. You can do the math on the number of VMs, memory, and networks that involves. It also means even small regressions in performance can have a huge impact on our daily throughput. A 5 minute regression per job adds up to 62.5 extra hours per day spent running test jobs. The good news is that a 5 minute improvement has the same impact in the positive.

The Solutions

Best strap in tight for this section, because we've been busy.

One useful tool that I want to point out is a simple webapp I wrote to keep an eye on the check-tripleo queue: check-tripleo queue status. It can show other queues as well, but it was specifically designed for the tripleo queues so some things may not make sense elsewhere. It's also designed to be as compact as possible, and it may not be obvious what some of the numbers mean. If there's interest, I can write a more complete post about the tool itself.

There are two main categories of changes that helped our CI throughput: bug fixes and optimizations. I'll start with the bugs that were hurting performance.

Bugs

  • Five minute delay DHCP'ing isolated nics This has actually bitten us twice. It's a fairly long-standing bug that goes back to at least Mitaka and causes deployments to spend 5 minutes attempting to DHCP nics that will never get a response. Fixing it saved time in every single job we run.
  • IPA image build not skipped even if image already exists This bug crept in when we moved to a YAML-based image build system. There was an issue with the check for existing images that meant even when we could use cached images in CI, we were spending 10 minutes rebuilding the IPA image. This didn't affect every job (some can't use cached images), but it was a big time suck for the ones it did.
  • overcloud-full ramdisk being rebuilt twice Thanks to a recent change, we ended up with two different image elements doing forced ramdisk rebuilds during our image builds. This was a less serious performance hit, but it still saves 1.5-2 minutes per job when we have to build images.

Optimizations

  • Run with actual node count Due to a scheduler race betwee Nova and Ironic, we had previously added an extra node to each test environment so the scheduler could retry when it failed. This no longer seems to be necessary, and removing the extra node freed up around 20% of the resources from each test environment. It also makes environment creation and deletion faster because there is less to do.
  • Disable optional undercloud features in longer jobs The undercloud has grown a lot of new services over the past few cycles, and this has caused it to take an increasingly long time to install. Since we aren't exercising many of these features in CI anyway, there's no point deploying them in all jobs. This is saving around 10 minutes in the ha and updates jobs.
  • Deploy network envs appropriate for the job Not all of our jobs require the full 6 networks I discussed earlier. Since neutron-server is one of the biggest CPU users on the CI cloud controller, reducing the number of ports attached to the VMs was a big win in terms of controller capacity. It also reduces the time to create a test environment by a minute or more for some jobs. And in case that's not enough, this change will also allow us to test with bonded nics in CI.
  • Always use cached images in updates job The updates job is especially painful from a runtime perspective. Not only does it deploy two full clouds, but it also has to update one of them, which takes a significant amount of time as well. Since the updates job is never run in isolation and image builds for it are not job-specific, there's no reason we can't always use cached images. If an image build is broken by a patch it will be caught by one of the other jobs. This can save as much as 30+ minutes in updates jobs.
  • Parallelization wherever possible. There were a few patches related to this, but essentially there are some processes in CI (such as log collection) that were being run in serial. Since our VMs are typically going to be running on different compute nodes, there's really no benefit to that and running those processes in parallel can save significant amounts of time.
  • Use http delorean urls instead of https At some point, our default delorean repos (which is where we get OpenStack and friends) switched to using https by default. While this is good from a security perspective, it's bad from a CI perspective because it means we can't cache those packages. This is both slower and results in more bandwidth wasted on both ends. Note: As of this writing, the problem is only half fixed. Some of our repos force redirect http to https so there's nothing we can do on our end.

And I think this one deserves special notice: Clean up testenv if Jenkins instance goes away Previously we had an issue where test environments were being left around for some time after the job they were attached to had been killed. This can happen, for example, if a new patch set is pushed to a change that has jobs actively running on it. Zuul kills the active jobs on the old patch set and starts new ones on the new patch set. However, before this change we did not immediately clean up the test environment from the killed jobs. This was very problematic and caused us to exceed our capacity in the CI cloud on several occasions. It also meant we couldn't make full use of the capacity at other times because the more jobs we ran the more likely it was that this situation would occur. Since the patch, I have never seen us exceed our configured capacity for jobs, and the problem scenario has occurred 1300 times in the two weeks since the change merged. That's a lot of resources not wasted.

All of these optimizations combined have both reduced our job runtimes and allowed us to run more jobs at once. We've increased our concurrent job limit from 60 to 70, and the CI cloud is still under less load than it was before. We could probably go even higher, but since things are generally under control right now there's no need to push the limit. There's also diminishing returns (more jobs running at once means more load on the compute nodes, which leads to lower performance) and some existing limits in the cloud that would require downtime to change if we go much higher. It could be done if necessary, but so far it hasn't been.

It's also worth noting that the effort to keep the CI queues reasonable is ongoing. Even while we merged the changes discussed above, other changes happened that regressed CI performance. Some because they were adding new things to deploy that take more time, and some for unanticipated reasons. Unfortunately, performance regressions tend to get ignored until they become so painful that jobs time out. This is a bad approach because CI performance affects every developer working on TripleO, and I'm hoping we can do a better job of keeping things in good shape going forward.

And just to drive the previous point home, in the time since I started writing this post and publishing it, we've regressed the ha job performance enough to start causing job failures.