Upstream OpenStack Performance and Release-Shaming

These topics may seem like strange bedfellows, but trust me: there's a method to my madness. Originally this was going to be part of my Berlin summit post, but as I was writing it got rather long and I started to feel it was important enough to deserve a standalone post. Since there are two separate but related topics here, I've split the post. If you're interested in my technical thoughts on upstream performance testing, read on. If you're only interested in the click-baity release-shaming part, feel free to skip to that section. It mostly stands on its own.

Performance

This came out of a session that was essentially about OpenStack performance, specifically how to quantify and test it. Unfortunately, in many cases the answer to the former was "it depends". Obviously it depends on your hardware. Bigger hardware generally means better performance. It also depends on the drivers you are using. Not all virt/network/storage drivers are created equally. Then there's your architecture. How is your network laid out? Are there only fat pipes between some node types or does everything have pretty equal connectivity?

This is of some interest to me because my first job out of college was on a performance team. That also means I have a pretty good understanding of what it takes to extensively performance test a major piece of software. Spoiler alert: It's a lot!

We had a team of 4 or 5 people dedicated primarily to performance testing and improvement. And that was just for my department's specific aspect of the product. There was a whole separate "core" performance team that was responsible for the overall product.

Besides people, there were also racks and racks of hardware needed. Though somewhat less once I got the load testing client software running on Linux instead of Windows, which reduced the hardware requirements to drive the tests by 5 or 10x. Still, we had a sizable chunk of a data center set aside exclusively for our use.

You also need software that can run tests and collect the results in a consumable format. Fortunately this seems to be a solved problem for OpenStack as Browbeat has been around for a few years now and Rally can also drive workloads against an OpenStack cloud.

That's three major pieces - people, hardware, and software - that you need to do performance testing well. OpenStack currently has one of those things that is available upstream. People and hardware? Not so much. I suspect part of the problem is that OpenStack vendors see their performance testing and tuning as a value-add and thus don't tend to publish their results. Certainly Red Hat is doing performance testing downstream, but to my knowledge the results aren't publicly available.

Is this the ideal situation? That depends on how you look at it. From a technical standpoint it would be far better if we had a dedicated upstream team doing regular performance testing against bleeding edge versions of OpenStack. That would allow us to catch performance regressions much faster than we generally do now. Downstream testing is likely happening against the last stable release, and thus is ~6 months behind at all times. On the business side, though, it makes more sense. To a large extent, what Red Hat (for example) is selling is expertise. That includes the expertise of our performance team. If you want their help tuning your cloud, then you pay us for a subscription and you get access to that knowledge. The software is free, but anything beyond that is not. So, ideal? No, but the reality of corporate-sponsored open source probably necessitates it.

The hardware side is also tricky. Upstream OpenStackOpenDev infra is exclusively populated by public cloud resources. These are inherently unsuited to performance testing because they are all on shared hardware whose performance can vary significantly based on the amount of load on the cloud. To do proper performance testing you need dedicated hardware with as few variables from run to run as possible. Even when upstream infra has had bare hardware donated to them, in many cases it didn't last. Apparently it's more common for companies to take back hardware donations than cloud resources. Seems odd, I know, but that was the experience related in this session.

So, what will it take to improve the state of upstream performance testing? Probably someone with moderately deep pockets to pay for the time and hardware needed, and who has a vested interest in improving performance upstream. Not a terribly promising answer, I realize, but that's the nature of the beast. This isn't a problem someone can solve by going heads down on it for a week. It takes an ongoing investment, and unless someone's revenue stream is dependent on it I'm not sure I see how it will happen.

Release-Shaming

However, as you can see from the title there was one other thing in this session that I wanted to touch on. Specifically, a rather extended digression where participants in the discussion browbeat (pun entirely intended) the session leaders for being on an older release. I should note that I came into this session late so it's possible I missed some context for where this came from, but even if that's the case I find it concerning.

Don't get me wrong, to some extent it's a valid point. If you come to upstream and say "we've got a performance problem on Mitaka", upstream's only answer can be "sorry, our Mitaka branches have been gone for a while". But the session wasn't about fixing Mitaka performance, it was (as I mentioned 800 words earlier) about testing and quantifying performance of upstream in general.

Further, this touches on some feedback we've gotten from operators in the past. Specifically that they don't always feel comfortable discussing things with developers if they're not on the latest release. That's why they like the ops meetup - it's a safe space, if you will, for them to discuss their experience with OpenStack and not have to worry about someone jumping on them for not CD'ing master. I exaggerate, but you see my point. This discussion, and apparently previous discussions, with developers was unnecessarily hostile toward the operators. You know, the people we're writing our software for.

The funny thing is that the discussion kind of shined a light on the exact problem the session was trying to solve. At one point there was basically a list of changes that have been made to improve performance since Mitaka. Which is fine, but did that actually improve performance or do you just think it did? In my experience OpenStack performance does not necessarily improve every cycle. Granted, I no longer operate a cloud at any sort of scale, but as a user of OpenStack I actually find that things seem more sluggish on recent releases than they did 3 years ago. As a result I feel the burden of proof is on the developers to show that their changes actually improved performance. How do you prove that? Performance testing! What the session was about in the first place!

In the interest of fairness, I do agree that it would be better on a number of levels if everyone stayed up to date on the most recent OpenStack releases. Heck, OVB on public cloud was blocked for years by the fact that public clouds were running versions of OpenStack that were too old to support some features I needed. I do get that argument. But I also understand that not every company is in a position to jump on the perpetual upgrade treadmill. The fast forward upgrades work that has been going on for a few cycles is a recognition of this, and it's something we need to keep in mind in other areas as well.

Overall I would say that the OpenStack developer community is a very civil, even downright friendly place most of the time. I hope we can extend the same to our colleagues in the operator community.