Oslo in Shanghai

Despite my trepidation about the trip (some of it well-founded!), I made it to Shanghai and back for the Open Infrastructure Summit and Project Teams Gathering. I even managed to get some work done while I was there. :-)

First, I recommend reading the opening of Colleen Murphy's blog post about the event (and the rest of it too, if you have any interest in what Keystone is up to). It does an excellent job of describing the week at a high level. To summarize in my own words, the energy of this event was a little off. Many regular contributors were not present because of the travel situation and there was less engagement from local contributors than I would have hoped for. However, that doesn't mean nothing good came out of it!

In fact, it was a surprisingly active week for Oslo, especially given that only myself and two other cores were there and we had limited discussion within the team. It turns out Oslo was a popular topic of conversation in various Forum sessions, particularly oslo.messaging. This led to some good conversation at the PTG and a proposal for a new Oslo library. Not only were both Oslo summit sessions well attended, but good questions were asked in both so people weren't just there waiting for the next talk. ;-) In fact, I went 10 minutes over time on the project update (oops!), in part because I hadn't really planned time for questions since I've never gotten any in the past. Not complaining though.

Read on for more detail about all of this.

oslo.messaging drivers

It should come as no surprise to anyone that one of major pain points for OpenStack operators is RabbitMQ administration. Rabbit is a frequent bottleneck that limits the scale of deployed clouds. While it should be noted that this is not always Rabbit's fault, scaling of the message queue is a problem almost everyone runs into at some point when deploying large clouds. If you don't believe me, ask someone how many people attended the How we used RabbitMQ in wrong way at a scale presentation during the summit (which I will talk more about in a bit). The room was packed. This is definitely a topic of interest to the OpenStack community.

A few different solutions to this problem have been suggested. First, I'll talk about a couple of new drivers that have been proposed.

NATS

This was actually submitted to oslo.messaging even before the summit started. It's a new driver that uses the NATS messaging system. NATS makes some very impressive performance claims on its site, notably that it has around an order of magnitude higher throughput than RabbitMQ. Anybody interested in being able to scale their cloud 10x just by switching their messaging driver? I thought so. :-)

Now, this is still in the early discussion phase and there are some outstanding questions surrounding it. For one, the primary Python driver is not compatible with Eventlet (sigh...) which makes it unusable for oslo.messaging. There does exist a driver that would work, but it doesn't seem to be very maintained and as a result we would likely be taking on not just a new oslo.messaging driver but also a new NATS library if we proceed with this. Given the issues we've had in the past with drivers becoming unmaintained and bitrotting, this is a non-trivial concern. We're hoping to work with the driver proposers to make sure that there will be sufficient staffing to maintain this driver in the long run. If you are interested in helping out with this work please contact us ASAP. Currently it is being driven by a single contributor, which is likely not sustainable.

We will also need to ensure that NATS can handle all of the messaging patterns that OpenStack uses. One of the issues with previous high performance drivers such as ZeroMQ or Kafka was that while they were great at some things, they were missing important functionality for oslo.messaging. As a result, that functionality either had to be bolted on (which reduces the performance benefits and increases the maintenance burden) or the driver had to be defined as notification-only, in which case operators end up having to deploy multiple messaging systems to provide both RPC and notifications. Even if the benefits are worth it, it's a hard sell to convince operators to deploy yet another messaging service when they're already struggling with the one they have. Fortunately, according to the spec the NATS driver is intended to be used for both so hopefully this won't be an issue.

gRPC

In one of the sessions, I believe "Bring your crazy idea", a suggestion was made to add a gRPC driver to oslo.messaging as well. Unfortunately, I think this is problematic because gRPC is also not compatible with Eventlet, and I'm not sure there's any way to make it work. It's also not clear to me that we need multiple alternatives to RabbitMQ. As I mentioned above, we've had problems in the past with alternative drivers not being maintained, and the more drivers we add the more maintenance burden we take on. Given that the oslo.messaging team is likely shrinking over the next cycle, I don't know that we have the bandwidth to take on yet another driver.

Obviously if someone can do a PoC of a gRPC driver and show that it has significant benefits over the other available drivers then we could revisit this, but until that happens I consider this a non-starter.

Out-of-tree Drivers

One interesting suggestion that someone made was to implement some of these proposed drivers outside of oslo.messaging. I believe this should be possible with no changes to oslo.messaging because it already makes use of generic entry points for defining drivers. This could be a good option for incubating new drivers or even as a longer term solution for drivers that don't have enough maintainers to be included in oslo.messaging itself. We'll need to keep this option in mind as we discuss the new driver proposals.

Reduce the amount of RPC in OpenStack

This also came out of the crazy idea session, but I don't recall that there was much in the way of specifics (I was distracted chatting with tech support in a failed attempt to get my cell phone working during this session). In general, reducing the load on the messaging layer would be a good thing though. If anyone has suggestions on ways to do this please propose them on the openstack-discuss mailing list.

LINE

Now we get to some very concrete solutions to messaging scaling that have already been implemented. LINE gave the RabbitMQ talk I mentioned earlier and had some novel approaches to the scaling problems they encountered. I suggest watching the recording of their session when it is available because there was a lot of interesting stuff in it. For this post, I'm going to focus on some of the changes they made to oslo.messaging in their deployment that we're hoping to get integrated into upstream.

Separate Notification Targets

One important architecture decision that LINE made was to use a separate RabbitMQ cluster for each service. This obviously reduces the load on an individual cluster significantly, but it isn't necessarily the design that oslo.messaging assumes. As a result, we have only one configuration section for notifications, but in a split architecture such as LINE is using you may want service-specific notifications to go to the service-specific Rabbit cluster. The spec linked in the title for this section was proposed to provide that functionality. Please leave feedback on it if this is of interest to you.

oslo.messaging instrumentation and oslo.metrics

One of the ways LINE determined where their messaging bottlenecks were was some instrumentation that they added to oslo.messaging to provide message-level metrics. This allowed them to get very granular data about what messages were causing the most congestion on the messaging bus. In order to collect these metrics, they created a new library that they called oslo.metrics. In essence, the oslo.messaging instrumentation calls oslo.metrics when it wants to output a metric, oslo.metrics then takes that data, converts it to a format Prometheus can understand, and serves it on an HTTP endpoint that the oslo.metrics library creates. This allowed them to connect the oslo.messaging instrumentation to their existing telemetry infrastructure.

Interestingly, this concept came up in other discussions throughout the week as well, so we're hoping that we can get oslo.metrics upstreamed (currently it is something they implemented downstream that is specific to their deployment) and used in more places. Another interesting related possibility was to add a new middleware to oslo.middleware that could do a similar thing for the API services and potentially provide useful performance metrics from them.

We had an extended discussion with the LINE team about this at the Oslo PTG table, and the next steps will be for them to fill out a spec for the new library and hopefully make their code changes available for review. Once that is done, we had commitments from a number of TC members to review and help shepherd this work along. All in all, this seems to be an area of great interest to the community and it will be exciting to see where it goes!

Policy Improvements

I'm going to once again refer you to Colleen's post, specifically the "Next Steps for Policy in OpenStack" section since this is being driven more by Keystone than Oslo. However, one interesting thing that was discussed with the Nova team that may affect Oslo was how to manage these changes if they end up taking more than one cycle. Because the oslo.policy deprecation mechanism is used to migrate services to the new-style policy rules, operators will start seeing quite a few deprecation messages in their logs once this work starts. If it takes more than one cycle then that means they may be seeing deprecations for multiple cycles, which is not ideal.

Currently Nova's plan is to queue up all of their policy changes in one big patch series of doom and once they are all done merge the whole thing at once. It remains to be seen how manageable such a patch series that touches code across the project will be though. If it proves untenable, we may need to implement some sort of switch in oslo.policy that would allow deprecations to be temporarily disabled while this work is ongoing, and then when all of the policy changes have been made the switch could be flipped so all of the deprecations take effect at once. As of now I have no plans to implement such a feature, but it's something to keep in mind as the other service projects get serious about doing their policy migrations.

oslo.limit

The news is somewhat mixed on this front. Unfortunately, the people (including me) who have been most involved in this work from the Keystone and Oslo sides are unlikely to be able to drive it to completion due to changing priorities. However, there is still interest from the Nova side, and I heard rumors at the PTG that there may be enough operator interest in the common quota work that they would be able to have someone help out too. It would be great if this is still able to be completed as it would be a shame to waste all of the design work and implementation of unified limits that has already been done. The majority of the initial API is available for review and just needs some massaging to be ready to merge. Once that happens, projects can start consuming it and provide feedback on whether it meets their needs.

Demo of Oslo Tools That Make Life Easier for Operators

A bit of shameless self-promotion, but this is a presentation I did in Shanghai. The recording isn't available yet, but I'll link it once it is. In essence, this was my attempt to evangelize some Oslo tools that have been added somewhat recently but people may not have been aware of. It covers what the tools are good for and how to actually use them.

Conclusion

As I tweeted on the last day of the PTG, this was a hard event for me to leave. Changes in my job responsibilities mean this was likely my last summit and my last opportunity to meet with the OpenStack family face-to-face. Overall it was a great week, albeit with some rough edges, which is a double-edged sword. If the week had gone terribly maybe I wouldn't have been so sad to leave, but on the other hand it was nice to go out on a high note.

If you made it this far, thanks! Please don't hesitate to contact me with any comments or questions.