The Oslo team held its second virtual PTG this week. We had a number of good discussions and even ran slightly over the 2 hours we scheduled, so I think it was a successful event. The first hour was mostly topics relating to Oslo itself, while the second hour was set aside for some cross-project discussions with the Nova team. Read on for details of both hours.
Thierry gave us a quick update on the status of oslo.metrics. Currently we have all the necessary infrastructure in place to add oslo.metrics, so we are just waiting on a code drop. Once we have code, we'll get it imported and review the API with the Oslo team to ensure that it fits with our general design standards.
However, just creating the library is not the end of the story. Once the library is released, we'll need to add the integration code for services like oslo.messaging. Once that is done it should be possible for deployers to start benefiting from the functionality provided by oslo.metrics.
Update! The above represents what was discussed at the PTG, but since then it has come to our attention that the oslo.metrics code was proposed. So in fact we are ready to start making progress. If this project is of interest to you please review the changes and propose patches.
There were a couple of things we discussed about oslo.cache because we've been doing a fair amount of work in that library to improve security and migrate to better supported client libraries. The first topic was functional testing for the drivers we have. Some functional testing has already been added, but there are a number of drivers still without functional coverage. It would be nice to add similar tests for those, so if anyone is interested in working on that please let us know.
We've also been looking to change memcached client libraries for a couple of cycles now. Unfortunately this is not trivial, so the current plan is to add the new library as a completely different driver and then deprecate the old driver. That way we just need a migration path, not complete compatibility between the old and new libraries. There was a question about what to do with the dead host behavior from the existing memcache_pool driver, but the outcome of the discussion for now was to continue with the non-pooled driver and leave the pool version for later.
The Good
First, we added a new core, Sean McGinnis. Big thanks to Sean for all his work on Oslo over the past couple of cycles!
The oslo-coresec team was updated. Prior to this cycle it had gotten quite out of date, to the point where I was the only active Oslo contributor still on it. That's not ideal since this is the group which handles private security bugs, so keeping the list current is important. We now have a solid group of Oslo cores to look at any such bugs that may come in now.
We had also held our first virtual PTG after the Shanghai event, and that was counted in the success column. With so many Oslo contributors being part-time on the project it's likely we'll want to continue these.
A couple of our cores were able to meet in person at FOSDEM. Remember meeting in person? Me neither. ;-)
The Bad
We missed completing the project contributor documentation community goal. This goal was more difficult for Oslo than for many other projects because we have so many repositories under our governance. By the end of the cycle we did come up with a plan to complete the goal and have made good progress on it.
Proposed Changes
We discussed assigning a driver for community goals in the future. One of the problems with the contributor docs goal was that everyone assumed someone else would take care of it. Having a person specifically assigned should help with that.
In addition, at least one of the community goals proposed for the Victoria cycle would not require explicit completion by Oslo. It involves users of oslo.rootwrap migrating to oslo.privsep, which would only require the Oslo team to assist other projects, and hopefully work to improve the oslo.privsep docs based on peoples' migration experiences. Otherwise, Oslo isn't a consumer of either library so there is no migration needed.
Another proposed community goal is to make ci jobs zuulv3 native, and I believe that Oslo is already done with that for the most part. I know we've migrated a few of our one-off jobs over the past couple of years since zuulv3 came out so we should be in good shape there too.
After the Nova Ussuri release, some deployers reported problems with the new policy rules, despite the use of the oslo.policy deprecation mechanism that is designed to prevent breakage on upgrade. It turned out that the problem was that they were using the sample policy generator tool to create JSON policy files. The problem with this is that JSON doesn't support comments, so when you create a sample file in that format it overrides all of the policy-in-code defaults. When that happens, the deprecation mechanism breaks because we don't mess with custom policies specified by the deployer. This is one of the reasons we don't recommend populating the policy file with default policies.
However, even though we've recommended YAML for policy files since policy-in-code happened, we never changed the default filename in oslo.policy from policy.json
. This naturally can lead to deployers using JSON formatted files, even though all of the other oslo.policy tools now default to YAML. One of the main reasons we've never changed the default is that it is tricky to do without opening potential security holes. Policy has a huge impact on the security of a cloud and there isn't a great option for migrating the default filename.
The solution we came up with is documented in an Oslo spec. You can read up on the full details there, but the TLDR is that we are going to coordinate with all of the consumers of oslo.policy to add an upgrade check that warns deployers if a JSON-formatted policy file is found. In addition to release notes, this should give deployers ample warning about the coming change. oslo.policy itself will also log a warning if it detects that JSON is in use after JSON support has been deprecated. As part of this deprecation work, oslo.policy will need to provide a tool to migrate existing JSON policies to YAML, preferrably with the ability to detect default policy rules and comment them out in the YAML version.
Deprecating and eventually removing JSON policy file support should allow us to deprecate policies in the future without worrying about the situation we ran into this cycle. YAML sample files won't override any rules by default so we'll be able to sanely detect when default rules are in use. There was some talk of proposing this as a community goal given the broad cross-project nature of the work, but we'll probably wait and see how the initial effort goes.
Another longstanding topic that has recently come up is a standard healthcheck endpoint for OpenStack services. In the process of enabling the existing healthcheck middleware there was some question of how the healthchecks should work. Currently it's a very simple check: if the api process is running it returns success. There is also an option to suppress the healthcheck based on the existence of a file. This allows a deployer to signal a loadbalancer that the api will be going down for maintenance.
However, there is obviously a lot more that goes into a given service's health. We've been discussing how to make the healthcheck more comprehensive since at least the Dublin PTG, but so far no one has been able to commit the time to make any of these plans happen. At the Denver PTG ~a year ago we agreed that the first step was to enable the healthcheck middleware by default in all services. Some progress has been made on that front, but when the change was proposed to Nova, they asked a number of the questions related to the future improvements.
We revisited some of those questions at this PTG and came up with a plan to move forward that everyone seemed happy with. One concern was that we don't want to trigger resource-intensive healthchecks on unauthenticated calls to an API. In the original discussions the plan was to have healthchecks running in the background, and then the API call would just return the latest results of the async checks. A small modification to that was made in this discussion. Instead of having explicit async processes to gather this data, it will be collected on regular authenticated API calls. In this way, regularly used functionality will be healthchecked more frequently, whereas less used areas of the service will not. In addition, only authenticated users will be able to trigger potentially resource intensive healthchecks.
Each project will be responsible for implementing these checks. Since each project has a different architecture only they can say what constitutes "healthy" for their service. It's possible we could provide some common code for things like messaging and database that are used in many services, but it's likely that many projects will also need some custom checks.
I think that covers the major outcomes of this discussion, but we have no notes from this session so if I forgot something let me know. ;-)
There was quite a bit of discussion during the Keystone PTG sessions about oslo.limit and unified limits in general. There are a number of pieces of work underway for this already. Hierarchical quota support is proposed to oslo.limit and a POC for Nova to consume it is also available. The Glance team has expressed interest in using oslo.limit to add quotas to that service, and their team has already started to contribute patches to oslo.limit (such as supporting configuration by service name and region). This is terrific news! That work also prompted some discussion of how to handle the separate configuration needed for keystoneauth and oslo.limit itself.
There was quite a bit of other discussion, some of which doesn't involve oslo.limit, some of which does. We need to define a way to export limits from one project and import them into Keystone. This will probably be done in the [project]-manage commands and won't involve Oslo.
Some refinement of the usage callback may be in order too. I don't know that we came to any definite conclusions, but encouraging more projects to use Placement was discussed, although some projects are hesitant to do that due to the complexity of using Placement. In addition, there was discussion of passing a context object to the usage callback, but it wasn't entirely clear whether that would work for all projects or if it was necessary.
Finally, the topic of caching came up. Since there can be quite a few quota calls in a busy cloud, caching may be needed to avoid significant performance hits. It's something we've deferred making any decisions on in the past because it wasn't clear how badly it would be needed or exactly how caching should work for limits. We continued to push this decision off until we have unified limits implemented and can gather performance information.
That should cover my recollection of the limits discussion. For the raw notes from the PTG, see the Keystone PTG Etherpad, under Unified Limits.
The past few months have been quite...interesting. Everyone is doing the best they can with a tough situation, and this all-virtual PTG is yet another example of that. Huge thanks to all of the organizers and attendees for making it a productive event in spite of the challenges.
I hope this has been useful. If you have any comments or questions feel free to contact me in the usual ways.