Debugging a Segfault in oslo.privsep

I recently helped track down a bug exposed by a recent oslo.privsep release that added threading to allow parallel privileged calls. It was a segfault happening in the privsep daemon that was caused by a C call in a privileged Neutron module. This, as you might expect, was a little tricky to debug so I thought I'd document the process for posterity.

There were a couple of reasons this was tough. First, it was a segfault, which meant something went wrong in the underlying C code. Python debuggers need not apply. Second, there's a bunch of forking that happens to start the privsep daemon, which meant I couldn't just run Python in gdb. Well, maybe I could have, but my gdb skills are not strong enough to navigate through a bunch of different forks.

To get gdb attached to the correct process, I followed the debugging with gdb instructions from Python, specifically the ones to attach to an existing process. To make sure I had time to get it attached, I added a sleep to the startup of the privsep daemon installed in my Neutron tox venv. Essentially I would run the test:

tox -e dsvm-functional -- neutron.tests.functional.agent.linux.test_netlink_lib.NetlinkLibTestCase.test_list_entries

Find the privsep-helper process that was eventually started, then attach gdb to it with:

gdb python [pid]

I also needed to install some debuginfo packages on my system to get useful tracebacks from the libraries involved. Gdb gave me the install command to do so, which was handy. I believe the important part here was dnf debuginfo-install libnetfilter_conntrack, but that will vary depending on what you're debugging.

Once gdb was attached, I typed c to tell it to continue (gdb interrupts the process when you attach), then once the segfault happened I used commands like bt, list, and print to examine the code and state where the crash happened. This allowed me to determine that we were passing in a bad pointer as one of the parameters for the C call. It turned out we were truncating pointers because we hadn't specified the proper parameter and return types, so large memory addresses were being squeezed into ints that were too small to hold them. Why the oslo.privsep threading change exposed this I don't know, but my guess is that it has something to do with the address space changing when the calls were made from a thread instead of the main process.

In any case, after quite a bit of cooperative debugging in the OpenStack community and a fair amount of rust removal from my gdb skills, we were able to resolve this bug and unblock the use of threaded oslo.privsep. This should allow us to significantly reduce the attack surface for OpenStack services, resulting in much better security.

I hope this was useful, and as always if you have any questions or comments don't hesitate to contact me.