OpenShift on OpenStack Virtual Baremetal

Fair warning: This is gonna be looooong. Proceed at your own risk. ;-)

Introduction

Since I started working with OpenShift on baremetal one of the things I've wanted to do is deploy OpenShift using OpenStack Virtual Baremetal to provide the host VMs. The usual developer setup is dev-scripts, which uses libvirt to stand up a virtual baremetal environment. This works fine, but it has a few drawbacks:

  • All the VMs are on a single host (by default), so you need a fairly large machine to run it.
  • Because of the previous point, you have pretty inefficient use of developer hardware resources since each machine can basically have one user at a time (you could theoretically share machines, but to my knowledge it's enough of a pain that it's just not done). When I shut down my dev cluster, that doesn't free up those resources for anyone else.
  • It does a lot of magic that insulates developers from the actual product interface. If you've read some of my previous posts, you'll know this is a huge pet peeve of mine. I feel dirty every time I push the developer easy button in dev-scripts. ;-)
  • In some cases, libvirt is overly helpful in setting up the environment which makes it a less ideal simulation of real baremetal. For example, libvirt is going to create DNS records for all of your nodes. If, like me, you work on internal DNS components then that can sometimes mask issues.

Unfortunately, OpenShift on OVB has some problems too. Most notably, baremetal OpenShift uses a libvirt VM as the bootstrap node. This means you either need to use nested virt (no thank you) or you have to find some way to hook an actual baremetal node into your OpenStack environment.

We recently had a week set aside for developers to work on little pet projects like this, and I was able to get a proof-of-concept of OpenShift on OVB working. How useful is it? I'm not sure, but that's not really the point of a PoC. As you'll see, the environment setup is rather complicated, but it does have the advantage of significantly lowering the requirements for hardware assigned to individual developers. Because the majority of the compute power lives in the OpenStack cloud, a developer might only need a 32 GB machine, possibly even their own laptop, to do dev-scripts development.

Yes, I'm still using dev-scripts for this. Just getting that to work was sufficiently complex that I didn't have time to try running the OpenShift installer standalone. Baby steps. :-)

Environment Overview

What I ended up doing was to use the OVB "undercloud" node as a sort of proxy into the virtual networks of the OVB environment. I also created 3 masters and 2 workers (the latter using OVB's role functionality so the workers could use a smaller flavor to save some resources) and OVB provided the necessary IPMI and PXE control over them. Note that I did this on my personal cloud with the Nova PXE boot patch applied. I expect this could be done on a public cloud using the ipxe boot image, but I haven't tried it.

There are three networks that are relevant for an OpenShift Baremetal IPI deployment: provisioning, baremetal, and BMC. Conveniently, these all map nicely to some of the networks used for TripleO, so I didn't have to make any changes to the OVB network templates. The stock undercloud (proxy in my case) already has all of those attached.

Here are the OVB configs I used:
ovb-deploy --quintupleo --name ocp --id openshift --poll -e env-ocp.yaml -e environments/all-networks.yaml --role role-worker.yaml

env-ocp.yaml:

parameter_defaults:
  baremetal_flavor: master
  baremetal_image: centos-stream
  baremetal_prefix: master
  bmc_flavor: bmc
  bmc_image: centos7
  bmc_prefix: bmc
  external_net: external
  key_name: default
  node_count: 3
  private_net: private
  provision_net: provision
  provision_net_shared: False
  public_net: public
  public_net_shared: False
  role: ''
  undercloud_flavor: m1.small
  undercloud_image: centos-stream
  undercloud_name: proxy

role-worker.yaml:

parameter_defaults:
  baremetal_flavor: worker
  baremetal_image: centos-stream
  key_name: default
  node_count: 2
  role: worker
  baremetal_name_template: worker-%index%
resource_registry:
  OS::OVB::BaremetalPorts: templates/baremetal-ports-all.yaml

To give my baremetal node that hosted the bootstrap VM access to the environment, I used two different methods: socat and OpenVPN. In retrospect I probably could have used exclusively OpenVPN and just given the baremetal node a VPN to the BMC network, but because I tackled that first I ended up using a simpler socat-based method and since it worked I didn't bother changing it. Two OpenVPN tunnels were also needed for the provisioning and baremetal networks.

socat for IPMI access

The BMC instances in OVB are on the private network that has a port on the external router and thus can be assigned floating IPs, but you can only assign one floating IP to a VM at a time and since there are multiple BMCs running, each with its own IP address, I would need some way to redirect traffic anyway. I ended up running socat on the proxy VM, listening on unique ports for each node's BMC, then had socat forward that traffic to the appropriate private IP. For simplicity, socat listens on the port 6[the final octet of the BMC IP]. So if the BMC is 12.1.1.188, socat listens on 6188.

This worked fine, although it wasn't perfect (see Known Issues below).

Aside: I initially found a method of doing this with netcat, but it was only able to handle a single IPMI call before the tunnel broke. I'm not sure what was wrong, but I found other posts complaining about the same thing with netcat. Since socat was a bit simpler anyway, I just went with that.

This is the script I used to start socat on the proxy:

#!/bin/bash

set -ex

for i in 93 229 235 138 188
do
  socat udp4-listen:6$i,reuseaddr,fork udp4:12.1.1.$i:623 &
done

To test IPMI functionality, you can use ipmitool on the dev-scripts host. The -H is the floating IP of the proxy and -p is the port assigned to the node.

ipmitool -I lanplus -U admin -P password power status -H 11.3.3.5 -p 6188

OpenVPN

While I've used OpenVPN before, it was always in "tun" mode. In this case I needed to be able to DHCP and PXE over the tunnel, which meant I needed to use "tap" mode. This didn't turn out to be drastically more complicated, but there was one weird behavior (bug?) that did cause me a lot of angst. We'll get to that in a moment.

First, I followed one of the many OpenVPN setup guides out there and generated all the necessary keys and certificates. I'm not going to go into detail here, but all of the files referenced in my configs do exist with the appropriate content. For the provisioning network, the server config looks like:

port 1194
proto udp
# Note: tap instead of tun
dev tap
# These two are needed to allow some scripting when the connection is brought up
script-security 2
up up.sh
# I'm not sure these are actually accomplishing anything, but I think there are MTU issues with my environment and these were my attempt to fix that.
tun-mtu 1600
fragment 1500
mssfix
# The rest of this is all pretty standard stuff, I believe.
ca ca.crt
cert server.crt
key server.key  # This file should be kept secret
dh dh.pem
topology subnet
ifconfig-pool-persist ipp.txt
server-bridge 192.168.24.2 255.255.255.0 192.168.24.50 192.168.24.100
push "route 192.168.24.0 255.255.255.0"
keepalive 10 120
tls-auth ta.key 0 # This file is secret
cipher AES-256-CBC
persist-key
persist-tun
status openvpn-status.log
verb 3

And the corresponding client config:

client
# This all needs to match the server
dev tap
proto udp
tun-mtu 1600
fragment 1500
mssfix
# Standard bits
remote 11.3.3.5 1194
resolv-retry infinite
nobind
user nobody
group nobody
persist-key
persist-tun
ca ca.crt
cert wonderland.crt
key wonderland.key
remote-cert-tls server
tls-auth ta.key 1
cipher AES-256-CBC
verb 3

In order for tap mode in OpenVPN to work, you need to bridge the appropriate interface(s). Since I also work with NMState quite a bit, I used that to create my bridge:

interfaces:
- name: br0
  type: linux-bridge
  state: up
  # I later changed this manually
  mtu: 1450
  ipv4:
    enabled: true
    address:
    # This is the IP OpenStack assigned to the provisioning nic on my proxy. I don't think it matters if it matches, but I figured it wouldn't hurt.
    - ip: "192.168.24.184"
      prefix-length: 24
  bridge:
    port:
    - name: eth1

Finally, I needed to run a couple of commands when the server brings up the tap interface. This is the contents of the up.sh script:

#!/bin/bash

/sbin/brctl addif "br0" "$1"
/sbin/ip l set dev "$1" mtu 1600
/sbin/ip l set dev "$1" up

Note that last line. Remember that weird behavior I mentioned? Yeah, turns out the tap device is not brought up by default. Maybe I'm doing something wrong here, but that was surprising to me. Once I added the link up command my tap-based VPN started working.

I also had to start a second OpenVPN instance to handle the "baremetal" interface (as it's known in dev-scripts). That followed the same process as above, but with ports and addresses changed as necessary. For completeness, here are the configs I used:

server:

port 1195
proto udp
dev tap
script-security 2
up up1.sh
tun-mtu 1600
fragment 1500
mssfix
ca ca.crt
cert server.crt
key server.key  # This file should be kept secret
dh dh.pem
topology subnet
ifconfig-pool-persist ipp.txt
server-bridge 192.168.111.222 255.255.255.0 192.168.111.100 192.168.111.150
push "route 192.168.111.0 255.255.255.0"
keepalive 10 120
tls-auth ta.key 0 # This file is secret
cipher AES-256-CBC
persist-key
persist-tun
status openvpn-status.log
verb 3

client:

client
dev tap
proto udp
tun-mtu 1600
fragment 1500
mssfix
remote 11.3.3.5 1195
resolv-retry infinite
nobind
user nobody
group nobody
persist-key
persist-tun
ca ca.crt
cert wonderland.crt
key wonderland.key
remote-cert-tls server
tls-auth ta.key 1
cipher AES-256-CBC
verb 3
interfaces:
- name: br1
  type: linux-bridge
  state: up
  mtu: 1450
  ipv4:
    enabled: true
    address:
    # This one does not match what OpenStack assigned, confirming that it didn't matter for the provisioning network either.
    - ip: "192.168.111.222"
      prefix-length: 24
  bridge:
    port:
    - name: eth2

up1.sh:

#!/bin/bash

/sbin/brctl addif "br1" "$1"
/sbin/ip l set dev "$1" mtu 1600
/sbin/ip l set dev "$1" up

There was one more thing I had to do on the client side to make this setup work with dev-scripts. I let dev-scripts manage my bridges for me, and if the tap devices had an IP assigned to them ahead of time that broke dev-scripts. So, I removed the IPs:

ip a flush dev tap0
ip a flush dev tap1

dev-scripts Config

Speaking of dev-scripts, here are the configuration variables I had to set in order to make dev-scripts work in this environment:

export NODES_PLATFORM="baremetal"
# Note: My nodes file got deleted every time I ran "make clean". Be sure to have a backup.
export NODES_FILE="/home/bnemec/dev-scripts/hosts.json"
# Corresponding to the OpenVPN interfaces
export INT_IF="tap1"
export PRO_IF="tap0"
# In libvirt VMs this would be enp1s0, OpenStack is different even though it also libvirt-based
export CLUSTER_PRO_IF="ens3"
export IP_STACK="v4"
export BMC_DRIVER="ipmi"
export PROVISIONING_NETWORK="192.168.24.0/24"
# Again, different from vanilla libvirt VMs
export ROOT_DISK_NAME="/dev/vda"
# I deployed with static IP addresses to avoid needing to add DHCP to the baremetal network. By default, OVB disables DHCP because TripleO didn't use it.
export NETWORK_CONFIG_FOLDER=/home/bnemec/dev-scripts/network-config-static/

And here is the contents of my nodes file:

{
  "nodes": [
      {
      "name": "ostest-master-0",
      "driver": "ipmi",
      "resource_class": "baremetal",
      "driver_info": {
        "username": "admin",
        "password": "password",
        "address": "ipmi://11.3.3.5:6235",
        "deploy_kernel": "http://192.168.24.2/images/ironic-python-agent.kernel",
        "deploy_ramdisk": "http://192.168.24.2/images/ironic-python-agent.initramfs",
        "disable_certificate_verification": false
      },
      "ports": [{
        "address": "fa:16:3e:16:a5:9c",
        "pxe_enabled": true
      }],
      "properties": {
        "local_gb": "20",
        "cpu_arch": "x86_64"
      }
    },
     {
      "name": "ostest-master-1",
      "driver": "ipmi",
      "resource_class": "baremetal",
      "driver_info": {
        "username": "admin",
        "password": "password",
        "address": "ipmi://11.3.3.5:6138",
        "deploy_kernel": "http://192.168.24.2/images/ironic-python-agent.kernel",
        "deploy_ramdisk": "http://192.168.24.2/images/ironic-python-agent.initramfs"
      },
      "ports": [{
        "address": "fa:16:3e:34:31:af",
        "pxe_enabled": true
      }],
      "properties": {
        "local_gb": "20",
        "cpu_arch": "x86_64"
      }
    },
     {
      "name": "ostest-master-2",
      "driver": "ipmi",
      "resource_class": "baremetal",
      "driver_info": {
        "username": "admin",
        "password": "password",
        "address": "ipmi://11.3.3.5:6188",
        "deploy_kernel": "http://192.168.24.2/images/ironic-python-agent.kernel",
        "deploy_ramdisk": "http://192.168.24.2/images/ironic-python-agent.initramfs",
        "disable_certificate_verification": true
      },
      "ports": [{
        "address": "fa:16:3e:28:9a:a5",
        "pxe_enabled": true
      }],
      "properties": {
        "local_gb": "20",
        "cpu_arch": "x86_64"
      }
    },
     {
      "name": "ostest-worker-0",
      "driver": "ipmi",
      "resource_class": "baremetal",
      "driver_info": {
        "username": "admin",
        "password": "password",
        "address": "ipmi://11.3.3.5:693",
        "deploy_kernel": "http://192.168.24.2/images/ironic-python-agent.kernel",
        "deploy_ramdisk": "http://192.168.24.2/images/ironic-python-agent.initramfs"
      },
      "ports": [{
        "address": "fa:16:3e:b7:45:d9",
        "pxe_enabled": true
      }],
      "properties": {
        "local_gb": "20",
        "cpu_arch": "x86_64"
      }
    },
     {
      "name": "ostest-worker-1",
      "driver": "ipmi",
      "resource_class": "baremetal",
      "driver_info": {
        "username": "admin",
        "password": "password",
        "address": "ipmi://11.3.3.5:6229",
        "deploy_kernel": "http://192.168.24.2/images/ironic-python-agent.kernel",
        "deploy_ramdisk": "http://192.168.24.2/images/ironic-python-agent.initramfs"
      },
      "ports": [{
        "address": "fa:16:3e:2c:a0:fe",
        "pxe_enabled": true
      }],
      "properties": {
        "local_gb": "20",
        "cpu_arch": "x86_64"
      }
    }
  ]
}

Run make and you should end up with a cluster deployed on your OVB instances. Well, almost. See below for a manual workaround I had to do during the deployment to get all the nodes to deploy correctly.

Known Issues

I ran into a few problems and didn't solve them all. Here they are, in case you're interested:

  • My cloud was too small. Although you can do dev-scripts deployments on a 72 GB host, the overhead of running a cloud in that same 72 GB caused my VMs to get OOM-killed. I worked around this by adding 16 GB of swap file to my OpenStack host, but of course that was not ideal from a performance perspective. I would suggest at least 96 GB for the cloud.
  • socat eventually ran my proxy out of memory. It seems that socat is creating a new process every time an IPMI call is made (presumably because of the "fork" option), and eventually I had so many of those processes that the proxy quit working. Killing all the socats and restarting just the initial ones worked around this, but obviously that's not ideal. Probably an argument to just use OpenVPN instead.
  • My masters all needed manual intervention in order to successfully pull the IPA rootfs. I'm guessing this is an MTU issue, although I tried everything I could to make sure the MTU wasn't an issue and it still kept happening. As long as I rebooted each master one by one so they pulled the image one at a time, everything worked. If two masters tried to pull the image at the same time they both hung.
  • I'm still relying on a bunch of libvirt/dev-scripts stuff to make this work. Ideally I should stand up a DNS server to provide the API and Ingress records and not let libvirt do that, but that can be a future exercise.