Day One Networking in OpenShift

Over the past few years there have been quite a few changes in how day one (or perhaps more accurately, deployment-time, since they also apply to scaleout operations on day two) networking functions. This particular phase of networking is especially tricky because cluster resources are not, for the most part, available yet. This means you can't use any of the normal operators that handle network configuration later in the deployment process.

It's also a very important phase of network configuration because modern network architectures are increasingly complex, and in many cases nodes are unable to connect to the rest of the cluster with default network configuration (in the case of RHEL CoreOS the default configuration attempts DHCP on every interface on the node). Some examples are use of static IPs, bonds, and VLANs. Many OpenShift deployers need the ability to provide network configuration that will be present from a very early point in the boot process.

To this end, there are now multiple ways to provide "day one" network configuration in OpenShift. Some of these are platform-specific, while one in particular is not (mostly, more on that later).

Platform-Specific Configuration

Some of the on-prem platforms (e.g. baremetal, VSphere, OpenStack) provide a mechanism to do network configuration very early in the boot process. In the case of baremetal, this configuration is baked into the deployment images and thus is present from the very first moment the node boots. Since I primarily work with baremetal I'll be focused on that, but be aware that there are other similar implementations for other platforms. For example, on VSphere there is an interface to pass kernel args to do network configuration at initial boot. The basic tenets of the deployment flow are the same regardless of the specific configuration mechanism.

The baremetal implementation lives in the baremetal operator, in particular the image-customization-controller. This controller is responsible for taking the NMState configuration attached to a given BareMetalHost record and embedding it in the image to be used for deployment of that host. This has a couple of important implications:

  1. The network configuration is embedded in both the ramdisk and the root image. As a result, any configuration provided by this mechanism must function in the more limited ramdisk environment. Notably, this means no Open vSwitch configuration may be done as OVS is not yet running in the ramdisk.
  2. The NMState configuration provided must be processable by the nmstatectl gc function. NMState configurations that rely on runtime information, such as capturing the state of a NIC, will not work in this phase of configuration. This is because the baremetal operator processes NMState config into raw nmconnection files that are then placed in /etc/NetworkManager/system-connections

While these are unfortunate drawbacks, the low-level nature of the configuration does ensure that network configuration can be provided as soon as it may be required. This means network configuration can be present before Ignition is retrieved, which is crucial since functional networking is required in order for Ignition to complete successfully.

For reference, here is a snippet from install-config showing the sections that go into this configuration (the network configuration parts are bolded):

apiVersion: v1
baseDomain: test.metalkube.org
networking:
  networkType: OVNKubernetes
  machineNetwork:
  - cidr: 192.168.111.0/24
[...snip...]
platform:
  baremetal:
[...snip...]
    hosts:
      - name: ostest-master-0
        role: master
        bmc:
          address: redfish-virtualmedia+http://192.168.111.1:8000/redfish/v1/Systems/fd0e728a-4c10-40e4-807a-fab5b055de78
          username: admin
          password: password
          disableCertificateVerification: null
        bootMACAddress: 00:d7:a3:95:42:1f
        bootMode: UEFI
        networkConfig:
          interfaces:
          - name: enp2s0
            type: ethernet
            state: up
            ipv4:
              address:
              - ip: "192.168.111.110"
                prefix-length: 24
              enabled: true
          dns-resolver:
            config:
              server:
              - 192.168.111.1
          routes:
            config:
            - destination: 0.0.0.0/0
              next-hop-address: 192.168.111.1
              next-hop-interface: enp2s0
        rootDeviceHints:
          deviceName: "/dev/sda"
        hardwareProfile: default
      - name: ostest-master-1
        role: master
        bmc:
          address: redfish-virtualmedia+http://192.168.111.1:8000/redfish/v1/Systems/1979833a-4089-41d0-936a-b8bfebd6785f
          username: admin
          password: password
          disableCertificateVerification: null
        bootMACAddress: 00:d7:a3:95:42:23
        bootMode: UEFI
        networkConfig:
          interfaces:
          - name: enp2s0
            type: ethernet
            state: up
            ipv4:
              address:
              - ip: "192.168.111.111"
                prefix-length: 24
              enabled: true
          dns-resolver:
            config:
              server:
              - 192.168.111.1
          routes:
            config:
            - destination: 0.0.0.0/0
              next-hop-address: 192.168.111.1
              next-hop-interface: enp2s0
        rootDeviceHints:
          deviceName: "/dev/sda"
        hardwareProfile: default
[...these sections repeat for each host in the cluster...]

There is also a mechanism to provide this configuration to nodes that are scaled out after initial deployment when install-config cannot be used. It's a bit more complex because it requires encoding the NMState configuration into a Secret and then attaching that to the BareMetalHost object, but it can be done.

This feature allows us to configure basic networking on hosts very early on. But what if you want to do something more complex? Perhaps OVS bridges and bonds? That's where the second step of our day one network configuration comes in.

NMState br-ex Creation

This feature was primarily driven by a longstanding desire to eliminate the configure-ovs.sh script that is still used in most deployments to create the br-ex bridge needed for OVNKubernetes, although it has also proven to have a few other benefits. It does require a bit more work up front, but it has a number of advantages:

  • Reliability. NMState is tested and supported by the NetworkManager team. Configure-ovs.sh is a large shell script that is difficult to test and has proven troublesome over the years.
  • Full control over the configuration of br-ex. The assumptions and guesses of configure-ovs.sh are no longer relevant.
  • The ability to make modifications to br-ex on day 2 using Kubernetes-NMState. Previously this was not allowed.
  • More network architectures are now supported. Certain configurations could not be used on day one (or at all) before this feature.

Why can't the baremetal feature discussed above be used for this? There are two main reasons: OVS is not available in the ramdisk and we don't want this to be baremetal-specific.

The first point is a big reason for the two step process. There are two conflicting requirements that make it difficult to solve in a single step. Initially, we need something that works very early in boot so we can pull Ignition. This cannot have a dependency on OVS or some other tools that are not present so early in boot. On the other hand, we also need to be able to deploy complex network setups that use things like OVS. In essence, we need one configuration that must be simple, and we need to be able to support configurations that are very complex. While it may be possible to come up with a compromise between the two, keeping them separate allows both requirements to be met in the best possible way.

Additionally, because the long-term goal is to replace configure-ovs.sh completely, we can't use a baremetal-only feature to do it. Prior to this, all of our host networking configuration tools were specific to a given platform. While we have a long way to go in terms of usability before this can become the default option, we also don't want to choose a design that prevents us from doing so.

One confession though: As of this writing the feature is only enabled for baremetal IPI. However, there is a patch proposed to enable it everywhere and we've retrofitted it into some non-baremetal IPI clusters and it worked just fine.

With all that background out of the way, let's talk about how this feature works. Currently, the only interface is through machine-configs. This is not great and we intend to add a better interface in the near future, but we went with the crude interface in order to expedite delivery of the functionality.

At a basic level, the NMState configuration for this step is provided in machine-config manifests passed to the installer. The machine-configs write files to /etc/nmstate/openshift which are then processed by a service deployed on each node. The filenames are used to determine which configurations apply to which nodes. By default, a file named /etc/nmstate/openshift/cluster.yml will be applied to every node in a given role (each role must have its own machine-config). This file must be common to every node on which it will be applied. However, it is also possible to apply a node-specific configuration based on hostname. For example, /etc/nmstate/openshift/master-0.yml will be applied only to a node named master-0. Note that this replaces cluster.yml entirely, the configurations are not merged.

Those familiar with Machine Config Operator may note that it's not currently possible to do node-specific configuration. This feature gets around that limitation by writing all of the configs for all of the nodes to every node in a role. So master-0 will have configs for itself, master-1, and master-2, but will only apply its own configuration. Each other master node will also have all three configs present. Is this an abuse of the machine-config interface? Absolutely. However, this feature was deemed important enough to justify the ickiness of the technical solution.

Below is an example machine-config that might be used with the static IP configurations above:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 10-br-ex-master
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,aW50ZXJmYWNlczo[snipped to avoid breaking page formatting]NlOiBici1leA==
        mode: 0644
        overwrite: true
        path: /etc/nmstate/openshift/master-0.yml
[...master-1 and master-2 have similar configuration...]

In this case the base64-encoded content looks like this:

interfaces:
- name: enp2s0
  type: ethernet
  state: up
  ipv4:
    enabled: false
  ipv6:
    enabled: false
- name: br-ex
  type: ovs-bridge
  state: up
  copy-mac-from: enp2s0
  ipv4:
    enabled: false
    dhcp: false
  ipv6:
    enabled: false
    dhcp: false
  bridge:
    port:
    - name: enp2s0
    - name: br-ex
- name: br-ex
  type: ovs-interface
  state: up
  ipv4:
    enabled: true
    address:
    - ip: "192.168.111.110"
      prefix-length: 24
  ipv6:
    enabled: false
    dhcp: false
dns-resolver:
  config:
    server:
    - 192.168.111.1
routes:
  config:
  - destination: 0.0.0.0/0
    next-hop-address: 192.168.111.1
    next-hop-interface: br-ex

As you can see, this configuration builds on the first static IP configuration by adding an ovs-bridge for br-ex.

I should note that at this time it's not possible to use this mechanism without providing your own configuration for br-ex. Deploying with these configurations disables configure-ovs.sh. In most cases this won't be a problem since advanced users will likely want that level of control, but it is something to be aware of.

Timing-wise, these configurations will be written to disk at Ignition time. The service that applies them is configured to run before any other OpenShift components, so assuming everything works as expected the configuration will be applied to the host before Kubelet or CRIO start running. Importantly, because these are not applied until after the pivot to the real root disk they are able to take advantage of any services and tools present on the system.

Conclusion

Hopefully this discussion has clarified our current method of doing day one configuration. While the two step process adds some complexity to the deployment workflow, it enables some important new network architectures, and we will be working to simplify the interface over the next few releases.