Over the past few years there have been quite a few changes in how day one (or perhaps more accurately, deployment-time, since they also apply to scaleout operations on day two) networking functions. This particular phase of networking is especially tricky because cluster resources are not, for the most part, available yet. This means you can't use any of the normal operators that handle network configuration later in the deployment process.
It's also a very important phase of network configuration because modern network architectures are increasingly complex, and in many cases nodes are unable to connect to the rest of the cluster with default network configuration (in the case of RHEL CoreOS the default configuration attempts DHCP on every interface on the node). Some examples are use of static IPs, bonds, and VLANs. Many OpenShift deployers need the ability to provide network configuration that will be present from a very early point in the boot process.
To this end, there are now multiple ways to provide "day one" network configuration in OpenShift. Some of these are platform-specific, while one in particular is not (mostly, more on that later).
Some of the on-prem platforms (e.g. baremetal, VSphere, OpenStack) provide a mechanism to do network configuration very early in the boot process. In the case of baremetal, this configuration is baked into the deployment images and thus is present from the very first moment the node boots. Since I primarily work with baremetal I'll be focused on that, but be aware that there are other similar implementations for other platforms. For example, on VSphere there is an interface to pass kernel args to do network configuration at initial boot. The basic tenets of the deployment flow are the same regardless of the specific configuration mechanism.
The baremetal implementation lives in the baremetal operator, in particular the image-customization-controller. This controller is responsible for taking the NMState configuration attached to a given BareMetalHost record and embedding it in the image to be used for deployment of that host. This has a couple of important implications:
nmstatectl gc
function. NMState configurations that rely on runtime information, such as capturing the state of a NIC, will not work in this phase of configuration. This is because the baremetal operator processes NMState config into raw nmconnection files that are then placed in /etc/NetworkManager/system-connections
While these are unfortunate drawbacks, the low-level nature of the configuration does ensure that network configuration can be provided as soon as it may be required. This means network configuration can be present before Ignition is retrieved, which is crucial since functional networking is required in order for Ignition to complete successfully.
For reference, here is a snippet from install-config showing the sections that go into this configuration (the network configuration parts are bolded):
apiVersion: v1 baseDomain: test.metalkube.org networking: networkType: OVNKubernetes machineNetwork: - cidr: 192.168.111.0/24 [...snip...] platform: baremetal: [...snip...] hosts: - name: ostest-master-0 role: master bmc: address: redfish-virtualmedia+http://192.168.111.1:8000/redfish/v1/Systems/fd0e728a-4c10-40e4-807a-fab5b055de78 username: admin password: password disableCertificateVerification: null bootMACAddress: 00:d7:a3:95:42:1f bootMode: UEFI networkConfig: interfaces: - name: enp2s0 type: ethernet state: up ipv4: address: - ip: "192.168.111.110" prefix-length: 24 enabled: true dns-resolver: config: server: - 192.168.111.1 routes: config: - destination: 0.0.0.0/0 next-hop-address: 192.168.111.1 next-hop-interface: enp2s0 rootDeviceHints: deviceName: "/dev/sda" hardwareProfile: default - name: ostest-master-1 role: master bmc: address: redfish-virtualmedia+http://192.168.111.1:8000/redfish/v1/Systems/1979833a-4089-41d0-936a-b8bfebd6785f username: admin password: password disableCertificateVerification: null bootMACAddress: 00:d7:a3:95:42:23 bootMode: UEFI networkConfig: interfaces: - name: enp2s0 type: ethernet state: up ipv4: address: - ip: "192.168.111.111" prefix-length: 24 enabled: true dns-resolver: config: server: - 192.168.111.1 routes: config: - destination: 0.0.0.0/0 next-hop-address: 192.168.111.1 next-hop-interface: enp2s0 rootDeviceHints: deviceName: "/dev/sda" hardwareProfile: default [...these sections repeat for each host in the cluster...]
There is also a mechanism to provide this configuration to nodes that are scaled out after initial deployment when install-config cannot be used. It's a bit more complex because it requires encoding the NMState configuration into a Secret and then attaching that to the BareMetalHost object, but it can be done.
This feature allows us to configure basic networking on hosts very early on. But what if you want to do something more complex? Perhaps OVS bridges and bonds? That's where the second step of our day one network configuration comes in.
This feature was primarily driven by a longstanding desire to eliminate the configure-ovs.sh script that is still used in most deployments to create the br-ex bridge needed for OVNKubernetes, although it has also proven to have a few other benefits. It does require a bit more work up front, but it has a number of advantages:
Why can't the baremetal feature discussed above be used for this? There are two main reasons: OVS is not available in the ramdisk and we don't want this to be baremetal-specific.
The first point is a big reason for the two step process. There are two conflicting requirements that make it difficult to solve in a single step. Initially, we need something that works very early in boot so we can pull Ignition. This cannot have a dependency on OVS or some other tools that are not present so early in boot. On the other hand, we also need to be able to deploy complex network setups that use things like OVS. In essence, we need one configuration that must be simple, and we need to be able to support configurations that are very complex. While it may be possible to come up with a compromise between the two, keeping them separate allows both requirements to be met in the best possible way.
Additionally, because the long-term goal is to replace configure-ovs.sh completely, we can't use a baremetal-only feature to do it. Prior to this, all of our host networking configuration tools were specific to a given platform. While we have a long way to go in terms of usability before this can become the default option, we also don't want to choose a design that prevents us from doing so.
One confession though: As of this writing the feature is only enabled for baremetal IPI. However, there is a patch proposed to enable it everywhere and we've retrofitted it into some non-baremetal IPI clusters and it worked just fine.
With all that background out of the way, let's talk about how this feature works. Currently, the only interface is through machine-configs. This is not great and we intend to add a better interface in the near future, but we went with the crude interface in order to expedite delivery of the functionality.
At a basic level, the NMState configuration for this step is provided in machine-config manifests passed to the installer. The machine-configs write files to /etc/nmstate/openshift
which are then processed by a service deployed on each node. The filenames are used to determine which configurations apply to which nodes. By default, a file named /etc/nmstate/openshift/cluster.yml
will be applied to every node in a given role (each role must have its own machine-config). This file must be common to every node on which it will be applied. However, it is also possible to apply a node-specific configuration based on hostname. For example, /etc/nmstate/openshift/master-0.yml
will be applied only to a node named master-0
. Note that this replaces cluster.yml
entirely, the configurations are not merged.
Those familiar with Machine Config Operator may note that it's not currently possible to do node-specific configuration. This feature gets around that limitation by writing all of the configs for all of the nodes to every node in a role. So master-0 will have configs for itself, master-1, and master-2, but will only apply its own configuration. Each other master node will also have all three configs present. Is this an abuse of the machine-config interface? Absolutely. However, this feature was deemed important enough to justify the ickiness of the technical solution.
Below is an example machine-config that might be used with the static IP configurations above:
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 10-br-ex-master spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,aW50ZXJmYWNlczo[snipped to avoid breaking page formatting]NlOiBici1leA== mode: 0644 overwrite: true path: /etc/nmstate/openshift/master-0.yml [...master-1 and master-2 have similar configuration...]
In this case the base64-encoded content looks like this:
interfaces: - name: enp2s0 type: ethernet state: up ipv4: enabled: false ipv6: enabled: false - name: br-ex type: ovs-bridge state: up copy-mac-from: enp2s0 ipv4: enabled: false dhcp: false ipv6: enabled: false dhcp: false bridge: port: - name: enp2s0 - name: br-ex - name: br-ex type: ovs-interface state: up ipv4: enabled: true address: - ip: "192.168.111.110" prefix-length: 24 ipv6: enabled: false dhcp: false dns-resolver: config: server: - 192.168.111.1 routes: config: - destination: 0.0.0.0/0 next-hop-address: 192.168.111.1 next-hop-interface: br-ex
As you can see, this configuration builds on the first static IP configuration by adding an ovs-bridge for br-ex.
I should note that at this time it's not possible to use this mechanism without providing your own configuration for br-ex. Deploying with these configurations disables configure-ovs.sh. In most cases this won't be a problem since advanced users will likely want that level of control, but it is something to be aware of.
Timing-wise, these configurations will be written to disk at Ignition time. The service that applies them is configured to run before any other OpenShift components, so assuming everything works as expected the configuration will be applied to the host before Kubelet or CRIO start running. Importantly, because these are not applied until after the pivot to the real root disk they are able to take advantage of any services and tools present on the system.
Hopefully this discussion has clarified our current method of doing day one configuration. While the two step process adds some complexity to the deployment workflow, it enables some important new network architectures, and we will be working to simplify the interface over the next few releases.