Deploying secure K8S cluster

Deploying a new cluster with presigned CA certificates using Hashicorp Vault and kubeadm.

Jan 31, 2024 · 2168 words · 11 minute read

Recently, I decided I needed to restructure my infrastructure to make more efficient use of my local hardware and reduce my cloud hosting costs.

The majority of the stuff I host on ’the cloud’ I do because I want maxium uptime, and hosting stuff behind a busy consumer router is not the best way to provide that.

In my new arrangement, I will create three clusters on my three main hardware nodes, which are an old DELL PowerEdge and a couple of old home PCs the kids grew out of, each with a little extra RAM and storage bolted in. Nothing fast or fancy, at least not until the services it hosts are making some kind of return!

The three clusters are going to be labelled ‘infra’, ‘staging’ and ‘production’. The ‘infra’ cluster will contain and isolate all the common support services required for the services hosted on ‘staging’ and ‘production’ to operate. These include things like SSO (Keycloak), PKI/TLS and secrets management (Vault), DNS (CoreDNS/PiHole), Logging (Loki) and Metrics (Prometheus), CI/CD (ArgoWorkflows/ArgoCD).

Pre-requisites 🔗

In this write-up, I will be leveraging a few services that have already been configured independently of the new clusters.

The main dependency is Hashicorp Vault as our PKI. If you’re just playing around, you can just run up a local instance for the purposes of creating the PKIs required for this exercise. Otherwise, you should really consider setting Vault up on a secure, independent host, with a good helping of access control applied.

Prepping the VM hosts 🔗

I assume the VM hosts to have been already been prepped with Ubuntu Focal 22.04 LTS, with a reasonable size root partition and the rest of the storage assigned to a LVM volume group named data from which the hosts will draw persistent storage for normal operation.

For reference, the hosts I will be configuring:

Host	Host IP	Virtual IP
sentinel-1	10.20.0.200	10.x.0.1
ollie-desktop	10.20.0.50	10.x.0.2
sam-desktop	10.20.0.10	10.x.0.3

The hosts are configured to use an OPNSense router at 10.20.0.1 as their default gateway.

TODO: Consistent (re-)prepping of a large number of physical hosts can be achieved using a PXE-based approach, which I will write about in another article. I use this approach at even though I only have a small number of hosts I could manage manually, as it’s good practice and because when (not if!) I win the lottery it would be nice to be able to replace and add new physical host nodes quickly.

The plan is to run three VMs on each host, one for each cluster. This way, a physical node going down would represent one of the three nodes of each cluster going down, but they would each still have two nodes running, so should be able to recover most services and continue to provice service.

For brevity (!) I’ll skip over this part, suffice to say that the hosts have (automatically) been updated, CIS-hardened, main admin users, groups and permissions configured, logging and metrics agents installed, and the main libvirt packages installed such that virsh CLI works.

I/O concerns 🔗

My concern is mainly regarding I/O. There will be a lot more of it than these hosts have been previously been used to, and they are running a mix of spinning drives and SSDs. This is where StorageClasses will come in useful while I work on phasing out the spinning drives.

Also, I am going to have to be careful to reduce the amount of unnecessary logging the services produce. Bursty logs can consume the majority share of the I/O bandwidth and cause other processes to stall and time out.

Poor etcd performance can also lead to unstable clusters. Moving the etcd and other control-plane components to their own dedicated resources is desirable, but not possible currently.

Host bridges and VLANs 🔗

Each host will have a different ‘default gateway’ interface. The DELL has two physical NICs so we combine them to form a ‘bonded’ interface (bond0). One of the home other home PCs has ended up with eth0 and the other enp1s0. Whatever they are, they need to be added to a br0 bridge, which will represent the ‘forwarding’ interface(s) and contain the main forward-facing IP address.

Then, we need a bridge for each of the virtual cluster networks we plan to host, and they need to be connected to a phyical VLAN that will allow them to share traffic between hosts. The default gateway route should be the virtual IP of the host for that VLAN.

This can be summarised by the following netplan:

network:
  version: 2
  ethernets:
    eno1:
      dhcp4: no
    eno2:
      dhcp4: no

  vlans:
    eno1vlan77:
      id: 77
      link: eno1
    eno1vlan88:
      id: 88
      link: eno1
    eno1vlan89:
      id: 89
      link: eno1
    eno2vlan77:
      id: 77
      link: eno2
    eno2vlan88:
      id: 88
      link: eno2
    eno2vlan89:
      id: 89
      link: eno2

  bridges:
    br0:
      mtu: 1420
      interfaces:
        - eno1
        - eno2
      addresses:
        - 10.20.0.200/24
      routes:
        - to: 0.0.0.0/0
          via: 10.20.0.1
      nameservers:
        addresses:
          - 10.20.0.1
    br1:
      interfaces:
        - eno1vlan88
        - eno2vlan88
      addresses:
        - 10.88.0.1/16
    br2:
      interfaces:
        - eno1vlan89
        - eno2vlan89
      addresses:
        - 10.89.0.1/16
    br3:
      interfaces:
        - eno1vlan77
        - eno2vlan77
      addresses:
        - 10.77.0.1/16

At this point, we can assume that the hosts can ping each other on each of the VLANs.

Static routing 🔗

As my physical hosts are on their own subnet connected to the main network router, the rest of the internal network cannot yet reach the hosts via their virtual IP addresses as there areno routes configured to those ranges in the main network router’s routing tables.

We can, if desired, add the nine specific static routes to the router by forwarding traffic destined for the new virtual IPs via to their respective host IPs. Or, we can just operate via the physical host IPs for now and wait until later when we configure BGP to advertise the host, pod and service addresses to the other hosts and the network router.

Prepping the network 🔗

In our network we allocate whole class Bs (16 bit) network addresses to each cluster.

I know it’s not ideal, but it’s a legacy scheme. At some point, we may get to a point where it becomes necessary to review our use of the RFC1912 address spaces and maybe look into ‘defragging’ the address space for efficiency and better organisation later once we better understand our organisational structure and plans at a larger scale.

Range	Purpose
`10.x.0.0/24`	For the physical hosts and LAN network segment.
`10.x.4.0/22`	For the ClusterIP service address range.
`10.x.64.0/22`	For assignment of a `/24` podCIDR for each physical host.

This makes it relatively easy to determine which cluster a given IP relates to, whether it’s a host IP, pod IP, or service IP. Also, from the podCIDR portion you can also deduce the node that it is assigned to.

In this example, I have decided on the following addressing scheme. Also, I will be using VLANs to isolate traffic between the clusters.

Cluster	Range	VLAN
Infra	10.88.0.0/16	88
Staging	10.77.0.0/16	77
Production	10.89.0.0/16	89

I have configured my managed switch to brige these VLANs only across the ports the hosts are connected to.

IP address management 🔗

Also, I have also added these ranges to our IPAM (IP address management) system (Netbox). I added these ranges manually via the UI, but ideally they should be populated automatically via the Netbox API by the scripts we’re using as the resources are deployed, using the latest details at the time.

Prepping the node VMs 🔗

The main parts are the base O/S image file (customised Ubuntu image), the cloudinit node configuration ISO file, and the libvirt XML node definition file.

Base O/S image 🔗

The base image used to fire up the nodes is built with Packer.

TODO: Sanitise and put a copy of the Packer file and supporting parts in a repo somewhere. Discuss.

Typically, I build the image on my laptop and scp copy the image file manually to the VM hosts. Going forward, I will put this into a CD pipeline so that a recent image is always available for when I fire up new nodes or replace host nodes.

Node configuration ISOs 🔗

There is a configs folder that contains folders for each of the host nodes. In each host node folder is a Makefile and a set of files used to create a simple ‘cloudinit’ ISO image.

File	Purpose
`Makefile`	Used to (re)generate instance’s config ISO
`instance.rc`	A shell script resource file used by the `create_vm.sh` script.
`meta-data`	A cloudinit metadata definition, sets instance ID and hostname etc.
`network-config`	A netplan config for the host, defining interfaces, bridges, VLANs
`parts`	A folder containing parts of the ‘user-data’ to be assembled
`parts/script.sh`	A shell script run on instance initialisation
`parts/userdata.yaml`	A YAML cloudinit definition of the node’s desired attributes

There is also a common.yaml file in a lower folder. This is prepended to the parts/userdata.yaml, and serves to define common attributes such as admin users, SSH keys etc.

The resulting ISO file will be mounted into the node VM via the CD-ROM device.

Libvirt VM definitions 🔗

The create_vm.sh script will take the libvirt node definition template and replace the variables placeholders with variables obtained from the instance.rc file.

This should ensure that the node has a unique node ID, ethernet MAC addresses, and other parameters appropriate to the instance.

It then defines the instance using the XML it has generated and starts it. We can watch it start with virsh console and once booted we should also be able SSH into it (from a host with appropriate routing!).

Node initialisation 🔗

As each node instance starts, it will run the bootstrap script.sh from the cloudinit ISO.

Kubernetes master nodes 🔗

On the K8S master node VMs, the bootstrap script will retrieve a set of pre-signed CA certificates from our Vault KV secrets store (TODO: Link to other article about that), and initialise the K8S cluster, using the other custom parameters (i.e. network addresses, TLS hostnames etc) defined in the kubeadm-config.yaml file. Once initialised, the script will deploy the Cilium CNI driver, and write a fresh cluster ‘join’ token to a Vault KV store entry, for the worker nodes to consume.

Essentially, it creates a cluster joining token. You could do the same manually with the following:

kubeadm token create --print-join-command --config=/etc/kubernetes/kubeadm-config.yaml

NOTE: The above command issued a join with an IP address. For a multi-master cluster, this needs to be a hostname. You can use DNS for this or (hack) just deploy a static hosts entry on each node.

Kubernetes worker nodes 🔗

On the K8S worker node VMs, the bootstrap script will just prep the necessary TLS CA certs, retrieve the (latest?) join token from the Vault KV store and use it to run up the worker node services.

kubeadm join k8s-api.infra.golder.lan:6443 --token blahbl.t7cs1123458im63w --discovery-token-ca-cert-hash sha256:9de4b8d3e4504e997753f5a29cf4e50e8a288b9041df429927388ea279e30998

Network configuration 🔗

The main issue now will be that while workloads can be fired up, there will be no ‘return route’ from the network router to the cluster’s new host, pod and service IP addresses.

In order for the hosts and routers involved in routing traffic to/from these node VMs to know how to get there, we need to advertise routes. This is the job of BGP, so we need to ensure that we have a BGP daemon running on the hosts and routers involved.

Network router configuration 🔗

I installed the OPNSense BGP service and configured it.

TODO: Screenshots

Host routing configuration 🔗

For the VM hosts, I configured the bird BGP daemon.

TODO: Describe bird config

Service initialisation 🔗

At this point, all the host, pod and service CIDR addresses should be generally routable internally, and have egress access to the Internet to fetch images and connect to external services as required.

Now, we want to deploy flux and configure it to pull it’s configuration from the git repo for the cluster. Flux will read it’s configuration and deploy and configure all the rest of the supporting services the cluster will need to operate, such as ‘coredns’, ‘kube-router’, ‘fluent-bit’ etc.

TODO: Link to flux repo and describe more

Application deployment 🔗

Flux should have deployed most of the services that run ‘under the hood’. One of those services will be ArgoCD.

ArgoCD is used to manage the main (tenanted) application workloads such as our Hugo and Odoo based websites, as it has a nicer GUI than Flux that makes it more accessible to a wider audience, and thus the state of the applications are more visible and easier to manage by specific tenants or groups.

Persistent storage 🔗

TODO: Discuss how LVM is used as the base PV layer, and how higher-level tools such as LongHorn and Postgres Operator ensure availability of data in the event that one of three nodes is down, and that recovery from two or three nodes being down is not a problem.

Backups and recovery 🔗

TODO: Discuss how workloads are backed up, checked etc.

Monitoring and logging 🔗

TODO: Explain how workloads log metrics and logs

Continuous compliance 🔗

TODO: Explain how we use tools liek kube-bench and kubescape to try to objectively measure our compliance against standards like PCI-DSS, CIS etc/