My Kubernetes Lab feat. Calico, Ceph, CoreOS, and Tunnels

Christmas is the time I get some time away from work to tinker on those projects I've put off for so long. One of them has been to build something cool with the hardware I have acquired over the years. Let's have a look on what the current state of my laboratory looks like!

The Hardware

I and a couple of friends rent rack space in a datacenter in Stockholm, Sweden. We have about 4U rack space each with some common units for things like switches and power management. Right now my part consists of:

chequita

Name origin: The first picture of the server I was sent had a Chiquita banana for scale placed on it. The misspelling is intentional :-).
Supermicro X10SL7-F
1x Intel Xeon E3-1240
32 GiB RAM
1x SSD, 3x 6TB HDD

fohn

Name origin: The fans would cause even the might Swiss Föhn winds to give up.
HP DL360 Gen8
2x Intel Xeon E5-2630
378 GiB RAM
1x SSD, 4x 900 GB + 3x 300 GB HDD

wind

Name origin: Same thing as fohn, but I couldn't think of any good wind metaphors.
HP DL360 Gen7
2x Intel Xeon E5620
32 GiB RAM
1x SSD, 3x 900 GB + 4x 300 GB HDD

The common part consists of:

kek-core

Edge-Core AS5712-54X
48x 10G SFP+ + 6x40G QSFP+

astrid

Arista 7050S-52
52x 10G SFP+

Cisco WS-C2960S-24TD-L
24x 1 Gbps + 2x 10G SFP+

2x pdu

Digipower SWH-1023J-08N1
Remote managable PDU

All servers all have IPMI/ILO connected, can be power cycled remotely through the PDU, and have 1x Gbit and 1x 10Gbit connected (soon to be 2x 10Gbit when my twinax DACs arrive).

It looks something like this:

The Past Software

For the past year I've been messing a round with VMware's excellent VMUG program. It allows you for $180 a year get access to their vSphere suite for up to 6 CPUs - which was super-neat for me. I have been using it to play with VSAN and other things they have been cooking up.

Sadly, I am not impressed by the latest VMware vCenter releases. The slightest non-standard configuration would be a sure way to end up with things failing, and the debugability of VMware's systems is really lacking. I like to break things, so I try to run quite extreme configurations to see how the systems break, and how I can restore them. The whole idea of my lab is to learn, and doing everything by the book doesn't teach me as much as when I manage to break things and then have to fix them.

So what I used to run was VSAN on consumer grade disks, on a IPv6-only network. And things really broke. I would get the weirdest errors and during the past year I had multiple total collapses of my storage network - something that my friends that are running VSAN in production would tell me was extremely odd. I never had a data loss however, but I did have quite bad and unpredictable performance. Worst of all: I didn't feel like I actually owned it - it owned me.

On the upgrade to ESXi 6.5 the VSAN fell over once again and I said "Enough is enough". I restored it to minimum functionallity like so many times before, ran the backup jobs one extra time, and wiped the systems.

The Current Software

I've always liked the idea of CoreOS: the OS itself is released in a bundle with kernel and userspace. Releases are often, and no real codenames. I thought it would be an excellent OS to run on my machines. The goal was to use the auto-updating features of CoreOS to have the nodes reboot as they wished but still have the cluster up. As I have a quorum-capable cluster of 3 nodes, it should be possible I thought - and it was.

Node: CoreOS

The current CoreOS setup is that fohn and chequita is running CoreOS beta, while wind is running the alpha train. They are all part of the etcd and fleet cluster, even though I don't use fleet currently. Locksmithd is used to synchronize the software reboots to only have a single machine rebooting at a time to maintain quorum. Overall this works very nicely, and I'm super-happy with the setup. The most painful part was, unsurprisingly, setting up the certificate infrastructure with minting machine certificates for all the services and such. As I'm a firm believer of not having firewalls on my lab systems (to force me to configure them properly) - having etcd running with TLS was mandatory.

CoreOS uses something called cloud-config to store configuration of the whole node. While I have manually installed the machine certificates, the rest is populated on boot from a file called /var/lib/coreos-install/user_data - a file that I keep under source-control and have a cron-job to fetch the latest version automatically. This means I can update the system configuration in the comfort of my own laptop, commit it, wait for the change to be downloaded, and reboot the machines one by one. This ensures I always have the latest configuration saved for when the OS disk (a USB memory stick) dies.

I mentioned I do not use fleet and there is a good reason for why: I'm a Kubernetes user. fleet doesn't offer anything that I couldn't do with Kubernetes, and Kubernetes has more features - so I opted to use Kubernetes for as much as possible. This means that the configuration for the nodes themselves is mostly things that is required to bring the Kubernetes base component (called Kubelet) up, at which point it will proceed with setting up the rest of the cluster components.

Compute: Kubernetes

The base components of Kubernetes is the Kubelet, and a handful of services that are watching the cluster state for inconsistencies (detect downed containers and such) and attempting to fix those inconsistencies.

The way these services are started are through the means of Kubernetes manifests. These are locally scheduled containers (or "pods" as they are called in the Kubernetes world) that I store in the same repository as the cloud-config for CoreOS. This means that I can update, for example, the version of my Kubernetes installation by committing a new manifest to my configuration repository and just wait. The Kubelet will see the changed manifest when the cronjob has fetched the latest modifications and will proceed to restart the pods in their new configuration. It's all very hands-off no-touch. And again, I have full revision control for speedy rollbacks.

When the base services are up, Kubernetes will read in the desired state from the etcd cluster. As etcd is a clustered replicated key-value store, no single machine failure should be able to take it down. This means I don't need to be as meticulous in keeping track of what is stored in there, but I do keep a repository for all the configuration files/manifests for things that are not just me messing about.

Networking: Calico and Tunnels

Kubernetes did a clever choice of not telling the operator how they should build their networks. They realized that no production network looks the same, so it would be a recipe for failure to dictate the network. Instead there is a simple interface called Container Network Interface (CNI) that allows anyone to plug in a network that fits their requirements.

The standard network plugins are flannel and weavenet. They are simple and works, weavenet is literally just one command to get started with. However, I dislike VPNs and non-physical Layer-2, which is what both of these solutions are based on. Having extended broadcast domains have bitten me before. Also, I would like a bit less magic and a bit more transparency. I want pure IP classic L3 connectivity. With the current network gear out there today, there is really no reason why not.

Calico is a project that does a bunch of things, but here we're interested in their CNI for Kubernetes. It works by setting up iBGP between all nodes and exchanging routing information on how to reach pods. It's dynamic and easy to set up, but also transparent. Standard tools like traceroute, tcpdump, and iptables works just fine - which is really neat.

When set up it looks something like this:

This is how fohn looks out on the rest of the cluster.

Another issue that is quite complex in Kubernetes is external access. How do you get a user from the world-wide web into the pod that is able to serve their request? There are a couple of ways to achieve this - but they all are pretty lousy when not running on GCE or AWS. They require either a load-balancer (doing connection proxying, removing possibility of getting source IP for example), use a high-numbered port on all nodes (called NodePort), or resorting to trickery when using the Cluster IP (meant to be only used internally in the cluster).

What I did to solve this is I wrote something that is inspired from how I'm used to route ingress traffic: using tunnels. By using a GRE (or even MPLS) tunnel router it is possible to do arbitrarily complex traffic engineering to select the target endpoint for a given external client. It works by setting up GRE interfaces on an ingress router and egress GRE interfaces inside the pods that serve the traffic, all with the same public IP.

It looks like this for the public IP 31.31.164.211:

fohn ~ # ip ro get 31.31.164.211 mark 2
31.31.164.211 dev ts0abc1d4d src 31.31.164.220 mark 2
cache

The "mark 2" here refers to the tuple hash the connection has been given. This is calculated by iptables normally, but given explicitly here to show how it works.

The interface ts0abc1d4d is a GRE interface with a pod associated with it:

120: ts0abc1d4d@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1476 qdisc noqueue state UNKNOWN group default qlen 1
link/gre 0.0.0.0 peer 10.188.29.77

The pod itself has the other end of this tunnel. In this case it is for a service named "flood".

6: flood@NONE: <NOARP,UP,LOWER_UP> mtu 1476 qdisc noqueue state UNKNOWN qlen 1

link/gre 10.188.29.77 brd 0.0.0.0

inet 31.31.164.211/32 scope global flood

valid_lft forever preferred_lft forever

inet6 fe80::5efe:abc:1d4d/64 scope link

valid_lft forever preferred_lft forever

Does it work?

bluecmd@fohn ~ $ ping 31.31.164.221
PING 31.31.164.221 (31.31.164.221) 56(84) bytes of data.
64 bytes from 31.31.164.221: icmp_seq=1 ttl=64 time=0.308 ms
64 bytes from 31.31.164.221: icmp_seq=2 ttl=64 time=0.282 ms

Yep!

The only reason why you couldn't fire up your terminal and ping these IPs right now are that they are not routed on the public Internet yet. I need to poke my ISP to change the routing to have the IPs routed via my nodes, which in turn only makes sense if I use BGP with something like svc-bgp. So we're not fully there yet! Or I might just say "Screw it" and route all those IPs to fohn and fix the dynamic routing later.

If you want to know more about this tunnel setup, check out the project on GitHub: kube-service-tunnel.

Storage: Ceph

I could write a blog post about Ceph itself - it's a complex system, as any good storage solution is. The amount of knobs to turn is insane, which is a good thing. Compared to VSAN, I at least have the feeling I'm in control. I managed to screw up my Ceph cluster a few times when setting it up, but nothing more serious than that it became a bit slow until I could correct my mistake.

The cluster uptime is only a few days, so it's too early to talk about stability - but I think I'm going to have some good time with Ceph. The design of the system as a whole feels well thought-out, and there are good tools and a vibrant community around it.

This is the overview status on my Ceph cluster as I'm writing this.

[bluecmd-laptop]$ kubectl exec -ti --namespace=ceph ceph-mon-1 -- ceph -s
cluster 1b3f7e94-fae6-4bce-ba6f-0b7c08759200
health HEALTH_OK
monmap e7: 3 mons at {ceph-mon-0=10.188.21.39:6789/0,ceph-mon-1=10.188.29.80:6789/0,ceph-mon-2=10.188.22.163:6789/0}
election epoch 654, quorum 0,1,2 ceph-mon-0,ceph-mon-2,ceph-mon-1
fsmap e95: 1/1/1 up {0=mds-ceph-mds-0=up:active}, 1 up:standby
osdmap e377: 16 osds: 16 up, 16 in
flags sortbitwise
pgmap v392484: 1088 pgs, 3 pools, 644 GB data, 214 kobjects
1312 GB used, 22415 GB / 23752 GB avail
1088 active+clean
client io 157 kB/s wr, 0 op/s rd, 52 op/s wr

I'm using CephFS for my main storage as I have many pods that read/write the same storage space. For things like Grafana and things that need persistence I use Ceph RBD via Kubernetes' Persistent Volume Claims.

[bluecmd-laptop]$ kubectl get pvc --all-namespaces
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESSMODES AGE
media flood-db Bound pvc-ec87bfbf-c91b-11e6-8d1f-ac162d706004 1Gi RWO 3d
mon grafana-data Bound pvc-b8e5ed1a-c91c-11e6-8d1f-ac162d706004 1Gi RWO 3d
mon prometheus-data Bound pvc-19fa34f2-c91d-11e6-8d1f-ac162d706004 200Gi RWO 3d

Monitoring: Prometheus

As an early adopter of Prometheus, it should come as no surprise to anyone that has ever talked to me about monitoring that I'm using Prometheus to monitor my whole infrastructure. The monitoring setup is quite boring, as it should be for things as critical as monitoring. It's simply a Prometheus server running with a Ceph RBD disk, Grafana for visualization, and I'm going to set up an alert manager with OpsGenie integration for alerting.

It looks like this:

Summary

I'm very happy with how it all came together and that I threw out VMware to get more down into the details with Kubernetes. Hopefully I'm able to pay back to the community with posts like these, the code I write, and the bugs I file. The community around Kubernetes is so helpful, friendly, and incredibly fast - I never seen anything like it.

Appendix: Configuration Repository

I have mentioned this repository a few times. I figured I might just as well provide a dump of how the structure looks like.

# These are the cloud-config system configuration files I talked about earlier

./chequita

./chequita/user_data

./fohn

./fohn/user_data

./wind

./wind/user_data

# These are my own Docker images that I use for my cluster services.

./docker

./docker/flexget

./docker/flexget/Dockerfile

./docker/flood

./docker/flood/config.js

./docker/flood/Dockerfile

./docker/rtorrent

./docker/rtorrent/attach.sh

./docker/rtorrent/Dockerfile

./docker/rtorrent/rtorrent-start.sh

./docker/rtorrent/start.sh

./docker/rtorrent/vpn.route

./kubernetes

./kubernetes/configs

# These are my Kubernetes manifest/deployments/configs that are used to setup the cluster services I run. These are high-level and don't care about indiviual nodes, but operate on the cluster as a whole.

./kubernetes/configs/ceph

# I use Ceph as a distributed filesystem for persistent storage. All nodes have their disks exported via Ceph's OSDs, all running on Kubernetes. This means that CoreOS doesn't have to care about Ceph at all.

./kubernetes/configs/ceph/ceph.conf

./kubernetes/configs/ceph/cephfs-test.yaml

./kubernetes/configs/ceph/create-pvc-secret.sh

./kubernetes/configs/ceph/exporter.yaml

./kubernetes/configs/ceph/mds-svc.yaml

./kubernetes/configs/ceph/mds.yaml

./kubernetes/configs/ceph/mon-svc.yaml

./kubernetes/configs/ceph/mon.yaml

./kubernetes/configs/ceph/osd.yaml

./kubernetes/configs/ceph/rbd-test.yaml

./kubernetes/configs/ceph/storage-class.yaml

./kubernetes/configs/kube-system

./kubernetes/configs/kube-system/calico.yaml

./kubernetes/configs/kube-system/dns.yaml

./kubernetes/configs/kube-system/kubernetes-dashboard.yaml

[ .. ]

./kubernetes/configs/mon

./kubernetes/configs/mon/grafana-pvc.yaml

./kubernetes/configs/mon/grafana-svc.yaml

./kubernetes/configs/mon/grafana.yaml

./kubernetes/configs/mon/kube-state-metrics.yaml

./kubernetes/configs/mon/prometheus

./kubernetes/configs/mon/prometheus/k8s.rules

./kubernetes/configs/mon/prometheus/prometheus.yml

./kubernetes/configs/mon/prometheus-pvc.yaml

./kubernetes/configs/mon/prometheus-svc.yaml

./kubernetes/configs/mon/prometheus.yaml

# These are the locally scheduled Kubernetes components needed to bring up the cluster.

# The idea is to run Kubernetes in HA mode (same components on all 3 machines) but I haven't gotten so far yet.

./kubernetes/manifests

./kubernetes/manifests/chequita

./kubernetes/manifests/chequita/kube-proxy.yaml

./kubernetes/manifests/fohn

./kubernetes/manifests/fohn/kube-apiserver.yaml

./kubernetes/manifests/fohn/kube-controller-manager.yaml

./kubernetes/manifests/fohn/kube-proxy.yaml

./kubernetes/manifests/fohn/kube-scheduler.yaml

./kubernetes/manifests/wind

./kubernetes/manifests/wind/kube-proxy.yaml

# This is my SSL certificate infrastructure, based on cfssl.

# I commit the certificates (but not the keys) for easy access. The keys are copied to their place when created.

# If a key is lost I just create a new one, no big deal.

./ssl

./ssl/apiserver.json

./ssl/apiserver.pem

./ssl/bluecmd.json

./ssl/bluecmd.pem

./ssl/ca-config.json

./ssl/ca.csr

./ssl/ca-csr.json

./ssl/ca-key.pem.encrypted

./ssl/ca.pem

./ssl/kube-admin.json

./ssl/kube-admin.pem

./ssl/wind.json

./ssl/wind.pem

# This is a list of my old apiserver certificates. Without this list I would have to restart all pods that are using service accounts in Kubernetes when changing the apiserver certificate as it is the default certificate that is used to validate service account tokens.

./ssl/service-account.trust.pem

mainframe.dev