Skip to main content

My Kubernetes Lab feat. Calico, Ceph, CoreOS, and Tunnels

Christmas is the time I get some time away from work to tinker on those projects I've put off for so long. One of them has been to build something cool with the hardware I have acquired over the years. Let's have a look on what the current state of my laboratory looks like!

The Hardware

I and a couple of friends rent rack space in a datacenter in Stockholm, Sweden. We have about 4U rack space each with some common units for things like switches and power management. Right now my part consists of:
  • chequita
    • Name origin: The first picture of the server I was sent had a Chiquita banana for scale placed on it. The misspelling is intentional :-).
    • Supermicro X10SL7-F
    • 1x Intel Xeon E3-1240
    • 32 GiB RAM
    • 1x SSD, 3x 6TB HDD
  • fohn
    • Name origin: The fans would cause even the might Swiss Föhn winds to give up.
    • HP DL360 Gen8
    • 2x Intel Xeon E5-2630
    • 378 GiB RAM
    • 1x SSD, 4x 900 GB + 3x 300 GB HDD
  • wind
    • Name origin: Same thing as fohn, but I couldn't think of any good wind metaphors.
    • HP DL360 Gen7
    • 2x Intel Xeon E5620
    • 32 GiB RAM
    • 1x SSD, 3x 900 GB + 4x 300 GB HDD
The common part consists of:
  • kek-core
    • Edge-Core AS5712-54X
    • 48x 10G SFP+ + 6x40G QSFP+
  • astrid
    • Arista 7050S-52
    • 52x 10G SFP+
  • tor
    • Cisco WS-C2960S-24TD-L
    • 24x 1 Gbps + 2x 10G SFP+
  • 2x pdu
    • Digipower SWH-1023J-08N1
    • Remote managable PDU
All servers all have IPMI/ILO connected, can be power cycled remotely through the PDU, and have 1x Gbit and 1x 10Gbit connected (soon to be 2x 10Gbit when my twinax DACs arrive).

It looks something like this:

The Past Software

For the past year I've been messing a round with VMware's excellent VMUG program. It allows you for $180 a year get access to their vSphere suite for up to 6 CPUs - which was super-neat for me. I have been using it to play with VSAN and other things they have been cooking up.

Sadly, I am not impressed by the latest VMware vCenter releases. The slightest non-standard configuration would be a sure way to end up with things failing, and the debugability of VMware's systems is really lacking. I like to break things, so I try to run quite extreme configurations to see how the systems break, and how I can restore them. The whole idea of my lab is to learn, and doing everything by the book doesn't teach me as much as when I manage to break things and then have to fix them.

So what I used to run was VSAN on consumer grade disks, on a IPv6-only network. And things really broke. I would get the weirdest errors and during the past year I had multiple total collapses of my storage network - something that my friends that are running VSAN in production would tell me was extremely odd. I never had a data loss however, but I did have quite bad and unpredictable performance. Worst of all: I didn't feel like I actually owned it - it owned me.

On the upgrade to ESXi 6.5 the VSAN fell over once again and I said "Enough is enough". I restored it to minimum functionallity like so many times before, ran the backup jobs one extra time, and wiped the systems.

The Current Software

I've always liked the idea of CoreOS: the OS itself is released in a bundle with kernel and userspace. Releases are often, and no real codenames. I thought it would be an excellent OS to run on my machines. The goal was to use the auto-updating features of CoreOS to have the nodes reboot as they wished but still have the cluster up. As I have a quorum-capable cluster of 3 nodes, it should be possible I thought - and it was.

Node: CoreOS

The current CoreOS setup is that fohn and chequita is running CoreOS beta, while wind is running the alpha train. They are all part of the etcd and fleet cluster, even though I don't use fleet currently. Locksmithd is used to synchronize the software reboots to only have a single machine rebooting at a time to maintain quorum. Overall this works very nicely, and I'm super-happy with the setup. The most painful part was, unsurprisingly, setting up the certificate infrastructure with minting machine certificates for all the services and such. As I'm a firm believer of not having firewalls on my lab systems (to force me to configure them properly) - having etcd running with TLS was mandatory.

CoreOS uses something called cloud-config to store configuration of the whole node. While I have manually installed the machine certificates, the rest is populated on boot from a file called /var/lib/coreos-install/user_data - a file that I keep under source-control and have a cron-job to fetch the latest version automatically. This means I can update the system configuration in the comfort of my own laptop, commit it, wait for the change to be downloaded, and reboot the machines one by one. This ensures I always have the latest configuration saved for when the OS disk (a USB memory stick) dies.

I mentioned I do not use fleet and there is a good reason for why: I'm a Kubernetes user. fleet doesn't offer anything that I couldn't do with Kubernetes, and Kubernetes has more features - so I opted to use Kubernetes for as much as possible. This means that the configuration for the nodes themselves is mostly things that is required to bring the Kubernetes base component (called Kubelet) up, at which point it will proceed with setting up the rest of the cluster components.

Compute: Kubernetes

The base components of Kubernetes is the Kubelet, and a handful of services that are watching the cluster state for inconsistencies (detect downed containers and such) and attempting to fix those inconsistencies.

The way these services are started are through the means of Kubernetes manifests. These are locally scheduled containers (or "pods" as they are called in the Kubernetes world) that I store in the same repository as the cloud-config for CoreOS. This means that I can update, for example, the version of my Kubernetes installation by committing a new manifest to my configuration repository and just wait. The Kubelet will see the changed manifest when the cronjob has fetched the latest modifications and will proceed to restart the pods in their new configuration. It's all very hands-off no-touch. And again, I have full revision control for speedy rollbacks.

When the base services are up, Kubernetes will read in the desired state from the etcd cluster. As etcd is a clustered replicated key-value store, no single machine failure should be able to take it down. This means I don't need to be as meticulous in keeping track of what is stored in there, but I do keep a repository for all the configuration files/manifests for things that are not just me messing about.

Networking: Calico and Tunnels

Kubernetes did a clever choice of not telling the operator how they should build their networks. They realized that no production network looks the same, so it would be a recipe for failure to dictate the network. Instead there is a simple interface called Container Network Interface (CNI) that allows anyone to plug in a network that fits their requirements.

The standard network plugins are flannel and weavenet. They are simple and works, weavenet is literally just one command to get started with. However, I dislike VPNs and non-physical Layer-2, which is what both of these solutions are based on. Having extended broadcast domains have bitten me before. Also, I would like a bit less magic and a bit more transparency. I want pure IP classic L3 connectivity. With the current network gear out there today, there is really no reason why not.

Calico is a project that does a bunch of things, but here we're interested in their CNI for Kubernetes. It works by setting up iBGP between all nodes and exchanging routing information on how to reach pods. It's dynamic and easy to set up, but also transparent. Standard tools like traceroute, tcpdump, and iptables works just fine - which is really neat.

When set up it looks something like this:

This is how fohn looks out on the rest of the cluster.

fohn ~ # ./calicoctl node status
Calico process is running.

IPv4 BGP status
| PEER ADDRESS  |     PEER TYPE     | STATE |   SINCE    |    INFO     |
| | node-to-node mesh | up    | 2016-12-24 | Established |
| | node-to-node mesh | up    | 2016-12-21 | Established |

IPv6 BGP status
No IPv6 peers found.

Another issue that is quite complex in Kubernetes is external access. How do you get a user from the world-wide web into the pod that is able to serve their request? There are a couple of ways to achieve this - but they all are pretty lousy when not running on GCE or AWS. They require either a load-balancer (doing connection proxying, removing possibility of getting source IP for example), use a high-numbered port on all nodes (called NodePort), or resorting to trickery when using the Cluster IP (meant to be only used internally in the cluster).

What I did to solve this is I wrote something that is inspired from how I'm used to route ingress traffic: using tunnels. By using a GRE (or even MPLS) tunnel router it is possible to do arbitrarily complex traffic engineering to select the target endpoint for a given external client. It works by setting up GRE interfaces on an ingress router and egress GRE interfaces inside the pods that serve the traffic, all with the same public IP.

It looks like this for the public IP

fohn ~ # ip ro get mark 2 dev ts0abc1d4d  src  mark 2

The "mark 2" here refers to the tuple hash the connection has been given. This is calculated by iptables normally, but given explicitly here to show how it works.

The interface ts0abc1d4d is a GRE interface with a pod associated with it:

120: ts0abc1d4d@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1476 qdisc noqueue state UNKNOWN group default qlen 1
    link/gre peer

The pod itself has the other end of this tunnel. In this case it is for a service named "flood".

6: flood@NONE: <NOARP,UP,LOWER_UP> mtu 1476 qdisc noqueue state UNKNOWN qlen 1
    link/gre brd
    inet scope global flood
       valid_lft forever preferred_lft forever
    inet6 fe80::5efe:abc:1d4d/64 scope link 
       valid_lft forever preferred_lft forever

Does it work?

bluecmd@fohn ~ $ ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=0.308 ms
64 bytes from icmp_seq=2 ttl=64 time=0.282 ms


The only reason why you couldn't fire up your terminal and ping these IPs right now are that they are not routed on the public Internet yet. I need to poke my ISP to change the routing to have the IPs routed via my nodes, which in turn only makes sense if I use BGP with something like svc-bgp. So we're not fully there yet! Or I might just say "Screw it" and route all those IPs to fohn and fix the dynamic routing later.

If you want to know more about this tunnel setup, check out the project on GitHub: kube-service-tunnel.

Storage: Ceph

I could write a blog post about Ceph itself - it's a complex system, as any good storage solution is. The amount of knobs to turn is insane, which is a good thing. Compared to VSAN, I at least have the feeling I'm in control. I managed to screw up my Ceph cluster a few times when setting it up, but nothing more serious than that it became a bit slow until I could correct my mistake.

The cluster uptime is only a few days, so it's too early to talk about stability - but I think I'm going to have some good time with Ceph. The design of the system as a whole feels well thought-out, and there are good tools and a vibrant community around it.

This is the overview status on my Ceph cluster as I'm writing this.

[bluecmd-laptop]$ kubectl exec -ti --namespace=ceph ceph-mon-1 -- ceph -s
    cluster 1b3f7e94-fae6-4bce-ba6f-0b7c08759200
     health HEALTH_OK
     monmap e7: 3 mons at {ceph-mon-0=,ceph-mon-1=,ceph-mon-2=}
            election epoch 654, quorum 0,1,2 ceph-mon-0,ceph-mon-2,ceph-mon-1
      fsmap e95: 1/1/1 up {0=mds-ceph-mds-0=up:active}, 1 up:standby
     osdmap e377: 16 osds: 16 up, 16 in
            flags sortbitwise
      pgmap v392484: 1088 pgs, 3 pools, 644 GB data, 214 kobjects
            1312 GB used, 22415 GB / 23752 GB avail
                1088 active+clean
  client io 157 kB/s wr, 0 op/s rd, 52 op/s wr

I'm using CephFS for my main storage as I have many pods that read/write the same storage space. For things like Grafana and things that need persistence I use Ceph RBD via Kubernetes' Persistent Volume Claims.

[bluecmd-laptop]$ kubectl get pvc --all-namespaces 
NAMESPACE   NAME              STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
media       flood-db          Bound     pvc-ec87bfbf-c91b-11e6-8d1f-ac162d706004   1Gi        RWO           3d
mon         grafana-data      Bound     pvc-b8e5ed1a-c91c-11e6-8d1f-ac162d706004   1Gi        RWO           3d
mon         prometheus-data   Bound     pvc-19fa34f2-c91d-11e6-8d1f-ac162d706004   200Gi      RWO           3d

Monitoring: Prometheus

As an early adopter of Prometheus, it should come as no surprise to anyone that has ever talked to me about monitoring that I'm using Prometheus to monitor my whole infrastructure. The monitoring setup is quite boring, as it should be for things as critical as monitoring. It's simply a Prometheus server running with a Ceph RBD disk, Grafana for visualization, and I'm going to set up an alert manager with OpsGenie integration for alerting.

It looks like this:


I'm very happy with how it all came together and that I threw out VMware to get more down into the details with Kubernetes. Hopefully I'm able to pay back to the community with posts like these, the code I write, and the bugs I file. The community around Kubernetes is so helpful, friendly, and incredibly fast - I never seen anything like it.

Appendix: Configuration Repository

I have mentioned this repository a few times. I figured I might just as well provide a dump of how the structure looks like.

# These are the cloud-config system configuration files I talked about earlier

# These are my own Docker images that I use for my cluster services.

# These are my Kubernetes manifest/deployments/configs that are used to setup the cluster services I run. These are high-level and don't care about indiviual nodes, but operate on the cluster as a whole.
# I use Ceph as a distributed filesystem for persistent storage. All nodes have their disks exported via Ceph's OSDs, all running on Kubernetes. This means that CoreOS doesn't have to care about Ceph at all.
[ .. ]
# These are the locally scheduled Kubernetes components needed to bring up the cluster.
# The idea is to run Kubernetes in HA mode (same components on all 3 machines) but I haven't gotten so far yet.
# This is my SSL certificate infrastructure, based on cfssl.
# I commit the certificates (but not the keys) for easy access. The keys are copied to their place when created.
# If a key is lost I just create a new one, no big deal.
# This is a list of my old apiserver certificates. Without this list I would have to restart all pods that are using service accounts in Kubernetes when changing the apiserver certificate as it is the default certificate that is used to validate service account tokens.


Popular posts from this blog

Buying an IBM Mainframe

I bought an IBM mainframe for personal use. I am doing this for learning and figuring out how it works. If you are curious about what goes into this process, I hope this post will interest you. I am not the first one by far to do something like this. There are some people on the internet that I know have their own personal mainframes, and I have drawn inspiration from each and every one of them. You should follow them if you are interested in these things: @connorkrukosky @sebastian_wind @faultywarrior @kevinbowling1 This post is about buying an IBM z114 mainframe (picture 1) but should translate well to any of the IBM mainframes from z9 to z14. Picture 1: An IBM z114 mainframe in all its glory Source: IBM What to expect of the process Buying a mainframe takes time. I never spent so much time on a purchase before. In fact - I purchased my first apartment with probably less planning and research. Compared to buying an apartment you have no guard rails. You are left

Brocade Fabric OS downloads

Fabric OS is what runs on the SAN switches I will be using for the mainframe. It has a bit of annoying upgrade path as the guldmyr blog can attest to. TL;DR is that you need to do minor upgrades (6.3 -> 6.4 -> 7.0 -> ... > 7.4) which requires you to get all  Fabric OS images for those versions. Not always easy. So, let's make it a bit easier. Hopefully this will not end up with the links being taken down, but at least it helped somebody I hope. These downloads worked for me and are hash-verified when I could find a hash to verify against. Use at your own risk etc. The URLs are: ftp://ftp.hp.c

System z on contemporary zLinux

IBM System z supports a handful of operating systems; z/VM, z/VSE, z/OS, z/TPF, and finally zLinux. All the earlier mentioned OSes are proprietary except for zLinux which is simply Linux with a fancy z in the name. zLinux is the term used to describe a Linux distribution compiled for S390 (31 bit) or S390X (64 bit). As we are talking about modern mainframes I will not be discussing S390, only S390X. There is a comfortable amount of distributions that support S390X - more or less all of the popular distributions do. In this  list  we find distributions like Debian, Ubuntu, Gentoo, Fedora, and RHEL. Noticeably Arch is missing but then again they only have an official port for x86-64. This is great - this means that we could download the latest Ubuntu, boot the DVD, and be up and running in no time, right? Well, sadly no. The devil is, as always, in the details. When compiling high level code like C/C++/Go the compiler needs to select an instruction set to use for the compiled binar