The curious case of the missing return packet

In this post I want to dig deeper into a phenomenon that I spent a few day debugging with friends. It will be a story of a behavior that seems illogical at first glance, broken even, but will seem feasible in the end. Let's begin!

Things you will need: The scenario requires one host and one switch minimum*. In our example we will use two Linux hosts, an Arista DCS-7050S switch, and an Arista DCS-7050QX switch. The second switch and host is not strictly needed, but it makes the scenario more realistic and is how we found this issue.

The story begins with you and a couple of friends are building an internet exchange (IX) for education and for fun. The simplest IX is just a simple switch - yes even an unmanaged thing you can pick up for $10 would work theoretically. The switch has one purpose: connect different actors together so they can route traffic between each other. See diagram 1.

Diagram 1: Your typical IX network. It's just a switch.

You have your IX switch and you have your friends that are ready to communicate over the switch - but you need to get all of them to the IX switch somehow. Normally this is done via physical cabling, everybody gets their own cable, but let's assume you want to avoid changing your existing cables for now and just want to test communicating over the IX switch. You proceed to build the network shown in diagram 2.

Diagram 2: The network we will focus on

This setup is a lot to think about, so let's discuss what we are trying to accomplish here. The goal is to simulate an internet exchange where the three routers R1, R2, and R3 are all on the same Layer-2 segment - and be isolated from each other except in the IX switch. Due to real-world constraints, R2 and R3 are currently only available through R1. "No big deal" you might think, and you would be forgiven thinking that, "we can just pass VLAN 20 and 30 right through R1 and out towards the IX Switch". Indeed, that's what we did. Let's look at the configuration for R1.

interface Ethernet1/1
  description To IX
  switchport access vlan 10

interface Ethernet1/2
  description To IX
  switchport access vlan 20

interface Ethernet1/3
  description To IX
  switchport access vlan 30

interface Ethernet2
  description To R2
  switchport access vlan 20

interface Ethernet3
  description To R3
  switchport access vlan 30

interface Vlan10
  ip address 10.0.10/24

See any issues with this? We sure didn't. Does it work? Mostly.

Connectivity between R2, R3, and the IX address (.1) works perfectly
The IX switch can reach all routers, R1, R2, R3
R1 can only reach the IX switch, not R2 or R3
R2 and R3 cannot reach R1

In summary, if we did not have the Vlan10 interface on R1 we would have full mesh connectivity and everything would be great.

Time to break out tcpdump. I wish I had some tcpdump sessions to show you but I don't, so I will have to summarize what we observed.

Pinging from 10.0.0.10 to 10.0.0.20

[-> Ethernet1/1] ARP: Who has 10.0.0.20? Tell 10.0.0.10 [Broadcast]
[Ethernet1/2<-] ARP: Who has 10.0.0.20? Tell 10.0.0.10 [Broadcast]
[-> Ethernet2] ARP: Who has 10.0.0.20? Tell 10.0.0.10 [Broadcast]
[Ethernet2<-] ARP: 10.0.0.20 is at x:y:z [Unicast to a:b:c]
... nothing more

In other words: as long as we are broadcasting, the packet is delivered fine.

Pinging from 10.0.0.20 to 10.0.0.30

This works fine. The full flow is too long to write up, but the ARP looks like this:

[Ethernet2<-] ARP: Who has 10.0.0.30? Tell 10.0.0.20 [Broadcast]
[-> Ethernet1/2] ARP: Who has 10.0.0.30? Tell 10.0.0.20 [Broadcast]
[Ethernet1/3<-] ARP: Who has 10.0.0.30? Tell 10.0.0.20 [Broadcast]
[-> Ethernet3] ARP: Who has 10.0.0.30? Tell 10.0.0.20 [Broadcast]
[Ethernet3<-] ARP: 10.0.0.30 is at x:y:z [Unicast to e:f:g]
[-> Ethernet1/3] ARP: 10.0.0.30 is at x:y:z [Unicast to e:f:g]
[Ethernet1/2<-] ARP: 10.0.0.30 is at x:y:z [Unicast to e:f:g]
[>-Ethernet2] ARP: 10.0.0.30 is at x:y:z [Unicast to e:f:g]

The notable differences between the two are:

Different interfaces
Different VLAN numbers
Different IP numbers
Different MAC addresses

We quickly ruled out 1-3 by shuffling things around to see that the problem remained physically in the same place (R1). Then we started debugging the Layer-2 forwarding logic in the switch ASIC to rule out 4. Our question was "Why would a given unicast MAC simply disappear and not be forwarded like any other MAC?". Let's look at how our dear Vlan10 interface looks like in the Arista switch.

84: vlan10@fabric: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether 00:1c:73:3f:9b:8b brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.10/24 brd 255.255.255.255 scope global vlan10
    inet6 fe80::21c:73ff:fe3f:9b8b/64 scope link 
       valid_lft forever preferred_lft forever

Notice anything strange? Most likely not from this interface alone. The interesting thing shows if you look at the MAC address on all interfaces, they are all the same - 00:1c:73:3f:9b:8b. At this point you might think exactly like what we were thinking; "OK, so what - when the ARP reply comes in on VLAN 20 or VLAN 30, it should just be forwarded as any other unicast frame - there are no interfaces on that VLAN". And here is where we get to the illogical behavior - it does not. In fact, even if we enter the Broadcom debug shell and look at the L2 tables the switch will never learn or flood its own MAC address. If the switch ASIC (Broadcom Trident family in our case) sees a unicast frame with its own MAC address, it will drop it or push it to the CPU - anything but forwarding it (diagram 3).

Diagram 3: "I don't care about your VLAN tag, I will eat the packet"

We have verified this same behavior on both BCM56840 and BCM56870 in both Arista EOS and SONiC. It is very possible that this behavior is only present in the Trident chipset family, but given that it is such a common chip to use in switches it is likely this quirk exists in a lot of switches out there.

In the end I can see from a technical standpoint why this limitation would exist. Is it possible to tell the switch ASIC to not do this? Most likely, these chips are quite programable, but since I am neither and Arista EOS developer, nor privy to Broadcom SDK documentation there is few to no alternative ways forward here. Except to build a less horrifying network topology :-).

A huge thanks to my friends Tisteagle, fyx, bl0m1, and Cynthia who helped debug this issue.

P.S. If you want to read more about building an IX, I recommend reading about the creation of the Freemont Cabal IX.

*) You can replace the IX switch with a cable instead if you remove R3 from the example.

System z on contemporary zLinux

IBM System z supports a handful of operating systems; z/VM, z/VSE, z/OS, z/TPF, and finally zLinux. All the earlier mentioned OSes are proprietary except for zLinux which is simply Linux with a fancy z in the name. zLinux is the term used to describe a Linux distribution compiled for S390 (31 bit) or S390X (64 bit). As we are talking about modern mainframes I will not be discussing S390, only S390X. There is a comfortable amount of distributions that support S390X - more or less all of the popular distributions do. In this list we find distributions like Debian, Ubuntu, Gentoo, Fedora, and RHEL. Noticeably Arch is missing but then again they only have an official port for x86-64. This is great - this means that we could download the latest Ubuntu, boot the DVD, and be up and running in no time, right? Well, sadly no. The devil is, as always, in the details. When compiling high level code like C/C++/Go the compiler needs to select an instruction set to use for the compi...

mainframe.dev

The curious case of the missing return packet

Comments

Post a Comment

Popular posts from this blog

Buying an IBM Mainframe

Brocade Fabric OS downloads

System z on contemporary zLinux