In this post I want to dig deeper into a phenomenon that I spent a few day debugging with friends. It will be a story of a behavior that seems illogical at first glance, broken even, but will seem feasible in the end. Let's begin!
Things you will need: The scenario requires one host and one switch minimum*. In our example we will use two Linux hosts, an Arista DCS-7050S switch, and an Arista DCS-7050QX switch. The second switch and host is not strictly needed, but it makes the scenario more realistic and is how we found this issue.
The story begins with you and a couple of friends are building an internet exchange (IX) for education and for fun. The simplest IX is just a simple switch - yes even an unmanaged thing you can pick up for $10 would work theoretically. The switch has one purpose: connect different actors together so they can route traffic between each other. See diagram 1.
You have your IX switch and you have your friends that are ready to communicate over the switch - but you need to get all of them to the IX switch somehow. Normally this is done via physical cabling, everybody gets their own cable, but let's assume you want to avoid changing your existing cables for now and just want to test communicating over the IX switch. You proceed to build the network shown in diagram 2.
This setup is a lot to think about, so let's discuss what we are trying to accomplish here. The goal is to simulate an internet exchange where the three routers R1, R2, and R3 are all on the same Layer-2 segment - and be isolated from each other except in the IX switch. Due to real-world constraints, R2 and R3 are currently only available through R1. "No big deal" you might think, and you would be forgiven thinking that, "we can just pass VLAN 20 and 30 right through R1 and out towards the IX Switch". Indeed, that's what we did. Let's look at the configuration for R1.
interface Ethernet1/1 description To IX switchport access vlan 10 interface Ethernet1/2 description To IX switchport access vlan 20 interface Ethernet1/3 description To IX switchport access vlan 30 interface Ethernet2 description To R2 switchport access vlan 20 interface Ethernet3 description To R3 switchport access vlan 30 interface Vlan10 ip address 10.0.10/24
See any issues with this? We sure didn't. Does it work? Mostly.
- Connectivity between R2, R3, and the IX address (.1) works perfectly
- The IX switch can reach all routers, R1, R2, R3
- R1 can only reach the IX switch, not R2 or R3
- R2 and R3 cannot reach R1
In summary, if we did not have the Vlan10 interface on R1 we would have full mesh connectivity and everything would be great.
Time to break out tcpdump. I wish I had some tcpdump sessions to show you but I don't, so I will have to summarize what we observed.
Pinging from 10.0.0.10 to 10.0.0.20
- [-> Ethernet1/1] ARP: Who has 10.0.0.20? Tell 10.0.0.10 [Broadcast]
- [Ethernet1/2<-] ARP: Who has 10.0.0.20? Tell 10.0.0.10 [Broadcast]
- [-> Ethernet2] ARP: Who has 10.0.0.20? Tell 10.0.0.10 [Broadcast]
- [Ethernet2<-] ARP: 10.0.0.20 is at x:y:z [Unicast to a:b:c]
- ... nothing more
- [Ethernet2<-] ARP: Who has 10.0.0.30? Tell 10.0.0.20 [Broadcast]
- [-> Ethernet1/2] ARP: Who has 10.0.0.30? Tell 10.0.0.20 [Broadcast]
- [Ethernet1/3<-] ARP: Who has 10.0.0.30? Tell 10.0.0.20 [Broadcast]
- [-> Ethernet3] ARP: Who has 10.0.0.30? Tell 10.0.0.20 [Broadcast]
- [Ethernet3<-] ARP: 10.0.0.30 is at x:y:z [Unicast to e:f:g]
- [-> Ethernet1/3] ARP: 10.0.0.30 is at x:y:z [Unicast to e:f:g]
- [Ethernet1/2<-] ARP: 10.0.0.30 is at x:y:z [Unicast to e:f:g]
- [>-Ethernet2] ARP: 10.0.0.30 is at x:y:z [Unicast to e:f:g]
- Different interfaces
- Different VLAN numbers
- Different IP numbers
- Different MAC addresses
We quickly ruled out 1-3 by shuffling things around to see that the problem remained physically in the same place (R1). Then we started debugging the Layer-2 forwarding logic in the switch ASIC to rule out 4. Our question was "Why would a given unicast MAC simply disappear and not be forwarded like any other MAC?". Let's look at how our dear Vlan10 interface looks like in the Arista switch.
84: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 00:1c:73:3f:9b:8b brd ff:ff:ff:ff:ff:ff inet 10.0.0.10/24 brd 255.255.255.255 scope global vlan10 inet6 fe80::21c:73ff:fe3f:9b8b/64 scope link valid_lft forever preferred_lft forever
Notice anything strange? Most likely not from this interface alone. The interesting thing shows if you look at the MAC address on all interfaces, they are all the same - 00:1c:73:3f:9b:8b. At this point you might think exactly like what we were thinking; "OK, so what - when the ARP reply comes in on VLAN 20 or VLAN 30, it should just be forwarded as any other unicast frame - there are no interfaces on that VLAN". And here is where we get to the illogical behavior - it does not. In fact, even if we enter the Broadcom debug shell and look at the L2 tables the switch will never learn or flood its own MAC address. If the switch ASIC (Broadcom Trident family in our case) sees a unicast frame with its own MAC address, it will drop it or push it to the CPU - anything but forwarding it (diagram 3).
We have verified this same behavior on both BCM56840 and BCM56870 in both Arista EOS and SONiC. It is very possible that this behavior is only present in the Trident chipset family, but given that it is such a common chip to use in switches it is likely this quirk exists in a lot of switches out there.
In the end I can see from a technical standpoint why this limitation would exist. Is it possible to tell the switch ASIC to not do this? Most likely, these chips are quite programable, but since I am neither and Arista EOS developer, nor privy to Broadcom SDK documentation there is few to no alternative ways forward here. Except to build a less horrifying network topology :-).
A huge thanks to my friends Tisteagle, fyx, bl0m1, and Cynthia who helped debug this issue.
P.S. If you want to read more about building an IX, I recommend reading about the creation of the Freemont Cabal IX.
*) You can replace the IX switch with a cable instead if you remove R3 from the example.