Skip to main content

The curious case of the missing return packet

 In this post I want to dig deeper into a phenomenon that I spent a few day debugging with friends. It will be a story of a behavior that seems illogical at first glance, broken even, but will seem feasible in the end. Let's begin!

Things you will need: The scenario requires one host and one switch minimum*. In our example we will use two Linux hosts, an Arista DCS-7050S switch, and an Arista DCS-7050QX switch. The second switch and host is not strictly needed, but it makes the scenario more realistic and is how we found this issue.

The story begins with you and a couple of friends are building an internet exchange (IX) for education and for fun. The simplest IX is just a simple switch - yes even an unmanaged thing you can pick up for $10 would work theoretically. The switch has one purpose: connect different actors together so they can route traffic between each other. See diagram 1.




Diagram 1: Your typical IX network. It's just a switch.

You have your IX switch and you have your friends that are ready to communicate over the switch - but you need to get all of them to the IX switch somehow. Normally this is done via physical cabling, everybody gets their own cable, but let's assume you want to avoid changing your existing cables for now and just want to test communicating over the IX switch. You proceed to build the network shown in diagram 2.

Diagram 2: The network we will focus on

This setup is a lot to think about, so let's discuss what we are trying to accomplish here. The goal is to simulate an internet exchange where the three routers R1, R2, and R3 are all on the same Layer-2 segment - and be isolated from each other except in the IX switch. Due to real-world constraints, R2 and R3 are currently only available through R1. "No big deal" you might think, and you would be forgiven thinking that, "we can just pass VLAN 20 and 30 right through R1 and out towards the IX Switch". Indeed, that's what we did. Let's look at the configuration for R1.

interface Ethernet1/1
  description To IX
  switchport access vlan 10

interface Ethernet1/2
  description To IX
  switchport access vlan 20

interface Ethernet1/3
  description To IX
  switchport access vlan 30

interface Ethernet2
  description To R2
  switchport access vlan 20

interface Ethernet3
  description To R3
  switchport access vlan 30

interface Vlan10
  ip address 10.0.10/24

See any issues with this? We sure didn't. Does it work? Mostly.

  • Connectivity between R2, R3, and the IX address (.1) works perfectly
  • The IX switch can reach all routers, R1, R2, R3
  • R1 can only reach the IX switch, not R2 or R3
  • R2 and R3 cannot reach R1

In summary, if we did not have the Vlan10 interface on R1 we would have full mesh connectivity and everything would be great.

Time to break out tcpdump. I wish I had some tcpdump sessions to show you but I don't, so I will have to summarize what we observed.

Pinging from 10.0.0.10 to 10.0.0.20

  • [-> Ethernet1/1] ARP: Who has 10.0.0.20? Tell 10.0.0.10 [Broadcast]
  • [Ethernet1/2<-] ARP: Who has 10.0.0.20? Tell 10.0.0.10 [Broadcast]
  • [-> Ethernet2] ARP: Who has 10.0.0.20? Tell 10.0.0.10 [Broadcast]
  • [Ethernet2<-] ARP: 10.0.0.20 is at x:y:z [Unicast to a:b:c]
  • ... nothing more
In other words: as long as we are broadcasting, the packet is delivered fine.

Pinging from 10.0.0.20 to 10.0.0.30

This works fine. The full flow is too long to write up, but the ARP looks like this:
  • [Ethernet2<-] ARP: Who has 10.0.0.30? Tell 10.0.0.20 [Broadcast]
  • [-> Ethernet1/2] ARP: Who has 10.0.0.30? Tell 10.0.0.20 [Broadcast]
  • [Ethernet1/3<-] ARP: Who has 10.0.0.30? Tell 10.0.0.20 [Broadcast]
  • [-> Ethernet3] ARP: Who has 10.0.0.30? Tell 10.0.0.20 [Broadcast]
  • [Ethernet3<-] ARP: 10.0.0.30 is at x:y:z [Unicast to e:f:g]
  • [-> Ethernet1/3] ARP: 10.0.0.30 is at x:y:z [Unicast to e:f:g]
  • [Ethernet1/2<-] ARP: 10.0.0.30 is at x:y:z [Unicast to e:f:g]
  • [>-Ethernet2] ARP: 10.0.0.30 is at x:y:z [Unicast to e:f:g]
The notable differences between the two are:
  1. Different interfaces
  2. Different VLAN numbers
  3. Different IP numbers
  4. Different MAC addresses

We quickly ruled out 1-3 by shuffling things around to see that the problem remained physically in the same place (R1). Then we started debugging the Layer-2 forwarding logic in the switch ASIC to rule out 4. Our question was "Why would a given unicast MAC simply disappear and not be forwarded like any other MAC?". Let's look at how our dear Vlan10 interface looks like in the Arista switch.

84: vlan10@fabric: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether 00:1c:73:3f:9b:8b brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.10/24 brd 255.255.255.255 scope global vlan10
    inet6 fe80::21c:73ff:fe3f:9b8b/64 scope link 
       valid_lft forever preferred_lft forever

Notice anything strange? Most likely not from this interface alone. The interesting thing shows if you look at the MAC address on all interfaces, they are all the same - 00:1c:73:3f:9b:8b. At this point you might think exactly like what we were thinking; "OK, so what - when the ARP reply comes in on VLAN 20 or VLAN 30, it should just be forwarded as any other unicast frame - there are no interfaces on that VLAN". And here is where we get to the illogical behavior - it does not. In fact, even if we enter the Broadcom debug shell and look at the L2 tables the switch will never learn or flood its own MAC address. If the switch ASIC (Broadcom Trident family in our case) sees a unicast frame with its own MAC address, it will drop it or push it to the CPU - anything but forwarding it (diagram 3).

Diagram 3: "I don't care about your VLAN tag, I will eat the packet"

We have verified this same behavior on both BCM56840 and BCM56870 in both Arista EOS and SONiC. It is very possible that this behavior is only present in the Trident chipset family, but given that it is such a common chip to use in switches it is likely this quirk exists in a lot of switches out there.

In the end I can see from a technical standpoint why this limitation would exist. Is it possible to tell the switch ASIC to not do this? Most likely, these chips are quite programable, but since I am neither and Arista EOS developer, nor privy to Broadcom SDK documentation there is few to no alternative ways forward here. Except to build a less horrifying network topology :-).

A huge thanks to my friends Tisteagle, fyx, bl0m1, and Cynthia who helped debug this issue.

P.S. If you want to read more about building an IX, I recommend reading about the creation of the Freemont Cabal IX.

*) You can replace the IX switch with a cable instead if you remove R3 from the example.

Comments

Popular posts from this blog

Buying an IBM Mainframe

I bought an IBM mainframe for personal use. I am doing this for learning and figuring out how it works. If you are curious about what goes into this process, I hope this post will interest you. I am not the first one by far to do something like this. There are some people on the internet that I know have their own personal mainframes, and I have drawn inspiration from each and every one of them. You should follow them if you are interested in these things: @connorkrukosky @sebastian_wind @faultywarrior @kevinbowling1 This post is about buying an IBM z114 mainframe (picture 1) but should translate well to any of the IBM mainframes from z9 to z14. Picture 1: An IBM z114 mainframe in all its glory Source: IBM What to expect of the process Buying a mainframe takes time. I never spent so much time on a purchase before. In fact - I purchased my first apartment with probably less planning and research. Compared to buying an apartment you have no guard rails. You are left

Brocade Fabric OS downloads

Fabric OS is what runs on the SAN switches I will be using for the mainframe. It has a bit of annoying upgrade path as the guldmyr blog can attest to. TL;DR is that you need to do minor upgrades (6.3 -> 6.4 -> 7.0 -> ... > 7.4) which requires you to get all  Fabric OS images for those versions. Not always easy. So, let's make it a bit easier. Hopefully this will not end up with the links being taken down, but at least it helped somebody I hope. These downloads worked for me and are hash-verified when I could find a hash to verify against. Use at your own risk etc. The URLs are: ftp://ftp.hp.com/pub/softlib/software13/COL59674/co-168954-1/v7.3.2a.zip ftp://ftp.hp.com/pub/softlib/software13/COL59674/co-157071-1/v7.2.1g.zip ftp://ftp.hp.com/pub/softlib/software13/COL59674/co-150357-1/v7.1.2b.zip ftp://ftp.hp.com/pub/softlib/software12/COL38684/co-133135-1/v7.0.2e.zip ftp://ftp.hp.com/pub/softlib/software13/COL22074/co-155018-1/v6.4.3h.zip ftp://ftp.hp.c

zBC12, the new family member

Yesterday after more than a year's delay my zBC12 mainframe finally booted up. This is a machine that was donated to me in hopes to advance the hobbyist community, which I am eternally grateful for. Image 1: Athena, the zBC12 that just now got online Then what is the main selling point of the zBC12 versus the z114? You might recall my article  System z on contemporary zLinux  where I explained that running modern Linux on a z114 is hard. This is the main selling point for me to upgrade - being able to run things like more modern Linuxes than z114. While the latest OSes in zLinux, z/VM, and z/OS require z13 or newer - a zBC12 still allows me to run a few releases newer software. Image 2: The operator himself in the picture with Athena Perhaps one of the bigger deals that is very welcome is the support for OSA-Express5S. This means that while previously you needed both PCIe and I/O bays in order to have both effective higher speed connectivity like 8G FC or 10 GB Ethernet as well as