The fake FICON board

The fake FICON board - Fejkon

The latest project I've been working on is a custom card that will allow me to interface any mainframe using the FICON protocol. I have a lot of ideas on how this could help a lot of hobbyists out there, and possibly folks doing development for mainframes as well. For my own purposes, it would allow me to not be reliant on my (still broken) DS6800 array.

DE5-Net card running fejkon

"But why would you build your own Fibre Channel (FC) card? There are many card out there!" you might ask. And yes, while it's true that there are a lot of FC cards out there they only officially support SCSI, called FCP in FC lingo. FICON is at the same "layer" as FCP, so if the card ends up filtering traffic based on the traffic type, or offload some functionality used for FCP - the card usually ends up bound to that type of traffic.

I asked one of the biggest FC card manufacturers about support for FICON as some datasheets mention it if you look for the standard name FC-SB-x. They reluctantly admitted that yes, they do support FICON but you may only enable it if you're certain companies. My bet is that if you take out a FICON card, remove the heatsinks, and look at what's underneath you'll find conventional FC HBA ASICs.

The FPGA card

So conventional FC cards are a no-go. As you might recall from my previous post I spent some time looking at other ways to get access to something that can transmit and receive raw FC, namely the Brocade 7800. I'm still interested in finishing that project, but there is only so much debuggability you can get from a closed unsupported system. And when I found discounted DE5-Net cards on eBay I didn't have to think about it for very long. I wanted to build my own FICON HBA for everyone to enjoy. I submitted an offer to buy the $600 card for $300 if I bought 3, and it was accepted.

Full disclosure: I have made something like this in the past. My master thesis was inline storage encryption for Fibre Channel so I am pretty confident I have the background and skill needed to make something like this work, as well as where to make the trade-offs to make something that works yet doesn't consume one year's worth of engineering time. Ideally it should also be useful for other things than only FICON.

Fejkon and its roadmap features

The fejkon project is named after a Swedish pun suggested by a friend of mine, I think it was "gix", meaning essentially "fake FICON". The primary goal should be clear from that name, but let's break it down.

The high-level goals are:

Hercules disks, tapes, etc. to be exposed to FICON ports
Fibre Channel analyzer functionality
(Stretch, v2) Coupling Facilities to be made to Hercules and/or over the internet

Hopefully this makes the card useful for mainframe hobbyists, as well as potentially developers and lab shops.

A quick note on FC layers. FC-x is a reference to where something is happening in the FC stack. FC-0 is the signal transmission, FC-1 is the coding of data for transmission, and FC-2 is application logic.

In order to make the development for this card as easy as possible the trade-off will be performance. The card will do all of FC-1 and as little as possible in terms of FC-2 logic and offloading, and leave that for the driver. The driver exposes the fejkon card as network interfaces in Linux, allowing any trusted binary running in userspace to send and received FC-2 frames. That binary is meant to be fikonfarm when the card is ready.

The big benefit about doing so much in userspace is that recompling an FPGA bitstream can take hours, while recompiling a go-program takes a few seconds. Debugging is also much easier when you can attach a debugger, and you can even run Wireshark on the interfaces to get the FC frames decoded straight away.

Performance wise the card will push all the 4x 8G ports over an PCIe 3.0 x8, which should be more than enough bandwidth wise. Processing those FC frames will be done in userland and thus be reliant on the CPU quite a fair bit. Given how fast today's CPUs are I am pretty confident in this trade-off and I hope to be able to push 4x 8G FC backed by NVMe in the end. However, I doubt this will scale to 4x 32GFC for example.

Fibre Channel FC-1 basics

A big deal in hardware and FPGA design is designing the high-speed parts. They take a long time to figure out a good model for, write, simulate, synthesize to hardware, and to test. Mostly you rely on so called "IP cores" (think libraries) that the FPGA manufacturer provides you - but you will have to implement a fair bit yourself. If this was an Ethernet card I could use much more specialized IP cores, but given this is Fibre Channel which is not as supported I had to use what is called a "custom transceiver". This means I had to read up on a lot of basics around the Fibre Channel protocol and how it is encoded in order to implement that in the FPGA. Since a good way to make the world a better place is to teach what you learn, I thought I would write it up.

In any digital system you have a bunch of 1's and 0's. In theory you know exactly what every 1 and 0 represents - it could be whether or not a fan should be running or not for example. However, when you deal with something like signaling over a fiber cable you have to make all those 1's and 0's travel in some orderly fashion to the remote end, and you have to be able to decode what that remote end is trying to tell you. In slow speeds these are simple, and in the past with interfaces such as IDE and such these were just parallel - you had as many pins as data and address bits. It was really simple to decode, but as interfaces grew faster the signal propagation delay in the physical cables put a limit on how fast you could practically go. This is why every high-speed protocol today is serial instead of parallel - SATA, FC, Ethernet, PCI-Express.

When you have a serial stream you first have to determine what is a 1 and 0 from some sensory signal. How that's done is beyond me, and I will not attempt to explain it - thankfully the FPGA and SFP modules abstracts this away in the layer known as FC-0 and provides an endless stream of 1's and 0's as its output. Let's say that you're receiving a stream of alternating 1's and 0's like this: ...01010101010101... - what is that? Is that 0xA or 0x5 repeated over and over again? You have no idea of knowing! That's why we need something called bit synchronization.

Bit synchronization

In order for us to find out what an endless stream of bits is trying to tell us, we need a way to find where our data words (fancy way of saying a "byte" more or less) start and stop. One popular way to do this is to invent a bit-pattern that only occurs in a specific place. That way we can scan the stream of bits until we see this pattern, and then "lock on" to it from there on out. That's the general concept of bit alignment and synchronization.

"But wait!" you might say, "what if I want to transmit that particular bit pattern?". It's a very good point, and one way of solving that is to apply a word encoding scheme. Think of this as a reverse-compression - e.g. you feed in 8 bits and you get 10 bits back. "Why would you want to do that though? Isn't that wasting bandwidth?" you might ask, observantly. Indeed it does, which is why we have to choose this carefully. If we do this carefully we can optimize the result to have the properties we want as well as decent overhead.

The properties we are after are:

A bit-pattern that only occurs in a specific place
Bit transitions to be as often as possible (clock recovery)
Redundant way of coding data with parity (DC balancing)

I'm going to cover the last two really quickly.

The clock recovery is a physical constraint where in that both sender and receiver has a clock running at the data rate (8.5 GHz for 8GFC) that tells it when to look at the data signal. In practice though these clocks are never exactly 8.5 GHz and they are never exactly the same - which means that you need to align your clocks to avoid bit errors. Imagine if you were transmitting only zeros for 1 hour - it would be impossible to detect if the clocks had skewed in any direction since the data stream is only zero. Instead we want the data stream to change at a minimum determined rate so that we can keep track of the clock skew and compensate our clock continuously. That's clock recovery.

DC balancing is a technique to ensure that the data signal contains as many 1s as 0s over the long run. This means that it is possible to design a physical interfaces so that it is on average no current flowing over the data line. That seems like a good thing, but to be honest I haven't thought about it too much. It's however a requirement in FC and as thus we have to understand it. This means that if you want to transmit 0xFF you have to then transmit 0x00 afterwards to make it equal. This is called Running Disparity (RD). Since of course we cannot force users to have files that have an equal number of 1's and 0's we have to do that for them, and thus we have a property requirement of DC balancing.

All these properties are fulfilled by the encoding 8B/10B and 64B/66B. For FC up to and including 8GFC 8B/10B is used, for faster 64B/66B is used. This also holds true for Ethernet at speeds of 1G and 10G respectively.

In 8B/10B, as the name implies, we encode 1 byte (8 bits) into 10 bits. These are called symbols and uses a quite special naming convention Kxx.y or Dxx.y. The K symbols are control symbols and the D symbols are data symbols. How they translate to and from bytes is not that important but explained in the Wikipedia articles above.

What is important to know however is that a symbol that occurs only in a specific known position is known as a comma symbol. This is what we want for our first property to be able to align our stream of bits. In FC and most other 8B/10B implementations this is K28.5.

Within the control symbols, K.28.1, K.28.5, and K.28.7 are "comma symbols". Comma symbols are used for synchronization (finding the alignment of the 8b/10b codes within a bit-stream). If K.28.7 is not used, the unique comma sequences 0011111 or 1100000 cannot be found at any bit position within any combination of normal codes. - Wikipedia

Out of these symbols only K28.5 is to be used in FC, which means that we have our bit-pattern. As you can see there are two versions of it which has to do with the Running Disparity discussed earlier.

Now we know how to slice our stream of bits into 8b/10b symbols. There is another concept called Byte Synchronization that is so trivial in FC that I will skip over it, just know that in FC all instructions and words are 4x 8b/10b symbols (called "primitives") and K28.5 always occurs as the first one. That's all you need to know.

How K28.5 looks in the simulator

Above you can see how this all looks in a simulator. Between the two yellow vertical lines you have 10 bits (actually I accidentally marked 11 so ignore the last one) - each clocked when the yellow clock changes (double data rate for the hardware people reading this) - 0011111010. Also note that in the picture above the symbol decoding lags one symbol behind, that's why it says K28.5 after the final bit has passed by.

Link State

In order to support things like link speed negotiation (e.g. 2/4/8GFC supported ports) and ensure a stable link before data is being transmitted protocols usually have some form of handshake and link state machine. This is the case for FC as well. In the document FC-FS-5 this is described in great detail and can be summarized as a link is established when the state reaches AC ("Active") and starts out most likely in the LFx ("Link Failure") state.

From there the two ends dances through the states, each with another "default" primitive being blasted out on the link. A link can never be empty, you always have to transmit something - so these primitives are being sent at full link speed when there is nothing else to send. The state machines are quite simple, and I've included them below for Fejkon.

This is the receive state machine:

  // This is Table 22 "FC_Port states" from FC-FS-5 INCITS 545-2019
  always @* begin
    state_next = state_r;
    case (fc::map_primitive(data))
      fc::PRIM_OLS: state_next = fc::STATE_OL2;
      fc::PRIM_NOS: state_next = fc::STATE_LF1;
      fc::PRIM_LR: begin
        if (state == fc::STATE_OL3 || state == fc::STATE_LF2)
          state_next = fc::STATE_LF2;
        else
          state_next = fc::STATE_LR2;
      end
      fc::PRIM_LRR: begin
        case (state)
          fc::STATE_LF1: state_next = fc::STATE_LF1;
          fc::STATE_LF2: state_next = fc::STATE_LF2;
          fc::STATE_OL1: state_next = fc::STATE_OL1;
          fc::STATE_OL3: state_next = fc::STATE_LF2;
          default: state_next = fc::STATE_LR3;
        endcase
      end
      fc::PRIM_IDLE, fc::PRIM_ARBFF: begin
        case (state)
          fc::STATE_AC: state_next = fc::STATE_AC;
          fc::STATE_LR1: state_next = fc::STATE_LR1;
          fc::STATE_LR2: state_next = fc::STATE_AC;
          fc::STATE_LR3: state_next = fc::STATE_AC;
          fc::STATE_LF1: state_next = fc::STATE_LF1;
          fc::STATE_LF2: state_next = fc::STATE_LF2;
          fc::STATE_OL1: state_next = fc::STATE_OL1;
          fc::STATE_OL2: state_next = fc::STATE_OL2;
          fc::STATE_OL3: state_next = fc::STATE_OL2;
        endcase
      end
    endcase
  end

And this is the transmit state machine:

  always @* begin
    case (state)
      fc::STATE_AC:  data = fc::ARBFF;
      fc::STATE_LR1: data = fc::LR;
      fc::STATE_LR2: data = fc::LRR;
      fc::STATE_LR3: data = fc::IDLE;
      fc::STATE_LF1: data = fc::OLS;
      fc::STATE_LF2: data = fc::NOS;
      fc::STATE_OL1: data = fc::OLS;
      fc::STATE_OL2: data = fc::LR;
      fc::STATE_OL3: data = fc::NOS;
    endcase
  end

A normal link-up dance is something like LF1 -> OL2 -> LR2 -> LR3 -> AC.

State transitions of FC-1 in simulator

IDLE and Emission Lowering Protocol

When a link is Active and there is nothing to send the sender is supposed to use the correct Fill Word. For FC at 4GFC or below this is the primitive called IDLE. However, on 8GFC this changed in what is called Emission Lowering Protocol (ELP). The resulting primitive to be used is ARBff - don't ask why it's named that, I have no clue. I haven't looked closely at the differences between IDLE and ARBff but I assume it has nicer disparity implications, and hence the emissions are lowered. Or something.

Looking at a Brocade SAN switch you can see the confusion this caused in the industry when this was launched.

br5100:admin# portcfgfillword
Usage: portCfgFillWord PortNumber Mode [Passive]

Mode: 0/-idle-idle   - IDLE in Link Init, IDLE as fill word (default)
      1/-arbff-arbff - ARBFF in Link Init, ARBFF as fill word
      2/-idle-arbff  - IDLE in Link Init, ARBFF as fill word (SW)
      3/-aa-then-ia  - If ARBFF/ARBFF failed, then do IDLE/ARBFF

It's at the point where the Brocade SAN switch, which is often said to be one of the best SAN switches in the world, does not do the correct thing and enable ELP. It defaults to using IDLE as stories on the internet claim this is the more supported mode. While if you were to read the standard, mode 2 above is the one the standard wants - IDLE during non-active and ARBff after. It's interesting to note that this mode is implemented in software and that Brocade chose to call it out. Why that is one can only speculate - but I hope the emissions are lowered to the point where all this complexity was worth it.

For fejkon it always uses ELP but does not care if it receives IDLE or ARBff in order to be as compatible as possible with your gear.

Skip Word

Something that confused me quite a bit is that FC does not seem to have a "skip word". This is supposed to be some 8b/10b symbol or symbols that you can inject into a stream as a clock synchronization pattern. It seems some protocols have that, I seem to recall Ethernet does, but FC does not. This means that having clocks running at as accurate frequencies as possible is very important in FC as far as I can tell.

Summary

Right now the Fejkon card can link up to an FC switch, look at and forward traffic, and be detected by its Linux driver as network interfaces. It also has some gadgets like a temperature sensor, frequency counter for the FC frequency, and a Qemu simulation model for driver development.

What is next is writing the PCIe DMA engine. This blog article represents a milestone in fejkon where I feel I have finished the FC parts more or less, and the next biggest fish to fry is doing the high-speed PCIe data transfers. Expect there to be a blog article just like this one but for PCIe in the future :-).

In the meantime, please reach out to me if you have questions or feedback on this project. Either in the comments here, Twitter, or Github.

As always, thanks for reading.

mainframe.dev