Who needs VRAM anyway? Or, calculating Mandelbrot on an FPGA

About a month ago Terasic had a 50% off on their DE2 board if you could prove that you're a student (check) and attended a course using FPGAs (check). Sadly I was involved in too many other things at the moment to actually use it.

Fast forward to a week ago. I finally took the time to install Quartus and begin playing around with the hardware. All my previous encounters with FPGAs and CPLDs where with VHDL and cumbersome programs like WebPACK (slow) and HDL Designer (complex). Since my first to programs had been a disappointment I had almost given up on finding a good environment (what's wrong with a CLI compiler and Makefiles!?) but after spending a day or so in Quartus I can honestly say I enjoy it.

During the course mentioned, one laboratory exercise was to create a functional VGA generator using only a VGA DAC located on the board. I had since then become very curious about Verilog, which seemed to fit my preference of C over Ada quite well. This made chosing my first project quite easy; porting the VGA generator to Verilog.

First VGA success!

Success! After some minor amount of bugs from translating the code it worked! I was very happy to have had such easy victory over something that potentially could have been so much hassle. To my knowledge the fitter didn't even mess things up - which I have had my share of issues with in other tools.

Now then, the VGA output generation works but it is not that thrilling to stare at a green box. Some time ago I discussed with a friend about the possibility of implemented a pipelined Mandelbrot generator using a fixed-point decimal system. We thought that the easy operations of Mandelbrot wouldn't pose a problem really and though the idea was plausible. This provided the seed for the next part: generating Mandelbrot in hardware.

A quick walkthrough on Mandelbrot. The Mandelbrot set consists of the points where the equation z_n+1 = z_n^2 + c is bounded (i.e does not diverge to infinity). A rule of thumb can be applied saying that if |z_n| > 2 then it will diverge. Coloring all points belonging to the set white and using 100 iterations before giving up and assume that it is bounded results in the following image.

Mandelbrot set colored white.

That's nice and all, but the real beauty of Mandelbrot sets become apparent if we instead color each pixel with a intensity proportional to the amount of iterations it takes before we know that it is not bounded - if it is bounded (i.e we gave up) we color it black.

Mandelbrot set colored green w/ varied intensity.

Back to the hardware! A normal Mandelbrot set is calculated from -2.0 to 1 (real) and -1.5 and 1.5 (imaginary) or something close to that. This poses a problem: floating point calculations. I do not want to do floating point calculations in the processor because (1) it lacks the blocks for it and (2) it would be a waste of resources I imagine (I would love to be proven wrong here). I decided to go for the solution previously discussed, namely a fixed-point decimal system.

A fixed-point decimal system is basically a scale applied to the initial number. After some experimentation I found that 25 bit of decimal, 7 bit of integer and 1 sign bit probably would be good - total 33 bits.

Since addition and subtraction works as usual with fixed-point as they do with integers, the only operator I needed to implement was multiplication. This is implemented as ab = a*b*1/scale where 1/scale is a power of two and thus implemented as a right shift.

Simple, isn't it? Moving on the Mandelbrot calculation.

The calculations for out_re and out_im are very straight forward to derive and I will not do so in this post, but I will mention the magic value 134217728. 134217728 is 4 in this fixed point system, remember that the rule of thumb was that |Z| > 2? We implement that as Re(z)^2 + Im(z)^2 > 4. I am very satisfied in how clean the implementation of this one iteration ended up being.

Final step, pushing the limits of the hardware and adding as many of these as possible. Testing various combinations proved that a 12 stage pipeline with one calculation per stage is the best. It was possible to have multiple calculations per stage up until some limit at which the fitter would scream that I violated setup and hold times. Since the results are quite funky, I have chosen to include them at the end with a quick description of my diagnosis why they look like they do.

Anyway, final image and the interesting parts of the code follows.