Skip to main content

Hardest bug I ever solved

A lot of people probably have that hard bug they spent a lot of hours trying to solve. Almost 1.5 years after solving my hardest bug I got nostalgic and thought "Hey, this was pretty cool! I should write about it to not forget it". Since it's so long ago there might be things I misremember, but the gist of it should still apply.

Let's jump to it! Presenting: "The curious case of the crashing compiler".
Compiling the following code works in my standard GCC but causes another GCC to crash:
#include <float.h>
#define PyLong_BASE (1 << 15)
int main(int argc, char* argv[]) {

       int base = 10;
       static double data[37] = {0.0e0,};
       data[base] = (log((double)base) / log((double)(1<<15)));
       return 0;

The weird thing? It's the same Binutils/GCC built with the same flags - two seemingly identical GCCs.

This is when porting tools to a new platform so there is no GDB and so on. After debugging this by putting trace printouts everywhere, it becomes apparent that the linker crashes in libMPFR.

int mpfr_const_catalan (mpfr_ptr x, mpfr_rnd_t rnd_mode) {
 return mpfr_cache (x, __gmpfr_cache_const_catalan, rnd_mode);

Compiling an unoptimized libMPFR with logging makes everything work - which was the case for the first GCC and why it worked. What would cause this?

The hint is that __gmpfr_cache_const_catalan is a so called thread-local variable when logging is not enabled:

/* Declare the cache */
MPFR_DECL_INIT_CACHE(__gmpfr_cache_const_pi, mpfr_const_pi_internal);
MPFR_DECL_INIT_CACHE(__gmpfr_normal_pi, mpfr_const_pi_internal);
MPFR_DECL_INIT_CACHE(__gmpfr_logging_pi, mpfr_const_pi_internal);
mpfr_cache_ptr MPFR_THREAD_ATTR __gmpfr_cache_const_pi = __gmpfr_normal_pi;

Debugging this I know from my IRC logs that I found and fixed a bug with regards to the so called TLS relocations in our binutils port, but I cannot access the commit anymore since all patches were squashed together in preparation for upstreaming. What I have is this line:

< blueCmd> stekern: the human readable version: I think it was a race depending on which relocation was read first and that TLS and non-TLS locations would change the sreloc pointer in the section back and forth
Alas, now the problem is that linker works, but GCC crashes instead.

After some more printouts I arrive at the conclusion that the call to mpfr_cache would get called with a totally bogus value for x and the pointer __gmpfr_cache_const_catalan would be something very obviously broken (like 0x1). Inside mpfr_const_catalan it would however be fine.

A thread-local variable can be resolved in a bunch of different ways depending on how "close" the compiler can assume the real value is. The furthest one requires a call to our C library for a function called __tls_get_addr that will resolve the pointer to something that is valid in the current running thread. Since this only happens when TLS is used we'r now pretty confident that the bug is somewhere in the TLS code.

Let's have a look on how mpfr_const_catalan is generated. On OpenRISC 1200 the calling convention is to place the first argument in r3. Remember that we want to pass the first argument straight through to mpfr_cache, so we shouldn't touch it. This is how the assembly was generated (I removed a lot of unrelated stuff):

Note: OpenRISC 1200 has a delay-slot (indented) after jumps, so you need to think out-of-order.
       l.movhi         r3,tlsgdhi(__gmpfr_cache_const_catalan)
# ^^-- r3 clobbered!
       l.ori           r3,r3,tlsgdlo(__gmpfr_cache_const_catalan)
       l.jal           8
        l.movhi         r16,gotpchi(_GLOBAL_OFFSET_TABLE_-4)
       l.ori           r16,r16,gotpclo(_GLOBAL_OFFSET_TABLE_+0)
       l.add           r16,r16,r9
       l.add           r3,r3,r16
       l.jal           plt(__tls_get_addr)
        l.add           r3,r4,r16
       l.jal           plt(mpfr_cache)
        l.ori           r4,r11,0
# ^^-- Real address for __gmpfr_cache_const_catalan as 2nd arg
And at this point I had a clear case what was wrong, but as I expressed it a few hours after finding this:

< blueCmd> I honestly have no idea how to solve this
This feeling of helplessness had never struck me before. I had none I could ask, I had nobody that knew the system better than me. This was unique for me and I was very excited and frustrated at the same time. Fast forward a few days of ball planking and looking at old bug reports of seemingly maybe-related-but-probably-not bug reports in GCC me and stekern arrived that the code is probably emitting the library call to __tls_get_addr at the wrong state in the code expansion (remember! this is not a normal function call, it's inferred by accessing a specific pointer) and therefore messing up the register scheduling.

Moving the so called "address legitimization" to an early stage (sounds easy but is a *pain*) and removing some parts where I tried to be clever fixed the issues and made the generated code use sensible registers. The clever part was that I thought I was helping GCC by telling it that "you can use the same register multiple times at this point" causing GCC to make some weird optimizations that were very broken. Simply letting GCC schedule that by itself solved that problem.

Special thanks to stekern for all the help and idea bouncing through all of this! GCC code is not easy to reason about, or as stekern and I put it:
< blueCmd> gah, gcc's damn .md files - they are confusing as f*ck
< stekern> you don't say
< stekern> I can't count the number of times I've went "ah, now I get it!, I don't" =)
< blueCmd> "error: unrecognizable insn:" *sigh*

When asked "What is the hardest bug you ever solved?" I will probably think of this for a long time.


Popular posts from this blog

Buying an IBM Mainframe

I bought an IBM mainframe for personal use. I am doing this for learning and figuring out how it works. If you are curious about what goes into this process, I hope this post will interest you.

I am not the first one by far to do something like this. There are some people on the internet that I know have their own personal mainframes, and I have drawn inspiration from each and every one of them. You should follow them if you are interested in these things:
@connorkrukosky@sebastian_wind@faultywarrior@kevinbowling1 This post is about buying an IBM z114 mainframe (picture 1) but should translate well to any of the IBM mainframes from z9 to z14.

What to expect of the process Buying a mainframe takes time. I never spent so much time on a purchase before. In fact - I purchased my first apartment with probably less planning and research. Compared to buying an apartment you have no guard rails. You are left to your own devices to ensure the state of whatever thing you are buying as it likely…

Powering a mainframe

The last few days have been eventful. I was contacted by the datacenter that the mainframe's cage is now ready for moving in, and the power has been made available. Very exciting! I grabbed my home-made power cables (more on that later) and my best screwdrivers and set off to the datacenter.

The datacenter staff, not needing a forklift in their day-to-day, had managed to solicit the services of a forklift, the associated operator, and some very handy folks to help navigate the mainframe from the storage space to its final location.

After some intense period of fighting the inclination of the road between the storage facility and the cage (and a door that was a bit too small) it was finally in place. Incidentally we were forced to trust the wheels on this pretty rough floor. I did not expect it to roll that well on raw concrete, I was pleasantly surprised. This thing is a tank!

Now, everybody wanted to see if it was working. My machine did not come with a power cable so I had to so…

Open Datacenter Hardware - Leopard Server

Introduction The Leopard is an OpenRack v1 compliant 12V server commissioned by Facebook to offer compute power. It consists of 2x Intel Xeon E5-2678 v3 and is available with either DDR3 or DDR4 memory. The model is manufactured by two vendors primarily: Quanta and Wiwynn.

Leopard features a 24x PCIe slot which can fit either a PCIe card with low profile, or a riser card with 1x 16x and 1x 8x slots. The server also supports a 3.5" SATA drive as well as either an mSATA or an M.2 drive mounted on the motherboard.

Connectivity wise Leopard has a mezzanine card slot allowing for example 10Gb/s or 25Gb/s Ethernet.

Figure 1 and figure 2 shows the server layout. The server is made to fit inside an OpenRack v1 enclosure, at which point it looks something like figure 3. Due to power constraints an OpenRack v1 can fit 30 of these servers before power starts to become an issue. The Leopard servers that the organization Serverfarm Upstate provides are all fitted with 256GiB DDR3 RAM and 2x …