Skip to main content

Hardest bug I ever solved


A lot of people probably have that hard bug they spent a lot of hours trying to solve. Almost 1.5 years after solving my hardest bug I got nostalgic and thought "Hey, this was pretty cool! I should write about it to not forget it". Since it's so long ago there might be things I misremember, but the gist of it should still apply.

Let's jump to it! Presenting: "The curious case of the crashing compiler".
Compiling the following code works in my standard GCC but causes another GCC to crash:
#include <float.h>
#define PyLong_BASE (1 << 15)
int main(int argc, char* argv[]) {

       int base = 10;
       static double data[37] = {0.0e0,};
       data[base] = (log((double)base) / log((double)(1<<15)));
       return 0;
}

The weird thing? It's the same Binutils/GCC built with the same flags - two seemingly identical GCCs.

This is when porting tools to a new platform so there is no GDB and so on. After debugging this by putting trace printouts everywhere, it becomes apparent that the linker crashes in libMPFR.

int mpfr_const_catalan (mpfr_ptr x, mpfr_rnd_t rnd_mode) {
 return mpfr_cache (x, __gmpfr_cache_const_catalan, rnd_mode);
}

Compiling an unoptimized libMPFR with logging makes everything work - which was the case for the first GCC and why it worked. What would cause this?

The hint is that __gmpfr_cache_const_catalan is a so called thread-local variable when logging is not enabled:

/* Declare the cache */
#ifndef MPFR_USE_LOGGING
MPFR_DECL_INIT_CACHE(__gmpfr_cache_const_pi, mpfr_const_pi_internal);
#else
MPFR_DECL_INIT_CACHE(__gmpfr_normal_pi, mpfr_const_pi_internal);
MPFR_DECL_INIT_CACHE(__gmpfr_logging_pi, mpfr_const_pi_internal);
mpfr_cache_ptr MPFR_THREAD_ATTR __gmpfr_cache_const_pi = __gmpfr_normal_pi;
#endif

Debugging this I know from my IRC logs that I found and fixed a bug with regards to the so called TLS relocations in our binutils port, but I cannot access the commit anymore since all patches were squashed together in preparation for upstreaming. What I have is this line:

< blueCmd> stekern: the human readable version: I think it was a race depending on which relocation was read first and that TLS and non-TLS locations would change the sreloc pointer in the section back and forth
Alas, now the problem is that linker works, but GCC crashes instead.

After some more printouts I arrive at the conclusion that the call to mpfr_cache would get called with a totally bogus value for x and the pointer __gmpfr_cache_const_catalan would be something very obviously broken (like 0x1). Inside mpfr_const_catalan it would however be fine.

A thread-local variable can be resolved in a bunch of different ways depending on how "close" the compiler can assume the real value is. The furthest one requires a call to our C library for a function called __tls_get_addr that will resolve the pointer to something that is valid in the current running thread. Since this only happens when TLS is used we'r now pretty confident that the bug is somewhere in the TLS code.

Let's have a look on how mpfr_const_catalan is generated. On OpenRISC 1200 the calling convention is to place the first argument in r3. Remember that we want to pass the first argument straight through to mpfr_cache, so we shouldn't touch it. This is how the assembly was generated (I removed a lot of unrelated stuff):

Note: OpenRISC 1200 has a delay-slot (indented) after jumps, so you need to think out-of-order.
       l.movhi         r3,tlsgdhi(__gmpfr_cache_const_catalan)
# ^^-- r3 clobbered!
       l.ori           r3,r3,tlsgdlo(__gmpfr_cache_const_catalan)
       l.jal           8
        l.movhi         r16,gotpchi(_GLOBAL_OFFSET_TABLE_-4)
       l.ori           r16,r16,gotpclo(_GLOBAL_OFFSET_TABLE_+0)
       l.add           r16,r16,r9
       l.add           r3,r3,r16
       l.jal           plt(__tls_get_addr)
        l.add           r3,r4,r16
       ...
       l.jal           plt(mpfr_cache)
        l.ori           r4,r11,0
# ^^-- Real address for __gmpfr_cache_const_catalan as 2nd arg
And at this point I had a clear case what was wrong, but as I expressed it a few hours after finding this:

< blueCmd> I honestly have no idea how to solve this
This feeling of helplessness had never struck me before. I had none I could ask, I had nobody that knew the system better than me. This was unique for me and I was very excited and frustrated at the same time. Fast forward a few days of ball planking and looking at old bug reports of seemingly maybe-related-but-probably-not bug reports in GCC me and stekern arrived that the code is probably emitting the library call to __tls_get_addr at the wrong state in the code expansion (remember! this is not a normal function call, it's inferred by accessing a specific pointer) and therefore messing up the register scheduling.

Moving the so called "address legitimization" to an early stage (sounds easy but is a *pain*) and removing some parts where I tried to be clever fixed the issues and made the generated code use sensible registers. The clever part was that I thought I was helping GCC by telling it that "you can use the same register multiple times at this point" causing GCC to make some weird optimizations that were very broken. Simply letting GCC schedule that by itself solved that problem.

Special thanks to stekern for all the help and idea bouncing through all of this! GCC code is not easy to reason about, or as stekern and I put it:
< blueCmd> gah, gcc's damn .md files - they are confusing as f*ck
< stekern> you don't say
< stekern> I can't count the number of times I've went "ah, now I get it! ...no, I don't" =)
< blueCmd> "error: unrecognizable insn:" *sigh*

When asked "What is the hardest bug you ever solved?" I will probably think of this for a long time.

Comments

Popular posts from this blog

Buying an IBM Mainframe

I bought an IBM mainframe for personal use. I am doing this for learning and figuring out how it works. If you are curious about what goes into this process, I hope this post will interest you. I am not the first one by far to do something like this. There are some people on the internet that I know have their own personal mainframes, and I have drawn inspiration from each and every one of them. You should follow them if you are interested in these things: @connorkrukosky @sebastian_wind @faultywarrior @kevinbowling1 This post is about buying an IBM z114 mainframe (picture 1) but should translate well to any of the IBM mainframes from z9 to z14. Picture 1: An IBM z114 mainframe in all its glory Source: IBM What to expect of the process Buying a mainframe takes time. I never spent so much time on a purchase before. In fact - I purchased my first apartment with probably less planning and research. Compared to buying an apartment you have no guard rails. You are left

System z on contemporary zLinux

IBM System z supports a handful of operating systems; z/VM, z/VSE, z/OS, z/TPF, and finally zLinux. All the earlier mentioned OSes are proprietary except for zLinux which is simply Linux with a fancy z in the name. zLinux is the term used to describe a Linux distribution compiled for S390 (31 bit) or S390X (64 bit). As we are talking about modern mainframes I will not be discussing S390, only S390X. There is a comfortable amount of distributions that support S390X - more or less all of the popular distributions do. In this  list  we find distributions like Debian, Ubuntu, Gentoo, Fedora, and RHEL. Noticeably Arch is missing but then again they only have an official port for x86-64. This is great - this means that we could download the latest Ubuntu, boot the DVD, and be up and running in no time, right? Well, sadly no. The devil is, as always, in the details. When compiling high level code like C/C++/Go the compiler needs to select an instruction set to use for the compiled binar

Brocade Fabric OS downloads

Fabric OS is what runs on the SAN switches I will be using for the mainframe. It has a bit of annoying upgrade path as the guldmyr blog can attest to. TL;DR is that you need to do minor upgrades (6.3 -> 6.4 -> 7.0 -> ... > 7.4) which requires you to get all  Fabric OS images for those versions. Not always easy. So, let's make it a bit easier. Hopefully this will not end up with the links being taken down, but at least it helped somebody I hope. These downloads worked for me and are hash-verified when I could find a hash to verify against. Use at your own risk etc. The URLs are: ftp://ftp.hp.com/pub/softlib/software13/COL59674/co-168954-1/v7.3.2a.zip ftp://ftp.hp.com/pub/softlib/software13/COL59674/co-157071-1/v7.2.1g.zip ftp://ftp.hp.com/pub/softlib/software13/COL59674/co-150357-1/v7.1.2b.zip ftp://ftp.hp.com/pub/softlib/software12/COL38684/co-133135-1/v7.0.2e.zip ftp://ftp.hp.com/pub/softlib/software13/COL22074/co-155018-1/v6.4.3h.zip ftp://ftp.hp.c