My favorite bug
— Competitive Programming, Stories — 15 min read
It was early November 2013, and me and a friend of mine, Helgi, were preparing to compete in our annual ACM-ICPC regional contest. We had been adding various algorithms and data structures to our code library, which we were allowed to bring with us to the contest. He had recently taken an Operating Systems course, where he had implemented a Skip list for use in a memory allocator. To my knowledge, Skip lists have not been used much in programming contests, but since it is a relatively simple (to code) data structure that supports the same operations as balanced binary search trees with the same (expected) time complexity, we thought it might be a good idea to add it to our library. (Note that C++, which was our programming language of choice, already provides a balanced binary search tree through its standard library, but it doesn’t support augmentation, which is something that is frequently needed in programming contests.)
So one evening we sat down and did just that. He found his old implementation, which was written in low-level optimized C, and we started translating it to a more usable high-level C++ implementation. When we were done, we started testing it to make sure the implementation was correct, and that we didn’t mess anything up in the translation. At first we made some small manual tests, and things looked good. This, of course, wasn’t very exhaustive, so after that we started generating large random test cases, and then compared the Skip list’s results to C++’s set. Looking at some old Github commits, I found that the program looked basically like this:
#include <algorithm>#include <cassert>#include <cmath>#include <cstdio>#include <cstdlib>#include <cstring>#include <ctime>#include <iomanip>#include <iostream>#include <map>#include <queue>#include <set>#include <sstream>#include <stack>#include <string>#include <utility>#include <vector>using namespace std;
#define all(o) (o).begin(), (o).end()#define allr(o) (o).rbegin(), (o).rend()#define pb push_backconst int INF = 2147483647;const double EPS = 1e-9;const double pi = acos(-1);typedef long long ll;typedef unsigned long long ull;typedef pair<int, int> ii;typedef vector<int> vi;typedef vector<ii> vii;typedef vector<vi> vvi;typedef vector<vii> vvii;template <class T> T mod(T a, T b) { return (a % b + b) % b; }template <class T> int size(T x) { return x.size(); }
#define BP 0.20#define MAX_LEVEL 10unsigned int bernoulli(unsigned int MAX) {    unsigned int cnt = 0;    while(((float) rand() / RAND_MAX) < BP && cnt < MAX) cnt++;    return cnt;}template<class T>class skiplist {public:    struct node {        T item;        int lens[MAX_LEVEL + 1];        node **next;        node(int level, T i) : item(i), next((node**)calloc(level + 1, sizeof(node*))) { memset(lens, 0, sizeof(lens)); };        ~node() { free(next); next = NULL; };    };    int current_level, _size;    node *head;    skiplist() : current_level(0), _size(0), head(new node(MAX_LEVEL, 0)) { };    ~skiplist() { clear(); delete head; head = NULL; }    #define FIND_UPDATE(cmp, target) \        int pos[MAX_LEVEL + 2]; \        memset(pos, 0, sizeof(pos)); \        node *x = head; \        node *update[MAX_LEVEL + 1]; \        memset(update, 0, MAX_LEVEL + 1); \        for(int i = MAX_LEVEL; i >= 0; i--) { \            pos[i] = pos[i + 1]; \            while(x->next[i] != NULL && cmp < target) { pos[i] += x->lens[i]; x = x->next[i]; } \            update[i] = x; \        } x = x->next[0];    int size() { return _size; }    void clear() { while(head->next && head->next[0]) erase(head->next[0]->item); }    node *find(T target) {        FIND_UPDATE(x->next[i]->item, target);        return x && x->item == target ? x : NULL;    }    node* insert(T target) {        FIND_UPDATE(x->next[i]->item, target);        // SET        if(x && x->item == target) return x;        int lvl = bernoulli(MAX_LEVEL);        if(lvl > current_level) current_level = lvl;        x = new node(lvl, target);        for(int i = 0; i <= lvl; i++) {            x->next[i] = update[i]->next[i];            x->lens[i] = pos[i] + update[i]->lens[i] - pos[0];            update[i]->next[i] = x;            update[i]->lens[i] = pos[0] + 1 - pos[i];        }        for(int i = lvl + 1; i <= MAX_LEVEL; i++) update[i]->lens[i]++;        _size++;        return x;    }    void erase(T target) {        FIND_UPDATE(x->next[i]->item, target);        if(x && x->item == target) {            for(int i = 0; i <= current_level; i++) {                if(update[i]->next[i] == x) {                    update[i]->next[i] = x->next[i];                    update[i]->lens[i] = update[i]->lens[i] + x->lens[i] - 1;                } else update[i]->lens[i] = update[i]->lens[i] - 1;            }            delete x; _size--;            while(current_level > 0 && head->next[current_level] == NULL) current_level--;        }    }};
int main() {    int cnt = 10000,        range = 1000;
    skiplist<int> t1;    set<int> t2;
    assert(t1.size() == 0);
    for (int i = 0; i < cnt; i++) {        int n = rand() % range;        skiplist<int>::node *p = t1.insert(n);        assert(p->item == n);        t2.insert(n);        assert(size(t1) == size(t2));
        int n1 = rand() % range;        skiplist<int>::node *b = t1.find(n1);        if (b) assert(b->item == n1);        assert((t2.find(n1) == t2.end()) == (b == NULL));
        int n2 = rand() % range;        t1.erase(n2);        t2.erase(n2);        assert(size(t1) == size(t2));    }
    t1.clear();    t2.clear();
    assert(t1.size() == 0);
    for (int i = 0; i < cnt; i++) {        int n = rand() % range;        skiplist<int>::node *p = t1.insert(n);        assert(p->item == n);        t2.insert(n);        assert(size(t1) == size(t2));
        int n1 = rand() % range;        skiplist<int>::node *b = t1.find(n1);        if (b) assert(b->item == n1);        assert((t2.find(n1) == t2.end()) == (b == NULL));
        int n2 = rand() % range;        t1.erase(n2);        t2.erase(n2);        assert(size(t1) == size(t2));    }
    for (int i = 0; i < range; i++) {        t1.erase(i);        t2.erase(i);        assert(size(t1) == size(t2));    }
    assert(t1.size() == 0);
    return 0;}However, the program crashed every time we ran it, outputting this nasty error message:
*** Error in `./a.out': double free or corruption (fasttop): 0x00000000012fbc60 ***======= Backtrace: =========libc.so.6(+0x70aa6)[0x7f9fbf3d9aa6]libc.so.6(+0x76d96)[0x7f9fbf3dfd96]libc.so.6(+0x7759e)[0x7f9fbf3e059e]./a.out[0x4015d1]libc.so.6(__libc_start_main+0xf0)[0x7f9fbf3892e0]./a.out[0x402baa]======= Memory map: ========00400000-00404000 r-xp 00000000 08:01 19922970              ./a.out00603000-00604000 r--p 00003000 08:01 19922970              ./a.out00604000-00605000 rw-p 00004000 08:01 19922970              ./a.out012ea000-0131c000 rw-p 00000000 00:00 0                     [heap]7f9fb8000000-7f9fb8021000 rw-p 00000000 00:00 07f9fb8021000-7f9fbc000000 ---p 00000000 00:00 07f9fbf369000-7f9fbf4fe000 r-xp 00000000 08:01 16136728      libc-2.24.so7f9fbf4fe000-7f9fbf6fd000 ---p 00195000 08:01 16136728      libc-2.24.so7f9fbf6fd000-7f9fbf701000 r--p 00194000 08:01 16136728      libc-2.24.so7f9fbf701000-7f9fbf703000 rw-p 00198000 08:01 16136728      libc-2.24.so7f9fbf703000-7f9fbf707000 rw-p 00000000 00:00 07f9fbf707000-7f9fbf71d000 r-xp 00000000 08:01 16136742      libgcc_s.so.17f9fbf71d000-7f9fbf91c000 ---p 00016000 08:01 16136742      libgcc_s.so.17f9fbf91c000-7f9fbf91d000 rw-p 00015000 08:01 16136742      libgcc_s.so.17f9fbf91d000-7f9fbfa21000 r-xp 00000000 08:01 16136743      libm-2.24.so7f9fbfa21000-7f9fbfc20000 ---p 00104000 08:01 16136743      libm-2.24.so7f9fbfc20000-7f9fbfc21000 r--p 00103000 08:01 16136743      libm-2.24.so7f9fbfc21000-7f9fbfc22000 rw-p 00104000 08:01 16136743      libm-2.24.so7f9fbfc22000-7f9fbfd8b000 r-xp 00000000 08:01 16137577      libstdc++.so.6.0.217f9fbfd8b000-7f9fbff8a000 ---p 00169000 08:01 16137577      libstdc++.so.6.0.217f9fbff8a000-7f9fbff96000 r--p 00168000 08:01 16137577      libstdc++.so.6.0.217f9fbff96000-7f9fbff97000 rw-p 00174000 08:01 16137577      libstdc++.so.6.0.217f9fbff97000-7f9fbff9a000 rw-p 00000000 00:00 07f9fbff9a000-7f9fbffbd000 r-xp 00000000 08:01 16136719      ld-2.24.so7f9fc01b4000-7f9fc01bc000 rw-p 00000000 00:00 07f9fc01bc000-7f9fc01bd000 r--p 00022000 08:01 16136719      ld-2.24.so7f9fc01bd000-7f9fc01be000 rw-p 00023000 08:01 16136719      ld-2.24.so7f9fc01be000-7f9fc01bf000 rw-p 00000000 00:00 07ffe87ac4000-7ffe87ae6000 rw-p 00000000 00:00 0             [stack]7ffe87b0d000-7ffe87b0f000 r--p 00000000 00:00 0             [vvar]7ffe87b0f000-7ffe87b11000 r-xp 00000000 00:00 0             [vdso]ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0     [vsyscall][1]    3278 abort      ./a.outSo somewhere in our code was a bug that was causing this memory corruption. We spent a good while staring at the code looking for a possible explanation, but turned up nothing. Actually, if you want, this would be a good time to look at the code yourself, but I have a feeling you’ll reach the same conclusion.
It came to mind that maybe we made an error during the translation. So we went back to the original implementation, modified our test generator to use its low-level interface, and then ran the program again. It worked like a charm, with no errors! So now we knew that we must have made some error during the translation. So we went back to the translated code, and probably stared at it a little longer. But since that was getting us nowhere, we decided to bring out the big guns, and went on to run the program with GDB.
My friend led the way, as he had more experience with GDB. My memory is a bit fuzzy here, but this is roughly what happened. The first thing we observed was that the error was occurring in a node’s destructor, where we were freeing some memory that didn’t belong to us. Unfortunately this didn’t give us much information about why this was happening.
After spending more time in the debugger, we got our first big break. It seemed that the Skip list’s destructor was being called more than once. This would explain the odd behavior we were seeing with the node’s destructor, as well as the memory errors. And indeed, after modifying the code to output some debugging information, we verified that this was the case. But that was very strange. We weren’t calling the destructor explicitly, so the only place it should have been called was (implicitly) at the end of the program.
So we went back to staring at the code, knowing that the destructor was unexpectedly being called multiple times. All of a sudden we realize what was happening. The way I remember it, me and my friend noticed this at exactly the same time, and we just looked at each other, both speechless. But this only lasted a second before we burst into laughter, realizing how hilarious this was.
The perpetrator was the size() function. Not the Skip list’s size() function, but the size() function I had defined at the top of the code as follows:
template <class T> int size(T x) { return x.size(); }It’s a convenience function I had defined to get rid of some warnings about comparisons between signed and unsigned integers, and we had used it a few times in the testing code. For example, when making sure that the number of elements in the Skip list (t1) was equal to the number of elements in the set (t2), we did the following:
assert(size(t1) == size(t2));So what exactly is wrong with the function? The issue is that the parameter x is being passed by value (which is the default mode in C++) as opposed to being passed by reference. This means that every time we call the function, passing it some object (in this case, the Skip list) the following happens:
- The object is copied,
- the copy is passed on to the function,
- the function executes using this copy of the object, and when the function terminates,
- the copy is destroyed (by calling the destructor).
So every time the test code called size(t1), it was making a copy of the Skip list using the copy constructor, and then calling the destructor of the copy.
When the copy constructor or the destructor functions are not implemented, C++ will provide a sane default implementation. And actually, if we had just used the default implementations, we wouldn’t have gotten this error. However, when the object is allocating memory (as was the case with the Skip list), one usually wants to make a custom implementation of the copy constructor to copy the actual value of the allocated memory (not just the pointer as the default implementation does), and the destructor to actually free the memory (which the default implementation does not). But we had only implemented the destructor, not the copy constructor. So when a copy of the Skip list was being made (for the size() function), the default copy constructor simply copied the pointers over. Then, when the copy’s destructor was called, it freed the memory of the pointers. And since this was the memory that the original copy of the Skip list was using, it was now unusable. But the test code continued anyway, running into lots of memory errors, and eventually crashing.
But this is not the only bad effect that the size() function could cause. Consider the following piece of code:
vector arr(1000000);for (int i = 0; i < size(arr); i++) {    arr[i] = 42;}It creates a vector of a million integers, then iterates through the vector, setting every value to 42. Usually this would run in a few milliseconds on your average computer (4ms on my machine, actually). However, since size()‘s parameter is passed by value, the vector of a million elements is copied on each iteration of the loop. And since there are a million iterations, this will take a long time… In fact, this ran for just over 13 minutes and 38 seconds on my machine.
Anyway. Let’s shift back to the exact moment when me and my friend realize what was causing the memory error. Not only did I realize the possible effects of the size() function as it was currently written, it also dawned on me that, since this function actually came from my default C++ template that I had configured my editor to import automatically every time I created a new C++ file, this error was present in basically all C++ code I had written for a long time. A few months at least!
This was of course trivial to fix, as can be seen by this commit, and our test program ran flawlessly after fixing this.
Aftermath
Man, was I happy to have found this. And, boy, was it a great learning experience. I have been very paranoid about how functions are declared—actually any unexpected thing that could be happening underneath our implementation—since then. I hope that’s a good thing…
But I’ve thought about this experience multiple times since it happened.
Why did it happen in the first place? As I mentioned before, the reason I started using this size() function was to get rid of a compiler warning. In C++’s standard library most containers’ size() function returns an integer of the type size_t, which is an unsigned integer. So if you compile something like the following piece of code:
for (int i = 0; i < v.size(); i++) {
}the compiler will give you a warning about a comparison between a signed and an unsigned integer. Seeing a ton of these warnings will get tiring pretty quickly. Another related issue is the following. Say you want to iterate through all but the last element of a container. You might implement this as follows:
for (int i = 0; i < v.size() - 1; i++) {
}But this code will not work correctly when the container is empty. Since v.size() returns an unsigned integer (of value zero in this case), subtracting one will underflow, giving the largest representable value of size_t instead of the expected -1. This, in turn, will result in an infinite loop.
An obvious simple fix to resolve both of the above issues is to cast the result to an integer, as follows:
for (int i = 0; i < (int)v.size() - 1; i++) {
}But since one writes such for loops so frequently, it would be pretty annoying to write this every time. Another way, which I think is quite common in the competitive programming community, is to define something like the following macro:
#define SZ(c) (int)(c).size()and then use it as follows:
for (int i = 0; i < SZ(v) - 1; i++) {
}I had probably seen this macro a few times when I was starting out doing competitive programming, and at some point I decided to make my own version. I thought SZ looked ugly, and I was more used to typing size, so I wanted to keep on doing that. But creating a macro with the name size would have been a bit dangerous as size is a common variable name, which would cause trouble. So I went the other way and created the following function instead of a macro:
template <class T> int size(T x) { return x.size(); }I’m not sure why I didn’t make the parameter pass-by-reference, but I’m sure I knew what the difference was before I started doing competitive programming (and hence, before I wrote this function). But it’s an easy mistake to make, even by an experienced coder who isn’t using his full brain power when implementing such a seemingly trivial function.
I’ve also wondered why I didn’t run into issues earlier. One contributing factor was probably that I was inexperienced, and didn’t know how fast I should expect such loops to run. Another contributing factor might also have been the fact that competitive programmers usually don’t have to be concerned with writing copy constructors and destructors. Usually we just let the C++ runtime free all memory when the process terminates. At least I’m sure that none of the data structures I had in my library by the time we found the bug implemented these constructors.
At some point I tried pinpointing when I started using this function. I looked at my submission history on TopCoder and Codeforces. On TopCoder, I found that I was not using this function in the contest on October 26th, 2011 (Unfortunately you have to log in to TopCoder to access the link). However, I was using the function on November 12th, 2011. Finding this left me devastated. This bug was present in all my programming contest solutions from November 12th, 2011 till November 3rd, 2013… That’s the first two years of my competitive programming career! And there’s no telling how much trouble this caused me.
As an example, I found this submission I made to the first problem of the second contest I participated in on Codeforces. I tried to solve the first two problems, but both of my solutions turned out to be too slow. However, for my submission to the first problem, by simply adding that single letter (&) to make the size() function pass-by-reference, the submission is accepted. I can’t help but feel bad for my old self, but at the same time I’m laughing so hard.
Recently I was taking a second look at a problem I couldn’t quite solve some years ago, although I had tried implementing a solution. I read the problem and came up with a solution. As far as I could remember, it was the same solution I thought of the first time. Being a little bit lazy, I found my old code instead of implementing everything from scratch. I looked over the code, and it seemed fine. So I submitted it, but it turned out to be too slow. I spent a good amount of time looking over the code, but didn’t understand why it was so slow. But all of a sudden, I froze. There it was! That old, broken, size() function… I added the &, and resubmitted. Accepted!
Even though I thought I had fixed the bug many years ago, it still hunts me to this day!