Debugging stripped binaries

In many space-restricted environments it is necessary to avoid including symbols in the released binaries, making debugging more complex. Even though many problems can be replicated by using a separate debug build, not all of them can, forcing the use of map files and cumbersome procedures. In this post we will see how to use separate debug info when building with GCC.

Generating a test binary

To avoid using third-party source code, we can produce our own big source file programatically:

import sys
print 'int main(void)\n{\n    unsigned x0 = 1234;'
for i in range(int(sys.argv[1])):
    j = i + 1
    print '    unsigned x%(j)d = x%(i)d * %(j)d * %(j)d;' % locals()
print '    return (int)x%(j)d;\n}' % locals()

Generating, compiling and stripping the binary (the compilation takes quite long in this netbook),

$ python bigcgen.py 50000 >a.c
$ time gcc -g3 -std=c99 -Wall -pedantic a.c -o a-with-dbg

real	1m3.559s
user	1m2.032s
sys	0m0.912s
$ cp a-with-dbg a-stripped && strip --strip-all a-stripped
bash: syntax error near unexpected token `;&'
$ cp a-with-dbg a-stripped && strip --strip-all a-stripped
$ ls -l
total 5252
-rw-rw-r-- 1 mchouza mchouza 2255639 Jun  4 20:37 a.c
-rwxrwxr-x 1 mchouza mchouza  907392 Jun  4 20:44 a-stripped
-rwxrwxr-x 1 mchouza mchouza 2204641 Jun  4 20:38 a-with-dbg
-rw-rw-r-- 1 mchouza mchouza     225 Jun  4 20:33 bigcgen.py

we can see there are substantial space savings by removing extra information from the binary, at the cost of being unable to debug it:

mchouza@nbmchouza:~/Desktop/release-debug-exp$ gdb ./a-with-dbg 
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
[...]
Reading symbols from ./a-with-dbg...done.
(gdb) q
$ gdb ./a-stripped 
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
[...]
Reading symbols from ./a-stripped...(no debugging symbols found)...done.
(gdb) q

Separating the debug info and linking it

We can use objcopy to get a copy of the debug information and then to link it back to the original file.

$ cp a-with-dbg a-stripped-dbg
$ objcopy --only-keep-debug a-stripped-dbg a-stripped-dbg.dbg
$ strip --strip-all a-stripped-dbg
$ objcopy --add-gnu-debuglink=a-stripped-dbg.dbg a-stripped-dbg
$ ls -l
total 7412
-rw-rw-r-- 1 mchouza mchouza 2255639 Jun  4 20:37 a.c
-rwxrwxr-x 1 mchouza mchouza  907392 Jun  4 20:44 a-stripped
-rwxrwxr-x 1 mchouza mchouza  907496 Jun  4 20:46 a-stripped-dbg
-rwxrwxr-x 1 mchouza mchouza 1300033 Jun  4 20:46 a-stripped-dbg.dbg
-rwxrwxr-x 1 mchouza mchouza 2204641 Jun  4 20:38 a-with-dbg
-rw-rw-r-- 1 mchouza mchouza     225 Jun  4 20:33 bigcgen.py

The file size is slightly bigger, but now we can debug normally:

$ gdb ./a-stripped-dbg
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
[...]
Reading symbols from ./a-stripped-dbg...Reading symbols from /home/mchouza/Desktop/mchouza/stripped-bins-dbg/a-stripped-dbg.dbg...done.
done.
(gdb) b 4567
Breakpoint 1 at 0x4145e2: file a.c, line 4567.
(gdb) r
Starting program: /home/mchouza/Desktop/mchouza/stripped-bins-dbg/a-stripped-dbg 

Breakpoint 1, main () at a.c:4567
4567	    unsigned x4564 = x4563 * 4564 * 4564;
(gdb) c
Continuing.
[Inferior 1 (process 19304) exited normally]

Limitations and a question

Of course, when applied to optimized code it can be hard to debug it anyway because of the restructuring done as part of the optimization process… but that is a limitation intrinsic in debugging optimized code.

Finally, a question: why does this program exits with code 0 when executed?

[EDITED THE PYTHON CODE TO MAKE THE QUESTION MORE INTERESTING]

Advertisement

Finding matches with bitwise expressions

When building a parser, it’s often useful to be able to quickly detect occurrences of some character in a string. The fastest methods in x86 processors involve the usage of SSE/AVX, but I was interested in starting by building a portable version using only bitwise operations.

Problem definition

Given a constant byte value d, we want to create a function get_matches() that takes an unsigned integer b as a parameter and returns another of the same size. Each byte of the return value should be either 0x80 if the byte of the same position in b is equal to d or 0x00 otherwise.

For example, for d equal to 0x20 and b being a 64 bit integer, we should have:

find_matches(0x1312202000200212) -> 0x0000808000800000
find_matches(0x0001020304050607) -> 0x0000000000000000
find_matches(0x0010203040506070) -> 0x0000800000000000

Simple implementation

It’s often useful to start by writing a simple implementation of the desired function, to test it and clarify the problem:

template <uint8_t d, typename int_type>
int_type get_matches_naive(int_type b)
{
    int_type ret = 0;
    for (size_t i = 0; i < sizeof(int_type); i++)
        if (((b >> i * CHAR_BIT) & 0xff) == d)
            ret |= (int_type)0x80 << i * CHAR_BIT;
    return ret;
}

But this implementation is of course not very efficient. How can we use bitwise instructions to our advantage?

Bitwise implementation

We can start by getting an integer with zeroes only in those bytes where the bytes of b are equal to d. It’s easy to do that by using XOR against an integer mask containing repeated instances of d. Applying it to the previous examples:

0x1312202000200212 ^ 0x2020202020202020 -> 0x3332000020002232
0x0001020304050607 ^ 0x2020202020202020 -> 0x2021222324252627
0x0010203040506070 ^ 0x2020202020202020 -> 0x2030001060704050

After that, we should translate the 0x00 bytes to 0x80 and the other values to 0x00. We can use subtraction from a mask composed by repeating instances of 0x80 and then do a bitwise AND with the same mask to isolate the high bits of each byte:

(0x8080808080808080 - 0x3332000020002232) & 0x8080808080808080 -> 0x0000808000800000
(0x8080808080808080 - 0x2021222324252627) & 0x8080808080808080 -> 0x0000000000000000
(0x8080808080808080 - 0x2030001060704050) & 0x8080808080808080 -> 0x0000800000000000

It would seem we have solved our problem… but that’s not true. The solution works for all these examples because the subtracted bytes are smaller than 0x80. If we try an example with higher valued bytes,

(0x8080808080808080 - 0x20300010607040aa) & 0x8080808080808080 -> 0x0000800000000080

we get false positives and also borrows, that can affect other subtractions. The problem of the higher valued bytes can be solved by clearing the high bits of every input byte, doing an AND against a 0x7f mask, but that will make us unable to distinguish between 0x80 and 0x00 bytes.

It would seem we are just moving the problem from one place to the other, without solving it. But, in fact, we can easily remove the 0x80 bytes by doing an AND with the result of doing a bitwise NOT of the original value, that used 0x00 to indicate the matches.

The resulting function would be something like the following:

template <uint8_t d, typename int_type>
int_type get_matches_bitwise(int_type b)
{
    // based on
    // https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
    constexpr int_type mask_0x01 = (int_type)~0 / (int_type)0xff;
    constexpr int_type mask_0x7f = (int_type)0x7f * mask_0x01;
    constexpr int_type mask_0x80 = (int_type)0x80 * mask_0x01;
    constexpr int_type mask_d = d * mask_0x01;
    int_type matches_as_0x00 = mask_d ^ b;
    int_type matches_as_0x80 = (mask_0x80 - (matches_as_0x00 & mask_0x7f)) &
                               ~matches_as_0x00 & mask_0x80;
    return matches_as_0x80;
}

Testing the bitwise implementation

We have seen that complex bitwise expressions can have some hidden mistakes. How can we make sure this expression doesn’t have one?

For 16 bit integers it’s quite fast to prove it by just trying all possible input values, though it can be a bit tricky to arrange covering the input template parameter:

#include <cassert>
#include <climits>
#include <cstdint>
#include <cstdio>

template <uint8_t d, typename int_type>
int_type get_matches_naive(int_type b)
{
    int_type ret = 0;
    for (size_t i = 0; i < sizeof(int_type); i++)
        if (((b >> i * CHAR_BIT) & 0xff) == d)
            ret |= (int_type)0x80 << i * CHAR_BIT;
    return ret;
}

template <uint8_t d, typename int_type>
int_type get_matches_bitwise(int_type b)
{
    // based on
    // https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
    constexpr int_type mask_0x01 = (int_type)~0 / (int_type)0xff;
    constexpr int_type mask_0x7f = (int_type)0x7f * mask_0x01;
    constexpr int_type mask_0x80 = (int_type)0x80 * mask_0x01;
    constexpr int_type mask_d = d * mask_0x01;
    int_type matches_as_0x00 = mask_d ^ b;
    int_type matches_as_0x80 = (mask_0x80 - (matches_as_0x00 & mask_0x7f)) &
                               ~matches_as_0x00 & mask_0x80;
    return matches_as_0x80;
}

template <uint8_t d, typename int_type>
void test()
{
	int_type b = 0;
	do
	{
		assert((get_matches_naive<d, int_type>(b) ==
		        get_matches_bitwise<d, int_type>(b)));
	} while(--b != 0);
}

template <uint8_t d, typename int_type>
struct aux
{
	static void do_test()
	{
		test<d, int_type>();
		aux<d - 1, int_type>::do_test();
	}
};

template <typename int_type>
struct aux<0, int_type>
{
	static void do_test()
	{
		test<0, int_type>();
	}
};

int main()
{
	aux<0xff, uint16_t>::do_test();
	// TAKES TOO LONG FOR IDEONE
	//aux<0xff, uint32_t>::do_test();
	return 0;
}

32 bit integers take a bit more of time to test, but it’s still feasible to go over all of them in about an hour. But with 64 bit values it would take many years to go over all the possible inputs.

In the next post we will see how we can use Z3 to do that verification in less than a second.

Universality with only two NOT gates

In a previous post we have asked how many NOTs are needed to compute an arbitrary Boolean function. In this post we will see that only two NOT gates are enough.

Building 3 NOT gates starting from 2

If we call the inputs X, Y and Z, we can make a function detecting when no more than one input is active using a single NOT gate:

\displaystyle f(X, Y, Z) = \overline{XY + YZ + XZ}.

Detects when no more than one input is active.

Detects when no more than one input is active.

By selecting only the cases where at least one input is present, adding a term to detect when all the inputs are active and using an additional NOT gate, we can detect when exactly zero or two inputs are active:

\displaystyle g(X, Y, Z) = \overline{f(X, Y, Z)(X + Y + Z) + XYZ}

\displaystyle = \overline{\overline{XY + YZ + XZ}(X + Y + Z) + XYZ}.

Detects when zero or two inputs are active.

Detects when zero or two inputs are active.

Now we know that if X is not present, we either have:

  • 0 inputs present: we can check that by simultaneously ensuring that we don’t have more than one input present and that we have either zero or two inputs present, f(X, Y, Z)\cdot g(X, Y, Z).
  • 1 input present: we should have no more than one input present and Y or Z should be present, f(X, Y, Z)\cdot(Y + Z).
  • 2 inputs present: we can check that by simultaneously ensuring that either zero or two inputs are present and that Y and Z are present, g(X, Y, Z)\cdot YZ.

Putting all together and adding the symmetrical cases:

\displaystyle \overline{X} = f(X, Y, Z) \cdot (Y + Z) + (f(X, Y, Z) + YZ)\cdot g(X, Y, Z)

\displaystyle \overline{Y} = f(X, Y, Z) \cdot (X + Z) + (f(X, Y, Z) + XZ)\cdot g(X, Y, Z)

\displaystyle \overline{Z} = f(X, Y, Z) \cdot (X + Y) + (f(X, Y, Z) + XY)\cdot g(X, Y, Z).

Example of computing NOT x.

Computing NOT X (the other cases are symmetrical).

Checking the solution

To get independent confirmation, let’s check the truth tables with a simple Python script:

from itertools import product

for x, y, z in product((False, True), repeat=3):
	f_xyz = not (x and y or x and z or y and z)
	g_xyz = not (f_xyz and (x or y or z) or x and y and z)
	not_x = f_xyz and (y or z) or (f_xyz or y and z) and g_xyz
	not_y = f_xyz and (x or z) or (f_xyz or x and z) and g_xyz
	not_z = f_xyz and (x or y) or (f_xyz or x and y) and g_xyz
	assert (not x == not_x) and (not y == not_y) and (not z == not_z)

Conclusion

As this technique allows us to expand two NOT gates to three and it can be applied repeatedly, we can compute an arbitrary Boolean function with a circuit containing only two NOT gates.

In a following post we will see how I arrived to the solution by using brute force.

How many NOTs are needed?

It’s relatively easy to see that we cannot compute an arbitrary Boolean function using only AND and OR gates. For example, even the NOT function cannot be computed using only those gates (why?).

Can we build a circuit to compute an arbitrary Boolean function using a constant number of NOT gates?


Solution to the “42 code golf” problem

This was my best result:

n=1e6,m,c,d;main(){while(c+=d==42,d=0,m=--n)while(d+=m%10,m/=10);printf("%d\n",c);}

It would have been nice to find a solution under 80 bytes but, after one hour of trying, that was the best I could manage…

42 code golf

A nice and easy interview problem (link not posted to avoid giving good answers) is the following:

Print the number of integers below one million whose decimal digits sum to 42.

It can be solved with some simple Python code like the following:

print sum(1 if sum(int(c) for c in '%d' % n) == 42 else 0
          for n in range(1000000))

A more interesting problem is to try to write the smallest C program that solves the problem, where C program is defined as something that can be compiled & executed by Ideone in “C” mode. I know it can be done in 83 bytes, but can it be done using less?

Do we need to live with the uncertainty?

Introduction

In the previous post we have seen that the halting problem cannot be solved. But maybe the problem is just in our programming languages, so let’s see if we can improve them to disallow non-halting programs.

Bounded C

A program can fail to halt by two reasons (we are ignoring I/O, limited memory sizes and all that earthly issues 😀 ): it can get trapped in an infinite loop or we can enter an infinite recursion. So let’s take a language like C and remove these “troublesome” features from it by

  • only allowing iteration via the for (b = 0; b < a; b++) construct and
  • removing the possibility of doing either direct or indirect recursion.

At first sight we haven’t lost a lot. For example we can easily raise numbers to a power

int my_pow(int base, int exp)
{
    int p = 1, i;
    for (i = 0; i < exp; i++)
        p *= base;
    return p;
}

or compute the elements of the Fibonacci series.

int fib(int n)
{
    int a = 0, b = 1, i, t;
    for (i = 0; i < n; i++)
    {
        t = a + b;
        a = b;
        b = t;
    }
    return a;
}

But maybe we have lost something subtler…

Challenge

Can we make a function in “normal C” that cannot be implemented in our “bounded C”? The answer will be given in the following post… or in the comments 😀

A really impossible problem

The problem

Previously we have solved a puzzle that claimed to be impossible but was, in fact, just a bit hard 😀 But in this post we will see a problem that is really impossible to solve, the halting problem:

Given a description of an arbitrary computer program, decide whether the program finishes running or continues to run forever.

For some programs it’s easy to see if they halt or not:

def program_1():
    return 2 + 2

def program_2():
    while True: pass

def program_3():
    from itertools import count
    for i in count():
        if str(2**i)[0] == '7':
            break

def program_4():
    from itertools import count
    for i in count():
        if str(2**i)[-1] == '7':
            break

For others it’s not so simple:

def program_5():
    def z(m, n):
        if m == 0:
            return n + 1
        elif n == 0:
            return z(m - 1, 1)
        else:
            return z(m - 1, z(m, n - 1))
    z(4, 4)

And there are simple ones where it’s an open problem to know if they halt or not:

def program_6():
    from itertools import count
    for i in count(1):
        p = 2 ** i
        r = int(''.join(reversed(str(p))))
        for j in count(1):
            q = 5 ** j
            if q > r:
                break
            elif q == r:
                return
The proof

Let’s assume we have a procedure, represented by the function does_halt(), to determine if any given program will halt:

def does_halt(program):
    #
    # magic implementation
    #

We don’t care about the details of the does_halt() function. The only requirements are:

  • Its implementation should be finite.
  • It must accept any Python function that doesn’t take arguments as a program.
  • It must give an answer in an easily bounded time. For example, we should be able to say “if the input program has 1000 bytes, the function should return a boolean value indicating if it halts in no more than NNN seconds”.

The clever point of the proof, though it’s not a new kind of cleverness, is to build a program that “halts only if it doesn’t”:

def paradox():
    if does_halt(paradox):
        while True:
            pass

As the paradox() program is finite, including this small amount of code plus the code of does_halt(), does_halt() should return in a finite amount of time. If does_halt() returns False, indicating that paradox() does not halt, the program exits, finishing its execution in a finite time. On the other hand, if does_halt() returns True, indicating that paradox() does halt, it enters an infinite loop.

This proves that some of these propositions must be true:

  • does_halt() must not be finitely implementable.
  • does_halt() must reject some valid programs.
  • does_halt() execution time is unbounded.

Quines [remix]

As the quines in the previous post were criticized as boring and ordinary :-P, I did a remix:

#include <stdio.h>
#define s char s[]
s="#if 0\nimport json;r=json.dumps\nprint'#include <stdio.h>\\n#define s char s[]\\ns=%s;\\n%s'%(r(s),s)\n\"\"\" \"\n#elif 1\n#undef s\nint main(void)\n{\n  char *t = s;\n  printf(\"#include <stdio.h>\\n#define s char s[]\\ns=\\\"\");\n  while (*t)\n  {\n    if (*t == '\\n')\n      printf(\"\\\\n\");\n    else if (*t == '\"')\n      printf(\"\\\\\\\"\");\n    else if (*t == '\\\\')\n      printf(\"\\\\\\\\\");\n    else\n      printf(\"%c\", *t);\n    t++;\n  }\n  printf(\"\\\";\\n%s\\n\", s);\n  return 0;\n}\n#elif 0\n\" \"\"\"\n#endif";
#if 0
import json;r=json.dumps
print'#include <stdio.h>\n#define s char s[]\ns=%s;\n%s'%(r(s),s)
""" "
#elif 1
#undef s
int main(void)
{
  char *t = s;
  printf("#include <stdio.h>\n#define s char s[]\ns=\"");
  while (*t)
  {
    if (*t == '\n')
      printf("\\n");
    else if (*t == '"')
      printf("\\\"");
    else if (*t == '\\')
      printf("\\\\");
    else
      printf("%c", *t);
    t++;
  }
  printf("\";\n%s\n", s);
  return 0;
}
#elif 0
" """
#endif

You can check that it works on the web or by downloading the file and testing it:

$ python polyquine.c | diff polyquine.c -
$ gcc -ansi -pedantic -Wall polyquine.c -o polyquine && ./polyquine | diff polyquine.c -

I’m aware that there are some impressive examples out there, but I haven’t analyzed them to avoid spoiling the fun.

What other language should I add? Reader contributions are welcome!

Quines

Introduction

One interesting programming exercise is to write a quine, a program that outputs its own source code (excluding empty programs!). If we naively try to write it just by using the print statement we will get into an infinite regression:

print 'print \' print \\\'print \\\\\\\'...

The basic problem is that the instructions required to write other instructions lead to an infinite recursion. So how can we avoid this problem?

The trick

One general way to do it is the one used by Von Neumann to make his abstract self-replicating machines and by nature itself. The trick is to handle the same instructions in two different ways: as instructions to be followed and as data to be copied.

Writing a C quine

As we will need to access the same data in two different ways, we must define a variable to avoid duplicating it:

char s[] = "<data will go here>";

It’s easy to write the code that prints everything before the data,

int main(void)
{
  printf("char s[] = \"");

but now we need to print the data as represented inside the string. Fortunately, of the characters we use, only the newline, the quote and the backlash require escape sequences:

  char *t = s;
  while (*t)
  {
    if (*t == '\n')
      printf("\\n");
    else if (*t == '"')
      printf("\\\"");
    else if (*t == '\\')
      printf("\\\\");
    else
      printf("%c", *t);
    t++;
  }

Finally, we print the data as a normal string and exit main():

  printf("%s", s);
  return 0;
}

This program is finished with exception of filling in the data in variable s. As this is quite tedious to do by hand, we will do it using a program:

char s[] = "\";\nint main(void)\n{\n  printf(\"char s[] = \\\"\");\n  char *t = s;\n  while (*t)\n  {\n    if (*t == '\\n')\n      printf(\"\\\\n\");\n    else if (*t == '\"')\n      printf(\"\\\\\\\"\");\n    else if (*t == '\\\\')\n      printf(\"\\\\\\\\\");\n    else\n      printf(\"%c\", *t);\n    t++;\n  }\n  printf(\"%s\", s);\n  return 0;\n}\n";

Putting it all together, we get a program that prints its own source:

char s[] = "\";\nint main(void)\n{\n  printf(\"char s[] = \\\"\");\n  char *t = s;\n  while (*t)\n  {\n    if (*t == '\\n')\n      printf(\"\\\\n\");\n    else if (*t == '\"')\n      printf(\"\\\\\\\"\");\n    else if (*t == '\\\\')\n      printf(\"\\\\\\\\\");\n    else\n      printf(\"%c\", *t);\n    t++;\n  }\n  printf(\"%s\", s);\n  return 0;\n}\n";
int main(void)
{
  printf("char s[] = \"");
  char *t = s;
  while (*t)
  {
    if (*t == '\n')
      printf("\\n");
    else if (*t == '"')
      printf("\\\"");
    else if (*t == '\\')
      printf("\\\\");
    else
      printf("%c", *t);
    t++;
  }
  printf("%s", s);
  return 0;
}

But, though it works in GCC, it’s not a strict C99 file:

cc1: warnings being treated as errors
prog.c: In function ‘main’:
prog.c:4: error: implicit declaration of function ‘printf’
prog.c:4: error: incompatible implicit declaration of built-in function ‘printf’

This can be fixed by including the stdio.h header:

#include <stdio.h>
char s[] = "\";\nint main(void)\n{\n  printf(\"#include <stdio.h>\\nchar s[] = \\\"\");\n  char *t = s;\n  while (*t)\n  {\n    if (*t == '\\n')\n      printf(\"\\\\n\");\n    else if (*t == '\"')\n      printf(\"\\\\\\\"\");\n    else if (*t == '\\\\')\n      printf(\"\\\\\\\\\");\n    else\n      printf(\"%c\", *t);\n    t++;\n  }\n  printf(\"%s\", s);\n  return 0;\n}\n";
int main(void)
{
  printf("#include <stdio.h>\nchar s[] = \"");
  char *t = s;
  while (*t)
  {
    if (*t == '\n')
      printf("\\n");
    else if (*t == '"')
      printf("\\\"");
    else if (*t == '\\')
      printf("\\\\");
    else
      printf("%c", *t);
    t++;
  }
  printf("%s", s);
  return 0;
}

Doing a Python quine

Doing a Python quine is much easier, because the built-in function repr() gives a string representation of an object, allowing us to skip most of the code of the C quine. Using it we can get this natural three line quine:

s = "print 's = %s' % repr(s)\nprint s"
print 's = %s' % repr(s)
print s

If we want to do code golfing, we can:

  • Remove unnecesary spaces and line breaks.
  • Printing the representation of the data and the data using only one print statement.

This give us a much shorter quine:

s="print's=%s;%s'%(repr(s),s)";print's=%s;%s'%(repr(s),s)

though still longer than the record one.

Top k – Solution

[Answer to the problem in Top k.]

This was the problem:

Given a list of n numbers, get the top k ones in O(n) time. (Partial credit for o(n log n) solutions.)

n log k solution

The easiest way to solve this problem in o(n log n) time is to scan the n elements, keeping track of the greatest k elements seen up to the current element. Using a heap to store the top k elements, we get the following algorithm:

#define SWAP(type, a, b) {type swap_temp = a; a = b; b = swap_temp;}

static void minheap_pushdown(int *h, size_t hs, size_t i)
{
    size_t j = 0;
    if (2 * i + 2 < hs)
        j = (h[2 * i + 1] < h[2 * i + 2]) ? 2 * i + 1 : 2 * i + 2;
    else if (2 * i + 1 < hs)
        j = 2 * i + 1;
    if (j != 0 && h[j] < h[i])
    {
        SWAP(int, h[i], h[j]);
        minheap_pushdown(h, hs, j);
    }
}

static void minheap_raise(int *h, size_t i)
{
    if (i == 0)
        return;
    if (h[i] < h[(i - 1) / 2])
    {
        SWAP(int, h[i], h[(i - 1) / 2]);
        minheap_raise(h, (i - 1) / 2);
    }
}

void top_k_nlogk_cp(int *top_k, size_t k, const int *l, size_t n)
{
    size_t i;

    for (i = 0; i < k; i++)
    {
        top_k[i] = l[i];
        minheap_raise(top_k, i);
    }

    for (i = k; i < n; i++)
    {
        if (l[i] > top_k[0])
        {
            top_k[0] = l[i];
            minheap_pushdown(top_k, k, 0);
        }
    }
}

As minheap_pushdown() and minheap_raise() have O(log k) cost, the total cost of this solution will be O(n log k). This is better than O(n log n) if k grows slower than any power of n (if k is constant or O(log n), for example).

The previous solution is enough to get partial credit, but we can do better 😀

Linear time solution

Quicksort is based on a linear time partition operation. This operation starts with an array and an element of this array, called pivot, and ends with a partially sorted array, having the pivot in his sorted position, smaller elements before it and bigger ones after it. Then doing a single partition operation would suffice… if we could guarantee that the chosen pivot is the (nk+1)-th element of the sorted array 😀

We cannot do that, but we can do something that is asymptotically equivalent: doing a quicksort-style recursion, though only in the partition that contains the desired pivot. The result is this (converting the tail recursion to iteration):

int *top_k_n_ip(int *l, size_t n, size_t k)
{
    size_t lo, hi, pos, i, j;
    int pivot;
   
    lo = 0;
    hi = n;
    pos = n - k;

    while (hi - lo > 1)
    {
        i = lo + 1;
        j = hi - 1;
        pivot = l[lo];

        while (1)
        {
            while (i < hi && l[i] <= pivot)
                i++;
            while (j > lo && l[j] > pivot)
                j--;
            if (i > j)
                break;
            SWAP(int, l[i], l[j]);
        }

        SWAP(int, l[lo], l[j]);

        if (j < pos)
            lo = j + 1;
        else if (j > pos)
            hi = j;
        else
            break;
    }

    return l + n - k;
}

Now we need to check that this gives us a linear-time algorithm. The execution time of the body of the while (1) loop is bounded from above by C times hilo, where C is a constant. If we assume that the chosen pivot divides the l [ lo .. hi ] array section in two roughly equal parts (this assumption can be made rigorous by choosing the pivot carefully), we get

T(n) \leq A + (B + C \cdot n) + (B + C \cdot n/2) + \hdots + (B + C \cdot 1)

where A, B and C are constants.

Summing all the terms containing B and taking out all the C factors we get:

T(n) \leq A + \lceil\log_2 n\rceil B + C (1 + \hdots + n/2 + n)

T(n) \leq A + \lceil\log_2 n\rceil B + 3 C n

T(n) = O(n)

As we know that any correct algorithm must employ \Omega(n) time (it needs to inspect all elements), we also know that this algorithm has \Theta(n) time complexity. Now let’s check that against reality 😀

Benchmarks

We will benchmark the following algorithm implementations:

Execution times for all the algorithms when k is fixed at 30.

As k has a fixed value in the previous plot, the O(n log k) algorithm is effectively an O(n) algorithm. In fact, it performs better than the truly linear algorithms, probably due to cache effects.

Another thing that can be seen in these plots is the difficulty of seeing the difference between n and n log n in a log-log plot, but that is something to be expected.

Execution times for all the algorithms when k is 30% of n.

Here the qualitative behavior of the O(n log k) solution is the expected one. The C implementation of the O(n) algorithm looks surprisingly good, with a performance competitive to the C++ implementation included as part of the standard library.