# Top k – Solution

[Answer to the problem in Top k.]

This was the problem:

Given a list of n numbers, get the top k ones in O(n) time. (Partial credit for o(n log n) solutions.)

##### n log k solution

The easiest way to solve this problem in o(n log n) time is to scan the n elements, keeping track of the greatest k elements seen up to the current element. Using a heap to store the top k elements, we get the following algorithm:

#define SWAP(type, a, b) {type swap_temp = a; a = b; b = swap_temp;}

static void minheap_pushdown(int *h, size_t hs, size_t i)
{
size_t j = 0;
if (2 * i + 2 < hs)
j = (h[2 * i + 1] < h[2 * i + 2]) ? 2 * i + 1 : 2 * i + 2;
else if (2 * i + 1 < hs)
j = 2 * i + 1;
if (j != 0 && h[j] < h[i])
{
SWAP(int, h[i], h[j]);
minheap_pushdown(h, hs, j);
}
}

static void minheap_raise(int *h, size_t i)
{
if (i == 0)
return;
if (h[i] < h[(i - 1) / 2])
{
SWAP(int, h[i], h[(i - 1) / 2]);
minheap_raise(h, (i - 1) / 2);
}
}

void top_k_nlogk_cp(int *top_k, size_t k, const int *l, size_t n)
{
size_t i;

for (i = 0; i < k; i++)
{
top_k[i] = l[i];
minheap_raise(top_k, i);
}

for (i = k; i < n; i++)
{
if (l[i] > top_k)
{
top_k = l[i];
minheap_pushdown(top_k, k, 0);
}
}
}


As minheap_pushdown() and minheap_raise() have O(log k) cost, the total cost of this solution will be O(n log k). This is better than O(n log n) if k grows slower than any power of n (if k is constant or O(log n), for example).

The previous solution is enough to get partial credit, but we can do better 😀

##### Linear time solution

Quicksort is based on a linear time partition operation. This operation starts with an array and an element of this array, called pivot, and ends with a partially sorted array, having the pivot in his sorted position, smaller elements before it and bigger ones after it. Then doing a single partition operation would suffice… if we could guarantee that the chosen pivot is the (nk+1)-th element of the sorted array 😀

We cannot do that, but we can do something that is asymptotically equivalent: doing a quicksort-style recursion, though only in the partition that contains the desired pivot. The result is this (converting the tail recursion to iteration):

int *top_k_n_ip(int *l, size_t n, size_t k)
{
size_t lo, hi, pos, i, j;
int pivot;

lo = 0;
hi = n;
pos = n - k;

while (hi - lo > 1)
{
i = lo + 1;
j = hi - 1;
pivot = l[lo];

while (1)
{
while (i < hi && l[i] <= pivot)
i++;
while (j > lo && l[j] > pivot)
j--;
if (i > j)
break;
SWAP(int, l[i], l[j]);
}

SWAP(int, l[lo], l[j]);

if (j < pos)
lo = j + 1;
else if (j > pos)
hi = j;
else
break;
}

return l + n - k;
}


Now we need to check that this gives us a linear-time algorithm. The execution time of the body of the while (1) loop is bounded from above by C times hilo, where C is a constant. If we assume that the chosen pivot divides the l [ lo .. hi ] array section in two roughly equal parts (this assumption can be made rigorous by choosing the pivot carefully), we get $T(n) \leq A + (B + C \cdot n) + (B + C \cdot n/2) + \hdots + (B + C \cdot 1)$

where A, B and C are constants.

Summing all the terms containing B and taking out all the C factors we get: $T(n) \leq A + \lceil\log_2 n\rceil B + C (1 + \hdots + n/2 + n)$ $T(n) \leq A + \lceil\log_2 n\rceil B + 3 C n$ $T(n) = O(n)$

As we know that any correct algorithm must employ $\Omega(n)$ time (it needs to inspect all elements), we also know that this algorithm has $\Theta(n)$ time complexity. Now let’s check that against reality 😀

##### Benchmarks

We will benchmark the following algorithm implementations:

As k has a fixed value in the previous plot, the O(n log k) algorithm is effectively an O(n) algorithm. In fact, it performs better than the truly linear algorithms, probably due to cache effects.

Another thing that can be seen in these plots is the difficulty of seeing the difference between n and n log n in a log-log plot, but that is something to be expected.

Here the qualitative behavior of the O(n log k) solution is the expected one. The C implementation of the O(n) algorithm looks surprisingly good, with a performance competitive to the C++ implementation included as part of the standard library.

## 4 thoughts on “Top k – Solution”

1. Steven Pigeon says:

Just like Quicksort, and exactly in the same way, the linear time of the select algorithm is only the expected time. It may still be $O(n^2)$ on evil inputs. Judicious choice of pivot merely minimize the probability of occurrence, but it does not eliminate the possibility of $O(n^2)$ behavior.

(and it turns out that it’s rather difficult to have a completely evil input for Quicksort…)

• mchouza says:

I think you can get worst case linear time if you use the “median of medians” algorithm to get the pivot, but it’s usually not worth the effort when implementing quicksort (you can always choose heapsort for guaranteed O(n log n) time).

• Steven Pigeon says:

For every method you choose, you can build a sequence to foil it… basically designing a “denial of sorting” attack :p But building such a sequence is complicated in general; that’s what I meant when I said that it’s rather difficult to have a completely evil input.

I wrote a thing on this a while ago: http://hbfs.wordpress.com/2010/03/16/foiled/

2. mchouza says:

For every method you choose, you can build a sequence to foil it… basically designing a “denial of sorting” attack :p

That is certainly true if you choose a fixed position for your pivot and even if you choose the pivot position pseudorandomly. But the “median of medians” algorithm guarantees that the chosen pivot is bigger than 30% of the elements and smaller than 30% of them. Check this slide by Luca Trevisan:

http://www.cs.berkeley.edu/~luca/w4231/fall99/slides/l3.pdf