Top k – Solution

[Answer to the problem in Top k.]

This was the problem:

Given a list of n numbers, get the top k ones in O(n) time. (Partial credit for o(n log n) solutions.)

n log k solution

The easiest way to solve this problem in o(n log n) time is to scan the n elements, keeping track of the greatest k elements seen up to the current element. Using a heap to store the top k elements, we get the following algorithm:

#define SWAP(type, a, b) {type swap_temp = a; a = b; b = swap_temp;}

static void minheap_pushdown(int *h, size_t hs, size_t i)
{
    size_t j = 0;
    if (2 * i + 2 < hs)
        j = (h[2 * i + 1] < h[2 * i + 2]) ? 2 * i + 1 : 2 * i + 2;
    else if (2 * i + 1 < hs)
        j = 2 * i + 1;
    if (j != 0 && h[j] < h[i])
    {
        SWAP(int, h[i], h[j]);
        minheap_pushdown(h, hs, j);
    }
}

static void minheap_raise(int *h, size_t i)
{
    if (i == 0)
        return;
    if (h[i] < h[(i - 1) / 2])
    {
        SWAP(int, h[i], h[(i - 1) / 2]);
        minheap_raise(h, (i - 1) / 2);
    }
}

void top_k_nlogk_cp(int *top_k, size_t k, const int *l, size_t n)
{
    size_t i;

    for (i = 0; i < k; i++)
    {
        top_k[i] = l[i];
        minheap_raise(top_k, i);
    }

    for (i = k; i < n; i++)
    {
        if (l[i] > top_k[0])
        {
            top_k[0] = l[i];
            minheap_pushdown(top_k, k, 0);
        }
    }
}

As minheap_pushdown() and minheap_raise() have O(log k) cost, the total cost of this solution will be O(n log k). This is better than O(n log n) if k grows slower than any power of n (if k is constant or O(log n), for example).

The previous solution is enough to get partial credit, but we can do better 😀

Linear time solution

Quicksort is based on a linear time partition operation. This operation starts with an array and an element of this array, called pivot, and ends with a partially sorted array, having the pivot in his sorted position, smaller elements before it and bigger ones after it. Then doing a single partition operation would suffice… if we could guarantee that the chosen pivot is the (nk+1)-th element of the sorted array 😀

We cannot do that, but we can do something that is asymptotically equivalent: doing a quicksort-style recursion, though only in the partition that contains the desired pivot. The result is this (converting the tail recursion to iteration):

int *top_k_n_ip(int *l, size_t n, size_t k)
{
    size_t lo, hi, pos, i, j;
    int pivot;
   
    lo = 0;
    hi = n;
    pos = n - k;

    while (hi - lo > 1)
    {
        i = lo + 1;
        j = hi - 1;
        pivot = l[lo];

        while (1)
        {
            while (i < hi && l[i] <= pivot)
                i++;
            while (j > lo && l[j] > pivot)
                j--;
            if (i > j)
                break;
            SWAP(int, l[i], l[j]);
        }

        SWAP(int, l[lo], l[j]);

        if (j < pos)
            lo = j + 1;
        else if (j > pos)
            hi = j;
        else
            break;
    }

    return l + n - k;
}

Now we need to check that this gives us a linear-time algorithm. The execution time of the body of the while (1) loop is bounded from above by C times hilo, where C is a constant. If we assume that the chosen pivot divides the l [ lo .. hi ] array section in two roughly equal parts (this assumption can be made rigorous by choosing the pivot carefully), we get

T(n) \leq A + (B + C \cdot n) + (B + C \cdot n/2) + \hdots + (B + C \cdot 1)

where A, B and C are constants.

Summing all the terms containing B and taking out all the C factors we get:

T(n) \leq A + \lceil\log_2 n\rceil B + C (1 + \hdots + n/2 + n)

T(n) \leq A + \lceil\log_2 n\rceil B + 3 C n

T(n) = O(n)

As we know that any correct algorithm must employ \Omega(n) time (it needs to inspect all elements), we also know that this algorithm has \Theta(n) time complexity. Now let’s check that against reality 😀

Benchmarks

We will benchmark the following algorithm implementations:

Execution times for all the algorithms when k is fixed at 30.

As k has a fixed value in the previous plot, the O(n log k) algorithm is effectively an O(n) algorithm. In fact, it performs better than the truly linear algorithms, probably due to cache effects.

Another thing that can be seen in these plots is the difficulty of seeing the difference between n and n log n in a log-log plot, but that is something to be expected.

Execution times for all the algorithms when k is 30% of n.

Here the qualitative behavior of the O(n log k) solution is the expected one. The C implementation of the O(n) algorithm looks surprisingly good, with a performance competitive to the C++ implementation included as part of the standard library.

Advertisement