**[Answer to the problem in Top k.]**

This was the problem:

Given a list of

nnumbers, get the topkones in O(n) time. (Partial credit for o(nlogn) solutions.)

*n* log *k* solution

The easiest way to solve this problem in o(*n* log *n*) time is to scan the *n* elements, keeping track of the greatest *k* elements seen up to the current element. Using a heap to store the top *k* elements, we get the following algorithm:

#define SWAP(type, a, b) {type swap_temp = a; a = b; b = swap_temp;} static void minheap_pushdown(int *h, size_t hs, size_t i) { size_t j = 0; if (2 * i + 2 < hs) j = (h[2 * i + 1] < h[2 * i + 2]) ? 2 * i + 1 : 2 * i + 2; else if (2 * i + 1 < hs) j = 2 * i + 1; if (j != 0 && h[j] < h[i]) { SWAP(int, h[i], h[j]); minheap_pushdown(h, hs, j); } } static void minheap_raise(int *h, size_t i) { if (i == 0) return; if (h[i] < h[(i - 1) / 2]) { SWAP(int, h[i], h[(i - 1) / 2]); minheap_raise(h, (i - 1) / 2); } } void top_k_nlogk_cp(int *top_k, size_t k, const int *l, size_t n) { size_t i; for (i = 0; i < k; i++) { top_k[i] = l[i]; minheap_raise(top_k, i); } for (i = k; i < n; i++) { if (l[i] > top_k[0]) { top_k[0] = l[i]; minheap_pushdown(top_k, k, 0); } } }

As *minheap_pushdown()* and *minheap_raise()* have O(log *k*) cost, the total cost of this solution will be O(*n* log *k*). This is better than O(*n* log *n*) if *k* grows slower than any power of *n* (if *k* is constant or O(log *n*), for example).

The previous solution is enough to get partial credit, but we can do better 😀

##### Linear time solution

Quicksort is based on a linear time partition operation. This operation starts with an array and an element of this array, called pivot, and ends with a partially sorted array, having the pivot in his sorted position, smaller elements before it and bigger ones after it. Then doing a single partition operation would suffice… if we could guarantee that the chosen pivot is the (*n*–*k*+1)-th element of the sorted array 😀

We cannot do that, but we can do something that is asymptotically equivalent: doing a quicksort-style recursion, though only in the partition that contains the desired pivot. The result is this (converting the tail recursion to iteration):

int *top_k_n_ip(int *l, size_t n, size_t k) { size_t lo, hi, pos, i, j; int pivot; lo = 0; hi = n; pos = n - k; while (hi - lo > 1) { i = lo + 1; j = hi - 1; pivot = l[lo]; while (1) { while (i < hi && l[i] <= pivot) i++; while (j > lo && l[j] > pivot) j--; if (i > j) break; SWAP(int, l[i], l[j]); } SWAP(int, l[lo], l[j]); if (j < pos) lo = j + 1; else if (j > pos) hi = j; else break; } return l + n - k; }

Now we need to check that this gives us a linear-time algorithm. The execution time of the body of the `while (1)`

loop is bounded from above by *C* times *hi* – *lo*, where *C* is a constant. If we assume that the chosen pivot divides the *l* [ *lo* .. *hi* ] array section in two roughly equal parts (this assumption can be made rigorous by choosing the pivot carefully), we get

where *A*, *B* and *C* are constants.

Summing all the terms containing *B* and taking out all the *C* factors we get:

As we know that any correct algorithm must employ time (it needs to inspect all elements), we also know that this algorithm has time complexity. Now let’s check that against reality 😀

##### Benchmarks

We will benchmark the following algorithm implementations:

it just sorts a copy of the input array and copies the top*n*log*n*(copying):*k*elements.scans the input array, keeping the top*n*log*k*(copying):*k*elements in a heap.applies the O(*n*(copying):*n*) in-place algorithm over a copy of the input array.sorts the input array in-place and returns a pointer to the top*n*log*n*(in-place):*k*elements.applies the O(*n*(in-place):*n*) quicksort-derived algorithm that was described in the previous section.Miles Macklin‘s solution, just calling*n*(in-place, C++):*std::nth_element()*🙂

As *k* has a fixed value in the previous plot, the O(*n* log *k*) algorithm is effectively an O(*n*) algorithm. In fact, it performs better than the truly linear algorithms, probably due to cache effects.

Another thing that can be seen in these plots is the difficulty of seeing the difference between *n* and *n* log *n* in a log-log plot, but that is something to be expected.

Here the qualitative behavior of the O(*n* log *k*) solution is the expected one. The C implementation of the O(*n*) algorithm looks surprisingly good, with a performance competitive to the C++ implementation included as part of the standard library.

Just like Quicksort, and exactly in the same way, the linear time of the select algorithm is only the

expectedtime. It may still be on evil inputs. Judicious choice of pivot merely minimize the probability of occurrence, but it does not eliminate the possibility of behavior.(and it turns out that it’s rather difficult to have a completely evil input for Quicksort…)

I think you can get

worst caselinear time if you use the “median of medians” algorithm to get the pivot, but it’s usually not worth the effort when implementing quicksort (you can always choose heapsort for guaranteed O(nlogn) time).For every method you choose, you can build a sequence to foil it… basically designing a “denial of sorting” attack :p But building such a sequence is complicated in general; that’s what I meant when I said that it’s rather difficult to have a completely evil input.

I wrote a thing on this a while ago: http://hbfs.wordpress.com/2010/03/16/foiled/

That is certainly true if you choose a fixed position for your pivot and even if you choose the pivot position pseudorandomly. But the “median of medians” algorithm guarantees that the chosen pivot is bigger than 30% of the elements and smaller than 30% of them. Check this slide by Luca Trevisan: