# The one who doesn’t match

A very common C interview problem is the following:

In an array with n positive integers, all values appear twice with the exception of one, that appears only once. Can we find it using constant space?

The usual trick is to use bitwise XOR in the following way,

unsigned answer1(const unsigned *a, size_t n)
{
unsigned ret = 0;
for (size_t i = 0; i < n; i++)
ret ^= a[i];
return ret;
}


and it works due to the properties a^b == b^a, a^(b^c) == (a^b)^c, a^0 == a and a^a == 0 (following the usual conventions, we assume that constant space refers to a constant number of computer words).

It is fairly straightforward to extend that solution to

In an array with n positive integers, all values appear k times with the exception of one, that appears only once. How much space and time do we need to find it?

(hint: XOR can be seen as addition modulo 2)

but what extensions are possible? If we generalize it to

In an array with n positive integers, all values appear k times with the exception of m values, that appear p times. How much space and time doe we need to find them?

a trivial upper bound is O(n) space and O(n log n) time, achievable by just sorting and counting. We know we can do better in some cases, but how can we characterize them?

Some answers in the next post (hopefully it won’t take one year to arrive!).

# An average problem

This is a simple problem to keep the posting rhythm.

Given an array a of N elements, we keep fixed the first element at 0.0 and the last element at 1.0, while all the other elements are initialized to 0.5. In each time step, all the element in the middle of the array are set to the average of their adjacent elements.

For example, if we have N = 4,

[0.0, 0.5, 0.5, 1.0] # Initialization
[0.0, 0.25, 0.75, 1.0] # After first step
[0.0, 0.375, 0.675, 1.0] # After second step
...


We can easily express it in numpy:

import numpy as np

N = 4
S = 10

a = 0.5 * np.ones(N)
a = 0.0
a[-1] = 1.0

print a
for i in xrange(S):
a[1:-1] = 0.5 * (a[:-2] + a[2:])
print a


The three questions are:

• Does the array converge to a limit?
• If it does, does it depend on its initial values?
• Assuming it converges and given a tolerance $\epsilon$, how many steps does it take to get to a distance $\epsilon$ of the limit?

# Paraxial approximation for spherical Green functions

Note: This post is just some notes for a discussion.

We can start by writing it simply as $\displaystyle f(\mathbf{r}) = \frac{e^{ikr}}{r}$.

Now we use $\mathbf{r} = [x\,y\,z]^T$ and assume $|z| \gg |x|, |y|$. After expanding the expression using the coordinates, $\displaystyle f(x, y, z) = \frac{\exp(ik\sqrt{z^2 + x^2 + y^2})}{\sqrt{z^2 + x^2 + y^2}}$,

we have to find now a way to get rid of the square roots. Taking $z^2$ as a common factor, $\displaystyle \sqrt{z^2 + x^2 + y^2} = \sqrt{z^2\left(1 + \frac{x^2 + y^2}{z^2}\right)}$ $\displaystyle = |z|\sqrt{1 + \frac{x^2 + y^2}{z^2}}$.

As we know the argument of the square root is very close to 1, we can use a first order approximation, $\sqrt{1 + x} \approx 1 + \frac{x}{2}$: $\approx |z|\left(1 + \frac{x^2 + y^2}{2z^2}\right)$ $\approx |z|\left(1 + \frac{x^2 + y^2}{2z^2}\right)$ $\approx |z| + \frac{x^2 + y^2}{2|z|}$.

Replacing in the original expression, $\displaystyle f(x, y, z) \approx \frac{1}{|z|}\exp\left(ik|z| + ik\frac{x^2+y^2}{2|z|}\right)$

we get expressions matching the ones in the slides apart from some amplitude and phase conventions.

# Superpolynomial

This post is just to help a friend that wanted to see why $\displaystyle \sum_{i=1}^{\lfloor\sqrt{n}\rfloor} \binom{n}{i}$

was superpolynomial.

### Getting a simple lower bound

We can start by keeping only the last term. As all the terms are positive, that will give us a lower bound: $\displaystyle \sum_{i=1}^{\lfloor\sqrt{n}\rfloor} \binom{n}{i} \ge \binom{n}{\lfloor\sqrt{n}\rfloor}$.

We can expand the binomial coefficient in terms of factorials $\displaystyle \binom{n}{\lfloor\sqrt{n}\rfloor} = \frac{n!}{(n - \lfloor\sqrt{n}\rfloor)!\lfloor\sqrt{n}\rfloor!}$,

and some of the factors inside $n!$ can then be cancelled with the ones in $(n - \lfloor\sqrt{n}\rfloor)!$, leaving the following expression: $\displaystyle \binom{n}{\lfloor\sqrt{n}\rfloor} = \frac{n(n-1)\hdots(n - \lfloor\sqrt{n}\rfloor + 1)}{\lfloor\sqrt{n}\rfloor!}$.

As we want a lower bound, we can bound the numerator by below and the denominator by above, leaving us a bound without factorials: $\displaystyle \frac{n(n-1)\hdots(n - \lfloor\sqrt{n}\rfloor + 1)}{\lfloor\sqrt{n}\rfloor!} \ge \frac{(n - \sqrt{n})^{\sqrt{n} - 1}}{\sqrt{n}^{\sqrt{n}}}$.

For any reasonably big value of $n$ we will have $n - \lfloor\sqrt{n}\rfloor \ge n/2$. If we combine that with factoring out the -1 in the exponent we get $\displaystyle \frac{(n - \sqrt{n})^{\sqrt{n} - 1}}{\sqrt{n}^{\sqrt{n}}} \ge \frac{1}{n - \sqrt{n}}\left(\frac{n}{\sqrt{n}}\right)^{\sqrt{n}} \ge \frac{1}{n}\sqrt{n}^{\sqrt{n}}$,

removing the problematic floors and arriving to a much simpler expression.

Combining all these bounds we get $\displaystyle \sum_{i=1}^{\lfloor\sqrt{n}\rfloor} \binom{n}{i} \ge \frac{1}{n}\sqrt{n}^{\sqrt{n}}$,

a much simpler lower bound when compared to the original expression.

### Proving it is superpolynomial

If w take an arbitrary polynomial $p(n)$ of degree $k$ we can say for $n \ge 1$ that $\displaystyle p(n) = \sum_{i=1}^k p_i\,n^i \le \left(\sum_{i=1}^k |p_i|\right)n^k$,

meaning we can bound by above any polynomial of degree $k$ by a polynomial of the same degree with a single term.

Taking the limit of the quotient of logarithms, $\displaystyle \lim_{x \to +\infty} \frac{\log \frac{1}{n}\sqrt{n}^{\sqrt{n}}}{\log a\,n^k} = \lim_{x \to +\infty} \frac{\frac{\sqrt{n}}{2}\log n - \log n}{\log a + k\,\log n}$ $\displaystyle = \lim_{x \to +\infty} \frac{\log n\left(\frac{\sqrt{n}}{2} - 1\right)}{\log n\left(\frac{\log a}{\log n} + k\right)}$ $\displaystyle = \lim_{x \to +\infty} \frac{\frac{\sqrt{n}}{2} - 1}{\frac{\log a}{\log n} + k}$ $\displaystyle = +\infty$

we can see the lower bound of the original expression is superpolynomial, making the original expression superpolynomial too.

# Debugging stripped binaries

In many space-restricted environments it is necessary to avoid including symbols in the released binaries, making debugging more complex. Even though many problems can be replicated by using a separate debug build, not all of them can, forcing the use of map files and cumbersome procedures. In this post we will see how to use separate debug info when building with GCC.

### Generating a test binary

To avoid using third-party source code, we can produce our own big source file programatically:

import sys
print 'int main(void)\n{\n    unsigned x0 = 1234;'
for i in range(int(sys.argv)):
j = i + 1
print '    unsigned x%(j)d = x%(i)d * %(j)d * %(j)d;' % locals()
print '    return (int)x%(j)d;\n}' % locals()


Generating, compiling and stripping the binary (the compilation takes quite long in this netbook),

$python bigcgen.py 50000 >a.c$ time gcc -g3 -std=c99 -Wall -pedantic a.c -o a-with-dbg

real	1m3.559s
user	1m2.032s
sys	0m0.912s
$cp a-with-dbg a-stripped &amp;&amp; strip --strip-all a-stripped bash: syntax error near unexpected token ;&'$ cp a-with-dbg a-stripped && strip --strip-all a-stripped
$ls -l total 5252 -rw-rw-r-- 1 mchouza mchouza 2255639 Jun 4 20:37 a.c -rwxrwxr-x 1 mchouza mchouza 907392 Jun 4 20:44 a-stripped -rwxrwxr-x 1 mchouza mchouza 2204641 Jun 4 20:38 a-with-dbg -rw-rw-r-- 1 mchouza mchouza 225 Jun 4 20:33 bigcgen.py  we can see there are substantial space savings by removing extra information from the binary, at the cost of being unable to debug it: mchouza@nbmchouza:~/Desktop/release-debug-exp$ gdb ./a-with-dbg
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
[...]
(gdb) q
$gdb ./a-stripped GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1 Copyright (C) 2014 Free Software Foundation, Inc. [...] Reading symbols from ./a-stripped...(no debugging symbols found)...done. (gdb) q  ### Separating the debug info and linking it We can use objcopy to get a copy of the debug information and then to link it back to the original file. $ cp a-with-dbg a-stripped-dbg
$objcopy --only-keep-debug a-stripped-dbg a-stripped-dbg.dbg$ strip --strip-all a-stripped-dbg
$objcopy --add-gnu-debuglink=a-stripped-dbg.dbg a-stripped-dbg$ ls -l
total 7412
-rw-rw-r-- 1 mchouza mchouza 2255639 Jun  4 20:37 a.c
-rwxrwxr-x 1 mchouza mchouza  907392 Jun  4 20:44 a-stripped
-rwxrwxr-x 1 mchouza mchouza  907496 Jun  4 20:46 a-stripped-dbg
-rwxrwxr-x 1 mchouza mchouza 1300033 Jun  4 20:46 a-stripped-dbg.dbg
-rwxrwxr-x 1 mchouza mchouza 2204641 Jun  4 20:38 a-with-dbg
-rw-rw-r-- 1 mchouza mchouza     225 Jun  4 20:33 bigcgen.py


The file size is slightly bigger, but now we can debug normally:

$gdb ./a-stripped-dbg GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1 [...] Reading symbols from ./a-stripped-dbg...Reading symbols from /home/mchouza/Desktop/mchouza/stripped-bins-dbg/a-stripped-dbg.dbg...done. done. (gdb) b 4567 Breakpoint 1 at 0x4145e2: file a.c, line 4567. (gdb) r Starting program: /home/mchouza/Desktop/mchouza/stripped-bins-dbg/a-stripped-dbg Breakpoint 1, main () at a.c:4567 4567 unsigned x4564 = x4563 * 4564 * 4564; (gdb) c Continuing. [Inferior 1 (process 19304) exited normally]  ### Limitations and a question Of course, when applied to optimized code it can be hard to debug it anyway because of the restructuring done as part of the optimization process… but that is a limitation intrinsic in debugging optimized code. Finally, a question: why does this program exits with code 0 when executed? [EDITED THE PYTHON CODE TO MAKE THE QUESTION MORE INTERESTING] # Compile-time filename processing To break another blogging dry spell after having moved to London, a fun (meta-)programming challenge: Write a C++ 11 program that only compiles successfully when the filename has a numeric prefix that is a prime number. For example, if the file is named 997something.cc it should compile successfully, but it should fail if the filename is 57another.cc or 130.cc. As compile-time processing is limited, it is acceptable to only handle relatively small numbers. But all numeric prefixes below 1000 should be handled (and not via enumeration :D). To prove I have a valid solution, I am attaching some tests: mchouza@nbmchouza:~/Desktop/mchouza-priv/prime-naming$ ll
total 16
drwxrwxr-x  2 mchouza mchouza 4096 May 14 21:29 ./
drwxrwxr-x 12 mchouza mchouza 4096 May 14 21:13 ../
-rw-rw-r--  1 mchouza mchouza  995 May 14 21:16 base.cc
-rwxrwxr-x  1 mchouza mchouza  258 May 14 21:29 test.sh*
mchouza@nbmchouza:~/Desktop/mchouza-priv/prime-naming$sha256sum base.cc 5c56fd3b0e337badcb34b9098488d28645c957c9d8f864594136320f84e0d15b base.cc mchouza@nbmchouza:~/Desktop/mchouza-priv/prime-naming$ cat test.sh
#!/bin/bash
echo "Printing the exit codes when compiling with names from $1.cc to$2.cc..."
for n in seq $1$2; do
ln -s base.cc $n.cc g++ -std=c++11 -pedantic -Wall$n.cc -o delete-me 2>/dev/null
echo "  $n.cc ->$?"
rm $n.cc done rm -f delete-me mchouza@nbmchouza:~/Desktop/mchouza-priv/prime-naming$ ./test.sh 10 20
Printing the exit codes when compiling with names from 10.cc to 20.cc...
10.cc -> 1
11.cc -> 0
12.cc -> 1
13.cc -> 0
14.cc -> 1
15.cc -> 1
16.cc -> 1
17.cc -> 0
18.cc -> 1
19.cc -> 0
20.cc -> 1
mchouza@nbmchouza:~/Desktop/mchouza-priv/prime-naming\$ ./test.sh 990 1000
Printing the exit codes when compiling with names from 990.cc to 1000.cc...
990.cc -> 1
991.cc -> 0
992.cc -> 1
993.cc -> 1
994.cc -> 1
995.cc -> 1
996.cc -> 1
997.cc -> 0
998.cc -> 1
999.cc -> 1
1000.cc -> 1


# Finding matches with bitwise expressions

When building a parser, it’s often useful to be able to quickly detect occurrences of some character in a string. The fastest methods in x86 processors involve the usage of SSE/AVX, but I was interested in starting by building a portable version using only bitwise operations.

### Problem definition

Given a constant byte value d, we want to create a function get_matches() that takes an unsigned integer b as a parameter and returns another of the same size. Each byte of the return value should be either 0x80 if the byte of the same position in b is equal to d or 0x00 otherwise.

For example, for d equal to 0x20 and b being a 64 bit integer, we should have:

find_matches(0x1312202000200212) -> 0x0000808000800000
find_matches(0x0001020304050607) -> 0x0000000000000000
find_matches(0x0010203040506070) -> 0x0000800000000000


### Simple implementation

It’s often useful to start by writing a simple implementation of the desired function, to test it and clarify the problem:

template <uint8_t d, typename int_type>
int_type get_matches_naive(int_type b)
{
int_type ret = 0;
for (size_t i = 0; i < sizeof(int_type); i++)
if (((b >> i * CHAR_BIT) & 0xff) == d)
ret |= (int_type)0x80 << i * CHAR_BIT;
return ret;
}


But this implementation is of course not very efficient. How can we use bitwise instructions to our advantage?

### Bitwise implementation

We can start by getting an integer with zeroes only in those bytes where the bytes of b are equal to d. It’s easy to do that by using XOR against an integer mask containing repeated instances of d. Applying it to the previous examples:

0x1312202000200212 ^ 0x2020202020202020 -> 0x3332000020002232
0x0001020304050607 ^ 0x2020202020202020 -> 0x2021222324252627
0x0010203040506070 ^ 0x2020202020202020 -> 0x2030001060704050


After that, we should translate the 0x00 bytes to 0x80 and the other values to 0x00. We can use subtraction from a mask composed by repeating instances of 0x80 and then do a bitwise AND with the same mask to isolate the high bits of each byte:

(0x8080808080808080 - 0x3332000020002232) & 0x8080808080808080 -> 0x0000808000800000
(0x8080808080808080 - 0x2021222324252627) & 0x8080808080808080 -> 0x0000000000000000
(0x8080808080808080 - 0x2030001060704050) & 0x8080808080808080 -> 0x0000800000000000


It would seem we have solved our problem… but that’s not true. The solution works for all these examples because the subtracted bytes are smaller than 0x80. If we try an example with higher valued bytes,

(0x8080808080808080 - 0x20300010607040aa) & 0x8080808080808080 -> 0x0000800000000080


we get false positives and also borrows, that can affect other subtractions. The problem of the higher valued bytes can be solved by clearing the high bits of every input byte, doing an AND against a 0x7f mask, but that will make us unable to distinguish between 0x80 and 0x00 bytes.

It would seem we are just moving the problem from one place to the other, without solving it. But, in fact, we can easily remove the 0x80 bytes by doing an AND with the result of doing a bitwise NOT of the original value, that used 0x00 to indicate the matches.

The resulting function would be something like the following:

template <uint8_t d, typename int_type>
int_type get_matches_bitwise(int_type b)
{
// based on
// https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
constexpr int_type mask_0x01 = (int_type)~0 / (int_type)0xff;
int_type matches_as_0x00 = mask_d ^ b;
return matches_as_0x80;
}


### Testing the bitwise implementation

We have seen that complex bitwise expressions can have some hidden mistakes. How can we make sure this expression doesn’t have one?

For 16 bit integers it’s quite fast to prove it by just trying all possible input values, though it can be a bit tricky to arrange covering the input template parameter:

#include <cassert>
#include <climits>
#include <cstdint>
#include <cstdio>

template <uint8_t d, typename int_type>
int_type get_matches_naive(int_type b)
{
int_type ret = 0;
for (size_t i = 0; i < sizeof(int_type); i++)
if (((b >> i * CHAR_BIT) & 0xff) == d)
ret |= (int_type)0x80 << i * CHAR_BIT;
return ret;
}

template <uint8_t d, typename int_type>
int_type get_matches_bitwise(int_type b)
{
// based on
// https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
constexpr int_type mask_0x01 = (int_type)~0 / (int_type)0xff;
int_type matches_as_0x00 = mask_d ^ b;
return matches_as_0x80;
}

template <uint8_t d, typename int_type>
void test()
{
int_type b = 0;
do
{
assert((get_matches_naive<d, int_type>(b) ==
get_matches_bitwise<d, int_type>(b)));
} while(--b != 0);
}

template <uint8_t d, typename int_type>
struct aux
{
static void do_test()
{
test<d, int_type>();
aux<d - 1, int_type>::do_test();
}
};

template <typename int_type>
struct aux<0, int_type>
{
static void do_test()
{
test<0, int_type>();
}
};

int main()
{
aux<0xff, uint16_t>::do_test();
// TAKES TOO LONG FOR IDEONE
//aux<0xff, uint32_t>::do_test();
return 0;
}
`

32 bit integers take a bit more of time to test, but it’s still feasible to go over all of them in about an hour. But with 64 bit values it would take many years to go over all the possible inputs.

In the next post we will see how we can use Z3 to do that verification in less than a second.