Debugging stripped binaries

In many space-restricted environments it is necessary to avoid including symbols in the released binaries, making debugging more complex. Even though many problems can be replicated by using a separate debug build, not all of them can, forcing the use of map files and cumbersome procedures. In this post we will see how to use separate debug info when building with GCC.

Generating a test binary

To avoid using third-party source code, we can produce our own big source file programatically:

import sys
print 'int main(void)\n{\n    unsigned x0 = 1234;'
for i in range(int(sys.argv[1])):
    j = i + 1
    print '    unsigned x%(j)d = x%(i)d * %(j)d * %(j)d;' % locals()
print '    return (int)x%(j)d;\n}' % locals()

Generating, compiling and stripping the binary (the compilation takes quite long in this netbook),

$ python bigcgen.py 50000 >a.c
$ time gcc -g3 -std=c99 -Wall -pedantic a.c -o a-with-dbg

real	1m3.559s
user	1m2.032s
sys	0m0.912s
$ cp a-with-dbg a-stripped && strip --strip-all a-stripped
bash: syntax error near unexpected token `;&'
$ cp a-with-dbg a-stripped && strip --strip-all a-stripped
$ ls -l
total 5252
-rw-rw-r-- 1 mchouza mchouza 2255639 Jun  4 20:37 a.c
-rwxrwxr-x 1 mchouza mchouza  907392 Jun  4 20:44 a-stripped
-rwxrwxr-x 1 mchouza mchouza 2204641 Jun  4 20:38 a-with-dbg
-rw-rw-r-- 1 mchouza mchouza     225 Jun  4 20:33 bigcgen.py

we can see there are substantial space savings by removing extra information from the binary, at the cost of being unable to debug it:

mchouza@nbmchouza:~/Desktop/release-debug-exp$ gdb ./a-with-dbg 
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
[...]
Reading symbols from ./a-with-dbg...done.
(gdb) q
$ gdb ./a-stripped 
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
[...]
Reading symbols from ./a-stripped...(no debugging symbols found)...done.
(gdb) q

Separating the debug info and linking it

We can use objcopy to get a copy of the debug information and then to link it back to the original file.

$ cp a-with-dbg a-stripped-dbg
$ objcopy --only-keep-debug a-stripped-dbg a-stripped-dbg.dbg
$ strip --strip-all a-stripped-dbg
$ objcopy --add-gnu-debuglink=a-stripped-dbg.dbg a-stripped-dbg
$ ls -l
total 7412
-rw-rw-r-- 1 mchouza mchouza 2255639 Jun  4 20:37 a.c
-rwxrwxr-x 1 mchouza mchouza  907392 Jun  4 20:44 a-stripped
-rwxrwxr-x 1 mchouza mchouza  907496 Jun  4 20:46 a-stripped-dbg
-rwxrwxr-x 1 mchouza mchouza 1300033 Jun  4 20:46 a-stripped-dbg.dbg
-rwxrwxr-x 1 mchouza mchouza 2204641 Jun  4 20:38 a-with-dbg
-rw-rw-r-- 1 mchouza mchouza     225 Jun  4 20:33 bigcgen.py

The file size is slightly bigger, but now we can debug normally:

$ gdb ./a-stripped-dbg
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
[...]
Reading symbols from ./a-stripped-dbg...Reading symbols from /home/mchouza/Desktop/mchouza/stripped-bins-dbg/a-stripped-dbg.dbg...done.
done.
(gdb) b 4567
Breakpoint 1 at 0x4145e2: file a.c, line 4567.
(gdb) r
Starting program: /home/mchouza/Desktop/mchouza/stripped-bins-dbg/a-stripped-dbg 

Breakpoint 1, main () at a.c:4567
4567	    unsigned x4564 = x4563 * 4564 * 4564;
(gdb) c
Continuing.
[Inferior 1 (process 19304) exited normally]

Limitations and a question

Of course, when applied to optimized code it can be hard to debug it anyway because of the restructuring done as part of the optimization process… but that is a limitation intrinsic in debugging optimized code.

Finally, a question: why does this program exits with code 0 when executed?

[EDITED THE PYTHON CODE TO MAKE THE QUESTION MORE INTERESTING]

Compile-time filename processing

To break another blogging dry spell after having moved to London, a fun (meta-)programming challenge:

Write a C++ 11 program that only compiles successfully when the filename has a numeric prefix that is a prime number. For example, if the file is named 997something.cc it should compile successfully, but it should fail if the filename is 57another.cc or 130.cc.

As compile-time processing is limited, it is acceptable to only handle relatively small numbers. But all numeric prefixes below 1000 should be handled (and not via enumeration :D).

To prove I have a valid solution, I am attaching some tests:

mchouza@nbmchouza:~/Desktop/mchouza-priv/prime-naming$ ll
total 16
drwxrwxr-x  2 mchouza mchouza 4096 May 14 21:29 ./
drwxrwxr-x 12 mchouza mchouza 4096 May 14 21:13 ../
-rw-rw-r--  1 mchouza mchouza  995 May 14 21:16 base.cc
-rwxrwxr-x  1 mchouza mchouza  258 May 14 21:29 test.sh*
mchouza@nbmchouza:~/Desktop/mchouza-priv/prime-naming$ sha256sum base.cc 
5c56fd3b0e337badcb34b9098488d28645c957c9d8f864594136320f84e0d15b  base.cc
mchouza@nbmchouza:~/Desktop/mchouza-priv/prime-naming$ cat test.sh 
#!/bin/bash
echo "Printing the exit codes when compiling with names from $1.cc to $2.cc..."
for n in `seq $1 $2`; do
  ln -s base.cc $n.cc
  g++ -std=c++11 -pedantic -Wall $n.cc -o delete-me 2>/dev/null
  echo "  $n.cc -> $?"
  rm $n.cc
done
rm -f delete-me
mchouza@nbmchouza:~/Desktop/mchouza-priv/prime-naming$ ./test.sh 10 20
Printing the exit codes when compiling with names from 10.cc to 20.cc...
  10.cc -> 1
  11.cc -> 0
  12.cc -> 1
  13.cc -> 0
  14.cc -> 1
  15.cc -> 1
  16.cc -> 1
  17.cc -> 0
  18.cc -> 1
  19.cc -> 0
  20.cc -> 1
mchouza@nbmchouza:~/Desktop/mchouza-priv/prime-naming$ ./test.sh 990 1000
Printing the exit codes when compiling with names from 990.cc to 1000.cc...
  990.cc -> 1
  991.cc -> 0
  992.cc -> 1
  993.cc -> 1
  994.cc -> 1
  995.cc -> 1
  996.cc -> 1
  997.cc -> 0
  998.cc -> 1
  999.cc -> 1
  1000.cc -> 1

Finding matches with bitwise expressions

When building a parser, it’s often useful to be able to quickly detect occurrences of some character in a string. The fastest methods in x86 processors involve the usage of SSE/AVX, but I was interested in starting by building a portable version using only bitwise operations.

Problem definition

Given a constant byte value d, we want to create a function get_matches() that takes an unsigned integer b as a parameter and returns another of the same size. Each byte of the return value should be either 0x80 if the byte of the same position in b is equal to d or 0x00 otherwise.

For example, for d equal to 0x20 and b being a 64 bit integer, we should have:

find_matches(0x1312202000200212) -> 0x0000808000800000
find_matches(0x0001020304050607) -> 0x0000000000000000
find_matches(0x0010203040506070) -> 0x0000800000000000

Simple implementation

It’s often useful to start by writing a simple implementation of the desired function, to test it and clarify the problem:

template <uint8_t d, typename int_type>
int_type get_matches_naive(int_type b)
{
    int_type ret = 0;
    for (size_t i = 0; i < sizeof(int_type); i++)
        if (((b >> i * CHAR_BIT) & 0xff) == d)
            ret |= (int_type)0x80 << i * CHAR_BIT;
    return ret;
}

But this implementation is of course not very efficient. How can we use bitwise instructions to our advantage?

Bitwise implementation

We can start by getting an integer with zeroes only in those bytes where the bytes of b are equal to d. It’s easy to do that by using XOR against an integer mask containing repeated instances of d. Applying it to the previous examples:

0x1312202000200212 ^ 0x2020202020202020 -> 0x3332000020002232
0x0001020304050607 ^ 0x2020202020202020 -> 0x2021222324252627
0x0010203040506070 ^ 0x2020202020202020 -> 0x2030001060704050

After that, we should translate the 0x00 bytes to 0x80 and the other values to 0x00. We can use subtraction from a mask composed by repeating instances of 0x80 and then do a bitwise AND with the same mask to isolate the high bits of each byte:

(0x8080808080808080 - 0x3332000020002232) & 0x8080808080808080 -> 0x0000808000800000
(0x8080808080808080 - 0x2021222324252627) & 0x8080808080808080 -> 0x0000000000000000
(0x8080808080808080 - 0x2030001060704050) & 0x8080808080808080 -> 0x0000800000000000

It would seem we have solved our problem… but that’s not true. The solution works for all these examples because the subtracted bytes are smaller than 0x80. If we try an example with higher valued bytes,

(0x8080808080808080 - 0x20300010607040aa) & 0x8080808080808080 -> 0x0000800000000080

we get false positives and also borrows, that can affect other subtractions. The problem of the higher valued bytes can be solved by clearing the high bits of every input byte, doing an AND against a 0x7f mask, but that will make us unable to distinguish between 0x80 and 0x00 bytes.

It would seem we are just moving the problem from one place to the other, without solving it. But, in fact, we can easily remove the 0x80 bytes by doing an AND with the result of doing a bitwise NOT of the original value, that used 0x00 to indicate the matches.

The resulting function would be something like the following:

template <uint8_t d, typename int_type>
int_type get_matches_bitwise(int_type b)
{
    // based on
    // https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
    constexpr int_type mask_0x01 = (int_type)~0 / (int_type)0xff;
    constexpr int_type mask_0x7f = (int_type)0x7f * mask_0x01;
    constexpr int_type mask_0x80 = (int_type)0x80 * mask_0x01;
    constexpr int_type mask_d = d * mask_0x01;
    int_type matches_as_0x00 = mask_d ^ b;
    int_type matches_as_0x80 = (mask_0x80 - (matches_as_0x00 & mask_0x7f)) &
                               ~matches_as_0x00 & mask_0x80;
    return matches_as_0x80;
}

Testing the bitwise implementation

We have seen that complex bitwise expressions can have some hidden mistakes. How can we make sure this expression doesn’t have one?

For 16 bit integers it’s quite fast to prove it by just trying all possible input values, though it can be a bit tricky to arrange covering the input template parameter:

#include <cassert>
#include <climits>
#include <cstdint>
#include <cstdio>

template <uint8_t d, typename int_type>
int_type get_matches_naive(int_type b)
{
    int_type ret = 0;
    for (size_t i = 0; i < sizeof(int_type); i++)
        if (((b >> i * CHAR_BIT) & 0xff) == d)
            ret |= (int_type)0x80 << i * CHAR_BIT;
    return ret;
}

template <uint8_t d, typename int_type>
int_type get_matches_bitwise(int_type b)
{
    // based on
    // https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
    constexpr int_type mask_0x01 = (int_type)~0 / (int_type)0xff;
    constexpr int_type mask_0x7f = (int_type)0x7f * mask_0x01;
    constexpr int_type mask_0x80 = (int_type)0x80 * mask_0x01;
    constexpr int_type mask_d = d * mask_0x01;
    int_type matches_as_0x00 = mask_d ^ b;
    int_type matches_as_0x80 = (mask_0x80 - (matches_as_0x00 & mask_0x7f)) &
                               ~matches_as_0x00 & mask_0x80;
    return matches_as_0x80;
}

template <uint8_t d, typename int_type>
void test()
{
	int_type b = 0;
	do
	{
		assert((get_matches_naive<d, int_type>(b) ==
		        get_matches_bitwise<d, int_type>(b)));
	} while(--b != 0);
}

template <uint8_t d, typename int_type>
struct aux
{
	static void do_test()
	{
		test<d, int_type>();
		aux<d - 1, int_type>::do_test();
	}
};

template <typename int_type>
struct aux<0, int_type>
{
	static void do_test()
	{
		test<0, int_type>();
	}
};

int main()
{
	aux<0xff, uint16_t>::do_test();
	// TAKES TOO LONG FOR IDEONE
	//aux<0xff, uint32_t>::do_test();
	return 0;
}

32 bit integers take a bit more of time to test, but it’s still feasible to go over all of them in about an hour. But with 64 bit values it would take many years to go over all the possible inputs.

In the next post we will see how we can use Z3 to do that verification in less than a second.

Beating the optimizer

It’s generally believed that it’s far too difficult to beat an optimizing C compiler when producing code, at least for a simple problem. In this post we will see how that’s not true and simple “hand-optimized” code can get large improvements in some cases.

Obviously that doesn’t mean that we should program using intrinsics or assembly most of the time: it still took me far more time to write this code than the time required to compile a C file with optimizations. But it means that, if we are working in a piece of time-critical code and profiling shows us a “hot spot”, it’s reasonable to consider doing such an optimization.

The problem

The problem description is very simple: given a big buffer (many megabytes), count the number of bytes equal to a given one. The prototype of such a function could be the following one:

size_t memcount(const void *s, int c, size_t n);

The baseline solution

Before doing any optimization is important to get a baseline solution and consider how optimal it is. In this case, it’s very easy to write:

size_t memcount_naive(const void *s, int c, size_t n)
{
    const uint8_t *b = (const uint8_t *)s;
    size_t count = 0;
    for (size_t i = 0; i < n; i++)
        if (b[i] == c)
            count++;
    return count;
}

Evaluating the baseline

We are going to do the evaluation with a relatively slow mobile Ivy Bridge core:

$ grep 'processor\|\(model name\)' /proc/cpuinfo 
processor	: 0
model name	: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
processor	: 1
model name	: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
processor	: 2
model name	: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
processor	: 3
model name	: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz

Timing with clock_gettime(), we get the following results:

$ gcc --version
gcc (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ gcc -march=native -std=c11 -DTEST_NAIVE -O3 memcount.c -o memcount && ./memcount 
Generated random arrays.
Count of character 45 in 'a1' using 'memcount_naive' is: 408978
Elapsed time: 103946325 ns
Count of character 45 in 'a1' using 'memcount_naive' is: 408978
Elapsed time: 104022735 ns
Count of character 45 in 'a2' using 'memcount_naive' is: 409445
Elapsed time: 104278289 ns
Count of character 45 in 'a3' using 'memcount_naive' is: 410593
Elapsed time: 104127038 ns
Count of character 45 in 'a4' using 'memcount_naive' is: 104447433
Elapsed time: 104110155 ns
Count of character 45 in 'a5' using 'memcount_naive' is: 104857613
Elapsed time: 104122294 ns

The measured time to count bytes with a given value in a ~100 MB array is about 100 ms, giving us a processing speed of approximately 1 GB/s. There are no substantial caching effects, as expected given that we are processing a 100 MB array.

Can we do better?

Memory bandwidth

Let’s start by calculating the memory bandwidth. We can get the memory parameters by using the dmidecode command:

$ sudo dmidecode --type 17 | grep '\(Locator\)\|\(Clock\)\|\(Size\)\|\(Type\)'
	Size: 4096 MB
	Locator: ChannelA-DIMM0
	Bank Locator: BANK 0
	Type: DDR3
	Type Detail: Synchronous
	Configured Clock Speed: 1333 MHz
	Size: No Module Installed
	Locator: ChannelA-DIMM1
	Bank Locator: BANK 1
	Type: Unknown
	Type Detail: None
	Configured Clock Speed: Unknown
	Size: 4096 MB
	Locator: ChannelB-DIMM0
	Bank Locator: BANK 2
	Type: DDR3
	Type Detail: Synchronous
	Configured Clock Speed: 1333 MHz
	Size: No Module Installed
	Locator: ChannelB-DIMM1
	Bank Locator: BANK 3
	Type: Unknown
	Type Detail: None
	Configured Clock Speed: Unknown

We have two DDR3 DIMMs in separate channels, running at “1333 MHz”. The expected bandwidth would be:

\displaystyle 1333\,\mathrm{MT}/\mathrm{s} \cdot 128\,\mathrm{bit}/\mathrm{T} \cdot 0.125 \,\mathrm{bit}/\mathrm{byte} \approx 20\,\mathrm{GB}/\mathrm{s},

so we should have plenty of room for improvement. We can get a more realistic upper limit by comparing the performance to a highly optimized normal operation, such as memchr():

$ gcc -march=native -std=c11 -DTEST_MEMCHR -O3 memcount.c -o memcount && ./memcount 
Generated random arrays.
Timing 'memchr()' to get character 13' in 'a5'. Result: (nil)
Elapsed time: 6475968 ns
Timing 'memchr()' to get character 13' in 'a5'. Result: (nil)
Elapsed time: 6919187 ns

Now we know that bandwidths of ~14 GB/s are achievable in this system, so there should be plenty of room for improving performance.

Using SSE intrinsics

My solution (that is probably far from optimal) is the following one:

size_t memcount_sse(const void *s, int c, size_t n)
{
    size_t nb = n / 16;
    size_t count = 0;
    __m128i ct = _mm_set1_epi32(c * ((~0UL) / 0xff));
    __m128i z = _mm_set1_epi32(0);
    __m128i acr = _mm_set1_epi32(0);
    for (size_t i = 0; i < nb; i++)
    {
        __m128i b = _mm_lddqu_si128((const __m128i *)s + i);
        __m128i cr = _mm_cmpeq_epi8 (ct, b);
        acr = _mm_add_epi8(acr, cr);
        if (i % 0xff == 0xfe)
        {
            acr = _mm_sub_epi8(z, acr);
            __m128i sacr = _mm_sad_epu8(acr, z);
            count += _mm_extract_epi64(sacr, 0);
            count += _mm_extract_epi64(sacr, 1);
            acr = _mm_set1_epi32(0);
        }
    }
    acr = _mm_sub_epi8(z, acr);
    __m128i sacr = _mm_sad_epu8(acr, z);
    count += _mm_extract_epi64(sacr, 0);
    count += _mm_extract_epi64(sacr, 1);
    for (size_t i = nb * 16; i < n; i++)
        if (((const uint8_t *)s)[i] == c)
            count++;
    return count;
}

Testing it we get substantially better times:

$ gcc -march=native -std=c11 -DTEST_SSE -O3 memcount.c -o memcount && ./memcount 
Generated random arrays.
Count of character 45 in 'a1' using 'memcount_sse' is: 408978
Elapsed time: 10908773 ns
Count of character 45 in 'a2' using 'memcount_sse' is: 409445
Elapsed time: 11195531 ns
Count of character 45 in 'a3' using 'memcount_sse' is: 410593
Elapsed time: 12147576 ns
Count of character 45 in 'a4' using 'memcount_sse' is: 104447433
Elapsed time: 10930570 ns
Count of character 45 in 'a5' using 'memcount_sse' is: 104857613
Elapsed time: 10856319 ns

The new speeds are between 8 and 9 GB/s, almost an order of magnitude faster than the “naive” code.

Explanation

As this is my first program using intrinsics, it’s probably far from optimal. But, as it can seem pretty impenetrable, let’s explain it step by step.

Initialization

size_t memcount_sse(const void *s, int c, size_t n)
{
    size_t nb = n / 16;
    size_t count = 0;
    __m128i ct = _mm_set1_epi32(c * ((~0UL) / 0xff));
    __m128i z = _mm_set1_epi32(0);
    __m128i acr = _mm_set1_epi32(0);

This section just initializes some variables:

  • nb: we will process the input in blocks of 16 bytes (128 bits), so we use this variable to track the number of blocks to be processed.
  • count: the return value, to be filled with the byte count to be computed.
  • ct: the comparison target, a 128 bit value filled with copies of c, the byte value we are counting.
  • z: a zero 128 bit value.
  • acr: the accumulated comparison result, a 128 bit value that accumulates the comparison results in a special format, to be described later.

Only one intrinsic is used here, _mm_set1_epi32(). It loads repeated copies of a 32 bit value in a 128 bit value.

Comparing and accumulating

    for (size_t i = 0; i < nb; i++)
    {
        __m128i b = _mm_lddqu_si128((const __m128i *)s + i);
        __m128i cr = _mm_cmpeq_epi8 (ct, b);
        acr = _mm_add_epi8(acr, cr);

The loop is self explanatory: we are just iterating over 128 bit blocks. The next intrinsic, _mm_lddqu_si128() just loads a potentially unaligned 128 bit value into the 128 bit variable b.

Now comes the key section: we compare b byte per byte against the comparison target ct, using the intrinsic _mm_cmpeq_epi8(), and we store 0xff when the bytes match and 0x00 when they don’t in the comparison result cr. That result is summed byte per byte in the accumulated counts variable acr, with the intrinsic _mm_add_epi8().

As 0xff is -1 modulo 256, the accumulation works normally, but we could have an overflow after 256 iterations. To avoid that, we need to move this counts to our count variable periodically.

Definitive accumulation

        if (i % 0xff == 0xfe)
        {
            acr = _mm_sub_epi8(z, acr);
            __m128i sacr = _mm_sad_epu8(acr, z);
            count += _mm_extract_epi64(sacr, 0);
            count += _mm_extract_epi64(sacr, 1);
            acr = _mm_set1_epi32(0);
        }

Every 255 iterations (to avoid overflow), we do the following operations:

  • Change the sign of acr by subtracting it from zero to get the positive values of the counts (using the _mm_sub_epi8() intrinsic).
  • Get the absolute value of each byte and sum all these values to get the total count (using the _mm_sad_epu8() intrinsic, that leaves the sum split in the low and high parts of the 128 bit variable).
  • Sum the low and high parts of the 128 bit variable to count, to get the total count, using the _mm_extract_epi64() intrinsic to get those parts.
  • Set the accumulated comparison result acr variable to zero.

After loop accumulation

    acr = _mm_sub_epi8(z, acr);
    __m128i sacr = _mm_sad_epu8(acr, z);
    count += _mm_extract_epi64(sacr, 0);
    count += _mm_extract_epi64(sacr, 1);

Accumulates the value that remains in acr after exiting the main loop.

Bytes after last block

    for (size_t i = nb * 16; i < n; i++)
        if (((const uint8_t *)s)[i] == c)
            count++;
    return count;
}

Handles the bytes that come after the last 16 byte block and returns the total count.

Conclusion

We have seen that a substantial improvement of almost 10x can be obtained over the optimizing compiler results, even in cases where the code to be optimized is quite simple. It would be interesting to see how close to the 20 GB/s we can get with more carefully optimized code…

The code used for these tests is available in Github.

Escaping variables

(Just a short post to break the “dry spell”.)

One interesting consequence of C having non-pure functions is that every function you call can be accessing your local variables if pointers to them “escape”.

If we compile the following C code with -O3

extern void f(int *a);
extern void g(void);

int main(void)
{
  int a = 0;
  f(0);
  a++;
  g();
  a++;
  g();
  a++;
  return a;
}

we get machine code where the additions are collapsed and a is not even given a memory position:

main:
        push    {r3, lr}
        movs    r0, #0
        bl      f(int*)
        bl      g()
        bl      g()
        movs    r0, #3
        pop     {r3, pc}

But if we pass a pointer to the local variable to a single function,

extern void f(int *a);
extern void g(void);

int main(void)
{
  int a = 0;
  f(&a);
  a++;
  g();
  a++;
  g();
  a++;
  return a;
}

we get a load/store after every function call:

main:
        push    {lr}
        sub     sp, sp, #12
        add     r0, sp, #8
        movs    r3, #0
        str     r3, [r0, #-4]!
        bl      f(int*)
        ldr     r3, [sp, #4]
        adds    r3, r3, #1
        str     r3, [sp, #4]
        bl      g()
        ldr     r3, [sp, #4]
        adds    r3, r3, #1
        str     r3, [sp, #4]
        bl      g()
        ldr     r0, [sp, #4]
        adds    r0, r0, #1
        add     sp, sp, #12
        ldr     pc, [sp], #4

Once a pointer to a variable has escaped, the optimizer has to assume that it’s part of the global state and every function call could be modifying it.

Abusing macros for typechecking

One commonly used macro in C programming is ASIZE(), generally defined as something like this

#define ASIZE(a) (sizeof(a)/sizeof(a[0]))

and used to calculate the number of elements in an array.

The main problem with this macro, as written, is that it doesn’t distinguish between arrays and pointers. If passed a pointer, it will silently produce wrong results:

Code

#include <stdio.h>

#define ASIZE(a) (sizeof (a) / sizeof((a)[0]))

int main(void)
{
	short a[3];
	short *b;
	int c[2];
	int *d;
	long long e[5][4];
	char *f[4];
	char (*g)[4];
	(void)a; (void)b; (void)c; (void)d; (void)e; (void)f; (void)g;
	printf("ASIZE() accepts pointers, producing invalid results.\n");
	printf("%zu\n", ASIZE( a ));
	printf("%zu\n", ASIZE( b ));
	printf("%zu\n", ASIZE( c ));
	printf("%zu\n", ASIZE( d ));
	printf("%zu\n", ASIZE( e ));
	printf("%zu\n", ASIZE( f ));
	printf("%zu\n", ASIZE( g ));
	return 0;
}

Output

3
2
2
1
5
4
1

By adding a new macro checking if the parameter is an array, we can define a safer ASIZE():

#define CHECK_ARRAY(a) ((void)(0&&((int (*)(__typeof__(a[0])(*)[ASIZE(a)]))NULL)(&(a))))
#define ASIZE_SAFE(a) (CHECK_ARRAY(a), ASIZE(a))

Checking this new version, we see it gets the correct results when passed arrays, but now the compilation fails when applied to pointers:

Code

#include <stdio.h>

#define ASIZE(a) (sizeof (a) / sizeof((a)[0]))

#define CHECK_ARRAY(a) ((void)(0&&((int (*)(__typeof__(a[0])(*)[ASIZE(a)]))NULL)(&(a))))

#define ASIZE_SAFE(a) (CHECK_ARRAY(a), ASIZE(a))

int main(void)
{
	short a[3];
	short *b;
	int c[2];
	int *d;
	long long e[5][4];
	char *f[4];
	char (*g)[4];
	(void)a; (void)b; (void)c; (void)d; (void)e; (void)f; (void)g;
	printf("ASIZE() accepts pointers, producing invalid results.\n");
	printf("%zu\n", ASIZE( a ));
	printf("%zu\n", ASIZE( b ));
	printf("%zu\n", ASIZE( c ));
	printf("%zu\n", ASIZE( d ));
	printf("%zu\n", ASIZE( e ));
	printf("%zu\n", ASIZE( f ));
	printf("%zu\n", ASIZE( g ));
	printf("ASIZE_SAFE() only accepts arrays (try uncommenting).\n");
	printf("%zu\n", ASIZE_SAFE( a ));
	//printf("%zu\n", ASIZE_SAFE( b ));
	printf("%zu\n", ASIZE_SAFE( c ));
	//printf("%zu\n", ASIZE_SAFE( d ));
	printf("%zu\n", ASIZE_SAFE( e ));
	//printf("%zu\n", ASIZE_SAFE( f ));
	//printf("%zu\n", ASIZE_SAFE( g ));
	return 0;
}

Output

ASIZE() accepts pointers, producing invalid results.
3
2
2
1
5
4
1
ASIZE_SAFE() only accepts arrays (try uncommenting).
3
2
5

It works in a relatively straightforward way, though I have put the details in a gist to avoid spoiling them.

Universality with only two NOT gates

In a previous post we have asked how many NOTs are needed to compute an arbitrary Boolean function. In this post we will see that only two NOT gates are enough.

Building 3 NOT gates starting from 2

If we call the inputs X, Y and Z, we can make a function detecting when no more than one input is active using a single NOT gate:

\displaystyle f(X, Y, Z) = \overline{XY + YZ + XZ}.

Detects when no more than one input is active.

Detects when no more than one input is active.

By selecting only the cases where at least one input is present, adding a term to detect when all the inputs are active and using an additional NOT gate, we can detect when exactly zero or two inputs are active:

\displaystyle g(X, Y, Z) = \overline{f(X, Y, Z)(X + Y + Z) + XYZ}

\displaystyle = \overline{\overline{XY + YZ + XZ}(X + Y + Z) + XYZ}.

Detects when zero or two inputs are active.

Detects when zero or two inputs are active.

Now we know that if X is not present, we either have:

  • 0 inputs present: we can check that by simultaneously ensuring that we don’t have more than one input present and that we have either zero or two inputs present, f(X, Y, Z)\cdot g(X, Y, Z).
  • 1 input present: we should have no more than one input present and Y or Z should be present, f(X, Y, Z)\cdot(Y + Z).
  • 2 inputs present: we can check that by simultaneously ensuring that either zero or two inputs are present and that Y and Z are present, g(X, Y, Z)\cdot YZ.

Putting all together and adding the symmetrical cases:

\displaystyle \overline{X} = f(X, Y, Z) \cdot (Y + Z) + (f(X, Y, Z) + YZ)\cdot g(X, Y, Z)

\displaystyle \overline{Y} = f(X, Y, Z) \cdot (X + Z) + (f(X, Y, Z) + XZ)\cdot g(X, Y, Z)

\displaystyle \overline{Z} = f(X, Y, Z) \cdot (X + Y) + (f(X, Y, Z) + XY)\cdot g(X, Y, Z).

Example of computing NOT x.

Computing NOT X (the other cases are symmetrical).

Checking the solution

To get independent confirmation, let’s check the truth tables with a simple Python script:

from itertools import product

for x, y, z in product((False, True), repeat=3):
	f_xyz = not (x and y or x and z or y and z)
	g_xyz = not (f_xyz and (x or y or z) or x and y and z)
	not_x = f_xyz and (y or z) or (f_xyz or y and z) and g_xyz
	not_y = f_xyz and (x or z) or (f_xyz or x and z) and g_xyz
	not_z = f_xyz and (x or y) or (f_xyz or x and y) and g_xyz
	assert (not x == not_x) and (not y == not_y) and (not z == not_z)

Conclusion

As this technique allows us to expand two NOT gates to three and it can be applied repeatedly, we can compute an arbitrary Boolean function with a circuit containing only two NOT gates.

In a following post we will see how I arrived to the solution by using brute force.