Debugging stripped binaries

In many space-restricted environments it is necessary to avoid including symbols in the released binaries, making debugging more complex. Even though many problems can be replicated by using a separate debug build, not all of them can, forcing the use of map files and cumbersome procedures. In this post we will see how to use separate debug info when building with GCC.

Generating a test binary

To avoid using third-party source code, we can produce our own big source file programatically:

import sys
print 'int main(void)\n{\n    unsigned x0 = 1234;'
for i in range(int(sys.argv[1])):
    j = i + 1
    print '    unsigned x%(j)d = x%(i)d * %(j)d * %(j)d;' % locals()
print '    return (int)x%(j)d;\n}' % locals()

Generating, compiling and stripping the binary (the compilation takes quite long in this netbook),

$ python bigcgen.py 50000 >a.c
$ time gcc -g3 -std=c99 -Wall -pedantic a.c -o a-with-dbg

real	1m3.559s
user	1m2.032s
sys	0m0.912s
$ cp a-with-dbg a-stripped && strip --strip-all a-stripped
bash: syntax error near unexpected token `;&'
$ cp a-with-dbg a-stripped && strip --strip-all a-stripped
$ ls -l
total 5252
-rw-rw-r-- 1 mchouza mchouza 2255639 Jun  4 20:37 a.c
-rwxrwxr-x 1 mchouza mchouza  907392 Jun  4 20:44 a-stripped
-rwxrwxr-x 1 mchouza mchouza 2204641 Jun  4 20:38 a-with-dbg
-rw-rw-r-- 1 mchouza mchouza     225 Jun  4 20:33 bigcgen.py

we can see there are substantial space savings by removing extra information from the binary, at the cost of being unable to debug it:

mchouza@nbmchouza:~/Desktop/release-debug-exp$ gdb ./a-with-dbg 
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
[...]
Reading symbols from ./a-with-dbg...done.
(gdb) q
$ gdb ./a-stripped 
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
[...]
Reading symbols from ./a-stripped...(no debugging symbols found)...done.
(gdb) q

Separating the debug info and linking it

We can use objcopy to get a copy of the debug information and then to link it back to the original file.

$ cp a-with-dbg a-stripped-dbg
$ objcopy --only-keep-debug a-stripped-dbg a-stripped-dbg.dbg
$ strip --strip-all a-stripped-dbg
$ objcopy --add-gnu-debuglink=a-stripped-dbg.dbg a-stripped-dbg
$ ls -l
total 7412
-rw-rw-r-- 1 mchouza mchouza 2255639 Jun  4 20:37 a.c
-rwxrwxr-x 1 mchouza mchouza  907392 Jun  4 20:44 a-stripped
-rwxrwxr-x 1 mchouza mchouza  907496 Jun  4 20:46 a-stripped-dbg
-rwxrwxr-x 1 mchouza mchouza 1300033 Jun  4 20:46 a-stripped-dbg.dbg
-rwxrwxr-x 1 mchouza mchouza 2204641 Jun  4 20:38 a-with-dbg
-rw-rw-r-- 1 mchouza mchouza     225 Jun  4 20:33 bigcgen.py

The file size is slightly bigger, but now we can debug normally:

$ gdb ./a-stripped-dbg
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
[...]
Reading symbols from ./a-stripped-dbg...Reading symbols from /home/mchouza/Desktop/mchouza/stripped-bins-dbg/a-stripped-dbg.dbg...done.
done.
(gdb) b 4567
Breakpoint 1 at 0x4145e2: file a.c, line 4567.
(gdb) r
Starting program: /home/mchouza/Desktop/mchouza/stripped-bins-dbg/a-stripped-dbg 

Breakpoint 1, main () at a.c:4567
4567	    unsigned x4564 = x4563 * 4564 * 4564;
(gdb) c
Continuing.
[Inferior 1 (process 19304) exited normally]

Limitations and a question

Of course, when applied to optimized code it can be hard to debug it anyway because of the restructuring done as part of the optimization process… but that is a limitation intrinsic in debugging optimized code.

Finally, a question: why does this program exits with code 0 when executed?

[EDITED THE PYTHON CODE TO MAKE THE QUESTION MORE INTERESTING]

Compact string tables in C

Introduction

The simplest way to store a string table in C is to follow this “pattern”:

/* in a header file */
typedef enum
{
  str_first,
  str_second,
  /* add more string ids here */
} str_id_t;

/* in a source file */
static const char str_table[STR_MAX_LEN + 1][] = {
  "This is the first message.",
  "This is the second message.",
  /* add more strings here */
}

This method is unarguably very simple and makes accessing strings very time efficient, as getting the string pointer only requires a multiplication and an addition. But the disadvantages are substantial:

  • the ids for each string are not positioned near the strings, making synchronization errors more likely;
  • the same storage space is used for all strings, leading to massive memory waste in most cases.

The rest of this post will be dedicated to explore an alternative method that solves these problems at the expense of requiring an initialization step and making string accesses a bit slower.

The method

See this update.

The main idea is to exploit the fact that a macro can be undefined and redefined to make a single macro expression take different values. In that way, an expression such as

MSG( str_id_first, "This is the first message." )

can be evaluated first as defining an enumeration value, and later as defining a string constant. This allows us to avoid the first problem, making the synchronization between the string ids an their associated string constants much easier.

To solve the second problem we can concatenate all the strings and make the accesses via an offset table. We can fill the sizes of each element using the sizeof operator over the string constants, but computing the offsets will require a runtime initialization step to do the necessary additions.

Putting together these two ideas we get:

Header file – str_table.h
#include <stdlib.h>

#if !defined(STR_TABLE_NORMAL) && !defined(STR_TABLE_STR_CONSTS) &&\
    !defined(STR_TABLE_STR_OFFSETS)
#define STR_TABLE_NORMAL
#endif

#if defined(STR_TABLE_NORMAL)

/* defines ids */
#define MSG(id,str ) id,
typedef enum
{

#elif defined(STR_TABLE_STR_CONSTS)

/* defines string constants */
#define MSG(id, str) str
static const char _table[] =

#elif defined(STR_TABLE_STR_OFFSETS)

/* defines string offsets */
#define MSG(id, str) sizeof(str) - 1,
static size_t _offsets[] = {
  0,

#endif

MSG(str_first, "This is the first message.")
MSG(str_second, "This is the second message.")

#if defined(STR_TABLE_NORMAL)

  str_id_last
} str_id_t;

void str_table_init(void);
void str_table_get(char* buffer, size_t buffer_size, str_id_t str_id);

#elif defined(STR_TABLE_STR_CONSTS)

;

#elif defined(STR_TABLE_STR_OFFSETS)

};

#endif

#undef MSG
Source file – str_table.c
#define STR_TABLE_NORMAL
#include "str_table.h"
#undef STR_TABLE_NORMAL

#define STR_TABLE_STR_CONSTS
#include "str_table.h"
#undef STR_TABLE_STR_CONSTS

#define STR_TABLE_STR_OFFSETS
#include "str_table.h"
#undef STR_TABLE_STR_OFFSETS

#include <string.h>

void str_table_init(void)
{
  size_t i;
  for (i = 1; i < sizeof(_offsets) / sizeof(_offsets[0]); i++)
    _offsets[i] += _offsets[i-1];
}

void str_table_get(char* buffer, size_t buffer_size, str_id_t str_id)
{
  if (_offsets[str_id+1] - _offsets[str_id] >= buffer_size)
    return;
  memcpy(buffer, &_table[_offsets[str_id]],
         _offsets[str_id+1] - _offsets[str_id]);
  buffer[_offsets[str_id+1] - _offsets[str_id]] = '\0';
}
Testing

We cannot run a multiple file C program online (AFAIK), but we can simulate the inclusions manually and run it at Codepad, where it gives this output:

This is the first message.
This is the second message.

Edit 2/14 21:30

Reading this comment by Arseny Kapoulkine I realized that my previous solution is, in fact, badly overengineered and wasteful. 😀 In fact, for most purposes, we can avoid using enums at all and just use this well known solution (though it’s not really a string table…):

Header file – str_table.h
#ifdef STR_TABLE_C
#define MSG( id, str ) extern char id[sizeof(str)];
#else
#define MSG( id, str ) char id[sizeof(str)] = str;
#endif

MSG(str_first, "This is the first message.")
MSG(str_second, "This is the second message.")
Source file – str_table.c
#define STR_TABLE_C
#include "str_table.h"

But if we want to be able to iterate through the string table, Arseny’s functional solution is a good solution (though it’s hard to follow for people like me that doesn’t know functional programming very well :-D).

Detectando dígitos en un 8051

Uno de los problemas que aparecieron en un examen de Labo de Micros reciente (que le tomaron a mi hermano) indicaba hacer una “función” en assembly 8051 tal que detectara si el valor que se le pasaba era un dígito. Más especificamente, debían volver con C en 1 si y solo si el valor no era un dígito.

Yo siempre había pensado el problema de la forma obvia, algo así como:

no_es_digito:
clr C
subb A, #’0′
jc no_es_dig_end
subb A, #10
cpl C
no_es_dig_end:
ret

Pero a mi hermano le dijeron que podía hacerse con cinco instrucciones, lo que me hizo pensar en más detalle… hasta que vi que la resta no era la única solución. Eso me llevó al siguiente código:

no_es_digito:
add A, #(256 – ‘0’)
add A, #(256 – 10)
ret

El primer add lleva, en forma modular, el valor desde el rango ASCII ‘0’‘9’ al rango 0x000x09. En base a eso es simple ver que el segundo add dará como resultado C = 1 solo si el valor es 0x0a o mayor. Por lo tanto, solo dará C = 1 si el caracter pasado originalmente no cae en el rango ‘0’‘9’.

No creo que pueda hacerse con dos instrucciones, ya que las instrucciones que alteran C en base al contenido del acumulador son aritméticas (según recuerdo!) y solo pueden “detectar” valores mayores o menores a uno especificado… pero el desafío queda abierto 😀

Proyecto para Labo de Micros

Acá está todo el software del proyecto que hicimos (junto con Mariano Beiró) para la materia Laboratorio de Microcomputadoras: un juego de Ta-Te-Ti capaz de conectarse a cualquier televisor (con audio incluido!). Una de las cosas más particulares que tenía el software era el uso de un tabla comprimida con Huffman para determinar que jugada realizar en base a la posición del tablero. Algunas fotos:


La imagen proyectada sobre una PC cuando hicimos la presentación (como la salida era video compuesto la tomaba sin problemas).


Una vista de la plaqueta principal.