Quines [remix]

As the quines in the previous post were criticized as boring and ordinary :-P , I did a remix:

#include <stdio.h>
#define s char s[]
s="#if 0\nimport json;r=json.dumps\nprint'#include <stdio.h>\\n#define s char s[]\\ns=%s;\\n%s'%(r(s),s)\n\"\"\" \"\n#elif 1\n#undef s\nint main(void)\n{\n  char *t = s;\n  printf(\"#include <stdio.h>\\n#define s char s[]\\ns=\\\"\");\n  while (*t)\n  {\n    if (*t == '\\n')\n      printf(\"\\\\n\");\n    else if (*t == '\"')\n      printf(\"\\\\\\\"\");\n    else if (*t == '\\\\')\n      printf(\"\\\\\\\\\");\n    else\n      printf(\"%c\", *t);\n    t++;\n  }\n  printf(\"\\\";\\n%s\\n\", s);\n  return 0;\n}\n#elif 0\n\" \"\"\"\n#endif";
#if 0
import json;r=json.dumps
print'#include <stdio.h>\n#define s char s[]\ns=%s;\n%s'%(r(s),s)
""" "
#elif 1
#undef s
int main(void)
{
  char *t = s;
  printf("#include <stdio.h>\n#define s char s[]\ns=\"");
  while (*t)
  {
    if (*t == '\n')
      printf("\\n");
    else if (*t == '"')
      printf("\\\"");
    else if (*t == '\\')
      printf("\\\\");
    else
      printf("%c", *t);
    t++;
  }
  printf("\";\n%s\n", s);
  return 0;
}
#elif 0
" """
#endif

You can check that it works on the web or by downloading the file and testing it:

$ python polyquine.c | diff polyquine.c -
$ gcc -ansi -pedantic -Wall polyquine.c -o polyquine && ./polyquine | diff polyquine.c -

I’m aware that there are some impressive examples out there, but I haven’t analyzed them to avoid spoiling the fun.

What other language should I add? Reader contributions are welcome!

Quines

Introduction

One interesting programming exercise is to write a quine, a program that outputs its own source code (excluding empty programs!). If we naively try to write it just by using the print statement we will get into an infinite regression:

print 'print \' print \\\'print \\\\\\\'...

The basic problem is that the instructions required to write other instructions lead to an infinite recursion. So how can we avoid this problem?

The trick

One general way to do it is the one used by Von Neumann to make his abstract self-replicating machines and by nature itself. The trick is to handle the same instructions in two different ways: as instructions to be followed and as data to be copied.

Writing a C quine

As we will need to access the same data in two different ways, we must define a variable to avoid duplicating it:

char s[] = "<data will go here>";

It’s easy to write the code that prints everything before the data,

int main(void)
{
  printf("char s[] = \"");

but now we need to print the data as represented inside the string. Fortunately, of the characters we use, only the newline, the quote and the backlash require escape sequences:

  char *t = s;
  while (*t)
  {
    if (*t == '\n')
      printf("\\n");
    else if (*t == '"')
      printf("\\\"");
    else if (*t == '\\')
      printf("\\\\");
    else
      printf("%c", *t);
    t++;
  }

Finally, we print the data as a normal string and exit main():

  printf("%s", s);
  return 0;
}

This program is finished with exception of filling in the data in variable s. As this is quite tedious to do by hand, we will do it using a program:

char s[] = "\";\nint main(void)\n{\n  printf(\"char s[] = \\\"\");\n  char *t = s;\n  while (*t)\n  {\n    if (*t == '\\n')\n      printf(\"\\\\n\");\n    else if (*t == '\"')\n      printf(\"\\\\\\\"\");\n    else if (*t == '\\\\')\n      printf(\"\\\\\\\\\");\n    else\n      printf(\"%c\", *t);\n    t++;\n  }\n  printf(\"%s\", s);\n  return 0;\n}\n";

Putting it all together, we get a program that prints its own source:

char s[] = "\";\nint main(void)\n{\n  printf(\"char s[] = \\\"\");\n  char *t = s;\n  while (*t)\n  {\n    if (*t == '\\n')\n      printf(\"\\\\n\");\n    else if (*t == '\"')\n      printf(\"\\\\\\\"\");\n    else if (*t == '\\\\')\n      printf(\"\\\\\\\\\");\n    else\n      printf(\"%c\", *t);\n    t++;\n  }\n  printf(\"%s\", s);\n  return 0;\n}\n";
int main(void)
{
  printf("char s[] = \"");
  char *t = s;
  while (*t)
  {
    if (*t == '\n')
      printf("\\n");
    else if (*t == '"')
      printf("\\\"");
    else if (*t == '\\')
      printf("\\\\");
    else
      printf("%c", *t);
    t++;
  }
  printf("%s", s);
  return 0;
}

But, though it works in GCC, it’s not a strict C99 file:

cc1: warnings being treated as errors
prog.c: In function ‘main’:
prog.c:4: error: implicit declaration of function ‘printf’
prog.c:4: error: incompatible implicit declaration of built-in function ‘printf’

This can be fixed by including the stdio.h header:

#include <stdio.h>
char s[] = "\";\nint main(void)\n{\n  printf(\"#include <stdio.h>\\nchar s[] = \\\"\");\n  char *t = s;\n  while (*t)\n  {\n    if (*t == '\\n')\n      printf(\"\\\\n\");\n    else if (*t == '\"')\n      printf(\"\\\\\\\"\");\n    else if (*t == '\\\\')\n      printf(\"\\\\\\\\\");\n    else\n      printf(\"%c\", *t);\n    t++;\n  }\n  printf(\"%s\", s);\n  return 0;\n}\n";
int main(void)
{
  printf("#include <stdio.h>\nchar s[] = \"");
  char *t = s;
  while (*t)
  {
    if (*t == '\n')
      printf("\\n");
    else if (*t == '"')
      printf("\\\"");
    else if (*t == '\\')
      printf("\\\\");
    else
      printf("%c", *t);
    t++;
  }
  printf("%s", s);
  return 0;
}

Doing a Python quine

Doing a Python quine is much easier, because the built-in function repr() gives a string representation of an object, allowing us to skip most of the code of the C quine. Using it we can get this natural three line quine:

s = "print 's = %s' % repr(s)\nprint s"
print 's = %s' % repr(s)
print s

If we want to do code golfing, we can:

  • Remove unnecesary spaces and line breaks.
  • Printing the representation of the data and the data using only one print statement.

This give us a much shorter quine:

s="print's=%s;%s'%(repr(s),s)";print's=%s;%s'%(repr(s),s)

though still longer than the record one.

The fall of the slinky I

The video

The video shows how a spring suspended from one of its ends reacts when its dropped. It can be observed that the lower end “doesn’t know what happened” until a wave propagates to it. In this post we will make a computer simulation of its behavior, to see if we can reproduce the phenomenon, and in the next we will apply a more analytic approach to the same problem.

Discretization

As we are studying the actions of gravity and inertia, we need to model the slinky as a massive spring. In (macroscopic) reality the mass and elasticity are continuously distributed throughout the slinky (macroscopically speaking) but, for the purpose of simulating its behavior with a computer model, we will represent the object as a series of masses connected with massless ideal springs:

Slinky discretization, showing how the discretized element properties relate to the original ones.

If we divide a slinky of total mass M in N small masses, each of them will have the value m = M/N. As the spring constants of series springs add as the resistances of parallel resistors, if we have a slinky with overall spring constant K divided into N-1 smaller springs, each of them will have a bigger spring constant, k = K(N-1) and (obviously) a smaller unstressed length, l_0 = L/(N-1).

Writing the simulation

Now it’s just a question of applying Newton’s laws of motion and the ideal spring equation to get a system of ordinary differential equations:

\displaystyle \frac{d^2x_i}{dt^2} = g + \frac{k}{m}\left[(x_{i+1}-x_i-l_0)H[i]-(x_i-x_{i-1}-l_0)H[N-1-i]\right]

where x_i are the coordinates of the masses (with i going from 0 to N-1), g is the acceleration of gravity, k is the spring constant of each massless spring, m is the value of each small mass, l_0 is the unstressed length of each massless spring and H[n] is the discrete Heaviside step function (used to avoid depending on undefined values).

This second order system can be reduced to system of first order ODEs in the usual way. Integrating it using RK4, we get the following Python code:

def get_xdot(x):
    sk = K * (NX - 1)
    sl = L / (NX - 1)
    mm = M / NX
    xdot = [x[NX + i] if i < NX else G
            for i in range(2 * NX)]
    for i in range(NX - 1):
        a = sk * (x[i + 1] - x[i] - sl) / mm
        xdot[NX + i] += a
        xdot[NX + i + 1] -= a
    return xdot

def rk4_step(x, dt):
    k1 = get_xdot(x)
    k2 = get_xdot([x[i] + dt * k1[i] / 2.0 for i in range(len(x))])
    k3 = get_xdot([x[i] + dt * k2[i] / 2.0 for i in range(len(x))])
    k4 = get_xdot([x[i] + dt * k3[i] for i in range(len(x))])
    return [x[i] + dt * (k1[i] + 2.0 * (k2[i] + k3[i]) + k4[i]) / 6.0
            for i in range(len(x))]

Now we need to define the initial conditions. If we just start with the masses separated by a distance l_0 and at rest, we won’t match the initial conditions of the slinky, because it was being stretched by the action of gravity. It’s not very difficult to compute initial conditions that leave the system at rest if it’s being held:

def initial_x():
    sl0 = L / (NX - 1)
    mm = M / NX
    sk = K * (NX - 1)
    w = M - mm
    x = [0.0 for i in range(2 * NX)]
    for i in range(1, NX):
        x[i] = x[i - 1] + sl0 + w / sk
        w -= mm
    return x

The remaining code is just plumbing and matplotlib presentation code. The whole program can be seen at the repository.

Running the simulation

If we run the simulation with the parameters

NT = 1000 # number of timesteps
NX = 40   # number of masses
T = 1.0   # simulation duration

L = 0.5   # slinky length
K = 1.0   # slinky overall spring constant
M = 1.0   # slinky mass
G = 1.0   # gravitational acceleration

we get the following results:

A simulation where the springs are too soft, giving some negative spring lengths (and consequent overlapping) near t = 1.

In this plot the gray lines represent the trajectory of the small masses and the black lines the trajectory of the slinky’s ends.

Clearly the springs are too soft and we are getting unphysical results, as we spring lengths go negative when t nears 1. To fix that, let’s run the simulations with a greater spring constant, K = 5:

A simulation where the springs have a physically reasonable constant, giving an intriguing behavior to its ends.

Now we get a more reasonable result, showing a phenomenon that is more similar to the one observed in the video: the bottom of the slinky remains in place while the top begins to fall. Now we can check if the slinky remains stationary when held by m_0:

def get_xdot(x):
    sk = K * (NX - 1)
    sl = L / (NX - 1)
    mm = M / NX
    xdot = [x[NX + i] if i < NX else G
            for i in range(2 * NX)]
    for i in range(NX - 1):
        a = sk * (x[i + 1] - x[i] - sl) / mm
        xdot[NX + i] += a
        xdot[NX + i + 1] -= a
    xdot[NX] = 0.0 # holding the slinky
    return xdot

Simulation to check if the slinky remains stationary when held from its upper end.

Conclusions

This simulation validates the main point of the original video: the lower end “doesn’t know” the upper end was released until a compression wave reaches it, at t \approx 0.45 in our simulation. But the detailed behavior differs, as the slinky only shows a compression wave once it reaches the nonlinear regime (when is no more space between the spires).

In the next post we will show an analysis of this nonlinear behavior and the analytical solution to the idealized slinky drop that was numerically simulated in this post.

Top k – Solution

[Answer to the problem in Top k.]

This was the problem:

Given a list of n numbers, get the top k ones in O(n) time. (Partial credit for o(n log n) solutions.)

n log k solution

The easiest way to solve this problem in o(n log n) time is to scan the n elements, keeping track of the greatest k elements seen up to the current element. Using a heap to store the top k elements, we get the following algorithm:

#define SWAP(type, a, b) {type swap_temp = a; a = b; b = swap_temp;}

static void minheap_pushdown(int *h, size_t hs, size_t i)
{
    size_t j = 0;
    if (2 * i + 2 < hs)
        j = (h[2 * i + 1] < h[2 * i + 2]) ? 2 * i + 1 : 2 * i + 2;
    else if (2 * i + 1 < hs)
        j = 2 * i + 1;
    if (j != 0 && h[j] < h[i])
    {
        SWAP(int, h[i], h[j]);
        minheap_pushdown(h, hs, j);
    }
}

static void minheap_raise(int *h, size_t i)
{
    if (i == 0)
        return;
    if (h[i] < h[(i - 1) / 2])
    {
        SWAP(int, h[i], h[(i - 1) / 2]);
        minheap_raise(h, (i - 1) / 2);
    }
}

void top_k_nlogk_cp(int *top_k, size_t k, const int *l, size_t n)
{
    size_t i;

    for (i = 0; i < k; i++)
    {
        top_k[i] = l[i];
        minheap_raise(top_k, i);
    }

    for (i = k; i < n; i++)
    {
        if (l[i] > top_k[0])
        {
            top_k[0] = l[i];
            minheap_pushdown(top_k, k, 0);
        }
    }
}

As minheap_pushdown() and minheap_raise() have O(log k) cost, the total cost of this solution will be O(n log k). This is better than O(n log n) if k grows slower than any power of n (if k is constant or O(log n), for example).

The previous solution is enough to get partial credit, but we can do better :-D

Linear time solution

Quicksort is based on a linear time partition operation. This operation starts with an array and an element of this array, called pivot, and ends with a partially sorted array, having the pivot in his sorted position, smaller elements before it and bigger ones after it. Then doing a single partition operation would suffice… if we could guarantee that the chosen pivot is the (n-k+1)-th element of the sorted array :-D

We cannot do that, but we can do something that is asymptotically equivalent: doing a quicksort-style recursion, though only in the partition that contains the desired pivot. The result is this (converting the tail recursion to iteration):

int *top_k_n_ip(int *l, size_t n, size_t k)
{
    size_t lo, hi, pos, i, j;
    int pivot;
   
    lo = 0;
    hi = n;
    pos = n - k;

    while (hi - lo > 1)
    {
        i = lo + 1;
        j = hi - 1;
        pivot = l[lo];

        while (1)
        {
            while (i < hi && l[i] <= pivot)
                i++;
            while (j > lo && l[j] > pivot)
                j--;
            if (i > j)
                break;
            SWAP(int, l[i], l[j]);
        }

        SWAP(int, l[lo], l[j]);

        if (j < pos)
            lo = j + 1;
        else if (j > pos)
            hi = j;
        else
            break;
    }

    return l + n - k;
}

Now we need to check that this gives us a linear-time algorithm. The execution time of the body of the while (1) loop is bounded from above by C times hilo, where C is a constant. If we assume that the chosen pivot divides the l [ lo .. hi ] array section in two roughly equal parts (this assumption can be made rigorous by choosing the pivot carefully), we get

T(n) \leq A + (B + C \cdot n) + (B + C \cdot n/2) + \hdots + (B + C \cdot 1)

where A, B and C are constants.

Summing all the terms containing B and taking out all the C factors we get:

T(n) \leq A + \lceil\log_2 n\rceil B + C (1 + \hdots + n/2 + n)

T(n) \leq A + \lceil\log_2 n\rceil B + 3 C n

T(n) = O(n)

As we know that any correct algorithm must employ \Omega(n) time (it needs to inspect all elements), we also know that this algorithm has \Theta(n) time complexity. Now let’s check that against reality :-D

Benchmarks

We will benchmark the following algorithm implementations:

Execution times for all the algorithms when k is fixed at 30.

As k has a fixed value in the previous plot, the O(n log k) algorithm is effectively an O(n) algorithm. In fact, it performs better than the truly linear algorithms, probably due to cache effects.

Another thing that can be seen in these plots is the difficulty of seeing the difference between n and n log n in a log-log plot, but that is something to be expected.

Execution times for all the algorithms when k is 30% of n.

Here the qualitative behavior of the O(n log k) solution is the expected one. The C implementation of the O(n) algorithm looks surprisingly good, with a performance competitive to the C++ implementation included as part of the standard library.

Top k

Given a list of n numbers, get the top k ones in O(n) time.

This problem is too easy to solve in O(n log n) time

def top_k_n_log_n(l, k):
    sl = sorted(l)
    return sl[-k:]

but o(n log n) solutions can be presented for partial credit. :-P

Solving the “product, sum & difference riddle”

A post in the scala-user group (via Mauro Ciancio) asked the following problem:

there are 3 people. let’s name them peter, simon and daniel. the three
are supposed to figure out a pair of numbes. the possible pairs are all
combinations of numbers from 0 to 1000, meaning:
(0,0), (1,0), (2,0)….(1000,0),(1,1),(1,2) up to (1000,1000)
peter knows the product of the pair, simon knows the sum, and daniel
knows the difference.

the following conversation takes places:

peter: i don’t know the solution
simon: i already knew that
peter: know i know the solution
simon: know i know it, too
daniel: wtf? i can only suspect one of the numbers but can’t be sure
peter: the number you are suspecting is wrong
daniel: k, now i now the numbers, too

As this riddle is a variation on the product & sum riddle, giving us an additional protagonist and a difference sequence of questions and answers, we can try to adapt our previous solution to this new problem.

the possible pairs are all
combinations of numbers from 0 to 1000, meaning:
(0,0), (1,0), (2,0)….(1000,0),(1,1),(1,2) up to (1000,1000)

This section of the riddle isn’t very clear, but we can tentatively translate it to Python as:

all_pairs = [(i, j) for i in range(1000+1) for j in range(i, 1000+1)]

peter knows the product of the pair, simon knows the sum, and daniel
knows the difference.

peter: i don’t know the solution
simon: i already knew that
peter: know i know the solution
simon: know i know it, too

The following four sections of the riddle are identical to those of the “product & sum” riddle, so they can be solved using essentially the same code (read the previous solution for the detailed explanation!):

def make_freq_table(it, initial_freqs=None):
    freq_table = {} if initial_freqs is None else dict(initial_freqs)
    for e in it:
        if e not in freq_table:
            freq_table[e] = 0
        freq_table[e] += 1
    return freq_table

# peter: i don't know the solution
num_pairs_by_prod = make_freq_table(x * y for x, y in all_pairs)
pairs_1 = [(x, y) for x, y in all_pairs if num_pairs_by_prod[x * y] > 1]

# simon: i already knew that
identif_by_prod_pairs_sums = set(x + y for x, y in all_pairs
                                 if num_pairs_by_prod[x * y] == 1)
pairs_2 = [(x, y) for x, y in pairs_1
           if x + y not in identif_by_prod_pairs_sums]

# peter: know i know the solution
num_pairs_by_prod_2 = make_freq_table(x * y for x, y in pairs_2)
pairs_3 = [(x, y) for x, y in pairs_2 if num_pairs_by_prod_2[x * y] == 1]

# simon: know i know it, too
num_pairs_by_sum_3 = make_freq_table(x + y for x, y in pairs_3)
pairs_4 = [(x, y) for x, y in pairs_3 if num_pairs_by_sum_3[x + y] == 1]

daniel: wtf? i can only suspect one of the numbers but can’t be sure

This indicates us that, based on the information he has available, Daniel is unable to choose one solution. Furthermore, there is one number that appears more frequently than the others in the pairs he is considering. As Daniel is aware of the information provided by Peter and Simon, we can start by classifying the pairs that are still candidates according to that information by their difference:

def get_pairs_by_diff(pairs):
    pairs_by_diff = {}
    for x, y in pairs:
        if x - y not in pairs_by_diff:
            pairs_by_diff[x - y] = []
        pairs_by_diff[x - y].append((x, y))
    return pairs_by_diff
pairs_by_diff_4 = get_pairs_by_diff(pairs_4)

The fact that Daniel is still unsure indicates that, whatever the value of the difference might be, there must be more than one pair associated with it. And the fact he has one “suspect number” indicates that one number appears more often than the others (in other words, there is only one modal value). Translating this reasoning to code:

def get_modal_elems(pairs):
    ft = make_freq_table((p[1] for p in pairs),
                         make_freq_table(p[0] for p in pairs))
    max_freq = max(ft.values())
    return [e for e in ft if ft[e] == max_freq]
pairs_5 = [(x, y) for x, y in pairs_4
           if len(pairs_by_diff_4[x - y]) > 1 and\
           len(get_modal_elems(pairs_by_diff_4[x - y])) == 1]

peter: the number you are suspecting is wrong

This means that, of the previously selected pairs, only those that don’t have the “suspect number” need to be considered. Getting the “suspect number” for each difference (we know that there is one, as we have done a selection by this criteria):

pairs_by_diff_5 = get_pairs_by_diff(pairs_5)
susp_num_by_diff_5 = dict((d, get_modal_elems(p)[0])
                          for d, p in pairs_by_diff_5.items())

Now, based on Peter’s information, we know we can remove the pairs containing the “suspect number” associated to their difference. Doing that:

pairs_6 = [(x, y) for x, y in pairs_5
           if susp_num_by_diff_5[x - y] not in (x, y)]

daniel: k, now i now the numbers, too

As Daniel now knows the answer, we know that there is only one pair associated to the difference value known to him. Selecting these pairs:

num_pairs_by_diff_6 = make_freq_table(x - y for x, y in pairs_6)
pairs_7 = [(x, y) for x, y in pairs_6 if num_pairs_by_diff_6[x - y] == 1]

Finally, if the riddle is uniquely answerable, we can get the pair items now:

assert len(pairs_7) == 1
print(pairs_7[0])

Putting all the code together and running it, we effectively obtain a single value for the pair: (64, 73).

Searching for bits in C

Some years ago, a friend asked me how to (efficiently) get the position of some bit with value 1 inside an (unsigned) integer. For example, taking 32 bit integers and taking the least significant bit (LSB) as the bit 0, these are valid answers:

0xa9e7da24 --> 5
0x1d56b8b0 --> 28
0x9459ffbb --> 0
0x9f0c2a38 --> 24

If we reduce the problem to a more specific one, finding the lowest bit with value 1 inside an integer, it’s easy to get a solution just by doing a bitwise AND against a moving mask:

size_t bitscan(uint32_t a)
{
  size_t i;
  uint32_t mask = 1;
  for (i = 0; i < 32 && !(a & mask); i++)
    mask <<= 1;
  return i;
}

but it’s relatively slow. Can we do it faster?

The answer is clearly yes if we are using x86 assembly, as this instruction set has a specific instruction for this purpose: Bit Scan Forward (BSF). But, even in C, we can do better by a three step process:

  • First we isolate the least significant bit that is set by using: a & (-a).
  • Then we take the residue of dividing this power of two by 37.
  • Finally, we look up the associated bit position, using the residue obtained in the previous step as an index.

Why the first step works? We can start with the zero: doing a bitwise AND against -0 gives us just zero. As the zero doesn’t have any 1 bits to start with, we can say we’ve “isolated” the least significant one, though in a somewhat vacuous sense.

To check the nonzero integers, let’s define the bit expansion of an unsigned integer a:

\displaystyle a = \sum_{i=0}^N a_i\,2^i

Then the result of doing -a (remember we are dealing with unsigned integers) will be:

\displaystyle -a = \sum_{i=0}^N (1-a_i)\,2^i + 1

If the number is nonzero, there must a least significant nonzero bit. Let’s call its position M. Writing a & (-a) now:

\displaystyle a\,\&\, (-a) = \left(\sum_{i=0}^N a_i\,2^i\right)\&\left(\sum_{i=0}^N (1-a_i)\,2^i + 1\right)

\displaystyle = \left(\sum_{i=M+1}^N a_i\,2^i + 1\cdot2^M + \sum_{i=0}^{M-1}0\cdot2^i\right)\&

\displaystyle \left(\sum_{i=M+1}^N (1-a_i)\,2^i + (1-1)\cdot2^M + \sum_{i=0}^{M-1}(1-0)\cdot2^i + 1\right) (definition of M)

\displaystyle = \left(\sum_{i=M+1}^N a_i\,2^i + 2^M\right)\&\left(\sum_{i=M+1}^N (1-a_i)\,2^i + \sum_{i=0}^{M-1}2^i + 1\right)

\displaystyle = \left(\sum_{i=M+1}^N a_i\,2^i + 2^M\right)\&\left(\sum_{i=M+1}^N (1-a_i)\,2^i + 2^M\right) (geometric series sum)

\displaystyle = \sum_{i=M+1}^N a_i(1-a_i)\,2^i + 2^M (AND bit by bit)

\displaystyle = \sum_{i=M+1}^N 0\cdot 2^i + 2^M = 2^M

we get the isolated Mth bit, as desired.

We can look for the smallest divisor giving different residues for 20, 21, …, 231 via a small piece of code:

def get_modulo(n):
    m = 1
    while True:
        if len(set(2**i % m for i in range(n))) == n:
            return m
        m += 1

if __name__ == '__main__':
    print(get_modulo(32))

Now we just have to put the bit numbers in their associated positions inside the lookup table. We can calculate the table using the following Python code:

def get_lut(max_exp, modulo):
    lut = [-1] * modulo
    for i in range(max_exp + 1):
        assert lut[2 ** i % modulo] == -1
        lut[2 ** i % modulo] = i
    return lut

def print_table(table):
    print('{%s}' % ', '.join('%d' % t for t in table))

if __name__ == '__main__':
    print_table(get_lut(32, 37))

Then, putting these two pieces together, we get the following C function:

int lut_bit_pos(uint32_t a)
{
  const int bit_pos_lut[37] =
  {
    -1, 0, 1, 26, 2, 23, 27, 32, 3, 16,
    24, 30, 28, 11, -1, 13, 4, 7, 17,
    -1, 25, 22, 31, 15, 29, 10, 12, 6, 
    -1, 21, 14, 9, 5, 20, 8, 19, 18
  };
  return bit_pos_lut[(a & (-a)) % 37];
}

As we are working with 32 bit integers, we can do a full test… though not using Ideone (it exceeds the runtime limits).

Edited 5/30/11 – Bit sets

The original problem was to handle a set of task slots, each slot having two possible states: occupied or empty. This set required an efficient support of the following operations:

  • Checking if a slot is occupied.
  • Occupying/freeing a slot.
  • Getting a free slot if one is available.

As this problem doesn’t ask us for the slot number, we can use an opaque type and internally codify the slot as its associated power of two:

typedef uint32_t set_t;
typedef int32_t set_aux_t;
typedef uint32_t slot_t;

#define NUM_SLOTS 32

#define EMPTY_SET ((set_t)-1)
#define FULL_SET ((set_t)0)

#define IS_VALID_SLOT(slot) ((slot)&&!((slot)&((slot)-1)))
#define SLOT_FROM_INDEX(slot_idx) (1<<(slot_idx))

#define GET_EMPTY_SLOT(set) ((set)&(-(set_aux_t)(set)))

#define SET_SLOT(set, slot) ((void)((set)&=~(slot)))
#define CLEAR_SLOT(set, slot) ((void)((set)|=(slot)))

#define IS_SLOT_SET(set, slot) (!((set)&(slot)))

We can check the implementation using Ideone.

Intermezzo: permutations

As I haven’t progressed very much with my WebGL apps due to a variety of reasons/excuses (lack of willpower, a crash course on nuclear reactors engineering since the Fukushima nuclear crisis started, etc.), I decided to make a quick post showing a solution to a simple problem. In other words, the main purpose of this post is just to avoid leaving a month without posts :-D

The problem

Given an arithmetic expression without parentheses, insert them to get a specified result. For example, if the expression is 2+2/2 and the desired result is 2, there is a single solution: (2+2)/2. On the other hand, if the desired result is 0 there is no solution.

It’s easy to see the answers in simple cases like the last one. But it’s not so easy to see if we can get a result of 18 by adding parentheses to 18-6/2+4-1*2… so let’s write code to do the hard work for us.

A solution

The parentheses only change the order in which the different operators apply and the effects of any operation order can be obtained with some parenthesization. So, if we search for a permutation of the order in which the operators are applied that gives us the desired result, we are getting a solution for the original problem.

This new problem is not exactly the same as the original one by two reasons that can be exemplified as follows:

  • Given the expression 2+2*2+2, the orders of operation 1-3-2 and 3-1-2 are associated to the same parenthesized expression: (2+2)*(2+2).
  • The parenthesizations (2+2)+2, 2+(2+2) and 2+2+2 are equivalent due to the associativity of addition.

But, as this only gives us some redundant solutions that can be easily eliminated with some post-processing, it’s not a problem from the correctness point of view. Being able to remove this redundant solutions without needing to enumerate them would be a great thing performance wise, but it doesn’t seem easy and it’s not required for small expressions (10! < 107).

Permutations in lexicographic order

To avoid accidentally skipping some permutations, it’s useful to follow a specific order when enumerating them. In this section we will see how we can go through all the permutations of an array [0, 1, 2, ..., n - 1] in lexicographic order.

Given some permutation of the array, that we will call a, the next one will probably share some prefix and will differ in the order of the last elements. As we cannot get a lexicographically greater permutation by permuting a suffix that is in reverse order (starting from the greater elements), we need to permute a suffix that is not in reverse order. Then we will make our first step to search for this suffix:

Find the largest index k such that a[k] < a[k + 1].

The specification of the largest index is equivalent to search for the minimal suffix to permute, something required to get a permutation that is lexicographically immediate. If this suffix cannot be found, we are at the last permutation.

Now we need to decide what to permute. It’s clear that we cannot get a greater permutation just by permuting the suffix a[k+1 .. n-1], as they are in reverse order. So we need to replace a[k] by some element and the obvious choice is to search for the immediately greater element in the suffix or, as the suffix a[k+1 .. n-1] is in reverse order:

Find the largest index l such that a[k] < a[l].

The lexicographically smallest permutation starting with the original a[l] can be obtained by appending all the other elements of the original suffix in ascending order. But, as the suffix is in descending order and l is the largest index such that a[k] < a[l], we can get the same result in an easier way:

Swap a[k] with a[l].
Reverse the sequence from a[k + 1] up to and including the final element a[n].

The final code in JS can be seen here.

Evaluating an arithmetic expression with different orders of operations

I started using an array whose members were alternately numbers and operators. The main problem is that, while it’s easy to go from an operator to the adjacent numbers in an array, this is not what we want to do. The desired behavior is to go from an operator to the operands, and they can be the result of a sequence of previous operations.

To solve this problem in a moderately efficient way, I adapted the parent node concept from the disjoint-set data structure. Each operation adds the result to the operator node and puts this node as the parent node of both operands, allowing efficient access to the value of the whole operand from any node that is included in it. The code that does this operations can be seen at the SVN repository of this blog.

Testing the solution

The whole solution can be tested at the same SVN repository, but let’s see some specific examples:

Input expression: 3+2*2-1*0.5
Input result 1: 3
Output: (3+2*2-1)*0.5
Input result 2: 6
Output: 3+2*(2-1*0.5)

Input expression: 18-6/2+4-1*2
Input result: 18
Output 1: ((18-6)/2+4-1)*2
Output 2: 18-(6/(2+4)-1)*2

GPGPU with WebGL: solving Laplace’s equation

This is the first post in what will hopefully be a series of posts exploring how to use WebGL to do GPGPU (General-purpose computing on graphics processing units). In this installment we will solve a partial differential equation using WebGL, the Laplace’s equation more specifically.

Discretizing the Laplace’s equation

The Laplace’s equation, \nabla^2 \phi = 0, is one of the most ubiquitous partial differential equations in physics. It appears in lot of areas, including electrostatics, heat conduction and fluid flow.

To get a numerical solution of a differential equation, the first step is to replace the continuous domain by a lattice and the differential operators with their discrete versions. In our case, we just have to replace the Laplacian by its discrete version:

\displaystyle \nabla^2 \phi(x) = 0 \rightarrow \frac{1}{h^2}\left(\phi_{i-1\,j} + \phi_{i+1\,j} + \phi_{i\,j-1} + \phi_{i\,j+1} - 4\phi_{i\,j}\right) = 0,

where h is the grid size.

If we apply this equation at all internal points of the lattice (the external points must retain fixed values if we use Dirichlet boundary conditions) we get a big system of linear equations whose solution will give a numerical approximation to a solution of the Laplace’s equation. Of the various methods to solve big linear systems, the Jacobi relaxation method seems the best fit to shaders, because it applies the same expression at every lattice point and doesn’t have dependencies between computations. Applying this method to our linear system, we get the following expression for the iteration:

\displaystyle \phi_{i\,j}^{(k+1)} = \frac{1}{4}\left(\phi_{i-1\,j}^{(k)} + \phi_{i+1\,j}^{(k)} + \phi_{i\,j-1}^{(k)} + \phi_{i\,j+1}^{(k)}\right),

where k is a step index.

Solving the discretized problem using WebGL shaders

If we use a texture to represent the domain and a fragment shader to do the Jacobi relaxation steps, the shader will follow this general pseudocode:

  1. Check if this fragment is a boundary point. If it’s one, return the previous value of this point.
  2. Get the four nearest neighbors’ values.
  3. Return the average of their values.

To flesh out this pseudocode, we need to define a specific representation for the discretized domain. Taking into account that the currently available WebGL versions don’t support floating point textures, we can use 32 bits RGBA fragments and do the following mapping:

R: Higher byte of \phi.
G: Lower byte of \phi.
B: Unused.
A: 1 if it’s a boundary value, 0 otherwise.

Most of the code is straightforward, but doing the multiprecision arithmetic is tricky, as the quantities we are working with behave as floating point numbers in the shaders but are stored as integers. More specifically, the color numbers in the normal range, [0.0, 1.0], are multiplied by 255 and rounded to the nearest byte value when stored at the target texture.

My first idea was to start by reconstructing the floating point numbers for each input value, do the required operations with the floating numbers and convert the floating point numbers to color components that can be reliably stored (without losing precision). This gives us the following pseudocode for the iteration shader:

// wc is the color to the "west", ec is the color to the "east", ...
float w_val = wc.r + wc.g / 255.0;
float e_val = ec.r + ec.g / 255.0;
// ...
float val = (w_val + e_val + n_val + s_val) / 4.0;
float hi = val - mod(val, 1.0 / 255.0);
float lo = (val - hi) * 255.0;
fragmentColor = vec4(hi, lo, 0.0, 0.0);

The reason why we multiply by 255 in place of 256 is that we need val_lo to keep track of the part of val that will be lost when we store it as a color component. As each byte value of a discrete color component will be associated with a range of size 1/255 in its continuous counterpart, we need to use the “low byte” to store the position of the continuous component within that range.

Simplifying the code to avoid redundant operations, we get:

float val = (wc.r + ec.r + nc.r + sc.r) / 4.0 +
	(wc.g + ec.g + nc.g + sc.g) / (4.0 * 255.0);
float hi = val - mod(val, 1.0 / 255.0);
float lo = (val - hi) * 255.0;
fragmentColor = vec4(hi, lo, 0.0, 0.0);

The result of running the full code, implemented in GLSL, is:

Solving the Laplace's equation using a 32x32 grid. Click the picture to see the live solving process (if your browser supports WebGL).

As can be seen, it has quite low resolution but converges fast. But if we just crank up the number of points, the convergence gets slower:

Incompletely converged solution in a 512x512 grid. Click the picture to see a live version.

How can we reconcile these approaches?

Multigrid

The basic idea behind multigrid methods is to apply the relaxation method on a hierarchy of increasingly finer discretizations of the problem, using in each step the coarse solution obtained in the previous grid as the “starting guess”. In this mode, the long wavelength parts of the solution (those that converge slowly in the finer grids) are obtained in the first coarse iterations, and the last iterations just add the finer parts of the solution (those that converge relatively easily in the finer grids).

The implementation is quite straightforward, giving us fast convergence and high resolution at the same time:

Multigrid solution using grids from 8x8 to 512x512. Click the picture to see the live version.

Conclusions

It’s quite viable to use WebGL to do at least basic GPGPU tasks, though it is, in a certain sense, a step backward in time, as there is no CUDA, floating point textures or any feature that helps when working with non-graphic problems: you are on your own. But with the growing presence of WebGL support in modern browsers, it’s an interesting way of partially accessing the enormous computational power present in modern video cards from any JS application, without requiring the installation of a native application.

In the next posts we will explore other kinds of problem-solving where WebGL can provide a great performance boost.