(Follow-up to Sumando subconjuntos – La respuesta.)

#### Introduction

In a previous post (and in a comment from Guille [ɡiˈʒe] 😀 ) we have seen how the pigeonhole principle implies that a set of 70 numbers in the range [10^{18}, 10^{19}) must have two subsets with equal sum. But this is a non-constructive proof, as it doesn’t give us the two subsets with the same sum. To rectify this omission, in this post we will see how this “sum collision” can be obtained.

#### Finding collisions: simple cases

We can start by defining the problem in a more general way: given a sequence of elements *x _{i}* and a function

*f*(), find two elements of the sequence,

*x*and

_{j}*x*, such that

_{k}*f*(

*x*) =

_{j}*f*(

*x*). In this way the sum collision problem can be reduced to finding a duplicate in the associated sequence of sums,

_{k}*f*(

*x*).

_{i}Two common ways to get duplicates in a sequence are the following:

def find_duplicate_sort(g): sl = sorted(g) prev = None for e in sl: if e == prev: return e prev = e return None

def find_duplicate_set(g): es = set() for e in g: if e in es: return e es.add(e) return None

The first one has O(n log n) complexity and the second one has O(1) complexity if we use a hash-based set. As the set-based approach also has a lower constant, we will use this approach in the rest of this post.

This algorithm works well if the sequences to be inspected for duplicates can be fit entirely in RAM, but in this case we have seen that tens of billions of elements must be inspected to have a high probability of finding a collision. In the next section we will analyse how this restriction can be evaded.

#### Finding collisions in big sets

Each of the subsets to be evaluated in this problem can be encoded using 70 bits and, to allow a simple and fast sum calculation algorithm to be used, this was rounded up to 10 bytes. Then, if we want to inspect 20 billion subsets of 35 elements to get a very high probability of not wasting the calculation time, we will need 200 GB to store the whole sequence. 200 GB of data cannot be stored in the RAM of an usual machine, but it’s quite easy to store this amount of data in a hard disk nowadays.

To allow a fast hash-set based RAM collision search while keeping the bulk of the data in disk, we can take the following approach:

- Generate in RAM a big number of random subsets and sort them by their sums.
- Find a vector of numbers (to be called “pivots”) aplitting the sorted subsets vectors by their sums in approximately equal-sized segments. (The segment
*n*will be composed by all the subsets whose sum is between pivot*n-1*and pivot*n*). - Generate in RAM a big number of random subsets and sort them by their sums.
- Split the sorted subset vector in segments
*using the previously generated pivots*and append each segment to an associated segment file (for example, append segment 3 to*0003.bin*). - Repeat steps 3 and 4 until getting enough subsets.
- Check each segment file at a time for collisions.

If we choose enough pivots, the size of each segment file will be small enough to allow easily doing step 6 with a hash-based set (each segment file won’t have the same size, as the generated subsets are random; but the law of large numbers ensures that their sizes won’t be very different).

#### Source code & parameters

The (ugly) C code that checked for collisions can be found in the repository associated with this blog. The chosen parameters were:

**Number of subsets:**2·10^{10}.**Number of subsets in RAM:**10^{7}.**Number of elements in each subset:**35 (constant).**Number of segments:**1000.**Number of slots in the hash-set:**2^{27}.

#### Results

The first stage (segment file generation) elapsed time was approximately 41 hours, somewhat over my original estimation of 36 hours, and the segment file range ranged from 194827630 bytes to 206242980 bytes. The second stage (collision detection inside each segment file) lasted for 12-18 hours.

The output of the second stage (discarding files where no collisions were found) was:

Processing file 218... Collision between identical elements. DONE in 40.754850s. Processing file 363... Collision between different elements!!! ed940f4f5710c6351a00 a35377a5a74a03961c00 DONE in 38.585990s. Processing file 394... Collision between different elements!!! 9ab5dfa6312e090e2a00 6558c9c8eab667233400 DONE in 35.570039s. Processing file 409... Collision between different elements!!! 2429d70f6ff500ab2c00 423cf8c758740ac73b00 DONE in 34.499926s. Processing file 434... Collision between different elements!!! 5a8bf5a532915f213100 496ebc281e9958713e00 DONE in 32.610608s. Processing file 475... Collision between different elements!!! c35d3f427a7c488c0e00 4e37132b717a287a1900 DONE in 21.971667s. Processing file 655... Collision between different elements!!! 4653023676ea8e5f1900 6abb914e641ba45a3900 DONE in 21.514123s. Processing file 792... Collision between different elements!!! 8a4a63d85de377130600 853d3d45b15e0c471e00 DONE in 21.506716s.

Each set is represented as a byte string with bit number increasing when advancing through the byte string. For example, `ed940f4f5710c6351a00`

represents the bit string `10110111001010011111000011110010111010100000100001100011101011000101100000000000`

and, consequently, the subset with indices 0, 2, 3, 5, 6, 7, 10, 12, 15, 16, 17, 18, 19, 24, 25, 26, 27, 30, 32, 33, 34, 36, 38, 44, 49, 50, 54, 55, 56, 58, 60, 61, 65, 67, 68. Its elements are

5213588008354709077 9115813602317594993 1796745334760975384 3579709154995762145 2312952310307873817 3627590721354606439 5763055694478716846 2730952213106576953 4868653895375087301 9737387704190733086 9262565481673300485 5968266991171521006 6752113930489992229 3772194655144630358 9029836747496103755 3318990262990089104 9205546877037990475 9849598364470384044 1376783969573792128 1108556560089425769 7820574110347009988 6951628222196921724 4776591302180789869 7999940522596325715 2290598705560799669 7835010686462271800 8998470433081591390 9131522681668251869 9096632376298092495 5295758362772474604 5953431042043343946 3151838989804308537 8643312627450063997 3624820335477016277 3782849199070331143

and its sum is 203743882076389458417.

In the same way, the set `a35377a5a74a03961c00`

has elements

5213588008354709077 9011219417469017946 3579709154995762145 3627590721354606439 5941472576423725122 4317696052329054505 2730952213106576953 5014371167157923471 9737387704190733086 9262565481673300485 5968266991171521006 5917882775011418152 5866436908055159779 9233099989955257688 3772194655144630358 3318990262990089104 9990105869964955299 2664344232060524242 1376783969573792128 1108556560089425769 7820574110347009988 9889945707884382295 7652184907611215542 8082643935949870308 4271233363607031147 6415171202616583365 6393762352694839041 2290598705560799669 7481066850797885510 5295758362772474604 5953431042043343946 9929081270845451034 7207546018039041794 3624820335477016277 3782849199070331143

and 203743882076389458417 as its sum, the same value as the previous different subset. 😀

#### JS puzzle

The previous post asked what happened when the JS code

(function(f){f(f)})(function(g){g(g)})

was executed. In first place we can observe that the two anonymous functions are really the same, as they only differ in the name of a dummy parameter. Let’s call this function `u()`

and observe what happens when we apply `u()`

over a function `f()`

:

u(f) -> f(f)

It’s clear that `u()`

applies its argument over itself. As in this case we are applying `u`

over itself, the result will be

u(u) -> u(u)

giving us an infinite recursion.