Posted December 21 2008

apple art bits c code evil games interaction iphone java js mathematica micro optimization php quine reveng sound updates useless video whine xterm zip
Feeds: RSS / Twitter

Remove ad (sets cookie)
PHP has finally taken the plunge into the 1960s by implementing GOTO. Here's an implementation for Java.
A patch for a serious Java bug. No longer needed as of June 16.
Max & Michael have written a Max/MSP driver based on the multitouch code.
This is a little test applet to shows the touch points as reported by the MacBook multitouch pad. The drivers seems to be able to track an impressive 11 fingers at once.

My CSS-fu is weak; please use a recent browser.

Some rights reserved.

Random, semi-related image by Eva the Weaver.

Bit Tricks, Part II: Enumerating Bits

How to enumerate bits quickly using DeBruijn sequences.

This is part 2 of a series of posts on bit-twiddling tricks and micro-optimizations. This one looks at ways to try to optimize the following function:

"Plain"
void bitscan(long bitset) { for (int i=0; i < 64; i++) if ((bitset >> i)&1) process(i); }

It loops across a word, and performs some action for each 1-bit found.

Updated 20090128: Fixed a typo in two of the code snippets. Thanks, Martin!

Finding the next bit

The original code tests one bit at a time. This is, of course, a bit slow. Luckily, just about every CPU ever made has specialized hardware for scanning bits: It is used for calculating the carry when adding numbers. This is the classic x &- x operator.

This illustrates why it works. Depending on your fonts, ~ (bitwise not) and - (negation) may look very similar.

00100001010001000   x
11011110101110111   ~x  
11011110101111000   ~x + 1         a.k.a. -x       note the carry!
00000000000001000   x & (~x + 1)   a.k.a. x & -x

Note that this returns the bit as 2i, not i, so the next step is to find i.

Finding the index

Once we have extracted a single one bit, the next operation is to find its position.

Java comes with a very clever library function for doing this, taken from Hacker's Delight. It's basically a binary search, but with branches merged.

    public static int numberOfTrailingZeros(long i) {
        // HD, Figure 5-14
	int x, y;
	if (i == 0) return 64;
	int n = 63;
	y = (int)i; if (y != 0) { n = n -32; x = y; } else x = (int)(i>>>32);
	y = x <<16; if (y != 0) { n = n -16; x = y; }
	y = x << 8; if (y != 0) { n = n - 8; x = y; }
	y = x << 4; if (y != 0) { n = n - 4; x = y; }
	y = x << 2; if (y != 0) { n = n - 2; x = y; }
	return n - ((x << 1) >>> 31);
    }
"Hacker's Delight"
int table[64]; void bitscan(long bitset) { while (bitset) { int t = bitset & -bitset; process(Long.numberOfTrailingZeros(t)); bitset ^= t; } }

This is a little bit faster for most inputs, but still not very good.

Finding the index, slightly faster

The numberOfTrailingZeros function has 6 conditional branches; it's basically doing a binary search. It's possible to write it in branch-free style, but there's a simpler way to get most of the benefit: We can exploit the fact that

2n - 1 = 20 + 21 + ... + 2n-1

and thus

bitCount(2n-1) = n

Also, the bitCount() function (also from Hacker's Delight) is branch free:

     public static int bitCount(long i) {
        // HD, Figure 5-14
	i = i - ((i >>> 1) & 0x5555555555555555L);
	i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L);
	i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
	i = i + (i >>> 8);
	i = i + (i >>> 16);
	i = i + (i >>> 32);
	return (int)i & 0x7f;
     }
"Hacker's Delight 2"
int table[64]; void bitscan(long bitset) { while (bitset) { int t = bitset & -bitset; process(Long.bitCount(t-1)); bitset ^= t; } }

Finding the index, fast

By multiplying the single bit (which will have a value of 2i) with a constant, you are essentially shifting the constant left by i bits. Consider the three middle bits here:

001101 *      1 = 000001101
001101 *     10 = 000011010
001101 *    100 = 000110100
001101 *   1000 = 001101000
001101 *  10000 = 011010000
001101 * 100000 = 110100000

Notice how the values marked in red are unique. Using this trick, the indexing operation can be performed using a simple multiplication, bit-shift and table lookup, given that we can find a De Bruijn sequence(W).

"DeBruijn"
int table[64]; void bitscan(long bitset) { while (bitset) { int t = bitset & -bitset; process(table[(t * 0x0218a392cd3d5dbfL) >>> 58]); bitset ^= t; } }

If you can control the input data (which you probably do), and you don't care about the order of the calls to process(), you can pre-shuffle the input to avoid the table lookup altogether.

"DeBruijn 2"
void bitscan(long bitset) { while (bitset) { int t = bitset & -bitset; process((t * 0x0218a392cd3d5dbfL) >>> 58); bitset ^= t; } }

Benchmarks

This graph shows the runtime with various densities (the X axis measures the percentage of 1-bits (randomly distributed) in the input array). Note the runtime for the original version: In the worst case, the "is this bit on" branch will be taken 50% of the time, independently at random, rendering the branch prediction hardware completely useless.

Let's go nuts

Later, the case where the function process simply increments a counter:

void process(int i) {
  table[i]++;
}

Also, how to avoid all of the above if inline assembly is available.

Comments

Have you considered the ffs() function? :-) (On x86, it maps down to bsf, which is fast -- much faster than a multiply.)
— Steinar H. Gunderson 2009-03-01
[Indeed. If you can run arbitrary instructions, that's definitely the way to go (this was for a project which had to be Pure Java®). Adding ffs() to the benchmark is now on my List®.]