# A case study in micro-optimization

Last week I saw a wrap up from
cedric’s coding challenge on his blog, the problem looked simple enough, * “write a counter function that counts from 1 to max but only returns numbers whose digits don’t repeat”*.

My first stab at it consisted of a brute force solution of looking at every natural number up to a given limit and determining if the digits were unique, the crux of which was this function that given a number determined if the digits repeated or not

private static bool DoDigitsRepeat(long num) { ····//Used to track which digits have already been encountered ····int used = 0; ····while (num > 0) { ········int digit = (int)num % 10; ········num = num / 10; ········int index = 1 << digit; ········if ((used & index) == index) { ············return true; ········} ········else { ············used |= index; ········} ····} ····return false; }

The good thing about this solution was that it met the secondary goal for the problem which was to determine the biggest gap in the sequence of generated numbers, but in terms of performance it sucked! It worked well for the smaller limits defined in the problem but when you tried to push and generate all matching numbers till the max possible i.e. 9876543210 it just took too long (I think it was under a minute, but that was still too long)

I went back to the problem page and saw that the scripting weenies who had tried to use string functions to prune numbers had the worst performance of all solutions, that made me feel a bit better until I saw CrazyBob’s Java solution(the fast version), the comments indicated that the solution found the total count in under half a second. This solution was not able to determine the biggest gap though because the numbers were generated out of order but nevertheless this was an excellent solution.

So began my quest to come up with a faster solution. I felt sure that I could use the bit twiddling hacks to come up with a faster solution, it was just a question of hitting the right spot.

The first non-brute force solution that I tried was to enumerate all possible subsets of {0,1,2,3,4,5,6,7,8,9} and then to generate permutations from that set making sure that zero was never in the first place. Subset generation is pretty easy to do, all you need to do is count from 0 to 2^n where n is the number of elements in the set, the bit patterns of all the numbers in this range can be used to generate all the subsets. For the first cut I tried to be Object-Oriented and used the below class to implement a BitTable.

public class BitTable { ····private uint _storage = 0; ····public BitTable(uint value) { ········_storage = value; ····} ····public void Set(int index) { ········Debug.Assert(index >= 0 && index < 32); ········_storage |= (uint)(1 << index); ····} ····public void Reset(int index) { ········Debug.Assert(index >= 0 && index < 32); ········_storage &= (uint)~(1 << index); ····} ····public bool IsSet(int index) { ········Debug.Assert(index >= 0 && index < 32); ········uint val = (uint)(1 << index); ········return (_storage & val) == val; ····} }

Of course as I soon found out, all the method calls to BitTable were really slowing things down (and by slow I mean a 1-2 seconds slower than CrazyBob’s solution), so I dropped the class and moved all the operations inline, also I realized that since I was using value types to hold the state in the search/call tree, I didn’t need to set and unset the state after each recursive call. Here’s the final version of this line of thinking.

class Beust4 { private static int _total = 0; public static void Run() { ····_total = 0; ····//Generate all possible subsets of a set of 10 elements ····//2^10 = 1024 ····for (int i = 1; i <= 1024; ++i) { ········Permute(i, 0L); ····} ····Console.WriteLine("nTotal: {0}", _total); } private static void Permute(int digits, long current) { ····for (int index = (current > 0) ? 0 : 1; index <= 9; ++index) { ········if ((digits & (1 << index)) == (1 << index)) { ············if ((digits & ~(1 << index)) == 0) { ················//Console.Write("{0}, ", (current * 10) + index); ················++_total; ················return; ············} ············else { ················Permute(digits & ~(1 << index), (current * 10) + index); ············} ········} ····} }

I felt really good about this attempt but to my surprise when I ran it, I took 1.2 seconds on average which was still more than 2 times slower than CrazyBob’s Java solution. I got really stuck at this point and I had to make sure that there was not something obvious that I was missing. The first thing I did was to port CrazyBob’s solution to .Net so that I could compare both solutions, I’ve uploaded the C# version of CrazyBob’s solution here in case anyone wants to take a look.

On the surface it looks like the bit twiddling based solution should run faster because it does not make all those method calls, the other thing that I suspected was causing problems was the number of recursive function calls that were being made, so I put in some code to check the number of recursive function calls that were being made and to my surprise I found that CrazyBob’s solution was making 8877691 recursive calls as compared to my solution which was making almost 10 times that number. Also the actual soultion count is 8877690 which meant that the number of calls in CrazyBob’s solution was near optimal. So it was clear that it was the number of calls that were costing me the half second. Btw CrazyBob’s C# version still ran in **700ms** on average on my laptop, which was still **500ms** faster than my C# version.

I then started to think about alternate ways to attack the issue, one track I went down was to consider all the digits as a complete graph and then coming up with a way to enumerate all paths in the graph, traversing a n edge in the graph would remove other edges from the graph and make them not available. This reminded me on Knuth’s Dancing Links Algorithm and I read up on that a bit, this paper from Knuth on the subject was an excellent read. It looked to me that CrazyBob had used an approach similar to the DLX algorithm, but after reading the entire paper from Knuth it still didn’t strike me as to why using a DLX approach would provide such excellent performance as compared to my version.

So I went back a bit to comparing both solutions and I think the two key observations that I saw that

- CrazyBob’s solution went down the search tree to one level above the last and then just generated all the solution from there instead of recursing down to the last level, so for e.g. supposing it was generating 5 digit numbers and it had already generated 4321, then at that level it didn’t make additional recursive calls to add the final digit, it was able to add the last digit at the same level pruning the search tree quite a bit. In contrast my solution was basically doing a method call for every digit of every number in the solution set.
- The above optimization was made possible by using the length of the final solution as the key, so first all 1 digit solutions were generated, followed by two digit ones and so on.

Cool, so now I ported my bit twiddling version to generate based on the length of the numbers and the optimization to prune the search tree one level above the last came naturally. I ran my solution and guess what, it was still 200 ms slower than CrazyBob 🙁 Aaaarghhhh!!!!!

for (int len = 1; len <= 10; ++len) { ····Generate(len, 0, 0xFFF >> 2, 0); } private static void Generate(int maxLen, int currentLen, int availableDigits, long currentValue) { ····bool last = (currentLen == maxLen - 1); ····for (int digit = (currentValue == 0) ? 1 : 0; digit <= 9; ++digit) { ········if ((availableDigits & (1 << digit)) != (1 << digit)) ············continue; ········if (last) { ············++_total; ············//Console.Write("{0}, ", (currentValue * 10) + i); ········} ········else { ············Generate(maxLen, currentLen + 1, availableDigits & ~(1 << digit), (currentValue * 10) + digit); ········} ····} }

But I knew I was getting close, at this point I knew what was killing me basically to determine which bits were set I was iterating from 0 to 9 and then testing if that bit was set in the number or not, this test was killing me because most of the times the bit was not set and I was doing a huge huge number of unnecessary tests. So I needed a way to iterate through only the set bits. The first solution I tried used a hashtable but that caused a even bigger degradation in performance. Finally for a lack of a better way to express this in C# I had to waste 4KB of memory and use an array to allow me to iterate through the indexes of the set bits in a number.

The final solution ran in 350ms on average, almost a 50% improvement over CrazyBob’s solution, woot!!!!! A further optimzation to move the if statement from inside the loop to outside which makes the code more readable but incredibly smelly because of the near duplicate code in both the if and else blocks shaved off another 50ms. Here’s the final version without the optimzation for moving the if statement outside which makes the code a bit shorter.

class Beust5 { ····private static int _total = 0; ····private static int[] _pre = null; ····public static void Run() { ········_total = 0; ········_pre = new int[(1 << 10) + 1]; ········for (int i = 0; i <= 10; ++i) { ············_pre[1 << i] = i; ········} ········for (int len = 1; len <= 10; ++len) { ············Generate2(len, 0, 0xFFF >> 2, 0); ········} ········Console.WriteLine("nTotal: {0}", _total); ····} ····private static void Generate2(int maxLen, int currentLen, int availableDigits, long currentValue) { ········bool last = (currentLen == maxLen - 1); ········int x = availableDigits; ········while (x != 0) { ············//digit will contain the lowest set bit ············int digit = _pre[x ^ (x & (x - 1))]; ············x &= (x - 1); ············//Avoid starting with zero ············if (digit == 0 && currentValue == 0) ················continue; ············if (last) { ················++_total; ················//Console.Write("{0}, ", (currentValue * 10) + i); ············} ············else { ················Generate2(maxLen, currentLen + 1, availableDigits & ~(1 << digit), (currentValue * 10) + digit); ············} ········} ····} }