In the Sieve of Eratosthenes post, I mentioned that at work (EA) we have an optional programming challenge every once in a while. It was started last year by Jim Hejl and it's called Hacker's Delight - named after the excellent book of programming tricks by Henry S. Warren.
I love this stuff. Why? Who knows? Anyhow, this time I wrote the challenge - Hacker's Delight #6. I chose the { 64 choose 4 } problem from Let Me Count The Ways. My fast solution was initially ~3.9ms. Since I had not yet aggressively optimized it, I did that and got it down to ~2.5ms. That solution was written completely in C++. I figured that was about as fast as it got because it turned out that ~2.5ms is (slightly) faster than memset(data, 0, 635376 * sizeof(unsigned __int64)) !!
In short order, solutions came in that were almost exactly as fast as mine with quite different algorithms. Hmm. I thought, the bottleneck is memory-processor bandwidth - ~2.5ms is simply how long it takes to write 4.8mb of data. All the other processing is swamped by that.
Even so, Jim was experimenting with SIMD approaches and had an SSE version of 'memset' that executed in about half memset's time. I started playing with that but couldn't figure out how to use that to get better performance out of my algorithm.
Then Jim tells me that he received an entry from EA United Kingdom that executes in ~1.25ms - all SIMD (MMX)! The gauntlet is down. I asked him not to forward it to me or let me see it.
So today I finally got my own MMX version to about ~1.25ms. It seems to execute slightly faster than Jim's SSE memset. While minor speed improvements may be possible (on the order of tenths of milliseconds), I'm convinced that there's no way to get significantly faster performance.
I could be wrong.
When it's all over, I will post a more detailed description of the optimization steps for those of you who are interested. For those who are not, z z z z z z z z z z...
[The related, and more detailed document Optimizing 64 Choose 4 (.pdf)]