How do you reliably assess the cache behavior of your software? Thinking in terms of cache behavior is certainly very useful for making software that runs fast, but if you have no reliable to check what is actually happening, you're pretty much walking in the dark.
Most CPUs have performance counters that let you determine when performance impacting events occur (e.g. pipeline flushes, reduced dispatch width, cache misses, etc.). In their simplest form they are a simple count, but there is usually some way to trigger an interrupt every N events rather than simply incrementing a counter, so that you can store the offending program counter somewhere for later aggregation. This functionality isn't always exposed very well by the OS, though. If you're using Linux then perf and oprofile are the standard ways to access performance counters.
Alternatively you can execute your code in a simulator with a cache model. Of course, this model won't actually match your hardware (unless you're using a serious cycle-accurate simulator), but it might be more insightful than performance counters alone. I don't know if any of the standard x86 emulators include a good cache model.
kcachegrind on linux is great for that. It shows you your code annotated with number of cache misses per line of code, it also can show you callgraph with aggregated running time, etc. Internaly it uses callgrind.
If you've got C/C++ (or Fortran!) code it's pretty straightforward - I work on a profiling tool that tracks things like CPU utilization and memory contention for HPC codes (http://www.allinea.com/products/map/). For single-machine programs you could just hack something together with the perf counters fairly easily.
A clustered signal processing application I previously worked on used custom middleware with perfmon2[1] built into it for collecting performance counter data across all MPI nodes the application was running on. The framework for multiple nodes is not all that difficult.
There are a bunch of existing tools that use the underlying architecture's performance measuring capabilities to give you things like cache hits/misses and branch prediction performance; I really like oprofile (but it only runs on linux).
If you're using multiple threads, then you might have issues with false sharing, which is a bitch. Valgrind can help you there.
I have been planning to release an article with the exact same title, so naturally I hate the title. The article gives a quick insight on cache-aware algorithms, but misses to state that it's really about the data structures. Taking the example to the extreme, assuming you had huge, sparse matrices/vectors, how would you encode them? Or more generally, when does asymptotic complexity become a secondary criterion for the choice of a data structure? Otherwise, it's short but good, there can't be enough articles on that topic.
If you're interested in how this could be automated, check out Polly[1]. There is also similar framework in GCC[2], but as can be seen from the blog post, there is room for improvement.
Cache does rule everything, and the proof lies in the laws of physics.
How fast can you access a bit of memory? The lower bound is the speed of light. How do you organize your memory such that access is the fastest? In a sphere with the processor in the center. There is no more overall efficient packing of bits assuming that each bit takes up some finite "bit" of space (pun intended).
This means that random access to memory can never be O(1) at the limit, it is actually O(n^.333)--the cube root, as the volume of a sphere is 4/3 pi r^3. Yes, even for hash tables. So random access memory always gets slower the more you have of it.
Conversely, the less memory you have to access, the faster it can be. By putting more frequently accessed bits closer to the center of the sphere, you have created a cache.