I'm actually surprised virtualized page-zeroing isn't a memory controller featur...

peterfirefly · on Dec 10, 2014

The Mill, if it ever gets built, does something like that.

http://millcomputing.com/wiki/Memory#Implicit_Zero_and_Virtu...

"The big gain for this is that the OS doesn't have to explicitly zero out new pages, which would be a lot of bandwidth and time, and accesses to uninitialized memory only take the time of the cache and TLB lookups instead of having to do memory round trips."

adwn · on Dec 10, 2014

The Mill [1] does/will do that: When memory is allocated, it is served directly be the cache, which implicitly zeroes the respective cache line (without involving the main memory).

A similar mechanism zeroes a stackframe upon a function call (and function return, I think), which eliminates most attacks that exploit buffer-overflows.

[1] http://millcomputing.com/docs/memory/

oso2k · on Dec 10, 2014

So a Mill CPU can never do shared-memory process parallelization? Two or more processes accessing the same memory could be a hazard. Seems like it would suffer the same issue IA-64 has with parallelism. However, the conveyor belt analogy simplifies register spilling & related issues.

adwn · on Dec 11, 2014

> So a Mill CPU can never do shared-memory process parallelization?

Why not? The caches will still be coherent.

crest · on Dec 11, 2014

Only if you are crazy enough to share stacks in which case you deserve what you get.

stephencanon · on Dec 10, 2014

It's a little bit silly, but zeroing a page is an extremely cheap operation (far cheaper than just about anything you are reasonably going do with the page once it's zeroed -- on the order of a hundred cycles is pretty typical these days). That said, yes, it is a cost, and it's not crazy to want to address it with HW.

FWIW, both powerpc and arm64 have a "just zero the damn cacheline" instruction. That's not quite what you want, but it is quite useful.

brucedawson · on Dec 11, 2014

A hundred cycles to zero a page? If 4 KB can be written in a hundred cycles then, assuming a 3 GHz processor, that's 30 million pages per second or ~120 GB/s. That's pretty fast memory. On x86/x64 processors, which lack a zero-the-cacheline instruction, the memory will also end up being read, so you need 240 GB/s to clear pages that quickly. This ignores the cost of TLB misses, which require further memory accesses.

stephencanon · on Dec 11, 2014

The zeros don't need to get pushed to memory immediately. They go to cache, where they will typically be overwritten with your real data long before they are pushed out to memory. That push of your real data would have needed to happen anyway, so there (usually) minimal extra cost associated with the zeroing.

There are, of course, pathological cases where you touch one byte on a new page, and then don't write anything else before the whole thing gets pushed out, but they are relatively rare in performance-critical contexts.

renox · on Dec 11, 2014

> but zeroing a page is an extremely cheap operation

Uh? Anything which touch main memory is NOT 'an extremely cheap operation': that's why we have registers, L1, L2 (even L3) caches!

stephencanon · on Dec 11, 2014

The whole point of cache is that you don't need to go to main memory every time you do an operation. If you don't write anything else to the page, those zeros will be written out to main memory eventually, but if you weren't going to write anything, there was no need to allocate the page to start with. So the most common scenario is the zeros get written to L1, which is fast, and then your real data overwrites it in L1 before it's pushed out to lower levels of the cache or memory. The initial write may require a read-for-ownership from memory, depending on the exact cache semantics, but if that's required, it would be required for you to write your data anyway.