Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Apple is using LPDDR5 for M3. The bandwidth doesn't come from unified memory - it comes from using many channels. You could get the same bandwidth or more with normal DDR5 modules if you could use 8 or more channels, but in the PC space you don't usually see more than 2 or 4 channels (only common for servers).

Unrelated but unified memory is a strange buzzword being used by Apple. Their memory is no different than other computers. In fact, every computer without a discrete GPU uses a unified memory model these days!



> (only common for servers).

On PC desktops I always recommend getting a mid-range tower server precisely for that reason. My oldest one is about 8 years old and only now it's showing signs of age (as in not being faster than the average laptop).


> In fact, every computer without a discrete GPU uses a unified memory model these days!

On PCs some other hardware (notably the SSD) comes with its own memory. But here it's shared with the main DRAM too.

This is not necessarily a performance improvement, it can avoid copies but also means less is available to the CPU.


DRAM-less NVMe (utilizing HMB) is also common on PCs, but it's seen as a slower budget alternative rather than a good thing.


I read all that marketing stuff and my brain just sees APU. I guess at some level, that’s just marketing stuff too, but it’s not a new idea.


The new idea is having 512 bit wide memory instead of PC limitation of 128 bit wide. Normal CPU cores running normal codes are not particularly bandwidth limited. However APUs/iGPUs are severely bandwidth limited, thus the huge number of slow iGPUs that are fine for browsing but terrible for anything more intensive.

So apple manages decent GPU performance, a tiny package, and great battery life. It's much harder on the PC side because every laptop/desktop chip from Intel and AMD use a 128 bit memory bus. You have to take a huge step up in price, power, and size with something like a thread ripper, xeon, or epyc to get more than 128 bit wide memory, none of which are available in a laptop or mac mini size SFF.


> The new idea is having 512 bit wide memory instead of PC limitation of 128 bit wide.

It's not really a new idea, just unusual in computers. The custom SOCs that AMD makes for Playstation and Xbox have wide (up to 384-bit) unified memory buses, very similar to what Apple is doing, with the main distinction being Apples use of low-power LPDDR instead of the faster but power hungrier GDDR used in the consoles.


Yeah, a lot of it is just market forces. I guess going to four channels is costly for the desktop PC space and that's why that didn't happen, and laptops just kind of followed suite. But now that Apple is putting pressure on the market, perhaps we'll finally see quad channel becoming the norm in desktop PCs? Would be nice...


> instead of PC limitation of 128 bit wide

Memory interface width of modern CPUs is 64-bit (DDR4) and 32+32 (DDR5).

No CPU uses 128b memory bus as it results in overfetch of data, i.e., 128B per access, or two cache lines.

AFAIK Apple uses 128B cache lines, so they can do much better design and customization of memory subsystem as they do not have to use DIMMs -- they simply solder DRAM to the motherboard, hence memory interface is whatever they want.


> Memory interface width of modern CPUs is 64-bit (DDR4) and 32+32 (DDR5).

Sure, per channel. PCs have 2x64 bit or 4x32 bit memory channels.

Not sure I get your point, yes PCs have 64 bit cache lines and apple uses 128. I wouldn't expect any noticeable difference because of this. Generally cache miss is sent to a single memory channel and result in a wait of 50-100ns, then you get 4 or 8 bytes per cycle at whatever memory clock speed you have. So apple gets twice the bytes per cache line miss, but the value of those extra bytes is low in most cases.

Other bigger differences is that apple has a larger page size (16KB vs 4KB) and arm supports a looser memory model, which makes it easier to reach a large fraction of peak memory bandwidth.

However, I don't see any relationship between Apple and PCs as far as DIMMS. Both Apple and PCs can (and do) solder dram chips directly to the motherboard, normally on thin/light laptops. The big difference between Apple and PC is that apple supports 128, 256, and 512 bit wide memory on laptops and 1024 bit on the studio (a bit bigger than most SFFs). To get more than 128 bits with a PC that means no laptops, no SFFs, generally large workstations with Xeon, Threadrippers, or Epyc with substantial airflow and power requirements


FYI cache lines are 64 bytes, not bits. So Apple is using 128 bytes.

Also important to consider that the RTX 4090 has a relatively tiny 384-bit memory bus. Smaller than the M1 Max's 512-bit bus. But the RTX 4090 has 1 TB/s bandwidth and significantly more compute power available to make use of that bandwidth.


Ugh, should have caught the bit vs byte, thanks.

The M4 max is definitely not a 4090 killer, does not match it in any way. It can however work on larger models than the 4090 and have a battery that can last all day.

My memory is a bit fuzzy, but I believe the m3 max did decent on some games compared to the laptop Nvidia 4070 (which is not the same as the desktop 4070). But highly depended on if the game was x86-64 (requiring emulation) and if it was DX11 or apple native. I believe apple claims improvements in metal (the Apple's GPU lib) and that the m4 GPUs have better FP for ray tracing, but no significant changes in rasterized performance.

I look forward to the 3rd party benchmarks for LLM and gaming on the m4 max.


What I was trying to say is that there is no 128b limitation for PCs.


Eh… not quite. Maybe on an Instinct. Unified memory means the CPU and CPU means they can do zero copy to use the same memory buffer.

Many integrated graphics segregate the memory into CPU owned and GPU owned, so that even if data is on the same DIMM, a copy still needs to be performed for one side to use what the other side already has.

This means that the drivers, etc, all have to understand the unified memory model, etc. it’s not just hardware sharing DIMMs.


I was under the impression PS4’s APU implemented unified memory, and it was even referred to by that name[1].

APUs with shared everything are not a new concept, they are actually older than programmable graphics coprocessors…

https://www.heise.de/news/Gamescom-Playstation-4-bietet-Unif...


I believe that at least on Linux you get zero-copy these days. https://www.phoronix.com/news/AMD-AOMP-19.0-2-Compiler


Yes, it's just easier to call it that without having to sprinkle asterisks at each mention of it :)

And yes, the impressive part is that this kind of bandwidth is hard to get on laptops. I suppose I should have been a bit more specific in my remark.


High end servers now have 12 ddr5 channels.


Yes, you could buy a brand new (announced weeks ago) AMD Turin. 12 channels of DDR5-6000, $11,048 and 320 watts (for the CPU) and get 576GB/sec peak.

Or you could buy a M3 max laptop for $4k, get 10+ hour battery life, have it fit in a thin/light laptop, and still get 546GB/sec. However those are peak numbers. Apple uses longer cache lines (double), large page sizes (quadruple), and a looser memory model. Generally I'd expect nearly every memory bandwidth measure to win on Apple over AMD's turin.


AnandTech did bandwidth benchmarks for the M1 Max and was only able to utilize about half of it from the CPU, and the GPU used even less in 3D workloads because it wasn't bandwidth limited. It's not all about bandwidth. https://www.anandtech.com/show/17024/apple-m1-max-performanc...


Indeed. RIP Anandtech. I've seen bandwidth tests since then that showed similar for newer generations, but not the m4. Not sure if the common LLM tools on mac can use CPU (vector instructions), AMX, and Neural engine in parallel to make use of the full bandwidth.


You lose out on things like expandability (more storage, more PCIe lanes) and repairability though. You are also (on M4 for probably a few years) compelled to use macOS, for better or worse.

There are, in my experience, professionals who want to use the best tools someone else builds for them, and professionals who want to keep iterating on their tools to make them the best they can be. It's the difference between, say, a violin and a Eurorack. Neither's better or worse, they're just different kinds of tools.


Agreed.

I was sorely tempted by the Mac studio, but ended up with a 96GB ram Ryzen 7900 (12 core) + Radeon 7800 XT (16GB vram). It was a fraction of the price and easy to add storage. The Mac M2 studio was tempting, but wasn't refreshed for the M3 generation. It really bothered me that the storage was A) expensive, B) proprietary, C) tightly controlled, and D) you can't boot without internal storage.

Even moving storage between Apple studios can be iffy. Would I be able to replace the storage if it died in 5 years? Or expand it?

As tempting as the size, efficiency, and bandwidth were I just couldn't justify top $ without knowing how long it would be useful. Sad they just didn't add two NVMe ports or make some kind of raw storage (NVMe flash, but without the smarts).


> Even moving storage between Apple studios can be iffy.

This was really driven home to me by my recent purchase of an Optane 905p, a drive that is both very fast and has an MTBF measured in the hundreds of years. Short of a power surge or (in California) an earthquake, it's not going to die in my lifetime -- why should I not keep using it for a long time?

Many kinds of professionals are completely fine with having their Optanes and what not only be plugged in externally, though, even though it may mean their boot drive will likely die at some point. That's completely okay I think.


I doubt you'll get 10+ hours on battery if you utilize it at max. I don't even know if it can really sustain the maximum load for more than a couple of minutes because of thermal or some other limits.


The 14" MBP has a 72 watt-hour battery and the 16" has a 100 watt-hour battery.

At full tilt an M3 Max will consume 50 to 75 watts, meaning you get 1 to 2 hours of runtime at best, if you use the thing full tilt.

That's the thing I find funny about the Apple Silicon MBP craze, sure they are efficient but if you use the thing as a workstation, battery life is still not good enough to really use unplugged.

Most claiming insane battery life are using the thing effectively as an information appliance or a media machine. At this game the PC laptops might not be as efficient but the runtime is not THAT different provided the same battery capacity.


FWIW I ran a quick test of gemma.cpp on M3 Pro with 8 threads. Similar PaliGemma inference speed to an older AMD (Rome or Milan) with 8 threads. But the AMD has more cores than that, and more headroom :)


CXL memory is also a thing.


Isn't unified memory* a crucial part in avoiding signal integrity problems?

Servers do have many channels but they run relatively slower memory

* Specifically, it being on-die


"Unified memory" doesn't really imply anything about the memory being located on-package, just that it's a shared pool that the CPU, GPU, etc. all have fast access to.

Also, DRAM is never on-die. On-package, yes, for Apple's SoCs and various other products throughout the industry, but DRAM manufacturing happens in entirely different fabs than those used for logic chips.


System memory DRAM never is, but sometimes DRAM is technically included on CPU dies as a cache

https://en.wikipedia.org/wiki/EDRAM


It's mostly an IBM thing. In the consumer space, it's been in game consoles with IBM-fabbed chips. Intel's use of eDRAM was on a separate die (there was a lot that was odd about those parts).


Yeah memory bandwidth is one of the really unfortunate things about the consumer stuff. Even the 9950x/7950x, which are comfortably workstation-level in terms of compute, are bound by their 2 channel limits. The other day I was pricing out a basic Threadripper setup with a 7960x (not just for this reason but also for more PCIe lanes), and it would cost around $3000 -- somewhat out of my budget.

This is one of the reasons the "3D vcache" stuff with the giant L3 cache is so effective.


For comparison, a Threadripper Pro 5000 workstation with 8x DDR4 3200 has 204.8GB/s of memory bandwidth. The Threadripper Pro 7000 with DDR5-5200 can achieve 325GB/s.

And no, manaskarekar, the M4 Max does 546 GB/s not GBps (which would be 8x less!).


> And no, manaskarekar, the M4 Max does 546 GB/s not GBps (which would be 8x less!).

GB/s and GBps mean the same thing, though GB/s is the more common way to express it. Gb/s and Gbps are the units that are 8x less: bits vs Bytes.


B = Bytes, b = bits.

GB/s is the same thing as GBps

The "ps" means "per second"


Thanks for the numbers. Someone here on hackernews got me convinced that a Threadripper would be a better investment for inference than a MacBook Pro with a M3 Max.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: