ZFS needs very little memory to run. Performance is definitely better with more ...

kstrauser · on June 13, 2016

The FreeBSD wiki (https://wiki.freebsd.org/ZFSTuningGuide) still claims:

> To use ZFS, at least 1 GB of memory is recommended (for all architectures) but more is helpful as ZFS needs lots of memory.

Is that inaccurate?

mioelnir · on June 13, 2016

I've ran the initial FreeBSD patchsets for ZFS support on a dual-p3 with 1.5g ram, so 1g for a recent version should be more than doable. ZFS on FreeBSD has become a lot better with low-memory situations.

There are two additional things to consider.

ZFS uses RAM mostly for aggressive caching to cover over both spinning disks and the iops tradeoff vdevs make over traditional raid arrays. Thus low memory is not such a big deal if you have a pool with a single SSD or NVMe device.

The other point to consider is that on at least any non-Solaris derived platforms, the VFS layer does not speak ARC. So data is copied from an ARC object into a VFS object, taking up space in both. If you are able to adopt your platform to use the ARC as direct VFS cache, you can save RAM that way as well.

ryao · on June 14, 2016

The wiki page should be corrected. Saying "lots of memory" is somewhat ambiguous. If this were the 90s, then it would be right.

As for the recommended amount of system memory, recommended amounts are not the minimum amount which code requires to run. It in no way contradicts my point that the code itself does not need so much RAM to operate. However, it will perform better with more until your entire working set is in cache. At that point, more RAM offers no benefit. It is the same with any filesystem.

KaiserPro · on June 13, 2016

because it has its own FS cache.

watersb · on June 13, 2016

ZFS de-duplication will eat all of the RAM you can throw at it.

Otherwise, it is basically the SLUB memory block allocator that was used in the Linux kernel for a while. So yes, it can run on watchOS-level amount of RAM.

ryao · on June 13, 2016

ZFS data deduplication does not require much more ram than the non-deduplicated case. Performance will depend heavily on IOPS when the DDT entries are not in cache, but the system will run slowly even with miniscule RAM.

danudey · on June 14, 2016

In my experience, ZFS will choke out the rest of the system RAM, and swap like crazy, killing the system.

ryao · on June 14, 2016

Kernel memory on the platforms where ZFS runs is not subject to swap, so something else happened on that system. The code itself is currently somewhat bad at freeing memory efficiently due to the use of SLAB allocation. A single long lived object in each slab will keep it from being freed. That will change later this year with the ABD work that will switch ZFS from slab-based buffers to lists of pages.

mkup · on June 14, 2016

If dedup is off, and max ARC size is limited, it will use a little memory (e.g. 512 Mb of RAM for 2x2TB RAID1 pool). I can say that from my own experience, I tried both approaches.

ryao · on June 14, 2016

I probably should clarify that the system could definitely run unacceptably slow when deduplication is used and memory is not sufficient for the DDT to be cached. My point is that saying ram is needed is saying that the software will not run at all, which is not true here.

raattgift · on June 16, 2016

Or when the cache is cold. It REALLY hurts to reboot while a deferred destroy on a big deduplicated snapshot is in progress. No import today for you!

Well, unless your medium has no seek penalty, which is what hurts with deduplication. Dedup on SSDs is pretty much OK, as long as your checksum performs reasonably (skein is reasonable; sha256 is not).

DDTs that fit inside no-seek-penalty L2s don't hurt that much either, and big DDTs on spinny-disk pools are acceptable with persistent l2arc, although it's risky because if the l2 fails, especially at import, you can have a big highly deduplicated pool that isn't technically broken but is fundamentally useless if not outright harmful to the system it's imported (or ESPECIALLY attempting to be imported) by. "No returns from zpool(1) or zfs(1) commands for you today!"

When eventually openzfs can pin datasets and DDTs to specific vdevs (notably ones made out of no-seek-penalty devices), heavy deduplication on big spinny disk pools should be usable and reliable.

Until then, "well technically even if you have only ARC and it's very small, it will work, just slowly" while correct in the normal case, is unfortunately hiding some of the most frustrating downsides when things go wrong.

rhizome · on June 14, 2016

Interesting, because on Reddit's /r/DataHoarder they recommend a "1GB RAM per terabyte of storage" rule of thumb. [standard Reddit disclaimer]

masklinn · on June 14, 2016

That's if you're running deduplication (which is generally considered pointless for general purposes, it works very well for some data-loads but you really need to bench it beforehand considering its cost)

ryao · on June 15, 2016

> Interesting, because on Reddit's /r/DataHoarder they recommend a "1GB RAM per terabyte of storage" rule of thumb.

The author meant deduplication, but that recommendation is wrong. A rule of the form "X amount of RAM per Y amount of storage" that applies to ZFS data deduplication is a mathematical impossibility.

You could could need as little as 40MB of RAM per TB of unique data stored (16MB records) or as much as 160GB of RAM per TB of unique data stored (4KB records), both assuming default arc settings. Notice that I say unique data and discuss records rather than simply say data. There is a difference between the two. If you want to deduplicate data and want to maintain a certain level of performance, you will want to make sure RAM is sufficient to have a relatively high hit rate on the DDT. You can read about how to do that in my other post:

https://news.ycombinator.com/item?id=8437921

It is not stragihtforward and it depends on knowing things about your data that you probably do not. There is no magic bullet that will make data deduplication work well in every workload or make deduplication easy to calculate. However, if the data is already on ZFS, the zdb tool has a function that can figure out what the deduplication ratio is, provided sufficient RAM for the DDT, which makes it impractical to run it on a large pool relative to system memory.

ZFS' data deduplication is a very strict implementation that attempts to deduplicate everything subject to it against everything else subject to it and do so under the protection of a merkle tree. If you want it to do better, you will have to either give up strong data integrity or implement a probabilistic deduplication algorithm that misses cases. Neither of which are likely to become options in ZFS.

Anyway, deduplicating writes in ZFS is IOPS intensive, which is the origin of poor performance. There are 3 random seeks that must be done per deduplicated write IO. If the DDT is accessed often, it will find its way into cache and if all of those seeks are in cache, then your write performance will be good. If they are not in cache, you often end up hitting hardware IOPS limits on mechanical storage and even solid state storage. That is when performance drops.

If you are writing 128KB records on a deduplicated dataset on hardware limited to 150 IOPS, you are only going to manage 6.4MB/sec when you have all cache misses. If your records are 4KB in size, you will only manage 200KB/sec when you have all cache misses. However, ZFS will continue to operate even if every DDT lookup is a cache miss and you are hitting the hardware IOPS limit.

t0liman · on June 14, 2016

The 1Gb per Tb rule of thumb is to handle the larger requirements of ARC caching, not the FS itself.

It's a performance guide, not a requirement.

The Gb/TB rule of thumb exists so that on-disk files can be moved around or stored in RAM before a write operation, and that open or recent files can be precached in RAM while being streamed, more than it is relative to the pools total capacity.

For things like compression, hashing, encryption, block defragmenting and other operations, ZFS uses a lot of caching and indexing to avoid bottlenecks.

If Apple decides to implement APFS on RAID at a software level ie to combat bitrot or to sell consumer /business NAS / SAN ie upscaled Time Machine services for VMs, there's going to be questionable setups and comparisons to FreeNAS, synology, unRAID and other software storage options with a mixture of technology and hardware, where apple won't be flexible or adaptive.

To argue for ZFS, requires understanding more about ZFS usage and performance scenarios.

It is very possible to run ZFS RAID Z1 on 2gb or less for even a 32tb pool, ie anyone is able to run 5x seagate 8tb SMR archive drives in RAID Z1, on 2gb RAM.

It is usable. FreeNAS regularly hosts builds on less than optimal hardware setups,

It's also usable with 4gb, 8gb, 16gb, or 32gb RAM, with varying % performance benefits as features are enabled and cache is expanded to handle storage of ARC or LRU (recently used) files/pages/blocks.

Usually, ZFS metric is measured in throughput when empty, to 90% full, and performance changes drastically under these conditions when cache is limited.

On a system like this with SMR "archive" drives the problem often is having a reliable cache of write data, and ideally, less fragments to store asynchronously, ie writing large files or modifying a large block is disk IO limited. If being used to store archives, up to and including for media files as a consumer device would, an optimal RAM size would be hard to guess, given that people might store bluray or UHD ISO files of ~40gb versus DVD's of 4-9gb, and streaming read/write of linear files would not use significant random iops.

With DB or VM storage, and consistent file blocks being written, the use case and performance requirements are just going to be different again, and this is where the 1gb per Tb rule is both useful and unhelpful for diagnosis of requirements.

ZFS has a lot of bottlenecks, usually CPU, RAM and IOPS, but people focus on RAM, since it is so much harder to expand or scale. And, it is not linear scale performance.

Regardless, it's just impossible to guess optimal use in a practical way since there's almost no caching at all under 2gb, the ARC is very limited and kernel panics are possible when memory is not tuned or limited to avoid expansion, which then usually relies on CPU performance rather than disk performance.

At the high end of usage, performance can be managed by different methods such as L2ARC, ZIL, more RAM, more CPU, different pools, etc. Each with caveats and usually, non linear benefits.

Many NAS units that come with 2gb of RAM are capable of running ZFS, the problem is performance.

It's even possible to run ZFS on less than 1gb RAM, but it's not going to be reliable or predictable unless you restrict the conditions of usage, ie limiting max filesizes, restrictions on vdevs or iops, etc. It would require heavy tuning for optimal task usage.

Especially if you start to hit the maximum storage limits of the pool, performance can be brutal without caching features, lower than 100kb/s when the ARC is busy or unoptimised. Usually whatever the CPU can deliver from the drive IO without IO or file cache will be veeery slow on NAS level hardware, because traditional NAS isn't CPU bound.

Essentially, at the point where you can't start or run performance features, there's no benefit from ZFS or CoW on smaller embed devices unless it is needed.

From memory, and experience, you can use half a gb per Tb of storage on Z1 storage with some caveats and have a usable performance, as long as you keep filesize and IO in mind.

With 4tb or larger drives, Z2 is recommended due to the outcome of a drive failure on the pool integrity, and just the rebuild /resilver times and error probability could allow data to be changed or corrupted during the resilver process.

This is just to combat entropy when reading Tb of data and creating new checksums due to the probabilities involved with magnetic storage. Current and future drive density almost guarantees that errors will occur with entropy and decay of magnetic storage.

With deduplication, it needs to store files with multiple hashes, caches per device, and pool, which conflates sizes (sic). About 5gb per Tb is a good start. in most cases, you would never require dedup as it has an extreme cost and usage case.