Where is there a performance benefit to increasing storage density rather than u...

Twirrim · on March 18, 2021

Data Density is the biggest single driving need for storage when you get towards datacentre / cloud environments. You want as many TB per rack as you can possibly get, because your dominant cost over time is not the initial upfront capital + depreciation, it's the per-rack running costs.

S3, BackBlaze etc. all focus on cramming as many hard disks in to a single machine as they can do, without running in to other bottlenecks on the machine level (CPU, memory, NIC bandwidth, controller etc).

You very much want to get out of the RAID business in those environments too. Backblaze mention their use of Reed-Solomon which is fairly common on large scale storage, and moves you much closer to resiliency on an individual object basis, rather than thinking in terms of the entire drive.

kijin · on March 18, 2021

Throughput tends to increase with storage density, because more data is stored in the same length of physical track.

Consumer-grade HDDs barely managed 100MB/s 10 years ago. Now they can often do 200MB/s, and enterprise disks are even faster. With these much larger Seagate drives I guess SATA will be the bottleneck, not the drive's sequential read/write speed.

dragontamer · on March 18, 2021

SATA3 is 6Gbps at 8/10b encoding: or 600MBps.

SAS is 12Gbps or 1200MBps (I don't know if it uses 8/10b encoding, but I'll assume that for simplicity)

------------

Hard Drives are no where close to breaking the SATA3 barrier, let alone the enterprise SAS 12Gbit barrier.

AtlasBarfed · on March 18, 2021

You're probably still correct, but what I saw from the immediate next generation was something like 9 platters and two or more independent read/write heads.

So we might be headed for what used to be a single write head (and its single throughput stream) to double, triple, or more.

Especially since the additional read heads enable the datacenters to scale shared object storage more effectively with more dense drives, which seems to be the main customer/application for HDDs at this point.

Sohcahtoa82 · on March 18, 2021

I remember a long time ago when I upgraded my motherboard and was pleasantly surprised that my SSD suddenly became twice as fast. Didn't even consider that I was switching from SATA2 to SATA3, was more interested in my CPU upgrade.

On an unrelated note, are you the same Dragontamer that I've met and played with at PDXLAN? Or do you just happen to use the same alias?

dragontamer · on March 18, 2021

> On an unrelated note, are you the same Dragontamer that I've met and played with at PDXLAN? Or do you just happen to use the same alias?

Unlikely me. We must have the same online alias (it is a very common alias in my experience...)

deelowe · on March 18, 2021

I'd like to see a graph of density versus iops over time. It definitely feels like the gap has been widening for quite some time just based on how long my ZFS arrays take to do a scrub.

To answer the OP's question, it seems to me that after around 12TB or so, it makes more sense to move away from implementations that require rebuilds such as raid 1, no raid, or jbod solutions.

dragontamer · on March 18, 2021

Random IOPS is and always will be stuck at 240 IOPS for a 7200 RPM drive.

7200 RPM / 60 == 120 rotations per second. A "half-rotation" to move the typical data on the disk to the head (half the data is within the first half-rotation, the other half of the data is within the 2nd half rotation).

If you want to reach the data faster, you need to physically rotate the disk faster: such as a 10,000 RPM drive, 15k, or 20k drive. To allow for faster rotations, you shrink the drive to 2.5" or even 1.8". Alas, SSDs have taken over this niche entirely, so we only really have 3.5" and 7200 RPM drives anymore.

magicalhippo · on March 18, 2021

And if you have a sufficient number of fairly sequential operations in parallel they start to look very random[1][2] to the storage system.

[1]: https://www.youtube.com/watch?v=yHgSU6iqrlE (presentation)

[2]: https://www.snia.org/sites/default/files/SDC/2019/presentati... (slides, page 41)

mtone · on March 18, 2021

Having dual actuators (Seagate's Mach.2 branding) can increase IOPS by having 2 heads process the queue in parallel. That should bring a noticeable improvement, but it's true that it doesn't apply to sequential random (just like NCQ didn't by reordering the queue -- you need a queue).

Not sure if there will be consumer drives with this eventually or if the cost is too prohibitive.

dragontamer · on March 18, 2021

Except we can already achieve that kind of IOPS increase: by simply using two hard drives in parallel (be it RAID0, or even RAID1 if your driver is willing to split the reads between hard drives).

A multi-actuator drive isn't really "one hard drive" anymore, its really just two hard drives ganged together. While more physically convenient, it doesn't seem to really offer the true 2x increase we're looking for.

Actuator#1 cannot give more IOPS over the data that Actuator#1 is assigned over. You only get more IOPS if you can split the work between the two actuators. Same problem as RAID0 or RAID1 multi-read hard drives (you gotta figure out a way to "split the work" to get RAID0 truly 2x the IOPS).

Dylan16807 · on March 19, 2021

RAID0 can't give you a true 2x increase, because reads and writes are constrained to a particular device, and big reads tend to require both drives working together.

RAID1 can give you a 2x increase in reads, but suffers even more than RAID0 when it comes to writes.

Dual actuators, implemented in a straightforward way, can both access the entire drive surface which means they can give you a true 2x increase. Sometimes even better than 2x, because each arm can focus on one side of the disk. For read/write workloads it completely outclasses RAID.

notacoward · on March 19, 2021

> because reads and writes are constrained to a particular device

That constraint means nothing here. You can issue two parallel reads to two drives in RAID-0 just as easily in RAID-1. The only case where this doesn't work is where you're reading more than 2x the interleave size and you're issuing separate requests for each interleaved chunk. With command queuing, a smart storage system should even recognize the pattern and buffer to reduce the damage, but you'll still pay a cost in extra interrupts and request handling though so it's better to learn about scatter/gather lists.

> they can give you a true 2x increase

I already explained why this isn't actually the case, and have observed it not to be the case with multiple generations of dual-actuator drives. Stop presenting theories based on misconceptions of how disks and storage stacks work as though they were fact.

Dylan16807 · on March 19, 2021

> You can issue two parallel reads to two drives in RAID-0 just as easily in RAID-1.

Under RAID 0, the odds are 50% that two independent reads are on the same drive. It's impossible to get a speed advantage in that case.

> I already explained why this isn't actually the case

You said they "improve parallelism, not media transfer rate or latency", and I'm arguing about parallelism. Plus large transfers can be rearranged into parallelism (fact, not theory).

And you said that they can face internal contention "elsewhere" but implied that could be fixed.

So that doesn't sound like what you said disagrees with what I said.

notacoward · on March 19, 2021

> Under RAID 0, the odds are 50% that two independent reads are on the same drive.

If you have a single sequential stream, then no. You'll either have parallel reads across the two drives, or you'll have alternating reads that the aforementioned semi-smart storage system can turn into parallel reads with buffering. If you have multiple sequential streams, then it's practically going to be like random access, which you already put out of scope. So there's no relevant case where RAID-0 is worse than RAID-1 for reads.

But you know what will be worse? Dual actuator drives. Why? Because of what dragontamer (who was right) mentioned, which you overlooked: the two actuators serve disjoint sets of blocks. They even present as separate SAS LUNs[1] just like separate disks would, so you would literally still need RAID on top to make them look like one device to most of the OS and above. But here's the kicker: they still share some resources that are subject to contention - most notably the external interface. Truly separate drives duplicate those resources, enabling both better performance and better fault isolation. Doubled performance is an absolute best case which is never achieved in practice, and I say that because I've seen it. If Seagate could cite something more realistic than IOMeter they would have, but they can't because the results weren't that good.

The only way dual actuators can really compete with separate drives is to duplicate all of the resources that change behavior based on the request stream - interfaces, controllers, etc. Basically everything but the spindle motor and some environmentals, as I already suggested now two days ago. You'd give up fault isolation, but at least you'd get the same performance. That's not what Seagate is offering, though.

[1] https://www.seagate.com/files/www-content/solutions/mach-2-m...

Dylan16807 · on March 19, 2021

Since you added a huge amount since I replied, I'll make a separate reply.

> But you know what will be worse? Dual actuator drives. Why? Because of what dragontamer (who was right) mentioned, which you overlooked: the two actuators serve disjoint sets of blocks.

They don't have to do that.

I was talking about what you can do with dual actuators, not product lines that already exist.

I didn't realize how mach.2 was designed, though. That's a shame.

> But here's the kicker: they still share some resources that are subject to contention - most notably the external interface.

Each head, even at peak transfer rate, uses less than half the bandwidth of the external interface.

So even if both of them are hitting peak rates at the same time, and the drive alternates transfers between them, things are fine. For example, let's say 128KB chunks, alternating back and forth. Those take .2 milliseconds to transfer. That makes basically no difference on a hard drive.

> Doubled performance is an absolute best case which is never achieved in practice, and I say that because I've seen it.

I completely believe you, about drives where each arm can only access half the data.

> The only way dual actuators can really compete with separate drives is to duplicate all of the resources that change behavior based on the request stream - interfaces, controllers, etc.

Or upgrade them to 1200Mbps, which isn't a very hard thing to do.

notacoward · on March 22, 2021

> I was talking about what you can do with dual actuators, not product lines that already exist.

Since you didn't know they're different until a moment ago, you were talking about both. Don't gaslight.

> Each head, even at peak transfer rate, uses less than half the bandwidth of the external interface.

So two will come damn close ... today. With an expectation that internal transfer rates will increase faster than standards-bound external rates. And the fact that no interface ever meets its nominal bps for a million reasons. Requests have overhead, interface chips have their own limits, signal-quality issues cause losses and retries (or step down down lower rates), etc. Lastly, request streams are never perfectly balanced except for trivial (mostly synthetic-benchmark) cases, and the drive can't do better than the request stream allows. There are so many potential bottlenecks here that any given use case is sure to hit one ... as actually seems to be the case empirically. Your theory remains theory, but facts remain facts.

Dylan16807 · on March 19, 2021

That sentence is specifically not about sequential reads.

mtone · on March 18, 2021

SSDs achieve their speed in part by combining multiple independent NAND channels under a single controller - each channel is more or less equivalent to an actuator. Their speed vary greatly based on workload parallelism, yet it's still very much one drive.

effie · on March 18, 2021

Using multiple drives is costly, it is much cheaper to consolidate if possible.

Rebuild time per TB will actually slightly improve because of better throughput due to higher density and higher number of disks inside the drive, so the recovery time for small arrays will actually get better.

True, rebuild time for a whole drive will get very long which is not great, but if the array is designed with good enough redundancy, this won't be a problem, less alone a blocking issue. The very point of RAID is that the system is functional even in the state of rebuilding. If enough drives are used, it does not matter that the rebuild takes 1 month.

notacoward · on March 19, 2021

In a large enough system, over a long enough time, even rare failure modes become inevitable. I was hearing about RAID-6 insufficiency at national labs ten years ago. Rebuild times were already long enough that, sooner or later, a second and then third failure would hit the same RAID group during the first rebuild. Data go poof. Since then, I've worked on even larger storage systems and seen overlapping failures cause data loss with even higher levels of redundancy. Throughout, I've seen the performance degradation from overlapping long rebuilds cause system-wide performance to drop below acceptable levels.

Higher areal density won't improve rebuild times unless internal transfer time is the bottleneck (it's not), and it very much does matter if rebuilds take a month. If that additional capacity isn't accompanied by proportional amounts of external-interface bandwidth and CPU/memory somewhere, then bigger disks will mean more risk of data loss. The math is unforgiving.

effie · on March 19, 2021

> In a large enough system, over a long enough time, even rare failure modes become inevitable.

Of course rare failures and loss of data do happen. There is no storage strategy that prevents these with certainty.

Data loss and performance degradation should be expected and designed for. Maybe RAID6 isn't cutting it for petabyte projects, but it is fine for vast majority of RAID users (small businesses, <12TB arrays).

I've noticed that special hardware and design requirements of the few largest operators are somehow proselytized as a standard that everybody should adopt. People just like to talk about how they understand the biggest deployments in the worlds and how that is the best practice for everybody. But for most users of RAID, these bigboy strategies are irrelevant. Arrays below 12TB are very common and work acceptably well with RAID5 / RAID6, and occasional stripe failure very often isn't a big deal for home users or small businesses.

> Higher areal density won't improve rebuild times unless internal transfer time is the bottleneck (it's not), and it very much does matter if rebuilds take a month.

Why? It matters only if running in degraded state poses performance/reliability problems to users. Which means the array wasn't designed with proper redundancy and performance in the first place. That is the problem, whether rebuild takes a day or a month. Large drives 100TB will be fine if enough of them is used in the array so it works well in degraded state. Also, most probably URE rate will go down due to better ECC measures with 100TB drives.

notacoward · on March 19, 2021

> Large drives 100TB will be fine if enough of them is used in the array

So one one hand you say that "big boy stuff" doesn't matter to anyone else, but on the other you say that "proper redundancy" requires higher scale. Seems a bit Goldilocks-ish to me, or perhaps even a bit slippery. There's a pretty well established trend, especially in storage, of things that happen in large systems becoming very relevant to smaller ones over time. RAID itself was considered a super-high-end niche once. And don't assume that my knowing about the high end means I don't know the low end as well, or make appeals to authority on that basis. Rebuild times have always been an issue worth addressing, from 1994-95 when I was working on the then-highest-density disk array (IBM 7135/110) to now, from high-end HPC to SOHO. Don't act like you occupy some magical space where what's true everywhere else is not true as well.

effie · on March 20, 2021

Regarding "bigboy stuff", it is really a simple argument, let me repeat in simpler words. Extreme data reliability beyond RAID6 is important for some specific deployments where loss of data is unacceptable, say for a unique experiment at CERN or a long supercomputer job that can't be repeated. But such strategy is also needlessly costly for other, less critical RAID users. The latter group of operators is many times bigger and this is often not reflected in these "RAID5/RAID6 is obsolete" discussions.

I agree with you that in time, the high-end tech becomes the standard tech. But that takes some time. There is quite a non-magical space of small providers who do not care for super reliable storage or super fast rebuilds and this will be the case for a long time. Yes the faster the rebuild the better, and "it is a concern" is fine. One week or month rebuild can be lived with. There is nothing magical about one day, one week or one month. They are all very short compared to typical drive lifespan.

At the same time, yes I believe 100TB drives, if they come, will be used in those extremely reliable big deployments, simply because of better TCO and expansion of data. Even if rebuild times will be longer than today, I believe it can be made to work reliably.