Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Leaky Abstractions (textslashplain.com)
480 points by jakub_g on June 2, 2021 | hide | past | favorite | 111 comments


These are the kinds of articles I love to see on HN!

I think it's so interesting to see how different OS' have approached these long lived operations over the years. Arguably, *nix OS' have been able to move past these problems since it appears that package management is much more of a first party concern. "Oh, you want to upgrade this utility? Well, I rely on these other things so you should update them too."

I'm a webdev so a lot of this is foreign to me w/ how it actually all functions but it's cool to read about these problems and speculate how/why they came to be and why they may only exist on certain platforms.


This is the kind of stuff I expected to study in my CS degree


The problem is that computer science ≠ software engineering, especially in academic contexts.


I've seen this a lot on here combined with cs degrees being frequently mentioned and software engineering degrees rarely mentioned.

Are people widely choosing computer science over software engineering and then being surprised they aren't studying software engineering or is it rare for institutions to offer software engineering as an option?


In the end, a good engineer rarely becomes a professor. This is both because the industry pays better, and the universities don’t value engineering expertise. The best SE courses in my university were offered by “guest” teachers from big software companies. (Which were very rare.)


I am not aware of anyone with a software engineering degree. Is that really a thing?


At least in the UK, "Software Engineering" degrees are just Computer Science degrees with a different name and a restricted subset of course options. From what I can tell, they're a marketing gimmick, but even respectable universities will offer them. The more questionable universities will also offer actual specialised courses under that moniker (in the same vein as "games design" degrees), but they aren't great.

Some Unis also offer different tracks for Masters degrees - MA (Master of Arts) in Software Engineering, or MSc (Master of Sciences) in Computer Science. Those differences are a bit more meaningful. The MA will be softer, with less research focus, and is (ostensibly) explicitly designed as preparation for industry.


My uni offers both a one-year MSc SE, and a two-year MSc CS. For organisational reasons, the 2 yr program is basically SE+MSc Information Science.

Students can pick their final subject; there is no difference in subjects between the two programmes, though we obviously expect some difference in result. After all, you should show some proficiency in both masters if you're doing the 2yr programme.

I'm currently supervising a student (SE) on file recovery, to whom I'll definitely forward the link. It's right up his project's alley.

Side remark: had he been enrolled in the 2yr programme, the subject would have been the same. I'd just expect him to extend his current work with a research question towards information science.


Yep, I'm the software engineer with diploma. I have master degree in Computer Sciences (Software Engineering Department). I'm from Ukraine.



The Computer Science college at Oregon State University offered both CS and Software Engineer degrees. The CS degree focused more on theory and technology, SWE more on management strategies (Agile, etc) and project architecture. I was really frustrated hitting the workforce and not knowing much about the "flow" of large team software projects.

It was interesting seeing what the EE's and ME's were learning about, standardized practices, vs what CS was taught, the "figure out a system" method.


At the Oregon Institute of Technology we start with three terms of C++ for Software Engineering and code all four years in several languages. We might not go as deep into the theory, but we can safely use a pointer and understand what the NullPointerException actually is in managed languages. By the time we graduate we have worked on a lot of code and two large projects, one of which is with a team of other students.


Unfortunately, at my University the SE program was mostly building UML and ER diagrams and memorizing agile terms. Software design was a minor part of it.

I was CS, but had a lot of SE friends, and we all agree that the best SE course was the one that was mandatory for all computing majors. It covered version control and branching, testing, design process, abstraction design, etc. That class was probably one of the most daily applicable things I learned, I took it in my third semester. If my university offered more courses of that caliber, that didn't rehash and force memorization of buzzwords, then that program would've been incredible.


I don't think that's the problem. The problem is that Universities aren't punished for teaching inadequately. And when challenged, people justify it by saying, "ah, but this course that 90% of you took so that you could be prepared for a career as a developer wasn't really about being a developer."


> The problem is that Universities aren't punished for teaching inadequately.

They are punished in a way. It hurts a school to get a bad reputation among local companies for poorly-prepared grads. R1 research schools, particularly if they are public, may feel it less. But even then, the ones I've dealt with have always chased all the fads in trying to improve learning and metrics. A tenured professor who's just there for the research may choose to ignore it all of course.


This hasn't been my experience in practice. I went to a top-5 CS university in the UK and some of my classmates graduated almost unable to program (and they still, somehow, got jobs). Seeing that first-hand was extremely jarring. It didn't and still hasn't damaged the reputation of the university.


Not if every university is like that. Punishment will only happen when there will be that one university that has adequate courses and stands out above the rest


Interesting, didn't even know you can cut files from a zip in that Windows zip file viewer. I would have thought it's some read-only filesystem like viewing a mounted CD or so.

In that context I wonder if you could cut from rewritable CD-RWs as well back in the day (can't remember) - that seems like another abstraction that's similarly slow in reality.


I never tried cutting from CD-RW, but AFAIR each burning would append a non-trivial header (like 20MBs or so) so that would be a pretty expensive thing to do :)

AFAIR when you "copied" into CD-RW the files would show up semi-transparent (pending) and you'd have to click a button to process with burning. Probably same for cutting I guess.


Been years, but the ISO file system says 16 sectors past the start of the last track is where the file system index starts. (the first 15 sectors are reserved for booting) Then however many sectors you need for the index - which depends on how many files you have, and how long the filenames are. Note that both CDR and CDRW don't allow you to write sectors in the middle of a track, so you need to know where every file will be on disk before you can write anything - you can't build the index on the fly. (I remember this part because I was writing backups and so I didn't know how big the file would be until I was done and thus couldn't write the index first - I eventually got around the rule by creating a new track with just the index)


Speaking of old Windows and CDs, Windows had some crazy trick to turn CD-R (not CD-RW) into rewriteable medium. I'm guessing they simulated a regular file system on top of an append-only representation. I never dug into the details back in the day, because I never used this feature. Unfortunately, IIRC, this trickery was enabled by default when copying files to CDs - which was a problem, because nothing else could read it. This caused me an unending stream of calls from friends and relatives, who all tried to burn some family/vacation photos onto a CD-R, in order to view them on their TVs, but the CD readers for TVs couldn't parse the format.


Oh, the trick of placing the table of contents on the outtermost place of the written area, instead of the beginning of the CD.

I think you were supposed to write a pointer at the original place, so the driver would know to look again, but outside in. Video CD players often didn't support that pointer.

But actually, Windows was quite a latecomer on that feature. It's just that its UX was horrible so people would never know what option they choose, on other OSes (and other Windows software) people had to actually decide to use it.


I remember also another quirk — you could burn CDs that various appliances will read, and you could do it such that you're able to add more files later, but you had to do that through Windows Media Player.



I think so. IIRC it was this:

https://en.wikipedia.org/wiki/Live_File_System

which is apparently the same thing as UDF.


Having written a shell namespace extension, I now use the concept as a great example of making the wrong abstraction interestingly enough. Shell namespaces are a odd win32 concept rather than a filesystem concept and cause a lot of headaches (read support calls) when people see that they work in explorer, but not from arbitrary applications depending on how they use the underlying APIs. If you're trying to abstract the filesystem with pluggable components... abstract _the filesystem_ with pluggable components.

It'd be interesting to now what lead to the architectural decision of wear to cut that abstraction. It feels like a Microsoft team balkanization issue (the explorer team and POs being different from the NT kernel ones, I assume?), but it's near impossible to know without inside knowledge.


I suspect one reason is that it doesn't make sense for everything in a namespace to actually be a file. For example, making something like Control Panel or My Network namespaces is a cute way to make them browsable in Explorer and similar tools (to find a network host, etc) but making those entries files would be a kind of pointless thing to do - almost every assumption you might make about files wouldn't hold for any of those entries.

Of course, as you point out the right abstraction in many cases is an abstract filesystem, and for whatever reason pluggable filesystems aren't really a Windows Thing (though I think they may have been added recently for features like OneDrive and lazy git checkouts). The existence of shell namespace extensions probably stopped that valuable feature from getting put in...


The traditional posix filesystem is yet another leaky abstraction. Inflexible ACLS and metadata, a horrible and confusing consistency story, an inability to reliably / cleanly rename files, interfaces that are painful to implement performantly without surprises, don't even get started on fsync...

I just want to create, read, update and delete some resources :)


But at least there I can expect programs run as the same user to see the same files.

"Why can I see the files in explorer and not this program I run to use those files?" isn't an answer you have to burn support time on because it's not ultimately implementing an FS layer in special GUI components.

It's not that I don't like their FS abstraction (there's a lot of abstractions I don't like; as an engineer that's not an interesting topic for discussion but instead basically expected). It's that they broke their own abstraction a decade later by implementing the multiplextion in a layer totally contrary to what the user expects. Like if FUSE was implemented entirely in Gnome components and anything that used open(2) broke as not seeing the veneer files but you could see them in every system application.


> Like if FUSE was implemented entirely in Gnome components and anything that used open(2) broke as not seeing the veneer files but you could see them in every system application.

Oh, you mean exactly like Gnome GVfs?


They at least stick a fuse module in too so it exists in the normal filesystem and non gnome apps can work with normal filepaths still. But yeah, agreed that it's also a case of this being implemented at entirely the wrong layer, probably following in the windows shell extension's footsteps blindly.


This kind of thing doesn't surprise me at all - a surprising number of developers miss those types of things, and one of the things a lot of people miss seems to be checking for situations where you end up with excessive amount of calls doing little work, for some reason.

E.g. at one point (long time ago), MySQL's C client library would call read() for 4 bytes to read a length indicator, and then read exactly the number of bytes indicated by the protocol, which led to a lot of time spent context-switching vs. reading into a larger user-space buffer and copying out of that.

On Linux simple, plain "strace" remains one of my favourite first steps to find performance problems in part because these things are near immediately apparent if you strace a process because the really pathological cases often ends up dominating the output so much that you'll spot them right away.

Another "favourite" issue that shows up often when you use strace like this is e.g. excessive include paths - running strace on MRI Ruby with rubygems enables and lots of gems pulled in is a good way of seeing that in action - like this zip problem it's an example that seems totally reasonable when the number of gems is small, and that first becomes apparent when you test with lots of gems, and look at what's actually happening under the hood.

A similar example of testing with too small datasets and/or on a large enough machine to not spot what becomes immediately obvious if you trace execution was how a type 1 font reference library made available by Adobe (no idea if they wrote it or if it was from another source originally) which would exhibit another typical pathological behaviour and call malloc() hundreds of times on loading a font to allocate 4 bytes at the time instead of allocating larger buffers. (strace won't catch malloc(), but ltrace does)) To avoid messing too much with the code, we replaced the calls to malloc() with calls to a simple arena allocator and load speed shot through the roof and memory usage dropped massively.

Too few people trace execution of their code. I know that because if most developers did, issues like the above would get caught much sooner, and would be rarer in published code.


> Another "favourite" issue that shows up often when you use strace like this is e.g. excessive include paths - running strace on MRI Ruby with rubygems enables and lots of gems pulled in is a good way of seeing that in action - like this zip problem it's an example that seems totally reasonable when the number of gems is small, and that first becomes apparent when you test with lots of gems, and look at what's actually happening under the hood.

When you add e.g. Python's standard library as a ZIP to the Python search path, the thing will open/read/read/read/close that file approximately a gazillion times on startup. That's OK on Linux/Unix, where that is fairly cheap. Guess which OS doesn't like that pattern at all?



It feels to me that this blog post is pointing out the obvious. The examples are a bit forced in my opinion. I would expect it to be about badly designed or architected code, but.

Iterating works. SQL works. Windshield wipers work. The iteration abstraction isn't leaking - the hardware and memory architecture isn't part of the abstraction. The SQL abstraction isn't leaking - the performance isn't part of the abstraction. The windshield wipers abstraction isn't leaking - the wipers were designed for rain, not a hurricane. Abstractions ignore details - that's the point. Reality doesn't cease to exist.

If your car isn't fast enough for the race, the car, steering wheel, accelerator pedal are not a leaky abstraction. You just don't have a fast enough car! If you have to worry about stuff like shifting gears, then an automatic transmission isn't the abstraction for you. The abstraction isn't leaking, you just chose the wrong one.

Abstractions are great.

I've never had to worry about CPU instructions or assembly code in the code I've written - the programming language abstraction is perfect as far as I'm concerned. Somebody handles it. I've never had to worry about the electricity powering my machines in the Cloud etc where I deploy. Perfect abstraction. Somebody handles it. My web browser chugs through hundreds of megabytes of JavaScript daily. Perfect abstraction. I've never needed to concern myself with the implementation of say V8. I've had my share of SQL performance investigations, but most of the SQL I've written didn't need any performance tweaking (there's a different abstraction for performance that works great - indexes).

Leaking is a matter of perspective.


Abstractions are great but I think that's sort of the problem in some cases. You see a pattern like this in SQL from time to time - someone creates a view called dbo.GlobalReport which has all the joins and query logic they need for their reports. Next person comes along, sees the dbo.GlobalReport object but it's missing something so they wrap that in a view dbo.NewGlobalReport, joining a bunch of tables to the original dbo.GlobalReport. And this works. And then the next person comes along...and so on. Eventually someone has problems so they look at the query plan and their jaw drops. In theory, you shouldn't have to worry whether dbo.GlobalReport is a table or a view (if you're just querying it) but in practice you do (sometimes).

Likewise, in an enterprise setting a (Windows) user shouldn't have to worry when saving a file to X:\ what kind of drive/storage X is. And 99.9% of the time, that abstraction works. But then someone tries to save a 500GB video and suddenly it does matter what kind of drive X is, where it is, how it's connected, the filesystem it's using etc.

I think the issue in both cases is that the distinction is fairly obvious if you're in the know but not necessarily for anyone else (until they try it).


Another common leaky abstraction: floating-point numbers as an abstraction of the real numbers. It works nearly all the time, until you have to really know about numerical precision, or where a NaN came from.

https://www.johndcook.com/blog/2009/04/06/numbers-are-a-leak...


Relevant: the list of integers you can convert to float, invert twice, and don't get the same number back.

https://oeis.org/A275419


This is a treasure; thank you!


i wouldn't call floating point numbers an abstraction on real numbers. It's a representation - like char[] is a representation of strings, which is itself an abstraction used to manipulate human readable text!


Read the link above about FP as a leaky abstraction of the reals; it explains it better than I can. FP is not really a representation of reals, because you can't represent the entire number line with a finite number of bits. It is a representation of a cleverly chosen subset of the reals, so well-designed that many people forget that they're not dealing with real numbers. That intelligent people can write code with FP values, operating on them thinking they are real numbers, and not knowing how FP works, and get programs that often do nearly the right thing, speaks to the success of FP as an abstraction. Understanding how things can go wrong is the leaks (like, non-associativity). I don't think people manipulating char[] representations of strings are ever at risk of forgetting that it's just an array of chars (they program with full awareness of the implementation), so it doesn't feel like there's much abstraction happening there. Other higher-level data structures for text manipulation probably do implement some useful abstractions; I haven't worked with those.


> I don't think people manipulating char[] representations of strings are ever at risk of forgetting that it's just an array of chars

and yet, when people start adding unicode text under this representation, and the program fails to work properly. Or when counting characters, and assume the glyph count would match the char array length.

or any number of other string issues.


yea good point. Unicode does overwhelm simplistic ideas about string representation. Love those grapheme clusters.


IMO that article is the most damaging thing to happen to programming culture in 20 years. It's not hard to write a valid abstraction, and none of Joel's examples actually hold up (yet he generalises it to a supposedly universal law). Of course abstractions are garbage-in, garbage-out - yes, TCP won't give you a reliable connection if you unplug the network cable, but an application written to use UDP won't fare any better in that case.


TCP won’t give you a reliable connection if you tunnel TCP over TCP, paradoxically, but you can tunnel TCP over UDP, UDP over TCP, or UDP over UDP. Leaky!


Not "leaky", the conditions that a TCP connection needs its underlying transport to satisfy are explicitly documented, and TCP doesn't conform to them.


If you're going into the details of what conditions a TCP connection needs, you're no longer talking about an abstraction.


An abstraction doesn't mean you can close your eyes and everything will work by magic. They're extremely useful tools, but like I said: garbage in, garbage out.


> An abstraction doesn't mean you can close your eyes and everything will work by magic.

Right... that's why they're called "leaky abstractions". Because when you don't pay attention to the implementation details and their constraints, things break.

If it weren't leaky, you'd call it a "specification" instead of an "abstraction".


Requirements are not the same thing as implementation details. 2 + 2 = 4 is a sound abstraction, the fact that 1 + 2 doesn't = 4 doesn't make it "leaky".


2 + 2 = 4 isn't an abstraction. I think we just have a different idea of what an "abstraction" is, which is fine.


It sounds like you completely missed the point he’s trying to make...


Maybe, but if so so did many of my colleagues over the last ten years. I've seen so much bad code written in the name of avoiding abstractions, with that article as direct inspiration.


this article is indeed cool and almost all examples are good, but this example is in my opinion heavily forced or even wrong?

>And you can’t drive as fast when it’s raining, even though your car has windshield wipers and headlights and a roof and a heater, all of which protect you from caring about the fact that it’s raining (they abstract away the weather), but lo, you have to worry about hydroplaning (or aquaplaning in England) and sometimes the rain is so strong you can’t see very far ahead so you go slower in the rain, because the weather can never be completely abstracted away, because of the law of leaky abstractions.


I think it’s contrived simply for a pun:

> leaky abstractions > > hydroplaning >

hydroplaning requires:

- water (“leaky”)

- sliding (“...traction”)

- lack of _brakes_ (“abs...”)

Well, and speed too.


I wonder why it's implemented as a per-file copy+delete instead of a "copy all files" then "delete all files".

I also have a gut feeling that doing similar operations to a connected android phone (e.g., moving photos from your phone to your PC over USB) is also slow, probably for similar reasons.


It is easier to abstract (at least nively). First, you abstract moving a single file, then you create abstraction for "all files" basically by repeating same operation for all files. You could do this for a subset of files and so on.

As to slow operations... that is more likely because of synchronous implementation.

The popular, naive implementation is, as above, to repeat same simple operation over and over again: read from source, write to destination, read from source, write to destination.

A better implementation (what I would do) would be to pipeline operations. Basic pipeline would have three components, each streaming data to and/or from buffer

1: Read from source to pipeline buffer

2: Read from pipeline buffer to write to destination, write information about files to delete to another pipeline buffer

3: Read files to delete from pipeline buffer and execute deletions.

Using *nix shell you could do something like that in single line

1: tar -c files to output

2: pipe output to tar -xv, write files to destination producing list of written files, pipe written files to output

3: read piped list of written files and remove them from input dir

Now, this is not perfect because we are wasting performance on creating tar file when we immediately discard it, but you get the picture.


> It is easier to abstract (at least nively). First, you abstract moving a single file, then you create abstraction for "all files" basically by repeating same operation for all files. You could do this for a subset of files and so on.

Litte bit of a rant, but I see this SO MUCH in database layers in applications. Implement an operation for one row, slap a for loop around it, it works kinda quickly on the small test data set... and then prod has an intimate conversation with a brick wall. It has been 0 days at work since that happened.

> As to slow operations... that is more likely because of synchronous implementation.

Interestingly, I think the answer is a solid maybe and depends on the storage and how you issue your i-o operations. A flash storage will increase performance if you increase parallel operations, up to a point. However - and this code apparently was written 20 years ago - on spinning drives, parallel io-operations slow you down if the OS does not merge those operations. So it's entirely not obvious.


On single drive you run a variation where you read X MB of data to a buffer, then write X MB of data out, then execute deletes, and so on. This lets avoid some of the problems with small files. Not all, because small files will unlikely to be consecutive, the head will have to jump a lot, and then you still need to do a lot of small writes to filesystem (to remove the files).

There are obviously improvements you could do. For some filesystem you can just remove entire folders rather than remove the files individually just to remove parent folder.


> The popular, naive implementation is, as above, to repeat same simple operation over and over again: read from source, write to destination, read from source, write to destination.

Reminds me of the kind of patterns functional programming languages introduce, where you process data by describing operations on individual items and assembling them into "a stream". I'm always wary of those - without a good implementation and some heavy magic at the language level, they tend to become the kind of context-switching performance disaster you describe.


Yes. Functional world is not impervious to leaky abstractions.

I am personally of the opinion that, to be a good developer, you have to have mental model of what happens beneath. If you are programming in a high level language it is easy to try forget about the fact that your program runs on real hardware.

I know, because I work mostly on Java projects and trying to talk to Java developers about real hardware is useless.

I have an interview question where I ask "what prevents one process from dereferencing a pointer written out by another process on the same machine" and I get all sorts of funny answers and only 5-10% candidates even have beginning of understanding what is going on. Most don't know what virtual memory is or are surprised that two processes can resolve different values under same pointer.


> Most don't know what virtual memory is or are surprised that two processes can resolve different values under same pointer.

Yikes. How many of them have a degree in computer science?


A lot of them have although I must admit that CS graduates fare noticeably better.

What happens is they know what virtual memory is but can't connect the concepts. It is knowledge without understanding.


Probably not many of them, because it doesn't usually take a graduate to write Java. Especially if your experience with Java is also limited to tools like Spring. I think the question is kind of pointless unless your specifically hiring people to work on an application that needs that sort of thru-put. Most apps don't need it.


The funny thing is that in the old days, it is required to know how the machine worked - you coded assembly against physical memory.

But that was incredibly hard, and error prone (and comes with tonnes of limitations). The fact that today, it's very easily possible to write working programs without knowing any of the underlying details, is a marvel.

If i were to hire a truck driver, i wouldn't expect to have to ask him to understand how the truck itself worked (e.g., fuel injection when he presses the accelerator). He only needs to know how to operate the truck from the interface (steering wheel!). Why isn't this the same for a java programmer?


> But that was incredibly hard, and error prone (and comes with tonnes of limitations).

No, it wasn't. If it was incredible anything it was tedious. But nobody really expected anything super complex from you. Just look at the kinds of programs that were produced in 80s or 90s.

It would be incredibly hard today. Machines got very complex, operating systems got very complex.

But it wasn't so complex in 90s. I think I stopped writing assembly when I started using WINAPI because that was the point when assembly stopped being practical.

Sometimes I wonder how I would fare if I was 18yo again today and had to start in development and learn everything from scratch. I feel being able to learn everything as the technologies were evolving is a huge advantage I enjoy.


There’s a much vaster gulf between a truck driver and a truck engineer, vs a Java programmer and a programmer one layer below. Per your analogy, the truck driver is the program user here, not the program maintainer.


Rust for example does indeed apply lots of black compiler magic to really cut the cost of those abstractions (I've seen the output of complex iterator chains compiled down to exactly the same machine code you'd write without the abstractions). However man is it slow to compile. Pick your poison.


From what I understood the problem with Rust compilation time are more with LLVM and monomorphization of generics. I know that OCaml has a generics system that's kinda like Rust but doesn't monomorphize functions and is really really fast at compiling. OCaml also has its own backend.


> Pick your poison.

I pick Rust every time.

I have this concept of easy problems and hard problems. Every decision to choose technology is a compromise and comes with its own problems. It is your job to know whether these are easy or hard problems.

Compilation time is easy problem. Just put more hardware to it or modularize your application or schedule your coffee breaks correctly.

Building reliable abstractions to prevent hard to debug problems is hard problem. Building a large, complex, reliable application in ANSI C is hard problem.

I try imagine, what would you rather spend your time on: coffee breaks or debugging complex bugs?


> just put more hardware to it or modularize your application

This is not that easy as it seems.

Rust compilation is not embarrassingly parallel as a lot of what's going on seems to be deferred at the link stage.

Modules inside a crate can't be compiled in parallel because of some reason I don't truly understand, please educate me.

Crates can be truly compiled in parallel, crating projects with hundreds of crates is quite a pain for other reasons.

The result is that a change single line of code in the project I'm working in can take more than 2 minutes on my last gen MacBook pro.

I have a M1 Mac mini where the compilation is twice as fast but I don't have enough RAM there and I'll have a hard time to convince my manager to make an exemption to my companies laptop refresher policies "because of rust".

If I have to take a coffee break every time I need to wait for an incremental compilation my heart would explode.

My current solution is:

1. Multitask. Do somethings else while I'm waiting.

2. Split the code that I'm iterating on in a new project. With minimal dependencies and copy the code back in the main project when I'm done (or just import it as a crate if it makes sense)

Both options suck and I wish they weren't necessary. Please let me know what else could I do. What hardware should I buy etc

(I already use zld on Mac and use minimal debug symbols, iirc debug=1)


> doing similar operations to a connected android phone

MTP is terrible[1].

[1]: https://en.wikipedia.org/wiki/Media_Transfer_Protocol#Perfor...


MTP is an absolute dogs breakfast.

I've had multiple experiences of issues on the device side causing the entire explorer.exe process to crash!

Explorers handling of the MTP protocol is not resilient to badly behaved devices, would not at all be surprised if there are security implications where a badly behaved MTP device can get an RCE in explorer.


Oh boy. I was thinking USB 2.0 is main reason for Android<>Windows copying of photos to suck so much, but the rabbit hole is much deeper.

It's sad that with cloud being the solution for everything those days, this will probably never be improved within next decade.


It's ridiculously wasteful and slow to have to upload to a cloud server who-knows-where, then download it again to the computer several feet away. Yet with Android having removed USB mass storage, that's often the easiest way. It's against the interests of those who profit off selling cloud storage to make local transfers easy.


If I need to move a lot of data I just use adb and USB debugging to access the files. That is actually fast, but it's ridiculous that I have to do it.


Ha, I like that! I should try that as well.


I'm using Resilio's Sync (before known as BitTorrent) for this. I guess that technically, it's a kind of cloud, except I'm running it on my intranet ?


copy + delete one at a time makes a lot of sense if you're working on a filesystem without a way to move without copying (I don't think you can actually move a file in fat32), because copy all could require more space than is available.

The same could be true here where you're moving from a zip file to probably the same filesystem the zip files is in; if removing a file from the zip file is actually an in-place move data then truncate. The problem, of course, is that removing a file from the zip file is tremendously expensive. Reading the file with one syscall per byte doesn't help (especially post-Spectre workarounds that make syscalls more expensive).


This is the real answer IMO. For example, if you have a 2TB drive with 50GB available space left and are trying to move 1TB of data, and the move is requiring copying, but all individual files on their own are less than 50GB in size, then I’d be pretty upset if my computer was unable to move the files just because it was wanting to not delete anything until the end.

But ideally I’d want the system to delete at the end if possible, and to otherwise delete as needed, instead of either doing only all at end or only after every single file.


> copy + delete [...] I don't think you can actually move a file in fat32

Wait, does that mean that a file which is larger than half of the partition in fat32 cannot be moved? Or not even be renamed?


Fat32 supports rename in the same directory as a simple operation. Moving a file though, I don't think so (and nobody corrected me, so I might be right).


Ok, well I just tried this on a full fat32 partition, I'm able to move a large file no problems reported, so I was wrong.


IIRC explorer/shell namespaces are built on top of IStorage / IStream and those don't know bulk operations.


Probably because someone worked on a high enough abstraction not to spot it, then tested it on small enough files to not see enough of a performance issue to dig into what actually happened..


Perhaps the common approach to performance testing is wrong (forget for a moment that most shops don't think about it at all; let's just consider quality developers/companies). Instead of just monitoring whether the product is fast enough for minimal/typical/peak expected usage, maybe it would be good to focus on determining a boundary. E.g. how much data does it take for the program to run 60 seconds? Or, in general, 10x longer than the maximum you'd consider acceptable? How much data does it take for it to run out of memory?

These determinations can be made with few tests each. Start with some reasonable amount of data, keep doubling it until the test fails. Continue binary-searching between last success and last failure.

The results may come out surprising. Performance does not scale linearly - just because your program takes 1 second for 1 unit of data, and 2 seconds for 2 units of data, doesn't mean it'll take 20 seconds for 20 units of data. It may as well take an hour. Picking a threshold far above what's acceptable, and continuously monitoring how much load is required to reach it, will quickly identify when something is working much slower than it should.


I find it's kinda hell moving files from both my android AND iPhone ... it feels like a 20 year old process in terms of it working in fits and starts and random failures.


It's so if the move fails midway you won't need to start from scratch.


In theory all software is coded using a "many by default" approach. So every time batching matters, we take those batching opportunities automatically due to the way software is coded.

In practice we only batch when it starts hurting. It doesn't hurt to delete files one by one on a normal file system. It's made for that. So the API wasn't "many by default" and that's how it works for zip files as well.


Unless I'm mistaken, this seems to be the original programmer on youtube:

Dave's Garage - Secret History of Windows ZIPFolders

https://youtu.be/aQUtUQ_L8Yk


As noted in the blog post, he's the author of the Shell Namespace code, but not the ZIP code beneath.


Oh, I can see now it's mentioned, I must've skipped it on first reading.


It is. I have seen a few of his videos before. They are generally interesting and worth checking out.


I don't know why but "your CAB archive is a folder" really took me back. It's nice to read something that makes me nostalgic for a time when Windows was a big part of my life. I don't know that I ever loved Windows, but we were intimate.


Stockholm syndrome ♡


Totally, I still have ‘fond’ memories of Hungarian notation variables starting with ‘h’ and ‘w’.

What I can’t remember is what they stood for - ‘w’ was a word, and I think ‘h’ a half-word. But was a word 16 bits when I was writing that code?

Thanks for the PTSD.


25 years without major bugs and still being extremely useful is a complete win in my book! This is definitely a bug, but I wonder if Microsoft will think it's worth it to even update it, it doesn't seem like it happens enough to cause widespread issues.


This was my exact thinking!


This article is about the details of zip file handling in the Shell Extension... Reading the first few paras I thought they would talk about the struggle it was to implement a shell namespace extension. I did a somewhat crappy one 20-something years ago and I still remember how confusing it was. At the time is spent all my days writing various COM things. I was very comfortable doing things like hand editing IDL headers to avoid changing GUIDs. NSEs were still the most baffling, brittle and confusing interfaces I've ever seen


It's an interesting article but the term 'Leaky abstractions' is not suitable here. This abstraction is not leaky at all. It sounds like a great abstraction in fact. It's just the implementation that's slow and it's not because of the abstraction. The author himself admits "it’s relatively easy to think of ways to dramatically improve the performance of this scenario".

That fact that the issue can be easily resolved without changing the abstraction (the plugin system) is proof that the abstraction is not leaky.

A leaky abstraction would leak implementation details from the plugin to the caller and thus make it impossible to replace or modify the plugin without modifying the caller's logic. It doesn't appear to be the case here. It seems like the issue can be resolved just by changing the plugin's logic; the caller's logic would stay the same. This sounds like an excellent abstraction and could be the reason why the core logic hasn't had to be changed in 1998... It should be hailed as an achievement, not shunned as 'leaky'.


The abstraction leaks because the performance characteristics are inherently impossible to mask from the higher layers.

There are other places where the abstraction leaks as well (try to add a file with a Unicode name to a Compressed Folder).

It's true that you could add more and more code to try to make the abstraction leak less.


It's not impossible because the author of the article themselves admit that there are many ways to fix this. I can already think of one possible way they could speed up the copy-paste from the zip file; by waiting until all files have been copied to the file system before deleting files from the zip in one go. The fact that solutions do not require a change to the interface is a sign that this is not a problem with the abstraction.

Is it the abstraction which is forcing the files to be copied 1 byte at a time? It doesn't seem like it. It's an implementation issue. The interface allowed the plugin to have been implemented in many other ways, the implementer of the plugin just happened to go about it the wrong way.


A bug has been filed about this on the WinDev repo, might help resolve it faster if it gets some upvotes: https://github.com/microsoft/Windows-Dev-Performance/issues/...


I'd like to see API's like this implement deferred operations, with help from the kernel.

For example, the zip folder could realise that deletion is an expensive task, and defer the job till later. It would tell the kernel that the raw bytes of the zip file are only available to other applications after this operation is complete (so that emailing someone the zip file can't represent the pre-deferred-operations state).

The tree of deferred operations can then be optimized - for example, deleting multiple files in a zip file could be combined into one. Deleting stuff in a zip file, and then deleting the whole file can likewise be combined.

Kinda like peephole optimisations for stuff your OS does.


On a quick scan, I'm not seeing any mention of OS/2, but I feel like this extensible behavior of the shell first showed up there. Not sure if that was an IBM thing or a Microsoft thing (or both), but it was WAY WAY WAY different from what came before.


leaky abstracttion is a... polite way of describing the Windows filesystem. :) Unix particularly Linux show how the filesystem abstraction can be sufficient for a huge range of problems, it's such a shame Pottering et al. fail to grok this.

System services is a virus.


Anyone interested in working on a library libportfs that allowed userland apps (in a cross platform manner) to create fs like /mnt/apache/connections and /mnt/apache/enabled


First job would be to expose the entire windows registry as a filesystem


Hussein Nasser just recently created a video on the topic of leaky abstractions in our everday tech stack.

https://youtube.com/watch?v=4a3bI7AYsy4


To this very day I wonder why people torture themselves with Windows Explorer instead of just using Total Commander and pressing Alt+F9.


"Unfortunately, the code hasn’t really been updated in a while. A long while. The timestamp in the module claims it was last updated on Valentine’s Day 1998"

When I saw this timestamp sometime ago on my PC I thought it was a joke?! C'mon 1998 like WTF!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: