Kind of tangentially, the defragmenter stuff reminds me of another interesting t...

Kind of tangentially, the defragmenter stuff reminds me of another interesting thing you can do given low level information about layout and access. I'm not sure how much of the following is still applicable on modern filesystems and hardware. It was pretty effective back at the end of the 20th century and early 21st century.

Consider a program like Photoshop or a web browser. If you watch the I/O it does while starting there are a lot of cases where it opens some file, reads a few k, then goes and reads from a bunch of other files, and eventually comes back and reads more from that first sale.

It often happens that the data it reads from that first file is actually consecutive on the disk, but because it read it in two separate reads separated by many reads from other files it has to do a seek when it comes back for that second part.

This typically happens for many different files during a launch. Font files, dynamic libraries, and databases, for example.

If you make a record the I/O sequences during many launches of a given program you also find that they are mostly the same. There might be a few differences due to it making temp files, or due to differences in the documents you are opening on each launch, but there is also a lot of commonality.

At this point you can get clever. Make something that can tweak the I/O requests made during application launch. When a program starts launching, your tweak thingy can check to see if you have a log of a previous launch. If you do it can load that and then for each I/O during the launch it can predict if data beyond the extent of that particular read is also going to be needed. If so, it can add another read to grab that data, reading it into a temp buffer somewhere.

That might seem pointless, because the launching program is still going to come back and try to do a read from that same part of the file later. Yes, it will...but now the data from your early read might still be in the system's file cache, saving a seek.

I'm simplifying a bit. What you would do in practice is analyze the logs of the prior launches, identify which requests caused seeks, and then taking into account the size of the system's file cache and whatever you know about how the system cache works, figure out which reads should be extended and which should not be (because they don't incur extra seeks, or because their data won't hang around in the cache long enough).

Basically, you are preloading the cache based on your knowledge of what I/Os will be upcoming.

On Windows 98 doing this could knock something like 30% of the launch time for Microsoft Office programs, Netscape Navigator, and Photoshop.

I was curious once if this would work on Linux, probably around 2000 or so. I made some logs of Netscape launching by simply using strace to record all the opens and reads that occurred during a few launches.

I then identified several small files that had multiple reads during launch separated by reads of other files, and then make a shell script that just did something like this:

  cp file_1 > /dev/null
  cp file_2 > /dev/null
  ...
  cp file_N > /dev/null
  exec /path/to/netscape $*

where file_1, ..., file_N were some of the files that had multiple interleaved reads during launch. I made no attempt to just read the parts that were needed (which could have been done with dd) as I just wanted a quick test to see if there was a hint that things could be sped up.

Launching netscape via my shell script turned out to be something like 10-15% faster than normal, if I recall correctly. I was surprised at how well it worked considering that doing it this way (do all the cache preloading up front) should result in fewer cache hits than the "preload the cache for each file the first time that file is accessed during launch" method.