This confuses me because the vertical rate is not 60 Hz, it's 3579545 Hz / (525/2 * 455/2) = 59.94 Hz. In other words it's odd that they would have chosen to be compatible with black and white instead of color.. instead of 44.1 KHz, it would be 44.056 KHz.
Edit:
Well it turns out that 44.056 KHz was used for the "EIAJ digital-audio-on-videotape standard"
Sony was originally proposing 44.056 kHz (NTSC - popular in Japan) with 16 bits while Philips was pushing for 44.1 kHz (PAL - popular in Europe) with 14 bits. The two reconciled their differences at the 4th Red Book meeting in 1980[1]. Sony was further ahead in developing the CD players but Philips supposedly was in the lead when it came to making the CD's[2]. Sony insisted on 16 bit vs while Philips was pushing for 14 bit. As a compromise they may have gone with the 44.1 kHz Philips was proposing and the 16 bit Sony was proposing because it would be easier to remember. Posts [1] and [2] are in direct conflict with each other on this point. There was further tension over what size disc to use [2].
The CD was one meeting away from launching another format war in the spirit of VHS vs Betamax or Blu-ray vs HD-DVD
Of course 44.056 kHz products did make it into the field for professional audio engineers. Anecdotally this made for some trouble: http://www.realhd-audio.com/?p=2197
❝Of course, lots of CDs were released with the original 44.056 kHz rate simple reclocked at 44.1 kHz. This resulted in a very slight speed increase AND a pitch shift of less than a quartertone.❞
While technically true (less than a quartertone), it's much, much, much less than a quartertone. It's 1.7cent (1.7 percent of a half-tone, or 1/30th of a quartertone). If you accidentally mix up 48kHz and 44.1, it's a much more noticable 1.5 half-tones. I doubt that this slight detuning is so blatantly obvious that it "feaks out" even a very well trained and very hot tempered classical violinist.
If you want to check the math: There are 12 semitones in an octave. One octave doubles the frequency, so the frequency difference between two adjacent semitones (e.g. from any key on your piano to the adjacent white or black key) is the twelfth root of two: ~1.0595. When tuning your guitar, your tuner might display the deviation from the true tone in cents, that's 1% of the interval between two halftones, or (python) math.pow(2,1.0/12.0e2) -> 1.0005777895065548.
Frequency ration between 44.1 and 44.056 kHz is 1.0009987.
Quite a lot of early CD players were actually 14 bits. Many early CDs were probably mastered with 14 bits in mind too, as they have a noticeably low average mix level.
The original American TV standard and TV recorders were 60Hz. When color was introduced, the frequency was shifted to 59.94 to avoid interference between the color signal and the sound signal.
(Color was encoded as a high-frequency sine-wave on top of the black-and-white signal which is mostly invisible on a black-and-white set (which allows for backwards compatibility). The phase of the fuzz indicates hue, and the amplitude indicates color intensity. This is why in the old system, if someone wore a shirt with vertical stripes on TV, viewers would see a rainbow of color over the shirt.)
I believe all NTSC equipment is required to support Black&White System M signals, which are exactly 60Hz[1]. It probably made their equipment much simpler to forget about colour encoding entirely. (And it made the 44,100Hz fit too)
Also, the fact that 44,100 can be factored as 2^2 * 3^2 * 5^2 * 7^2 makes it very efficient to do Fast Fourrier Transforms on moving windows. Which we do a lot in Speech Recognition and DSP in general.
What you say actually makes no sense whatsoever. What is forcing you to consider 1 second windows?
Also, 48kHz is 2^7 * 3 * 5^3 so it actually avoids the larger prime factor of 7. Not that anyone cares: just use an actual power of two as your window size to begin with!
What you say actually makes no sense whatsoever. Why are you talking about 1 second windows? If 44100 can be decomposed easily in small prime numbers, it means that you can split 44100 into windows that are themselves products of small prime numbers.
The length of a second is arbitrary and irrelevant to audio processing. The fact that there are 44100 samples in an arbitrary length of time is meaningless. You would only care about the prime factorization of 44100 if the second was some special amount of time, which it is not.
Why are you splitting 44100 samples into windows? That's the number of samples in one second. Where did you get one second from? You could have started from any duration.
I see what you mean, but since the 1970s speech researchers have been splitting seconds into 10ms or 25ms intervals, that's not my decision. They could have started from any duration but they haven't. And you could start to express time with the Aztec calendar if you wanted. But you won't.
It's not (entirely) arbitrary. Speech is variable, our vocal tract shape changes all the time, but not arbitrarily fast. From a source-filter theory perspective it does make sense to consider speech production over short time frames as linear time-invariant systems.
What this has to do with 44.1 kHz is beyond me, however..
Actually 44.1kHz is unhelpful for that... a lot of infrastructure exists that expects things to work on 10 and 20ms intervals (and multiples there-of), e.g. the normal windows audio apis work in 10ms chunks.
10ms of 44.1kHz is 441 samples, which is odd and thus a not really fantastic size for easily implemented factorizations.
Of course, if you can pick your own analysis interval you can pick some nice power of two or other very smooth number. But thats true regardless of the sampling rate.
I'm only guessing but it looks like 3 stereo samples per line plus some extra data (ECC?) on the right. Its quite interesting to see how the pattern changes when the music fades out (around 3:50, 9:20, and 13:25).
I wonder what kind of sound you could extract given only the low-quality YouTube video as a source.
It seems strange to me that they would be re-purposing video tape for digital audio, given that digital tape technology was used in computers since UNIVAC in the 1950's [1].
This came along for video applications. Up until then the idea of tapes for audio or data was to put as many linear tracks on them as possible. With video some new thinking was needed, hence the helical scan. As it turned out helical scan was the future for data and audio storage too.
Quad recording [1] preceded helical. In Quad, the mag stripes are across the tape almost perpendicular to the tape's movement. The head spun at an insane speed -- 14,400 RPM in open air. Helical was originally invented as a lower-cost recorder. Once helical gained the ability to do slow-mo and still-frame, Quad died.
I used to use VHS set on 6-hour EP recording speed along with dbx noise reduction (compression/expansion, really). I used dbx because the bandwidth & sound quality was so bad on the slower speeds -- until the HiFi audio technology was introduced later on.
I'd get the play time of a reel-to-reel in a device that I already owned, and so no additional space was needed on my shelf.
Video tape was vastly less expensive than tape drives used for nine track data storage and all. You could actually buy a decent Beta or VHS machine for 1) less than $2k for a while and 2) less than a kilobuck later ( before DVD drive it down ) .
Tape is tape, and VHS was a popular platform. I do remember people wanting "higher grade" Super VHS for some digital audio recording platforms, but for the most part you could even use regular VHS in a pinch.
In the Wikipedia it states that the CD's capacity target was to hold Beethoven's Ninth Symphony on one CD. There was a tale that this and the desire to fit a CD in a standard Japanese car stereo form factor determined the sampling rate. Submitted for your amusement.
Related: Digital Cinema uses a 48Khz sample rate which gives exactly 2000 samples per frame, per channel. That makes it very easy to sync the audio to the film.
At 25fps and 48Khz you rather neatly have 1920 samples per frame, which is coincidentally the width of an HD picture. At least I think it's a kind of a coincidence... I believe the number 1920 was derived from the 720 horizontal pixels of Rec. 709, which doubled give a 1440 pixel 4:3 picture and hence a 1920 pixel 16:9 picture.
It's also convenient that highest frequencies a human ear can detect is about 20khz, so the sampling rate needs to be at least 40khz (see "Nyquist Rate")
I can't say for anyone else, but for me, I was an AV nut in the old days. I remember pulling apart an early Sony CD player in '89 or '90 and screwing each part down on a piece of wood so I could reach everything and looking at every single trace with my "new" old Phillips oscilloscope trying to understand the magic.
These stories are like finally getting to see the cards from a game long over. Pure gold to me. Upvoted.
This is exactly what I want to read on Hacker News. Some people come here for the posts about startups/founder culture (I don't, but I fully accept that it's part of the site), some because of the highly technical posts. It's probably beneficial in some way that Hacker News is a mixture of both.