Failed intercept at Dhahran caused by a software error in handling of timestamps

avar · on Feb 28, 2018

Even better, the timeline:

    - February 11th: Vendor informed of the issue
    - February 25th: 28 people die because of the issue
    - February 26th: The vendor ships a fix

I'd have loved to be a fly on the wall for that phonecall on the 25th (or early on the 26th).

phonon · on Feb 28, 2018

You missed this date--

Feb 21--notice goes out to users to avoid "very long run times". Users do not know what that means, and ignore warning.

https://www.gao.gov/assets/220/215614.pdf (page 9)

"On February 21, 1991, the Patriot Project Office sent a message to Patriot users stating that very long run times could cause a shift in the range gate, resulting in the target being offset. The message also said a software change was being sent that would improve the system’s targeting. However, the message did not specify what constitutes very long run times. According to Army officials, they presumed that the users would not continuously run the batteries for such extended periods of time that the Patriot would fail to track targets. Therefore, they did not think that more detailed guidance was required."

macintux · on Feb 28, 2018

That's terrible. Competent technical writing is criminally undervalued.

zeeZ · on Feb 28, 2018

But there's also "presumed" and "did not think" in there. When there's a problem with your killing device you probably shouldn't use it until you've clarified what the problem is and don't just assume your end users will use it correctly.

That's like saying "It's fine, the critical vulnerability patch will be applied on reboot", while in reality all your users just suspend to disk and move that annoying reboot nag window behind the task bar where it's out of sight.

mikeash · on Feb 28, 2018

“You probably shouldn’t use it until you’ve clarified...” doesn’t work so well for a defensive system.

lostlogin · on Feb 28, 2018

How long was the off/on cycle? If it was short it would be reasonable to do it periodically. I don't think one can pin the blame on the vendor alone.

mikeash · on Feb 28, 2018

Clearly the vendor thought it would be reasonable. They failed at communicating it, though. Putting a number on it would have made things clear: “The system must be rebooted after at most 12 hours [or whatever the appropriate value would be] of operation.”

dragonwriter · on Feb 28, 2018

> When there's a problem with your killing device

Patriot, in the role deployed in that Gulf War, is a “not being killed” device, but not really a killing device.

kevinconaway · on Feb 28, 2018

Per the GAO report[0]

> According to Army officials, the delay in distributing the software from the United States to all Patriot locations was due to the time it took to arrange for air and ground transportation in a wartime environment.

I'm not knowledgeable at all on how software for missile batteries was distributed in 1991 from the US to the Persian Gulf but 11 days doesn't seem unreasonable to me.

[0] https://www.gao.gov/assets/220/215614.pdf

nathan_long · on Feb 28, 2018

Oh, gosh. Really?

It makes me kind of sick imagining that call.

sli · on Feb 28, 2018

You're not even curious to see how it was dealt with and how the issue was expressed to the vendor? I'd never be in a meeting regarding deaths of users of my software, because I just make internal webapps, so I just cannot help but be curious as to how one of those meetings would go.

hx2a · on Feb 28, 2018

> I'd never be in a meeting regarding deaths of users of my software

I know what that's like.

About 20 years ago I was at a consulting firm supporting an electric and gas utility company. Among other things they had to do something called "markouts" which means they paint the ground at a location in a way that indicates exactly what infrastructure they have in the ground and precisely where it is. Markouts are a government organized thing. Before digging somewhere you can call a number and anybody that might possibly have infrastructure in the ground anywhere near your dig site is required to paint their markouts within a short time period. There are stiff fines if you "miss a markout."

Anyhow there was a data problem with a markout. The field worker was sent to paint a markout at the corner of two streets that actually ran parallel to each other and didn't meet. Instead of calling it in and questioning the task he did nothing. Shortly after a construction worker put a backhoe through an electrical conduit with 15K volts. There was an explosion that was heard for many miles. The worker died the next day. He died painfully.

> so I just cannot help but be curious as to how one of those meetings would go.

Finger pointing, of course. Data was being fed back and forth between systems and eventually somebody else took the blame. The field worker who ignored the markout also was blamed. We did add something to our system so that that kind of data error would raise an exception.

I learned a lot about care and diligence about data from this experience. Data errors are no joke.

brookside · on Feb 28, 2018

Sure but

> I'd have loved to be a fly ...

loving anything about that sad scenario seems impossible.

kaishiro · on Feb 28, 2018

It's a figure of speech, not a show of approval.

Nomentatus · on Feb 28, 2018

Even loving learning enough to avoid the next one? 'Cause that's what responder wants, to learn. "What the hell were they thinking?" is often the most pertinent knowledge of all.

_lhlo · on March 1, 2018

I'm not, because it would bore me. I see shit like that for breakfast when studying transportation. But if you don't do it, I say it's because of your own primal instincts, so you stuff them down...

godelmachine · on March 1, 2018

Who was the vendor?

Aren't defense contractors required to be on their toes all the time?

EDIT → Found it. The PATRIOT Project Office.

tofof · on Feb 28, 2018

This particular bug is often taught in university compsci classes as "bug that killed people" is a good attention grabber -- the CS/EE analysis is sound; its truthfulness is only suspect because of the DoD's claimed successes.

A more truthful "computer bugs that killed people" example would be the Therac-25 - a machine intended to treat cancer with tightly-focused radiation therapy. Six patients died as a result of massive overdoses of radiation, on the order of 20,000 rads. It was possible for the machine to end up in a state where it delivered full-power radiation without a hardware shield in place to protect the rest of the patient's body. No hardware interlocks were used to ensure that the full power mode was only usable with the shield in place - all safety features relied on software. In addition, the bug was only possible when an operator made a mistake in mode selection and then rapidly (proficiently) corrected it - the rapidity required prevented the bug from being discovered during slow, methodic, careful testing.

See Hackaday's article Killed by a Machine (and associated HN discussion) or for the especially curious, a 49-page post-mortem for more detail:

https://hackaday.com/2015/10/26/killed-by-a-machine-the-ther...

https://news.ycombinator.com/item?id=12201147

http://sunnyday.mit.edu/papers/therac.pdf

otoburb · on Feb 28, 2018

This was a tragic and preventable loss. It's incredible that a software bug might have been the root cause.

At the time, this incident really stuck out because it broke the illusion of our fabled Patriot missile shield protecting us. Civilian expats really believed the inflated Patriot interception rates parroted to us by mainstream media and our American military expat buddies.

A large number of remaining expats who had stuck out the Gulf War to that point decided to pack it in and leave when word got out that the Dhahran barracks were hit. Although history shows that Iraq surrendered days after this incident, at the time there was heightened fear and confusion amongst the remaining expats, especially the non-Americans.

We left on the last Lufthansa flight (crewed by military personnel) after hearing about this.

Nostalgic edit:

During the Gulf War embassies issued equipment and rations to expat citizens who chose to stay behind. Americans were issued full body suits (for adults and youths) due to the biological and chemical weapon payloads that Saddam boasted his SCUDs were carrying, along with MREs that tasted fabulous! In stark contrast, Commonwealth citizens were issued a bare gas mask (adult size only) and mono-flavour MREs that tasted like cardboard.

The British embassy sticks out in my mind: with stern stone-faced expressions they admonished us all for not evacuating and thus endangering children in a war zone. In addition to the terrible rations and gas masks, they wordlessly gave us a stack of translucent stickers. When asked what they were for, embassy staff explained that in the event of the air siren going off, we should get under our sturdiest tables and don our gas masks (standard procedure), and then slap the stickers on. If the stickers changed colour, it meant we were in the presence of a biochemical agent and would have approximately 10 seconds before we died a horrific death.

You kind of had to be there to appreciate the grim humour.

celticninja · on Feb 28, 2018

I mean I kind of understand the attitude of the British Embassy, it wasnt like trouble flared up overnight, the option to leave was there for a long time prior to the war beginning. Obviously it isnt the fault of the children who were kept there by their parents, but some responsibility needs to be borne by the expats that decided they were getting paid well enough to stay.

otoburb · on Feb 28, 2018

We all understood their stance. The notable point is the stark contrast between the Americans (embassy and expats alike) and everybody else.

While most of us were cowering under our desks and tables during SCUD attacks, some of our American civilian friends were out with their families in the desert trying to film the Patriots "intercepting" the SCUDs and driving out to try and pick up pieces of debris.

I look back upon those days with fondness and gratitude, especially for the American forces that served.

girvo · on March 1, 2018

That description of the stickers is so British it makes me think of Monty Python. Brilliant!

sharemywin · on Feb 28, 2018

I remember hearing about this in my numerical analysis class.

1. I remember hearing the system was only designed for XX operational hours but was being run over the operational spec.

2. The time was stored in base 10 so the calculation errors added up over time or something like that so if they had used some base 2 timing scheme it would haven't have had issues with rounding errors.

My class was in the mid nineties so the details of my 25 year old memory is pretty hazy...at best.

clw8 · on Feb 28, 2018

My recollection matches with yours, except I learned about it in the first week of Embedded Systems 101. If it isn't a standard part of the curriculum at every college embedded systems class, it should be! It really drove home the point that bad code can kill.

pilom · on Feb 28, 2018

I learned about it in a Decision Analysis course and had a completely different point driven home. This wasn't bad code. It was code that was correctly written to a very well defined requirement ("System shall be operational for at most X hours before a reboot"). The code was written to a spec that was approved by the customer (the military). Unfortunately though, that requirement wasn't communicated to the end users.

michaelmrose · on Feb 28, 2018

From the article

"However, the timestamps of the two radar pulses being compared were converted to floating point differently: one correctly, the other introducing an error proportionate to the operation time so far"

The code had a defect that effects its aim from turning it on but because it took 100 hours to drift by 1/3 of a second the problem wasn't apparent when rebooted regularly. If software can't continue to do basic math without manual intervention its defective.

In fact everyone including the company that made it admits it's defective.

Its possible your teacher picked a great example to illustrate a communication failure.

sharemywin · on Feb 28, 2018

The Patriot system was originally designed to operate in Europe against Soviet medium- to high-altitude aircraft and cruise missiles traveling at speeds up to about MACH 2 (1500 mph). To avoid detection it was designed to be mobile and operate for only a few hours at one location.

http://archive.gao.gov/t2pbat6/145960.pdf

Page 2

dug into reference 48 from Wikipedia which referenced this article which I did a search on google.

michaelmrose · on Feb 28, 2018

The fact that the bug manifests after a longer than normal period of operation doesn't ex post facto make it not a bug. If you add 2 and 2 and get 42 you failed.

It is however a good explanation why it remained undetected.

mikekchar · on March 1, 2018

Conversations like this are surprisingly common in our industry ;-) To help ease communication there are 2 terms in common usage: software error and bug. A software error is code that is incorrect. A bug is a software error that manifests a user visible problem. In this case the incorrect code is a software error, but it does not manifest a user visible problem unless it is used outside some assumed parameters. The bug doesn't exist when the product is used as intended. One can argue that the behaviour is undefined when used outside of the intended use and therefore there is no bug. There is no arguing about the software error, though. It exists.

Arguing about whether or not something is a bug is pointless precisely because someone will just pull the "behaviour outside of expected use is undefined" thing out of the bag. Regardless of whether or not you should have expected something to work, if your product unintentionally kills people due to a software error, you have a gigantic problem. It's really that lesson we have to keep in mind.

I get this all the time from project managers: it doesn't matter if X fails because we aren't designing the software for X. But you can't just dismiss X -- you need to understand the consequences of X just in case somebody tries to do it. For example: It corrupts the DB if 2 people edit the same record at the same time. The project manager says, "Not a problem. I got sign off from the groups using the app and they promise never to have 2 people working on the same thing. Problem solved, and no need to modify the code!" Of course a week later the DB is corrupted and it's not a bug (it's a feature ;-) ).

It does make software development more costly, and you need to draw the line somewhere. This requires balancing risk. But I will argue that if you are writing software for a missile, there is no hiding behind the "we didn't design it for that" argument.

Dehstil · on Feb 28, 2018

If by "defective" you mean has rounding errors, then sure. Everything that rounds numbers is defective. To be fair, round errors can sometimes be mitigated by carefully changing the order of operations, but never fully eliminated in those cases.

michaelmrose · on March 1, 2018

You can avoid rounding errors 100% of the time for as long as you like. For example you can use integers.

Its entirely possible to have any reasonable degree of precision reasonably required to the limits of our tools to measure.

This isn't about an inherent limit of computation its just programmer error.

Jtsummers · on Feb 28, 2018

I'm failing to find anything that says the requirement was "System shall be operational for at most X hours before a reboot". It's more likely that there was a key performance paramater (KPP) saying that it should be functional for at least some period of time. And that was what was tested.

Generally KPPs (which aren't requirements themselves, but influence the requirements for systems) are set at lower bounds, not upper bounds, for somethnig like this. You wouldn't set a KPP: Should only work for 4 hours. You'd use: Should work for at least 3 hours, 4 hours desirable (or some similar language). If it works for longer, that's great. But longer won't be tested since it's not a requirement or goal for the system, which also means failure modes for longer runtimes won't be encountered because they're outside the bounds of the system requirements and specs.

dragonwriter · on Feb 28, 2018

As I gather a the Patriot was a mobile anti-aircraft / anti-cruise missile platform that was meant to move, be activated when needed, and then be turned off and move again because the original location was expected to become a target. It was pressed, on short notice (with some software upgrades, but not the normal cycle of specs, development, and validation that would go into that kind of repurposing) into stationary, continuous coverage, anti-ballistic-missile (critically, dealing with much faster targets than originally envisioned, which means short warning times where deactivations have a lot more risk) use.

So, while it's horrible in results, it can be very easy to understand why basic functions would have specs not at all adapted to the use to which it was being put.

Jtsummers · on Feb 28, 2018

There's a distinction to be made, though. There was no requirement that it be rebooted after some period of time, though there was an expectation that this would happen by the original developers. Consequently it was not evaluated for 20 hour or 100 hour performance. That's a critical distinction in developing, testing, and fielding systems. And the way we term it in our requirements documents reflects this. We rarely say: System SHALL fail after some period. Rather we say: System SHALL perform for some period. We leave the result of longer durations undefined. The system may work, or it may not, we aren't required to test it and so we don't. If the customer wants it to run longer, we can evaluate it but they have to communicate that back to us (or to the testing facilities, which may not be the developers).

Similarly, with regards to the speed of the missiles, the requirement would not be: System SHALL fail to detect missiles above some threshold speed. But rather: System SHALL detect missiles below some threshold speed. This leaves open the possibility that it may be more or less accurate outside that range. It should be documented for the operators as a potential for failure: System may be ineffective against missiles operating above X m/s. But the requirements wouldn't include that detail.

This pushes the problem into the documentation and training. Since it was originally designed as a mobile platform with short run-times, there was no explicit operating procedure requiring reboots. It was just assumed. At the same time, the failure itself (after 20 hours) was unknown because testing hadn't been done to see what would happen.

michaelmrose · on March 1, 2018

Getting a slightly wrong answer ought to have been detectable even after short period of time even if the difference was microscopic.

Jtsummers · on March 1, 2018

Not the way we test these things. You set the KPPs and analyze system performance. Especially back then, there wouldn’t have been much in the way of unit testing or anything for these sorts of systems.

You set your performance parameters (have some success rate while operating continuously for up to 4 hours). Then you launch missiles at it (simulated and real). If you stop enough of then you’re good.

michaelmrose · on March 1, 2018

Article discussing testing software back in 1976

https://dl.acm.org/citation.cfm?id=807721

No real good excuse for not actually testing systems that can take or save lives.

sharemywin · on Feb 28, 2018

I just regurgitation about some kind of article the professor brought in.

Wikipedia didn't exist when I was taking the course. It's probably in one of the 100 odd source articles since it wasn't just my professor that pointed it out. One of the other commenters mentioned a similar discussion from one of their professors.

Jtsummers · on Feb 28, 2018

Fair. I wasn't replying to you, your #1 sounds a lot like what I'm saying, though.

  1. I remember hearing the system was only designed for
  XX operational hours but was being run over the
  operational spec.

This is very similar to my "at least" which is very different than "at most". In requirements we wouldn't bound ourselves like that. We wouldn't say our system should run for at most 8 hours. We'd say it should run for at least 8 hours. However, we won't say what happens after 8 hours because we don't test it (it's not a requirement). We may communicate to the operators that the system should be rebooted after some period of time if there's a known or anticipated issue, or we may include a soft boot to reset things. For many of our systems, their operating time is usually under 12 hours (they go on aircraft that don't fly for days at a time, mostly), so we never test anything past about 48 hours anyways. If there's an issue that arises around 96 hours, we'd never know from our testing and only know about if an operator pushed it to that limit and recorded the circumstances properly.

sharemywin · on Feb 28, 2018

The Patriot system was originally designed to operate in Europe against Soviet medium- to high-altitude aircraft and cruise missiles traveling at speeds up to about MACH 2 (1500 mph). To avoid detection it was designed to be mobile and operate for only a few hours at one location.

http://archive.gao.gov/t2pbat6/145960.pdf

Page 2

I never did embedded programming or government programming, so what your saying make sense from a spec perspective.

Jtsummers · on Feb 28, 2018

Right, but it didn't say that it was a cap on how long it should work correctly, rather it's a lower bound and also sets the maximum they tested it to. This is a design feature, but not a requirements feature, then, that it required a reboot after less than 8 hours of operations.

VLM · on Feb 28, 2018

Its interesting that FM 44-85 "Patriot Battalion and Battery Operations" is publicly available and pretty easy to find. We discussed this in a systems analysis class back in '04 using a copy of FM 44-85 released in '97. In summary the class blamed TRADOC and the tech writers for publishing a manual that did not accurately reflect real world use cases, with the software bug being a secondary concern.

I googled up a copy of FM 44-85 to refresh my memory and write this post, its pretty much as I remember it.

The doctrine in chapter 3, planning, is extreme mobility and rapid hour to hour activation and deactivation of individual missile batteries, kinda like infantry bounding overwatch but glacially slower on an hourly basis, for example see Table 3-2 where the four batteries are rotating on and off and moving/maintaining on detailed hour by hour basis, so the doctrine seems to be uptimes should typically be on the order of 3 or 4 hours maybe. Not a zillion days in a row as actually deployed when the software bug hit.

The doctrine in chapter 5, operations, goes into a big discussion of defense design strategies. The weapon system is inherently sectorized this naturally leads to overlapping areas of fire being very important. You have to ask why the unit that had a ridiculous uptime never shut down to perform daily maintenance which would inherently involve rebooting stuff, its no big deal to down a system because sectorization and overlap is inherently built into the technology. Its reasonably well understood that technically you can tell an individual infantry soldier to guard a post for 100 hours or 1000 hours continuously, but someone screwed up if they issued an order like that because its simply impractical. That leadership failure will be discussed later. So... aside from the question of why the software failed under ridiculous conditions, you have to ask WHO more or less knowingly misapplied the resource without backup or planned maintenance intervals? Possibly this section of the FM was rewritten between the tragedy and the the release of the copy I have access to, but its still poorly written. Or what section of the FM would have ever given the officers the idea that the weapons system can be deployed the way they did it? The idea that the weapon system could do what they told it to do came from somewhere and it apparently was not the documentation?

The doctrine in chapter 6, support, has a little blurb about battalion level staff officers. What did the EMMO think about keeping a patriot booted up and running for 100 hours without a maintenance interval? Missile maintenance is literally his only job. And if that slot was unfilled, its the job of S4 and the XO to cover or reassign someone or otherwise work around. Around page 6-16 there's a discussion about operators being responsible for maintenance... I had a humvee assigned to me, I hated it, it leaked oil all the time, but the point is even my junky humvee had daily maintenance tasks for PMCS. The patriot missile PMCS checklist is probably classified, but if a lowly humvee has daily maint, how can a missile not have a much longer and more complicated daily maint? And this implies someone is pencil whipping maint (I mean, everyone kinda does that, but..)

Its hard to summarize a class discussion but from the point of view of a systems analysis class, mostly non-military other than myself, the end users were being innovative and adapting and overcoming which unfortunately means the doctrine and specifications of the weapon system have little to do with how its being used. The class considered this the biggest systems analysis mistake of the tragedy. Why even write docs and specs if the users won't read them and they have no relation to what the users want to do? I guess a good HN analogy would be you could creatively deploy binary executables using the "cat" command and hand typing unicode and that would be a nifty hack to work around a problem but would be a pretty stupid way to operate normally. Specifically the Army's own docs used to train and plan operations imply shorter operations terms interspersed with maintenance intervals and deep redundancy, none of which seems to have anything to do with the failed deployment.

There was a big argument in class that it had nothing to do with systems analysis and was merely a leadership failure, using the example above of technically you can order a soldier to stand guard for a hundred hours, when guard shifts are normally a couple hours, and when he passes out asleep around 48 hours into his shift, you can try to blame the soldier or declare there's a bug in our brain preventing 100 hour deployments, or you can even blame the manual and the technical writers for not putting a warning in the manual not to do dumb things, but fundamentally thats just passing the buck that it was a failure of leadership to assign a unit to a task its not designed to handle, then cover it up by pretending its merely a software bug or something. I don't know enough history of the tragedy; its possible the Army correctly relieved some officers of command and its only the media and press who blame the software bug.

You can imagine the look on the face of the software developers when they got the bug report; like dude, did you ever read FM 44-85, or if you aren't reading it, what are you reading, so we can read it?

Jtsummers · on Feb 28, 2018

The software guys likely never read the field manuals. I know I never did. I read the specs and requirements documents, which are different things than what operators receive. The program office is responsible for maintaining synchronicity between the two with regards to performance parameters (reqs and specs) and performance expectations (manuals). The test office should’ve been familiar with both sides as well.

OedipusRex · on Feb 28, 2018

That was a temporary fix, then a software patch was released. I also wouldn't call that a "software" fix.

dredmorbius · on Feb 28, 2018

The inimitable comp.risks discussed this in 1992:

http://catless.ncl.ac.uk/Risks/13/35#subj1.1

http://catless.ncl.ac.uk/Risks/13/76#subj8.1

And in 1997:

http://catless.ncl.ac.uk/Risks/18/79#subj9.1

tntn · on Feb 28, 2018

Despite other comments below, I think that the equivalence drawn between "failed to save" and "killed" reflects an interesting philosophical choice. I don't think that this equivalence is universally accepted, even by those who call thinking otherwise fallacious.

If an EMT fails to save a victim of a car crash, did he/she kill the victim? If the dispatcher misspoke and gave the wrong cross street, delaying aid, did the dispatcher kill them?

rxhernandez · on Feb 28, 2018

In the medical device industry the company who made the device can be found at fault if a clinician makes a poor decision that leads to death based on a fault in the device. If the soldiers would have sought better cover or be otherwise saved in the case that there was no missile defense system was there then yes, some, if not most, of the blame lies on the software error.

logfromblammo · on Feb 28, 2018

For doing a ballistic propagation, you apply a gravitational map in Earth-centered, Earth-fixed (ECEF) geodetic coordinates, then convert to Earth-centered rotating (ECR) geodetic coordinates, because that way you don't have to correct for the Coriolis effect. That ECEF-ECR conversion requires a time-of-day parameter.

You can use a gravitational map that only accounts for latitude, but it isn't as precise.

So using an accurate clock is really important if your intent is to hit a missile with a missile.

sjburt · on Feb 28, 2018

This is a completely misleading headline. The Patriot missile was not effective at destroying the Scud [0]. The DoD initially claimed successful intercepts when the missile detonated near the Scud, but it rarely, if ever, actually destroyed the warhead. The only reason there was an illusion of success was that the Scud was also spectacularly unreliable and often broke up on re-entry or failed to detonate. It is a complete falsehood to claim that the Patriot would have prevented this loss of life.

[0] http://www.slate.com/articles/news_and_politics/war_stories/...

seorphates · on Feb 28, 2018

Reboot. Around the same time-frame we gathered the flag for a deployment (fleet admiral) and I was responsible for UNIX systems on the ship. Not long after coming aboard the command came down to reboot all of the systems at midnight, nightly (yes, only the UNIX systems). Being that "But Mister.." never really gets you too far in the military I just rode it iterating through any possible reason for the madness, nightly. I could never come up with a good one. Until now. (ok, perhaps not a "good" reason but crazy enough to count.)

It now makes much more sense to me that a (terrible) mishap had occurred and possible prevention was only a reboot away. I can see how being exposed to that context at upper levels could easily cause one to latch onto any perceived preventative measures.

I also once saw a short ntp time step across multiple clusters (yeh, simultaneously) shut down half of a wafer factory.

Time is important.. but rebooting all your systems at midnight probably will not help you to control it. This especially if there are large, hot, fast objects flying around in the night sky and definitely, really, don't do ALL of them at the same time every day .. especially during, you know, battle. /pro-tip

lostlogin · on Feb 28, 2018

That's still not great logic. Think of all the crazy shit you have seen fix machines. If all the was implemented you would have users doing some truly bizarre things.

seorphates · on March 3, 2018

Mm. That's on point. It is as illogical as having the means and knowledge for prevention and not applying it. The crazy shit (booting theater active operational assets) was implemented by authority. Not patching theater active assets leads to death.

bertjk · on Feb 28, 2018

I've often wondered, considering the supposed low accuracy of Scud missiles, (wiki gives it a CEP of 450m) how much of the casualties from that incident were more due to the bad luck of the missile actually hitting its target.

nerpderp83 · on Feb 28, 2018

If the Scud had been brought down earlier in it's trajectory it would have not been near people regardless of any randomness in it's landing.

dnautics · on Feb 28, 2018

many of the scuds "broke up in flight" or otherwise malfunctioned, too, so the actual effectiveness of patriot has been called into question.

criley2 · on Feb 28, 2018

This is bad, editorialized title that is not the title of the article.

Mods should change this. The "software fix" was a software patch which corrected the clocking bug.

The "software workaround" to use pre-fix was reboot.

I hate editorialized, lying titles :(

codazoda · on Feb 28, 2018

Came here to mention that. The title needs a re-write but the story is interesting still.

leggomylibro · on Feb 28, 2018

I could be reading this wrong, but 1/3 of a second within 100 hours seems really good, like something you'd get from a temperature-controlled crystal oven.

I don't mean to second-guess them in an area I know so little about, but if that was enough to cause a serious issue in the span of only a few days, shouldn't the devices be designed with a separate synchronization system, at least as a backup? Maybe GPS?

Which brings up a sort of interesting question...would a Patriot missile system even have receivers for a weak public signal like GPS, or is it all self-contained?

GCU-Empiricist · on Feb 28, 2018

As a former submariner who has had used clock for inertial navigation or for similar weapons systems 1/3 of a second over 100 hours is terrible.

leggomylibro · on Feb 28, 2018

I mean, it's not an atomic clock, but I'm comparing it to the 32.768KHz RTC crystals I use with consumer microchips. If super-precise isolated accuracy were actually important, I assume they would use a rubidium or cesium oscillator.

grkvlt · on Feb 28, 2018

1/3 of a second in 100 hours is basically 1ppm, or TXCO levels of accuracy, so pretty good i'd have thought, even for a submarine INS?

cocoablazing · on March 1, 2018

USN ballistic missile submarines deploy the most accurate INS in the world (ESGN), and that system is used in conjunction with another advanced gyro.

grkvlt · on March 2, 2018

Not sure about that. ESGN is old technology, since submarines don't need that much accuracy. For example, space probes, ballistic missiles, smart artillery shells/rockets/missiles and so on would all appear to have multiple orders of magnitude better accuracy than submarines, in the fractional ppb ranges, rather than tens of ppm. [0][1]

0. https://www.sto.nato.int/publications/STO%20Meeting%20Procee...

1. http://users.cecs.anu.edu.au/~Jonghyuk.Kim/teaching/Inertial...

GCU-Empiricist · on March 1, 2018

Here is a 2 RU atomic clock on ebay for context: https://www.ebay.com/itm/Antelope-Audio-Isochrone-10M-Rubidi...

simonh · on Feb 28, 2018

It wasn’t clock drift, it was an error in calculation leading to separate parts of the system, that were calibrated to the same common clock, to drift out of synchronization. Using a different clock, like GPS wouldn’t help with this.

But the rest of your point boils down to ’if you know your system has a flaw why not mitigate it’? But of course at design time they didn’t know it had this flaw.

leggomylibro · on Feb 28, 2018

Good point - hindsight is 20/20...

ajross · on Feb 28, 2018

MIL-SPEC was indeed famous for overspecified components. So it's not terribly shocking that the oscillator on that board would operate really well as an isolated system. You probably don't need temperature control per se, a temperature compensation circuit could probably do that.

grkvlt · on Feb 28, 2018

> weak public signal like GPS

i'd assume it could, since GPS is military, and a mobile missile system is exactly the sort of thing that wants to know where it is, so would have the keys to the (higher resolution) encrypted GPS signals as well.

brohoolio · on Feb 28, 2018

This is depressing. One of my middle school classmates had a brother killed in a SCUD strike.

jimjimjim · on Feb 28, 2018

regarding the comments about bug killed people versus weapon killed people.

There is no 1 answer, this argument is a result of black-white/yes-no/us-them single point of blame thinking. and it's terrible.

the bug contributed to the loss of life.

blurbleblurble · on Feb 28, 2018

Little things do add up.

mlazos · on Feb 28, 2018

The title of this post is misleading, they eventually supplied a software patch that fixed the clock drift. The Israelis proposed rebooting as a stopgap until the bug could be fixed.

sctb · on Feb 28, 2018

We've updated the submitted title from “Clock error lead to death of 28 Soldiers. Software fix: Reboot system regularly” to a representative phrase (edited for length) from the article. Submitters: please follow the guidelines by not editorializing titles.

https://news.ycombinator.com/newsguidelines.html

nathan_long · on Feb 28, 2018

> The Patriot missile battery at Dhahran had been in operation for 100 hours, by which time the system's internal clock had drifted by one-third of a second. Due to the missile's speed this was equivalent to a miss distance of 600 meters.

jasonmaydie · on Feb 28, 2018

The scud missile lead to their deaths, not the software. There's no absolute guarantee it would have intercepted it, plus rebooting a deployed machine regularly is an acceptable fix when it's live in the field

rosser · on Feb 28, 2018

That's a reductio fallacy. If you want to play that game, it was being deployed to that specific place that caused their deaths. Or was it enlisting in the first place? Maybe merely having been born?

This is a strictly technical examination of the proximate cause of their deaths; it makes no claims about their ultimate cause. Whether or not a missile system with an accurate clock might have hit the target, it is unambiguous that this one missed specifically because of clock drift.

adamredwoods · on Feb 28, 2018

So much could go right or wrong, especially when in a war, or even when depending on technology in our households (fire alarms, replace your batteries!).

I feel "preventable deaths" is a preferable focus over "cause of death".

jasonmaydie · on Feb 28, 2018

How so? The implication you and the article are asserting is that the clock error caused their deaths.. rather than the more accurate description "could have prevented death".

Vivtek · on Feb 28, 2018

Well, it wasn't the missile that caused their deaths. Strictly speaking, it was the explosion of the missile.

Well, wait. It wasn't the explosion - technically, it was the impact of the pressure wave on their bodies that caused ... well, no. Really, it was the fact that their organs stopped working after impact of the ... well. If you really want to be accurate, it was the fact that metabolism ceased to be practicable after their organs stopped working.

Well, no, actually, the fact that their mental processes depended on their metabolism - that was really the cause of their... Well, no...

berns · on Feb 28, 2018

At this point we are already used to these trashy titles. There is no logic reasoning game to play; simply the bad faith in the title must be mentioned. And considering that the death of 28 persons is involved, it's in poor taste.

KindOne · on Feb 28, 2018

More technically information on the math behind the failure. http://www-users.math.umn.edu/~arnold/disasters/patriot.html

mannykannot · on Feb 28, 2018

If you are going to be pedantic, you might want to be very careful about what you write... How do you figure it would be more accurate to say that the clock error "could have prevented death?"

euyyn · on Feb 28, 2018

It would depend on whether they were relying on it to work or not.

michaelmrose · on Feb 28, 2018

People have been conditioned by using badly designed software that systems naturally drift into broken states in the course of normal operations and that having to start the system from scratch regularly is therefore very reasonable.

This just isn't so. Also the degree of acceptable reliability that is reasonable is different in a missile defense system vs the toy your grandma uses to browse facebook.

It had to be rebooted because a bug caused it to be increasingly inaccurate the longer it was booted up. This was always broken. It wasn't an acceptable fix because you manifestly can't trust users to do so as shown by the 28 corpses. It was however probably the best that could be done on short notice.

dragonwriter · on Feb 28, 2018

>

Taking 60-90s completely out of protection to reboot a critical defensive system when someone might, at any moment, toss a Mach 5 projectile at you from a couple hundred miles away is a far-from-ideal fix, even if it had been communicated properly to the end users (which it wasn't.)

https://embeddedgurus.com/barr-code/2014/03/lethal-software-...

KindOne · on Feb 28, 2018

I've seen clock drift first hand. I have an old Windows XP machine that would drift ahead about 3-30 seconds an hour while watching YouTube with Adobe flash. Playing video games, compiling gcc and other things in cygwin (hell) would not drift.