There are quite some harsh comments here below. You can't plan for every possible failure point, who knows what part of a system/infra out of everything that they have went down and triggered this behaviour. Some things you just can't catch/predict. Especially in huge systems like theirs. I would expect people here to understand things like these and not just call people names for something like this, we all know things seem simple/clear from the outside, but the job of debugging and fixing something like this take quite some effort.
This is a company with one of the largest digital infrastructures in the world. An outage is understandable, inability to tell they're having an outage and inform users appropriately is not. Stop making excuses for people who are literally awash in resources.
> Stop making excuses for people who are literally awash in resources.
This is a pretty weird outlook to have - looking at any group awash with resources, whether it be governments or other companies, and you can clearly see that even with those resources, failures still happen.
You can jump up and down and pretend that this is solvable, or you can look at reality, look at all the evidence of this happening over and over to almost everyone, and conclude with some humility that these things just happen to everyone.
(Looking this reality in the face is one of the things motivating my beliefs around e.g. AI safety, climate change, etc.)
It is always better for the company's rep for the issue to have been on your end. Admitting fault comes with a potential liability. It's gaslighting written as an SLA
You can't plan for every contigency, but you can reserve potentially scary message for situations where you know they are correct. An unpected error state should NOT result in a "invalid credentialiald error".
Pushing people to unnecessarily reset credentials increases risk. Not only does it increase acute risk, but it also decreases the value of the signal by crying wolf.
The argument here is the kind of nonsense cargo cult security that pervades the industry.
- in general, if the system is broken enough to be giving false-negatives on valid credentials, it's broken enough that there isn't much planning to be done here because the system's not supposed to break. So if they give me "Sorry, backend offline" instead of "invalid credential," they've now turned their system into an oracle for scanning it for queries-of-death. That's useful for an attacker.
- in the specifics of this situation, (a) credential reset was offline too so nobody could immediately rotate them anyway and (b) as a cohort, Facebook users could stand to rotate their credentials more often than the "never" that they tend to rotate them, so if this outage shook their faith enough that they changed their passwords after system health was restored... Good? I think "accidentally making everyone wonder if their Facebook password is secure enough" was a net-positive side-effect of this outage.
So your approach to security is to never admit that an application had an error to a user, but to instead gaslight that user with incorrect error messages that blame them?
This is security by obscurity of the worst kind, the kind that actively harms users and makes software worse.
No. My approach to security is to never admit that an application had an error to an unauthenticated user.
That information is accessible to two cohorts:
- authenticated users (sometimes; not even authenticated users get access to errors as low-level as "The app's BigTable quota was exceeded because the developers fucked up" if it's closed source cloud software)
- admins, who have an audit log somewhere of actual system errors, monitoring on system health, etc.
Unfortunately, I can't tell if the third cohort (unauthenticated users) is my customers or actively-hostile parties trying to make the operation of my system worse for my customers, so my best course of action is to refrain from providing them information they can use to hurt my customers. That means, among other things, I 403 their requests to missing resources instead of 404ing them, I intentionally obfuscate the amount of time it takes to process their credentials so they can't use timing attacks to guess whether they're on the right track, I never tell them if I couldn't auth them because I don't recognize their email address (because now I've given them an oracle to find the email addresses of customers), and if my auth engine flounders I give them the same answer as if their credentials were bad (and I fix it fast, because that's impacting my real users too).
To be clear: I say all this as a UX guy who hates all this. UX on auth systems is the worst and a constant foil to system usability. But I understand why.
You are absolutely correct. That would be a much better experience.
That said, getting there strikes me as pretty challenging. Automatically detecting a down state is difficult and any detection is inevitably both error-prone and only works for things people have thought of to check for. The more complex the systems in question, the greater the odds of things going haywire. At Meta's scale, that is likely to be nearly a daily event.
The obvious way to avoid those issues is a manual process. Problem there tends to be that the same service disruptions also tend to disrupt manual processes.
So you're right, but also I strongly suspect it's a much more difficult problem than it sounds like on the surface.
> That said, getting there strikes me as pretty challenging. Automatically detecting a down state is difficult and any detection is inevitably both error-prone and only works for things people have thought of to check for. The more complex the systems in question, the greater the odds of things going haywire. At Meta's scale, that is likely to be nearly a daily event.
Well, in principle, the frontend just has to distinguish between HTTP status 500 (something broken in the backend, not the fault of the user) and some HTTP status code 4xx (the user did something wrong).
The "your username/password is wrong" message came in a timely manner. So someone transformed "some unforeseen error" into a clear but wrong error message.
And this caused a lot of extra trouble on top of the incident.
But there's something off here. I wouldn't expecting to be shown as logged out when the services are down. I'd expect calls to fail with something aka 500 and an error showing "something happen edited on our side". Not all the apps going haywire.
At the scale of Meta, "down" is a nuanced concept. You are very unlikely to get every piece of functionality seizing up at once. What you are likely to get is some services ceasing to function and other services doing error-handling.
For example, if the service that authenticates a user stops working but the service that shows the login form works, then you get a complex interaction. The resulting messaging - and thus user experience - depend entirely on how the login page service was coded to handle whatever failure the authentication service offered up. If that happens to be indistinguishable from a failure to authenticate due to incorrect credentials from the perspective of the login form service, well, here we are.
At Meta's scale, there's likely quite a few underlying services. Which means we could be getting something a dozen or more complex interactions away from wherever the failures are happening.
Isn't this just the standard problem of reporting useful error messages? Like, yes, there are academic situations where you can't distinguish between two possible error sources, but the vast majority of insufficiently informative error messages in the real world arise because low effort was applied to doing so.
Yes, with the additions of sheer scale, a vast number of services, multiple layers, and the difficulty of defining "down" added in. I think the difficulty of reporting useful error messages is proportional to the number of places an error can reasonably happen and the number of connections it can happen over, and by any metric Meta's got a lot of those.
No, in that detecting when you should be reporting a useful error message is itself a complex problem. If a service you call gives you a nonsense response, what do you surface to the user? If a service times out, what do you report? How do you do all this without confusing, intimidating, and terrifying users to whom the phrase "service timeout" is technobabble?
> If a service you call gives you a nonsense response, what do you surface to the user?
If this occurred during the authentication process, I think I would tell the user "Sorry, the authentication process isn't working. Try again later." rather than "Invalid credentials". And you could include a "[technical details]" button that the user could click if they were curious or were in the process of troubleshooting.
> If that happens to be indistinguishable from a failure to authenticate due to incorrect credentials from the perspective of the login form service, well, here we are.
If you can't distinguish those, then that is bad software design.
Come on use a little imagination. DNS lookup for the db holding the shard with the user credentials disappears. Code isn’t expecting this, throws a generic 4xx because security instead of a generic 5xx (plenty of people writing auth code will take the stance all failures are presented the same as a bad password or non-existing username); caller interprets this a login failure.
Same auth system system used to validate logins to the bastions that have access to DNS. Voilá.
> plenty of people writing auth code will take the stance all failures are presented the same as a bad password or non-existing username
Those people would be wrong. You can take all unexpected errors and stick them behind a generic error message like "something went wrong" but you should not lie to your users with your error message.
If you have different messages for invalid username vs invalid password, you can exploit that to determine if a user has an account at a particular service.
"Invalid credentials" for either case solves this problem.
But sure, let's report infra failures different as "unexpected error"
Now, what happens if the unexpected error is only when checking passwords, but not usernames?
Do you report "invalid credentials" when given an invalid username, but "unexpected error" when given a valid name but invalid password?
If so, you're leaking information again and I can determine valid usernames.
So, safe approach is to report "invalid credentials" for either invalid data or partial unexpected errors.
Only time you could safely report "unexpected error" is if both username check and password check are failing, which is so rare that it's almost not worth handling. Esp. at the risk of doing wrong and leaking info again.
If you really want to hide whether a username is in use, then you also have to obscure the actual duration of the authentication process among other things. The amount of hoops you need to jump through to properly hide username usage are sufficient that you need to actually consider if this is a requirement or not. Otherwise, it is just a cargo cult security practice like password character requirements or mandated password reset periods.
In this case, Facebook does not treat hiding username usage as a requirement. Their password reset mechanism not only exposes username / phonenumber usage, but ties it to a name and picture. So yes, Facebook returning an error that says credentials are incorrect when it has infrastructure problems is absolutely a defect.
what if, if one service doesnt respond at all or responds with something that doesnt fit an expected format that it would if working correctly, the whole thing just says "sorry, we had an error, try again later"? if it has to check both at the same time, and cant check them independently, wouldn't that solve the vulnerability? or am i missing something? totally understandable if i am, i just want to learn /gen
This would prevent people from panicking they've been hacked and/or unnecessarily resetting their password.