For any techies out there, this is deliberately simplified :)
Most people have an idea about Domain Name Services (DNS) and about how it converts domain names to IP addresses. Well Internet routing is what happens once your machine knows the destination IP address.
Now, if the site you are going to is on the same network provider as you, then your traffic would route to it using standard routing protocols. However, this is usually not the case, so your internet provider needs to know where to send the traffic, this is where BGP steps in.
Large blocks of IP addresses are assigned to each geographical region in the world, and those IP's are then assigned to Internet providers (ISP's etc.).
Each ISP is also assigned something called an Autonomous System number (AS number).
All of the IP ranges that an ISP assigns to it's customers, both corporate and consumer, are summarised as one large set of routes that are then associated with the AS number.
These large blocks of routes are then 'swapped' with other ISP's at what we call 'peering points'. These are datacenters that host border routers for all the major ISP's.
Here is an example. Consumer has ip 1.1.1.1 and is with ISP#1 ( who has an AS of 1001)
Web site has ip 3.3.3.3 and is with ISP#3 (who has an AS of 3003)
Now, these two providers might not connect to the same peering point, but they both have connections to ISP#2 who has an AS number of 2002.
So, rather than carry millions of routes for all the various networks, ISP#1 knows it can get to the IP address 3.3.3.3 by sending traffic to AS 2002, who will then pass it on to AS 3003 for the final delivery.
Now, each AS number is associated with huge numbers of routes and are often summarised in larger blocks, but there are always exceptions and it isn't always neat, so each ISP has filters in place on their BGP peerings with other ISP's.
These filters perform two functions.
-
They determine the various networks that are advertised as part of the ISP to all the other peering partners and
-
They can filter incoming advertisments to prevent small networks clogging things up and to prevent hijacking of routes (think redirecting blocks of IP's to China so they can sniff the traffic).
Now, in order for facebook's address range to be blackholed they would have to either be removed from the BGP advertisment from their AS number (FB is big enough to have their own AS so this would be on their machines) OR all the other providers would be configured to filter out the ranges that are associated with FB.
This sort of thing can be done by accident, but it would never appear as something so focussed and for a lengthy period of time.
The internet works via BGP, without you couldn't connect to people who are with other providers, so it gets a lot of security focus and a lot of resources thrown at it should something hit the fan.
This outage is no accident, and it is not easy to do. You either need control of FB's border routers (all of them, there's never just one or two) OR you need to control everyone else's border routers.
Neither of which is a simple task.
All of these border routers will have something call OOB (Out of Band) access. Which is access to the device that is not dependent on the network itself. Think a modem connecting to a telephone which is linked to a console server (yes, really. The bandwidth requirements for ssh/telnet/remote access are not great).
If FB could access their routers via the oob modems, then this outage would have been over an hour after it started (assuming the hack was on their devices).
Let me know if there's anything that doesn't make sense and I'll try to answer your Q's, but it might be in the morning now.
Thanks for the explanation. Seems like an impossible event. It makes the most sense the issue was with FB routers - some huge screw up - could be a rookie left in charge Of some fix / upgrade while someone more senior on vacation. Lots of change control protocols in place but still things can happen - pushed to all routers? My observation in IT these days is although there are a lot of employees… Everyone is very siloed and each person handles a very specific area with not a lot of cross training / responsibilities.
I've worked in places where that kind of thing happens, even with change control.
However, they didn't have the resources FB has. I'm still not buying that this was an accident.
Cascade errors from unforseen consequences of a change do happen, but not at every peering point FB has at the same time. They don't peer with the world with a couple of Cisco 12000's after all.
Does seem unprofessional for such a large organization. Just as curious was the amount of time to fix it. But sometimes (more often than not) the person who caused the screw up is the one in charge of undoing it.
Well, since it did affect DNS, if the primary resolvers were out for more than 4 hours then all the downstream servers would lose their records and take time to reload, which is probably what is happening now.
However, the BGP peerings were definitely screwed for several hours, and then the admins were unable to access the router console via out of band, so it's definitely fishy.
Is it possible that something went wrong with trying to use out of band - would that not be emergency use so very rarely used - could be something stupid as login credentials unknown. Well what we know for sure - given the millions of dollars lost - the pressure to fix must have been intense - And then you usually have a battalion of managers breathing down some poor guy’s neck and asking for an update every five minutes! Hehehe