For any techies out there, this is deliberately simplified :)
Most people have an idea about Domain Name Services (DNS) and about how it converts domain names to IP addresses. Well Internet routing is what happens once your machine knows the destination IP address.
Now, if the site you are going to is on the same network provider as you, then your traffic would route to it using standard routing protocols. However, this is usually not the case, so your internet provider needs to know where to send the traffic, this is where BGP steps in.
Large blocks of IP addresses are assigned to each geographical region in the world, and those IP's are then assigned to Internet providers (ISP's etc.).
Each ISP is also assigned something called an Autonomous System number (AS number).
All of the IP ranges that an ISP assigns to it's customers, both corporate and consumer, are summarised as one large set of routes that are then associated with the AS number.
These large blocks of routes are then 'swapped' with other ISP's at what we call 'peering points'. These are datacenters that host border routers for all the major ISP's.
Here is an example. Consumer has ip 1.1.1.1 and is with ISP#1 ( who has an AS of 1001)
Web site has ip 3.3.3.3 and is with ISP#3 (who has an AS of 3003)
Now, these two providers might not connect to the same peering point, but they both have connections to ISP#2 who has an AS number of 2002.
So, rather than carry millions of routes for all the various networks, ISP#1 knows it can get to the IP address 3.3.3.3 by sending traffic to AS 2002, who will then pass it on to AS 3003 for the final delivery.
Now, each AS number is associated with huge numbers of routes and are often summarised in larger blocks, but there are always exceptions and it isn't always neat, so each ISP has filters in place on their BGP peerings with other ISP's.
These filters perform two functions.
-
They determine the various networks that are advertised as part of the ISP to all the other peering partners and
-
They can filter incoming advertisments to prevent small networks clogging things up and to prevent hijacking of routes (think redirecting blocks of IP's to China so they can sniff the traffic).
Now, in order for facebook's address range to be blackholed they would have to either be removed from the BGP advertisment from their AS number (FB is big enough to have their own AS so this would be on their machines) OR all the other providers would be configured to filter out the ranges that are associated with FB.
This sort of thing can be done by accident, but it would never appear as something so focussed and for a lengthy period of time.
The internet works via BGP, without you couldn't connect to people who are with other providers, so it gets a lot of security focus and a lot of resources thrown at it should something hit the fan.
This outage is no accident, and it is not easy to do. You either need control of FB's border routers (all of them, there's never just one or two) OR you need to control everyone else's border routers.
Neither of which is a simple task.
All of these border routers will have something call OOB (Out of Band) access. Which is access to the device that is not dependent on the network itself. Think a modem connecting to a telephone which is linked to a console server (yes, really. The bandwidth requirements for ssh/telnet/remote access are not great).
If FB could access their routers via the oob modems, then this outage would have been over an hour after it started (assuming the hack was on their devices).
Let me know if there's anything that doesn't make sense and I'll try to answer your Q's, but it might be in the morning now.
Thank you for this excellent explanation...I'm pretty inexperienced in such things but I followed you...
God's router is a direct line so I never get outages kek!
Blessings fren....let's hope this is what we have been waiting for....🙏
Thanks, it's hard to know if I struck the right balance, it can get a bit gnarly when younget into the weeds :)
Thank you, fren. This was very helpful.
Since the one employee is talking about tbis being due to an update.
Could FB even runan update unless they controlled all the border routers.
If not, does this hint at someone else having backdoor to all the routers?
Back when I was a backbone engineer in the 90's I drew up a plan on how to break the internet.
The first part would be to mess up the BGP, the next bit would be to mess up the machines so no-one could fix it quickly.
It is possible to fuck up this badly with an update, but not on all of the border routers at the same time.
Change control procedures would not allownyou to mess up this badly on all the devices at the same time.
The biggest fuckup you could make would be to update the IOS on the routers and brick them, but like I said, these are all redundant devices, it just wouldn't happen.
Thats wbat Im getting at. Seems the person claiming it was an update is either lying or telling a broader truth they dont want people to know.
Someone has the keys to the entire kingdom.
It's impossible to tell. My attack plan would have worked in the late 90's because there wasn't as much security and there were fewer players and routers.
To do the same thing now would require a team of state level actors.
Thanks for the explanation. Seems like an impossible event. It makes the most sense the issue was with FB routers - some huge screw up - could be a rookie left in charge Of some fix / upgrade while someone more senior on vacation. Lots of change control protocols in place but still things can happen - pushed to all routers? My observation in IT these days is although there are a lot of employees… Everyone is very siloed and each person handles a very specific area with not a lot of cross training / responsibilities.
I've worked in places where that kind of thing happens, even with change control.
However, they didn't have the resources FB has. I'm still not buying that this was an accident.
Cascade errors from unforseen consequences of a change do happen, but not at every peering point FB has at the same time. They don't peer with the world with a couple of Cisco 12000's after all.
Does seem unprofessional for such a large organization. Just as curious was the amount of time to fix it. But sometimes (more often than not) the person who caused the screw up is the one in charge of undoing it.
Well, since it did affect DNS, if the primary resolvers were out for more than 4 hours then all the downstream servers would lose their records and take time to reload, which is probably what is happening now.
However, the BGP peerings were definitely screwed for several hours, and then the admins were unable to access the router console via out of band, so it's definitely fishy.
Is it possible that something went wrong with trying to use out of band - would that not be emergency use so very rarely used - could be something stupid as login credentials unknown. Well what we know for sure - given the millions of dollars lost - the pressure to fix must have been intense - And then you usually have a battalion of managers breathing down some poor guy’s neck and asking for an update every five minutes! Hehehe
remote console passwords are hard coded, no authentication against remote devices should be required. There's no way they don't test oob regularly, especially on border routers.
So, would the root cause have been a configuration text file that was corrupted? What would be your guess?
Since they've now recovered, I suspect it was the prefix filters that were messed with.