PSA: BGP (Border Gateway Protocol)

posted 3 years ago by JonathanE 3 years ago by JonathanE +16 / -0

For any techies out there, this is deliberately simplified :)

Most people have an idea about Domain Name Services (DNS) and about how it converts domain names to IP addresses. Well Internet routing is what happens once your machine knows the destination IP address.

Now, if the site you are going to is on the same network provider as you, then your traffic would route to it using standard routing protocols. However, this is usually not the case, so your internet provider needs to know where to send the traffic, this is where BGP steps in.

Large blocks of IP addresses are assigned to each geographical region in the world, and those IP's are then assigned to Internet providers (ISP's etc.).

Each ISP is also assigned something called an Autonomous System number (AS number).

All of the IP ranges that an ISP assigns to it's customers, both corporate and consumer, are summarised as one large set of routes that are then associated with the AS number.

These large blocks of routes are then 'swapped' with other ISP's at what we call 'peering points'. These are datacenters that host border routers for all the major ISP's.

Here is an example. Consumer has ip 1.1.1.1 and is with ISP#1 ( who has an AS of 1001)

Web site has ip 3.3.3.3 and is with ISP#3 (who has an AS of 3003)

Now, these two providers might not connect to the same peering point, but they both have connections to ISP#2 who has an AS number of 2002.

So, rather than carry millions of routes for all the various networks, ISP#1 knows it can get to the IP address 3.3.3.3 by sending traffic to AS 2002, who will then pass it on to AS 3003 for the final delivery.

Now, each AS number is associated with huge numbers of routes and are often summarised in larger blocks, but there are always exceptions and it isn't always neat, so each ISP has filters in place on their BGP peerings with other ISP's.

These filters perform two functions.

They determine the various networks that are advertised as part of the ISP to all the other peering partners and
They can filter incoming advertisments to prevent small networks clogging things up and to prevent hijacking of routes (think redirecting blocks of IP's to China so they can sniff the traffic).

Now, in order for facebook's address range to be blackholed they would have to either be removed from the BGP advertisment from their AS number (FB is big enough to have their own AS so this would be on their machines) OR all the other providers would be configured to filter out the ranges that are associated with FB.

This sort of thing can be done by accident, but it would never appear as something so focussed and for a lengthy period of time.

The internet works via BGP, without you couldn't connect to people who are with other providers, so it gets a lot of security focus and a lot of resources thrown at it should something hit the fan.

This outage is no accident, and it is not easy to do. You either need control of FB's border routers (all of them, there's never just one or two) OR you need to control everyone else's border routers.

Neither of which is a simple task.

All of these border routers will have something call OOB (Out of Band) access. Which is access to the device that is not dependent on the network itself. Think a modem connecting to a telephone which is linked to a console server (yes, really. The bandwidth requirements for ssh/telnet/remote access are not great).

If FB could access their routers via the oob modems, then this outage would have been over an hour after it started (assuming the hack was on their devices).

Let me know if there's anything that doesn't make sense and I'll try to answer your Q's, but it might be in the morning now.

15 comments

15 comments share save hide report block hide replies

To The Great Awakening

We are researchers who deal in open-source information, reasoned argument, and dank memes. We do battle in the sphere of ideas and ideas only. We neither need nor condone the use of force in our work here. WE ARE THE PUBLIC FACE OF Q. OUR MISSION IS TO RED-PILL NORMIES.

This is a pro-Q community. Please read and respect our rules below before contributing.

WHY Q?

"Those who cannot understand that we cannot simply start arresting w/o first: ensuring the safety & well-being of the population shifting the narrative removing those in DC through resignation to ensure success defeating ISIS/MS13 to prevent fail-safes freezing assets to remove network-to-network abilities kill off COC to prevent top-down comms/org, etc. etc. should not be participating in discussions." Q

Comments (15)

sorted by:

▲ 1 ▼

– Baltic19 1 point 3 years ago +1 / -0

Thanks for the explanation. Seems like an impossible event. It makes the most sense the issue was with FB routers - some huge screw up - could be a rookie left in charge Of some fix / upgrade while someone more senior on vacation. Lots of change control protocols in place but still things can happen - pushed to all routers? My observation in IT these days is although there are a lot of employees… Everyone is very siloed and each person handles a very specific area with not a lot of cross training / responsibilities.

permalink save report block reply

– JonathanE [S] 1 point 3 years ago +1 / -0

I've worked in places where that kind of thing happens, even with change control.

However, they didn't have the resources FB has. I'm still not buying that this was an accident.

Cascade errors from unforseen consequences of a change do happen, but not at every peering point FB has at the same time. They don't peer with the world with a couple of Cisco 12000's after all.

permalink parent save report block reply

Does seem unprofessional for such a large organization. Just as curious was the amount of time to fix it. But sometimes (more often than not) the person who caused the screw up is the one in charge of undoing it.

Well, since it did affect DNS, if the primary resolvers were out for more than 4 hours then all the downstream servers would lose their records and take time to reload, which is probably what is happening now.

However, the BGP peerings were definitely screwed for several hours, and then the admins were unable to access the router console via out of band, so it's definitely fishy.

Is it possible that something went wrong with trying to use out of band - would that not be emergency use so very rarely used - could be something stupid as login credentials unknown. Well what we know for sure - given the millions of dollars lost - the pressure to fix must have been intense - And then you usually have a battalion of managers breathing down some poor guy’s neck and asking for an update every five minutes! Hehehe

... continue reading thread?