Undiscovered Software Bug Triggered Massive Internet Outages Worldwide

Image sourced from Newsweek.

A massive internet-wide outage struck the web on Tuesday. Several major websites were impacted by the outage that originated at Fastly, a key content delivery network that supports many major companies and sites online.

Websites affected included huge internet cornerstones like Reddit, Twitch, Amazon, PayPal, Spotify and CNN amongst many others. Many websites weren’t taken off the internet completely but were affected in other ways.

PayPal users complained that payments weren’t being processed, Twitter users were unable to post certain emojis.

The outage lasted about an hour. Users navigating to affected websites would be met with a familiar error message: “Error 503 Service Unavailable.” Fastly quickly identified the issue and immediately set to work on a fix.

What Caused the Outage

Fastly has summarised the service interruption in a post in an official blog. According to the company, a customer pushed a valid configuration change – probably within a website or internet service. This configuration change triggered a previously undiscovered bug. This bug caused 85% of Fastly’s network to collapse.

A high-level website and application hosting service, many large international corporations use Fastly to host their content for millions of users worldwide.

The company says that once the initial effects were mitigated, its developers turned towards fixing the specified bug. A permanent fix has been implemented with Fastly hoping the problem has been eliminated.

“We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change. We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95% of our network was operating as normal,” says Nick Rowell, Senior VP of Engineering and Infrastructure at Fastly.

“This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them.”

A timeline of events shows how quickly Fastly managed to handle the devastating outage:

09:47 Initial onset of global disruption
09:48 Global disruption identified by Fastly monitoring
09:58 Status post is published
10:27 Fastly Engineering identified the customer configuration
10:36 Impacted services began to recover
11:00 Majority of services recovered
12:35 Incident mitigated
12:44 Status post resolved
17:25 Bug fix deployment began

Fastly is currently conducting a research “post-mortem” of the incident and the processes it followed to resolve it.

“We have been — and will continue to — innovate and invest in fundamental changes to the safety of our underlying platforms,” Rowell concludes.

By Luis Monzon
Follow Luis Monzon on Twitter
Follow IT News Africa on Twitter