On the morning of 12 February 2020, a major disruption occurred due to a network failure. The failure affected all our services and servers including our telephone exchange.
First of all, we would like to take this opportunity to express our regret about the problems and how they affect you as a customer. We know you rely on us to keep your website and services online.
Yesterday we did not live up to your expectations of us. We apologise for that.
Is this report too technical? We’ve also collected general information and FAQs for those who aren’t technically inclined.
Table of contents
- Overview
- About the network problem that affected all customers
- About the consequential issue affecting Managed Server customers
- Actions and lessons learned
1. Overview
At approximately 10:30am on 12 February 2020, a major network failure occurred that affected all of our services and servers, including our own website, support and telephone exchange.
At approximately 14:00 we got the network back up and running and the majority of our services and servers were able to return to normal operation.
Just after services started to come back online, we discovered a consequential error with the Managed Server storage clusters. Some were getting corrupt databases which required manual hand patching for each server. By 5:45pm the database issues were resolved and the entire outage was resolved.
Below you can read more about the various problems and what we are doing to prevent this and improve in the future.
2. About the network problem that affected all customers
The problem with the network was due to a bug in the software of one of the switches in our network.
In our network we use something called STP (Spanning-tree Protocol). It is used, among other things, to prevent loops in networks, and to be able to build redundancy in the form of multiple uplinks between switches, for example.
The bug caused a switch to suddenly appoint itself as an STP root, even though it was not. With two roots in the same network, a loop occurs when the rest of the switches go as intended to the “real” root switch.
The loop, in turn, causes all traffic to go around between the switches internally. The traffic does not reach the server or the internet. Because of the intense traffic between the switches that occurs during the loop, they quickly become overloaded and the network goes down.
In more technical terms, this caused our core routers uplinked to the rest of the data center to think that we had a network loop somewhere and started blocking VLAN traffic to the rest of the environment. At the same time, the VLAN was constantly being blocked and unblocked on the uplink, creating an extreme packet loss.
This made us initially suspect that the problem was in our core routers as they were constantly switching between active and passive. In this situation, we tried to turn off the spanning-tree for the core routers. This did not solve the problem, but allowed us to rule out the core routers as part of the problem.
This took a relatively long time because due to the network error we could not access our syslog and determine exactly where the error was (see action below).
At this point, the technicians on site in the hall were able to identify the loop and started to go through the rest of the switches. Here we started with the switches that we had recently had minor problems with, or done work on.
We were then able to locate the offending switch and reboot it (see further action below). After the reboot, the network was up and running normally, around 12:30 and the first of our services started to come back online.
After the outbound communication was back up and running after the network problem was solved, we saw in our monitoring how our hosting servers and customer websites on them became accessible again as expected.
At the same time, it became increasingly clear that not everything was healing as it should with all services.
3. On the consequential problem affecting Managed Server customers
Our storage clusters for Managed Server, hosting and Do-It-Yourself servers are separated to provide optimal storage volume, speed and fault tolerance.
Problem 1: Lost connection to storage
During the network issues, some of the Managed Server hypervisor nodes periodically lost connectivity to the underlying storage. The nodes are responsible for performing database lookups and displaying customer websites as quickly as possible. This meant that files could not be saved properly.
When some of our Managed Servers attempted to write data, they experienced problems with the operating system and in some cases with database writes.
Unlike our web hosting servers which did not have these problems, manual hand loading of very many servers was therefore required at the same time.
Many of the servers where the operating system hung due to the storage problem were quickly brought back up with a reboot. However, the servers that experienced problems with database writes required additional manual intervention.
Problem 2: A few corrupt databases
The databases, which in the e-commerce context often contain order information, customer details and wish lists, are stored on disks in the same way as normal files.
When the server tries to write a change to disk but is unsuccessful, the database can become corrupt in some cases.
You can compare this to saving something to a USB flash drive and unplugging it before it has time to save. Or even if you tear the paper out of the printer before it has finished writing. It’s much the same with the database.
We were unlucky that the number of servers that had problems with corrupt databases was only a few of our managed servers.
The database problems we saw were solvable in two ways:
- Either we could have restored the last backup. On the Managed Server, a restore point is saved every day (usually at night). This would have meant that changes and thus, among other things, today’s order information for e-merchants would have been lost.
- Alternatively, we could get as much data as possible out of the corrupted databases and reload these to a new database instance and then repair any broken tables. This is a manual process that, depending on the size of the databases and what can be retrieved, takes anywhere from 15 to 45 minutes per server.
We quickly made the decision that option 1 was not reasonable. Better a little longer downtime than the loss of business-critical data. So we immediately set about fixing the corrupt databases as quickly as we could. Here we prioritised our customers with SLA Pro.
4. Actions and lessons learned
Ideally, of course, we would have liked to avoid downtime altogether. We have high ambitions for uptime in all our services. In 2019, our uptime was 99.998%. This means about 2.5 minutes of downtime in a month. Of course, our goal is that you as a customer should never have to experience major disruptions to your service(our last major one was on 2018-12-04).
That’s why we work systematically to improve and upgrade our services. We invest heavily in our network, storage and virtualization nodes to deliver the best possible service.
A major outage like this is another opportunity to review procedures and analyse what we can do better going forward. We also thank you for the feedback with comments and suggestions that we received from your customers during the day, including via Facebook.
Decommissioning of the faulty switch
We have already removed the faulty switch to prevent the same bug from occurring again.
Closer communication
On our status page (more about it below), which is on an external network, you can always get information about current outages. This is also the case this time.
During the day, we posted about one update every half hour and were available with personal responses via social media such as on our Facebook page.
It’s a challenge to both want to wait for more information and troubleshooting results, and report positive news, as well as update more often. In the future, we want to try to report even more frequently, even if we don’t have any updates.
At 14:10, we published an overly positive forecast for Managed Server because there were still a few servers affected by the database problem. The update on the database problem was then delayed too long (15:30). Here we should have communicated more frequently and better.
Fallback site
It is very rare that all our servers and services go down at the same time. When this happens, our own website also goes down. We understand that this causes extra concern and raises questions, especially if you are not familiar with our status page.
To prevent our site from being unreachable, we are looking at an external “fallback” solution that we can activate if something like this happens again. This way you will always be able to access our site and get up-to-date information.
External syslog
In the coming days we will set up a syslog that is not affected by network problems in either of our two data halls. In this case, it would have helped us to find the problem faster when we could not reach the network.
Better monitoring of the network
With better and clearer visual monitoring of the network, we would have been able to identify exactly where the problem was and act more quickly. This is work that we had already started and are continuing.
Better status page
During the outage, our status page still showed “Operational” and green on all services. This was despite the fact that we had an outage message. In addition, a text was displayed that incorrectly told us that only “some” services had problems, when in fact all of them did.
It is of course not acceptable that the status page does not show fully correct information. We will therefore be looking at a new and improved solution for status information in the coming month.