An Apology To Our Customers

By: Folke Lemaitre

December 29, 2014

Last week we experienced an extended outage, significantly impacting your business and your customers. We’re sorry that we caused this downtime and for the impact it had on your businesses. We strive to maintain a solid, fast and real-time social engagement platform. We let you down when we did not deliver. Please know that we have and will continue to work for all of you to make our systems better and more resilient so events like this are less and less likely in the future.

Here is some additional detail about what happened and steps we are taking to mitigate future outages of a similar nature.

What happened?

We originally scheduled a maintenance window on December 23rd from 3am CET to 5am CET to upgrade one of the core components of Engagor. Due to unforeseen changes in the new release, we had to put the Engagor application in maintenance mode for an extended period of time. During part of this time, the Engagor web interface was completely offline. In addition, once the web interface was back online, none of our customers were able to receive and respond to new incoming messages.

Why did it happen?

The platform component that is responsible for storing, searching and analysing social data is running in a big cluster of beefy servers. Our Sysops team is continuously monitoring this cluster and plans for capacity upgrades way in advance when needed. Early November we decided to upgrade this cluster on December 23rd to accommodate the new power needed for new customers. Unfortunately, due to some recent changes in the way we calculate trending topics, currently available in beta, we ran into some capacity issues already early December, much sooner than expected. On December 4th we did emergency maintenance to make sure we could resolve these issues quickly. During that time, we also performed part of the changes we initially planned for December 23rd. The remaining changes we planned for December 23rd were an upgrade of that component, that would improve some specific insights widgets and overall performance of the Engagor application.

Upgrading that component requires a full cluster restart that normally takes just under 1 hour. Unfortunately we were hit pretty badly by two, at that time still unknown, issues with the new release. It took us some time to fully figure out what was happening and before bringing the real-time aspect of Engagor back online, we wanted to fully understand what was happening to make sure we would not cause any data-loss (which has not been the case).

On December 24rd at 4am CET, Engagor was fully operational again, albeit in degraded mode, but with little impact to most of our customers. On December 25th we also implemented some work-arounds for the issues we encountered. All is now working 100% again.

How did we respond and recover?

Initially we attempted to downgrade the cluster to the previous release. This failed because the data format used in the new release was also different and, as such, part of the existing data was no longer recognised which would have lead to data loss of some of our historical data.

Since that did not work, we quickly put down the whole cluster to prevent any real data loss. We eventually found the cause for the cluster not starting with the new release and were able to bring the whole cluster back up. Everything worked, but much slower than normal, while we expected the opposite. Part of the cluster management module is responsible for sharing the state of all Engagor topics with all servers in the cluster. Normally that state only changes when we add new fields or change any fields of a social message. This happens very infrequently, once a month at most. One of the issues we encountered, was that this new release was detecting spurious changes to social message fields all the time, which resulted in a lot of overhead of syncing the cluster state with all servers. This prevented the cluster from performing the real work it should be focusing on.

Once we fully understood this issue, we were able to work towards a final solution, that required us to rewrite the state to all topics across the cluster. This had been running in the background for 14 hours, but did not cause any overhead to our customers. On December 25th, those upgrades successfully finished and we can now confidently say that the cluster is running on full speed again.

How do we prevent similar issues from occurring again?

Engagor is a very complex application that requires a very powerful, but also complex, data backend to run on. We are very happy that advances in big data computing have lead to powerful data warehouses that we can use at Engagor. The downside of allowing complex queries and filtering anywhere in Engagor is that it is not always easy to fully assess the impact of launching a new feature or upgrading core components.

  • First of all, the software bugs uncovered during the incident were fixed and automated tests were included to prevent regression.
  • Our Sysops team is continuously looking into ways to optimize existing functionality and doing regression tests to gauge the performance needs of new features.
  • We”re making some additional changes to our regression tests that should give us a better indication in fully scoping the additional raw power needed.
  • Upgrading the Engagor core data component has lead to issues in the past. We are reviewing our internal processes for upgrading that component to minimize the impact to our customers.
  • We are researching alternative measures we can take to have at least recent and real-time data still available in the event of maintenance or a major outage.