We owe you an apology

By: Folke Lemaitre

April 11, 2013

Tags:
downtime

Over the last week we’ve had some of the longest downtime ever since we initially launched Engagor back in February 2011. First of all, we’d like to say we’re sorry. Real sorry. We take these issues very seriously, and we appreciate your patience and support as we work to resolve them. We still owe you an explanation on what happened and how we plan to prevent something like this happening again in the future. I’ll try to explain this as easy as possible, but there will be some parts probably only interesting for the more tech savvy people.

The technology behind Engagor

Engagor is a very complex application that consists of multiple moving parts that work together in a distributed way. One of the core pieces is ElasticSearch, which is responsible for storing all your data and providing the advanced analytics of Engagor. ElasticSearch is the heart of Engagor if you will. In order to accommodate our growth, we frequently add new servers to our ElasticSearch cluster that can scale easily to thousands of servers.

Engagor is built from the ground up with performance and scalability in mind. We also try to make sure that in the event that we have a server going down, there is little to no impact for our customers. In the case of ElasticSearch, every piece of data is stored at least twice to make sure that if one of the nodes goes down everything keeps working. In the case multiple nodes go down, only a portion of our customers are impacted.

Upgrading and extending our server cluster

It all started on Friday evening (GMT) on the 28th of March. We decided to add another eight nodes to the ElasticSearch cluster and bring the total number of servers in that pool to 20. Before that time, the new servers had already been set up and properly configured to be brought in production. We also decided that before adding the new servers, it was a good time to upgrade all the other servers to the latest long term supported version of Ubuntu 12.04 LTS (which is the operating system we use). This was important since we were still running on a two year old version that was no longer supported by the Ubuntu community.

So on that particular Friday, we scheduled a maintenance window of 4 hours to perform the upgrades. During initial tests we did on other servers, we had estimated that the whole procedure should take roughly 2 hours. The upgrades went well, even in under 2 hours, but we had some difficulty with 2 components we use:

  • The language detector failed to run on the upgraded machine. It took us some time, but we finally found that the issue was introduced due to some peculiar changes in the c++ compiler (g++).
  • At that time we still used Gearman for scheduling older crawlers that fetch some of your data. The new version of Gearman was returning a different response than we expected which resulted in the fact that instead of scheduling a job exactly once, it was scheduled thousands of times. Once we spotted this, it was fixed right away and actually in the meantime we have stopped using Gearman altogether and moved everything over to RabbitMQ.

Six hours after the maintenance window started, we were finally successfully upgraded and restarted our cluster. Starting the whole cluster involves a different number of steps:

  1. Make sure all databases, memcaches, etc are running. This takes a couple of minutes.
  2. Starting all ElasticSearch nodes:
    • In the first phase, every node in the cluster tries to detect every other node. While this process should only take a couple of minutes, it took up to an hour in our case. (More on this later)
    • In the second phase, all primary indices are loaded (an index is roughly what contains data from a certain timeframe for a certain topic on Engagor). This should take up to 15 minutes, but in reality took an hour.
    • In the third phase, all replicas are loaded. Replicas are copies of all the primary indices. Should be done in 30 minutes, but took 2 hours.
    • In the fourth and last phase, indices are redistributed over all nodes to make sure every server holds approximately the same amount of data. This last phase can take while, but this quietly runs in the background and should have no impact on the performance of Engagor. But again, it did have an impact. (more on this later)
  3. Starting all processors. Processors are responsible for processing data received from Twitter, Facebook, News, …. The processors match this data to topics you have set up, fire automation rules and send email reports when needed.

Eventually, the whole process that should have been finished in under 4 hours eventually took 12 hours to complete. At that time, the biggest issue we had was getting ElasticSearch back up and running. In the meantime, we were able to make some changes that greatly reduce the time it takes to do a full ElasticSearch cluster restart. We are also working closely now with the ES team to further optimize our setup.

After all of this, we wrongly assumed all problems were over and we could get back to our day to day business of making Engagor more awesome than ever. Shortly after the upgrades however, we started to experience very strange hiccups on our servers. During the daytime (GMT) we once in a while got hit by something that resembled a large DDOS attack against our infrastructure. What we noticed:

  • In a couple of minutes, network usage (bandwidth) on our servers increased 100 fold
  • A huge amount of data was being read from all our servers
  • The CPU skyrocketed in usage (for the techie people, within minutes, we noticed load averages going up from less than 1 to over 1000 on a quad core machine)

In most cases, this all went back to normal as fast as it started. But during last week and especially on April 9th, it took a lot longer and we even needed to do a physical restart of some servers resulting in a full cluster restart and the associated downtime.

Nailing it down

Since this problem only very infrequently impacted us (but when it did, it hit us hard) and only for a short time, it was very difficult to find the root cause. But on April 9th we finally found what was happening and were able to create a fix.

The issue was caused by a single topic by one of our customers. It was a new topic that had been created, but no searches or profiles had been setup, so it contained no data at all. This also means that there was not yet an index for that topic on ElasticSearch. During the upgrades of two weeks ago, we also upgraded to the latest stable release of ElasticSearch.

We discovered a big issue with this new release when performing a search on a non-existing index. While this search effectively returned no results, it did not do so immediately. Instead it searched through and fetched all data ever stored on Engagor. Meaning: “data from all topics with data since 2011”. After it fetched all this data, only then ES noticed no results should be sent back.

Every time this particular customer logged in on Engagor, all data ever processed and stored was requested at the same time. We were able to finally fix this and recover completely.

What we learned from all of this

The biggest thing we’ve learned is that we tried to do too many things at once, wrongly assuming that upgrades would not affect us. To make sure we have less issues during maintenance, from now on we will only do changes like these in smaller batches.

Again I would like to personally apologize for all of this and would like to thank all our customers for their support and for being so patient.

Folke, Engagor CEO