Jump to content

(Archived) Downtime yesterday

Recommended Posts

  • Level 5*

A question for Dave I guess, EN sync'ing was down for me yesterday for a few hours (I'm in the UK). The very helpful EN status page tells me that 8:10am PDT the Services on one server were restarted, however by then the EN Sync service had been down for a number of hours.

So a couple of questions...

1. Did all the fallbacks that were implemented after the last downtime ensure that EN staff knew about this as soon as it occurred? The length of the downtime, the fairly simple resolution (restart) and the time of the resolution implies to me that it was discovered when someone got to the office.

2. Is there any way for us Europeans (8 hours ahead of you) to notify you of a possible problem? I guess that's me volunteering....

Link to comment


In this case, the automated monitoring systems didn't page us to wake us up because the servers never actually stopped working ... the database was running, and the application server was running. The problem was that the database was mysteriously backlogged and unable to process new requests, so the application server wasn't getting anything out of the database when it asked for your notes.

Our monitoring system checks around 30-40 things on each of these servers, but it didn't have a check for this exact scenario, since we haven't seen this before on any of our 39 servers. The server that choked up (shard 1) was our very first server, which has been running for 2.5 years without exhibiting this behavior before.

As soon as some of our people were awake and reading Twitter and the forums, they noticed the reports about this problem and paged the IT crew to take care of it.

We do have some folks 8+ hours ahead in Moscow with permission to page the IT folks if they see a problem, but they didn't happen to catch this one before the US people noticed it.

So we're adding another set of checks on each server to watch for cases where the servers are all up and running, but the database isn't responding to non-trivial requests within 10 seconds or less. This will send an email to the IT folks the first time it happens on a server, and send pages if it happens for two checks in a row.

(We actually had a different server [shard 36a] crash at 4:30am on the same night due to dual hard drive failures, but the redundancy and monitoring systems worked perfectly on that "shard" so no one noticed that problem.)

Link to comment


This topic is now archived and is closed to further replies.

  • Create New...