(Archived) Service Outage & General Data Security ?s

basit.mustafa · July 20, 2010

Why hasn't Evernote Tweeted or FB'ed about what is going on? It has been going on for several hours now.

I do not find it acceptable that such an outage occurs without any word from the team.

That being said, I know desktop clients work, but a major portion of my use cases for Evernote involve the sync functions.

Also, I know problems happen, but why can't there be communication about it?

Finally, what kind of data security measures are taken to ensure a ma.gnolia style disaster does not occur? I backup my evernote DB to time machine and export .enex to my own redundant data stores, but cloud computing (properly executed) should make those kind of practices on my part totally obselete/unnecessary and make me seem like a paranoid freak, no (although still a good idea for the uber-paranoid, for sure*)?

*For example, I could envision a case where the "live" copy of server data for some reason decided I had deleted all my notes (either because of a malicious attack or some kind of corruption) and directed all my clients to delete all local copies of data...an offline backup is a good idea for this even if EN can assure 100% reliability.

engberg · July 20, 2010

At a little after 2am this morning, one of the networking switches in our data center went haywire and started telling other switches to do the wrong thing with their traffic. This led to a cascade that affected about half of the switches in the data center, and basically prevented data from reaching servers connected to those switches. (If the original switch had just shut down completely, there wouldn't have been a problem, but the cascade caused the network redundancy to fail.)

The other half of our servers were on unaffected networking equipment, which meant that approximately half of users could sync their clients, but the other half could not. Both web servers for the corporate web "site" were also affected, as were the blog and forum servers, which meant that those services were also unavailable.

Our internal monitoring infrastructure checks several thousand different things every minute or so, and it detected the problem immediately. Unfortunately, when it tried to send the emails and SMS messages to wake up the IT group, it failed due to an unexpected networking dependency. (I.e. we thought all of the monitoring networking was redundant, but the DNS lookups required to send emails unexpectedly failed due to the multi-switch problems.)

We are paying for an external company (AlertSite) to also provide a more basic external sanity check of Evernote's servers. They hit a particular URL at Evernote every 5 minutes and then send pages to the IT folks if that fails. Unfortunately, since half of Evernote was running fine, they didn't detect any problems, so that redundant check didn't help.

That combination of failures occurred shortly after our last IT person went to bed, and the lack of SMS pages meant that we didn't know about the issues until the first of us woke up and checked email at 7am. Then we rushed over to the data center and fixed the misbehaving network switch. Once we understood the problem, we put something on Twitter: http://twitter.com/evernote/status/19009058746

I.e. the reason we didn't say anything wasn't that we were being secretive, it was that we were asleep and unaware of the problem (unfortunately).

We sincerely apologize for the outage, and for any inconvenience it may have caused you.

We are making a number of changes to prevent this type of error in the future, and to make sure that we are woken up more reliably if a problem happens to occur. We've added another layer of external monitoring that should detect whether our internal monitoring service is unavailable, and we're restructuring the internal monitoring system's notification paths to make sure that it can tell us about problems. And we're making sure that a few of our overseas folks have a way to page us if they're sure there's an emergency while we're all sleeping.

Regarding data security:

This error was strictly a networking problem, so was unrelated to your notes, data, etc. Your data is stored on at least 4 different online hard drives (one pair in a "primary" server and another pair in a "secondary" server), and we perform nightly snapshots and backups onto another set of storage, with weekly offsite backup rotation. This means that we have 5x data redundancy for daily data and 6x for weekly backups. All drives are "enterprise" rated Seagate disks.

We save a fair amount of money on our infrastructure by designing for scalability (200,000 users per pair of servers), but we are intentionally over-paying for data reliability and redundancy to make sure that we can be a permanent place for your memories.

C.Noize · July 20, 2010

Maybe i should mention this. I don't know if you know this, but a few weeks ago the service and your website also went down for 1-2 hours. At least here from Germany i couldn't access your servers. I can't remember the date and time exactly, but it must be something around two or maybe three weeks ago between 08:00 and 13:00 GMT.

TheGurkha · July 20, 2010

Ouch.

JustDave · July 20, 2010

Yikes. That explains why I was still able to synch but couldn't get to the forum or blog. Thanks for the details and good luck redefining all the failure modes. :mrgreen:

basit.mustafa · July 21, 2010

Dave,

Thanks for the openness and debrief, I love the product and service and am sorry to hear you guys have had a stressful day - when it rains.

It sounds like the right factors aligned to hit the right vulnerable spots of the infrastructure and processes...thanks again for being open and straight on what happened and what EN does to try and avoid such problems!

Basit

At a little after 2am this morning, one of the networking switches in our data center went haywire and started telling other switches to do the wrong thing with their traffic. This led to a cascade that affected about half of the switches in the data center, and basically prevented data from reaching servers connected to those switches. (If the original switch had just shut down completely, there wouldn't have been a problem, but the cascade caused the network redundancy to fail.)
The other half of our servers were on unaffected networking equipment, which meant that approximately half of users could sync their clients, but the other half could not. Both web servers for the corporate web "site" were also affected, as were the blog and forum servers, which meant that those services were also unavailable.
Our internal monitoring infrastructure checks several thousand different things every minute or so, and it detected the problem immediately. Unfortunately, when it tried to send the emails and SMS messages to wake up the IT group, it failed due to an unexpected networking dependency. (I.e. we thought all of the monitoring networking was redundant, but the DNS lookups required to send emails unexpectedly failed due to the multi-switch problems.)
We are paying for an external company (AlertSite) to also provide a more basic external sanity check of Evernote's servers. They hit a particular URL at Evernote every 5 minutes and then send pages to the IT folks if that fails. Unfortunately, since half of Evernote was running fine, they didn't detect any problems, so that redundant check didn't help.
That combination of failures occurred shortly after our last IT person went to bed, and the lack of SMS pages meant that we didn't know about the issues until the first of us woke up and checked email at 7am. Then we rushed over to the data center and fixed the misbehaving network switch. Once we understood the problem, we put something on Twitter: http://twitter.com/evernote/status/19009058746
I.e. the reason we didn't say anything wasn't that we were being secretive, it was that we were asleep and unaware of the problem (unfortunately).
We sincerely apologize for the outage, and for any inconvenience it may have caused you.
We are making a number of changes to prevent this type of error in the future, and to make sure that we are woken up more reliably if a problem happens to occur. We've added another layer of external monitoring that should detect whether our internal monitoring service is unavailable, and we're restructuring the internal monitoring system's notification paths to make sure that it can tell us about problems. And we're making sure that a few of our overseas folks have a way to page us if they're sure there's an emergency while we're all sleeping.
Regarding data security:
This error was strictly a networking problem, so was unrelated to your notes, data, etc. Your data is stored on at least 4 different online hard drives (one pair in a "primary" server and another pair in a "secondary" server), and we perform nightly snapshots and backups onto another set of storage, with weekly offsite backup rotation. This means that we have 5x data redundancy for daily data and 6x for weekly backups. All drives are "enterprise" rated Seagate disks.
We save a fair amount of money on our infrastructure by designing for scalability (200,000 users per pair of servers), but we are intentionally over-paying for data reliability and redundancy to make sure that we can be a permanent place for your memories.

ShellBryson · July 21, 2010

It's good to see an in-depth response here, but it still feels 'somewhat secretive'. Given how many 100s of complaints on twitter yesterday, and the only public response from you guys is 'it's fixed'. No link to this post, which actually explains what happened.

Given how important social media is becoming, companies can easily live and die within realms like Twitter - a bad rep goes a LONG way when you have like-minded users feeding off of each others complaints.

This post should have been linked to twitter and facebook right away: People are far more forgiving when you hold your hands up and admit "hey, something went wrong, we did all this to fix it..." than "it's fixed" after 6 hours of outage.

It was quite embarrassing for me: I'd just pushed this product (which I love) with a number of colleagues, and then I start getting direct messages over Twitter asking why the service was down.

The lack of transparency speaks volumes, and makes me wonder how seriously you *really* take your reputation with paying customers (like me).

It's a real shame I had to dig up this post in your forum to find out what really happened.

BurgersNFries · July 21, 2010

Given how many 100s of complaints on twitter yesterday, and the only public response from you guys is 'it's fixed'. No link to this post, which actually explains what happened.
(snip)
The lack of transparency speaks volumes, and makes me wonder how seriously you *really* take your reputation with paying customers (like me).
(snip)
It's a real shame I had to dig up this post in your forum to find out what really happened.

Dave gave a detailed explanation of what happened. If everyone was angrily tweeting about the service being down yesterday & the lack of response, where were all the tweets linking to this post? As the saying goes, "The phone rings both ways."

jbenson2 · July 21, 2010

ShellBryson said
It's good to see an in-depth response here, but it still feels 'somewhat secretive'. Given how many 100s of complaints on twitter yesterday, and the only public response from you guys is 'it's fixed'. No link to this post, which actually explains what happened.
Given how important social media is becoming, companies can easily live and die within realms like Twitter - a bad rep goes a LONG way when you have like-minded users feeding off of each others complaints.
This post should have been linked to twitter and facebook right away: People are far more forgiving when you hold your hands up and admit "hey, something went wrong, we did all this to fix it..." than "it's fixed" after 6 hours of outage.
It was quite embarrassing for me: I'd just pushed this product (which I love) with a number of colleagues, and then I start getting direct messages over Twitter asking why the service was down.
The lack of transparency speaks volumes, and makes me wonder how seriously you *really* take your reputation with paying customers (like me).

I am in major disagreement with your analysis.

The response from Evernote was impressive in detail and their candor was refreshing. Compare the Evernote same-day response and immediate corrective steps to other website shut downs.

* Twitter is a perfect counter example - how many times have you been shut down and seen that stupid floating whale logo with no explanation?

* How about Flickr and their total lack of "transparency" on their shutdowns. It was so bad a user created a website called DownorNot.com/flickr.

* And how about the Google GMail shutdown earlier this year - they were far more secretive about their problem then Evernote.

Evernote addressed the problem as soon as they found out about it. Their solutions look solid and will protect the users in the future.

ShellBryson · July 21, 2010

The response in HERE was great. No disagreement there at all. But it's hardly easy to find. There's not even a link to the forum on their homepage. Twitter is accessible instantly to almost anyone. What impression do you think casual folks doing a Twitter search for Evernote will have? They will see 100s of posts complaining about the outage, the lack of comms, and ONE message saying "it's fixed" without explanation, or a link to explanation. Poor.

I don't pay for Twitter or GoogleMail, I do pay for my Evernote account, but regardless I don't expect any service to be faultless!

You say "Compare the Evernote same-day response and immediate corrective steps to other website shut downs." I should hope so! Would you expect anything less?

I guess this sounds horribly negative, and I don't mean to be. I *love* Evernote. That's why I paid for an account. I use it on my phone, laptop, work computers, you name it.

engberg · July 21, 2010

Thanks for the feedback. We're looking into setting up some sort of simple "status update" blog/feed of some sort that we could point to from a few places (e.g. the normal blog), and that interested parties could follow via RSS.

The type of post that I made (above) would go there, as would more routine stuff like "Restarting servers for weekly service update. Expected outage: 5-10 minutes."

In theory, the type of Twitter reply we made about this incident (once we woke up) could include a short link to the more detailed entry this hypothetical status update feed.

Thanks

rutherfordgenealogy · July 21, 2010

Is there an outage going on right now? I am not able to sync.

engberg · July 21, 2010

Unfortunately, yes.

The service was unavailable for approximately 45 minutes today while we were installing some new equipment to help us monitor and detect problems. This was something we were deploying to help with problems like yesterday's. (Yes, ironic.)

This should not have caused more than a 10 second disruption, but there was a configuration mistake that took too long to resolve. This was a pure administrator error (a typo in the IP address of the new box that collided with our primary user database server).

For completeness, I'll note that the service will also be unavailable for 15 minutes tonight around 5pm (California time) for the regular weekly web service update and system security patches.

We do apologize again for the inconvenience.

rutherfordgenealogy · July 22, 2010

and so naturally... at exactly the time you mentioned there's to be a planned maintenance downtime - I tried again to Sync then refreshed the Forum to see that it was planned. So I waited a couple more minutes and all's well.

Yes, it would be nice for a quick note to go out to all users when things are down for an extended period of time, but for these little glitchy, typo-created things -- we all can just take a breath and work on something more important that we should probably be doing anyway.

Thanks for the reply/update...

Carolyn

JustDave · July 22, 2010

This was a pure administrator error (a typo in the IP address of the new box that collided with our primary user database server).

Being moderately dyslexic and having wrestled many "typo" monsters in my time I'm probably enjoying your posts more than I should.

Thanks again for the honesty.

engberg · July 22, 2010

Ok, we've added a placeholder status page at:

http://www.evernote.com/about/status/

(I.e. no RSS feed for that or anything yet, but it will give information about planned and unplanned outages from now on.)

We'll redirect that to something more permanent later.

rutherfordgenealogy · July 22, 2010

Looks fine to me ... thank you! ..... but can you link to it from say, this page? http://www.evernote.com/about/support/ or add it as an item in the list of our HELP in the top menu, so we can find it?

~ Carolyn

engberg · July 22, 2010

Yes, we're finding good places to link to that page, including a blog post in progress.

Thanks

nova47 · July 22, 2010

Ok, we've added a placeholder status page at:
http://www.evernote.com/about/status/

NICE!

ShellBryson · July 22, 2010

Nice to see the status page!

bpm32 · July 22, 2010

This is great! Thanks for adding this to your site.

What if simple notices of planned outages were added to a status bar in the desktop (and maybe even mobile) clients?

Again, thanks for making all of this more visible!

Brian

Owyn · July 31, 2010

Very nice work on the status page.

I have added the RSS Feed to GReader.

(Archived) Service Outage & General Data Security ?s

Recommended Posts

basit.mustafa 0

Link to comment

engberg 89

Link to comment

C.Noize 2

Link to comment

TheGurkha 1

Link to comment

JustDave 0

Link to comment

basit.mustafa 0

Link to comment

ShellBryson 0

Link to comment

BurgersNFries 2,407

Link to comment

jbenson2 2,147

Link to comment

ShellBryson 0

Link to comment

engberg 89

Link to comment

rutherfordgenealogy 5

Link to comment

engberg 89

Link to comment

rutherfordgenealogy 5

Link to comment

JustDave 0

Link to comment

engberg 89

Link to comment

rutherfordgenealogy 5

Link to comment

engberg 89

Link to comment

nova47 60

Link to comment

ShellBryson 0

Link to comment

bpm32 2

Link to comment

Owyn 457

Link to comment

Archived