Wednesday, September 20, 2017

There is no such thing as 100% uptime

Over the years we have had customers reporting an outage on their websites. Most clients would never even notice this as they occur usually in the middle of the night. However, lots of IT departments use 'pinging' software to continually monitor their sites and take great pleasure in reporting an outage to anyone who cares. (Usually the MD or or the Marketing Department/Director).

So the question of uptime is one we get a lot. And the notion is that if we work hard enough, test out the hosting companies enough, talk to enough people, and generally get smarter, we can find and recommend the right hosting partner who will deliver 100% uptime.

This is not true. And it has nothing to do with our effort, time, research, or overall intelligence.

It’s related to the nature of the internet. And no one can promise you 100% uptime (for a price you’re willing to pay). Want to test what I’m talking about?

The test for what I’m saying is simple:

Here are the steps you can take.
  1. Look up any hosting provider you like (or want to use)
  2. See if they have a 100% uptime guarantee.
  3. Then see if they have any statements after the 100% uptime guarantee.
If they do, they don’t really guarantee 100% uptime. Right? Because why would you need to say more after the 100% uptime guarantee.

If I tell you that I won’t stab you, there are no further statements, right? Not like, “If I do stab you, I will be sure to put a bandaid on the wound,” or “If I end up stabbing you for a reason that isn’t your fault, I promise to drive you to the hospital.” Nope. If the guarantee is 100%, there are no “if” statements afterwards. An uptime guarantee is – no matter which host you look at – simply a promise of what refund the host offers customers if there’s a network outage. The reality is that many companies simply won’t offer you a 100% uptime guarantee. But if they do, they’ll likely articulate exceptions.
  • Failure of systems, internet, infrastructure, network, power, facilities or connections delivered by third parties
  • Applications, software, or operating system failures because of denial of service attacks, hacker activity, or other malicious events
  • Acts of God (weather, etc)
Oh, and maintenance is also an exception. So trust me when I tell you that the tests are clear – you won’t get 100% anywhere.

You could get 100% uptime…but it will cost you
Let’s talk, for just a second, about how you might go about getting 100% uptime, if you really wanted it. To do that, we need to understand what’s happening behind the simple things we do with a browser and the web. When we make a request for a website, imagine that you’re actually asking someone to send you a book, in chapters, via snail mail. (I know, who would do that??)

So let’s say I ask for "Harry Potter and the Goblet of Stone". Here’s what happens in a normal network.
  • I send you a note asking you for the book.
  • You collect all the chapters (117 of them).
  • You send me each chapter in a separate manilla envelope addressed to me.
  • You take them all to the mailbox and drop them off.
  • The postman sends them through the system and they arrive at my post office.
  • My mailman delivers them to me.
  • Only, I don’t get chapters 10, 43, 86, and 92.
  • I send you a note asking for those again.
  • You grab copies of those and you resend them to me.
  • The same thing happens, but I still am missing 92.
  • I ask again and this time your delivery reaches me.
  • Then I open all the envelopes and arrange them in order and I start reading.
This is what happens every time we ask for a web page. And all those chapters are data packets. And those post offices (and mailmen) are like routers. And that chatter back and forth about getting all of what I want / need – that’s the internet protocols that send communication back and forth between my browser and your server.

It’s kind of crazy!

Where it gets even more complicated is when you imagine that those packages aren’t getting delivered using the same route. Ever order 10 things from Amazon and get them in different shipments? If you track them via UPS or Fedex, they don’t necessarily go through the same locations or hubs across the country.

That’s the same thing with routers and packet traffic.

Now why is this all important?  Because in a high availability setup, you have to mitigate issues on several fronts. This isn’t just a “use Cloudflare and everything will be ok.” That’s just not true.
One way to mitigate this is to use a different kind of protocol – anycast instead of unicast. This means that instead of me asking you for the book, I can make a request for the book and you and all your friends, spread out everywhere, could react to me individually – based on who is closer.

This translates, technically, to the notion that confuses people because we told everyone that every domain has a unique IP address (like the address of your house). And the reality of using DNS on an Anycast network, is that the IP can be registered on several servers in several locations. Crazy, I know.

That would help with the speed and performance of requests (like the need for some chapters sent again) because they could go to many different locations. Since I’m in Derby, I could hit a server in London. If you’re in New Jersey, you might make a request to a server in New York.

But that’s not the only place where you need redundancy.
You would also need to mitigate issues with the servers themselves. That means that the place that stores books (a bookstore, if you will) needs its own support if something happens there.
So you’re going to need more than DNS redundancy, you’ll need server clusters. And while you might want to pay for the cheap cold or warm failover, true high availability (100% uptime) will likely require hot failover with a heartbeat monitor.

Think of that heartbeat monitor as one of  those young interns at the bookstore that has to keep running to the back to see if a book is there. Only this time they need to run all over town because your cluster might not be located all in the same spot.

And the moment he comes back to tell you that in warehouse A the book isn’t there, you need to update your infrastructure to route all requests away from that warehouse and to another that has it. But you also need him to go run an order to get that book back in stock.

Are you starting to see why this is expensive and likely more than you want to pay?

You can’t get this for £10-100/month

I love that all sorts of hosting companies offer tremendous deals. Many are doing great things.
But none of them will reserve for you twice the servers you need, located in different places, with a heartbeat monitor, and synchronization, along with anycast DNS services all for £10 a month.
Some hosts will help you with this, but you won’t be paying a few bucks.
But there is good news.
If you really want this, or need this, you can create it yourself on Amazon Web Services. They have everything you would need, assuming you want to get into that configuration game.

In this way, all hosts are equal

What I’m telling you is this. The reality of storms, earthquakes, flooding, DDOS attacks, hackers, and more – they don’t distinguish between hosts. They don’t care about you and your specific site.
In the end, these things happen.
And unlike SimCity (the original), there is no setting to turn on that protects you from it all. And just like when you’re in the slow lane on the motorway, changing will likely just mean you’re about to get into the slowest lane now. It’s just the rule.

Swapping won’t change much, if your host is good to begin with.

So today, you either pay a lot of money for high availability systems, or you recognize that there’s no such thing as 100% uptime...

(With thanks to Chris Lima for most of this information)

No comments:

Post a Comment