A Grunts Guide to High Availability Designs in High Load Shared Environments
(and how I suckered management into buying me a 100k+ worth of hardware so I could test something that sounded cool… and somehow it magically worked!)
Last month we stirred up some excitement with our announcement that we’ll be migrating our entire shared SQL environment to a new High Availability Cluster (HAC). We invited one the chief architects behind the new environment, Sol Posenjak, to share some insight on the project’s development.
Before I get into the design details, let me start with the three things that got us rolling on this whole HAC / Cloud thing:
1. I love Microsoft.
And the bosses here like to give me Microsoft toys to break in the hopes something useable emerges from the rubble.
2. Our understanding of this time-tested fact: Shared SQL standalone servers will always be risky business.
Traditional hosters will typically use standalone machines for SQL. They pack them with 1200+ databases per server then avoid touching them unless absolutely necessary. While this is all well and good if things are running fine, when there is a problem with a single box, you end up screwing up over 1200 or more sites. That’s bad juju! This, of course, leads to higher restrictions on SQL to avoid critical failures (you might think 1-5% sustained CPU from your database isn’t a big deal, but when you get 1200 databases that think the same thing, you can have a serious resource problem).
Also, during peak hours this can be detrimental to shared performance as a whole. Think ColdFusion—while your SQL server is pegged, you’re queuing up requests in that CF instance, eventually tanking it. This can end up spiraling out of control since now your web servers are impacted. In just a few minutes of downtime, your support staff can become seriously overwhelmed with complaints about performance. Not only that, other resources like disk space end up being seriously limited because you have to push 1200 sites into 1-2 data arrays that are usually directly attached (DAS).
In short—one SQL server having a performance issue usually means a large number of people end up having a REALLY BAD day.
3. People like to say that virtualizing SQL servers is a bad idea.
If you ask me, those are fighting words. With enough money, you can make anything work, right?
But therein lies the problem—money. Management is usually not going to be ok with spending a ton of money on something that could utterly fail.
There is a solution, however, so have no fear!
Use the word cloud.
You see, management loves that word. Don’t just say you’re building a high availability cluster with redundant gizmos and fancy thingamabobs with colo available expansion—no, you’re building a cloud! And clouds cost money, so try saying cloud as much as possible. And if anyone in management has a question, make sure you say cloud at least once in your response… even if it makes absolutely NO SENSE.
Ok, so now that that’s out of the way, let’s get into some details.
Pick your hardware:
You need to know every current hardware implementation in your environment that you plan to virtualize, how many there are, and average and peak usage of the key resources (RAM, CPU, Disk). With older hardware configurations, you’re in luck. Especially if you’re using older Xeons. The jump to a Nehalem processor can make all the difference with this sort of thing. Look into a reporting tool like System Center Operations Manager. It pulls really great stats and can build some really good reports with its information.
Another big one is to pick a real vendor, not an unknown. I hate unknowns. They’re a little bit less money, but when you’re going with a decent sized hardware vendor (HP, Dell, etc.) you’ve got a really good buffer and great swap options for future proofing.
After comparing all your stats, make a best guess with what you think you’ll need, and then go shopping through your vendors.
RAM – This is probably the least of your concern. You’re allocating what you need, and are probably going to standardize any oddball boxes. So just keep the RAM counts up and stick with speeds around 1066 or 1333 and you’re good. No real special treatment needed. The only important detail is to maintain some RAM free on the host for failover. Generally 1/4th the box when you have eight physical nodes + one standby works as a baseline, at least that’s the ratio I like to keep.
CPU – Yay for CPU, second biggest risk. This one is sort of a big pain. You’ll have a few ideas of how virtual CPUs are going to perform based on your cores, but until there are a few fully loaded SQLs running, you don’t really know how your processors are going to measure up to a bunch of databases inundating them. If you’re going from something like dual E5410s to dual L5520s or L5630s, you’ll be impressed with the difference. At least I was.
DISK – The worst one of them all. Gauging the disk usage is one thing, doing it on over 50 servers and trying to get all those numbers together at the same time without making the graphs and reports look like meaningless gibberish is a monster of a task. In the end, only a few things matter – reliability, scalability, and cost per gig. Since we’re looking at exposed “virtual hard drives” (i.e. LUNs) instead of mount points, sticking with a storage area network (SAN) architecture is your best bet.
For us, Compellent worked out best for any scenario we ran into. It gave us a highly available model with a large amount of scalability. Data progression, which automatically shifts older data into less expensive disks, allowed us to build a configuration that was cost effective. If you’re not sure about storage, and are looking to work in a design based off a SAN model, check into them – they’re one of the best around that I’ve seen.
NETWORK – People will say network matters. Well, it sort of does. Use virtual LANs (VLANs) and use multiple links with either quick/immediate failover (or set up some teaming). Four NICs will generally keep you up and running safely – two configured with redundant networks for the host and cluster and two dedicated to your guests. If you need to share your two host NICs with other services (backups to the guests or whatever) as long as they’re split on VLANs so the hosts and guests never talk, everything should be fine. Keeping them isolated is a good security buffer, but if all your hosts does is Hyper-V, there’s no real reason they’d need to talk to anything except monitoring / management software.
In the end, our hardware was set with Dell blades and a Compellent SAN. For us this was the best choice, not just for the hardware but the support from those two companies. Whenever we had a question or a concern they were able to answer it quickly and efficiently.
So, yeah, that’s the basics of the hardware design behind our new private cloud. In my world, gauging the variation between physical to virtual is next to nothing these days and shared SQL was just a stepping-stone to prove the technology to our customers and ourselves. With any luck, we’ll be able to continue expanding the environment very, very shortly.
Now, if this actually gets thrown up on the blog, maybe next week I can be even more vague and talk about the software behind it all! Bah, they’ll probably edit it and take out all the fun stuff.