I have a web application based on $Platform. It needs to support $NumUsers concurrent connections. How much server do I need?$Platform can be anything from 'php' to 'tomcat' to the incredibly unhelpful 'linux'. $NumUsers can be anything from a reasonable number to completely unreasonable numbers representing the anticipated worst-case (or maybe that's 'best case') scenario (50,000 users! Two meeeeelion Users!).
The answer they're looking for is:
Two AWS large instances.
The answer they'll get is:
It depends on the application code.They're laboring under the misconception that $Platform and $NumUsers are the only variables in the Grand Equation of Scaling. HAHahahahaahaha. There actually is a GEoS, but I'm getting ahead of myself.
And yet, there are there are four measurements that determine how well a pair of pants fit:
- Waist
- Hip
- Rise (crotch to waist)
- Inseam
This? This is a problem. There is no way to sell off-the-shelf
pants with four measurements on the tag. Well, you could but
retailers would hate it since they'd have to shelve umpty
different permutations, and customers would hate it since finding
the right one would take far too long. Clearly a non-starter. What
to do?
The question here is, "How related are these four measurements, and is there a single one that would provide the best predictive power for the rest?"
What they did was measure a lot of people. They've done this several times over the last century, but the most recent dataset is far more multi-cultural than the ones used back in the 1950's. It actually has non-white people in it!
What they found was:
- For men, Waist is a strong predictor of Hip
- For women, Waist is a poor overall predictor of Hip
- For women, there are clusters around certain ethnicities where Waist is actually a pretty good predictor of Hip
Add this into several decades of marketing-habits they've learned
over the years:
- Men can handle two numbers on the label, so can handle a
separate Inseam measurement
- Women expect only one number on the label
- Rise is dictated by overall fashion trends instead of
individual bodies (compare 1980's pants vs today's)
- Inseam for women is dictated equally by body measurements and fashion (you wear different pants when wearing heels)
Men's pants are easy: Waist, Inseam, done.
Women's pants.... trickier.
Consider what an enterprising new clothing manufacturer faces
when designing pants. How do they get the best fit for their picky
customers? Like the question at the top of this article, all they
know is that women's pants have a single size number on them, and
come in petite/regular/tall. But what does that mean? How should they size
their pants?
I'm a new clothing designer for women's ready-to-wear. What's the hip measurement for a 28" waist?The answer they're looking for is "40 inches".
They answer they'll get is, "Who are you selling to?"
Does this sound familiar at all? It should.
The AppDev looking for server-sizing is expecting their problem to be a men's-pants kind of problem; one sizing-style fits everything. When in fact, it's far more complex.
So, what is this Grand Equation of Scaling I talked about above?
Take a look at how the fashion industry solved their sizing
problem. They took a lot
of measurements, ran quite a bit of analysis over it, watched
things over the course of years, and adjusted to fit.
- They measured things relevant to the problem (body measurements)
- They researched customer preferences (shopping data and market research)
- They analyzed trends in the data
With these three things done, they were able to construct a
sizing regime that works pretty well for them, their retailers,
and their customers.
The same bullet-points apply for figuring how much hardware
(physical or virtual) is needed for a web or mobile application.
The easiest to quantify metrics are $NumUsers and $NumServers, but they're just parts of the overall
dataset needed to be analyzed to appropriately answer the
question.
Measuring things relevant to the
problem
When figuring out how many resources are required to support a
given application, any of the following variables will need to be
tracked (and I'm sure I'm missing some for edge-cases):
- Number of Users
- Number of Web Servers
- Size of WebServer
- Number of Databases
- Size of DatabaseServers
- Caching tier efficiency
- Number of concurrent accesses
- Number of concurrent sessions
And much, much more.
These variables can be discovered in pre-deployment testing, and
by monitoring production performance. Since things like user
concurrency is not a static value, a range needs to be assessed.
Research customer preferences
This will tell you how your customers expect your application to
perform, when they start getting peeved at slow-downs, how they
work through the application, and a bunch of other things. The
item easiest to measure is perceived performance thresholds.
People using it on mobile networks will be more forgiving of
stuttering than those on traditional networks. It is through this
research that you find the performance envelope you have to stay
within.
Analyzing trends in the data
A pile of data means nothing unless you have someone who can make
sense of it. It is through this process that the Grand Equation of
Scaling is derived. Now that you have a pile of data about how
your application works, and you know what performance goals you
have to hit, you can start constraining your equation. This
is where you get to use the higher maths you got in college, and
is one of the reasons why Google likes to hire Ph.D's in Mathematics.
Because this equation? For complex environments it can involve
Calculus and Discrete Mathematics. This is why they sometimes call
us Systems Engineers,
not Administrators.
With these three steps you can actually answer the "how do I build an infrastructure that can support 2 meeeeelion users" question.
The same steps can be used to figure out whether or not your new web-app can be run on shared-hosting or needs more expensive dedicated-hosting. It won't even involve calculus for something this small!
Either way, you do have to know what to measure. For pants, see above. For IT infrastructure, find a Systems Administrator/Engineer. We'll even wear pants if you ask.
FYI, I'm getting 404s for some post pages. Example:
http://sysadmin1138.net/mt/blog/2011/02/the-heartbreak-of-software-licensing.shtml
from
http://sysadmin1138.net/mt/blog/2011/02/
That would explain the drop in spam I'm getting after the latest blog software install.