Threaded Mode | Linear Mode

Bill Duncan · 02-05-2018, 02:07 AM

Apologies in advance; this really doesn't have much to do with HP calculators,
except that I'd love to develop an approximation model that would run on one..
Wink

I'm hoping for some expertise from someone smarter than me..

I'm trying to model a system that involves about 2400 hosts in a sharded
database of sorts. Users enter queries, they get distributed to the 2400
systems simultaneously and when all have finished, the user gets the results.

The overall system has an SLA (Service Level Agreement) that the queries
finish in a certain length of time, 90% of the time.

Obviously, the weakest link in the 2400 systems determines the length of time
that a search completes in.

A naive approach that assumes all queries on the backend are independent
would mean we are looking for the probability that "at least one of" the
backend systems is over SLA. To find that of course, we'd use the complement
of probabilities; the intersection set of probabilities where all queries on
the back end are within SLA. If they were all equal and independent, it might
look something like:

Probability of Search being within SLA == 0.99995^2400

Unfortunately, the queries on the backend aren't totally independent.
Sometimes they are, such as a query hitting a system that is busy doing
something else. However, sometimes the query performance rests with the
complexity, number of terms etc, and will impact a number of backend systems.

Nevertheless, there is a long tail of systems that are always within SLA.

I'm wondering if there is a relatively simple way to model the behaviour
such that we can "play with the numbers" and get approximate results.

Eg. What percentage of the systems would we have to "fix" and by how
much to maintain the user experience etc.

As a sample, the worst systems are within SLA 95% of the time, while the
best are within 100%. The user experience however is closer to 90% which
is natural of course, as the probabilities would accumulate and be worse than
the worst one..

Thoughts? Suggestions? Pointers? I'm stuck..

Thanks!

pier4r · (This post was last modified: 02-07-2018 10:51 AM by pier4r.)

Interesting problem (how many interesting problems are posted here, and I miss the majority of them).

I learned that some can proceed by large improvements at once, I have to go very slowly, step by step.
At first I would check the real world reports, to get an idea about the success rate (being within the SLA) of each system.
Even without real world stats one could work with assumptions. I mean in the Engineering course I learned that unless we need extra precision, it is ok to know a range of values where things works (or, conversely, don't).

So you said that there are normal queries, that could be considered executed on each system independently from the others and then queries that depends on other systems.
For the first type of queries (the independ ones), you already have the solution. The probability of going over the SLA is a Bernoulli trial.
If the probability to succeed is P (assumed or computed), then to fail is (1-P). Then the SLA fails when one system fails, when two system fails, when three, etc.. then all.

This is:

\[ \sum_{n=1}^{2400} P^{2400-n} \cdot (1-P)^n \]

so the probability of succeeding is:

\[ 1 - \sum_{n=1}^{2400} P^{2400-n} \cdot (1-P)^n \]

And this is for the simple queries. For the more complicated queries, either you go through the rabbit hole of dependent variables (I myself have little ideas about those) or again: simplify unless you need that high precision (or you want to know).
So for depended queries again we need to set assumptions.
How often do they happen? Often enough that we need to care for them?
If they happen, can we model them like a chain of dependencies? Can be that one system may ask many others but each independently so the chain is at most of length 1 (longer lengths may be not so frequent)?

Bill Duncan · 02-08-2018, 12:28 PM

Thanks pier4r, you've given me some ideas..
Appreciate the help.
Cheers.

Bill Duncan · 05-30-2020, 01:01 AM

Hey @pier4r
Thanks so much for pointing me in the right direction!

I've added a program for playing with the numbers while developing SLOs.

Service Level Objectives

Cheers.