Threaded Mode | Linear Mode

Bill Duncan · (This post was last modified: 06-06-2020 11:02 PM by Bill Duncan.)

Background

This will mostly be of interest to systems operations folk,
system administrators (sysadmins), and often called
SREs (System Reliability Engineers), DevOps these days.

Things fail. Systems fail. Large scale systems often depend
on hundreds or even thousands of "Backend" systems these days;
usually Virtual Machines (VMs) or more recently "containers".

The more backend systems which are used (usually to improve
response times), the more likely there will be failures to
deal with.

The terminology that has grown around this includes:

SLI - Service Level Indicators -- things like availability, latency, errors

SLO - Service Level Objectives -- objectives based on the
indicators that are used to gauge reliability of a
service.

SLA - Service Level Agreements -- sometimes objectives are
communicated with customers in the form of agreements,
often with penalties to the provider if the objectives
are not met.

The reliability (eg. availability, latency, errors) that users
experience can dramatically deteriorate when the number of
backend systems is increased. This program is about exploring
some of the variables involved; how the number and reliability
of backend systems impacts the user experience, probably.

A more detailed description and background can be found in these two articles:

The Tail at Scale
The Tail at Scale Revisited

Operation:

This program enables you to play with the numbers a bit, possibly
while developing SLOs (for the backend and/or users) and SLAs.

Code:

+----+----+----+----+----+

| SL | FR | A  | BE | N  |  Mnemonics

+----+----+----+----+----+

| A  | B  | C  | D  | E  |  User Keys

+----+----+----+----+----+

| 01 | 02 | 03 | 04 | 05 |  Registers

+----+----+----+----+----+

  ^    ^    ^    ^    ^

  |    |    |    |    |

  |    |    |    |    |

  |    |    |    |    +---> Number of Back End Systems

  |    |    |    +---> Back End Service Level  (SLO)

  |    |    +---> Agreement (SLA/SLO) Customer, Front End

  |    +---> Failure Rate (reciprocal)

  +---> Service Level

The "A" and "B" keys (and corresponding registers) translate
between "service level" (probability of meeting objective) and
"failure rate" (reciprocal). Two ways of describing the same thing.

The "C", "D" and "E" keys (and registers) are used to look at the
relationship between the front end SLO or SLA and backend SLO for
supporting it. The "E" key specifies the number of backend services.

"F" key translates back end service level to a level which
involves two replicas. Also updates Register 04.

Pressing "R/S" after any calculations or storing is finished
will bring up the mnemonics again. Pressing "R/S" one more time
will turn the calculator off in a way that will display the
mnemonics (and remind you what program you're in) when you turn
it on again.

Using the user keys usually works fine. You can also RCL the register
directly or prefix with "XEQ" to force the calculation. (User keys
will fail to detect "number entry" if you use an existing number in
the X register for example. Just STO the number. Also, if you had entered
a number that you hadn't intended to store, pressing a user key will
store it. Pressing it again will do the calculation, or use "XEQ" directly.)

Example:

Some customers are complaining that our services are not meeting target
objectives (or agreements). We find that the backend services are
failing to meet their time budgets at a rate of about one in a thousand
which is a few orders of magnitude better than the front end. (99.9% vs.
90% in the front end.)

Most of our customers are small and the queries hit a few dozen backend
systems while the few larger customers who are complaining can sometimes
hit 500+ systems.

What service level objectives should we be aiming for in the backend
to meet the objectives for all clients? How can we best do that?

Code:

  1000     # backend failure rate recriprocal, roughly 1/1000

  B        # stores it

  A        # Calculate Service Level, see "99.9"

  STO D    # backend SLO

  95 C     # frontend SLO (Objective, more conservative than SLA)

  E        # calculate number of systems we're good to.  51.27

  500 E    # number of systems req'd for large customers

  C        # Calculate frontend service level probability, 60.64%

           # Fret! Barely better than 50/50

  RCL D    # backend service level

  F        # calculate service level while querying replicas

  C        # Calculate the new improved front end service level, 99.95%

           # Celebrate!!

  STO A

  XEQ B    # 1 fail in 2000 for Front end!

  50 E

  C        # 99.995% for N==50  !!  Bonus!

  STO A

  XEQ B    # 1 in 20,000 fail for N==50 querying both replicas

The Code:

Code:

LBL SLO

  LBL 00

  SL FR A BE N

  AVIEW

  CF 22

  SF 27

RTN

  SF 11

  OFF

GTO 00

LBL A

  FC?C 22

  GTO 01

  STO 01

RTN

GTO 00

LBL B

  FC?C 22

  GTO 02

  STO 02

RTN

GTO 00

LBL 01

  RCL 02

  1/X

  XEQ 09

  STO 01

RTN

GTO 00

LBL 02

  RCL 01

  XEQ 08

  1/X

  STO 02

RTN

GTO 00

LBL C

  FC?C 22

  GTO 03

  STO 03

  RTN

GTO 00

LBL 03

  RCL 04

  XEQ 11

  RCL 05

  X^Y

  XEQ 12

  STO 03

RTN

GTO 00

LBL D

  FC?C 22

  GTO 04

  STO 04

RTN

GTO 00

LBL 04

  RCL 03

  XEQ 11

  RCL 05

  1/X

  X^Y

  XEQ 12

  STO 04

  RTN

GTO 00

LBL E

  FC?C 22

  GTO 05

  STO 05

  RTN

GTO 00

LBL 05

  RCL 03

  XEQ 11

  LOG

  RCL 04

  XEQ 11

  LOG

  /

  STO 05

  RTN

GTO 00

LBL F

  XEQ 08

  X^2

  XEQ 09

  STO 04

RTN

GTO 00

LBL 11

  1 E2

  /

RTN

LBL 12

  1 E2

  *

RTN

LBL 09

  1

  X<>Y

  -

  XEQ 12

RTN

LBL 08

  XEQ 11

  1

  X<>Y

  -

RTN

Bill Duncan · 07-19-2020, 01:29 AM

I've added another post with a "close enough" approximation.

The approximation is close enough in the range of customer happiness that matters and so simple a calculator isn't really required.. lol..

https://billduncan.org/the-tail-at-scale-approximation/