SRE concepts

07.11.2021

Reading time: 2 min.

Update: I added several key things recently after started implementing SRE concepts in Billie.

Site Reliability Engineering makes sense only if you bothered with Reliability. It doesn’t bring you much value if the most significant thing at current stage is delivering new features, say in recently founded startup this is probably not a good time to start with SRE.

SRE is a way to balance between the product Stability (Reliability) and Changes you’re going to make to the product, as changes are the most frequent root cause of the bad events. The core concept is when your changes breaking your product too much, you probably need to stop delivering these to the production and focus on stability. In order to switch the focus timely, you need to establish and track stability metrics. Also you need to define steps you going to take when stability promise to users about to be broken.

Let me share my thought after completing this superuseful SRE Course.

You need to make several steps to consider SRE path.

Think what makes user unhappy using your services.
Decide on metrics that reflects user happiness and start gathering it.
Create plans on how to maintain the service level target and policies describing what you going to do when situation become dangerous to achieving your availability targets.
Create plans for improve these metrics.
Act, measure, reflect, improve.

Little bit clarity on abbreviations those used by google guys.

SLA - service level agreement. This is the service perception boundary you shouldn’t cross. When user considers your service as bad, you didn’t match his expectations, so either you didn’t set proper expectations or you breached your promise on the service quality.

SLO - service level objectives. Same as SLA, but this is only internal promise and compass to meet user expectations, and this is a bit more tight because we don’t want to dissapoint user by breaching SLA.

SLI - service level indicator shows how you meet user expectation in some point in time. Normally this is ratio of good events to all valid events in some period of time.

How these relate to each other? Let me describe this in this little mantra.

We measure SLIs, which shouldn’t breach SLOs not to disappoint users by breaking SLAs.