Retries¶

Introduction¶

Distributed systems are full of temporary issues - network failures, sudden latency increases, bandwidth exhaustion, node evictions and partial pods rescheduling, temporary overloading of some microservices, etc. All of that can create situations when your requests may be delayed, queued or failed.

What can you do in that case? The most natural answer is to retry your request. Hence, retry is the most fundamental, intuitive and commonly used component in our resilience toolkit.

However, retries might look deceptively simple and straightforward, The real usage of retries is more nuanced as you will read throughout this page.

Use cases¶

Retries hide temporary short-lived errors
Jitters are useful to reduce congestion on resources

Usage¶

Hyx provides a decorator that brings retry functionality to any function:

import asyncio

import httpx

from hyx.retry import retry


@retry(on=httpx.NetworkError, attempts=4)
async def get_gh_events() -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get("https://api.github.com/events")

        return response.json()


asyncio.run(get_gh_events())

hyx.retry.retry(*, on=, attempts=3, backoff=0.5, name=None, listeners=None, event_manager=None)

@retry() decorator retries the function on exceptions for the given number of attempts. Delays after each retry is defined by backoff strategy.

Parameters:

on - Exception or tuple of Exceptions we need to retry on.
attempts - How many times do we need to retry. If None, it will infinitely retry until the success.
backoff - Backoff Strategy that defines delays on each retry. Takes float numbers (delay in secs), list[floats] (delays on each retry attempt), or Iterator[float]
name (None | str) - A component name or ID (will be passed to listeners and mention in metrics)
listeners (None | Sequence[TimeoutListener]) - List of listeners of this concreate component state

Backoffs¶

The backoff strategy is a crucial parameter to consider. Depending on the backoff, the retry component can help your system or be a source of problems.

Warning

For the sake of simplicity, Hyx assumes that you are following AsyncIO best practices and not running CPU-intensive operations in the main thread. Otherwise, the backoff delays may fire later after the thread is unblocked.

Constant Backoff¶

The most basic backoff strategy is to wait the constant amount of time on each retry.

import asyncio

import httpx

from hyx.retry import retry


@retry(on=httpx.NetworkError, backoff=0.5)  # delay 500ms on each retry
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("pikachu"))

The float backoffs are just aliases for the const backoff.

class hyx.retry.backoffs.const(delay_secs, *, jitter=None)

Constant Delay(s) Backoff

Parameters:

delay_secs (float, int) - How much time do we wait on each retry.
jitter (optional) - Decorrelate delays with the jitter. No jitter by default

Interval Backoff¶

You can also provide a list or a tuple of floats to pull delays from it in a sequential and cyclical manner.

import asyncio

import httpx

from hyx.retry import retry


@retry(on=httpx.NetworkError, attempts=4, backoff=(0.5, 1.0, 1.5, 2.0))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("slowpoke"))

The list[float] and tuple[float, ...] backoffs are just aliases for the interval backoff.

class hyx.retry.backoffs.interval(delay_secs, *, jitter=None)

Interval Delay(s) Backoff

Parameters:

delay_secs (Sequence[float]) - How much time do we wait on each retry. It will take next delay from that list on each retry. It will repeat from the beginning if the list is shorter than number of attempts
jitter (optional) - Decorrelate delays with the jitter. No jitter by default

Exponential Backoff¶

Exponential backoff is one of the most popular backoff strategies. It delays that growth rapidly. That gives the faulty functionality more and more time to recover on each retry.

Hyx implements the Capped Exponential Backoff that allows to specify the max_delay_secs bound:

import asyncio

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import expo


@retry(on=httpx.NetworkError, backoff=expo(min_delay_secs=10, base=2, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("psyduck"))

class hyx.retry.backoffs.expo(*, min_delay_secs=1, base=2, max_delay_secs=None, jitter=None)

Exponential Backoff (delay = min_delay_secs * base ** attempt)

Parameters:

min_delay_secs - The minimal initial delay
base - The base of the exponential function
max_delay_secs (optional) - Limit the longest possible delay
jitter (optional) - Decorrelate delays with the jitter. No jitter by default

Linear Backoff¶

Linear Backoff growth linearly by adding additive_secs on each retry:

import asyncio

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import linear


@retry(on=httpx.NetworkError, backoff=linear(min_delay_secs=10, additive_secs=10, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("fomantis"))

class hyx.retry.backoffs.linear(*, min_delay_secs=1, additive_secs=1.0, max_delay_secs=None, jitter=None)

Linear Backoff

Parameters:

min_delay_secs - The minimal initial delay
additive_secs - How many seconds to add on each retry
max_delay_secs (optional) - Limit the longest possible delay
jitter (optional) - Decorrelate delays with the jitter. No jitter by default

Fibonacci Backoff¶

Another rapidly growing backoff is based on the Fibonacci sequence:

import asyncio

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import fibo


@retry(on=httpx.NetworkError, backoff=fibo(min_delay_secs=10, factor_secs=5))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("kartana"))

class hyx.retry.backoffs.fibo(*, min_delay_secs=1, factor_secs=1, max_delay_secs=None, jitter=None)

Fibonacci Backoff

Parameters:

min_delay_secs - The minimal initial delay
factor_secs - Defines the second element in the initial Fibonacci sequence
max_delay_secs (optional) - Limit the longest possible delay
jitter (optional) - Decorrelate delays with the jitter. No jitter by default

Decorrelated Exponential Backoff¶

This is a complex backoff strategy proposed by AWS Research. It's based on the exponential backoff and includes the full jitter. On every retry, it exponentially widens the range of possible delays.

import asyncio

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import decorrexp


@retry(on=httpx.NetworkError, backoff=decorrexp(min_delay_secs=10, max_delay_secs=60, base=20))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("arrokuda"))

class hyx.retry.backoffs.decorrexp(min_delay_secs, max_delay_secs, base=3)

Decorrelated Exponential Backoff with Build-in Jitter

Parameters:

min_delay_secs - The minimal initial delay
base - The base of the exponential function
max_delay_secs (optional) - Limit the longest possible delay

Soft Exponential Backoff (Beta)¶

Soft Exponential Backoff is another variation of complex exponential backoffs with built-in jitter. It was authored by the Polly community as a less spiky alternative to Decorrelated Exponential Backoff.

import asyncio

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import softexp


@retry(on=httpx.NetworkError, backoff=softexp(median_delay_secs=35, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("arrokuda"))

class hyx.retry.backoffs.softexp(*, median_delay_secs, max_delay_secs=None, pfactor=4.0, rp_scaling_factor=0.7142857142857143)

Soft Exponential Backoff with Build-in Jitter

Parameters:

median_delay_secs - The minimal initial delay
max_delay_secs (optional) - Limit the longest possible delay
pfactor -
rp_scaling_factor -

Custom Backoffs¶

In the Hyx design, backoffs are just iterators that return float numbers and can go on infinitely.

Here is how the factorial backoff could be implemented:

import asyncio
from typing import Iterator

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import MS_TO_SECS, SECS_TO_MS


class factorial(Iterator[float]):
    """
    Custom Factorial Backoff
    """

    def __init__(
        self,
        *,
        min_delay_secs: float = 1,
    ) -> None:
        self._min_delay_ms = min_delay_secs * SECS_TO_MS

        self._current_delay_ms = self._min_delay_ms

    def __iter__(self) -> "factorial":
        self._current_delay_ms = self._min_delay_ms

        return self

    def __next__(self) -> float:
        current_delay_ms = self._current_delay_ms

        self._current_delay_ms *= self._current_delay_ms + 1

        return current_delay_ms * MS_TO_SECS


@retry(on=httpx.NetworkError, backoff=factorial(min_delay_secs=20))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("fomantis"))

Note

The built-in backoffs accepts delay params in seconds, but works with milliseconds under the hood. That improves granularity of the generated delays. Then it returns generated delays in seconds again.

Jitters¶

In the high-loaded setups, or when a few requesters that are trying to pull the same API, or just a set of background tasks that do something on schedule, there may be situations when they happen to do that action simultaneously. That triggers traffic spikes or unusually high load on the backend system. When you use retries in a few clients, they may trigger the load spikes in the same way.

It may push your system to autoscale without many reasons that are not super efficient.

In that case, we say that the requests were correlated.

In order to mitigate this problem, we can use jitters which is essentially a way to decorrelated your requests by adding some randomness. That helps to distribute load more evenly and process the same amount of requests with less capacity.

In the Hyx design, jitters are part of backoff strategy.

Note

Constant, exponential, linear and fibonacci backoffs supports jitters listed below as an optional argument.

Full Jitter¶

Full Jitter is a decorrelation strategy proposed by AWS Research.

It takes a delay from the range between zero and your upper bound uniformly:

import asyncio

import httpx

from hyx.retry import jitters, retry
from hyx.retry.backoffs import expo


@retry(on=httpx.NetworkError, backoff=expo(min_delay_secs=10, max_delay_secs=60, jitter=jitters.full))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("psyduck"))

Note

Full jitter may decide to do the action right away without a delay

hyx.retry.jitters.full(delay)

Full Interval Jitter

Draw a jitter value from [0, upper_bound] interval uniformly

Parameters:

delay - The delay to jitter

Reference:

AWS - Exponential Backoff And Jitter

Equal Jitter¶

Another jitter algorithm proposed by AWS Research.

It takes a middle of the given interval and tries to add some additional delay drawing it from the halved interval at random uniformly.

Note

Equal Jitter guarantees that you will wait at least a half of the given delay interval.

hyx.retry.jitters.equal(delay)

Equal Jitter

Parameters:

delay - The delay to jitter

Reference:

AWS - Exponential Backoff And Jitter

Jittered Backoffs¶

Decorrelated Exponential and Soft Exponential backoffs provide built-in decorrelation as a part of their algorithm.

Custom Jitters¶

Hyx uses jitters as a part of backoff strategies. Jitters are callables that take a delay in milliseconds generated by backoff and return the final delay in milliseconds.

Note

Jitters can modify the final delay returned by the backoff algorithm.

import asyncio
import random
from functools import partial

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import expo


def randomixin(delay: float, *, max_mixing: float = 20) -> float:
    """
    Custom Random Mixin Jitter
    """

    return delay + random.uniform(0, max_mixing)


@retry(on=httpx.NetworkError, backoff=expo(min_delay_secs=20, jitter=partial(randomixin, max_mixing=50)))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("cryogonal"))

Backoffs Outside Retries¶

Backoffs and jitters can be useful even outside of retries.

Worker Pools¶

In the following example, we create a pool of in-process workers. If we had no delays between their scheduling, they would be started almost instantaneously. They would complete with each other pulling tasks from the database.

In order to avoid that, we are introducing a little jitter that decorrelates their startup time:

import asyncio

from hyx.retry import jitters


async def run_worker(delay_between_tasks_secs: float = 10) -> None:
    """
    The Worker's Logic
    """
    while True:
        # pull tasks from the database and process it
        await asyncio.sleep(jitters.full(delay_between_tasks_secs))


async def run_worker_pool(workers: int = 5, schedule_delay_secs: float = 5) -> None:
    """
    Worker Manager
    Schedules a set of workers with jittering their startup times
    """
    tasks: list[asyncio.Task] = []

    for _ in range(workers):
        tasks.append(asyncio.create_task(run_worker()))
        await asyncio.sleep(jitters.full(schedule_delay_secs))

    await asyncio.gather(*tasks)


asyncio.run(run_worker_pool())

Besides that, we are jittering the worker's rest time increasing changes that workers lifecycles end up being different.

Best Practices¶

Limit Retry Attempts¶

Hyx supports an option to retry infinitely, but that should be generally considered as an antipattern.

import asyncio

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import expo

# Don't do this


@retry(on=httpx.NetworkError, attempts=None, backoff=expo(min_delay_secs=10, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("skiddo"))

Always prefer to limit the number of retries over the infinite attempts.

Specify Delays¶

You can disable delays between retries, but that's another antipattern you should not follow:

import asyncio

import httpx

from hyx.retry import retry

# Don't do this


@retry(on=httpx.NetworkError, backoff=0.0)
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("noibat"))

Without delays, retries can easily heat your system and create a situation known as the retry storm.

When and What to Retry¶

It's important to realize that not every action should be retried. When you are dealing with non-idempotent APIs, you can introduce duplicated entries in the system if retried.

When it comes to HTTP requests, you should retry based on the server response errors and consider error codes that have temporary nature (e.g. 50x errors).

Avoid Retry Storms¶

The retry storm is a well-known issue when retries are badly configured or put in the wrong place of the system.

Excessive retries can overload some parts of your system putting it down. The two antipatterns above are common ways to misconfigured retries. That's why you should always limit the number of retry attempts and give some time between the retries for the downstream system to recover.

The place where retries are added is equally important to avoid retry storms. Consider the following case:

Image title — A system with retries configured on multiple levels

The given system has two retries configured on two levels: gateway (lvl1) and orders microservice (lvl2). If the inventory microservice fails, it will first exhaust all retries on the orders side, and then it will get back to the gateway. The gateway will retry two more times.

The total number of request to the inventory microservice will be 3 * 3 which is 9. If we had a deeper request chain with more retries on the way, all of them would multiply and create even worse load on the system.

The general rule of thumb here is to retry in the component that is directly above the failed one. In this case, it would be okay to retry on the orders level only.