Skip to content

Retries

Introduction

Distributed systems are full of temporary issues: network failures, sudden latency increases, bandwidth exhaustion, node evictions, partial pod rescheduling, temporary overloading of microservices, etc. All of these can create situations where your requests may be delayed, queued, or failed.

What can you do in that case? The most natural answer is to retry your request. Hence, retry is the most fundamental, intuitive, and commonly used component in our resilience toolkit.

However, retries may look deceptively simple and straightforward. The real usage of retries is more nuanced, as you will discover throughout this page.

Use Cases

  • Retries hide temporary, short-lived errors
  • Jitters are useful to reduce congestion on resources

Usage

Hyx provides a decorator that brings retry functionality to any function:

import asyncio

import httpx

from hyx.retry import retry


@retry(on=httpx.NetworkError, attempts=4)
async def get_gh_events() -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get("https://api.github.com/events")

        return response.json()


asyncio.run(get_gh_events())
hyx.retry.retry(*, on=, attempts=3, backoff=0.5, name=None, listeners=None, event_manager=None)

@retry() decorator retries the function on exceptions for the given number of attempts. Delays after each retry is defined by backoff strategy.

Parameters:

  • on - Exception or tuple of Exceptions we need to retry on.
  • attempts - How many times do we need to retry. If None, it will infinitely retry until the success.
  • backoff - Backoff Strategy that defines delays on each retry. Takes float numbers (delay in secs), list[floats] (delays on each retry attempt), or Iterator[float]
  • name (None | str) - A component name or ID (will be passed to listeners and mention in metrics)
  • listeners (None | Sequence[TimeoutListener]) - List of listeners of this concreate component state

Backoffs

The backoff strategy is a crucial parameter to consider. Depending on the backoff, the retry component can either help your system or become a source of problems.

Warning

For the sake of simplicity, Hyx assumes that you are following AsyncIO best practices and not running CPU-intensive operations in the main thread. Otherwise, the backoff delays may fire later after the thread is unblocked.

Constant Backoff

The most basic backoff strategy is to wait a constant amount of time on each retry.

import asyncio

import httpx

from hyx.retry import retry


@retry(on=httpx.NetworkError, backoff=0.5)  # delay 500ms on each retry
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("pikachu"))

The float backoffs are just aliases for the const backoff.

class hyx.retry.backoffs.const(delay_secs, *, jitter=None)

Constant Delay(s) Backoff

Parameters:

  • delay_secs (float, int) - How much time do we wait on each retry.
  • jitter (optional) - Decorrelate delays with the jitter. No jitter by default

Interval Backoff

You can also provide a list or tuple of floats to pull delays from in a sequential and cyclical manner.

import asyncio

import httpx

from hyx.retry import retry


@retry(on=httpx.NetworkError, attempts=4, backoff=(0.5, 1.0, 1.5, 2.0))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("slowpoke"))

The list[float] and tuple[float, ...] backoffs are just aliases for the interval backoff.

class hyx.retry.backoffs.interval(delay_secs, *, jitter=None)

Interval Delay(s) Backoff

Parameters:

  • delay_secs (Sequence[float]) - How much time do we wait on each retry. It will take next delay from that list on each retry. It will repeat from the beginning if the list is shorter than number of attempts
  • jitter (optional) - Decorrelate delays with the jitter. No jitter by default

Exponential Backoff

Exponential backoff is one of the most popular backoff strategies. Its delays grow rapidly, giving the faulty functionality more and more time to recover on each retry.

Hyx implements Capped Exponential Backoff, which allows you to specify a max_delay_secs bound:

import asyncio

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import expo


@retry(on=httpx.NetworkError, backoff=expo(min_delay_secs=10, base=2, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("psyduck"))
class hyx.retry.backoffs.expo(*, min_delay_secs=1, base=2, max_delay_secs=None, jitter=None)

Exponential Backoff (delay = min_delay_secs * base ** attempt)

Parameters:

  • min_delay_secs - The minimal initial delay
  • base - The base of the exponential function
  • max_delay_secs (optional) - Limit the longest possible delay
  • jitter (optional) - Decorrelate delays with the jitter. No jitter by default

Linear Backoff

Linear Backoff grows linearly by adding additive_secs on each retry:

import asyncio

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import linear


@retry(on=httpx.NetworkError, backoff=linear(min_delay_secs=10, additive_secs=10, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("fomantis"))
class hyx.retry.backoffs.linear(*, min_delay_secs=1, additive_secs=1.0, max_delay_secs=None, jitter=None)

Linear Backoff

Parameters:

  • min_delay_secs - The minimal initial delay
  • additive_secs - How many seconds to add on each retry
  • max_delay_secs (optional) - Limit the longest possible delay
  • jitter (optional) - Decorrelate delays with the jitter. No jitter by default

Fibonacci Backoff

Another rapidly growing backoff is based on the Fibonacci sequence:

import asyncio

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import fibo


@retry(on=httpx.NetworkError, backoff=fibo(min_delay_secs=10, factor_secs=5))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("kartana"))
class hyx.retry.backoffs.fibo(*, min_delay_secs=1, factor_secs=1, max_delay_secs=None, jitter=None)

Fibonacci Backoff

Parameters:

  • min_delay_secs - The minimal initial delay
  • factor_secs - Defines the second element in the initial Fibonacci sequence
  • max_delay_secs (optional) - Limit the longest possible delay
  • jitter (optional) - Decorrelate delays with the jitter. No jitter by default

Decorrelated Exponential Backoff

This is a complex backoff strategy proposed by AWS Research. It's based on exponential backoff and includes full jitter. On every retry, it exponentially widens the range of possible delays.

import asyncio

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import decorrexp


@retry(on=httpx.NetworkError, backoff=decorrexp(min_delay_secs=10, max_delay_secs=60, base=20))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("arrokuda"))
class hyx.retry.backoffs.decorrexp(min_delay_secs, max_delay_secs, base=3)

Decorrelated Exponential Backoff with Build-in Jitter

Parameters:

  • min_delay_secs - The minimal initial delay
  • base - The base of the exponential function
  • max_delay_secs (optional) - Limit the longest possible delay

Soft Exponential Backoff (Beta)

Soft Exponential Backoff is another variation of complex exponential backoffs with built-in jitter. It was authored by the Polly community as a less spiky alternative to Decorrelated Exponential Backoff.

import asyncio

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import softexp


@retry(on=httpx.NetworkError, backoff=softexp(median_delay_secs=35, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("arrokuda"))
class hyx.retry.backoffs.softexp(*, median_delay_secs, max_delay_secs=None, pfactor=4.0, rp_scaling_factor=0.7142857142857143)

Soft Exponential Backoff with Build-in Jitter

Parameters:

  • median_delay_secs - The minimal initial delay
  • max_delay_secs (optional) - Limit the longest possible delay
  • pfactor -
  • rp_scaling_factor -

Custom Backoffs

In Hyx's design, backoffs are simply iterators that return float numbers and can continue indefinitely.

Here is how a factorial backoff could be implemented:

import asyncio
from typing import Iterator

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import MS_TO_SECS, SECS_TO_MS


class factorial(Iterator[float]):
    """
    Custom Factorial Backoff
    """

    def __init__(
        self,
        *,
        min_delay_secs: float = 1,
    ) -> None:
        self._min_delay_ms = min_delay_secs * SECS_TO_MS

        self._current_delay_ms = self._min_delay_ms

    def __iter__(self) -> "factorial":
        self._current_delay_ms = self._min_delay_ms

        return self

    def __next__(self) -> float:
        current_delay_ms = self._current_delay_ms

        self._current_delay_ms *= self._current_delay_ms + 1

        return current_delay_ms * MS_TO_SECS


@retry(on=httpx.NetworkError, backoff=factorial(min_delay_secs=20))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("fomantis"))

Note

The built-in backoffs accept delay parameters in seconds but work with milliseconds internally. This improves the granularity of generated delays. The delays are then returned in seconds.

Jitters

In high-load setups, or when multiple requesters are trying to access the same API, or with a set of background tasks running on a schedule, situations may arise where they happen to perform actions simultaneously. This triggers traffic spikes or unusually high load on the backend system. When you use retries across multiple clients, they can trigger load spikes in the same way.

This may push your system to autoscale unnecessarily.

In such cases, we say the requests are correlated.

To mitigate this problem, we can use jitters, which essentially decorrelate your requests by adding randomness. This helps distribute load more evenly and process the same volume of requests with less capacity.

In Hyx's design, jitters are part of the backoff strategy.

Note

Constant, exponential, linear, and fibonacci backoffs support the jitters listed below as an optional argument.

Full Jitter

Full Jitter is a decorrelation strategy proposed by AWS Research.

It uniformly selects a delay from the range between zero and your upper bound:

import asyncio

import httpx

from hyx.retry import jitters, retry
from hyx.retry.backoffs import expo


@retry(on=httpx.NetworkError, backoff=expo(min_delay_secs=10, max_delay_secs=60, jitter=jitters.full))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("psyduck"))

Note

Full jitter may choose to perform the action immediately without any delay.

hyx.retry.jitters.full(delay)

Full Interval Jitter

Draw a jitter value from [0, upper_bound] interval uniformly

Parameters:

  • delay - The delay to jitter

Reference:

Equal Jitter

Another jitter algorithm proposed by AWS Research.

It takes the middle of the given interval and adds some additional delay, drawn uniformly at random from the halved interval.

Note

Equal Jitter guarantees that you will wait at least half of the given delay interval.

hyx.retry.jitters.equal(delay)

Equal Jitter

Parameters:

  • delay - The delay to jitter

Reference:

Jittered Backoffs

Decorrelated Exponential and Soft Exponential backoffs provide built-in decorrelation as part of their algorithm.

Custom Jitters

Hyx uses jitters as part of backoff strategies. Jitters are callables that take a delay in milliseconds generated by the backoff and return the final delay in milliseconds.

Note

Jitters can modify the final delay returned by the backoff algorithm.

import asyncio
import random
from functools import partial

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import expo


def randomixin(delay: float, *, max_mixing: float = 20) -> float:
    """
    Custom Random Mixin Jitter
    """

    return delay + random.uniform(0, max_mixing)


@retry(on=httpx.NetworkError, backoff=expo(min_delay_secs=20, jitter=partial(randomixin, max_mixing=50)))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("cryogonal"))

Backoffs Outside Retries

Backoffs and jitters can be useful even outside of retries.

Worker Pools

In the following example, we create a pool of in-process workers. If there were no delays between their scheduling, they would start almost instantaneously and compete with each other when pulling tasks from the database.

To avoid this, we introduce a small jitter that decorrelates their startup times:

import asyncio

from hyx.retry import jitters


async def run_worker(delay_between_tasks_secs: float = 10) -> None:
    """
    The Worker's Logic
    """
    while True:
        # pull tasks from the database and process it
        await asyncio.sleep(jitters.full(delay_between_tasks_secs))


async def run_worker_pool(workers: int = 5, schedule_delay_secs: float = 5) -> None:
    """
    Worker Manager
    Schedules a set of workers with jittering their startup times
    """
    tasks: list[asyncio.Task] = []

    for _ in range(workers):
        tasks.append(asyncio.create_task(run_worker()))
        await asyncio.sleep(jitters.full(schedule_delay_secs))

    await asyncio.gather(*tasks)


asyncio.run(run_worker_pool())

Additionally, we jitter each worker's rest time, increasing the chances that their lifecycles end up being different.

Best Practices

Limit Retry Attempts

Hyx supports an option to retry infinitely, but this should generally be considered an antipattern.

import asyncio

import httpx

from hyx.retry import retry
from hyx.retry.backoffs import expo

# Don't do this


@retry(on=httpx.NetworkError, attempts=None, backoff=expo(min_delay_secs=10, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("skiddo"))

Always prefer limiting the number of retries over infinite attempts.

Specify Delays

You can disable delays between retries, but that's another antipattern you should avoid:

import asyncio

import httpx

from hyx.retry import retry

# Don't do this


@retry(on=httpx.NetworkError, backoff=0.0)
async def get_poke_data(pokemon: str) -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")

        return response.json()


asyncio.run(get_poke_data("noibat"))

Without delays, retries can easily overwhelm your system and create a situation known as the retry storm.

When and What to Retry

It's important to realize that not every action should be retried. When dealing with non-idempotent APIs, retrying can introduce duplicate entries in the system.

When it comes to HTTP requests, you should retry based on server response errors and consider error codes that are temporary in nature (e.g., 5xx errors).

Avoid Retry Storms

The retry storm is a well-known issue that occurs when retries are poorly configured or placed in the wrong part of the system.

Excessive retries can overload parts of your system and bring it down. The two antipatterns above are common ways to misconfigure retries. That's why you should always limit the number of retry attempts and allow time between retries for the downstream system to recover.

The placement of retries is equally important for avoiding retry storms. Consider the following case:

Image title

A system with retries configured on multiple levels

This system has retries configured at two levels: gateway (level 1) and orders microservice (level 2). If the inventory microservice fails, it will first exhaust all retries on the orders side, then return to the gateway. The gateway will then retry two more times.

The total number of requests to the inventory microservice will be 3 * 3 = 9. If there were a deeper request chain with more retries along the way, they would all multiply and create even worse load on the system.

The general rule of thumb is to retry only in the component directly above the failed one. In this case, it would be appropriate to retry only at the orders level.