Retries¶
Introduction¶
Distributed systems are full of temporary issues: network failures, sudden latency increases, bandwidth exhaustion, node evictions, partial pod rescheduling, temporary overloading of microservices, etc. All of these can create situations where your requests may be delayed, queued, or failed.
What can you do in that case? The most natural answer is to retry your request. Hence, retry is the most fundamental, intuitive, and commonly used component in our resilience toolkit.
However, retries may look deceptively simple and straightforward. The real usage of retries is more nuanced, as you will discover throughout this page.
Use Cases¶
- Retries hide temporary, short-lived errors
- Jitters are useful to reduce congestion on resources
Usage¶
Hyx provides a decorator that brings retry functionality to any function:
import asyncio
import httpx
from hyx.retry import retry
@retry(on=httpx.NetworkError, attempts=4)
async def get_gh_events() -> None:
async with httpx.AsyncClient() as client:
response = await client.get("https://api.github.com/events")
return response.json()
asyncio.run(get_gh_events())
hyx.retry.retry(*, on=@retry() decorator retries the function on exceptions for the given number of attempts.
Delays after each retry is defined by backoff strategy.
Parameters:
- on - Exception or tuple of Exceptions we need to retry on.
- attempts - How many times do we need to retry. If
None, it will infinitely retry until the success. - backoff - Backoff Strategy that defines delays on each retry.
Takes
floatnumbers (delay in secs),list[floats](delays on each retry attempt), orIterator[float] - name (None | str) - A component name or ID (will be passed to listeners and mention in metrics)
- listeners (None | Sequence[TimeoutListener]) - List of listeners of this concreate component state
Backoffs¶
The backoff strategy is a crucial parameter to consider. Depending on the backoff, the retry component can either help your system or become a source of problems.
Warning
For the sake of simplicity, Hyx assumes that you are following AsyncIO best practices and not running CPU-intensive operations in the main thread. Otherwise, the backoff delays may fire later after the thread is unblocked.
Constant Backoff¶
The most basic backoff strategy is to wait a constant amount of time on each retry.
import asyncio
import httpx
from hyx.retry import retry
@retry(on=httpx.NetworkError, backoff=0.5) # delay 500ms on each retry
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("pikachu"))
The float backoffs are just aliases for the const backoff.
hyx.retry.backoffs.const(delay_secs, *, jitter=None)Constant Delay(s) Backoff
Parameters:
- delay_secs (float, int) - How much time do we wait on each retry.
- jitter (optional) - Decorrelate delays with the jitter. No jitter by default
Interval Backoff¶
You can also provide a list or tuple of floats to pull delays from in a sequential and cyclical manner.
import asyncio
import httpx
from hyx.retry import retry
@retry(on=httpx.NetworkError, attempts=4, backoff=(0.5, 1.0, 1.5, 2.0))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("slowpoke"))
The list[float] and tuple[float, ...] backoffs are just aliases for the interval backoff.
hyx.retry.backoffs.interval(delay_secs, *, jitter=None)Interval Delay(s) Backoff
Parameters:
- delay_secs (Sequence[float]) - How much time do we wait on each retry. It will take next delay from that list on each retry. It will repeat from the beginning if the list is shorter than number of attempts
- jitter (optional) - Decorrelate delays with the jitter. No jitter by default
Exponential Backoff¶
Exponential backoff is one of the most popular backoff strategies. Its delays grow rapidly, giving the faulty functionality more and more time to recover on each retry.
Hyx implements Capped Exponential Backoff, which allows you to specify a max_delay_secs bound:
import asyncio
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import expo
@retry(on=httpx.NetworkError, backoff=expo(min_delay_secs=10, base=2, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("psyduck"))
hyx.retry.backoffs.expo(*, min_delay_secs=1, base=2, max_delay_secs=None, jitter=None)Exponential Backoff (delay = min_delay_secs * base ** attempt)
Parameters:
- min_delay_secs - The minimal initial delay
- base - The base of the exponential function
- max_delay_secs (optional) - Limit the longest possible delay
- jitter (optional) - Decorrelate delays with the jitter. No jitter by default
Linear Backoff¶
Linear Backoff grows linearly by adding additive_secs on each retry:
import asyncio
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import linear
@retry(on=httpx.NetworkError, backoff=linear(min_delay_secs=10, additive_secs=10, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("fomantis"))
hyx.retry.backoffs.linear(*, min_delay_secs=1, additive_secs=1.0, max_delay_secs=None, jitter=None)Linear Backoff
Parameters:
- min_delay_secs - The minimal initial delay
- additive_secs - How many seconds to add on each retry
- max_delay_secs (optional) - Limit the longest possible delay
- jitter (optional) - Decorrelate delays with the jitter. No jitter by default
Fibonacci Backoff¶
Another rapidly growing backoff is based on the Fibonacci sequence:
import asyncio
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import fibo
@retry(on=httpx.NetworkError, backoff=fibo(min_delay_secs=10, factor_secs=5))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("kartana"))
hyx.retry.backoffs.fibo(*, min_delay_secs=1, factor_secs=1, max_delay_secs=None, jitter=None)Fibonacci Backoff
Parameters:
- min_delay_secs - The minimal initial delay
- factor_secs - Defines the second element in the initial Fibonacci sequence
- max_delay_secs (optional) - Limit the longest possible delay
- jitter (optional) - Decorrelate delays with the jitter. No jitter by default
Decorrelated Exponential Backoff¶
This is a complex backoff strategy proposed by AWS Research. It's based on exponential backoff and includes full jitter. On every retry, it exponentially widens the range of possible delays.
import asyncio
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import decorrexp
@retry(on=httpx.NetworkError, backoff=decorrexp(min_delay_secs=10, max_delay_secs=60, base=20))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("arrokuda"))
hyx.retry.backoffs.decorrexp(min_delay_secs, max_delay_secs, base=3)Decorrelated Exponential Backoff with Build-in Jitter
Parameters:
- min_delay_secs - The minimal initial delay
- base - The base of the exponential function
- max_delay_secs (optional) - Limit the longest possible delay
Soft Exponential Backoff (Beta)¶
Soft Exponential Backoff is another variation of complex exponential backoffs with built-in jitter. It was authored by the Polly community as a less spiky alternative to Decorrelated Exponential Backoff.
import asyncio
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import softexp
@retry(on=httpx.NetworkError, backoff=softexp(median_delay_secs=35, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("arrokuda"))
hyx.retry.backoffs.softexp(*, median_delay_secs, max_delay_secs=None, pfactor=4.0, rp_scaling_factor=0.7142857142857143)Soft Exponential Backoff with Build-in Jitter
Parameters:
- median_delay_secs - The minimal initial delay
- max_delay_secs (optional) - Limit the longest possible delay
- pfactor -
- rp_scaling_factor -
Custom Backoffs¶
In Hyx's design, backoffs are simply iterators that return float numbers and can continue indefinitely.
Here is how a factorial backoff could be implemented:
import asyncio
from typing import Iterator
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import MS_TO_SECS, SECS_TO_MS
class factorial(Iterator[float]):
"""
Custom Factorial Backoff
"""
def __init__(
self,
*,
min_delay_secs: float = 1,
) -> None:
self._min_delay_ms = min_delay_secs * SECS_TO_MS
self._current_delay_ms = self._min_delay_ms
def __iter__(self) -> "factorial":
self._current_delay_ms = self._min_delay_ms
return self
def __next__(self) -> float:
current_delay_ms = self._current_delay_ms
self._current_delay_ms *= self._current_delay_ms + 1
return current_delay_ms * MS_TO_SECS
@retry(on=httpx.NetworkError, backoff=factorial(min_delay_secs=20))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("fomantis"))
Note
The built-in backoffs accept delay parameters in seconds but work with milliseconds internally. This improves the granularity of generated delays. The delays are then returned in seconds.
Jitters¶
In high-load setups, or when multiple requesters are trying to access the same API, or with a set of background tasks running on a schedule, situations may arise where they happen to perform actions simultaneously. This triggers traffic spikes or unusually high load on the backend system. When you use retries across multiple clients, they can trigger load spikes in the same way.
This may push your system to autoscale unnecessarily.
In such cases, we say the requests are correlated.
To mitigate this problem, we can use jitters, which essentially decorrelate your requests by adding randomness. This helps distribute load more evenly and process the same volume of requests with less capacity.
In Hyx's design, jitters are part of the backoff strategy.
Note
Constant, exponential, linear, and fibonacci backoffs support the jitters listed below as an optional argument.
Full Jitter¶
Full Jitter is a decorrelation strategy proposed by AWS Research.
It uniformly selects a delay from the range between zero and your upper bound:
import asyncio
import httpx
from hyx.retry import jitters, retry
from hyx.retry.backoffs import expo
@retry(on=httpx.NetworkError, backoff=expo(min_delay_secs=10, max_delay_secs=60, jitter=jitters.full))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("psyduck"))
Note
Full jitter may choose to perform the action immediately without any delay.
hyx.retry.jitters.full(delay)Full Interval Jitter
Draw a jitter value from [0, upper_bound] interval uniformly
Parameters:
- delay - The delay to jitter
Reference:
Equal Jitter¶
Another jitter algorithm proposed by AWS Research.
It takes the middle of the given interval and adds some additional delay, drawn uniformly at random from the halved interval.
Note
Equal Jitter guarantees that you will wait at least half of the given delay interval.
hyx.retry.jitters.equal(delay)Equal Jitter
Parameters:
- delay - The delay to jitter
Reference:
Jittered Backoffs¶
Decorrelated Exponential and Soft Exponential backoffs provide built-in decorrelation as part of their algorithm.
Custom Jitters¶
Hyx uses jitters as part of backoff strategies. Jitters are callables that take a delay in milliseconds generated by the backoff and return the final delay in milliseconds.
Note
Jitters can modify the final delay returned by the backoff algorithm.
import asyncio
import random
from functools import partial
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import expo
def randomixin(delay: float, *, max_mixing: float = 20) -> float:
"""
Custom Random Mixin Jitter
"""
return delay + random.uniform(0, max_mixing)
@retry(on=httpx.NetworkError, backoff=expo(min_delay_secs=20, jitter=partial(randomixin, max_mixing=50)))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("cryogonal"))
Backoffs Outside Retries¶
Backoffs and jitters can be useful even outside of retries.
Worker Pools¶
In the following example, we create a pool of in-process workers. If there were no delays between their scheduling, they would start almost instantaneously and compete with each other when pulling tasks from the database.
To avoid this, we introduce a small jitter that decorrelates their startup times:
import asyncio
from hyx.retry import jitters
async def run_worker(delay_between_tasks_secs: float = 10) -> None:
"""
The Worker's Logic
"""
while True:
# pull tasks from the database and process it
await asyncio.sleep(jitters.full(delay_between_tasks_secs))
async def run_worker_pool(workers: int = 5, schedule_delay_secs: float = 5) -> None:
"""
Worker Manager
Schedules a set of workers with jittering their startup times
"""
tasks: list[asyncio.Task] = []
for _ in range(workers):
tasks.append(asyncio.create_task(run_worker()))
await asyncio.sleep(jitters.full(schedule_delay_secs))
await asyncio.gather(*tasks)
asyncio.run(run_worker_pool())
Additionally, we jitter each worker's rest time, increasing the chances that their lifecycles end up being different.
Best Practices¶
Limit Retry Attempts¶
Hyx supports an option to retry infinitely, but this should generally be considered an antipattern.
import asyncio
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import expo
# Don't do this
@retry(on=httpx.NetworkError, attempts=None, backoff=expo(min_delay_secs=10, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("skiddo"))
Always prefer limiting the number of retries over infinite attempts.
Specify Delays¶
You can disable delays between retries, but that's another antipattern you should avoid:
import asyncio
import httpx
from hyx.retry import retry
# Don't do this
@retry(on=httpx.NetworkError, backoff=0.0)
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("noibat"))
Without delays, retries can easily overwhelm your system and create a situation known as the retry storm.
When and What to Retry¶
It's important to realize that not every action should be retried. When dealing with non-idempotent APIs, retrying can introduce duplicate entries in the system.
When it comes to HTTP requests, you should retry based on server response errors and consider error codes that are temporary in nature (e.g., 5xx errors).
Avoid Retry Storms¶
The retry storm is a well-known issue that occurs when retries are poorly configured or placed in the wrong part of the system.
Excessive retries can overload parts of your system and bring it down. The two antipatterns above are common ways to misconfigure retries. That's why you should always limit the number of retry attempts and allow time between retries for the downstream system to recover.
The placement of retries is equally important for avoiding retry storms. Consider the following case:
This system has retries configured at two levels: gateway (level 1) and orders microservice (level 2).
If the inventory microservice fails,
it will first exhaust all retries on the orders side, then return to the gateway.
The gateway will then retry two more times.
The total number of requests to the inventory microservice will be 3 * 3 = 9.
If there were a deeper request chain with more retries along the way,
they would all multiply and create even worse load on the system.
The general rule of thumb is to retry only in the component directly above the failed one.
In this case, it would be appropriate to retry only at the orders level.