Retries¶
Introduction¶
Distributed systems are full of temporary issues - network failures, sudden latency increases, bandwidth exhaustion, node evictions and partial pods rescheduling, temporary overloading of some microservices, etc. All of that can create situations when your requests may be delayed, queued or failed.
What can you do in that case? The most natural answer is to retry your request. Hence, retry is the most fundamental, intuitive and commonly used component in our resilience toolkit.
However, retries might look deceptively simple and straightforward, The real usage of retries is more nuanced as you will read throughout this page.
Use cases¶
- Retries hide temporary short-lived errors
- Jitters are useful to reduce congestion on resources
Usage¶
Hyx provides a decorator that brings retry functionality to any function:
import asyncio
import httpx
from hyx.retry import retry
@retry(on=httpx.NetworkError, attempts=4)
async def get_gh_events() -> None:
async with httpx.AsyncClient() as client:
response = await client.get("https://api.github.com/events")
return response.json()
asyncio.run(get_gh_events())
hyx.retry.retry
(*, on=@retry()
decorator retries the function on
exceptions for the given number of attempts
.
Delays after each retry is defined by backoff
strategy.
Parameters:
- on - Exception or tuple of Exceptions we need to retry on.
- attempts - How many times do we need to retry. If
None
, it will infinitely retry until the success. - backoff - Backoff Strategy that defines delays on each retry.
Takes
float
numbers (delay in secs),list[floats]
(delays on each retry attempt), orIterator[float]
- name (None | str) - A component name or ID (will be passed to listeners and mention in metrics)
- listeners (None | Sequence[TimeoutListener]) - List of listeners of this concreate component state
Backoffs¶
The backoff strategy is a crucial parameter to consider. Depending on the backoff, the retry component can help your system or be a source of problems.
Warning
For the sake of simplicity, Hyx assumes that you are following AsyncIO best practices and not running CPU-intensive operations in the main thread. Otherwise, the backoff delays may fire later after the thread is unblocked.
Constant Backoff¶
The most basic backoff strategy is to wait the constant amount of time on each retry.
import asyncio
import httpx
from hyx.retry import retry
@retry(on=httpx.NetworkError, backoff=0.5) # delay 500ms on each retry
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("pikachu"))
The float
backoffs are just aliases for the const
backoff.
hyx.retry.backoffs.const
(delay_secs, *, jitter=None)Constant Delay(s) Backoff
Parameters:
- delay_secs (float, int) - How much time do we wait on each retry.
- jitter (optional) - Decorrelate delays with the jitter. No jitter by default
Interval Backoff¶
You can also provide a list or a tuple of floats to pull delays from it in a sequential and cyclical manner.
import asyncio
import httpx
from hyx.retry import retry
@retry(on=httpx.NetworkError, attempts=4, backoff=(0.5, 1.0, 1.5, 2.0))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("slowpoke"))
The list[float]
and tuple[float, ...]
backoffs are just aliases for the interval
backoff.
hyx.retry.backoffs.interval
(delay_secs, *, jitter=None)Interval Delay(s) Backoff
Parameters:
- delay_secs (Sequence[float]) - How much time do we wait on each retry. It will take next delay from that list on each retry. It will repeat from the beginning if the list is shorter than number of attempts
- jitter (optional) - Decorrelate delays with the jitter. No jitter by default
Exponential Backoff¶
Exponential backoff is one of the most popular backoff strategies. It delays that growth rapidly. That gives the faulty functionality more and more time to recover on each retry.
Hyx implements the Capped Exponential Backoff that allows to specify the max_delay_secs
bound:
import asyncio
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import expo
@retry(on=httpx.NetworkError, backoff=expo(min_delay_secs=10, base=2, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("psyduck"))
hyx.retry.backoffs.expo
(*, min_delay_secs=1, base=2, max_delay_secs=None, jitter=None)Exponential Backoff (delay = min_delay_secs * base ** attempt)
Parameters:
- min_delay_secs - The minimal initial delay
- base - The base of the exponential function
- max_delay_secs (optional) - Limit the longest possible delay
- jitter (optional) - Decorrelate delays with the jitter. No jitter by default
Linear Backoff¶
Linear Backoff growth linearly by adding additive_secs
on each retry:
import asyncio
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import linear
@retry(on=httpx.NetworkError, backoff=linear(min_delay_secs=10, additive_secs=10, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("fomantis"))
hyx.retry.backoffs.linear
(*, min_delay_secs=1, additive_secs=1.0, max_delay_secs=None, jitter=None)Linear Backoff
Parameters:
- min_delay_secs - The minimal initial delay
- additive_secs - How many seconds to add on each retry
- max_delay_secs (optional) - Limit the longest possible delay
- jitter (optional) - Decorrelate delays with the jitter. No jitter by default
Fibonacci Backoff¶
Another rapidly growing backoff is based on the Fibonacci sequence:
import asyncio
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import fibo
@retry(on=httpx.NetworkError, backoff=fibo(min_delay_secs=10, factor_secs=5))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("kartana"))
hyx.retry.backoffs.fibo
(*, min_delay_secs=1, factor_secs=1, max_delay_secs=None, jitter=None)Fibonacci Backoff
Parameters:
- min_delay_secs - The minimal initial delay
- factor_secs - Defines the second element in the initial Fibonacci sequence
- max_delay_secs (optional) - Limit the longest possible delay
- jitter (optional) - Decorrelate delays with the jitter. No jitter by default
Decorrelated Exponential Backoff¶
This is a complex backoff strategy proposed by AWS Research. It's based on the exponential backoff and includes the full jitter. On every retry, it exponentially widens the range of possible delays.
import asyncio
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import decorrexp
@retry(on=httpx.NetworkError, backoff=decorrexp(min_delay_secs=10, max_delay_secs=60, base=20))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("arrokuda"))
hyx.retry.backoffs.decorrexp
(min_delay_secs, max_delay_secs, base=3)Decorrelated Exponential Backoff with Build-in Jitter
Parameters:
- min_delay_secs - The minimal initial delay
- base - The base of the exponential function
- max_delay_secs (optional) - Limit the longest possible delay
Soft Exponential Backoff (Beta)¶
Soft Exponential Backoff is another variation of complex exponential backoffs with built-in jitter. It was authored by the Polly community as a less spiky alternative to Decorrelated Exponential Backoff.
import asyncio
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import softexp
@retry(on=httpx.NetworkError, backoff=softexp(median_delay_secs=35, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("arrokuda"))
hyx.retry.backoffs.softexp
(*, median_delay_secs, max_delay_secs=None, pfactor=4.0, rp_scaling_factor=0.7142857142857143)Soft Exponential Backoff with Build-in Jitter
Parameters:
- median_delay_secs - The minimal initial delay
- max_delay_secs (optional) - Limit the longest possible delay
- pfactor -
- rp_scaling_factor -
Custom Backoffs¶
In the Hyx design, backoffs are just iterators that return float numbers and can go on infinitely.
Here is how the factorial backoff could be implemented:
import asyncio
from typing import Iterator
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import MS_TO_SECS, SECS_TO_MS
class factorial(Iterator[float]):
"""
Custom Factorial Backoff
"""
def __init__(
self,
*,
min_delay_secs: float = 1,
) -> None:
self._min_delay_ms = min_delay_secs * SECS_TO_MS
self._current_delay_ms = self._min_delay_ms
def __iter__(self) -> "factorial":
self._current_delay_ms = self._min_delay_ms
return self
def __next__(self) -> float:
current_delay_ms = self._current_delay_ms
self._current_delay_ms *= self._current_delay_ms + 1
return current_delay_ms * MS_TO_SECS
@retry(on=httpx.NetworkError, backoff=factorial(min_delay_secs=20))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("fomantis"))
Note
The built-in backoffs accepts delay params in seconds, but works with milliseconds under the hood. That improves granularity of the generated delays. Then it returns generated delays in seconds again.
Jitters¶
In the high-loaded setups, or when a few requesters that are trying to pull the same API, or just a set of background tasks that do something on schedule, there may be situations when they happen to do that action simultaneously. That triggers traffic spikes or unusually high load on the backend system. When you use retries in a few clients, they may trigger the load spikes in the same way.
It may push your system to autoscale without many reasons that are not super efficient.
In that case, we say that the requests were correlated.
In order to mitigate this problem, we can use jitters which is essentially a way to decorrelated your requests by adding some randomness. That helps to distribute load more evenly and process the same amount of requests with less capacity.
In the Hyx design, jitters are part of backoff strategy.
Note
Constant, exponential, linear and fibonacci backoffs supports jitters listed below as an optional argument.
Full Jitter¶
Full Jitter is a decorrelation strategy proposed by AWS Research.
It takes a delay from the range between zero and your upper bound uniformly:
import asyncio
import httpx
from hyx.retry import jitters, retry
from hyx.retry.backoffs import expo
@retry(on=httpx.NetworkError, backoff=expo(min_delay_secs=10, max_delay_secs=60, jitter=jitters.full))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("psyduck"))
Note
Full jitter may decide to do the action right away without a delay
hyx.retry.jitters.full
(delay)Full Interval Jitter
Draw a jitter value from [0, upper_bound] interval uniformly
Parameters:
- delay - The delay to jitter
Reference:
Equal Jitter¶
Another jitter algorithm proposed by AWS Research.
It takes a middle of the given interval and tries to add some additional delay drawing it from the halved interval at random uniformly.
Note
Equal Jitter guarantees that you will wait at least a half of the given delay interval.
hyx.retry.jitters.equal
(delay)Equal Jitter
Parameters:
- delay - The delay to jitter
Reference:
Jittered Backoffs¶
Decorrelated Exponential and Soft Exponential backoffs provide built-in decorrelation as a part of their algorithm.
Custom Jitters¶
Hyx uses jitters as a part of backoff strategies. Jitters are callables that take a delay in milliseconds generated by backoff and return the final delay in milliseconds.
Note
Jitters can modify the final delay returned by the backoff algorithm.
import asyncio
import random
from functools import partial
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import expo
def randomixin(delay: float, *, max_mixing: float = 20) -> float:
"""
Custom Random Mixin Jitter
"""
return delay + random.uniform(0, max_mixing)
@retry(on=httpx.NetworkError, backoff=expo(min_delay_secs=20, jitter=partial(randomixin, max_mixing=50)))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("cryogonal"))
Backoffs Outside Retries¶
Backoffs and jitters can be useful even outside of retries.
Worker Pools¶
In the following example, we create a pool of in-process workers. If we had no delays between their scheduling, they would be started almost instantaneously. They would complete with each other pulling tasks from the database.
In order to avoid that, we are introducing a little jitter that decorrelates their startup time:
import asyncio
from hyx.retry import jitters
async def run_worker(delay_between_tasks_secs: float = 10) -> None:
"""
The Worker's Logic
"""
while True:
# pull tasks from the database and process it
await asyncio.sleep(jitters.full(delay_between_tasks_secs))
async def run_worker_pool(workers: int = 5, schedule_delay_secs: float = 5) -> None:
"""
Worker Manager
Schedules a set of workers with jittering their startup times
"""
tasks: list[asyncio.Task] = []
for _ in range(workers):
tasks.append(asyncio.create_task(run_worker()))
await asyncio.sleep(jitters.full(schedule_delay_secs))
await asyncio.gather(*tasks)
asyncio.run(run_worker_pool())
Besides that, we are jittering the worker's rest time increasing changes that workers lifecycles end up being different.
Best Practices¶
Limit Retry Attempts¶
Hyx supports an option to retry infinitely, but that should be generally considered as an antipattern.
import asyncio
import httpx
from hyx.retry import retry
from hyx.retry.backoffs import expo
# Don't do this
@retry(on=httpx.NetworkError, attempts=None, backoff=expo(min_delay_secs=10, max_delay_secs=60))
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("skiddo"))
Always prefer to limit the number of retries over the infinite attempts.
Specify Delays¶
You can disable delays between retries, but that's another antipattern you should not follow:
import asyncio
import httpx
from hyx.retry import retry
# Don't do this
@retry(on=httpx.NetworkError, backoff=0.0)
async def get_poke_data(pokemon: str) -> None:
async with httpx.AsyncClient() as client:
response = await client.get(f"https://pokeapi.co/api/v2/pokemon/{pokemon}")
return response.json()
asyncio.run(get_poke_data("noibat"))
Without delays, retries can easily heat your system and create a situation known as the retry storm.
When and What to Retry¶
It's important to realize that not every action should be retried. When you are dealing with non-idempotent APIs, you can introduce duplicated entries in the system if retried.
When it comes to HTTP requests, you should retry based on the server response errors and consider error codes that have temporary nature (e.g. 50x errors).
Avoid Retry Storms¶
The retry storm is a well-known issue when retries are badly configured or put in the wrong place of the system.
Excessive retries can overload some parts of your system putting it down. The two antipatterns above are common ways to misconfigured retries. That's why you should always limit the number of retry attempts and give some time between the retries for the downstream system to recover.
The place where retries are added is equally important to avoid retry storms. Consider the following case:
The given system has two retries configured on two levels: gateway
(lvl1) and orders
microservice (lvl2).
If the inventory
microservice fails,
it will first exhaust all retries on the orders
side, and then it will get back to the gateway
.
The gateway
will retry two more times.
The total number of request to the inventory
microservice will be 3 * 3 which is 9.
If we had a deeper request chain with more retries on the way,
all of them would multiply and create even worse load on the system.
The general rule of thumb here is to retry in the component that is directly above the failed one.
In this case, it would be okay to retry on the orders
level only.