My team built and productionized a custom rate limiter in Scala#

hero

A year ago, when I was leading a "Digital Platform" team at a large bank, we built and productionized a custom rate limiter in Scala. This is the story of why and how we did it.

Acknowledgements

This was a team effort which included

Special mention to Rohit who worked on this from inception to production deployment and also helped solve production issues

My team was responsible for building and maintaining the platform that powered the bank's lending products. The platform was built using Kotlin, Spring Boot, and Kafka. The work we usually signed up for was not directly related to the lending products themselves, but rather the infrastructure that powered them. For example working on the platform's observability, security, performance, orchestration engine, contract testing, etc. There was a lot of work to be done and we were a small team and so we always had to prioritize what we worked on based on the impact it would have.

Why we needed a custom rate limiter#

The platform had some common services which were used by all the lending products. Each lending product had different stakeholders and they all wanted to ensure that their product was always available and responsive. Each product had different set of channels through which they were accessed and so the traffic patterns were different. Traffic could sometimes be bursty due to marketing campaigns, or due to some external events. We had to ensure that the platform was able to handle these traffic spikes without any issues. We sometimes observed that the platform was not able to handle the traffic spikes and services would crash.

flowchart LR
    KYC[KYC Service]
    CS[Credit Scoring]
    IS[Income Assessment]

    H[Home Loan]
    P[Personal Loan]
    C[Car Loan]
    CC[Credit Card]

    H --> KYC
    CC --> KYC
    P --> KYC
    C --> KYC

    H --> CS
    CC --> CS
    P --> CS
    C --> CS

    H --> IS
    CC --> IS
    P --> IS
    C --> IS

This was a bad customer experience. An additional challenge for the platform team was that a common service could be overwhelmed by traffic from a single product and that would affect the other products. We needed some kind of virtual "isolation" among clients. Even if we explored automatic horizontal scaling, it would not solve the problem completely because:

Automatic scaling would take time to kick in
It's not unlimited will always have a limit
Upon reaching the limit, it won't solve the problem of one product affecting the others

We did need horizontal auto-scaling, but in addition to that we also needed a way to stop one client from affecting the others. And so we decided to introduce a rate limiter. The next question was how do we decide the rate limit for each client?

This question was not easy to answer. We could not just give a fixed limit to all the clients. Because different products had different demands and those ratios changed time to time. If we attempted to hard code some ratios such as 10% to Home Loan, 40% to Personal Loan, we would be at loss when Home Loan starts to demand more than 10% when Personal Loan is using less than 40%. This would result in un-necessarily rejecting requests or auto-scaling when it was not needed.

Requirements for the rate-limiter were clear now

It should start with a default fair and equal rate limit for all clients
The rate limiter should be able to handle high traffic without itself becoming a bottleneck
Demanding clients should be able to "borrow" capacity if its available.

If a client is not using their 100% capacity, it should be available for others to borrow if they need additional capacity to handle increased demand. Until the owner of the capacity needs it back. This would ensure zero waste of capacity and scaling up only when needed. If scaling up takes time, or if scale up has reached it limit, this solution would make sure that

Only the products which have traffic spikes would face rejected requested
Common services would not crash because they would never face traffic spikes
Other products within their limits would not feel the impact of spikes from other products

We explored off-the-shelf rate limiters but could not find one that would fit our needs. We decided to build our own rate limiter.

How we built the rate limiter#

Technology choices#

We had a few options to implement the new custom rate limiter. We could either make a java/kotlin library that could have been used as a filter/middleware in spring boot. Or we could build a separate service that would be used as a sidecar proxy. We decided to go with sidecar approach because it would allow us to use the rate limiter for any service, regardless of the language it was written in. It also allowed us to manage resources of rate limiter separately from the services it was protecting.

Going with the sidecar approach also gave us the freedom to choose the language and libraries that we felt were appropriate for the use case. We decided to use Scala along with ZIO and ZIO-HTTP. Scala's expressive syntax, type safety and its immutability by default made it a good choice for our use case.

flowchart LR
    KLB[Kubernetes Load Balancer]

    RL1[Rate Limiter Sidecar 1]
    RL2[Rate Limiter Sidecar 2]
    RL3[Rate Limiter Sidecar 3]

    S1[KYC Service Replica 1]
    S2[KYC Service Replica 2]
    S3[KYC Service Replica 3]

    RL1 --> S1
    RL2 --> S2
    RL3 --> S3

    KLB --> RL1
    KLB --> RL2
    KLB --> RL3

We chose ZIO because it is offers many concurrency primitives. ZIO's Ref for example is a very easy to use concurrency construct. It is a mutable reference that can be read and written to in a non-blocking and purely functional way. Actors were also a good choice because they offer more control. But in our case we did not need that much control so we decided to use Ref.

Note

Using Ref is easy because all you don't have to change a lot for using a Ref. You have a case class for your state. Some helper update methods on the data class that return a copy of the class instance. And then you can use Ref to call these methods from updateAndGet method of the Ref. Actors on the other hand require you to think upfront about the messages you want to send and receive. They are not an easy drop in replacement of any non-concurrent code. They are however more powerful and can be used to build more complex systems.

enum AttemptResult:
  case Allowed, ClientRateLimitExceeded, AnonymousClient

case class RateLimiter(stateRef: Ref[ApplicationState]):
  def attempt(client: Option[ClientName]): UIO[AttemptResult] = ???

ZIO-HTTP was a natural choice because it is built on top of ZIO and it is a high performance HTTP server. It also has a very good support for building HTTP proxies because you can easily use the incoming request and dispatch it to another server and use the response from there as the response to the original request. Because the request and response bodies are streamed, you are essentially piping the data from one server to another without having to buffer it in memory. For both - request and response.

Forwarding an HTTP request using ZIO-HTTP

def forward(req: Request, to: Location.Absolute): ZIO[Scope, Throwable, Response] =
  val newURL     = URL(kind = to, path = req.path, queryParams = req.url.queryParams, fragment = None)
  val newRequest = req.copy(url = newURL)
  client.request(newRequest)

The Algorithm#

“

This is a variant of Fixed Window Algorithm with the deviation that here, we keep track of request attempts (demand) by each client in each window (cycle). At the start of each cycle, we calculate new rate limit (capacity) for a client based on all demands from last cycle. This allows demanding clients to borrow capacity from non-demanding clients temporarily, until non-demanding clients need their capacity back.

\[ \begin{align*} \text{Max Capacity of Service} &= \text{Config} \\ \text{Reservation percent} &= \text{Config} \\ \text{Count of clients} &= \text{Runtime state} \\ \text{Demand of client} &= \text{Number of requests by the client in previous cycle} \\ \text{Default client capacity} &= \frac{\text{Max Capacity of Service}}{\text{Count of clients}} \\ \text{Demand with reservation} &= \max(\text{Demand of client}, (\text{Default client capacity} \times \text{Reservation percent})) \\ \text{Gap} &= \text{Default client capacity} - \text{Demand with reservation} \\ \text{Total Deficit} &= \sum_{gap_i < 0} gap_i \\ \text{Total Surplus} &= \sum_{gap_i > 0} gap_i \\ \text{Total Remaining Capacity} &= \max(0, \text{Total surplus} - \text{Total Deficit}) \\ \text{Deficit ratio} &= \left| \max(0, \frac{\text{Gap}}{\text{Total Deficit}}) \right| \\ \text{Surplus ratio} &= \left| \max(0, \frac{\text{Gap}}{\text{Total Surplus}}) \right| \\ \text{Loan amount} &= \min \left( \left| \text{Gap} \right|, \text{Deficit ratio} \times \text{Total Surplus} \right) \\ \text{Remaining capacity} &= \text{Surplus ratio} \times \text{Total remaining capacity} \\ \text{Effective client capacity} &= \begin{cases} \text{Loan amount} + \text{Default client capacity} & \text{if } \text{Gap} \leq 0 \\ \text{Demand with reservation} + \text{Remaining capacity} & \text{otherwise} \end{cases} \end{align*} \]

Info

This algorithm is only used in subsequent cycles. In the first cycle, all clients are given equal capacity because there are no previous cycle demands available

Take following example: there are 4 clients A,B,C & D. Max service capacity is 40 and cycle duration is 10 seconds. That mean each client can make 10 requests per 10 seconds in first cycle. After that newer capacities will be calculated per client for subsequent cycles.

Client	Cycle 1 Size	Cycle 1 Demand	Cycle 2 Size	Cycle 2 Demand	Cycle 3 Size	Cycle 3 Demand	Cycle 4 Size
A	10	2	5	3	3	0	1
B	10	15	15	15	11	15	12
C	10	10	10	50	16	50	22
D	10	10	10	10	10	5	5
Total	40	37	40	78	40	75	40

Simple deficit and surplus distribution example: Cycle 1

In cycle 1, Client A makes only 2 requests while Client B makes 15 requests. Client C and D make use of their full capacity. In the next cycle (2), capacity for Client B will be increased to 15 to match the demand in last cycle because client A can lend its unused capacity. Client C and D capacities will not change and total capacity will always be 40.

Proportional deficit distribution example: Cycle 2

In cycle 2, the total demand is 78 which is more than the total capacity of 40. But there is a spare capacity of 7 from Client A. So we can distribute the deficit proportionally to the demand.

Total Deficit = 45
Total Surplus = 7

B Gap = 5
B Deficit ratio = 5/45 = 11%
B Loan amount = 11% * 7 = 1
B Capacity = 10 + 1 = 11

C Gap = 40
C Deficit ratio = 40/45 = 89%
C Loan amount = 89% * 7 = 6
C Capacity = 10 + 6 = 16

In cycle 3, we can see that Client B's capacity has increased from 10 (default) to 11 and Client C's capacity has increased from 10 (default) to 16. A & D's capacity are equal to last cycle demand.

Reserved capacity example: Cycle 3

Cycle 3 demands are similar to cycle 2 except now Client A has 0 demand. In this case, we reserve 10% (configurable) capacity for Client A and not lend it to others. This is done to ensure that a client always has some capacity available to it in case it makes calls very infrequently.

Total Deficit = 45
Total Surplus = 14

B Gap = 5
B Deficit ratio = 5/45 = 11%
B Loan amount = 11% * 14 = 2
B Capacity = 10 + 2 = 12

C Gap = 40
C Deficit ratio = 40/45 = 89%
C Loan amount = 89% * 14 = 12
C Capacity = 10 + 12 = 22

In cycle 4, Client A has 1 (10% of 10) capacity reserved for it. There is a surplus of 5 from Client D and 9 from Client A. This surplus is distributed to Client B and C with 11% and 89% ratio respectively for cycle 4 due to their demand in last cycle.

State management#

The rate limiter did not persist state anywhere. It was all in-memory in the JVM. The rate-limited was a sidecar, meaning that a rate-limiter got attached each replica of the service. So it only needed to know the max capacity of one service replica and track requests to that instance. It did not need to know about the requests to other replicas. We rely on the kubernetes load balancer to distribute the requests among replicas evenly (or approximately even).

Client identification#

We needed to identify the client making the request. We decided to use introduce a new http-header for client-id. This was a trusted setup so there was no worry about clients spoofing the header. We made change in the common library to pick up the name from application yaml and add it to header of each outgoing request. This was a simple change and did not require any changes in the services themselves. If a request did not have a client-id header, we would consider it as an anonymous client and reject the request with HTTP 429 status code.

Note

It is possible to extend this to be more secure by adding authentication layer before rate limiter.

Client registry#

The rate limiter does not start with any clients pre-registered. When a new client-id is encountered, rate-limiter ends the current cycle before its due time and starts new cycle with the new client registered. This is done to ensure that new clients do not have to wait for the next cycle to get their capacity. This also avoids the need to maintain configuration file of clients. New clients get "Default client capacity" in their first cycles. Remaining clients use the algorithm to calculate their capacity for the next cycle.

Resetting the cycle before due time does introduce a small overhead but it was acceptable for us because new clients will be added only at startup.

Passthrough mode#

To gain confidence in the rate limiter, we introduced a passthrough mode. In this mode, the rate limiter would not reject any requests. It would simply publish the stats. Running rate limiter in this mode for some time in production gave us good confidence in the rate limiter. We were able to see the traffic patterns and the rate limiter was able to handle the traffic without any issues. This mode was controlled via a helm configuration which was eventually turned off after 1 month.

Monitoring#

ZIO has support for many monitoring protocols. We used ZIO's support for Prometheus format to monitor the rate limiter's custom metrics. We were able to see the rate limiter's stats in Grafana. We also had alerts setup for when the rate limiter was close to its capacity and when it was rejecting requests.

Challenges#

While this rate limiter does solve the problem, it comes with it's own set of challenges. The biggest one is how do you find out Max capacity of a service per cycle duration? We would have to benchmark the service against production like loads. This involves

Efforts to write load tests
As the service evolves, we will also have to evolve the load tests
Run load tests periodically to be updated with the service's capacity
Run load tests whenever we decide change service Memory and CPU configuration

These challenges make adoption of this approach expensive but the cost could be justified if the service(s) is very critical and the impact of downtime is very high for clients.