DEX handling 100,000 RPS: Autoscaling engine for PancakeSwap

DevOps

Blockchain

Products

Services

min read

Daniel Yavorovych

February 29, 2024

This is a story about how PancakeSwap became one of the biggest DEX in the world thanks to its high availability, improved UX, and autoscaling heart.

About PancakeSwap

PancakeSwap is a decentralized exchange (DEX) launched on the Binance Smart Chain (BSC) in September 2020. It became one of the leading DEXes on the BSC and gained significant traction due to the high Ethereum gas fees experienced around its launch.

This platform quickly became famous thanks to low fees, swift transaction exchange, diverse token offerings, a user-friendly interface, bright marketing, and customer-centric support. As the number of users of this DEX and the liquidity pool grew, the whole platform was considered more reliable, secure, and stable. Here’s the positive feedback loop of any Defi project’s dreams!

But what was going on from the inside of PancakeSwap?

The positive attention of Binance, holding crypto events like IFO, and even simple everyday user activity started to challenge and overflow the traffic load limits. At the same time, the simple server quantity scaling—the overprovision—didn’t change the situation much, as the gigantic resources bought to cover the traffic spikes were unused during the regular periods.

PancakeSwap should stay reliable for its customers by solving this problem.

Crypto traders won’t wait while your server gets on its’ feet after being down. They open X or Reddit and start a thread about 1000 reasons why you should not use this or that platform to trade. That’s what each DEX wants to avoid. People worry about their money, and it’s natural that they have the highest availability expectations for trading platforms like PancakeSwap.

To overcome this challenge, the Dysnix team used all their experience and wits and implemented a predictive Kubernetes autoscaler called PredictKube to support PancakeSwap and make it more stable, available anytime, and handling traffic. With this implementation, the engineers helped PancakeSwap use GCP resources with maximum profit. PredictKube solved a plethora of problems that are common for trading platforms and other Web3 projects.

Problems of growth and being flexible

Inefficient scaling

PancakeSwap used popular solutions before meeting Dysnix, like public endpoints, but they were not performing well enough. First, DEX couldn’t control and secure this type of connection entirely because of the nature of public endpoints. Second, they didn’t solve the problem they had with traffic loads. Traditional scaling faced two main challenges:

By the time traffic surges, it might be too late to scale effectively;
During periods of low traffic, resources might be unnecessarily abundant and underutilized.

It's preferable to allocate more resources in anticipation of increased traffic rather than reacting when the surge already occurs. The anticipation period should consider that each new node takes a few hours to deploy in the normal state.

Why manual scaling or simple autoscaling methods won’t do the trick

Manual and automated scaling are still applicable for many cases but not for the DEX platform like PancakeSwap. And that’s why:

Reactive nature: you can set up the rules that will cause the scaling, but they will be triggered by reaching some threshold, like CPU utilization. There’s no way to scale in advance.
Non-efficiency: Setting the thresholds can’t solve over-provisioning and under-provisioning. Even the most masterful set of limits and rules of up- and down-scaling may cause the loss of traffic or inefficient expenses.
Manual tuning: Classic autoscaling requires frequent adjustments, and your specialist has to be extremely careful and attentive to how the platform grows and what market changes are about to happen.
Out-of-patterns surprises: For any unexpected event causing critical load that doesn’t fall into the described scaling scenarios, the traditional scaling will be helpless.

Over-provisioning

This is a typical state of many cloud-based projects. “How much server space will I need in the following month? If I take too much—there’s no money-back guarantee to save me from this mistake. If I take not enough—I’ll lose my traffic…”—these thoughts may be familiar to you.

Overprovision strategy forces you to buy lots and lots more to cover any traffic case, even if you have one spike a month. Like in the picture below, you always pay for Actual nodes (red lines), while you need 5 times less most of the time.

This picture is also a spoiler of a traffic prediction feature of PredictKube so that you can see RPC actual vs. RPC predicted in comparison.

Latency spikes

When RPS grows rapidly, and available resources can’t cope with it, it brings latency spikes.

During such episodes, the system becomes congested and unavailable, transactions may get lost, and the user experience decreases. To solve this problem, you need to have enough resources and distribute traffic flows to avoid any signs of delay in the interface and the system overall. If the congestion is too hard, the downtime is here; everything should be relaunched.

Effects on business: Loss of traffic and indefinite service level

These challenges were on the way to the development of the PancakeSwap. The poor user experience, even periodic downtimes, and high latency can’t be a foundation for the reliable DEX for thousands of traders. Loss of traffic causes unfinished, delayed, or canceled transactions. Blurred service level is a vague promise that never attracts people’s money.

And all these challenges were solvable with PredictKube.

PredictKube implementation

PredictKube, the proactive autoscaler for Kubernetes made by Dysnix, works on an AI model and is compatible with GCP, Azure, and AWS. It helps avoid overprovisioning by predicting the traffic trend and balancing resources in advance. It’s easy, customizable, fast learning, and adaptable.

At the core, it fully extends the GKE possibilities, incorporating Kubeflow, TensorFlow, and ElasticSearch functionality.

Components of the PredictKube solution

Load prediction for proactive scale

This load prediction is based on two types of data.

First, historical data allows us to predict the usual traffic trend. Second, business metrics data (multiple sources) lets us guess the unpredictable events by considering non-traffic indicators.

Thus, predictive scaling and event-based scaling cover all the cases that must be tracked and covered for proactive reaction to the load change.

*The example of business metrics that cause traffic load*

Continuous size optimization of the cluster for the current load

This process is based on constant balancing of the resources for the current and upcoming load. The full node needs time to be launched, so depending on the prediction, the setup starts seriously in advance. Shutting up needs less time, but it must be done right on time. Also, the system tracks the health indicators for each node. If the node slows down because of being too overloaded or cluttered, the traffic automatically switches to a healthy one, and the slow one gets a clean-up.

Selection of optimal type of instances

Optimal types of instances influence the speed of traffic processing and overall productivity. Depending on the traffic situation, the nature of the load, and the predicted pattern, the instances can be easily switched from one to another type to be optimal (the best-fitting and cheapest at the same time). This is also an automated process

Problem fixed: Happy crypto bunny with the lowest latency possible

Results of the PredictKube implementation for PancakeSwap in numbers:

Latency decreased from ~400-500 msec to ~80 msec. It stays alike for any load, during any IFO or other events.
We made the infrastructure always available, coping with even 100,000 RPS.
The cost of the cloud bill was decreased by more than 50%.

*PredictKube was implemented in February*

Thanks to solved problems of over- and under-provision, right-on-time scaling, autorotation of nodes, and automatic selection of the best-fit nodes—the cloud bill melted until it became optimal.

Another nice bonus is that PredictKube is not a third-party service. It’s more of an implant that widens all the scaling possibilities. Once set, the AI model inside continues to learn from allowed data and become even more precise. Depending on the market and platform states and the appearance of new technologies, PredictKube will also get another update.

Bonus effects of implemented predictive autoscaler

‍Implementing the predictive autoscaling tool is a reward for cautious and meticulous work with data streams of all kinds inside and outside of the infrastructure. Only by possessing enough data about your traffic attentive collection of each hit and request you can give enough “food” for the model to become efficient and truly helpful. Without data, it’ll still be adapted to the primary data we trained it with.

So, keeping your data crystal clear and organized is a prerequisite for enforcing your IT infrastructure with an autoscaling tool.

The technical team behind the solution: CTO’s direct speech

The idea of such a tool appeared in our heads while Dysnix worked on projects like zkSync. Sometimes, simple algorithms wouldn’t help scaling correctly, so we had to invent something else. Sure, there are projects where you don’t need any predictive scaling—it’s enough to add a few scaling rules and tell the system what to do in case of a traffic spike appearance.

We started to gather a team of experts to create a solution that would give a reliable result in the circumstances of uncertainty.

It wasn’t a big team, though: one Architect, two Lead DevOps Engineers, one Senior DevOps Engineer, one Lead AI Developer, and two of his assistants. Together, we managed to create and teach the core AI model that became easily adjustable and omnivore for the data about traffic volume and business metrics. We planned to make PredictKube a standalone solution so the project would install and set it up once and for all, staying independent from us as creators.

Aside from our clients, we worked closely with the KEDA community, collecting feedback from them and adjusting PredictKube accordingly. So, such products are never the personal achievement of one company but are more a mutual creation of a caring community.

Long-term and delayed impacts of autoscaling for Web3 projects

Now, autoscaling is an unavoidable part of the future vial Web3 project. From decreasing energy consumption and carbon footprint to simple cost-cutting, each reason motivates the setting up of scaling. Some effects of such projects’ scaling reach beyond the nearest future but should be mentioned.

Potential long-term impacts of implemented autoscaling

Increased adoption
An efficient scaling solution will help to handle more users and transactions without extensive technical growth.
Ecosystem awareness
Scalable platforms are easier to integrate and intertwine with other services. The flexibility and resource-wise core of such projects can be the most attractive features for the partners while building a mutual technical landscape.
Better security
Proper scaling solutions, especially layer 2 solutions, require robust security mechanisms. With autoscaling, the reasons for human error are decreased. Thus, the overall security is much better.
Competitive edge
As the projects can scale well, they outperform competitors who struggle with scalability issues.
Cost efficiency
Optimized projects don’t spend a penny on inefficiently distributed resources.

Delayed effects you may expect after implementing autoscaling

Increasing decentralization: The full-fledged resource optimization might affect the distribution of the roles in the blockchain. For example, projects rely on a smaller set of validators or nodes for efficiency, which can centralize control. However, as blockchain demonstrates self-regulatory features, this effect can be avoided.

Technical debt appeared because of the quick and careless implementation: Implementing quick fixes for scalability can lead to technical debt. Out-of-the-box solutions work for the project without mending and customization. Increased technical debt always leads to additional expenses and a loss of focus on the main growth direction.

Interoperability challenges for projects with multiple outer connections: As projects implement their customized scaling solution, interoperability between platforms or blockchains might become challenging.

Dependency on third-party solutions: Each time you delegate a vital part of your architecture to the service provider, you get a headache in the long run. You can’t be sure of the stability and 100% availability of any solution that is not totally in your control.

So, that’s why keeping all autoscaling solutions implemented into the core of your project is more secure.

With autoscaling becoming a must-have feature, any Web3 adventure will be more viable, stable, and expectable. Trying solutions like PredictKube will help you to find out what suits best to your case.