paint-brush
Fulfilling the Cloud Native Promise: Autoscaling for Apache Pulsar Is Here!by@datastax
693 reads
693 reads

Fulfilling the Cloud Native Promise: Autoscaling for Apache Pulsar Is Here!

by DataStaxJuly 13th, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

We're pleased to to introduce Kubernetes Autoscaling for Apache Pulsar, or KAAP. KAAP is a Kubernetes operator that is available on OperatorHub.
featured image - Fulfilling the Cloud Native Promise: Autoscaling for Apache Pulsar Is Here!
DataStax HackerNoon profile picture

Apache Pulsar has always seen itself as cloud native. It’s right there on the home page:


“Apache® Pulsar™ is an open-source, distributed messaging and streaming platform built for the cloud”


Pulsar certainly embodies many cloud-native principles, including the separation of compute from storage where the Pulsar broker handles the serving of messages and Apache BookKeeper handles the storage of them, and the idea of stateless components that can be scaled up and down at will like the Pulsar broker.


However, it has never really delivered on the ultimate promise of the cloud: autoscaling. At least, not until today.


I am excited to introduce Kubernetes Autoscaling for Apache Pulsar, or KAAP. KAAP is a Kubernetes operator that is available on OperatorHub. It includes all the niceties that you would expect from a Kubernetes operator. Once you have the operator installed in your Kubernetes cluster, you can get a fully functional Pulsar cluster by applying a single custom resource definition (CRD) like this:


apiVersion: kaap.oss.datastax.com/v1alpha1
kind: PulsarCluster
metadata:
  name: pulsar
spec:
  global:
    name: pulsar
    image: 'apachepulsar/pulsar:3.0.0'


Pulsar has multiple components: ZooKeeper, BookKeeper, brokers, and proxies. The operator handles configuring all of them with the PulsarCluster CRD. Because the operator works at the cluster level, you don’t have to worry about how the components work together. The operator takes care of that for you.

Automatic staged upgrades

Having operated a production cloud service for Pulsar longer than anyone I know of (first at Kesque and now with DataStax Astra Streaming), I know firsthand how long it takes to do a proper upgrade of a Pulsar cluster. While you could just YOLO upgrade all the components at once, the recommended and safest method is to upgrade each component one at a time, ensuring each component has successfully upgraded before moving on to the next.


This is the process that we follow when upgrading the Pulsar clusters in our Astra Streaming service because the availability of our customer service is paramount, but it is very time-consuming for our production engineering team. In fact, this is the number one request they have: can you make the upgrades smarter and more automatic? Well, with KAAP, we have done just that.


With a single line change in the PulsarCluster CRD, you can trigger a staged upgrade of your Pulsar cluster. The KAAP operator will orchestrate a careful, staged upgrade of your cluster. Of course, you should monitor the metrics during the upgrade, but you can let the KAAP operator drive (no sleeping at the wheel).


Open source autoscaling

Automatic staged upgrades are nice, but that’s not what I am most excited about with KAAP. The concept of elastic resources is a key promise of the cloud. It is why many of the AWS services have “elastic” in their name. I (actually ChatGPT) did a quick search and there are at least 10 of them.


The cloud enables you to acquire and release resources on demand, so you can use autoscaling techniques to match the number of resources your application is using to the load on it. As the load increases, you can automatically add more resources. As the load decreases, you can free up resources. Since cloud providers charge for the usage of resources, autoscaling can help you make the most efficient use of your cloud spend.


I am sure we can all think of examples of cloud services that feature autoscaling behind the scenes. Most of those “elastic” AWS services work like that. However, the implementation of the autoscaling is proprietary, locked in a vendor’s private repo. Sometimes they will talk about how they have implemented autoscaling, like in this paper about AWS DynamoDB, or this blog post about how Confluent Cloud makes Kafka elastic. But no one, to my knowledge, has made their autoscaling implementation open source, freely available to all users under a truly open source license. That’s exactly what we are doing with KAAP.


KAAP adds sophisticated autoscaling capabilities to your Pulsar (or Luna Streaming, our Pulsar distribution) under an Apache 2.0 license. You can check it out in our GitHub repository. If you like it, remember to give it a star.


Pulsar + auto scaling: A perfect match

KAAP gives you a way to add autoscaling to Pulsar when running in Kubernetes. Not some simple, Horizontal Pod Autoscaler (HPA) tacked on to a Pulsar broker deployment, but a sophisticated auto scaler that works with the existing cloud-native components of Pulsar to provide a truly elastic streaming and messaging system. In my opinion, this is a combination that can’t be beat.


Stateless broker scaling

Let’s take a look at the two main autoscaling dimensions of KAAP. First, we’ll look at Pulsar brokers. As you might know, Pulsar brokers are stateless, which means you can scale them up and down easily, as required. But what you might not know is that Pulsar has a built-in broker load-balancing mechanism that continuously monitors CPU, memory, and network bandwidth on each broker. Using that information and one of the configurable load-balancing algorithms, Pulsar will move topics from broker to broker to prevent a broker from being overloaded.


A naive autoscaling solution would be to configure a Kubernetes Horizontal Pod Autoscaler (HPA) on the brokers and when some metric like CPU gets high, scale up another broker pod. But this might not actually be necessary because the Pulsar broker load balancer could decide to shift topics to even out the load at the same time. Now you’ve scaled up a broker pod that isn’t needed because the Pulsar load balancer has balanced things out. So now the HPA decides to scale down the new broker pod, which causes any new topics that had been created on it to get moved to an existing broker.  As you can imagine, the Pulsar load balancer and an HPA can create a thrashing mess of brokers going up and down and topics being shifted from broker to broker.


KAAP avoids this problem by integrating directly with the Pulsar load balancer. KAAP scales up the brokers when the cluster-wide metrics from the Pulsar load balancer suggest that the cluster is nearing capacity, not when a single broker pod is busy. And it only scales down a broker if the usage of the whole cluster falls below a configured threshold. The KAAP operator works with the Pulsar broker load balancer, not against it.


Scaling storage up and down: Remarkable!

Scaling the compute (or serving) layer of Pulsar is great, but it’s not enough for a true autoscaling implementation. Sure, the number of messages (or events) that need to be processed can vary over time, but what about the number of messages that need to be stored? We’ve probably all had to deal with a backlog of messages building up because of a failure of a downstream system. As the outage drags on, the available storage on the streaming system starts to run low.


This is a scenario where Pulsar and its reliance on BookKeeper shines. To add storage capacity to a Pulsar cluster, you just need to add a new BookKeeper node, affectionately called a “bookie.” Because BookKeeper storage is based on segments of topics and not whole topics, the new bookie is immediately available to relieve the pressure on storage, with any painfully rebalancing of topics or any other operational intervention.


KAAP can, of course, handle this case for you. It constantly monitors the disk usage of the bookie nodes, and if available storage is getting low, it will scale up a new bookie node. This is not all that remarkable (at least for Pulsar). It’s pretty straightforward to add a new replica in Kubernetes, even if it is stateful and backed by persistent volumes. But how about when the outage is over and the backlog has cleared? Are you now stuck with an extra bookie node consuming resources that aren’t really needed anymore?


Not with KAAP, you’re not. Once the BookKeeper storage drops below a configured threshold, the KAAP operator will carefully orchestrate the removal of the unneeded bookie node. It does this in a very safe way, ensuring that no messages are lost and that the required replication factor is maintained at all times. For example, if you have configured Pulsar to keep three copies of each message (write and ack quorum both equal three), KAAP interacts with the BookKeeper to copy messages from the bookie that is being scaled down to other bookies in order to ensure there are at least three copies of the message available. Once that has been completed successfully, it will proceed with removing the unneeded bookie.


With KAAP you get automatic scaling of the storage in your Pulsar cluster both up and down so that you can optimize the storage usage in your cluster and not get stuck with idle capacity after an unfortunate production incident. I don’t know about you, but I think that’s pretty remarkable.


Zone awareness and migration tools

KAAP can do staged upgrades and a smart scaling of your Pulsar cluster. But there’s more. To operate a highly available cluster in a cloud provider, it’s important to take availability zones (AZ) into consideration. If you don’t spread your components, especially BookKeeper, across availability zones, you won’t be able to survive an AZ failure and provide multiple nines of availability.


Luckily Pulsar has great built-in capabilities like rack awareness to support high-availability deployments. The tricky part is that to set it properly you need to both configure Kubernetes correctly with zone awareness and also configure Pulsar. The KAAP operator has you covered by introducing the concept of resource sets, which enable you to group components and give them rack awareness. The KAAP operator will automatically apply both the Kubernetes and Pulsar configurations based on your declarative configuration of resource sets. Resource sets are flexible, allowing you to support a variety of Pulsar deployment options.


And what if you are already running your Pulsar using a Helm chart or maybe just some Kubernetes manifest magic? KAAP has a migration tool to help you out. You can point the migration tool at your existing Kubernetes Pulsar deployment and it will automatically generate a matching CRD configuration that you can use to have the KAAP operator take over operating your cluster for you.


The KAAP Stack

The KAAP operator has lots of great features, turbocharging your regular Pulsar cluster into a well-oiled, highly available, autoscaling machine. But as someone who has operated production Pulsar clusters for a long time, I know that there are lots of other considerations for creating a production Pulsar cluster, such as TLS certificate management, authentication, and monitoring.


That’s why we have included what we call the KAAP stack with the operator. It’s an umbrella Helm chart that installs the KAAP operator along with critical production tools, including:

  • Cert Manager
  • Keycloak
  • Prometheus Stack (Grafana)
  • Pulsar Grafana dashboards


These are must-have tools when running our production Pulsar clusters and we didn’t want to leave you high and dry, so we pulled them all together and integrated them into one convenient package.


Why use KAAP?

So you’ve heard about all the great features of Kubernetes Autoscaling for Apache Pulsar. You can create an entire Pulsar cluster with a single CRD. You can put your upgrades on autopilot and let the KAAP operator perform safe, staged upgrades. You can automatically scale Pulsar brokers up and down based on the Pulsar broker load balancer saying the brokers are getting overloaded, and you can scale BookKeeper nodes up and down (!) safely based on the storage requirements of your clusters. You can easily configure your cluster for availability zone awareness for high availability. And it even includes a migration tool so you can easily move from your old Helm-based deployment to a turbo-charged, KAAP operator-based one.


So—lots of great features, but what are the benefits of KAAP? I can think of several:


  • Easily configure and operate highly available Pulsar clusters running in Kubernetes leading to less effort for you and your production teams
  • Dramatically simplify scaling Pulsar to match changing demands by autoscaling the cluster resources to match the demand
  • Reduce the total cost of ownership of a Pulsar cluster by eliminating provisioning to peak loads or over provisioning as result of production incidents
  • Avoid vendor lock-in by using all open source technologies


In my opinion, releasing KAAP is a truly innovative moment in the streaming and messaging space. No other open-source project combines the streaming and messaging power of Apache Pulsar with the ultimate promise of cloud computing: fully elastic, autoscaling. I’m looking forward to you trying it out. Hop into GitHub Discussions in our repo and let us know what you think!


Find out more

If you want to take a technical deep dive into KAAP, take a look at this blog post. You can find the complete documentation for KAAP here. And here is the GitHub repository.


By Chris Bartholomew, DataStax


This story was distributed by Datastax under HackerNoon’s Brand As An Author Program. Learn more about the program here: https://business.hackernoon.com/brand-as-author