Why we decided to build a K8ssandra operator - Part 2 - K8ssandra, Apache Cassandra® on Kubernetes

In the first post in this series, John Sanda talked about how the K8ssandra team leveraged the Helm package manager to quickly deliver the first few releases of the project. We reviewed some of the limitations with Helm that we ran into. In this post, Jeff DiNoto joins the conversation as we discuss how we decided it was time to start building a K8ssandra operator and the ongoing role we see for Helm within the K8ssandra project.

The feature that made us decide to build an operator

Jeff Carpenter: It sounds like you were able to work around most of these challenges, maybe through duplicating some code, more so than you would have wanted. You could still make it work, but was the desire to implement multi-cluster K8ssandra deployments the final straw for using Helm?

John Sanda: Yes, not just multi-cluster, but even multi-datacenter within the same Kubernetes cluster. There are things that we want to do with multi-cluster deployments, without even getting into the intricacies around networking. It would really be an uphill battle trying to do that with Helm. It really comes down to using the right tool for the job.

Jeff Carpenter: Were you trying to use Helm beyond the ways in which it was intended? Is Helm only designed for deployment to a single Kubernetes cluster, or does it have capabilities for deploying to multiple Kubernetes clusters?

John Sanda: I don’t think we’ve stretched it beyond Helm’s intended use. However, it’s easy to get into a situation where you learn how to use the hammer and everything looks like a nail, but what you really need is a screwdriver. In terms of developing a K8ssandra operator, I don’t see Helm and the K8ssandra operator as mutually exclusive, where we would completely drop Helm. It’s a complementary tool. You need Helm to install the operator, create a service account, and set up the RBAC for it. That’s beyond the scope of the operator, and that’s where package installation tools like Helm come into play,

Jeff DiNoto: The Operator Framework defines a capability model for operators. And Helm sits in the first two stages of that, while a full-fledged programming language like Go covers the full gamut and gives us a base to do more in the future. There are generally known limitations with Helm, and at the end of the day, it’s not a programming language.

What is the proper usage of Helm?

Jeff Carpenter: It sounds like these are known issues within the community. However, if a Helm community member were listening in on our conversation, would they be thinking that we don’t know all the things that Helm can do, or asking why we haven’t filed issues? Or is it more that Helm is suitable for certain things, and we need to build an operator to do the things that operators do best?

Jeff DiNoto: I think that’s mostly true, although there were spots where we discussed enhancing Helm to do CRD management – Helm creates custom resources but doesn’t manage them. Right now we have to create hooks within Helm to perform the management tasks.

John Sanda: We just walked into a hornet’s nest, because in the Helm community, there’s a lot of strong opinions on what Helm should do with respect to CRDs. In Helm 2, CRDs can be treated as templates. You could add Helm template code and CRDs, and Helm treats them just like any other resource. With Helm 3, you create a CRD directory, put any CRDs you want to install there. You cannot use Helm template code in it.

John Sanda: And like Jeff said, Helm will install it. But after that, it doesn’t do anything else. It’s not an omission on the part of the Helm developers, this was based on their prior experience with Helm 2, as well as the experience of many Helm users. They felt that was the best decision, albeit a very controversial one. We’ve had to go through a lot of work writing a hook to upgrade our own CRDs, when it would be nice if Helm would just do that for us.

Jeff Carpenter: What does it mean to have a hook to manage your own CRDs?

John Sanda: It’s similar to the idea of a GitHub webhook, where you can inject a step into your CI/CD lifecycle. In Helm, you can provide a pre-upgrade hook for CRDs. We define a Kubernetes job and add an annotation or label to it that says this is a pre-upgrade hook. Then Helm will run that job before it does the upgrade work. That job is a custom image with some Go code that we wrote, that looks at the CRDs to see if any changes are needed, and applies any changes that are needed.

John Sanda: Looking at this from the flip side, why Helm would want to avoid doing this from the tooling standpoint? Well, a CRD is a cluster-wide resource. If you have multiple Helm installs working off of a particular CRD and somebody installs another version of the CRD, that’s going to impact every instance of that CRD in the cluster. On that basis, you can understand that the rationale of why they punt on handling CRDs. Not that I agree with it, but I understand it.

How to know it’s time to build an operator

Jeff Carpenter: Do you have any advice for other communities on how they will know when it’s time to build an operator?

John Sanda: I think there are different criteria. But from a pure engineering standpoint, if you find yourself dealing with multiple situations where the tooling is working against you and not for you, then maybe it’s time to consider a different solution.

John Sanda: Let me use another example to illustrate to make the point. We don’t want to deploy Stargate until the Cassandra cluster is up, to avoid schema disagreement. With Helm out of the box, there’s no way to do that. We have to add an init container in the Stargate pod that performs a rudimentary check that the cluster is up and running. This works fine, but this is a problem that is solved more easily and in a much more robust way in an operator. The Stargate operator gets be triggered to run it through its reconciliation. It queries to get the state of the Cassandra cluster and it checks the status of the cluster to find out if it is ready. Once the cluster is ready, the operator creates the Stargate deployment.

Jeff Carpenter: This is absolutely the right approach – do things the Kubernetes way. Let your dependency notify Kubernetes of its status, and let your operator get triggered on that status change. Over the years I’ve seen multiple creative hacks for determining whether a Cassandra cluster is up – everything from parsing log files to trying to hit the CQL port with a query.

John Sanda: Well, depending on what you’re trying to do, there may be multiple right answers. If I want to just make sure that Cassandra’s JMX port is available so that I can run some maintenance commands, that’s going to be different than accepting client requests. Fortunately, the nice thing is that people don’t have to reinvent these checks all the time. The operator can do it for you.

Downsides to creating an operator

Jeff Carpenter: We’ve talked a lot about challenges with Helm. Let’s make this comparison in the opposite direction. Are there any downsides to creating an operator versus using Helm charts?

John Sanda: Yes, somebody could easily listen to this and think that we just think the grass is greener on the other side all the time. That’s not the case, it’s about trade-offs. I mentioned before that working with Helm templates is really nice for iterating quickly, Let’s walk through the development cycle to make a change, like creating a Kubernetes Service object, or a Deployment object. Whatever it is, I create or modify the Helm template, and then execute Helm install with the path to my local chart directory. Now I’m up and running, and I can manually test the changes and see if they work.

John Sanda: Now, let’s say I want to do the same thing in my operator code. I modify the code that creates the Deployment object, I rebuild the operator image, I deploy that image. Then I have to deploy the custom resource that the operator manages so that it will then generate the underlying Deployment object. Then I can verify the Deployment. This process involves more steps, so it’s a lot more cumbersome.

Jeff Carpenter: Yes, but can’t you automate this? Isn’t this solved with a good CI/CD pipeline? Or not?

John Sanda: Yes, there is automation involved, but those steps are still there. In terms of local development, the other tool that’s considered a counterpart to Helm is Kustomize. This is more of a declarative approach. It’s bundled as part of KubeBuilder and Operator SDK. You’re going to see Kustomize being used with the K8ssandra operator, and we already use it for testing scenarios. Applying this to the scenario I described earlier, there’s a two-step process: first I run the build command to rebuild my image, then I’ll run another command that will use Kustomize to redeploy things. So while we can automate those steps, it’s still not as fast of a turnaround in terms of “wall clock” time, because you’re still having to rebuild an image.

Jeff Carpenter: Sure, that’s a key difference between any case where you have a compiled language versus a scripted language.

Can Helm charts and operators co-exist?

Jeff Carpenter: Are the operator and Helm charts mutually exclusive? Or can they coexist in some capacity?

John Sanda: They are definitely not mutually exclusive. We will still have Helm charts for installing and configuring the operator. That’s where Helm really shines. We won’t use Helm to create a K8ssandra cluster object. People could do that if they choose, but we’re not going to write the Helm chart to do that, and in fact, I would advise against that actually.

Summary

As you can tell from our conversation, there was a lot of thought that went into both the initial decision to use Helm for early versions of K8ssandra, and the more recent decision to create a K8ssandra operator for the 2.0 release and beyond. You’ll still be able to deploy K8ssandra via Helm, but the operator will be a key part of managing and upgrading K8ssandra clusters. The operator will also enable some great new features like K8ssandra deployments that span multiple Kubernetes clusters and data centers.

In the next post in this series, I’ll share the next part of the conversation, where we talked about implementing and testing an operator, and how writing the operator in Go will enable more developers to contribute to the project.

K8ssandra, Apache Cassandra® on Kubernetes

Why we decided to build a K8ssandra operator – Part 2

The feature that made us decide to build an operator

What is the proper usage of Helm?

How to know it’s time to build an operator

Downsides to creating an operator

Can Helm charts and operators co-exist?

Summary

k8ssandra-operator v1.10.0 is available

Medusa 0.16 was released

Centralized Reaper Deployment Mode in k8ssandra-operator

k8ssandra-operator v1.5.0 is available

Announcing a Turnkey Solution for Cassandra CDC Integration on Kubernetes

Introduce yourself

FAQ | Becoming a trusted member

Local installation on VM: reaper and stargate are stuck

In Rack topology, Why Affinity Rules are preferred over TopologySpreadConstraint?

Medusa-restore start on every restart of cassandra DC