Jon Scheele (00:00)
Automation is one of the key tenets of DevOps. We use tools like Kubernetes to help to orchestrate the provisioning of infrastructure. But when it doesn't work, we still need to debug manually. So we're always looking for ways to automate even the debugging process. I'm very pleased to welcome Nilesh Gule to discuss how generative AI can help to automate even that. So welcome, Nilesh.
Nilesh GULE (00:37)
Hi Jon, it's always a pleasure.
Jon Scheele (00:38)
I know you've been on this program before, let's take a moment for you to just introduce yourself and what you do.
Nilesh GULE (00:49)
Yeah, sure.
So myself, I am Nilesh Gule. I've been in the industry for more than two decades now. Over last 21 years, I've performed different roles. And currently, I've moved to Melbourne. And I have been an enterprise architect in my recent role.
I've also done other roles like solution architect, big data architect, and this comes with lot of industry experience in terms of banking and insurance industries being primary industries where I work along with Retail. I'm also a Microsoft Most Valuable Professional and a Docker captain as well. So as Jon said, I think in today's world, automation is a very key part and Kubernetes has become like a de facto orchestrator or de facto program.
Jon Scheele (01:43)
So Nilesh, you've worked with Kubernetes and other tools for quite some time. Kubernetes is great when it works, but what are the things that prevent it from working and how do you go about debugging it when it doesn't work?
Nilesh GULE (02:04)
Yeah, that's a very interesting question. Let me start with this representation. It's an iceberg, you know, which shows different aspects related to Kubernetes. And I was one of the early adopters of Kubernetes starting back in 2017. So I've been working with Kubernetes for last seven, eight years. I think what people find very easy is the top two layers of this iceberg, where it's very easy to get started.
you containerize your application using something like Docker, or any container runtime. And then you start deploying it onto a Kubernetes cluster. You might start with a single node cluster, like a Minikube or even Docker desktop has embedded Kubernetes cluster. That's where you come up with a pod. Then if you're using a higher order definitions or higher order objects like a deployment, it's quite easy to get started with this.
Once you start with deployments, then you would probably use things like secrets, config maps, services to expose those deployments outside of the Kubernetes cluster. You would probably use ingress and then you start going down this stack or different layers in the iceberg. And that is where the complexity starts increasing. So let's say you want to run a stateful workload. You want to run a database inside your Kubernetes cluster.
you need to have a persistent mechanism to store that data. So you start using persistent volume claims. When you want to do things like you want to have high availability, then you would need things like the replicas and things like that. As you go down, each of these objects that is added to your application or your Kubernetes cluster, it increases the complexity.
And let's say when there is a problem. This is a very simple diagram depicting what are the different debugging steps which a normal person who is debugging an issue on a Kubernetes cluster would go through. Now, if you look at this flowchart, it's quite lengthy. So let's split it into two halves. And the scenario here is you have an application which is deployed on the cluster and there is a problem. So how do you start?
You probably start by looking at the number of pods which are running. You run commands like kubectl get pods. If the pods are running fine, then you go and describe the individual pods and you'll see if there is any issue with things like the resource limits. If that is okay, then you go to the next level. And this is like a iterative process. You go through each step to see, okay, pod is running, but your service is not accessible.
So that is where you would go into the next step, where you would look at things like, my ports mapped correctly? Is what the container port is exposing? Is the service listening or service trying to access the same port? If you are having an ingress, which is exposing this service outside of your Kubernetes cluster, then you might have to go and look at the ingress details. So you see that the steps which are required in debugging all of this, they are quite involved in terms of trying to find a root cause of a problem where a service or a board might not be visible outside the Kubernetes cluster. Now this is a very simple case, which I say, because as I said earlier, starting a deployment or creating a deployment and then exposing that as a service using a ingress or one of the Kubernetes native object is one of the simplest way to get your application up and running in Kubernetes.
But if you look at the debugging steps, they are quite involved in the whole process. So I think when it comes to managing Kubernetes cluster, it is quite complex as and when your use cases increases, like the complexity of the use cases increases. And the more and more services or the features of Kubernetes you try to use, like you're doing, let's say automated deployment using a Helm or you're using CI CD pipelines to deploy.
There could be issues where your tagging is not happening correctly. So you are maybe defining a tag which is not correctly published to your repository. And then the Helm chart is trying to pull that into your Kubernetes cluster. How do you debug these issues? So I think this is where I would like to switch.
Jon Scheele (06:41)
Yeah. So I think Nilesh, you know, is a very simple rule. The more moving parts there are, the more that can go wrong. And what strikes me with the flow chart is, it's not just that there are lots of steps to go through, but there are different types of components that you need to debug. And knowing one component doesn't mean that you actually know about
very much about the other components. So you really need to be an expert across the whole stack in order to do this effectively. Would that be a fair comment?
Nilesh GULE (07:17)
Yeah.
Yes. And then also, I think in one of the earlier discussions, we were talking about how the teams are structured. So even that could play a role here where, let's say, there is a change which is done by the development team, but then it impacts the operations. And then how do you handle that? The development team might not have the right access to make some changes, and it could result in a scenario where a change needs to be done, but only operations team can do it.
So it could be quite complex. Exactly.
Jon Scheele (07:54)
or has the knowledge to do it. Because even if the developer has access, they may not have worked with that component before and they may need someone else's assistance anyway. yeah, looks quite like a lot of steps to me, but also a lot of different types of steps.
Nilesh GULE (08:03)
Yes.
Yes. So this is where I feel that automation is important. And there is an interesting project or a tool available, which is called as K8sGPT. So nowadays, as you know, AI is everywhere. And people are trying to integrate AI in almost every product and service. And this particular project is trying to do that, but with a mindset of SRE built into the automation.
So like a site reliability engineer doing, let's say, a live debugging on the Kubernetes cluster, what are the different things he would be doing? That is the kind of aim or that is the objective of this particular project. It scans the whole Kubernetes cluster. It diagnoses the problems. And it helps to try out the issues. But it does this in a very simple English, using generative AI as the approach or as the back end.
It can integrate with different generative AI backends and it can help us to identify the issues and it can help us to explain what is the issue and also propose a solution to fix that issue.
Jon Scheele (09:33)
So what is the body of knowledge that this has been trained on? Because there are lots of different types of Kubernetes implementations and other types of components. So has this been trained on the readily, the widely available ones that we're likely to find?
Nilesh GULE (09:58)
Yes, so this works with any Kubernetes cluster. All it needs is access to your Kubernetes APIs or kube-context, as we say it in the CLI terms. As long as it can access the Kubernetes cluster, this tool would be able to run this analysis, and it would be able to explain those issues, not just in English. Actually, it supports more than 18 different languages.
Imagine a situation where a person who is debugging the Kubernetes cluster manually, for example, he is not a native English speaker. For him, it might be a bit cumbersome to first try to get the English issue reported by kubectl or kubectl-control-cli. And then he has to probably take that, do a translation of that, and then work out how to fix it. This tool can help even the non-native.
English speakers to use a different language to analyze and to debug the Kubernetes cluster. So I think that's again, another advantage of using a tool like this.
Jon Scheele (11:04)
Mm-hmm.
Okay, so how does it run? This has to be granted access to your infrastructure or at least to the APIs, the Kubernetes APIs in order to function.
Nilesh GULE (11:21)
Yes. Yes. So there are two ways in which this can run. One is like running it as a client mode where you have it running side by side to your kube control or kubectl command line tool. You can install it as a CLI and then you can run as we see on the screen here. There is this commands it can run like K8sSGPT analyze, explain and other features which are supported on the command line.
The other approach is you can run it as a operator within the Kubernetes cluster and which is a recommended way. Once you run it as an operator, it runs like if you know how operators work in Kubernetes, they kind of specialize this knowledge about a particular tool or a product and they know how to install, how to upgrade and how to operate that product within the Kubernetes cluster. So it uses whatever is available as part of the Kubernetes API. And it extends it using the custom resource definitions. And it continuously runs within the Kubernetes cluster. So the advantage there is nobody has to run this command periodically. We can define what should be the frequency. And the tool will run them automatically for us. And it can help us to analyze issues and generate a matrix which can be stored within Prometheus. And then we can build visualizations to report those different aspects related to the Kubernetes cluster over a period of time. And this can be helpful for
Jon Scheele (12:56)
So is it a tool for simply letting you know about problems or is there a self-healing nature to this also?
Nilesh GULE (13:08)
I would love to have the self-feeling nature, but I think if we go back to the complexity of the Kubernetes, at this point in time, I would prefer that it's more of a suggestion-based approach that we take, because if somebody automatically goes and makes changes to the Kubernetes configurations without understanding the whole context, it could cause lot of problems. So I think based on my experience, I would say that it is better that it just tells us at this moment in time what is the issue and what could be the potential resolution for that issue rather than going ahead and doing it itself.
Jon Scheele (13:49)
So there's still a need for a human in the loop, but it'll be a much better informed human, better able to make decisions about what steps to take next.
Nilesh GULE (13:53)
Yes.
Yes.
Yeah, so what this tool is capable of doing is because it uses generative AI, it can summarize the issues very quickly for us. So instead of me as an operator or me as a developer going and trying to figure out all those different steps and performing those different steps to get to the root cause, this tool can, because it has that knowledge or the body of knowledge, it can use that intelligence.
to easily identify what's the issue based on your Kubernetes logs or different types of events that are coming out of the Kubernetes cluster. It can easily identify the root cause of the problem and recommend as a solution. But the final fix still lies with the human at this point in time.
Jon Scheele (14:52)
Okay. So having highlighted this and notifying, or providing a visualization of this through, through Grafana, then the operator, the human operator is going to go and take corrective action on, on issues that have been identified. And that flow chart you showed earlier is the whole range of
Nilesh GULE (15:12)
Yes.
Jon Scheele (15:22)
that debugging flow covered within the tool or are there things that the human is going to have to just know?
Nilesh GULE (15:32)
There are two things here. So let me switch back to the slides. So the first one is maybe we can touch base on this first to understand what are the different backends it supports. So it supports the OpenAI APIs like the standards. So all the other models or all the other platforms which adhere to the OpenAI specification, they can easily integrate with K8sGPT.
So this means that hosted service like Azure OpenAI can be used as a backend. We can use OlaMa, which runs the models locally on the computer. So if we want to save cost, it could so happen that we could host the models internally in our organization and then expose them via endpoint to which the K8sGPT can connect. There are others like Amazon Bedrock, then Google Gemini models are supported, Oracle Generative AI models are supported, and the ones which are available on Hugging Face and Watson as well. So what is covered as part of the different objects is, we see by default, it supports around 10 or 12 different types of Kubernetes objects, including the replica set, node, pod, the job services, ingress, and the others. And it also has the capability to investigate or to diagnose and to propose solutions for gateway HTTP routes, the horizontal pod autoscalers, pod disruptions, and these kinds of things. So we can enable these additional filters, and the tool can analyze any issue related to whatever we see on the screen here. And since this is an open source project,
We can also add new objects to this.
Jon Scheele (17:31)
So how can this model be extended?
Nilesh GULE (17:38)
So it's an open source project and the source code is available on GitHub. So the developers who, let's say you want to add a new object here, they have a specific interface using which you can write your own code, how to analyze that particular type of Kubernetes object and then how to respond to that issues reported for those objects.
Jon Scheele (18:03)
Okay, so it looks like a big help, but I think that you probably still need a lot of skill in order to understand, well, the human still needs to be in control anyway in setting it up and monitoring it. So it's an informational aid, but the people who have already gone through the pain of gaining the Kubernetes certifications and developing the skill. That skill isn't wasted. It's still very much valuable. It's just you can be so much more productive because K8sGPT is going to speed up your debugging work when things aren't right.
Nilesh GULE (18:54)
Yeah, I see it as an enabler, especially when someone is new. So let's say you have a new developer onboarding to your application or to your organization, and they are new to Kubernetes. They might not know the complete area of Kubernetes or how vast it is or what features of Kubernetes the organization is using. So this tool could help them to investigate issues faster.
And it can help them to learn faster how to fix those issues. But they themselves have to put in the efforts to actually fix the issue.
Jon Scheele (19:32)
So this is quite a comprehensive tool and it connects to the Kubernetes through the Kubernetes API. What other tools does this tool interact with?
Nilesh GULE (19:45)
Yeah, that's a very interesting question because the tool itself if it were to do everything I think it would be quite a huge task So what K8sGPT does is it allows external tools or vendors to integrate with its AI capabilities and as of this recording these are the five different five or six different tools which it supports AWS is supported like a few services of AWS can be debugged. There is this method or approach where you could manage services of AWS using Kubernetes. So those kinds of services, they can be part of K8sGPT integrations. There is the event-driven autoscaling, the Kubernetes-based approach, which is called Keda. So this works with external data sources and it helps to like automatically scale up and down the services on Kubernetes clusters. Then we have the Kyverno, which is like a policy-based approach for Kubernetes. Prometheus is very well known in the DevOps and SRE community as one of the tools used for collection of metrics. And there is also Trivy, which is open source scanner. All these tools, are extending or they are supported the K8sGPT capabilities. Now we can take one example like Trivy for example, if you are integrating Trivy scanner with your K8sGPT, it can automatically scan how many vulnerabilities are there in your Kubernetes setup or your Kubernetes cluster. And as we all know that these vulnerabilities, they keep changing over a period of time.
So you might have deployed an application or a set of applications. And even the default Kubernetes objects, they might be compliant as of today. But let's assume two or three months down the line, there are new vulnerabilities which come up. Unless somebody scans them as part of the continuous scanning process, we would not know about them. So K8sSPT, the operator, can do it for us. And it can definitely help us to report this over a period of time. So I think that's a classic example where continuous scanning and continuous improvement can help to improve the overall system.
Jon Scheele (22:17)
Mm-hmm. Okay. So where do you see this heading in future? mean, what I see with most systems is that they just get more complex. And as much as we try to abstract away complexity by putting another layer on, actually, know, the complexity is still there. It may be hidden from some people, but when things aren't right, then you have to dig into some of this complexity. What do you see as the next step that tools like this will take?
Nilesh GULE (22:55)
I think there are things like visualizing the issue that could be an area which could be helpful. Like in terms of Kubernetes objects, there are different hierarchies, right? If we let's say just take a service as an example, behind a service, there is a type of deployment object behind the deployment that is pod. The pod might have individual containers. They might have things like secrets, configurations. Now, if there is an issue in one of this, and you were to visualize it. That's an area I think which is not very well covered in many of the tools right now. You might have to pay for it, but a tool like this could be quite useful if it could visualize and show it to us in an easy way to understand how the issue flows and how it relates to the different objects. I think that's one area. The other area I see a tool like this helping is you could still do a partial remediation, for example, like if you know it's a very minor issue, like you are using a deployment type of an object and it has a tag which is wrongly specified in Kubernetes. You could give an option to say that, okay, cross verify that the tag is existing or it's a typo kind of thing and you allow it to fix automatically. So those kind of like low priority issues which don't cause a huge disruption in the whole environment of Kubernetes, it could allow some of those things to be automatically remediated or like port mapping, for example. That's a quite common issue where the ports are not mapped correctly and that's why services are not accessible. I see some of those things could be implemented as part of tools like this. They could do partial remediations for some of the most commonly occurring issues, but then still leave something which could be quite disruptive into the hands of the operators.
Jon Scheele (24:54)
Okay, I think there are lot of possibilities, but I think there's still, we're not going to be replaced by the machine just yet. The machines are, this machine in particular is definitely here to help us do our work better. So thanks very much for sharing that Nilesh.
Nilesh GULE (25:06)
Yeah.
Yeah, thank you. Thanks for the opportunity to share this.