Published July 31, 2024 © GPL3+

Local Kubernetes Cluster Troubleshooting Assistant

Use your local AMD Desktop GPU to power large language models to troubleshoot Kubernetes clusters.

IntermediateFull instructions provided4 hours124

Local Kubernetes Cluster Troubleshooting Assistant

Things used in this project

Hardware components

AMD Radeon RX7900 XT

AMD Instinct™ MI210 accelerators

Software apps and online services

LocalAI

Open WebUI

K8sGPT

Kubernetes

AMD ROCm Software

Story

Introduction

This project was created as part of the Hackster x AMD Pervasive AI Developer Contest. My submission was for a Local Kubernetes Cluster Troubleshooting Assistant. This project outlines the steps to deploy a locally hosted Large Language Model and integrate it with K8sGPT - a tool for scanning your Kubernetes clusters and diagnosing issues in plain English.

Judging

For the judges, as per the rules and judging criteria, I've ensured all of the submission requirements are fulfilled and laid out on the main page of the project Github repository.

For this project, I've kept journals of the work I've done, and the amount of hours I've put into coding, troubleshooting, and supporting the related projects.

Components

This project uses the following hardware and software components:

AMD Instinct™ MI210 Accelerators - As part of this contest, I was provided access to the AMD Accelerator Cloud, which had the ability to launch Docker containers to leverage AMD Epyc CPUs, and AMD Accelerator GPUs. This was mainly used for testing larger AI models, as the cloud had more resources than my local machine. This could be useful when performing training on existing models.
AMD Radeon RX 7900XT - For locally hosting services in the scope of this project, I used my own desktop PC GPU, which is a PowerColor Hellhound Radeon RX 7900 XT 20GB.
LocalAI was used as the backend to act as a drop-in replacement REST API that’s compatible with OpenAI API specifications for local inferencing.
Open WebUI was used to provide a user friendly web interface for testing the local LLMs being loaded.
K8sGPT was used as a command line tool for scanning and diagnosing issues with Kubernetes clusters, powered by the LocalAI service.
TrueNAS SCALE was used as a test environment for launching test Kubernetes deployments to use with K8sGPT. This can be replaced with any local or cloud based Kubernetes cluster.

Resources

All of the resources, code, and instructions used for this project are available in the main project Github repository:https://github.com/linuxtek-canada/hackster-amd-contest

AMD Accelerator Cloud

Using this cloud resource was an interesting exercise, and I was able to create a script to install the LocalAI service to run inside of a ROCm Ubuntu 22.04 Docker container. If you have access to the service, I've outlined all of the details on how to do this in this readme.

AMD Accelerator Cloud is currently testing the ability to launch custom Docker images, and I was able to get access to evaluate it. I will add more details once this feature is ready in this readme.

Locally Hosted AI/LLM Setup

To set up LocalAI and Open WebUI to run on your local machine, I've outlined the steps, configuration instructions, and usage in this readme. I've provided a Docker Compose YAML file which can be adjusted to your hardware specifications, and used to launch containers for both services.

Setting Up K8sGPT

K8sGPT is a command-line utility that can be installed on your local machine regardless of the operating system, or as a Docker container. It will integrate with your Kubernetes configuration, as well as LocalAI, in order to scan your Kubernetes clusters for issues, and diagnose and triage issues in simple English. It has SRE experience codified into its analyzers and helps to pull out the most relevant information to enrich it with AI.

I've included steps to install K8sGPT and integrate it with Kubernetes and OpenAI in this readme.

Kubernetes Troubleshooting

To test using K8sGPT, I used some example scenarios from Abhishek Veeramalla's Kubernetes From Zero to Hero YouTube series, specifically on Kubernetes Troubleshooting. You can watch the first video at this link. I included the Github repository as a submodule of my project, so it can be used to test with.

Details on my troubleshooting examples and steps can be found in this readme.

As an example, I purposefully deployed a Kubernetes Deployment that had an incorrect image name. This sent the pods into ImagePullBackOff. K8sGPT was able to detect the error, understand the problem, and suggest a corrected image name:

Accomplishments

This project is a great proof of concept for getting local AI powered tools working to make troubleshooting Kubernetes easier. K8sGPT also supports an in-cluster operator to make scaling troubleshooting easier and eliminating single points of failure.

As part of this project, I was able to help the LocalAI development team test their latest revision which added support for AMD ROCm 6.1. The team does not have access to AMD hardware that supports the HipBLAS library, which is needed for building AMD specific Docker containers for LocalAI. I was able to work with the developers to test and help approve this pull request.

I also worked with the LocalAI developers to recommend an improvement to the ROCm packages for Debian/Ubuntu, as they currently do not follow the standard policy when installing shared libraries. I submitted this issue to the ROCm Github page, which outlines the issue and the solution. It is currently under investigation.

Future Improvements

As part of this project, I wanted to build a fully code defined Kubernetes cluster built on top of Proxmox, to run in my homelab. This proved to be very challenging, and there are a number of software and network limitations I have to resolve before I will be able to do this. Instead, I used my existing TrueNAS SCALE system to test the implementation, but it was sufficient to complete this project. For anyone else using this setup, the Kubernetes cluster(s) should be interchangeable.

I would also like to explore using Retrieval Augmented Generation (RAG), and introduce more training data for my local LLM to improve inference capabilities.

Conclusion

I'm very thankful to have been able to participate in this contest and finish this project. I want to thank all the AMD employees such as Prithvi Mattur and Javier who made this possible, provided resources and support in the Discord. It was a great experience collaborating with other contestants, and joining the office hours to ask questions.

I also want to thank Jinger Zeng from Hackster for all of her work on this contest. It definitely could not have been easy coordinating resources for all of us.

I hope you will find this project useful, and please let me know if you have any recommendations or improvements I should explore.

Credits

Jason Paul

1 project • 0 followers

Cloud Infrastructure Engineering. I build cool things in the cloud and in my homelab.

Contact

Comments

Please log in or sign up to comment.

Local Kubernetes Cluster Troubleshooting Assistant

Things used in this project

Hardware components

Software apps and online services

Story

Code

LinuxTek Canada - Hackster AMD Contest

Credits

Jason Paul

Comments

Embed the widget on your own site

Local Kubernetes Cluster Troubleshooting Assistant

Local Kubernetes Cluster Troubleshooting Assistant

Things used in this project

Hardware components

Software apps and online services

Story

Code

LinuxTek Canada - Hackster AMD Contest

Credits

Jason Paul

Comments

Related channels and tags