Autonomous debugging and troubleshooting agent for production-grade systems from DevOps, SRE, and Operations perspective.
There are two separate architecture options and execution this application supports:
- OPTION A - Claude Code Mode for System Admins: Project-specific Claude Subagents and Agent SKILLS along with Default Claude Tools for SKILLS. These are organized under
.claudefolder (as per default project-specific agents). - OPTION B - LangGraph-based Application: Structured Custom Developed LangGraph-framework based Agents with below Multi-agent architecture pattern with an Orchestrator Agent coordinating specialized sub-agents:
- Observability Agent: Queries Grafana for metrics, dashboards, and alerts
- Infrastructure Agent: Queries Kubernetes for cluster state, pods, services, and logs
- Knowledge Search Agent: Queries Stackoverflow for troubleshooting knowledge
- Incident Management: ServiceNow integration for incident tracking
- Code Management: GitHub integration for issues and PRs
- LangGraph DeepAgent: Multi-agent orchestration, Deep Agent is n Agent harness built using LangChain as framework (for tools, model access) and LangGraph as Runtime (for checkpoints, memory)
- LiteLLM: Unified LLM provider access
- LangSmith: Observability and tracing
- MCP Servers (via Docker Hub):
- Kubernetes MCP Server
- Grafana MCP Server
- Stackoverflow MCP Server
- ServiceNow MCP Server
- GitHub MCP Server
- Install dependencies:
uv sync - Configure environment variables (see
.env.example) - Deploy MCP servers separately
- Run agent:
python main.py
See IMPLEMENTATION_PLAN.md for complete implementation details.