Show HN: Agentic Reliability Framework – Multi-agent AI self-heals failures https://ift.tt/lS5rYAO

Show HN: Agentic Reliability Framework – Multi-agent AI self-heals failures Hey HN! I'm Juan, former reliability engineer at NetApp where I handled 60+ critical incidents per month for Fortune 500 clients. I built ARF after seeing the same pattern repeatedly: production AI systems fail silently, humans wake up at 3 AM, take 30-60 minutes to recover, and companies lose \$50K-\$250K per incident. ARF uses 3 specialized AI agents: Detective: Anomaly detection via FAISS vector memory Diagnostician: Root cause analysis with causal reasoning Predictive: Forecasts failures before they happen Result: 2-minute MTTR (vs 45-minute manual), 15-30% revenue recovery. Tech stack: Python 3.12, FAISS, SentenceTransformers, Gradio Tests: 157/158 passing (99.4% coverage) Docs: 42,000 words across 8 comprehensive files Live demo: https://ift.tt/9WjFzJo... The interesting technical challenge was making agents coordinate without tight coupling. Each agent is independently testable but orchestrated for holistic analysis. Happy to answer questions about multi-agent systems, production reliability patterns, or FAISS for incident recall! GitHub: https://ift.tt/SO0xXND (Also available for consulting if you need this deployed in your infrastructure: https://lgcylabs.vercel.app/ ) https://ift.tt/SO0xXND December 9, 2025 at 10:25PM

Comments