Abstract
Modern cloud-native microservice architectures, despite their flexibility and scalability, often face complex challenges such as latent failures, cascading errors, and unpredictable system behavior. This article explores the application of chaos engineering, a proactive DevOps method, using the example of Google’s “Online Boutique” application deployed on Google Kubernetes Engine (GKE). Experiments conducted with Chaos Mesh (CPU Hog, Memory Hog, Network Latency, Packet Loss) revealed critical vulnerabilities, including “silent failures,” memory exhaustion (OOMKilled), and inadequate timeout/retry mechanisms. Using KPIs measured through Prometheus and Grafana (response time, error rate, CPU/memory usage) and simulated user traffic via Locust, system optimization was achieved. This included implementing the Horizontal Pod Autoscaler (HPA), adjusting resource limits, and enhancing timeout/retry mechanisms. Repeated experiments demonstrated statistically significant improvements in system resilience, underscoring the importance of chaos engineering as an integral part of DevOps. The article provides practical recommendations for enhancing system reliability and integrating chaos engineering into the SDLC.
References
დემჩენკო ნ. (2025). DevOps-ის ეფექტიანობის ოპტიმიზაცია ქაოსური ინჟინერიის გამოყენებით, ბიზნესისა და ტექნოლოგიების უნივერსიტეტი. 80 გვ.
Basiri, A., Behnam, N., de Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J., & Rosenthal, C. (2016). Chaos Engineering: Building Confidence in System Behavior through Controlled Experiments. Netflix Technology Blog.
Rosenthal, C., & Jones, N. (2020). Chaos Engineering: System Resiliency in Practice. O’Reilly Media. 267 p.
Zhang, X., Liu, Y., & Wang, J. (2022). Performance Optimization in Kubernetes-Based Microservices. Journal of Cloud Computing, 11(3), 45–56p.