Gremlin

The world’s most comprehensive fault injection platform

Improving reliability means knowing how systems behave under non-ideal conditions. With Gremlin’s enterprise fault injection platform, you can simulate these failure conditions and improve your systems—without impacting users or slowing development. Gremlin lets you inject fault into systems in a safe, secure, and controlled way.

‍

What is fault injection?

Fault injection is a technique for creating controlled failure in a computing component, such as a host, container, or service. By observing how their components respond to failure, engineering teams can build them to be more resilient.

‍

Reveal hidden reliability risks

Modern systems are large and complex with countless moving parts. The potential for failure is significant, and is only increasing as more teams move to distributed and cloud-based platforms. Engineers need to know how their systems will respond under different failure conditions so they can mitigate, predict, and respond quickly to incidents.

Gremlin lets you test the reliability of your systems by safely and proactively injecting failures into hosts, containers, services, and serverless workloads. Our comprehensive library of faults lets you test any kind of incident across all of our supported platforms. Find hidden and unexpected reliability risks, both in the cloud and on-prem.

Build confidence in your systems’ resiliency

Engineering teams need to know that their systems can withstand any type of fault at any time. Gremlin helps you understand how your systems behave under any condition, not just ideal conditions.

Environments change over time, especially as systems scale and engineers push new code. Gremlin helps you stay ahead of changing systems and configuration drift with automated, repeated experiments. Confidently push to production knowing that your changes won’t introduce new reliability risks.

Safely test your systems with automatic halt and rollback

Gremlin is built with safety and control in mind. All experiments can be immediately stopped and rolled back at any time. Gremlin also natively integrates with your observability tools—including Amazon CloudWatch, Datadog, New Relic, and Prometheus—to monitor your systems during an experiment. If your metrics exceed your SLIs or SLOs, Gremlin instantly stops the active experiment and returns your systems to normal.

Enterprise-grade fault injection

The world’s most comprehensive fault injection platform

What is fault injection?

Reveal hidden reliability risks

Build confidence in your systems’ resiliency

Safely test your systems with automatic halt and rollback

Shift from observing to improving

Related Resources

Enterprise-grade fault injection

The world’s most comprehensive fault injection platform

What is fault injection?

Reveal hidden reliability risks

Build confidence in your systems’ resiliency

Safely test your systems with automatic halt and rollback

Shift from observing to improving

Related Resources

What is fault injection?

Reliability best practices: how Gremlin uses Gremlin