Chaos engineering

Calculating how much confidence we have in the interconnected complex systems that are put into production environments requires operational readiness metrics.

Solutions for increasing the resilience and operational readiness of a platform include strengthening the backup, restore, network file transfer, failover capabilities and overall security of the environment.

It was promptly put to use for debugging by generating errors for programmers to fix, because automated testing was not possible; the first Macintosh had too little free memory space for anything more sophisticated.

Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme.

Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users.

"[15]By regularly "killing" random instances of a software service, it was possible to test a redundant architecture to verify that a server failure did not noticeably impact customers.

Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases.

The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.The Simian Army[18] is a suite of tools developed by Netflix to test the reliability, security, or resilience of its Amazon Web Services infrastructure and includes the following tools:[20] Voyages-sncf.com's 2017 "Day of Chaos"[23] gamified simulating pre-production failures[24] to present at the 2017 DevOps REX conference.