I was reading it trying to tease out it's core purpose as well. I reached a simi...

quotemstr · on Dec 1, 2012

Right. This behavior, and a few other things, is nice, but I'd be hesitant to say it belongs in its own library. If this kind of baroque architecture (and the stilted writing demonstrated by the documentation) is de rigueur at Netflix, I don't think I'd ever want to work there.

buddycasino · on Dec 2, 2012

Care to elaborate? I think Amazon does it similarly, so what's so baroque about it?

quotemstr · on Dec 3, 2012

I agree that the behavior Hysterix provides is a Good Thing, but I'm not convinced that packaging up this behavior into a separate library and open source project is architecturally elegant. Instead, the base-level web services client library should provide this functionality. I'm wary of adding too many modules and too many dependencies to a system.

benjchristensen · on Dec 2, 2012

At a basic level yes, it is to allow "giving up and moving on" and that is the point. There is more to it than that but basically it is to allow failing fast and if possible degrade gracefully via fallbacks instead of queueing requests on a latent backend dependency and saturating resources of all machines in a fleet which then causes everything to fail instead of just functionality related to the single dependency.

The concepts behind Hystrix are well known and written and spoken about by folks far better at communicating these topics than I such as Michael Nygard (http://pragprog.com/book/mnee/release-it) and John Allspaw (http://www.infoq.com/presentations/Anomaly-Detection-Fault-T...). Hystrix is a Java implementation of several different concepts and patterns in a manner that has worked well and been battle-tested at the scale Netflix operates.

Netflix has a large service oriented architecture and applications can communicate with dozens of different services (40+ isolation groups for dependent systems and 100+ unique commands are used by the Netflix API). Each incoming API call will on average touch 6-7 backend services.

Having a standardized implementation of fault and latency tolerance functionality that can be configured, monitored and relied upon for all dependencies has proven very valuable instead of each team and system reinventing the wheel and having different approaches to configuration, monitoring, alerting, etc - which is how we were before.

As for incremental backoff - I agree that this would be an interesting thing to pursue (which is why a plan is in place to allow different strategies to be applied for circuit breaker logic => https://github.com/Netflix/Hystrix/issues/9) but thus far the simple strategy taken by tripping a circuit completely for a short period of time has worked well.

This may be an artifact of the size of Netflix clusters and a smaller installation could possibly benefit more from incremental backoff - or perhaps this is functionality that could be a great win for us as well that we just haven't spent enough time on.

The following link shows a screen capture from the dashboard monitoring a particular backend service that was latent:

https://github.com/Netflix/Hystrix/wiki/images/ops-social-64... (from this page https://github.com/Netflix/Hystrix/wiki/Operations)

Note how it shows 76 circuits open (tripped) and 158 closed across the cluster of 234 servers.

When we are running clusters of 200-1200 instances the "incremental backoff" naturally occurs as circuits are tripping and closing independently on different servers across the fleet. In essence the incremental backoff is done at a cluster level rather than within a single instance and naturally reduces the throughput to what the failing dependency can handle.

Feel free to send questions or requests at https://github.com/Netflix/Hystrix/issues or to me @benjchristensen on Twitter.