Automatic fail fast based on monitoring sounds like an awesome alternative to having latency cascade through the system and then scrambling to "fix" the problem by reducing client timeouts.
I agree, though my first thought is many things work better if you add a little randomness to the equation. Something like abort X% of the time where X is based on latency over an acceptable threshold which should find a happy medium where a resource is at max capacity, but not overloaded.
Of course this is all assuming you have some sort of useful redundancy.
That's an interesting thought and probably something that would make a good addition to the library.
It could be part of the default library or perhaps a custom strategy for the circuit breaker (once I finish abstracting it so it can be customized via a plugin: https://github.com/Netflix/Hystrix/issues/9).
At the scale Netflix clusters operate they basically get this randomness already because circuits open/close independently on each server (no cluster state or decision making).
Thus the cluster naturally levels out to how much traffic can be hitting the degraded backend as circuits flip open/closed in a rolling manner across the instances.
Also, doing this makes sense even when a dependency doesn't have a useful redundancy and must fail fast and return an error.
It is far better to fail fast and let the end client (such as a browser, iPad, PS3, XBox etc) retry and hopefully get to the 2/3s that are still able to respond rather than let the system queue (or going into server-side retry loops and DDOS the backend) and fail and not let anything through.
We prefer obviously to have valid fallbacks but many don't and in those cases that is what we do - fail fast (timeout, reject, short-circuit) on instances where it can't serve the request and let the clients retry which in a large cluster of hundreds of instances almost always get a different route through the instances of the API and backend dependencies.
taligent seems to be hellbanned. Here's his response:
https://github.com/Netflix/Hystrix/wiki/How-it-Works
It's more than just aborting a request. It also handles caching with different fallback modes, collapsing multiple requests, monitoring and most importantly it provides a consistent model for handling service integration.
Probably not the most innovative thing in the world but definitely useful if you do have a large SOA system.
Weird, his post is showing up as not dead for me. Looking at his comment history, all of the comments since three days ago are dead except this one (which is the most recent at the moment).
I was reading it trying to tease out it's core purpose as well. I reached a similar conclusion. The only piece that seems incremental is the way it remembers past performance and will fail-fast if the service is down or failing rather than keep sending every request to the downstream service. Like client-side back-off logic or throttling.
Right. This behavior, and a few other things, is nice, but I'd be hesitant to say it belongs in its own library. If this kind of baroque architecture (and the stilted writing demonstrated by the documentation) is de rigueur at Netflix, I don't think I'd ever want to work there.
I agree that the behavior Hysterix provides is a Good Thing, but I'm not convinced that packaging up this behavior into a separate library and open source project is architecturally elegant. Instead, the base-level web services client library should provide this functionality. I'm wary of adding too many modules and too many dependencies to a system.
At a basic level yes, it is to allow "giving up and moving on" and that is the point. There is more to it than that but basically it is to allow failing fast and if possible degrade gracefully via fallbacks instead of queueing requests on a latent backend dependency and saturating resources of all machines in a fleet which then causes everything to fail instead of just functionality related to the single dependency.
The concepts behind Hystrix are well known and written and spoken about by folks far better at communicating these topics than I such as Michael Nygard (http://pragprog.com/book/mnee/release-it) and John Allspaw (http://www.infoq.com/presentations/Anomaly-Detection-Fault-T...). Hystrix is a Java implementation of several different concepts and patterns in a manner that has worked well and been battle-tested at the scale Netflix operates.
Netflix has a large service oriented architecture and applications can communicate with dozens of different services (40+ isolation groups for dependent systems and 100+ unique commands are used by the Netflix API). Each incoming API call will on average touch 6-7 backend services.
Having a standardized implementation of fault and latency tolerance functionality that can be configured, monitored and relied upon for all dependencies has proven very valuable instead of each team and system reinventing the wheel and having different approaches to configuration, monitoring, alerting, etc - which is how we were before.
As for incremental backoff - I agree that this would be an interesting thing to pursue (which is why a plan is in place to allow different strategies to be applied for circuit breaker logic => https://github.com/Netflix/Hystrix/issues/9) but thus far the simple strategy taken by tripping a circuit completely for a short period of time has worked well.
This may be an artifact of the size of Netflix clusters and a smaller installation could possibly benefit more from incremental backoff - or perhaps this is functionality that could be a great win for us as well that we just haven't spent enough time on.
The following link shows a screen capture from the dashboard monitoring a particular backend service that was latent:
Note how it shows 76 circuits open (tripped) and 158 closed across the cluster of 234 servers.
When we are running clusters of 200-1200 instances the "incremental backoff" naturally occurs as circuits are tripping and closing independently on different servers across the fleet. In essence the incremental backoff is done at a cluster level rather than within a single instance and naturally reduces the throughput to what the failing dependency can handle.
No, this one was dead on arrival, I'm just curious why. Lots of karma, nice posting history.. Nothing even sort of objectionable in his history, just a line of comments and then suddenly they're all dead.
It's more than just aborting a request. It also handles caching with different fallback modes, collapsing multiple requests, monitoring and most importantly it provides a consistent model for handling service integration.
Probably not the most innovative thing in the world but definitely useful if you do have a large SOA system.