So it's a wrapper that aborts an entire request if an API call made by the code ...

Fluxx · on Dec 1, 2012

It's not revolutionary, but it does a lot of things right and has more features that what you mentioned above:

* maintains a semaphore of currently processing requests, and will immediately fail a request if the semaphore is full.

* tracks service latency and other statistics for you

* maintains a circuit breaker to immediately fail requests if the breaker is open (based on statistics)

* watches for health to be restored of the 3rd party service and allows requests to it again.

* built in request isolation via threadpools

* built in request collapsing and caching

Overall I like this cause it's a great demonstration of how to think about, and engineer for, failure in a distributed system.

dkarl · on Dec 2, 2012

Automatic fail fast based on monitoring sounds like an awesome alternative to having latency cascade through the system and then scrambling to "fix" the problem by reducing client timeouts.

Retric · on Dec 2, 2012

I agree, though my first thought is many things work better if you add a little randomness to the equation. Something like abort X% of the time where X is based on latency over an acceptable threshold which should find a happy medium where a resource is at max capacity, but not overloaded.

Of course this is all assuming you have some sort of useful redundancy.

benjchristensen · on Dec 2, 2012

That's an interesting thought and probably something that would make a good addition to the library.

It could be part of the default library or perhaps a custom strategy for the circuit breaker (once I finish abstracting it so it can be customized via a plugin: https://github.com/Netflix/Hystrix/issues/9).

At the scale Netflix clusters operate they basically get this randomness already because circuits open/close independently on each server (no cluster state or decision making).

In this screenshot https://github.com/Netflix/Hystrix/wiki/images/ops-social-64... you'll see how in a cluster of 234 servers that about 1/3 of them are tripped and the rest are still letting traffic through.

Thus the cluster naturally levels out to how much traffic can be hitting the degraded backend as circuits flip open/closed in a rolling manner across the instances.

Also, doing this makes sense even when a dependency doesn't have a useful redundancy and must fail fast and return an error.

It is far better to fail fast and let the end client (such as a browser, iPad, PS3, XBox etc) retry and hopefully get to the 2/3s that are still able to respond rather than let the system queue (or going into server-side retry loops and DDOS the backend) and fail and not let anything through.

We prefer obviously to have valid fallbacks but many don't and in those cases that is what we do - fail fast (timeout, reject, short-circuit) on instances where it can't serve the request and let the clients retry which in a large cluster of hundreds of instances almost always get a different route through the instances of the API and backend dependencies.

@benjchristensen

mh- · on Dec 2, 2012

thanks for posting, those insights/experiences with this architecture helped me understand some of the design decisions.

also, huge thanks to you and your team (and your employer!) for releasing an amazing volume of production-quality open source projects this year.

tshtf · on Dec 1, 2012

taligent seems to be hellbanned. Here's his response:

https://github.com/Netflix/Hystrix/wiki/How-it-Works It's more than just aborting a request. It also handles caching with different fallback modes, collapsing multiple requests, monitoring and most importantly it provides a consistent model for handling service integration. Probably not the most innovative thing in the world but definitely useful if you do have a large SOA system.

gizmo686 · on Dec 2, 2012

Weird, his post is showing up as not dead for me. Looking at his comment history, all of the comments since three days ago are dead except this one (which is the most recent at the moment).

JOnAgain · on Dec 1, 2012

I was reading it trying to tease out it's core purpose as well. I reached a similar conclusion. The only piece that seems incremental is the way it remembers past performance and will fail-fast if the service is down or failing rather than keep sending every request to the downstream service. Like client-side back-off logic or throttling.

quotemstr · on Dec 1, 2012

Right. This behavior, and a few other things, is nice, but I'd be hesitant to say it belongs in its own library. If this kind of baroque architecture (and the stilted writing demonstrated by the documentation) is de rigueur at Netflix, I don't think I'd ever want to work there.

buddycasino · on Dec 2, 2012

Care to elaborate? I think Amazon does it similarly, so what's so baroque about it?

quotemstr · on Dec 3, 2012

I agree that the behavior Hysterix provides is a Good Thing, but I'm not convinced that packaging up this behavior into a separate library and open source project is architecturally elegant. Instead, the base-level web services client library should provide this functionality. I'm wary of adding too many modules and too many dependencies to a system.

benjchristensen · on Dec 2, 2012

At a basic level yes, it is to allow "giving up and moving on" and that is the point. There is more to it than that but basically it is to allow failing fast and if possible degrade gracefully via fallbacks instead of queueing requests on a latent backend dependency and saturating resources of all machines in a fleet which then causes everything to fail instead of just functionality related to the single dependency.

The concepts behind Hystrix are well known and written and spoken about by folks far better at communicating these topics than I such as Michael Nygard (http://pragprog.com/book/mnee/release-it) and John Allspaw (http://www.infoq.com/presentations/Anomaly-Detection-Fault-T...). Hystrix is a Java implementation of several different concepts and patterns in a manner that has worked well and been battle-tested at the scale Netflix operates.

Netflix has a large service oriented architecture and applications can communicate with dozens of different services (40+ isolation groups for dependent systems and 100+ unique commands are used by the Netflix API). Each incoming API call will on average touch 6-7 backend services.

Having a standardized implementation of fault and latency tolerance functionality that can be configured, monitored and relied upon for all dependencies has proven very valuable instead of each team and system reinventing the wheel and having different approaches to configuration, monitoring, alerting, etc - which is how we were before.

As for incremental backoff - I agree that this would be an interesting thing to pursue (which is why a plan is in place to allow different strategies to be applied for circuit breaker logic => https://github.com/Netflix/Hystrix/issues/9) but thus far the simple strategy taken by tripping a circuit completely for a short period of time has worked well.

This may be an artifact of the size of Netflix clusters and a smaller installation could possibly benefit more from incremental backoff - or perhaps this is functionality that could be a great win for us as well that we just haven't spent enough time on.

The following link shows a screen capture from the dashboard monitoring a particular backend service that was latent:

https://github.com/Netflix/Hystrix/wiki/images/ops-social-64... (from this page https://github.com/Netflix/Hystrix/wiki/Operations)

Note how it shows 76 circuits open (tripped) and 158 closed across the cluster of 234 servers.

When we are running clusters of 200-1200 instances the "incremental backoff" naturally occurs as circuits are tripping and closing independently on different servers across the fleet. In essence the incremental backoff is done at a cluster level rather than within a single instance and naturally reduces the throughput to what the failing dependency can handle.

Feel free to send questions or requests at https://github.com/Netflix/Hystrix/issues or to me @benjchristensen on Twitter.

kami8845 · on Dec 1, 2012

taligent your comment is [dead]

Karunamon · on Dec 2, 2012

This has to be the most head-scratching hellban I've ever seen.

zheng · on Dec 2, 2012

I've been seeing quite a few of those recently. I could only speculate why, but the trend is concerning.

jorts · on Dec 2, 2012

What did I miss? taligent was banned from HN for his comment?

Karunamon · on Dec 2, 2012

No, this one was dead on arrival, I'm just curious why. Lots of karma, nice posting history.. Nothing even sort of objectionable in his history, just a line of comments and then suddenly they're all dead.

Karunamon · on Dec 2, 2012

..and now undead. The unwritten rules of this place confuse and concern me.

taligent · on Dec 1, 2012

https://github.com/Netflix/Hystrix/wiki/How-it-Works

It's more than just aborting a request. It also handles caching with different fallback modes, collapsing multiple requests, monitoring and most importantly it provides a consistent model for handling service integration.

Probably not the most innovative thing in the world but definitely useful if you do have a large SOA system.