Consul issues have bitten me at two companies, and I heard word of it being the culprit for some serious outages elsewhere. One possible takeaway here is to remove it.
Totally agree. It can really shit the bed hard it can. I had an 0.8 cluster that crashed with a never ending leadership election. Only option was to burn the whole cluster to the ground and reinitialise it from scratch. That issue seems to have gone away since 1.0 but I’m not sure I can sleep after running it for a couple of years. Every time I bounce a cluster node for patching I clench.
I could write a book on the problems I’ve had with vagrant as well. Lost at least a day a week to that in the last month.
Consul seems to be more prone to issues than one would hope though. Imo the feature set is not worth the increased complexity and operational burden. There are simpler ways of handling service discovery and configuration without running your own consensus based cluster.
Central authorities are typically simpler than gossip and consensus systems. They have failure modes too, of course, but those failure modes are better understood and potentially easier to manage.
Sometimes you can't avoid the need for distributed consensus, but you can box it inside a well defined abstraction like leader election, and then do everything else in a traditional client-server way.