I've looked at these before, and I remember a few years ago when Grafana was rea...

viraptor · on July 21, 2020

What sar+rsync doesn't provide:

- app-specific metrics

- quick and easy way to build a number of graphs searching for correlation (how to slice the data to get results that explain issues)

- log/metrics correlation

- unified way to build alerts

- ad-hoc changes - while you're in the middle of an incident and want to get information that's just slightly different from existing, or filter out some values, or overlay a trend - how long would it take in your custom solution vs grafana?

And finally - grafana exists. Why would I write a custom graph generator from a custom data store if I can setup a collector + influx + grafana in a fraction of that time and get back more?

gjulianm · on July 21, 2020

I'm not experienced with the CollectD stack, but I use Prometheus + Grafana to monitor probes. My two cents:

- Fairly lightweight. Prometheus deals with quite a lot of series without much memory or CPU usage.

- Integration with a lot of applications. Prometheus lets me monitor not only the system, but other applications such as Elastic, Nginx, PostgreSQL, network drivers... Sometimes I need an extra exporter, but they tend to be very light on resources. Also, with mtail (which is again super lightweight) I can convert logs to metrics with simple regexes.

- Number of metrics. For instance, several times I needed to diagnose an outage and I need a metric that I didn't think about, and turns out that the exporter I was using did actually store it, it was just that I didn't include it in the dashboard. As an example, the default node exporter has very detailed I/O metrics, systemd collectors, network metrics... They're quite useful.

- Metric correctness. Prometheus appears to be at least decent at dealing with rate calculations and counter resets. Other monitoring systems are worse and it wasn't weird to find a 20000% CPU usage alert due to a counter reset.

- Alerts. Prometheus can generate alerts with quite a lot of flexibility, and the AlertManager is a pretty nice router for those alerts (e.g., I can receive all alerts in a separate mail, but critical alerts are also sent in a Slack channel).

- Community support. It seems the community is adopting the Prometheus format for exposing metrics, and there are packages for Python, Go and probably more languages. Also, the people who make the exporters tend to also make dashboards, so you almost always have a starting point that you can fine-tune later.

- Ease of setup. It's just YAML files, I have an Ansible role for automation but you can go with just installing one or two packages in clients and adding a line to a configuration file in the Prometheus master node.

- Ease of use. It's incredibly easy to make new graphs and dashboards with Prometheus and Grafana, no matter if they're simple or complex.

For me, the main points that make me use Prometheus (or any other monitoring config above simple scripts) is alerting and the amount of metrics. If you just need to monitor CPU load and simple stats, maybe Prometheus is too much, but it's not that hard to set up anyways.

serhack_ · on July 21, 2020

Author here. I'll probably write another tutorial focusing on Prometheus, instead of CollectD.

Thanks for suggestion

SerHack

rckoepke · on July 21, 2020

It would be wonderful if you included limitations as well, to help people make the right decisions for their tech stack. I've been playing around with Prometheus lately for environmental monitoring, and long-term retention is particularly important to me.

During proof-of-concept testing, some historical data on disk perhaps wasn't lost per se, but definitely failed to load on restart. I haven't worked hard to replicate this but there are some similar unsolved tickets out there.

Additional traps for new players include customizing --storage.tsdb.retention.time and related parameters.

serhack_ · on July 21, 2020

Thank you!

Nextgrid · on July 21, 2020

The biggest advantage is the near-real-time aspect of this. During an outage, having live metrics is essential. Does your custom system allow you to see live metrics as they happen, or do you need to re-run your aggregation scripts every 5 minutes?

dewey · on July 21, 2020

At work and in my personal projects I use a Prometheus + Grafana setup and it's very easy to set up. Not sure what you mean with the complex and setup costs for a Grafana stack?

Alerting with the Prometheus AlertManager is also pretty straight forward and I'm looking at dashboards every day to see if everything is running smoothly or tracking down what's not working well if there are any issues. Grafana dashboards are always the second thing I look at after an alert fires somewhere and it has been invaluable.

VectorLock · on July 21, 2020

It sounds like you home rolled a version of Grafana, CollectD et al. which probably took more time and effort than just installing Grafana CollectD et al.

ggregoire · on July 21, 2020

I think you are overestimating the time and complexity to install and set up prometheus + grafana on a box, node exporter on your hosts and copy/paste a grafana dashboard for node exporter (which is your use case).

It gets complex only when you start monitoring your apps (i.e. using a prometheus client library to generate and export custom app metrics) and create custom grafana dashboards for these metrics. Or if you need to monitor some niche technology without its own existing prometheus exporter. Then yes, you need to read the docs, think about what you need to monitor and how, write code...

lifeisstillgood · on July 21, 2020

I love this attitude (simplest thing that can possibly work) and am trying to write book about how to run a whole syseng department in this way.

In my view grafana shines in app data collection. And what we want a lot of is that (my simplest thing that can possibly is just have a web server accepting metrics ala carbon/graphite)

So one is likely to have grafana already lying around when one does the infra monitoring - I guess your choice is now how to leverage your set up to do app monitoring ?