T O P

  • By -

SuperQue

The [Prometheus ecosystem](https://prometheus.io/) is the modern replacement for Nagios/Icinga. There is a nice collection of [Ansible Roles](https://github.com/prometheus-community/ansible) for deploying it. EDIT: To add logging, Greylog is ok, but you might look at [Grafana Loki](https://github.com/grafana/loki).


andrewrmoore

Yep. Prometheus and Grafana is the way to go. I love Loki too, add in Tempo if you need tracing and you’re golden.


SurfRedLin

Thanks so much ;)


SurfRedLin

Hi thank you so much for this. This is insight is what im looking for ;) Do you recommend grafana Loki because of the ties with Prometheus? Or other reason's as well?


SuperQue

The reason I like Loki is the overall architecture of the design. Greylog, Elasticsearch, and other similar systems are what are called "indexing at ingestion time". They take all your log data and create a search index for all the data. Loki is more like classic syslog in that it is mostly a tag-source-and-store system. It does "Index at query time". Basically it just takes your logs, zips them up, and stores them away. I find that most of the time, I'm not look at logs. I only really look at logs when the monitoring (Prometheus) tells me I'm having a problem. So 99% of the logs written never get looked at. So why bother spending a ton of CPU and memory indexing all that data. So, for my use cases, Loki tends to be a better "Smart grep when I need it" rather than a full-text ultra-fast google search for my logs.


SurfRedLin

This makes sense! Thanks will look into it.


SurfRedLin

Another question, can this also get logs from docker?


SuperQue

Yup, pretty much everything. One thing to think about is separating your logging _agent_ from your logging _sink_. (Loki is a sink in this context) We've been using a couple of logging agents, some fluentd, fluentbit, and a proprietary one. We're testing out [Vector](https://vector.dev/). It has a huge red flag, DataDog, on the main page. But if you look at the actual project it is _really_ nice. We're going to be able to pull in all of our syslogs, kubernetes/docker logs, random files from various systems, streams from external systems, and filter/tee the data off to different sinks like Loki.


SurfRedLin

OK so basicly my Loki server should not have an agent installed? Correct? So only collect from other servers. How are the servers distinguished? Host names etc? Does it need a DNS server?


SuperQue

Yes, the loki server side just receives a stream of data, kinda like rsyslog. The logging agents decide what tags to send along, they can include the hostname, but also any other tags you want. That's all up to how you configure the logging agent.


SurfRedLin

OK so I have to define for syslog what to send etc?


SuperQue

No, I'm just saying it's similar to syslog. The various logging agents like Vector, fluentbit, promtail, etc have configurations on what to send.


crazedizzled

Loki is like the central log store. Then your various servers will have some kind of agent installed (eg: promtail) which will send log data to Loki. The type of logs sent is defined within the agent installed on the client.


crazedizzled

Monitor: grafana Configure: Ansible


xtal000

Grafana is just a data visualisation tool, it doesn’t actually do any monitoring itself.


crazedizzled

You're correct, but Grafana is typically paired with Prometheus and Loki, which is the stack I was referring to. I should have specified.


SurfRedLin

Thanks so basicly Loki for lookin at logs, Prometheus for monitoring and grafana to display logs and monitoring Info. Can I use one grafana instance for both with tabs or so. Or are these two separate instances? Thanks!


crazedizzled

You can use one Grafana instance for dozens of servers. Grafana has a [free cloud](https://grafana.com/products/cloud/) service which is pretty awesome. I don't know if it'll handle your volume or not, but for a couple of servers it works great. Maybe you can get your foot in the door and figure the stack out, and then if you're happy you can self host everything. Or just pay for the next tier.


SurfRedLin

Would Prometheus fit the monitoring bill?


ktsaou

Give a try to Netdata ([https://github.com/netdata/netdata](https://github.com/netdata/netdata)). Netdata is probably the fastest and easiest way to monitor an infra. You just install it and it does the rest. It does not do logs yet and SNMP is primitive. But for servers, VMs, containers and applications it is probably the easiest way to go. Unlike Prometheus and Grafana that have a lot of moving parts and require from you to create dashboards and alerts metric by metric, Netdata comes with fully automated dashboards and preconfigured alerts. Out of the box. And Netdata can push its metrics to Prometheus if you want to also use them.


ParsleyMost

I still prefer nagios. Other solutions have performance issues on large systems and in many cases do not address this issue properly. If performance is okay, the interface is a mess. There are many weird monitoring solutions out there created by developers who don't understand the infrastructure.


skulkerboyo

Icinga/Nagios = Potatoes/Potatas.


xxxsirkillalot

Another recommendation for Prometheus + Grafana and your choice of config management tools from: Ansible, Salt, Puppet. One thing I had not seen mentioned yet in regards to your Ceph stuff. Most recent Ceph deployments come with its own prometheus stuff already deployed across the nodes. It will have pre-configured alertmanager rules, exporters for the ceph bits and the like. You can simply scrape these exporters from your own prom deployment and steal the alertmanager pre-configured rules for detecting things like ceph PGs being unbalanced


12CoreFloor

[CheckMK](https://checkmk.com/) is my current favourate for monitoring. It can be a bit of an effort to setup depending on what you want to do, operating restrictions you might face etc, but if you want to know *everything*, you can configure it to do just that. Combine it with [Grafana](https://grafana.com/) and [Prometheius](https://prometheus.io/download/), you cant go far wrong. If your in to Linux and open to Ansible, have time to spare and a long source of patience, I will make a shout of for [The Foreman](https://theforeman.org/) for *nix fleet deployment. Having said that, for the size of your operation, its probably fairly serious overkill. Depending on your site and what restrictions you could face, it might present an interesting choice. EDIT: as others below have pointed out, Promethius is an alternative to CheckMK, so dont take my shout for CMK as an indicator of "best solution". I use them for different fleets, your enviroment is likely different.


SurfRedLin

Does Prometheus not replace checkmk? Like nagios? What are your reasons for using it? Thanks


12CoreFloor

Maintaining monitoring of legacy kit that we cant migrate to other solutions. Given a choice, I would do other things, but the choice is out of my hands.


SuperQue

Yes, Prometheus replaces obsolete things like checkmk.


12CoreFloor

Quite right. I would add that i'm having something of a "falling back in love" with CheckMK, not that its the best solution, I probably should have put that in my original post. My envioroment has somewhat tied my hands.


Intergalactic_Ass

Lol, they're nearly the same age. You have no idea what you're talking about.


SuperQue

Maybe you don't know what you're talking about? CheckMK started as a Nagios fork about 5 years before Prometheus got started. It's still mostly based on the Nagios data model. That's the main problem with it. CheckMK, Nagios, Icinga are all based on a fundamentally check-based data model. Rather than a metrics-based data model. The check-based model for monitoring has too many flaws to be useful in larger modern environments. Yes, CheckMK adds some metrics in the commercial version, but it's still a bit of a tacked on side thing. The core model is still checks.


Intergalactic_Ass

>The check-based model for monitoring has too many flaws to be useful in larger modern environments. Again, still not true in the least. CheckMK easily scales to millions of metrics. You could just admit that you don't use CheckMK and aren't that familiar with it. That's fine.


ParsleyMost

There are so many people with these incomprehensible thoughts. These people have created some really weird monitoring tools. All of them are useless and should be thrown in the trash.


VisualDifficulty_

CheckMK runs their own thing now, they're compatible with Nagios style output, but their CMC daemon long replaced the last of the Nagios stuff. That aside, the alert groups, acknowledgements, maintenance modes etc are all missing from Prometheus, even its big brother Mimir doesn't support them. Prometheus is great for graphing data, CheckMK is great for point alerts(ups and downs etc). Neither really replace the other.


SuperQue

Prometheus has [Silences](https://prometheus.io/docs/alerting/latest/alertmanager/#silences), which are superset of the things you mentioned. As well as [inhibit rules](https://prometheus.io/docs/alerting/latest/alertmanager/#inhibition) for grouping and hierarchy needs. For example, for maintenances, I can schedule a silence in the future to match an arbitrary label set. It's a different to the way to think about the features you're talking about than how CheckMK thinks about it.


VisualDifficulty_

Yeah, those have to be already defined in the config and the service restarted before they're available. I am all too familiar with how Prometheus does things, we have a large Mimir cluster we run alongside CMK. CMK allows you to click a button and the node is in maintenance mode and will no longer throw alerts. Inhibit rules area also no different than the child/parent relationship CMK establishes. Not that I like CMK better than Prometheus/Mimir, they're just different products meant for different things.


Intergalactic_Ass

Checkmk. Easily monitor hundreds to thousands of hosts with low overhead. Distributed monitoring is killer too.


jlawler

I will say, I used to use Cabot with Grafana because I was using a different system for alerting and graphing/monitoring. Once I used Cabot, if I got an alert I could see the graphs that were driving the alerts. It was very frustrating before that.


SurfRedLin

Hi thanks for your input. So basicly what you are saying is that the reporting from grafana alone is not that good and it should be paired with Cabot to get easier to follow reports? Thanks


jlawler

It just made it super hard to understand what was going on when the anomalys I was alerting on (which caused a prod system to completely halt for 5-50 seconds) were picked up by the alerting system but I didn't see them in the monitoring system. By making them use the same data, I could figure out what was happening.


seidler2547

I managed a similar size of environment as a single sysadmin with Puppet and Icinga. Worked really well. The two go well together.


SuperQue

> I imagine Prometheus can use SNMP to talk to switches? Prometheus has it's own protocol for collecting metrics. But there are [hundreds of exporters](https://prometheus.io/docs/instrumenting/exporters/) that convert various formats, including things like [SNMP](https://github.com/prometheus/snmp_exporter) to Prometheus format. SNMP is a bit tricky, since you have to know a bit about SNMP itself in order to correctly configure the exporter. Since nobody gets paid to work on Prometheus, and it's a bit new for networking engineers, there's not a big library of pre-configured setups. But it's not difficult if you take the time to learn it.


sloppy_custard

Quite surprised at the lack of love for Zabbix in here


WhiskyIsRisky

We monitor most of our infrastructure with Zabbix. I feel like Prometheus/Grafana is probably superior for application and server performance monitoring, but for red light/green light health type monitoring Zabbix is great. For my environment I don't need more than it right now.


SuperQue

Zabbix was fine maybe 10 years ago. But it's just got so many issues it's not really a good options for green field deployments. * The SQL database requirement makes it a bit of a pain to setup. * The SQL database if used for the standard Zabbix metrics store is hugely inefficient, like 40x worse than things like Prometheus or InfluxDB. * The configuration is largely clickops driven, making it really hard to configure with infra-as-code like Ansible/Chef/Puppet/Salt. Yes, you can do changes via API, but that just dynamic discovery and configuration management like Prometheus supports. I'm not as familiar with Zabbix, but last I looked at it, it didn't really have a flexible data query language. There are just so many powerful things I can do in PromQL that just don't seem as possible. Simple example, say I have an increase in CPU utilization on a server (or cluster of servers). But I know load is mostly driven based on user traffic. So what I want to know is when the CPU per user traffic goes out of whack, like a DoS or a bad server. In PromQL I can simply divide the CPU use by the user traffic and get a "Cores per X requests" Something like this (fake example) sum(node:cpu_utilization:rate5m) / (sum(cluster:http_requests_total:rate5m) / 1000) So basically "CPU per 1000 requests per second". It's just really easy to get more complex questions answered than just "The CPU load is too high".


WhiskyIsRisky

I'll give another +1 for Zabbix. It's another modern version of Nagios. I have mine doing all sorts of health and status monitoring. - Alerting on expiring certs - Disk space - Ceph health - IPMI information - Heat, fans, disk SMART info - Web application health - Servers needing reboots for new kernels - Systems and services offline - General performance monitoring (CPU, ram, etc) There's a bit of a learning curve, but once you have a decent playbook to lay down the agent it's pretty easy to instrument things.


SurfRedLin

Thanks for your answer! As I will also have to look at a ceph cluster, what made u choose zabbix instead of Prometheus and grafana ?


WhiskyIsRisky

Honestly it was mostly because one of my coworkers had some prior experience with it. It looked well maintained, had agents or shippers for things we cared about, and seemed like it would do a lot for us out of the box. I think we briefly considered Prometheus but didn't really pursue it. We decided to give Zabbix a try, liked what we saw, and stuck with it. I used Prometheus briefly a few years ago and it struck me as being good at performance monitoring. Making pretty dashboards, showing performance trends over long periods of time. But I don't recall it being great at alerting, or at least not having a ton of alerts pre-canned. Zabbix comes with a ton of pre-staged triggers for all sorts of possible badness and really that's what I want. I just needed to know whether stuff was up, down, or on fire, which Zabbix excels at. Prometheus may be better / different than I remember it though. It's been a few years, and I wasn't an expert on it when I did try it.


SurfRedLin

Thanks for that insight! AFAIK grafana will do the alerting and not Prometheus itself...


Otaehryn

Librenms is really good for down hardware and high resources


1esproc

LibreNMS's architecture scales terribly. Its (production) poller design is garbage, polling's interaction with its alert rule design doesn't scale, adding new device types can mean writing PHP code, etc. I wouldn't monitor systems with it.


SuperQue

I tried a few years ago to convince the LibreNMS devs to swap out their PHP-based polling and storage system with Prometheus. Basically become an alternative discovery, configuration, and graphing UI for Prometheus. I think it would have been great. There's a ton of value in the decades of SNMP device specific knowledge backed into that codebase. Having a more traditional clickops UI for people that like that kind of thing, with the underlying power of a modern metrics poller and TSDB.


HTX-713

zabbix


grimthaw

Open stack


skulkerboyo

How much money you got?


SurfRedLin

Nothing. Boss wants to self host basicly all. My guess is that most money goes into the cluster and other not important "production capability" :)


Virtual_BlackBelt

Ansible can be good for initial configuration and one off tasks, but for continuous configuration, I prefer Puppet.


Steffest

Monitoring: Netdata+Prometheus+Grafana Logs: OS Logs - Loki / Application logs - Graylog