mono-infrastructure/architecture/monitoring
2020-12-28 16:40:49 +00:00
..
prometheus_mvp.py Add a section to the repo for architecture (#14944) 2020-12-18 16:03:49 -05:00
README.md [pre-commit.ci] auto fixes from pre-commit.com hooks 2020-12-28 16:40:49 +00:00

Monitoring

For the now the the planned monitoring platform is Prometheus.

Initially, we should keep it simple. Prometheus can scale a long way and allows a lot of clever stuff involving data archival and service discovery. This can all come in the medium-term.

For now we want to solve the basics:

  • collect infrastucture metrics
  • visualise those over a reasonable time-frame
  • be alerted if one of those metrics does somehthing funky

For now we do not need HA, multi-year retention or automatic service discovery, so I propose something like the following:

  • A single prometheus host in AWS
    • Non-AWS Exporters added via Ansible using file_sd
    • AWS hosted exporters added via ec2_sd
  • Grafana on that host
  • Alertmanager on that same host
    • Non-critical alerts in a dedicated channel
    • Critical alerts to a small group via a service like Pushover/Pagerduty.

Pretty pictures via Python

Use python-diagrams to build construct the diagram.

pip install --user diagrams
python ./prometheus-mvp.py

We'll automate putting the outputed file somewhere ASAP

What this is NOT addressing

I am purposely not covering Logging and web service uptime here. We can check web services with Prometheus, but an external service (UptimeRobot?) is, in my opion, better suited to that problem.

Likewise, I do not see Logging as directly related. A separate stack is necessary for that. Loki would perhaps be a good solution that could use the same Grafana instance. ELK and Graylog are also worth considering.

Responsiblities

The monitoring team cannot realistically be responsible for how every single is monitored. Prometheus has a huge library of exporters for almost everything.

The monitoring team can be responsible for ensuring that the infrastructure is available to the application/infrastructure teams. Also that knowledge of how to be added to that infrastucture is suitably shared.

It falls on the application teams themselves to find a suitable exporter, add it to the Prometheus server and write the necessary alerts, queries and dashboards. Obviously, we will help as much as we can, but please don't ask me to learn the internals of FreeIPA for example.