Add a section to the repo for architecture (#14944)

* Proposal for monitoring responsibilities * added an architecture diagram for Prometheus * install graphviz * Only run the diagrams action when someone commits a diagram * Filled out the architecture README * Install node Prometheus Node Exporter on all hosts Co-authored-by: Chris Cowley <chris.cowley@fr.clara.net>
2024-11-23 21:51:28 +00:00 · 2020-12-18 22:03:49 +01:00 · 2020-12-18 22:03:49 +01:00 · bae96c0431
commit bae96c0431
parent c0c8ea1ec6
4 changed files with 180 additions and 0 deletions
--- a/.github/workflows/diagrams.yaml
+++ b/.github/workflows/diagrams.yaml
@ -0,0 +1,37 @@
+---
+name: Arcitecture Diagrams
+on:
+  push:
+    paths:
+      - 'architecture/**.py'
+
+jobs:
+  build-archi-diagrams:
+    runs-on: ubuntu-latest
+    timeout-minutes: 5
+
+    steps:
+      - name: Git Checkout
+        uses: actions/checkout@v2
+      - name: Set up Python
+        uses: actions/setup-python@v2
+        with:
+          python-version: 3.8
+      - name: Install dependencies
+        run: |
+          sudo apt update
+          sudo apt -y install graphviz
+          python -m pip install --upgrade pip
+          pip install diagrams
+      - name: Build diagrams
+        run: |
+          for file in $(find architecture/ -type f -name "*.py")
+          do
+            python $file
+          done
+          mkdir artifacts; mv *.png artifacts/
+      - name: Upload diagram images
+        uses: actions/upload-artifact@v2
+        with:
+          name: diagrams-png
+          path: artifacts/*.png
--- a/ansible/playbooks/role-rocky-monitoring.yml
+++ b/ansible/playbooks/role-rocky-monitoring.yml
@ -21,6 +21,7 @@

  role:
    - role: cloudalchemy.prometheus
+    - role: cloudalchemy.alertmanager

  post_tasks:
    - name: Touching run file that ansible has ran here
@ -30,3 +31,24 @@
        mode: '0644'
        owner: root
        group: root
+
+- name: Install Prometheus Node Exporter
+  hosts: all
+  become: true
+
+  pre_tasks:
+    - name: Install SELinux packages
+      package:
+        name: python3-policycoreutils.noarch
+        state: present
+
+  roles:
+    - role: cloudalchemy.node-exporter
+      state: present
+
+  post_tasks:
+    - name: Open firewall for node-exporter
+      ansible.posix.firewalld:
+        port: 9100/tcp
+        permanent: yes
+        state: enabled
--- a/architecture/monitoring/README.md
+++ b/architecture/monitoring/README.md
@ -0,0 +1,59 @@
+# Monitoring
+
+For the now the the planned monitoring platform is [Prometheus](https://prometheus.io/).
+
+Initially, we should keep it simple. Prometheus can scale a long way and
+allows a lot of clever stuff involving data archival and service discovery.
+This can all come in the medium-term.
+
+For now we want to solve the basics:
+
+- collect infrastucture metrics
+- visualise those over a reasonable time-frame
+- be alerted if one of those metrics does somehthing funky
+
+For now we do not need HA, multi-year retention or automatic service discovery,
+so I propose something like the following:
+
+- A single prometheus host in AWS
+  - Non-AWS Exporters added via Ansible using file_sd
+  - AWS hosted exporters added via ec2_sd
+- Grafana on that host
+- Alertmanager on that same host
+  - Non-critical alerts in a dedicated channel
+  - Critical alerts to a small group via a service like Pushover/Pagerduty.
+
+## Pretty pictures via Python
+
+Use [python-diagrams](https://diagrams.mingrammer.com) to build construct the diagram.
+
+```
+pip install --user diagrams
+python ./prometheus-mvp.py
+```
+
+We'll automate putting the outputed file somewhere ASAP
+
+## What this is NOT addressing
+
+I am purposely not covering Logging and web service uptime here. We can check
+web services with Prometheus, but an external service (UptimeRobot?) is, in my
+opion, better suited to that problem.
+
+Likewise, I do not see Logging as directly related. A separate stack is
+necessary for that. Loki would perhaps be a good solution that could
+use the same Grafana instance. ELK and Graylog are also worth considering.
+
+## Responsiblities
+
+The monitoring team cannot realistically be responsible for how every single 
+is monitored. Prometheus has a huge library of exporters for almost everything.
+
+The monitoring team can be responsible for ensuring that the infrastructure is
+available to the application/infrastructure teams. Also that knowledge of how
+to be added to that infrastucture is suitably shared.
+
+It falls on the application teams themselves to find a suitable exporter, add
+it to the Prometheus server and write the necessary alerts, queries and
+dashboards. Obviously, we will help as much as we can, but please don't ask
+me to learn the internals of FreeIPA for example.
--- a/architecture/monitoring/prometheus_mvp.py
+++ b/architecture/monitoring/prometheus_mvp.py
@ -0,0 +1,62 @@
+#!/usr/bin/env python
+
+from diagrams import Diagram, Cluster, Edge
+from diagrams.aws.compute import EC2
+from diagrams.aws.general import General
+from diagrams.aws.network import ELB
+from diagrams.onprem.compute import Server
+from diagrams.onprem.iac import Ansible
+from diagrams.onprem.monitoring import Grafana, Prometheus
+from diagrams.saas.alerting import Pushover
+from diagrams.saas.chat import Slack
+
+graph_attr = {
+        }
+
+node_attr = {
+        }
+
+with Diagram("Prometheus MVP",
+             show=False,
+             direction="TB",
+             outformat="png",
+             graph_attr=graph_attr,
+             node_attr=node_attr):
+
+    with Cluster("Rocky VPC"):
+        with Cluster("AWS services"):
+            aws_group = [
+                    EC2("service01"),
+                    EC2("service02"),
+                    ]
+        with Cluster("metrics host"):
+            metrics = Prometheus("metrics")
+            alertmanager = Prometheus("alertmanager")
+            dashboard = Grafana("monitoring")
+            metrics << dashboard
+            metrics >> alertmanager
+
+    Ansible("ansible") >> metrics
+    metrics >> Edge(style="dashed",
+                    label="ec2 read permissions") >> General("AWS API")
+
+    alertmanager >> Edge(style="dashed",
+                         label="non-critical") >> Slack("rocky-alerts")
+    alertmanager >> Edge(style="dashed",
+                         label="critical") >> Pushover("tbd")
+    ELB("metrics.rockylinux.org") >> Edge(label="TCP3000") >> dashboard
+    with Cluster("Cloudvider"):
+        cloudvider_group = [
+                Server("server01"),
+                Server("server02"),
+                ]
+
+    with Cluster("Spry Servers"):
+        spry_group = [
+                Server("server01"),
+                Server("server02"),
+                ]
+
+    metrics >> aws_group
+    metrics >> spry_group
+    metrics >> cloudvider_group