mirror of
https://github.com/rocky-linux/infrastructure
synced 2024-12-03 18:26:26 +00:00
Add a section to the repo for architecture (#14944)
* Proposal for monitoring responsibilities * added an architecture diagram for Prometheus * install graphviz * Only run the diagrams action when someone commits a diagram * Filled out the architecture README * Install node Prometheus Node Exporter on all hosts Co-authored-by: Chris Cowley <chris.cowley@fr.clara.net>
This commit is contained in:
parent
c0c8ea1ec6
commit
bae96c0431
37
.github/workflows/diagrams.yaml
vendored
Normal file
37
.github/workflows/diagrams.yaml
vendored
Normal file
@ -0,0 +1,37 @@
|
|||||||
|
---
|
||||||
|
name: Arcitecture Diagrams
|
||||||
|
on:
|
||||||
|
push:
|
||||||
|
paths:
|
||||||
|
- 'architecture/**.py'
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
build-archi-diagrams:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
timeout-minutes: 5
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- name: Git Checkout
|
||||||
|
uses: actions/checkout@v2
|
||||||
|
- name: Set up Python
|
||||||
|
uses: actions/setup-python@v2
|
||||||
|
with:
|
||||||
|
python-version: 3.8
|
||||||
|
- name: Install dependencies
|
||||||
|
run: |
|
||||||
|
sudo apt update
|
||||||
|
sudo apt -y install graphviz
|
||||||
|
python -m pip install --upgrade pip
|
||||||
|
pip install diagrams
|
||||||
|
- name: Build diagrams
|
||||||
|
run: |
|
||||||
|
for file in $(find architecture/ -type f -name "*.py")
|
||||||
|
do
|
||||||
|
python $file
|
||||||
|
done
|
||||||
|
mkdir artifacts; mv *.png artifacts/
|
||||||
|
- name: Upload diagram images
|
||||||
|
uses: actions/upload-artifact@v2
|
||||||
|
with:
|
||||||
|
name: diagrams-png
|
||||||
|
path: artifacts/*.png
|
@ -21,6 +21,7 @@
|
|||||||
|
|
||||||
role:
|
role:
|
||||||
- role: cloudalchemy.prometheus
|
- role: cloudalchemy.prometheus
|
||||||
|
- role: cloudalchemy.alertmanager
|
||||||
|
|
||||||
post_tasks:
|
post_tasks:
|
||||||
- name: Touching run file that ansible has ran here
|
- name: Touching run file that ansible has ran here
|
||||||
@ -30,3 +31,24 @@
|
|||||||
mode: '0644'
|
mode: '0644'
|
||||||
owner: root
|
owner: root
|
||||||
group: root
|
group: root
|
||||||
|
|
||||||
|
- name: Install Prometheus Node Exporter
|
||||||
|
hosts: all
|
||||||
|
become: true
|
||||||
|
|
||||||
|
pre_tasks:
|
||||||
|
- name: Install SELinux packages
|
||||||
|
package:
|
||||||
|
name: python3-policycoreutils.noarch
|
||||||
|
state: present
|
||||||
|
|
||||||
|
roles:
|
||||||
|
- role: cloudalchemy.node-exporter
|
||||||
|
state: present
|
||||||
|
|
||||||
|
post_tasks:
|
||||||
|
- name: Open firewall for node-exporter
|
||||||
|
ansible.posix.firewalld:
|
||||||
|
port: 9100/tcp
|
||||||
|
permanent: yes
|
||||||
|
state: enabled
|
||||||
|
59
architecture/monitoring/README.md
Normal file
59
architecture/monitoring/README.md
Normal file
@ -0,0 +1,59 @@
|
|||||||
|
# Monitoring
|
||||||
|
|
||||||
|
For the now the the planned monitoring platform is [Prometheus](https://prometheus.io/).
|
||||||
|
|
||||||
|
Initially, we should keep it simple. Prometheus can scale a long way and
|
||||||
|
allows a lot of clever stuff involving data archival and service discovery.
|
||||||
|
This can all come in the medium-term.
|
||||||
|
|
||||||
|
For now we want to solve the basics:
|
||||||
|
|
||||||
|
- collect infrastucture metrics
|
||||||
|
- visualise those over a reasonable time-frame
|
||||||
|
- be alerted if one of those metrics does somehthing funky
|
||||||
|
|
||||||
|
For now we do not need HA, multi-year retention or automatic service discovery,
|
||||||
|
so I propose something like the following:
|
||||||
|
|
||||||
|
- A single prometheus host in AWS
|
||||||
|
- Non-AWS Exporters added via Ansible using file_sd
|
||||||
|
- AWS hosted exporters added via ec2_sd
|
||||||
|
- Grafana on that host
|
||||||
|
- Alertmanager on that same host
|
||||||
|
- Non-critical alerts in a dedicated channel
|
||||||
|
- Critical alerts to a small group via a service like Pushover/Pagerduty.
|
||||||
|
|
||||||
|
## Pretty pictures via Python
|
||||||
|
|
||||||
|
Use [python-diagrams](https://diagrams.mingrammer.com) to build construct the diagram.
|
||||||
|
|
||||||
|
```
|
||||||
|
pip install --user diagrams
|
||||||
|
python ./prometheus-mvp.py
|
||||||
|
```
|
||||||
|
|
||||||
|
We'll automate putting the outputed file somewhere ASAP
|
||||||
|
|
||||||
|
## What this is NOT addressing
|
||||||
|
|
||||||
|
I am purposely not covering Logging and web service uptime here. We can check
|
||||||
|
web services with Prometheus, but an external service (UptimeRobot?) is, in my
|
||||||
|
opion, better suited to that problem.
|
||||||
|
|
||||||
|
Likewise, I do not see Logging as directly related. A separate stack is
|
||||||
|
necessary for that. Loki would perhaps be a good solution that could
|
||||||
|
use the same Grafana instance. ELK and Graylog are also worth considering.
|
||||||
|
|
||||||
|
## Responsiblities
|
||||||
|
|
||||||
|
The monitoring team cannot realistically be responsible for how every single
|
||||||
|
is monitored. Prometheus has a huge library of exporters for almost everything.
|
||||||
|
|
||||||
|
The monitoring team can be responsible for ensuring that the infrastructure is
|
||||||
|
available to the application/infrastructure teams. Also that knowledge of how
|
||||||
|
to be added to that infrastucture is suitably shared.
|
||||||
|
|
||||||
|
It falls on the application teams themselves to find a suitable exporter, add
|
||||||
|
it to the Prometheus server and write the necessary alerts, queries and
|
||||||
|
dashboards. Obviously, we will help as much as we can, but please don't ask
|
||||||
|
me to learn the internals of FreeIPA for example.
|
62
architecture/monitoring/prometheus_mvp.py
Executable file
62
architecture/monitoring/prometheus_mvp.py
Executable file
@ -0,0 +1,62 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
|
||||||
|
from diagrams import Diagram, Cluster, Edge
|
||||||
|
from diagrams.aws.compute import EC2
|
||||||
|
from diagrams.aws.general import General
|
||||||
|
from diagrams.aws.network import ELB
|
||||||
|
from diagrams.onprem.compute import Server
|
||||||
|
from diagrams.onprem.iac import Ansible
|
||||||
|
from diagrams.onprem.monitoring import Grafana, Prometheus
|
||||||
|
from diagrams.saas.alerting import Pushover
|
||||||
|
from diagrams.saas.chat import Slack
|
||||||
|
|
||||||
|
graph_attr = {
|
||||||
|
}
|
||||||
|
|
||||||
|
node_attr = {
|
||||||
|
}
|
||||||
|
|
||||||
|
with Diagram("Prometheus MVP",
|
||||||
|
show=False,
|
||||||
|
direction="TB",
|
||||||
|
outformat="png",
|
||||||
|
graph_attr=graph_attr,
|
||||||
|
node_attr=node_attr):
|
||||||
|
|
||||||
|
with Cluster("Rocky VPC"):
|
||||||
|
with Cluster("AWS services"):
|
||||||
|
aws_group = [
|
||||||
|
EC2("service01"),
|
||||||
|
EC2("service02"),
|
||||||
|
]
|
||||||
|
with Cluster("metrics host"):
|
||||||
|
metrics = Prometheus("metrics")
|
||||||
|
alertmanager = Prometheus("alertmanager")
|
||||||
|
dashboard = Grafana("monitoring")
|
||||||
|
metrics << dashboard
|
||||||
|
metrics >> alertmanager
|
||||||
|
|
||||||
|
Ansible("ansible") >> metrics
|
||||||
|
metrics >> Edge(style="dashed",
|
||||||
|
label="ec2 read permissions") >> General("AWS API")
|
||||||
|
|
||||||
|
alertmanager >> Edge(style="dashed",
|
||||||
|
label="non-critical") >> Slack("rocky-alerts")
|
||||||
|
alertmanager >> Edge(style="dashed",
|
||||||
|
label="critical") >> Pushover("tbd")
|
||||||
|
ELB("metrics.rockylinux.org") >> Edge(label="TCP3000") >> dashboard
|
||||||
|
with Cluster("Cloudvider"):
|
||||||
|
cloudvider_group = [
|
||||||
|
Server("server01"),
|
||||||
|
Server("server02"),
|
||||||
|
]
|
||||||
|
|
||||||
|
with Cluster("Spry Servers"):
|
||||||
|
spry_group = [
|
||||||
|
Server("server01"),
|
||||||
|
Server("server02"),
|
||||||
|
]
|
||||||
|
|
||||||
|
metrics >> aws_group
|
||||||
|
metrics >> spry_group
|
||||||
|
metrics >> cloudvider_group
|
Loading…
Reference in New Issue
Block a user