As today’s homes tend to house fully fledged networks with all kinds of networkt and smart home devices, most of them connected to the internet, a proper monitoring makes sense. Luckily the necessary tools for the latter are freely available these days, at a very high professional level.
Except for the lack of redundancy, the setup described below is at enterprise level and can be considered excessive for the usual home network. However the well-known phrase “because it’s there” applies here as well.
Setup overview #
After several evolutionary steps over the last decade, I ended up with a Monitoring setup mainly based on the Grafana tool stack. While mainly known for it’s top-notch and name-giving visualization tool, Grafana Labs provides a tool stack covering all aspects of Observability. As there are:
-
Grafana Mimir as the Time Series Database (TSDB) for collecting and querying our metrics. Alternative choices for a TSDB are Prometheus and InfluxDB. As the latter two lacked either proper scalability or a clear development path, I decided to go for Mimir.
-
Grafana Loki as a log aggregation system. Via Loki I am able collect and query the syslog messages of the different hosts in my network.
-
Grafana Agent as the general data collector fetching the input data from various Prometheus style metric sources as well as log sources and pushing it into Mimir/Loki.
-
Grafana as the metric visualization and alerting tool. Grafana connects to Mimir and Loki to query the collected metrics and log data.
Only tool now missing, is a data agent used to gather all kind of data (CPU load, network traffic, …) and provide them as metrics fetchable by Grafana Agent.
- Telegraf is my choice of tool for this task. It provides a vast amount of input plugins out of the box and is easily extendable.
In a picture:
I am running this setup on Gentoo Linux. The necessary ebuilds to install the tools are availabile via the main portage tree or via my custom overlay.
Grafana Mimir setup #
Configuration file: /etc/mimir/mimir-local-config.yaml
|
|
Grafana Loki setup #
Configuration file: /etc/loki/loki-local-config.yaml
|
|
Grafana Agent setup #
Configuration file: /etc/agent/agent-local-config.yaml
|
|
The metrics
section defines which Prometheus styles metric sources should be scraped (10: scrape_configs
) and forwarded
to Mimir (26: remote_write
). For the config above only the Telegraf source is active. If you have others sources running,
add them like the examples seen in the comments.
The logs
section defines the log sources to be processed (36: scrape_config
) and forwarded to Loki (50: clients
).
The configuration above configures Grafana Agent to listen on Syslog messages on port 1514. To receive anything via this port,
the Syslog daemon must be configured to forward it’s messages to this port. For syslog-ng
this can be achieved by adding the following
lines to syslog-ng.conf
.
...
destination d_agent { syslog("localhost" port(1514) transport(tcp)); };
...
log { source(src); destination(d_agent); };
...
Note: Unless multitenancy_enabled: false
is explicitly set in the Mimir configuration, Mimir runs in
multi-tenancy mode and forwarded data must be assigned to a tenant. This done via the lines 29: X-Scope-OrgID: home
and 52: tenant_id: home
.
The tenant id choosen here must also be set when connecting Grafana to Mimir (see Grafana configuration).
Telegraf setup #
Configuration file: /etc/telegraf/telegraf.conf
|
|
The noteworthy sections are:
- 15-19: Output plugin is set to
outputs.prometheus_client
and the listen parameter must match the targets parameter in the Grafana Agent configuration. - 20-49: Simply add the input plugins of your choice.
Starting the daemons #
Now it’s the time to start the daemons and check the logs whether they are up and running. Any additional dependencies (e.g. MinIO for storage) should be already up and running. The starting sequence must be as follows:
- Mimir: Should initialize and then wait for connections.
- Loki: Should initialize and then wait for connections.
- Telegraf: Should start collecting and the wait for connections.
- Agent: Should start and scraping data from Telegraf and Logs and forwarding it to Mimir and Loki.
Grafana configuration #
If the daemons are running and logs showing normal operation, metric collection should be running and data should flow in. Now is the time to startup Grafana and access this data. Adapt grafana.ini according to your needs and setup. Next step is to login into Grafana using an admin account to configure the Mimir and Loki data sources:
Mimir data source #
Go to Connections / Data Sources and add a Prometheus data source using the following settings:
Enter a data source name of your choice and enter the connection URL according to your Mimir configuration
(parameters http_listen_address
and http_listen_port
).
Mimir is running without authentication and in multi-tenancy mode. Therefore the conection must be set to No Authentication
as well and the
HTTP header X-Scope-OrgID
must be set to same value used in the Agent configuration.
In the Performance section the Prometheus type must be set to Mimir
and the Mimir version to a suitable value.
Loki data source #
Go to Connections / Data Sources and add a Loki data source using the following settings (quite similar as for Mimir):
Enter a data source name of your choice and enter the connection URL according to your Loki configuration
(parameters http_listen_address
and http_listen_port
).
Loki is running without authentication and in multi-tenancy mode. Therefore the conection must be set to No Authentication
as well and the
HTTP header X-Scope-OrgID
must be set to same value used in the Agent configuration.
Example dashboard #
The following steps demonstrate how to setup a new dashboard based on the just created Mimir data source. The dashboard shows the system uptime based on the corresponding metric gathered by Telegraf. The dashboard also defines a host variable to display/filter the uptime for hosts in the network.
- Creating a new dashboard
To create a new dashboard navigate to the Dashboards section in Grafana and select New dashboard.
- Defining a variable for host selection
Dashboard variables are a powerful feature to make dashboards dynamic. The variable values are derived from the collected data.
To add a new variable select the Dashboard settings in the newly created dashboard, navigate to the Variables section and select New variable.
The host variable is of type Query
as it is dynamic. The value entered in Name is later used to reference the variable in queries. The value entered
in Label is used by Grafana to display the variable on the dashboard.
The Query options define the actual variable values. The Data source is the newly created Mimir data source. By setting Query type to
Lavel values
, Label to host
and Metric to system_uptime
the variable will be backed up by the all the host values collected as part
of the system_uptime metric. The remaining options define how the variable is displayed and handled within the dashboard.
- Visualizing the system uptime
Now the actual visualaztion can be added to the dashboard.
Like for the variable, the Data source is set to our newly created Mimir data source. The visualization type is set to Stat
to display single
value. By setting the Repeat by variable option to host
we instruct Grafana to display the visualization for all known hosts.
The Metric is set to system_uptime
and by setting Label filters to host = $host
the visualization displays the uptime of a single host.
Additional visualization settings (not shown on the picture) are:
- Value options / Calculation is set to
Last*
to show the most current uptime. - Standard options / Unit is set to
seconds (s)
as the uptime is reported in seconds. - Thresholds is set to
Yellow
as default andGreen
for values above 3600 seconds (1 hour) to indicate shortly restarted hosts.
Note: When the All
option is selected as the host filter, the preview shown while editing the visualization may stay empty. To evade this
issue, the host filter can be set to the first value in the list while editing and later set back to All
as soon as the visualization is finalised.