Skip to main content

Grafana and friends for monitoring

Howto Monitoring Grafana Telegraf
Table of Contents

As today’s homes tend to house fully fledged networks with all kinds of networkt and smart home devices, most of them connected to the internet, a proper monitoring makes sense. Luckily the necessary tools for the latter are freely available these days, at a very high professional level.

Except for the lack of redundancy, the setup described below is at enterprise level and can be considered excessive for the usual home network. However the well-known phrase “because it’s there” applies here as well.

Setup overview
#

After several evolutionary steps over the last decade, I ended up with a Monitoring setup mainly based on the Grafana tool stack. While mainly known for it’s top-notch and name-giving visualization tool, Grafana Labs provides a tool stack covering all aspects of Observability. As there are:

  • Grafana Mimir as the Time Series Database (TSDB) for collecting and querying our metrics. Alternative choices for a TSDB are Prometheus and InfluxDB. As the latter two lacked either proper scalability or a clear development path, I decided to go for Mimir.

  • Grafana Loki as a log aggregation system. Via Loki I am able collect and query the syslog messages of the different hosts in my network.

  • Grafana Agent as the general data collector fetching the input data from various Prometheus style metric sources as well as log sources and pushing it into Mimir/Loki.

  • Grafana as the metric visualization and alerting tool. Grafana connects to Mimir and Loki to query the collected metrics and log data.

Only tool now missing, is a data agent used to gather all kind of data (CPU load, network traffic, …) and provide them as metrics fetchable by Grafana Agent.

  • Telegraf is my choice of tool for this task. It provides a vast amount of input plugins out of the box and is easily extendable.

In a picture:

flowchart LR a[Telegraf] --> b[Agent] --> c[Mimir] --> d[Grafana] e[Syslog] --> b --> f[Loki] --> d subgraph Inputs a e end subgraph Collect b end subgraph Store c f end subgraph Visualize d end

I am running this setup on Gentoo Linux. The necessary ebuilds to install the tools are availabile via the main portage tree or via my custom overlay.

Grafana Mimir setup
#

Configuration file: /etc/mimir/mimir-local-config.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
server:
  log_level: warn
  http_listen_address: localhost
  http_listen_port: 9009

common:
  storage:
    backend: s3
    s3:
      endpoint: <MinIO/S3 server>
      access_key_id: "<MinIO/S3 access key id>"
      secret_access_key: "<MinIO/S3 access key>"

blocks_storage:
  tsdb:
    dir: /var/lib/mimir/tsdb
  bucket_store:
    sync_dir: /var/lib/mimir/tsdb-sync
  backend: s3
  s3:
    bucket_name: mimir-blocks
#  backend: filesystem
#  filesystem:
#    dir: /var/lib/mimir/data/tsdb

ruler_storage:
  backend: s3
  s3:
    bucket_name: mimir-ruler
#  backend: local
#  local:
#    directory: /var/lib/mimir/rules

distributor:
  ring:
    kvstore:
      store: inmemory
  pool:
    health_check_ingesters: true

ingester:
  ring:
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512
    kvstore:
      store: inmemory
    replication_factor: 1

compactor:
  data_dir: /var/lib/mimir/compactor

limits:
  compactor_blocks_retention_period: 1y

usage_stats:
  enabled: false
Depending on your storage choice either add the MinIO/S3 server and credentials at lines 10-12 or comment out the S3 related stuff and use the filesystem and local backends.

Grafana Loki setup
#

Configuration file: /etc/loki/loki-local-config.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
server:
  log_level: warn
  http_listen_address: localhost
  http_listen_port: 3100
  grpc_listen_port: 0

common:
  path_prefix: /var/lib/loki
  storage:
    s3:
      endpoint: <MinIO/S3 server>
      access_key_id: <MinIO/S3 access key id>
      secret_access_key: <MinIO/S3 access key>
      bucketnames: loki-data
      s3forcepathstyle: true
#    filesystem:
#      chunks_directory: /var/lib/loki/chunks
#      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: s3
#      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

ruler:
  storage:
    s3:
      bucketnames: loki-ruler

frontend:
  encoding: protobuf

analytics:
  reporting_enabled: false
As for Mimir, either add the MinIO/S3 server and credentials at lines 11-13 or comment out the S3 related stuff and use the filesystem backend.

Grafana Agent setup
#

Configuration file: /etc/agent/agent-local-config.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
server:
  log_level: warn

metrics:
  global:
    scrape_interval: 1m
  wal_directory: /var/lib/agent/wal
  configs:
    - name: metrics
      scrape_configs:
        - job_name: telegraf
          static_configs:
            - targets: ['localhost:9273']
#        - job_name: traefik
#          static_configs:
#            - targets: ['localhost:8080']
#        - job_name: podman
#          static_configs:
#            - targets: ['localhost:9882']
#        - job_name: minio
#          bearer_token: <MinIO bearer token>
#          metrics_path: /minio/v2/metrics/cluster
#          scheme: https
#          static_configs:
#            - targets: ['<MinIO server>']
      remote_write:
        - url: http://localhost:9009/api/v1/push
          headers:
            X-Scope-OrgID: home

logs:
  configs:
  - name: logs
    positions:
      filename: /var/lib/agent/positions.yaml
    scrape_configs:
    - job_name: syslog
      syslog:
        listen_address: localhost:1514
        idle_timeout: 60s
        labels:
          job: "syslog"
      relabel_configs:
      - source_labels: ['__syslog_message_hostname']
        target_label: 'host'
      - source_labels: ['__syslog_message_severity']
        target_label: 'severity'
      - source_labels: ['__syslog_message_app_name']
        target_label: 'app'
    clients:
      - url: http://localhost:3100/loki/api/v1/push
        tenant_id: home
Grafana Agent works in a scrape and forward manner.

The metrics section defines which Prometheus styles metric sources should be scraped (10: scrape_configs) and forwarded to Mimir (26: remote_write). For the config above only the Telegraf source is active. If you have others sources running, add them like the examples seen in the comments.

The logs section defines the log sources to be processed (36: scrape_config) and forwarded to Loki (50: clients). The configuration above configures Grafana Agent to listen on Syslog messages on port 1514. To receive anything via this port, the Syslog daemon must be configured to forward it’s messages to this port. For syslog-ng this can be achieved by adding the following lines to syslog-ng.conf.

...
destination d_agent { syslog("localhost" port(1514) transport(tcp)); };
...
log { source(src); destination(d_agent); };
...

Note: Unless multitenancy_enabled: false is explicitly set in the Mimir configuration, Mimir runs in multi-tenancy mode and forwarded data must be assigned to a tenant. This done via the lines 29: X-Scope-OrgID: home and 52: tenant_id: home. The tenant id choosen here must also be set when connecting Grafana to Mimir (see Grafana configuration).

Telegraf setup
#

Configuration file: /etc/telegraf/telegraf.conf

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
[global_tags]
  # Using defaults
[agent]
  # Using defaults
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"
  hostname = ""
  omit_hostname = false
[[outputs.prometheus_client]]
  # Enable Prometheus endpoint
  # Using defaults
  listen = ":9273"
  export_timestamp = true
[[inputs.cpu]]
  # Collect CPU load
  # Using defaults
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
  core_tags = false
[[inputs.disk]]
  # Collect Disk utilization
  # Restrict to a well-known static set
  mount_points = ["/", "/boot"]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
[[inputs.kernel]]
  # Collect Kernel metrics
[[inputs.mem]]
  # Collect Memory metrics
[[inputs.processes]]
  # Collect Process metrics
[[inputs.swap]]
  # Collect Swap metrics
[[inputs.system]]
  # Collect System metrics
[[inputs.net]]
  # Collect Network metrics
  # Restrict to well-known interfaces
  interfaces = ["eth0"]
  ignore_protocol_stats = true
[[inputs.sensors]]
  # Collect HW sensor metrics
The Telegraf configuration is initially overhelming due to it’s high amount of options, plugins and comments. Filtering out everything not needed, leads to a surprisingly focussed configuration.

The noteworthy sections are:

  • 15-19: Output plugin is set to outputs.prometheus_client and the listen parameter must match the targets parameter in the Grafana Agent configuration.
  • 20-49: Simply add the input plugins of your choice.

Starting the daemons
#

Now it’s the time to start the daemons and check the logs whether they are up and running. Any additional dependencies (e.g. MinIO for storage) should be already up and running. The starting sequence must be as follows:

  • Mimir: Should initialize and then wait for connections.
  • Loki: Should initialize and then wait for connections.
  • Telegraf: Should start collecting and the wait for connections.
  • Agent: Should start and scraping data from Telegraf and Logs and forwarding it to Mimir and Loki.

Grafana configuration
#

If the daemons are running and logs showing normal operation, metric collection should be running and data should flow in. Now is the time to startup Grafana and access this data. Adapt grafana.ini according to your needs and setup. Next step is to login into Grafana using an admin account to configure the Mimir and Loki data sources:

Mimir data source
#

Go to Connections / Data Sources and add a Prometheus data source using the following settings:

Data source name and connection URL
Data source name and connection URL
Enter a data source name of your choice and enter the connection URL according to your Mimir configuration (parameters http_listen_address and http_listen_port).
Authentication and HTTP headers
Authentication and HTTP headers
Mimir is running without authentication and in multi-tenancy mode. Therefore the conection must be set to No Authenticationas well and the HTTP header X-Scope-OrgID must be set to same value used in the Agent configuration.
Mimir version
Mimir version
In the Performance section the Prometheus type must be set to Mimir and the Mimir version to a suitable value.

Loki data source
#

Go to Connections / Data Sources and add a Loki data source using the following settings (quite similar as for Mimir):

Data source name and connection URL
Data source name and connection URL
Enter a data source name of your choice and enter the connection URL according to your Loki configuration (parameters http_listen_address and http_listen_port).
Authentication and HTTP headers
Authentication and HTTP headers
Loki is running without authentication and in multi-tenancy mode. Therefore the conection must be set to No Authenticationas well and the HTTP header X-Scope-OrgID must be set to same value used in the Agent configuration.

Example dashboard
#

The following steps demonstrate how to setup a new dashboard based on the just created Mimir data source. The dashboard shows the system uptime based on the corresponding metric gathered by Telegraf. The dashboard also defines a host variable to display/filter the uptime for hosts in the network.

Example dashboard
Example dashboard

  1. Creating a new dashboard

To create a new dashboard navigate to the Dashboards section in Grafana and select New dashboard.

New Dashboard
New Dashboard

  1. Defining a variable for host selection

Dashboard variables are a powerful feature to make dashboards dynamic. The variable values are derived from the collected data.

Dashboard settings
Dashboard settings
To add a new variable select the Dashboard settings in the newly created dashboard, navigate to the Variables section and select New variable.
Dashboard variable type and name
Dashboard variable type and name
The host variable is of type Query as it is dynamic. The value entered in Name is later used to reference the variable in queries. The value entered in Label is used by Grafana to display the variable on the dashboard.
Dashboard variable values
Dashboard variable values
The Query options define the actual variable values. The Data source is the newly created Mimir data source. By setting Query type to Lavel values, Label to host and Metric to system_uptime the variable will be backed up by the all the host values collected as part of the system_uptime metric. The remaining options define how the variable is displayed and handled within the dashboard.

  1. Visualizing the system uptime

Now the actual visualaztion can be added to the dashboard.

Dashboard variable values
Dashboard variable values
Like for the variable, the Data source is set to our newly created Mimir data source. The visualization type is set to Stat to display single value. By setting the Repeat by variable option to host we instruct Grafana to display the visualization for all known hosts. The Metric is set to system_uptime and by setting Label filters to host = $host the visualization displays the uptime of a single host. Additional visualization settings (not shown on the picture) are:

  • Value options / Calculation is set to Last* to show the most current uptime.
  • Standard options / Unit is set to seconds (s) as the uptime is reported in seconds.
  • Thresholds is set to Yellow as default and Green for values above 3600 seconds (1 hour) to indicate shortly restarted hosts.

Note: When the All option is selected as the host filter, the preview shown while editing the visualization may stay empty. To evade this issue, the host filter can be set to the first value in the list while editing and later set back to All as soon as the visualization is finalised.