Monitor systems and Docker deployments
Even when locked-down Mihalis Tsoukalos can still keep a close eye on all of his Linux systems and Docker images with Netdata.
Mihalis Tsoukalos keeps a close eye on all of his Linux systems and Docker images.
Welcome to Netdata, software for distributed real-time performance and health monitoring of UNIX machines. Don’t you dare turn that page! A key advantage of Netdata is that it collects all of its metrics without introducing too much load on to the Linux machine that it runs on. In fact, most of the times you’ll forget that Netdata is running on a Linux machine – it’s only after you look at its impressive visualisations that you’ll remember the software is collecting, processing and visualising all these metrics! Another Netdata advantage is that it carries out real-time monitoring, so you can see what’s happening on the Linux machine at that time.
Install Netdata using your package manager. On a Debian or Ubuntu Linux system you can install Netdata by executing apt install netdata . The configuration directory of Netdata is /etc/netdata, but you’ll also find plenty of useful files inside /usr/lib/netdata. By default, Netdata listens to http://localhost:19999, which is the first thing that you should try, to ensure that the Netdata installation was successful.
The screenshot (right) shows the initial default screen, which is full of information. If you’re running Netdata on your own Linux machine, take the time to look at the Netdata visualisations. If you want to make Netdata available over your local network or the Internet, then you should change the value of “bind socket to IP” in the /etc/netdata/netdata.conf configuration file to the external IP of the machine and restart Netdata by executing systemctl restart netdata for the change to take effect. You can identify the version of Netdata you’re using by looking at the lower right-hand corner of the Netdata screen. We’re using Netdata version 1.12.0 in this tutorial.
Measure the metrics
Netdata can collect and display a plethora of metrics – more than 1,000. The list of metrics includes those on the entire system, the CPUS, the memory, the disks, TCP/IP networking, Systemd, specific applications, users, the firewall, running containers as well as the operation of Netdata itself.
The good news is that if you want to monitor something that’s not directly supported by Netdata, you can create you own metric collector using the Netdata plugin API. You can find more about that capability at https://github.com/netdata/netdata/tree/master/ collectors/plugins.d. The plugin API won’t be discussed in much detail in this tutorial.
Finally, bear in mind that not every metric will help you solve a specific performance issue you might have. You’ll need to understand the metrics and select the ones that are related to your situation and observe them, before trying to troubleshoot your machines.
What you already know is that Netdata reads metrics in real time from the machine that it’s running on and automatically creates visualisations with that data. Logically speaking Netdata can be divided into four components. The first component is the metrics collector whereas the second one is a memory time series database that stores the metrics.
Note that the metrics aren’t written on disk, which means they’ll be lost if you reboot your machine or restart Netdata. However, using computer memory speeds things up. The third component is the metrics visualiser and the final component is the alarms notification engine. All these components, when combined, make up Netdata.
Netdata keeps its metrics in memory using a time series database. However, it also stores data related to the health of the system at /var/lib/netdata/health/ health-log.db. The operation log files of Netdata are kept in /var/log/netdata. The kind of entries that you’ll find in /var/log/netdata/access.log is the following: 2020-01-11 19:47:19: 42: 1556 ‘[2.86.21.11]:49813’ ‘DATA’ (sent/all = 744/1501 bytes -50%, prep/sent/total = 0.14/0.27/0.42 ms) 200 ‘/api/v1/ alarms?active&_=1578764050928’
Netdata keeps its web and visualisation files inside /usr/share/netdata/web. This is what you see when you connect to the Netdata web interface.
Let’s delve into the Netdata UI and its various options. The screenshot (right) shows another output from the Netdata web interface. The Netdata screen is divided into three parts: the top menu bar, the left column and the right column. The top menu bar contains options related to Alarms, Settings and the export functionalities of Netdata.
The left column, which is the main area of the Netdata UI, contains the visualisations. You can zoom in and out of each visualisation to obtain a more detailed output, using your mouse wheel while pressing Shift. Additionally, if you put your mouse on a graph point, you’ll be shown the actual value of the metric that’s being visualised. Moreover, you can select an area in a visualisation by clicking with your mouse while pressing Shift. Finally, the right column shows the available sets of metrics: the elements of the active set of metrics is expanded so that you can select what you want.
Health monitoring
Health monitoring is implemented using the alarm and notification systems of Netdata. The first thing you should do is execute /etc/netdata/edit-config health_ alarm_notify.conf as root, which will set up the notification system. This command will create or make changes to the /etc/netdata/health_alarm_notify. conf file. Note that you can test the notification system by executing /usr/lib/netdata/plugins.d/alarm-notify.sh test with root privileges.
Let’s create a new notification related to the CPU usage of the current Linux machine. The value of CPU utilisation will be pretty low, to make testing easier. On a production system, that value might vary. For reasons of simplicity, we’re going to use one of the Netdata preconfigured alarms. Go to the Netdata UI and press the Alarms button on the top menu and select the All tab. The source row for the “system.cpu” alarm has the /usr/lib/netdata/conf.d/health.d/cpu.conf value.
Now execute /etc/netdata/edit-config health.d/cpu. conf with root privileges to edit the file that defines that alarm. On the 10min_cpu_usage template, change the warn and crit lines as follows: warn: $this > (($status >= $WARNING) ? (25) : (35)) crit: $this > (($status == $CRITICAL) ? (45) : (45))
Save the file configuration file, and restart Netdata by executing systemctl restart netdata . The changes you made should be visible in the Alarms section of the Netdata UI.
To intentionally increase the CPU utilisation of the Linux machine you can compile your Linux kernel, run some heavy Docker images or write a C program that creates a large number of threads. You’re free to try other things – just make sure you don’t experiment on a production system.
After you increase the CPU load on your Linux machine, you’ll see a new alarm in the Alarms Active tab. You’ll also be informed about alarms by Netdata badges. Note that the Netdata Health system, when configured, can send notifications to Slack channels or via email, which is a handy feature. Finally, bear in mind that setting the right values in an alert might require some experimentation.
IPV4 traffic
Let’s discuss the IPV4 networking related metrics captured by Netdata. The available subsections for IPV4 networking are sockets, packets, errors, icmp, tcp and udp. The screenshot (page 72, left-hand side) shows part of the Netdata visualisations related to network traffic, displaying the sockets, packages and errors visualisations. The good thing is that there are no errors.
You can also find sections on the networking stack, IPV6 networking, network interfaces and the firewall.
Apache metrics
Netdata can help you to monitor an Apache web server. The main sections of the Apache-related visualisations are requests, connections, bandwidth, workers and statistics. The screenshot (this page, bottom right) shows the Netdata data related to the Apache web server. Note that during the monitoring period, the ab utility was used for generating traffic by sending requests to the Apache web server of the local machine. Netdata can also monitor the Nginx web server.
Docker images
Finally let’s learn how to monitor Docker images using Netdata. Although you might think that this is going to
be a difficult task, the use of Docker images simplifies things. The technique will be illustrated in a dockercompose.yml file. The part of the docker-compose.yml file related to Netdata is the following: netdata: container_name: netdata image: netdata/netdata hostname: a_host_name.com ports:
- 19999:19999 networks:
- linuxformat cap_add:
- SYS_PTRACE security_opt:
- apparmor:unconfined volumes:
- /etc/passwd:/host/etc/passwd:ro - /etc/group:/host/etc/group:ro
- /proc:/host/proc:ro
- /sys:/host/sys:ro - /var/run/docker.sock:/var/run/docker.sock:ro environment:
- PGID=998 elasticsearch:
... kafka:
...
Let’s look at the contents of docker-compose.yml.
There are three images in there: kafka, elasticsearch and netdata. The first two are the images that we want to monitor using Netdata whereas the third image is Netdata itself – the presented setup will also enables you to monitor the Netdata Docker image. The dockercompose.yml file wraps all Docker images and enables the Netdata container to communicate and obtain metrics from the other two. The reason for using the PGID variable, which is the group id of the UNIX group assigned to the Docker image, is for Netdata to be able
to resolve the container names and display them in its web interface – you need this when monitoring multiple Docker images. You’ll most likely need to find the value of PGID on your own, and the easiest way to do that is by executing the grep docker /etc/group | cut -d ‘:’ -f 3
command on the Linux machine you’re about to run the presented docker-compose.yml configuration file.
We’ll now concentrate in the Netdata block. You can find a working docker-compose.yml file in the Linux Format archives (www.linuxformat.com/archives).
Apart from the expected parts in the Netdata block, which are the Docker image that’s going to be used, the port that’s going to be exposed to the outside world and the name of the container, there are some important definitions. The most important block is the volumes block where you make system information and data available to the Netdata Docker image. This part of docker-compose.yml gives the Netdata container access to host OS information using the /sys and /proc folders as well as the /etc/group and /etc/shadow
system files. The SYS_PTRACE option on cap_add starts the container with strace capabilities. Finally, the apparmor:unconfined option starts the container without an Apparmor profile. Application Armor (Apparmor) is a Linux security module that protects the operating system. You can learn more about Apparmor at https://en.wikipedia.org/wiki/apparmor.
Note that if you’re already running Netdata on your Linux machine as a regular server process or in another container, you’ll need to replace the 19999:19999 line with something like 20000:19999 because port number 19999 will be already in use on your local Linux machine, which will make the docker-compose up
command to fail. Note that the first port number is the external one, which should be unique in the entire Linux machine, whereas the second port number is the internal one, which should be unique in the running Docker image only. Because Netdata uses port number 19999, if you change the second port number you won’t be able to communicate with the Netdata process.
You can learn more information about using Netdata for monitoring Docker images at https://docs.netdata. cloud/packaging/docker.
Netdata output
You’ll need to visit http://localhost:19999 to see the data collected by Netdata. The important thing is that you can select the Docker image that interests you and find out more about its performance. The screenshot (right) shows more on this. All this data and visualisations will help you understand the performance level and the bottlenecks of your running container, which will enable you to change its running parameters to improve its performance, especially when the container is used in a production environment.
Next, we use a different docker-compose.yml file with Netdata. That new docker-compose.yml file uses Elasticserach, Kibana and Logstash as well as a Kafka server. The output reveals how each running container performs in relation to the overall system performance. The Kafka container requires more CPU whereas the Netdata container doesn’t require many system resources. Finally, the Elasticsearch container performs too much writing on the disk, which makes sense because Elasticsearch stores data on a hard disk.
Onboard containers
Finally let’s see how to manually install Netdata on a running Docker image that’s using Debian Linux. You might need to do that in case you have a running container that you can’t restart, but you want to check its performance. First, you should download the Debian Docker image by executing docker pull debian:latest . Then you’ll need to execute the following commands:
# docker run -it --name=debian debian:latest bash
# apt update
# apt install curl
# bash <(curl -Ss https://my-netdata.io/kickstart.sh)
The first command executes the Debian Linux Docker image and gives you a bash shell with root privileges in the container. The second and third commands update the packages and downloads the latest version of respectively. The fourth command downloads a bash script offered by Netdata that automates the entire installation – this also includes the installation of quite a few Debian packages in multiple stages. Note that the script compiles Netdata from source code and installs the latest Netdata release. After a successful installation, you’ll get an update script located at /usr/libexec/netdata/netdata-updater.sh
and an uninstall script located at /usr/libexec/ netdata/netdata-uninstaller.sh. You’ll also get some handy instructions that will help you run Netdata on this Debian machine.
Performance monitoring, metrics and visualisation are difficult subjects because there isn’t a single and deterministic technique that can help you solve every bottleneck or problem. The most important thing is understanding the meaning and the importance of the metrics you’re using. Netdata is here to help you but you’ll need to spend some time with it and get used to the data and the visualisations that it offers, in order to be able to use it productively on your test machines or on production machines.