Anyone constructing a system that operates a money making - or more to the point, potentially losing - business needs to accept that it can go wrong. Line-by-line while writing the code, the programmer must logically account for what occurs in every branch. That’s where decisions are made. Decisions need to be recorded.
It Turns Out Zero Is Not Absolute
Is there an assumption that a result is positive? What if it’s negative? This happened in the past several years in much banking code when interest rates went below zero which had been assumed impossible. Code crashed or produced misleading results.
The moral of the story is that every possible outcome should be explicitly acted upon, though this can be tedious. At the least, there should be a log. What’s a log?
The term comes from shipping in the 15th century where to estimate speed an actual log of wood would be thrown over the side attached to a rope. It was knotted at regular intervals (this is where the term “knots” for speed at sea comes from). It was necessary to note the times at which those knots passed behind the ship. The notes were themselves termed “logs”.
Such a log would, at a minimum, record the time. This is true of today’s computer system logs. They typically consist of a time, a status and a message:
2023-02-27T12:01:16.217 CRITICAL markets Couldn't create container: not enough memory
Earlier in computer system history, employees would take turns watching log-files being printed out live. The role was sometimes colloquially termed the “watchdog”.
These days there are too many systems and associated logs to physically watch, but logs remain vital. When there is a system failure they are searched for indications of the cause.
However, logs can also be watched programmatically and used for alerting. The subject of this blog is setting up a simple system for doing that.
Swatchdog
There are many log alerting systems on the market. The best known is probably Datadog. There’s also Logtail, Papertrail, Splunk, Logstash and others. These are well put together products with a host of great features, such as excellent UIs, sophisticated live searching via web interfaces and sometimes query languages and alerting. They require various levels of installation and they have costs, either through volume-based tiered systems or monthly payments. For a bootstrapped business, this can be problematic, for instance when a surge of logs - indicating a possible important problem that needs to be solved - pushes volume on to another tier. Should the “log ransom” be paid?
Instead, I recalled from earlier times surely the simplest log watcher: Swatchdog. It is rather venerable software. Its file history from its source download shows dates in 2015, but it was written much earlier - 1993 according to the author Todd Atkins.
To install it, do
sudo apt install swatch
(It used to be called “Swatch” however a watch manufacturer objected!)
It is difficult to find documentation online. Instead - it’s right there in the implementation! You simply use the Linux pod2text
command:
pod2text `which swatchdog`|less
Alternatively, generate an HTML file with
pod2html `which swatchdog` > swatchdog.html
Its language is Perl, which has fallen out of favour with the rise of Python. Python is much easier to write - and read. Further, the bash shell
language encompasses most of Perl’s functionality now. However, in the days swatchdog
was written, Perl was the default language for system administrators. It had a large active and enthusiastic user and developer base. They produced many packages that were distributed via CPAN. CPAN has “…has 213,554 Perl modules in 44,051 distributions, written by 14,345 authors”. In fact some large systems were written in Perl.
Perl has significant features that make it well suited for some tasks, particularly in the system administration space. It is well-known for its succinctness (though this is one cause of the difficulty to read its code). It has a very expressive regular expression language. These two aspects make it work well for automated log-watching.
How it works
swatchdog
runs a process for each file it watches. In its simplest form it uses a regular expression that each log-line matches against and a command that runs in the case of a positive match.
For our system ProfitView, most code is in Python in which we use the logging
package. Therefore we use very simple Perl regular expressions that match these types of logs. In practice, in production, we look for instances of the term CRITICAL
. In development and pre-prod we use the system selectively in more extensive ways. We are likely to use it in production more extensively as the system evolves.
Such “critical” lines will be logged in circumstances in which our programmers determine that a code path being hit indicates a serious system problem that needs to rectified or at least examined in the short term. For example:
logging.critical(f"Couldn't create {resource}: not enough memory")
might create the log-line cited above.
Our implementation
We needed an operationally simple system that had no other dependence on business critical functions. Further, since we wanted it to be used across our estate, it needed to be light-weight in terms of system resources. swatchdog
itself certainly fits this bill - it is a small Perl program. We wanted to keep to this model for its integration into our own operation. Therefore we followed a “minimum necessary” rule in decisions on implementation.
Python Wrappers
swatchdog
is designed to be run from the command line, controlled by extensive options. Our usage for the medium term at least will be simple. We’re unlikely to need to be alerted to difficult to find identifiers. Therefore we won’t need to consider many of these options.
However, we do need to have our swatchdog
usage automated. We use systemd
. Therefore we needed to introduce wrapper scripting for both the start and stop of the installation.
We initially used bash
scripts for this purpose, however, mostly due to the need for multiple levels of special-character escaping, this proved to be confusing and thus error-prone. A simple Python script solved the problem. A single script was used with an option to choose to either start it or shut it down.
[Unit]
Description=Swatchdog Service
After=network.target
[Service]
Type=forking
ExecEnvironmentFile=/etc/swatchdog/environment
ExecStart=/usr/bin/env ${PYTHON_EXECUTABLE} /etc/swatchdog/swatchdog.py
ExecStop=/usr/bin/env ${PYTHON_EXECUTABLE} /etc/swatchdog/swatchdog.py -t
[Install]
WantedBy=multi-user.target
The script itself is also straightforward as can be seen here.
Configuration
Tailoring our swatchdog
implementation to your needs is - yes, simple. It relies on two parts: a controlling JSON file, specifying which log-files to watch and which “watchfor
” files to use for them.
There is also an environment file housed in /etc/swatchdog/env
. It’s formatted as required by systemd
’s EnvironmentFile
mechanism, but actually used by this implementation in three different ways! This coupling is not ideal, but expedient. The purpose of the file is to store local or (to some level) secret data. In our immediate case it stores our Slack API key and monitoring channel name and the path to the Python installation we use.
It’s worth clarifying how this file’s usage is implemented. It is a straightforward “tag=value” list - nothing more. However systemd
Unit files have some rules that slightly complicate the file’s use - see below.
Constrained by systemd
’s Environment system to use an EnvironmentFile
file, we nevertheless wanted to have only one place for this type of local data. Since the file’s format is genuinely about as simple as it is possible to be, this worked - though not quite trivially.
The data from this file is needed in two other places: the first is in the watchfor
files where the Python interpreter is used to send alerts to Slack. The second is in the script that does the Slack message send - it needs the Slack API key. So, Python needs to read the file and so does Perl! This is a little complicating, but in the end, not excessively so - and it works.
systemd
There’s endless criticism of systemd
in the Linux community and I expect it will be properly superseded in the next few years. I haven’t needed to learn much about it until now. It’s a curiously un-robust system. In particular in Exec
lines, it turns out that the first “word” is special (the “command”) and these can’t be substituted using Environment
mechanisms - but subsequent words in these lines can be. Using /usr/bin/env
is allowed, however, for the command.
ExecStart=/usr/bin/env ${PYTHON_EXECUTABLE} /etc/swatchdog/swatchdog.py
swatchdog
“watchfor
” files
The “watchfor
” swatchdog
configuration files are snippets of Perl that are pulled into the full script for the purpose of matching. They allow the use of perlcode
lines. These are lines of arbitrary perl that are executed “around” the matching line(s). They can be configured to run at various levels. We just need lines that read the env file and extract the text of the Python interpreter path. This is actually a small perl
script in itself - I’m mildly surprised to see that the Python version is shorter - one line (a big part of perl
’s appeal was always its terseness). It’s messy to have the whole script sitting in each watchfor
.cfg
file, so we place this in a perl
“module” .pm
file and import it with “use
”.
package Env;
use strict;
use warnings;
sub parameterName {
my ($file, $parameter) = @_;
my $value; open my $fh, '<', $file or die "Could not open file '$file': $!";
while (my $line = <$fh>) {
chomp $line;
if ($line =~ /^$parameter=(.*)$/) {
$value = $1;
last; # we found what we were looking for, no need to continue reading the file
}
}
close $fh;
return $value;
}
1;
and
perlcode use lib '/etc/swatchdog/watchfor';
perlcode use Env;
perlcode my $python_executable = Env::parameterName('/etc/swatchdog/env', 'PYTHON_EXECUTABLE');
watchfor /RegularExpressionToFind/
exec $python_executable /etc/swatchdog/swatchdog_slack.py test_server 'Message to send' "$0"
Hello My AI Friend
There’s a little story with this script. I spent about half a frustrating day endlessly searching Stackoverflow etc tweaking the script and retrying to get it working - to no avail.
This type of thing was always the problem with Perl. It is a prime reason why Python was such a revolution (the other was that it was object-oriented, which was a new concept). Back then, “Perl Monks” would get respect by being able to internalise its arcana (a little like “KDB/Q Gods” more recently). My personal experience was that, in contrast to Perl, Python just did what you asked it to, first time.
So I thought, who can I ask to help me? Perl is so old that what’s on the internet is dispersed and in confusingly threaded bug-reports. Who do I know who can still recall all this stuff?
So I asked ChatGPT
write a perl script that searches a file every line of which is of the form
<tag>=<value>
for a tag PYTHON_EXECUTABLE and writes the value of it into a variable $python_executable that can then be used in the rest of the script
I had a fully functioning result 30 seconds later…
I pasted into the watchfor
file (each line prefixed with perlcode
) - it immediatedly worked flawlessly. That really gave me pause. I asked a couple more questions to help put the code in a module and call it.
It’s a great example of the use of ChatGPT as a practical coding tool - to help with aspects of work where you can acknowledge your lack of expertise.
The Slack message script
As mentioned above, the Python to read the env file is a one-liner:
with open(sys.argv[Slack.ENVS]) as envs:
env = dict(v.strip().split('=') for v in envs.readlines())
Operation
swatchdog
use is initiated in the usual way with
sudo systemctl enable swatchdog
which adds a symlink at the appropriate location in the /etc/systemd/system
hierarchy. Therefore, on startup, the Python script is executed choosing the run_watchfors()
function (since there’s no options chosen).
The script works through the JSON file, finally accumulating the correct log file and “watchfor
” file to run. The watchfor
files are assumed to be in /etc/swatchdog/watchfor
, however this can be changed in the argument handling of swatchdog.py
.
{
"logfiles": [
{
"directory": "/home/directory/path",
"files": [
{
"name": "first_logfile.log",
"watchfor": "first_logfile_config.cfg"
},
...
]
},
...
]
}
It then executes swatchdog
as a script in the normal way, which is by use of subprocess.run()
. Thus, for each log file monitored there is one swatchdog executable instance. Within the swatchdog
Perl script there is also some process management resulting in two processes for each. Therefore in a system using an arrangement like ours that is of reasonable complexity there will be many swatchdog
initiated processes.
Normal Function
swatchdog
detects the addition of text to the file it monitors. It calls a script to get the Python executable to execute as described above. It then looks for lines in its associated regular expression “watchfor
” starting watchfor
. It then applies the Perl Regular Expression text on that line to the text added to the watched log file. If there’s a match, it executes the commands on the next lines
perlcode Use ....
watchfor / \[CRITICAL/
exec $PYTHON_EXECUTABLE /etc/swatchdog/swatchdog_slack.py server_name 'server [A CRITICAL error has occourred]' "$0"
In this example it looks for the string “CRITICAL” occurring after a space and left square-bracket character ([
). The regular expression is placed between two terminating characters, in this case slashes (/
) - though other characters could be used. The left square-bracket is significant to Perl regular expressions, therefore it must be escaped with a leading backslash.
In the case that there is a match, the “exec
” directive specifies that an external command is to be used. Alternatively, internal swatchdog
commands - like “mail
” to send emails - could be used. In the current case, the Python file to send Slack messages is called as described below.
The data is sent in four arguments as required in the Python file. The second-to-last of these uses a Perl variable $0
which contains the whole text that was added to the log file and detected by swatchdog
. If you wish to send an alternative location of the EnvironmentFile
that can be passed as the last parameter. It defaults to /etc/swatchdog/env
.
Shutdown
It is possible to put together an efficient and elegant process management system that will work for a swatchdog
installation like this one, probably using swatchdog
’s --pid-file
option which writes the process id to a file. However, that was not thought necessary in practice. Instead, for shutdown, the Python script is called again, this time with the -t
option. The Linux convenience script pgrep
is then used to gather all the process-ids, which are then simply terminated with os.kill()
and the SIGKILL
signal. This is crude but simple and effective. Because of the choice to use swatchdog
as an entirely separate system to other functional processes, there is minimal risk of problems with this “brute force” approach.
Notification
We prefer notifications via Slack. For this, we initially used a simple REST interface. However, due to a later difficulty with special character escaping, we decided to use the official Slack API in the hope of avoiding those problems. In fact, this didn’t help with the special character escaping. Nevertheless the official API is cleaner so we continued using it. It is more appropriate to use that API for Slack notifications and we haven’t needed to think of generalising to other notification systems at this juncture.
To execute these notifications we introduced another simple Python script. It simply formats the notification text minimally and passes it to the Slack API.
Issues
As noted, a better process management scheme could be utilised. There is a small risk that another system whose process names are ‘swatch’ may be installed. If this occurred it’s probable they would be killed during a shutdown of the Swatchdog service and this might cause problems. At this stage this is not considered a risk.
Slack’s API sends messages within JSON wrappers. To do so it must first escape those elements of the messages that are special characters in JSON. This escaping is not reversed when output in Slack, resulting in text that is not as easily readable as it would normally be. In particular, we use JSON internally so that error messages often contain JSON - many characters of which must naturally be escaped. This problem has not yet been solved. It is considered an annoyance, but non-critical.
Conclusion
The use of swatchdog
solves a significant problem for us - notification of system warnings - in a simple way. The system is free and open source. It uses few resources.
It is simple enough to be easily understood from scratch in less than an hour, so that additional log-files can be added by competent non-experts.
The system can be extended without significant risk by a junior developer.
Overall, it is a good solution for at least up to mid-sized software installations.
The code is available with the MIT license. Please submit Issues and Pull Requests.