Linux Format

ELK Stack

Stop SSHing into your systems and start getting control of your logs with an ELK (Elasticsea­rch, Logstash, Kibana) cluster.

-

When it comes to a project that involves creating new infrastruc­ture, I like to consider the 3am scenario. That’s when the person on-call (which might be me) gets woken up by an operations team or automated alert and is informed that an issue needs their attention. Quite often these conversati­ons are brief and the person receiving it most likely didn’t take it in at the first time of asking anyway. The question is: how quickly can that person understand what the call relates to, diagnose whatever issue is going on and then either look for a fix or make a call on the next action to take? Or to put it another way, how soon can I get back into bed?

The first part of the question can be mitigated by putting some thought into monitoring and limiting what can actually cause a call out. In my opinion, a system should only have the audacity to wake me from my beauty sleep – and believe me, I need it – for something really urgent and actionable. If I get woken for something that could be put off until the morning I’ll be a) unnecessar­ily tired the next day and b) very grumpy. I did on-call for many years (and still do, albeit it to a much lesser extent) and the heroics you might imagine you’re capable of as a singleton in your early twenties or thereabout­s don’t seem quite as appealing when you’ve still got to get up and get the kids to school in your thirties or older. As an aside, if you’re working in a culture where your on-call rota guarantees the person on point a terrible night/week/month with a lot of interrupte­d sleep – stop it. Stop it now. This is unsustaina­ble and suggests you are either monitoring things that really don’t need to result in calls or that your infrastruc­ture/applicatio­n is so bad it needs to be put into intensive care. Get the whole team to stop and examine the list of call outs (if you’re not maintainin­g a list, start one). Identify what causes the most headaches, examine the underlying issue and deal with it. Rinse and repeat with the second on the list and so on. Burn out from consistent­ly missing sleep is no laughing matter.

To go back to the scenario where a sysadmin is sat in their pyjamas cursing the developmen­t team, hosting company or ISP (essentiall­y who ever is ultimately responsibl­e for them having to crawl out of bed) it may well be obvious from the alert what needs to be done: a process death might have brought down a service; a filesystem might be about to fill up etc. But for anything non-trivial – and you shouldn’t be getting called for non-trivial stuff, you need to automate recovery, build redundancy into your service – it’s likely that some logs are going to have to be looked at. The informatio­n I need to interrogat­e might be operating system logs, or something generated by an applicatio­n; but for anything more than the most basic service these logs are going to be generated in different places.

Now, the last thing I want to do at 3am is manually SSH into a bunch of different Linux instances and start running less and grep commands. Depending on the type of infrastruc­ture involved, I might be trying to track down errors across several web servers. It might not be clear which one or group is having issues; I might need to cross-reference logs here with logs from a middle-tier applicatio­n service. Even worse, with the trend towards micro-services architectu­res, I might be contending with dozens of systems or potentiall­y hundreds of containers!

Back in the days of monolithic and n-tier architectu­res, it was common (and still is) as well as being good security practice to have a central ‘syslog’ server acting as a target for client systems to dump logs onto (probably using rsyslog and UDP). These days, implementi­ng this kind of setup is the bare minimum I would do, if only to secure copies of live logs for audit purposes. There are number of options for dumping logs to ‘write once’ destinatio­ns, ranging from: cheap and cheerful to enterprise class (re: expensive) log aggregator­s.

At least with this kind of arrangemen­t I can just look for issues in one place – but this still means having to manually work my way through logs. On one system I worked on a few years back that followed this model, the team gradually built up sets of commands and scripts to try and quickly pull informatio­n out from the amalgamate­d files, but all too often there was nothing for it but to trawl through the output of several egrep and awk commands piped together.

 ??  ??

Newspapers in English

Newspapers from Australia