ELK Stack

Stop SSHing into your systems and start getting control of your logs with an ELK (Elasticsearch, Logstash, Kibana) cluster.

2015-12-22 -

When it comes to a project that involves creating new infrastructure, I like to consider the 3am scenario. That’s when the person on-call (which might be me) gets woken up by an operations team or automated alert and is informed that an issue needs their attention. Quite often these conversations are brief and the person receiving it most likely didn’t take it in at the first time of asking anyway. The question is: how quickly can that person understand what the call relates to, diagnose whatever issue is going on and then either look for a fix or make a call on the next action to take? Or to put it another way, how soon can I get back into bed?

The first part of the question can be mitigated by putting some thought into monitoring and limiting what can actually cause a call out. In my opinion, a system should only have the audacity to wake me from my beauty sleep – and believe me, I need it – for something really urgent and actionable. If I get woken for something that could be put off until the morning I’ll be a) unnecessarily tired the next day and b) very grumpy. I did on-call for many years (and still do, albeit it to a much lesser extent) and the heroics you might imagine you’re capable of as a singleton in your early twenties or thereabouts don’t seem quite as appealing when you’ve still got to get up and get the kids to school in your thirties or older. As an aside, if you’re working in a culture where your on-call rota guarantees the person on point a terrible night/week/month with a lot of interrupted sleep – stop it. Stop it now. This is unsustainable and suggests you are either monitoring things that really don’t need to result in calls or that your infrastructure/application is so bad it needs to be put into intensive care. Get the whole team to stop and examine the list of call outs (if you’re not maintaining a list, start one). Identify what causes the most headaches, examine the underlying issue and deal with it. Rinse and repeat with the second on the list and so on. Burn out from consistently missing sleep is no laughing matter.

To go back to the scenario where a sysadmin is sat in their pyjamas cursing the development team, hosting company or ISP (essentially who ever is ultimately responsible for them having to crawl out of bed) it may well be obvious from the alert what needs to be done: a process death might have brought down a service; a filesystem might be about to fill up etc. But for anything non-trivial – and you shouldn’t be getting called for non-trivial stuff, you need to automate recovery, build redundancy into your service – it’s likely that some logs are going to have to be looked at. The information I need to interrogate might be operating system logs, or something generated by an application; but for anything more than the most basic service these logs are going to be generated in different places.

Now, the last thing I want to do at 3am is manually SSH into a bunch of different Linux instances and start running less and grep commands. Depending on the type of infrastructure involved, I might be trying to track down errors across several web servers. It might not be clear which one or group is having issues; I might need to cross-reference logs here with logs from a middle-tier application service. Even worse, with the trend towards micro-services architectures, I might be contending with dozens of systems or potentially hundreds of containers!

Back in the days of monolithic and n-tier architectures, it was common (and still is) as well as being good security practice to have a central ‘syslog’ server acting as a target for client systems to dump logs onto (probably using rsyslog and UDP). These days, implementing this kind of setup is the bare minimum I would do, if only to secure copies of live logs for audit purposes. There are number of options for dumping logs to ‘write once’ destinations, ranging from: cheap and cheerful to enterprise class (re: expensive) log aggregators.

At least with this kind of arrangement I can just look for issues in one place – but this still means having to manually work my way through logs. On one system I worked on a few years back that followed this model, the team gradually built up sets of commands and scripts to try and quickly pull information out from the amalgamated files, but all too often there was nothing for it but to trawl through the output of several egrep and awk commands piped together.

ELK Stack

Stop SSHing into your systems and start getting control of your logs with an ELK (Elasticsearch, Logstash, Kibana) cluster.

Newspapers in English

Newspapers from Australia

ELK Stack

Stop SSHing into your systems and start getting control of your logs with an ELK (Elasticsea­rch, Logstash, Kibana) cluster.

Newspapers in English

Newspapers from Australia

Stop SSHing into your systems and start getting control of your logs with an ELK (Elasticsearch, Logstash, Kibana) cluster.