Tag Archives: logging

Stream Data Platform or Global Logging with Apache Kafka

A logging platform has been something I’ve been looking for some time. In Logging for the masses‌ I explained how I built an ELK platform for accessing/searching our web logs. Elasticsearch and Kibana are great but Logstash is the weak link, it’s not well designed for parallel processing (cloud/multiples nodes). I had to split the logstash service in two adding a redis server just to get some HA and don’t lose logs.

Also logging is a deficit or a requisite needed by any dockerized app. Most of the issues I talked about in Docker: production usefulness are still valid some  have been tackled with kubernetes, openshiftv3,…  (those relative to managing docker images, and fleet/project management) but with monitoring and logging the jury is still out.

Apache Kafka is a solution to both. Actually is a solution for a lot of things:

  • Messaging: Kafka works well as a replacement for a more traditional message broker.In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ.

  • Website Activity Tracking: The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds.This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.

  • Metrics: Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.

  • Log Aggregation: Many people use Kafka as a replacement for a log aggregation solution. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption.

  • Stream Processing: Many users end up doing stage-wise processing of data where data is consumed from topics of raw data and then aggregated, enriched, or otherwise transformed into new Kafka topics for further consumption. Storm and Samza are popular frameworks for implementing these kinds of transformations.

  • Event Sourcing: Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka’s support for very large stored log data makes it an excellent backend for an application built in this style.

  • Commit Log: Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data.

What is Kafka? Where does the name come from?
It’s explained in http://blog.confluent.io/2015/02/25/stream-data-platform-1/ great blog entry by the way, a must to read for understanding Kafka.

We built Apache Kafka at LinkedIn with a specific purpose in mind: to serve as a central repository of data streams.

For a long time we didn’t really have a name for what we were doing (we just called it “Kafka stuff” or “the global commit log thingy”) but over time we came to call this kind of data “stream data”, and the concept of managing this centrally a “stream data platform”


LinkedIn platform before and after developing and implementing Kafka.


In this blog entry from “engineering.linkedin.com”, there is another technical explanation:


The Log: What every software engineer should know about real-time data’s unifying abstraction



I learnt about it thanks to Javi Roman (@javiromanrh) a RedHat Engineer who talks about BigData and for several weeks his tweets always had some Kafka in them. So appealing that I had to research it myself to verify that it really needs to enter in my priority list.

Some links tweeted by Javi Roman to get a glimpse of Apache Kafka:

Logging for the masses

(I really need to update the blog template 🙁 )

Problem: there are several sources of logs you want to consult, search in a centralized way. Also those logs should be correlated for events and raise alerts.

At first glance there are 2 alternatives: Splunk, maybe the leader for logging systems and ArcSight Logger already installed in Poland RDC.

The former is ridiculously expensive (at least for my miserable budget) and the later is a bureaucratic hell.

Both are expensive solutions, proprietary and closed, so sometimes pays itself to look for inexpensive and free (as in speech) source

The free solution involves using Logstash, Elasticsearch and Kibana for logging, storing and presentation.

Web Server Logging

We have about 80 log feeds from 15 web applications and 30 servers, the goal is log everything and be able to search by app, server, date, IP,…

The good news are that all those logs follows the same pattern 

The architecture follows the next scheme (the configuration files are sanitized):


Logstash-forwarder: formerly known as lumberjack. It’s an application that tails logs and sends them in a secure channel to logstash over a tcp port, maintains an offset of each log.

Logstash (as shipper): Receives all the logs streams and stores them in a Redis data store.

Redis: It works here like a message queue between shipper and indexer. A thousand times easier to setup than ActiveMQ.

Logstash (as indexer): Extracts from the redis queue and process the data: parse, map and store in a elasticsearch db.

ElasticSearch: the database where logs are stored, indexed and be searchable.

Kibana: a php frontend for ES, allows the creation and customization of dashboards, queries and filters.


Logstash works as shipper and indexer why split those functions in two different process?

  • Because we don’t want to lose data.
  • Because the indexer can do some serious, CPU intensive tasks per entry.
  • Because the shipper and indexer throughput are different and not synchronized.
  • Because the logs can be unstructured and the match model could have errors reporting null pointers and finally out of memory killing or making it a zombie process (as when i tried to add some JBoss log4j logs).

For those reasons there is a queue between shipper and indexer, so the infrastructure is resilient to downtimes and the indexer isn’t saturated by the shipper throughput.

Logstash-forwarder configuration

A JSON config file, declaring the shipper host, a certificate (shared with the shipper) and which paths are being forwarded.

One instance per server


  "network": {

        "servers": [ "<SHIPPER>:5000" ],

    "ssl certificate": "/opt/logstash-forwarder/logstash.pub",

    "ssl key": "/opt/logstash-forwarder/logstash.key",

    "ssl ca": "/opt/logstash-forwarder/logstash.pub",

    "timeout": 15


"files": [


      "paths": [

        "/opt/httpd/logs/App1/access.log" , "/opt/httpd-sites/logs/App2/access_ssl.log"


      "fields": { "type": "apache" },

      "fields": { "app": "App1" }



      "paths": [

        "/opt/httpd/logs/App2/access.log" , "/opt/httpd/logs/App2/access_ssl.log"


      "fields": { "type": "apache" },

      "fields": { "app": "App2" }




Logstash as shipper

Another JSON config file, accepts logs streams and stores them in a redis datastore.

input {

lumberjack {

port => 5000

ssl_certificate => "/etc/ssl/logstash.pub"

ssl_key => "/etc/ssl/logstash.key"

codec => json



output {

stdout { codec => rubydebug }

redis { host => "localhost" data_type => "list" key => "logstash" }




I think is out of the scope of this blog entry, it’s really dead easy, a default config was enough. It would need scaling depending on the throughput.

Logstash as indexer

Here the input it’s the output of the shipper, the output it’s the ES database, between there is the matching section where we filter the entries (we map them, dropping the health-checks from the F5 balancers and tagging entries with 503 errors). Yes, the output can be multiple too here not only we store those matches but those 503 are sended to a zabbix output which in turn sends them to our zabbix server.

input {

redis {

host => "<REDIS_HOST>"

type => "redis"

data_type => "list"

key => "logstash"



filter {

grok {

match => [ "message", "%{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] \"(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})\" %{NUMBER:response:int} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:agent} %{QS:jsessionid} %{QS:bigippool} %{NUMBER:reqtimes:int}/%{NUMBER:reqtimems:int}" ]



filter {

if [request] == "/f5.txt" {

drop { }



filter {

if [response] == "503" {

alter {

add_tag => [ "zabbix-sender" ]




output {

stdout { }

elasticsearch {

cluster => "ES_WEB_DMZ"


zabbix {

# only process events with this tag

tags => "zabbix-sender"

# specify the hostname or ip of your zabbix server

# (defaults to localhost)

host => "<ZABBIX_SERVER>"

# specify the port to connect to (default 10051)

port => "10051"

# specify the path to zabbix_sender

# (defaults to "/usr/local/bin/zabbix_sender")

zabbix_sender => "/usr/bin/zabbix_sender"





The configuration file for a basic service is easy. Depending on the needs, throughput, how many searches per second it gets complicate (shards, masters, nodes,…) but for a very occasional use with this line is enough:

cluster.name: ES_WEB_DMZ


Another easy configuration, it only needs to know the ES address: “http://<ES_HOST>:9200” and that’s all. Dashboards and queries are saved in the ES database. The php files and directories can be read only.

This post was originally posted in my company intranet and was showing 2 dashboards/screenshots that I can’t reproduce here:

  1. A simple dashboard showing how the logs are distributed per application, server, how many entries and their response times. Each facet can be inspected and go deeper.
  2. A dashboard showing the application errors (error codes 5XX)