Tag Archives: docker

Stream Data Platform or Global Logging with Apache Kafka

A logging platform has been something I’ve been looking for some time. In Logging for the masses‌ I explained how I built an ELK platform for accessing/searching our web logs. Elasticsearch and Kibana are great but Logstash is the weak link, it’s not well designed for parallel processing (cloud/multiples nodes). I had to split the logstash service in two adding a redis server just to get some HA and don’t lose logs.

Also logging is a deficit or a requisite needed by any dockerized app. Most of the issues I talked about in Docker: production usefulness are still valid some  have been tackled with kubernetes, openshiftv3,…  (those relative to managing docker images, and fleet/project management) but with monitoring and logging the jury is still out.

Apache Kafka is a solution to both. Actually is a solution for a lot of things:

  • Messaging: Kafka works well as a replacement for a more traditional message broker.In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ.

  • Website Activity Tracking: The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds.This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.

  • Metrics: Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.

  • Log Aggregation: Many people use Kafka as a replacement for a log aggregation solution. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption.

  • Stream Processing: Many users end up doing stage-wise processing of data where data is consumed from topics of raw data and then aggregated, enriched, or otherwise transformed into new Kafka topics for further consumption. Storm and Samza are popular frameworks for implementing these kinds of transformations.

  • Event Sourcing: Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka’s support for very large stored log data makes it an excellent backend for an application built in this style.

  • Commit Log: Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data.

What is Kafka? Where does the name come from?
It’s explained in http://blog.confluent.io/2015/02/25/stream-data-platform-1/ great blog entry by the way, a must to read for understanding Kafka.

We built Apache Kafka at LinkedIn with a specific purpose in mind: to serve as a central repository of data streams.

For a long time we didn’t really have a name for what we were doing (we just called it “Kafka stuff” or “the global commit log thingy”) but over time we came to call this kind of data “stream data”, and the concept of managing this centrally a “stream data platform”

 

LinkedIn platform before and after developing and implementing Kafka.

data-flow-ugly

In this blog entry from “engineering.linkedin.com”, there is another technical explanation:

stream-centric-architecture1

The Log: What every software engineer should know about real-time data’s unifying abstraction

 

 

I learnt about it thanks to Javi Roman (@javiromanrh) a RedHat Engineer who talks about BigData and for several weeks his tweets always had some Kafka in them. So appealing that I had to research it myself to verify that it really needs to enter in my priority list.

Some links tweeted by Javi Roman to get a glimpse of Apache Kafka:

Kubernetes pre-101 (brief introduction)

Before a better blog entry detailing a real, production ready example of dockerized micro-services (using free/open tools, most of the work, and design, would be reused if we used external cloud providers like GCP or AWS ECS‌), I’ll introduce another key technology in the ecosystem: the container management engine.

From a previous post: Docker: production usefulness

To run Docker in a safe robust way for a typical multi-host production environment requires very careful management of many variables:

  • secured private image repository (index)
  • orchestrating container deploys with zero downtime
  • orchestrating container deploy roll-backs
  • networking between containers on multiple hosts
  • managing container logs
  • managing container data (db, etc)
  • creating images that properly handle init, logs, etc
  • much much more…

Time to eat my words (or my quotes), let’s present Kubernetes:

A brief summary of what is: https://github.com/GoogleCloudPlatform/kubernetes/blob/master/DESIGN.md

Kubernetes is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications

Kubernetes uses Docker to package, instantiate, and run containerized applications.

Kubernetes enables users to ask a cluster to run a set of containers. The system automatically chooses hosts to run those containers on.

The scheduler needs to take into account individual and collective resource requirements, quality of service requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, deadlines, and so on

The atomic element for Kubernetes is the pod.

Pods simplify application deployment and management by providing a higher-level abstraction than the raw, low-level container interface. Pods serve as units of deployment and horizontal scaling/replication. Co-location, fate sharing, coordinated replication, resource sharing, and dependency management are handled automatically.

Pods serve as units of deployment and horizontal scaling/replication. Co-location, fate sharing, coordinated replication, resource sharing, and dependency management are handled automatically.
A pod correspond to a colocated group of Docker containers with shared volumes.

Pods facilitate data sharing and communication among their constituents.

Their use:

Pods can be used to host vertically integrated application stacks, but their primary motivation is to support co-located, co-managed helper programs, such as:

  • content management systems, file and data loaders, local cache managers, etc.
  • log and checkpoint backup, compression, rotation, snapshotting, etc.
  • data change watchers, log tailers, logging and monitoring adapters, event publishers, etc.
  • proxies, bridges, and adapters
  • controllers, managers, configurators, and updaters

Docker: Attack on Wildfly

Somehow in my previous blog entry Docker: production usefulness. I gave a negative impression (I don’t know why, it isn’t as if I were using NOs in red, bold and supersized for answering main questions). But Docker it’s really a disruptive technology which works, so let’s use it for a near future platform change, the migration of our app servers to Red Hat Jboss 7 (present wildfly) from a classic infrastructure to a cloud/dockerized one.

Docker Images

The goal is an image that can be instanced as domain controller (dc) or node, each execution adding its service/container to the JBoss domain seamlessly. This domain and their services must meet production quality.

Existing Docker Images problematic.

JBoss (Red Hat) has an official docker image (jboss/wildfly) image and dockerfile. With documentation on how to extend it.

The first problem comes from its origin: “jboss/wildfly” depends on “jboss/base-jdk:7” which depends on “jboss/base:latest” , which depends on “fedora:20“. Fedora isn’t a production software (at least not here in Spain), those images are tainted, we must run our services in supported systems. For our research we’d use Centos so we could change it for RHEL later.

In any case, this means we had to recreate a new Dockerfile without sharing images with the official repo but importing the inner workings of all those dockerfiles. There’s another option, fork all those repos changing only their FROM, in this way we have the advantage of pulling and merging updates. For simplicity and not creating/uploading anything to github I chose the former.

FROM centos:centos7

# ENV http_proxy <PROXY_HOST>:<PROXY_PORT>
# ENV https_proxy <PROXY_HOST>:<PROXY_PORT>

# imported from jboss/base
# Execute system update
RUN yum -y update

# Install packages necessary to run EAP (from jboss/base)
RUN yum -y install xmlstarlet saxon augeas unzip tar bzip2 xz

# Imported from jboss/jdk7
# Install necessary packages (from JDK7 Base)
RUN yum -y install java-1.7.0-openjdk-devel
RUN yum clean all

# Imported from jboss/base
# Create a user and group used to launch processes
# The user ID 1000 is the default for the first "regular" user on Fedora/RHEL,
# so there is a high chance that this ID will be equal to the current user
# making it easier to use volumes (no permission issues)
RUN groupadd -r jboss -g 1000 && useradd -u 1000 -r -g jboss -m -d /opt/jboss -s /sbin/nologin -c "JBoss user" jboss

# Set the working directory to jboss' user home directory
WORKDIR /opt/jboss

# Specify the user which should be used to execute all commands below
USER jboss

# Imported from jboss/jdk7
# Set the JAVA_HOME variable to make it clear where Java is located
ENV JAVA_HOME /usr/lib/jvm/java

# Imported from jboss/wildfly
# Set the WILDFLY_VERSION env variable
ENV WILDFLY_VERSION 8.1.0.Final

# Add the WildFly distribution to /opt, and make wildfly the owner of the extracted tar content
# Make sure the distribution is available from a well-known place
RUN cd $HOME; /curl http://download.jboss.org/wildfly/$WILDFLY_VERSION/wildfly-$WILDFLY_VERSION.tar.gz /
| tar zx && mv $HOME/wildfly-$WILDFLY_VERSION $HOME/wildfly

# Set the JBOSS_HOME env variable
ENV JBOSS_HOME /opt/jboss/wildfly

# Expose the ports we're interested in
EXPOSE 8080 9990

# Set the default command to run on boot
# This will boot WildFly in the standalone mode and bind to all interface
CMD ["/opt/jboss/wildfly/bin/standalone.sh", "-b", "0.0.0.0"]

 

This dockerfile will build us an image similar to the official one but using centos instead of fedora .

$ sudo docker build --tag ackward/wildfly .
$ sudo docker images
REPOSITORY           TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
ackward/wildfly       latest              1c4572ac308d        4 days ago          680.8 MB
<none>               <none>              102ff2048023        4 days ago          467.7 MB
centos               centos7             ae0c2d0bdc10        2 weeks ago         224 MB

Executing them would be so easy as (following the documentation):

$ sudo docker run -it ackward/wildfly /opt/jboss/wildfly/bin/domain.sh -b 0.0.0.0 -bmanagement 0.0.0.0

or for a standalone instance:

$ sudo docker run -it ackward/wildfly

They execute and works, kinda, because it isn’t configured but that’s expected. So we continue to the next step, configure the JBoss Domain Service.

Configuring JBoss Domain Service.

For the neophytes, JBoss App Server has two operation modes, standalone where each server manages itself and domain mode where there is a domain controller managing several nodes similar to how WebSphere Deployment Manager or Weblogic are implemented.

The first thing to consider is how are we going to manage the configuration files, how are going to be accessed, tracked or versioned, how logs could be accessed, how the apps could be deployed, etc.

It’s important because the docker paradigm tries to achieve independence of the data, each exec of a image is fresh and containers should be the most flexible to create and destroy on the fly (tying them with data isn’t the best way to achieve it).

The solution is already implemented in our environments (prod included), when I designed the migration (or runaway) from WebSphere to JBoss I separated the configuration files from the application files (for accountability and because all the configuration files are textfiles, can be gitted and had them in a remote/central repo; an upgrade doesn’t touch the config files, can be merged and diffed between them). This solution works because we can pass the config dir (jboss.<>.base.dir) as an external volume, all the changes would be persistent, we have straight access to the log files or the deployment directory, can be versioned with git and synced with a master/central repo and creating new nodes is dead easy (let’s see it).

Instead of modifying our wildfly dockerfile we extend it, in this way we customize an image while maintaining backward compatibility.

FROM ackward/wildfly
RUN mkdir -p /opt/jboss/LOCAL_JB/node
VOLUME ["/opt/jboss/LOCAL_JB/node"]
EXPOSE 8080 9990 9999
CMD ["/opt/jboss/wildfly/bin/domain.sh", "-b", "0.0.0.0", "-bmanagement",  "0.0.0.0",
"-Djboss.domain.base.dir=/opt/jboss/LOCAL_JB/node"]

To build it:

$ sudo docker build --tag ackward/dc-wildfly .$ sudo docker imagesREPOSITORY           TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
ackward/dc-wildfly   latest              1a9de7c0734f        3 days ago          680.8 MB
ackward/wilfly       latest              1c4572ac308d        5 days ago          680.8 MB
<none>               <none>              102ff2048023        5 days ago          467.7 MB
centos               centos7             ae0c2d0bdc10        2 weeks ago         224 MB
$ sudo touch: cannot touch '/opt/jboss/LOCAL_JB/node/tt': Permission denied
$ sudo chcon -Rt svirt_sandbox_file_t /opt/jboss/

There is some issues with SELinux, we need to change a context in the host dir we’re exposing to the containers :

We aren’t going to work with data containers at the moment just the old plain local fs (I’ve created the jboss user also in the host with the same uid/gid as the images, don’t know if it’s needed just as precaution):

$ ls -l /opt/jboss/
total 0
drwxr-xr-x. 7 jboss jboss 71 nov 21 08:17 dc1
drwxr-xr-x. 7 jboss jboss 71 nov 24 11:06 node-tplt
drwxr-xr-x. 7 jboss jboss 71 nov 21 08:40 node1
drwxr-xr-x. 7 jboss jboss 71 nov 21 11:51 node2
$ ls -l /opt/jboss/dc1/
total 4
drwxr-xr-x. 4 jboss jboss 4096 nov 21 10:59 configuration
drwxr-xr-x. 3 jboss jboss   20 nov 21 08:17 data
drwxr-xr-x. 2 jboss jboss   61 nov 21 09:23 log
drwxr-xr-x. 4 jboss jboss   40 nov 21 08:17 servers
drwxr-xr-x. 3 jboss jboss   17 nov 21 08:17 tmp
$ ls -l /opt/jboss/node1
total 4
drwxr-xr-x. 3 jboss jboss 4096 nov 21 11:16 configuration
drwxr-xr-x. 3 jboss jboss   20 nov 21 08:22 data
drwxr-xr-x. 2 jboss jboss   61 nov 21 10:50 log
drwxr-xr-x. 4 jboss jboss   40 nov 21 11:12 servers
drwxr-xr-x. 3 jboss jboss   17 nov 21 08:22 tmp
How to configure wildfly as a domain controller and their nodes is out of scope of this entry. It’s well documented by RH. Running the dc is as easy as:
$ sudo docker run -d -P -p 9990:9990 -v /opt/jboss/dc1:/opt/jboss/LOCAL_JB/node -h dc1 --name=wf-dc1 ackward/dc-wildfly

If we want to create a new node we’ve an empty (no servers, no groups, just the slave service account needed to join the domain) template in “node-tplt” and a cp :

$ sudo cp /opt/jboss/node-tplt /opt/jboss/node2
$ sudo chown -R jboss:jboss /opt/jboss/node2
$ sudo docker run -P -d --link=wf-dc1:ldc1 -v /opt/jboss/node3:/opt/jboss/LOCAL_JB/node --name=node2 /
-h node2 ackward/dc-wildfly /opt/jboss/wildfly/bin/domain.sh -b 0.0.0.0 -Djboss.domain.master.address=ldc1 /
-Djboss.domain.base.dir=/opt/jboss/LOCAL_JB/node
$ sudo docker ps
CONTAINER ID        IMAGE                       COMMAND                CREATED             STATUS
PORTS                                                                       NAMES
665e986ad585        ackward/dc-wildfly:latest   "/opt/jboss/wildfly/   2 days ago          Up 2 days
0.0.0.0:49187->8080/tcp, 0.0.0.0:49188->9990/tcp, 0.0.0.0:49189->9999/tcp   node2
1229904dd9d2        ackward/dc-wildfly:latest   "/opt/jboss/wildfly/   2 days ago          Up 2 days
0.0.0.0:49184->8080/tcp, 0.0.0.0:49185->9990/tcp, 0.0.0.0:49186->9999/tcp   node1
d0cdba99520b        ackward/dc-wildfly:latest   "/opt/jboss/wildfly/   3 days ago          Up 3 days
0.0.0.0:9990->9990/tcp, 0.0.0.0:49170->9999/tcp, 0.0.0.0:49171->8080/tcp    node1/ldc1,node2/ldc1,wf-dc1

Where are we?

A long road yet, the foundations look solid and this post is long enough than most people will just TL;DR.

We have:

  • We have a way to create nodes, accessing/modifying their config and their logs easily.
  • The images and containers would be easily upgraded or customized.
  • The images and containers have zero configuration. (GREAT!)
  • We can use git repos for accountability of our configuration files and creating new nodes.
  • With a private docker hub we could share the images on all the docker nodes.
  • A way to execute containers in remote servers using Ansible, we could create a role with some docker tasks. 
    A simple Ansible tasks for executing a container from an image locally:
- hosts: localhost
tasks:
- name: run wildfly dc
docker: image=ackward/wildfly name=wf-dc1 command="/opt/jboss/wildfly/bin/domain.sh -b 0.0.0.0 -bmanagement 0.0.0.0" /
hostname=dc1 publish_all_ports=yes volumes=/opt/jboss/dc1:/opt/jboss/LOCAL_JB/node ports=9990

We haven’t:

  • An infrastructure for provisioning docker hosts (physical/virtual nodes) or at least a fully Ansible configuration for managing the underlying servers and their services.
    • Including dependencies and orchestration of those micro-services.
  • A map of external dependencies and how they must be managed in a cloud paradigm (with floating services).
    • External resources like accessing a mainframe, SAS, fw rules, access control permissions, etc.
  • All the deployment (continuous delivery) platform. How are we going to connect, build and deploy apps in this dockerized Jboss Domain.
  • All the end2end experience!!!:
    • How the services are accessed. And this one is big, now they are exposing a dynamic tunneled port there is only one known port exposed 9990 (the dc admin console) but they are public services with known urls and ports.
    • How developers interact with these environments, how they publish, debug, etc… the applications that run on them.
  • How the platform is monitored (nodes, services, apps,…)
  • Backups
  • Contingency and HA plans.
  • A business case! What would happen in 1, 3, 5 years, associated costs, etc. Do we really need to implement it? are there any advantage, benefits of doing so rather than otherwise?.