No Huddle Offense

"Individual commitment to a group effort-that is what makes a team work, a company work, a society work, a civilization work."

Running a distributed native-cloud python app on a CoreOS cluster

September 21st, 2014 • 1 Comment

Suricate is an open source Data Science platform. As it is architected to be a native-cloud app, it is composed into multiple parts:

Up till now each part was running in a SmartOS zone in my test setup or run with Openhift Gears. But I wanted to give CoreOS a shot and slowly get into using things like Kubernetes. This tutorial hence will guide through creating: the Docker image needed, the deployment of RabbitMQ & MongoDB as well as deployment of the services of Suricate itself on top of a CoreOS cluster. We’ll use suricate as an example case here – it is also the general instructions to running distributed python apps on CoreOS.

Step 0) Get a CoreOS cluster up & running

Best done using VagrantUp and this repository.

Step 1) Creating a docker image with the python app embedded

Initially we need to create a docker image which embeds the Python application itself. Therefore we will create a image based on Ubuntu and install the necessary requirements. To get started create a new directory – within initialize a git repository. Once done we’ll embed the python code we want to run using a git submodule.

$ git init
$ git submodule add https://github.com/engjoy/suricate.git

Now we’ll create a little directory called misc and dump the python scripts in it which execute the frontend and execution node of suricate. The requirements.txt file is a pip requirements file.

 
$ ls -ltr misc/
total 12
-rw-r--r-- 1 core core 20 Sep 21 11:53 requirements.txt
-rw-r--r-- 1 core core 737 Sep 21 12:21 frontend.py
-rw-r--r-- 1 core core 764 Sep 21 12:29 execnode.py 

Now it is down to creating a Dockerfile which will install the requirements and make sure the suricate application is deployed:

 
$ cat Dockerfile
FROM ubuntu
MAINTAINER engjoy UG (haftungsbeschraenkt)

# apt-get stuff
RUN echo "deb http://archive.ubuntu.com/ubuntu/ trusty main universe" >> /etc/apt/sources.list
RUN apt-get update
RUN apt-get install -y tar build-essential
RUN apt-get install -y python python-dev python-distribute python-pip

# deploy suricate
ADD /misc /misc
ADD /suricate /suricate

RUN pip install -r /misc/requirements.txt

RUN cd suricate && python setup.py install && cd ..

Now all there is left to do is to build the image:

 
$ docker build -t docker/suricate .

Now we have a docker image we can use for both the frontend and execution nodes of suricate. When starting the docker container we will just make sure to start the right executable.

Note.: Once done publish all on DockerHub – that’ll make live easy for you in future.

Step 2) Getting RabbitMQ and MongoDB up & running as units

Before getting suricate up and running we need a RabbitMq broker and a Mongo database. These are just dependencies for our app – your app might need a different set of services. Download the docker images first:

 
$ docker pull tutum/rabbitmq
$ docker pull dockerfile/mongodb

Now we will need to define the RabbitMQ service as a CoreOS unit in a file call rabbitmq.service:

 
$ cat rabbitmq.service
[Unit]
Description=RabbitMQ server
After=docker.service
Requires=docker.service
After=etcd.service
Requires=etcd.service

[Service]
ExecStartPre=/bin/sh -c "/usr/bin/docker rm -f rabbitmq > /dev/null ; true"
ExecStart=/usr/bin/docker run -p 5672:5672 -p 15672:15672 -e RABBITMQ_PASS=secret --name rabbitmq tutum/rabbitmq
ExecStop=/usr/bin/docker stop rabbitmq
ExecStopPost=/usr/bin/docker rm -f rabbitmq

Now in CoreOS we can use fleet to start the rabbitmq service:

 
$ fleetctl start rabbitmq.service
$ fleetctl list-units
UNIT                    MACHINE                         ACTIVE  SUB
rabbitmq.service        b9239746.../172.17.8.101        active  running

The CoreOS cluster will make sure the docker container is launched and RabbitMQ is up & running. More on fleet & scheduling can be found here.

This steps needs to be repeated for the MongoDB service. But afterall it is just a change of the Exec* scripts above (Mind the port setups!). Once done MongoDB and RabbitMQ will happily run:

 
$ fleetctl list-units
UNIT                    MACHINE                         ACTIVE  SUB
mongo.service           b9239746.../172.17.8.101        active  running
rabbitmq.service        b9239746.../172.17.8.101        active  running

Step 3) Run frontend and execution nodes of suricate.

Now it is time to bring up the python application. As we have defined a docker image called engjoy/suricate in step 1 we just need to define the units for CoreOS fleet again. For the frontend we create:

 
$ cat frontend.service
[Unit]
Description=Exec node server
After=docker.service
Requires=docker.service
After=etcd.service
Requires=etcd.service

[Service]
ExecStartPre=/bin/sh -c "/usr/bin/docker rm -f suricate > /dev/null ; true"
ExecStart=/usr/bin/docker run -p 8888:8888 --name suricate -e MONGO_URI=<change uri> -e RABBITMQ_URI=<change uri> engjoy/suricate python /misc/frontend.py
ExecStop=/usr/bin/docker stop suricate
ExecStopPost=/usr/bin/docker rm -f suricate

As you can see it will use the engjoy/suricate image from above and just run the python command. The frontend is now up & running. The same steps need to be repeated for the execution node. As we run at least one execution node per tenant we’ll get multiple units for now. After bringing up multiple execution nodes and the frontend the list of units looks like:

 
$ fleetctl list-units
UNIT                    MACHINE                         ACTIVE  SUB
exec_node_user1.service b9239746.../172.17.8.101        active  running
exec_node_user2.service b9239746.../172.17.8.101        active  running
frontend.service        b9239746.../172.17.8.101        active  running
mongo.service           b9239746.../172.17.8.101        active  running
rabbitmq.service        b9239746.../172.17.8.101        active  running
[...]

Now your distributed Python app is happily running on a CoreOS cluster.

Some notes

Live football game analysis

December 23rd, 2013 • Comments Off on Live football game analysis

Note: I’m talking about American Football here 🙂

In previous posts I already showed how game statistics can be used to automatically which Wide receiver is the Qb’s favorite on which play, down and field position. Now let’s take this one step further and create a little system (using Suricate) which will make suggestions to a Defense Coordinator.

The following diagram will guide through the steps needed to create such a system:

Live game breakdown (Click to enlarge)

Live game breakdown (Click to enlarge)

Let’s start at the top. (Step 1) The User of Suricate will start with performing some simple steps. First a bunch of game statistics are uploaded (same as using in this post). Next also a stream is defined. In this case a URI for an AMQP broker (using CloudAMQP – RabbitMQ as a Service) is defined in the service. With this used defined data is provide to the service.

(Step 2) Now we start creating an analytics notebook. Suricate provides an interactive python console via your web browser which can easily be used to explore the data previously uploaded. Python Pandas and scikit-learn are both available within the Suricate service and can be used right away to accomplish this task:

Exploring game statistics (Click to enlarge)

Exploring game statistics (Click to enlarge)

Based on the data we can create a model which describes on which down, on which fieldposition a run or a pass play is performed. We can also store who is the favorite Wide receiver/Running back for those plays (see also). All this information is stored in a JSON data structure and saved using the SDK of Suricate (Step 3).

(Step 4) Now a little external python script needs to be written which grabs relatively ‘live’ game data from e.g. here. This script now simple continuously sends messages to the previously defined RabbitMQ broker. The messages contain the current play, fieldposition and distance togo information.

(Step 5) Now a processing python notebook needs to be written. This is a rather simple python script. It takes the new incoming messages and compares them to the model learned in step 2 & 3. Based on that suggestions can be displayed (Step 6a) – “e.g. watch out for Wes Welker on 3rd and long at own 20y line” or just some percentages for pass or run plays:

Processing notebook (Click to enlarge)

Processing notebook (Click to enlarge)

Next the information about the new play can be added to the game statics data file (Step 6b). Once this is done a new model can be created (Step 7) to get the most up to date models all the time.

With this overall system new incoming data is streamed in (continuous analytics), models updated and suggestions for a Defense Coordinator outputted. Disclaimer: some steps describe here are not yet in the github repository of suricate – most namely the continuously running of scripts.

Peyton’s favorite WO (Analytics-as-a-Service)

November 17th, 2013 • 1 Comment

Again a little excerpt on stuff you can do with the help of Suricate. Again we’ll look at play-by-play statistics. So no information on which play was performed, but just the outcomes of the plays. Still there is some information you can retrieve from that. Most importantly because it can be done automated without user interaction needed. Just upload the file, press a button and get the results:

Peyton's favorites (Click to enlarge)

Peyton’s favorites (Click to enlarge)

This time you are looking are a cluster analysis of players Peyton passed too in the game of week 1. First cluster represent players passed to on 1,2 & 3 down with up to 6 yards. The second cluster the goto-guys which did go for medium yardage and finally the WOs able to get a bunch of yards on the board.

So with simple scripts (few lines of code – which can be reused) it is possible to abstract information from just play-to-play statistics. I guess mostly important to Defense Coordinators who would love to get some information on the fly with the press of a button 🙂

Peyton’s favorite WR in week 1 (Analytics-as-a-Service)

November 9th, 2013 • 2 Comments

A few line lines of Python code, a big CSV file, some help of Python PandasSuricate and about 20 minutes is all it takes to answer the following Question:

Who was Peyton Manning’s favorite receiver during the match-up between Denver and Baltimore in week 1 of the 2013 NFL season?

The answer is simple: Wes Welker 🙂 See the complete graph below…

Peyton's favorite

Peyton’s favorite (Click to enlarge)

Now it would be cool to find a source to stream the data into Suricate (which supports this) and get live up to date charts…but that is for another day.

Analytics as a Service

October 15th, 2013 • 2 Comments

With this blog post I try to summarize some key concepts/definitions of something called an Analytics-as-a-Service (AaaS).  The AaaS – named Suricate – described in this post , was developed as part of my Masterthesis. The Masterthesis was about learning application/service behavior and then be able to abstract models from that, to create agile systems/strategies. This could reach from simple fault-detection up to reconfigurations (e.g. auto scaling) of the application/service instance of the platform (Cloud/whatever) it runs upon. Originally it was used to take data aggregated from DTrace using Python, parse models from it using scikit-learn, continuously compare (and update) models with new incoming data, and finally process/take actions. This process requires sometimes user interactions (from Data Scientist or whatever you wanna call them) as can be seen in the following diagram:

Learning Process (Click to enlarge)

Learning Process (Click to enlarge)

Eventually, when my Masterthesis was done (with much more in it then just the development of Suricate in it – talking algorithms/concepts to create models etc), I thought it could serve some more general purpose. In short (tl;dr 🙂) Suricate:

Concepts of an Analytics as a Service

The following diagram shows a conceptual overview of Suricate, which will be used to explain some key concepts of an AaaS.

Overview of Suricate - Analytics as a Service

Overview of Suricate – Analytics as a Service (Click to enlarge)

Overall the four main concepts for an AaaS therefore are:

  1. Aggregate – Supplying user defined data: Stream data into the service or upload the data in an internal or external Storage (Object, Relational, etc). The data can then be aggregated, pre-processed and cached.
  2. Interact – Supplying user defined logic (this is where the IP is in – create marketplace for this if you want :-)): Use Python interactive scripting capabilities to perform analytics and visualize (visualizations are sometimes key to understanding :-)) the results. The Data Scientist can interact with the Service via a Web UI or REST API.
  3. Compute – Uses other services (potentially compute or things like Hadoop) to perform the computational aspect of the analytics.
  4. Act – Process the learned models and trigger actions on insights gained, to create agile Systems & Strategies.

Other example of AaaS are Quantopian or DataHero btw. There are plenty of other around, but most of them are focused on Business Analytics. But sometimes the streaming data & ‘Acting’ part is missing from all of them. When running Suricate one is first greeted with the overview page. Those ’tiles’ reflecting the main concepts of an AaaS:

Suricate entry (Click to enlarge)

Suricate entry (Click to enlarge)

In the data page, files can be uploaded as objects and AMQP streams can be defined:

Suricate data sources (Click to enlarge)

Suricate data sources (Click to enlarge)

In the Analytics part notebooks – just like with IPython – can be written. Within these notebooks toolkits such as matplotlib, scikit-learn or Python Pandas can easily be used. Using these toolkits models can be created within this part and stored again as objects (or even use them to retrieve/download data from somewhere). Also packages to enable faster computation of models can be integrated, as a full Python interpreter is at hands. Python is an ideal language for data processing/analytics and learning btw [1], [2] – although R could be integrated too.

Suricate Analytics - model creation (Click to enlarge)

Suricate Analytics – model creation (Click to enlarge)

The Processing part looks similar to the Analytics part. Again notebooks can be created. But these notebooks now use the models created earlier, to compare them with new incoming data (from AMQP) streams and take actions accordingly.

Suricate processing (Click to enlarge)

Suricate processing (Click to enlarge)

It is noted here that both the processing and the Analytics notebooks can be triggered externally (through an API), and therefore create a true continuously Analytics framework.

Suricate’s source code is available (without warranty – Open Source – in early stages etc. as it is mainly a Demonstrator/PoC for my thesis) on GitHub. Feel free to extent it etc. It happily runs on PaaS like OpenShift as it is an simple WSGI application.