Tuesday, 19 May 2015

Monitoring Docker with New Relic on CoreOS and ECS

This post is on how to use New Relic server monitoring on ECS (EC2 Container Service) using CoreOS.

Update - 9 November 2015

This post is out of date as it was based on an early version of New Relic's monitoring for Docker. You can find an updated systemd unit file on GitHub (lorieri/coreos-newrelic). This worked for me with CoreOS stable 766.4.


Last Friday we launched our Force12 demo of container autoscaling. This is the first of a series of posts on how we're using ECS and what we've learnt from building the demo.

Docker Metrics in New Relic server monitoring


We're running our 3 EC2 instances in an Auto Scaling Group. The Launch Configuration for the EC2 instance installs 2 Docker containers as services. This reflects the architecture of CoreOS which is to keep the OS as minimal as possible and install extra components as containers.
  • ECS Agent - controls Docker on the EC2 instance and communicates with the ECS API
  • New Relic System Monitor - the New Relic sysmond service deployed as a Docker container
At the end of the post is our full cloud-config configuration. Its worth noting that the syntax has to be exact and issues like trailing whitespace will prevent the services being installed.

CoreOS with Quay.io Private Repository

The starting point for our CoreOS setup was their ECS example configuration. We're using private repositories from Quay.io so we configure this by adding the environment variables ECS_ENGINE_AUTH_TYPE and ECS_ENGINE_AUTH_DATA. To get the auth data run the docker login command on your local machine and use the data created in the .dockercfg file.

$ docker login quay.io

# .dockercfg
{"quay.io":{"auth":"***YOUR_AUTH_DATA***","email":"email@example.com"}}

New Relic

The extra Docker metrics for New Relic server monitoring are currently in beta. Initially when we installed the server monitoring it was working but there were no Docker metrics. This was fixed by adding this parameter -v /var/run/docker.sock:/var/run/docker.sock

This mounts the Docker socket running on the host in the New Relic container so it can monitor it. This forum post was very useful in getting this working and this issue is being worked on at New Relic.

cloud-config

Here is the full cloud-config file that runs when each EC2 instance is launched. To recap make sure you're setting the following:
  • YOUR_ECS_CLUSTER - make sure this matches your ECS cluster name.
  • YOUR_AUTH_DATA - set this if you're using private repositories.
  • YOUR_NEWRELIC_LICENSE_KEY - this should be your New Relic license key without quotes.
#cloud-config

coreos:
 units:
   -
     name: amazon-ecs-agent.service
     command: start
     runtime: true
     content: |
       [Unit]
       Description=Amazon ECS Agent
       After=docker.service
       Requires=docker.service

       [Service]
       Environment=ECS_CLUSTER=YOUR_ECS_CLUSTER
       Environment=ECS_LOGLEVEL=info
       Environment=ECS_ENGINE_AUTH_TYPE=dockercfg
       Environment=ECS_ENGINE_AUTH_DATA=YOUR_AUTH_DATA
       ExecStartPre=-/usr/bin/docker kill ecs-agent
       ExecStartPre=-/usr/bin/docker rm ecs-agent
       ExecStartPre=/usr/bin/docker pull amazon/amazon-ecs-agent
       ExecStart=/usr/bin/docker run --name ecs-agent --env=ECS_CLUSTER=${ECS_CLUSTER} --env=ECS_LOGLEVEL=${ECS_LOGLEVEL} --env=ECS_ENGINE_AUTH_TYPE --env=ECS_ENGINE_AUTH_DATA --publish=127.0.0.1:51678:51678 --volume=/var/run/docker.sock:/var/run/docker.sock amazon/amazon-ecs-agent

       ExecStop=/usr/bin/docker stop ecs-agent
   -
      name: newrelic-system-monitor.service
      command: start
      runtime: true
      content: |
        [Unit]
        Description=New Relic System Monitor (nrsysmond)
        After=amazon-ecs-agent.service
        Requires=docker.service

        [Service]
        TimeoutStartSec=10m
        ExecStartPre=-/usr/bin/docker kill nrsysmond
        ExecStartPre=-/usr/bin/docker rm nrsysmond
        ExecStartPre=/usr/bin/docker pull newrelic/nrsysmond:latest
        ExecStart=/usr/bin/docker run --name nrsysmond --rm \
          -v /proc:/proc -v /sys:/sys -v /dev:/dev -v /var/run/docker.sock:/var/run/docker.sock --privileged=true --net=host \
          -e NRSYSMOND_license_key=YOUR_NEWRELIC_LICENSE_KEY \
          -e NRSYSMOND_loglevel=info \
          -e NRSYSMOND_hostname=%H \
          newrelic/nrsysmond:latest
        ExecStop=/usr/bin/docker stop -t 30 nrsysmond

Saturday, 9 May 2015

About the Force12.io demo

The Force12 demo shows Linux containers being automatically created and destroyed to handle unpredictable demand in real time.

 It's a live, real time view of containers being created and destroyed to meet demand on our AWS cluster.

Container Demand = red, Containers = dark blue

It’s deliberately simple.

AWS Setup

  • There's a fixed pool of resources (3 EC2 VMs). 
  • 2 generic container types: Priority 1 (dark blue) and Priority 2 (lilac). 
  • 1 Demand_RNG container, which randomly generates demand for Priority1 containers (demand in red). 
  • 1 Force12 scheduler, which monitors demand and starts and stops Priority1 and Priority2 containers. 
The first bar chart show what’s happening right now (red demand vs blue Priority1 containers).

The second chart shows historical snapshots of the last few seconds (red demand, dark blue P1 containers and lilac P2 containers).

Below the charts you can see what’s happening now on each container instance.

Goal

Force12’s job is to

  • Meet demand by starting or stopping Priority1 containers. 
  • Use any leftover resources for Priority2 containers. 

Success is when the running Priority1 containers meet the demand AND the total number of running Priority1 + Priority2 containers = 9 (maximum utilisation of fixed resources).

So if there are 4 Priority1 containers running there should be 5 Priority2 containers running. If the demand for Priority1 services increases by 1, Force12 will stop a Priority2 container and start a new Priority1 container ASAP.

The Results

As you can see from the demo, on a fairly untuned environment you can get container instantiation speeds of around 3 seconds (with some normal-style spread around that). Shutdown is much faster. This untuned instantiation time is far higher than the sub-second startup times we know are possible in a more tuned environment.

To improve stability and speed we’ll evolve this basic set-up and we’ll blog about what we do and what effect it has.

Known Issues with the Demo

  • Demo container start time is 3-4 seconds. We want to achieve sub second speeds without heroic measures, i.e. with a standard cloud service setup.
  • The container instances can become unresponsive to starts and stops and they can take 30 seconds to recover (but they do recover). This seems to be related to the container networking. We're looking into this.

What are Containers?

Where Did Containers Come From?

In the beginning there were monolithic physical servers. They each ran a single operating system like Linux or Windows.

Then we devised virtual machines and we could run multiple guest operating systems on a single host server. This gave us huge flexibility - the ability to use physical servers more effectively (server density and multi tenancy) and change their use comparatively rapidly (in hours or even minutes).
Physical Server with 2 VMs

Finally, products like Vagrant, Chef and Puppet gave us the ability to script the creation of  VMs. That made it much easier to get consistency across development, test and production environments.

When combined with IaaS VMs became an amazingly effective way to get more from physical infrastructure and cut hosting costs.

Containers Are a Powerful New Take on VM Concepts

How are containers different to VMs?

VMs are great, but when you’re running several guest OSs on a host OS you’re duplicating a lot of functionality - multiple full network stacks for instance. That’s a waste.

Physical Server with 2 Containers

Containers are not VMs - but they kind of act like them. Containers are processes that run on your host OS, but behave conceptually much like a very lightweight VM. They focus on providing the separation and configurability of a VM with minimal duplication between container and host OS. This means you can fit more containers than VMs on a physical server (lower costs) and you get much faster launch speeds (a container could potentially be instantiated fast enough to handle a single network packet).

Each container could run several applications (a "fat" container) but they often just run a single application ("thin").

Containers are managed on your host OS using a container engine application (Docker for example). Like a hypervisor, a container engine routes network traffic to individual containers and divides up and controls access to shared Host resources like memory and disk. Docker also cleverly provisions a container’s contents via preconfigured images and scripts (in much the same way Vagrant allows you to script VM creation).

Each container then emulates a cut down “guest” that supports a restricted set of applications. For example, a web server or a database.

In order to make containers even faster, they can be hosted on a stripped down open source Linux variant like CoreOS or Snappy Ubuntu, but this isn’t necessary, many ordinary Linux variants support containers out of the box.

Are there Windows Containers?

Containers were originally developed for Linux, but Microsoft are in the process of developing similar function for Windows.

Why are Containers Better than VMs or Bare Metal?

Containers have most of the advantages of VMs: flexibility and scriptability. However, they can achieve higher server densities and faster instantiation than VMs because of the reduction in duplication between the host OS and guest OSes.

It is the extreme speed of instantiation and destruction of containers that we’re exploiting in Force12.

How are Containers Worse than VMs?

You can’t mix different OSes on the same host with containers (for example, you couldn’t have a Windows container on a Linux host). That potentially reduces the flexibility, although this often isn’t much of an issue in real world scenarios outside of dev and test.

You can’t give a container its own IP address. The container inherits the IP address of the host and you can only distinguish individual containers using port numbers (although there are some open source projects like Calico that can indeed allow full IPV4 or IPV6 addressing of containers).

Containers running on the Host are just processes. They are not as sandboxed in terms of disk, memory, cpu etc.. as a VM would be. This currently makes them less secure in a multi-tenant environment. Again, that’s being worked on.

Friday, 8 May 2015

What is Force12?

Force12 Dynamic Container Autoscaling

Force12 monitors demand on a cluster and then starts and stops containers in real time to repurpose your cluster to handle that demand.

Force12 is designed to optimize the use of an existing cluster in realtime without manual intervention.

VMs cannot be scaled in real time and neither can physical machines but containers can be started or stopped at sub second speeds. This potentially allows a cluster to adapt itself in real time, producing the optimal configuration to meet current demand.

For example, in response to a traffic peak worker services performing low urgency tasks
can be stopped and web services started. When the traffic peak ends the cluster can reconfigure itself to kill off web services and create more worker instances again.




Force12 optimises the use of the cluster resources available right now - existing VMs or physical servers.

Router Analogy

The Force12 approach is analogous to the way that a network router dynamically optimises the use of a physical network. A router is limited by the capacity of the lines physically connected to it - adding additional capacity is a physical process and takes a long time.

Network routers therefore make decisions in real time about how to best use their current local capacity. They do this by deciding which packets will be prioritized on a particular line based on the packet's priority (SLA). For example, at times of high bandwidth usage a router might automatically prioritize VOIP traffic over web browsing or file transfer.

Force12 can make similar instant judgements on service prioritisation within your cluster because using containersit can start and stop services near real-time.

Network routers can only make very simplistic prioritization judgments because they have limited time and cpu and they act at a per packet level. Force12 has the capacity to make far more sophisticated judgements but to start with it won't - simple judgements are proven to work in a network so let's start there and worry about greater sophistication later.

The Force12 demo is a bare-bones implementation that recognises only 1 demand type: randomised demand for a priority 1 service. When this fluctuating priority 1 (P1) demand has been met, then a priority 2 (P2) service will utilize whatever cluster resource remains.

The demo demand type example has been chosen purely for simplicity.

Force12

Force12 can be configured to actively monitor real time system analytics (queue lengths, load balancer requests etc) and then instantly reconfigure your systems to respond to current conditions.

Force12 will allow your cluster to adapt in an organic, real time fashion to handle whatever unpredictable events the outside world throws at it, without you having to anticipate those events.

Only the incredible speed of container startup and shutdown makes this possible.