Thursday, 4 June 2015

Force12 demo architecture

Force12.io is a demo of microscaling containers using ECS (EC2 Container Service) from AWS. It shows containers being rapidly stopped and started based on a randomized demand metric. To use a networking analogy Force12 is providing QoS (Quality of Service) for containers.

In a router QoS will prioritise voice data over downloads because the VOIP traffic is more demand sensitive. With containers a public API used by a mobile app would be more demand sensitive and higher priority than a worker process performing a background task.

Our previous post described the demo in more detail. This post is on the design of the ECS cluster. A later post will be on the wider architecture which includes a REST API hosted with Heroku and the front end which is an Angular app.

Generally building the demo has gone well considering how much new technology we’re using. However there have been some problems and changes of direction and they are described in this post.

Force12 ECS cluster

EC2 Auto Scaling Group

The cluster consists of 3 m3.medium instances running in an Auto Scaling Group. We use m3.medium because it’s the smallest instance type that isn’t throttled like the t2 series. We use spot instances to keep the costs down.

CoreOS

In ECS terminology a Container Instance is a VM that is a node in the ECS cluster. For these VM’s there are currently 2 choices of operating system Amazon Linux or CoreOS.

We originally started using Amazon Linux but then switched to CoreOS. Mainly this was because I was interested to try an operating system optimized for containers. I’ve been impressed with CoreOS and especially their documentation. These pages on ECS, EC2 and Vagrant have all been essential. At the moment we’re running CoreOS stable. Now that Docker 1.6 is in the beta channel we’ll be switching to that.

Quay.io

For the demo we decided early on that we weren’t going to open source the project when we launched. Instead we wanted to launch a demo as early as possible to see what interest there was in the idea. For that reason we’re using a private repository from Quay.io to host our containers.

This did cause some problems as initially we set up our ECS cluster in AWS’s EU (Ireland) region. Mainly we do projects for European clients and so we prefer to keep our data in the EU and close to their customers.

Since we don’t have any customer data on this project we moved everything to the US East region. Quay also seems to be hosted in US East and we got a noticeable increase in performance after the move. So we think the choice and location of your Docker repository is an important one. Since container launch speed is important for us we’re thinking about running our own repository bringing it even closer to our ECS cluster.

System Containers - ECS Agent & New Relic

The demo shows the containers running on our ECS cluster but it doesn’t show the 2 extra system containers installed on each node.

The ECS Agent is written in Go and it calls Docker to start and stop containers on behalf of the ECS Scheduler.

The New Relic container provides their server monitoring plus some extra metrics they’ve developed for Docker. The Docker socket running on the CoreOS VM is mounted in the New Relic container so it can be monitored.

I’ve written a post on how to install these containers as services when booting a container instance running CoreOS into an ECS cluster.

Force12 scheduler

The Force12 scheduler is written in Go. It polls a DynamoDB table to get the random demand metric. When the demand changes it stops containers to create capacity and then starts containers to match the desired quantity.

To stop and start these containers the scheduler calls the ECS API. In ECS terminology these are actually called tasks. Each task can have multiple containers but in our case each task has a single container.

The scheduler is written in Go rather than Ruby because we felt when we release it we’ll need the additional speed. The other reason is it’s my business partner Anne who does the scheduler development. She has a strong C background and so is much happier working with Go than Ruby.

Demand randomizer

The demand-RNG container is developed in Ruby. Its only responsibility is setting the random demand metric and updating the DynamoDB table. Both the force12 and demand-RNG containers run as ECS services. This means if a container dies it is replaced automatically.

Demo containers – priority1 & priority2

Our original idea for the demo was much more complex. We wanted the demo containers to be constantly generating a series of random visualizations. As we got into building the demo we realised this was over complex.

What we really wanted to show is that one of the properties of containers is ideal for autoscaling. Containers can be stopped and started in close to real time whereas with Virtual Machines this takes minutes. So the demo containers aren’t actually doing anything. They are based on the minimal busybox image and run sleeps of 1 second in an infinite loop.

However we did have a problem with the demo containers. We saw a big performance drop when we upgraded the ECS Agent from v1.0 to v1.1. The newer agent stops containers in a more correct way but this was causing timeouts when the agent calls the docker stop command.

The problem was we weren’t trapping the SIGTERM signal. This meant our containers were being force killed (Docker status 137) instead of stopping cleanly (Docker status 0). We got some great support from the ECS development team on GitHub who helped us find this.

Current Status

We’re still working on some issues with the current demo. The cluster mainly tracks the demand metric but we’re still seeing 30 second periods where the cluster stops responding.

A possible cause of this is the ECS Agent stops containers but it waits 3 hours before removing them. This is useful for debugging purposes but in our case it means up to 700 containers build up on each instance after 3 hours. So it could be a “garbage collection” problem when the stopped containers are being removed. The ECS team are working on an enhancement for the stopped containers issue and we’ve given our feedback on what works for us.

There are also several fixes / enhancements that we want to make to the front end. The front-end developers we usually work with are busy on other projects. We didn’t want to delay the launch so it meant I did most of the front-end development but I’m much happier working on the back-end and infrastructure parts of the stack.

What’s Next?

We’ve had some great reaction to the demo so we want to keep on showing that autoscaling is a great use case for containers. We’re also trying to do blog driven development as we make changes. For us getting people talking about micro scaling with containers is just as important as developing Force12.

For turning Force12 from a demo into an open source product the next step is to start using real demand metrics. At the moment our REST API is a Sinatra Ruby app hosted on Heroku and autoscaled using AdeptScale. We’re going to move that in house and host the API on the ECS cluster and auto scale it using CloudWatch metrics.

Other areas we’re looking at are scaling up the demo and moving it to other platforms such as Kubernetes or Mesos. We chose to develop the demo on ECS and the AWS platform for 2 reasons. We’re very familiar with AWS and we thought using AWS was the quickest way we could launch the demo. However the “batteries included, but removable” design approach is something we support and long term we don’t see Force12 being tied to a specific platform.

No comments:

Post a Comment