Monday, 8 June 2015

CoreOS networking problems on ECS

This post describes some problems we’re having with our microscaling demo on ECS (EC2 Container Service). It describes why we switched from Amazon Linux to CoreOS and why today we’ve switched back to Amazon Linux.

The problems we’ve been having with CoreOS are around networking. We still really like CoreOS and we’d like to resolve the problems. So any feedback or fix suggestions would be greatly appreciated. Please add them in the comments.

Move from Amazon Linux to CoreOS

At the moment Amazon Linux and CoreOS are the 2 choices of operating system for running container instances in an ECS cluster. We started out using Amazon Linux and were happy with it until a new AMI was released with v1.1 of the ECS Agent and Docker 1.6.

This was just before we launched and we saw a massive slowdown in responsiveness of the demo. At this point we switched to CoreOS stable as it gave us more control and allowed us to continue using v1.0 of the ECS Agent and Docker 1.5.

Fixing issue with v1.1 Agent

After we’d launched we worked with AWS to resolve the performance problems with the v1.1 agent. It turned out the problem was at our end because we weren’t trapping the SIGKILL signal

The 1.1 agent stops containers in a more correct manner but because we weren’t trapping the correct signals our demo containers were hitting a 30 second timeout of the docker stop command. So our containers were being force killed (status 137). We fixed our demo containers and upgraded to the v1.1 agent on CoreOS stable.

Trying CoreOS beta for Docker 1.6

Another benefit of the CoreOS switch was we got to try an operating system optimised for containers. Overall I like the CoreOS approach but there is a steep learning curve due to systemd and we’re hit several problems with it being a new distro.

One thing we’ve been keen to try is the upgrade from Docker 1.5 to 1.6. This moved from the alpha to the beta channel last week. At that point we switched our staging rig to CoreOS beta to see the effect on performance and stability.

DHCP with multiple interfaces

On CoreOS beta we continued to see networking problems. We would see the message below and see network connectivity problems from our containers. These could be the ECS Agent becoming disconnected, the Force12 scheduler being unable to call the ECS API or our demand randomizer being unable to call our REST API on Heroku.

kernel: IPv6: ADDRCONF(NETDEV_UP): veth01426c4: link is not ready

I looked for networking issues on the CoreOS bugs repository on GitHub. There wasn’t a direct match but this issue about DHCP with multiple interfaces seemed related.

Since our problems were with networking I decided to simplify our configuration by not assigning public IPs to our container interfaces. This turned out to be quite involved and meant using a lot of VPC features I hadn’t used before. Our new VPC setup has public and private subnets.  All the ECS container instances run in the private subnet. The public subnet has a NAT instance, which provides outbound connectivity to the ECS instances for applying updates. This guide was very useful in setting it up.

Back to Amazon Linux and turning off checkpointing

Removing the public interfaces seemed to help but we continued to have connectivity problems. So we decided to move back to Amazon Linux since our problems with the v1.1 agent are fixed and we’re seeing good performance with Docker 1.6. This worked well on staging but we spotted an issue when a container instance had 0 containers running. There was a pause of around 30 seconds and the message “Saving state!” appears in the logs.

This is a feature called checkpointing, which saves the state of an ECS agent so it can be restored if an instance crashes. However for our use case we don’t need it so we turned it off by setting the ECS_CHECKPOINT environment variable to false.

Current Status

Now we’re back on Amazon Linux we’re seeing good performance. We’ve also been able to turn up the demand randomizer to make it a bit spikier. This is part of our continued plans to scale up the demo as we learn more about ECS and improve the Force12 scheduler. However we’re still seeing periods of 30 to 60 seconds where the ECS Agent becomes disconnected. We’re continuing to investigate this, as it’s the biggest stability problem with the demo at the moment.

No comments:

Post a Comment