Run CoreOS etcd clusters on spot instances

Tags: , , , , , , , ,

As part of our continuous efforts to bring economical computing while guaranteeing 100% availability into containerized workloads, such as Kubernetes, Amazon Elastic Container Service & Docker Swarm. We are excited to share our Elastigroup & etcd2 integration guide.

 

CoreOS-etcd

etcd is an open-source distributed key value store that provides shared configuration and service discovery for Container Linux clusters. etcd runs on each machine in a cluster and gracefully handles leader election for network partitions and the loss of the current leader.

Application containers running on your cluster can read and write data into etcd. Common examples are storing database connection details, cache settings, feature flags, and more.

 

Clustering etcd on spot instances

As you may already know, the etcd2 discovery service does not play well with post-bootstrap members joining and leaving the cluster. In this tutorial we will use a different method to reduce dependencies on external systems.

Automate

Our script starts off by querying the spotinst API to fetch the Elastigroup ID (for members discovery) and using AWS Instance Metadata to get the instance-id and instance-ip. This information will be written to a file and instruct etcd2 to load this information on startup.

Cluster Membership

Etcd will expect instances to remove themselves from a cluster prior to termination. Since we are working with Spot instances it was necessary to add some additional logic to make the solution more robust so that Spot instances can be replaced without a problem.

Cleanup of “bad” members

Once added to our user data, our script will handle the cleanup in two methods to make sure no instance left behind.

  • The instances will ask the Spotinst API for their status every 30 seconds and will deregister themselves from the etcd cluster if the “TERMINATING” status appears.
  • When a new instance comes up, this process will compare the list of members reported by etcd to the list of running machines in the Elastigroup. Once the bad host(s) are found we can go ahead and send a REST call to a healthy cluster member to perform a cleanup task of the cluster. This will remove instances that have been replaced.

 

#cloud-config

coreos: 
  etcd2: 
    advertise-client-urls: "http://$private_ipv4:2379"
    initial-advertise-peer-urls: "http://$private_ipv4:2380"
    listen-client-urls: "http://0.0.0.0:2379,http://0.0.0.0:4001"
    listen-peer-urls: "http://$private_ipv4:2380"
  units: 
    - name: etcd2.service
      command: stop
    - name: spotinst-etcd-discovery.service
      command: start
      content: |
        [Unit]
        Description=Spotinst Elastigroup discovery  
        [Service]
        ExecStartPre=/bin/bash -c '/home/core/spotinst_etcd/discovery.sh'
        ExecStart=/usr/bin/systemctl start etcd2
    - name: fleet.service
      command: start
    - name: spotinst-etcd-termination.service
      content: |
        [Unit]
        Description=Validate spot server status
        [Service]
        EnvironmentFile=/etc/environment
        Type=oneshot
        ExecStart=/bin/bash -c '/home/core/spotinst_etcd/termination.sh'
    - name: spotinst-etcd-termination.timer
      command: start
      content: |
        [Unit]
        Description=Check spot instance status
        [Timer]
        OnCalendar=*:*:0/30
        Persistent=true

write_files: 
    - content: |
          [Service]
          EnvironmentFile=/home/core/spotinst_etcd/peers
          
      path: /run/systemd/system/etcd2.service.d/30-etcd_peers.conf
      permissions: "0644"
      
    - content: |
          #!/bin/bash
          pkg="spotinst_etcd_termination"
          version="0.0.1"
          spotinst_token="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
          
          # Create directory to write output to
          mkdir -p /home/core/spotinst_etcd/
          
          ec2_instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
          if [[ ! $ec2_instance_id ]]; then
              echo "$pkg: failed to get instance id from instance metadata"
              exit 2
          fi
          
          if !(etcdctl member list | grep -q $COREOS_PRIVATE_IPV4); then
            echo "machine not found in etcd restarting etcd2"
            rm -rf /var/lib/etcd2/* 
            systemctl restart etcd2 
          fi
          
          ec2_instance_status=$(curl -X GET -H "Content-Type: application/json" -H "Authorization: Bearer ${spotinst_token}" "https://api.spotinst.io/aws/ec2/instance/${ec2_instance_id}" | jq '.response | .items[0]|.lifeCycleState')
          echo "ec2_instance_status=$ec2_instance_status"
          
          if [[ $ec2_instance_status = *"TERMINATING"* ]]; then
            etcd_member_id=$(etcdctl member list | grep $COREOS_PRIVATE_IPV4 | awk '{print $1}'| awk -F':' '{print $1}')
            echo "removing etcd member from cluster: $etcd_member_id"
            etcdctl member remove $etcd_member_id
          fi

      path: /home/core/spotinst_etcd/termination.sh
      permissions: "0777"

    - content: |
          
          #!/usr/bin/env bash
          curl -fsSL https://s3.amazonaws.com/spotinst-labs/etcd-cluster/elastigroup-discovery.sh | \
          SPOTINST_TOKEN="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
          bash
          
      path: /home/core/spotinst_etcd/discovery.sh
      permissions: "0777"

 

Note: This script cannot handle scale up at this time and will only maintain the initial cluster size

Conclusion

In this write up we have managed to create a functional etcd cluster running on spot instances with automatic recovery. If you are interested in testing out your own Spot based etcd cluster please give it a try and please share your results in the comments below!

 

-The Spotinst Team