Spotinst raises $15M Series A Led By Intel Capital and Vertex Ventures Read The Official Press Release

Stateful Applications with Spot Instances

Brings the enterprise experience to Spot Instances

Shiri Ivtsan
Customer Success

General

The concept of data integrity and consistency is crucial when managing workloads. This aspect may be trivial when running with On-Demand instances, but it’s not so trivial while working with EC2 Spot Instances, which are conceptually ephemeral and can be revoked at any given moment. At Spotinst, we gave it some deep thought, on how you can leverage Spot, but still handle data concerns easily and with confidence.

As many of our customers faced this challenge on various and multiple use-cases, It challenged us to come up with creative solutions to solve that problem In this post we wanted to deep dive on our recent and most common feature that enables Spot Instances to handle also Stateful workloads.

If you wonder which option you should choose when working with Stateful Configuration in Spotinst, this is the place for you.

There are 3 mains factors we need to consider in terms of Spotinst stateful configuration:

  • Data Location – Where is your data located?
    • Root volume
    • Data volume
    • Both data volume and root volume
  • How often does your data change?
    • Periodically: Apps \ Scripts (Every 5 minutes or more)
    • Constantly: Database engines
  • Does the exact same data exist in all the instances? (Relevant for multiple-instances Elastigroup only)
    • Yes, the exact same  (Such as Auto Scaling Groups)
    • No, I have different data stored on each one of the nodes (Such as Cassandra, Elasticsearch or any other NoSQL database)

TL;DR

To make things easier, we created the next flowchart that will help you determine the required configuration for your use-case:

ami-backup-blog-3

Few things to note:

  • Persist data volume and persist root volume option simulates a restart of the node as part of instance replacement.
  • If the data is saved both on the root volume and on the data volume, the two options can be used in parallel.
  • Size of the data – If the data in the volume (Root/EBS) is big, it would be better to use Hot EBS migration as the exact same volume will be migrated between the instances during the restart. When using ‘keep data volume’ option, the maintenance restart (see more details below) might additional time – According to the disk size.
Root and Data volumes

Usually, your data is stored on EBS volumes, and you would most likely want to keep the data available even in case the EC2 Spot Instance is interrupted. Every EC2 Spot Instance has a root volume attached and potentially other data volumes (that contains application files and data) that are attached to the EC2 Instance as an external drive like /dev/xvd{b…z}.

Maintaining data persistence means that the EBS relations (volume IDs & Device Mapping) will stay the same, meaning you have the choice to Backup the entire AMI or migrate the external and root volumes.

These options split into multiple available solutions we offer:

Stateful recovery

If you have a stateful application or an application that is designed to withstand node failure such as a database cluster with sharding configuration like Elastic.co, Cassandra, etc.. our Stateful configuration will allow you to utilize Spot Instances that maintain 100% data integrity recovering the full state of the instance including its private IP and network configuration. When a recovery occurs, the system will automatically create a clone of the previous instance and it will appear as if the instance restarted.

Please note

  • The stateful configuration will work best with  Shard-based clusters with a replication factor greater than 1.
  • Verify that you can tolerate an instance being removed from the cluster for maintenance. During a spot interruption, there will be a ‘restart’.

Hot EBS Migration

If you have a stateful application or an application that is sensitive to node interruption, you can use Hot EBS migration to make sure your nodes are always accessible, while using the same set of external volumes and configuration. When a recovery occurs the system will automatically attach any external volume of the previous instance and it will continue service from the same state.

screen-shot-2017-07-12-at-8-57-15-pm

Please note:  

  • Multi-AZ environments will have a snapshotting mechanism to make the volume available to all AZs configured in the Elastigroup. (as stated here: Hot EBS Migration)

AMI Auto backup

Elastigroup allows you to create automatic, scheduled snapshots of your AMI and attached EBS volumes. With the Auto Backup feature, you can maintain data persistence within your cluster. In the case of any instance replacement, Elastigroup will use the last snapshot recorded according to the defined interval.

If you customized your instance with EBS volumes in addition to the root device volume, the new AMI contains block device mapping information for those volumes. When the instance is launched from this new AMI, it will automatically launch with those additional volumes.

If you have an application with periodic changes or updates to the AMI and root volume this is the most complete solution as it simply creates new Images based on your desired frequency. This is a great option for application server clusters and for clusters running behind a Load Balancer

Please note:  

  • AMI back will be taken from a single instance of a group.
  • This is a great solution for Autoscaling groups.

Use Cases

Elastic.co – Elasticsearch node recovery will take a fraction of the time required to provision a brand new instance. From the standpoint of your Elasticsearch cluster the instance was only down for a maintenance restart  (depending on the size of the data volumes attached). No changes are necessary for your cluster to provision this as long as you have enough instances for Quorum. Solution: Keep root & Data volumes

Cassandra –  If your Cassandra node is replaced we’ll clone the instance and bring it back. Your Cassandra cluster will behave as if the instance was down for some time. Bringing up a clone of the previous instance ensures that cluster IOPs are not wasted on bringing a new instance up. Solution: Keep root & Data volumes

Single Server Database – for non-production environments where do not have a requirement for 100% uptime for your database instances. For production, we recommend running with a Slavemaster configuration. Running the Master with on-Demand instances and the Slave on a Stateful Spot instance. Solution: Keep root & Data volumes

Hadoop cluster – Support for “Stateful Spot” instances in Spotinst Elastigroups allows you to provision Spot Instances and automatically recover the full state of the instance including the private ip. When a recovery occurs we will automatically create a clone of the previous instance and it will appear as if the instance was brought down for a restart. For instructions please see: Hadoop use case. Solution: Keep root & Data volumes

Kafka – Kafka’s architecture is designed based on several components and each component has his unique role. All of these components can run on Spot Instances. Brokers and ZooKeeper clusters, as well as the consumers, can run seamlessly on Spot Instances. Solution: Keep root & Data volumes

Development / QA – You can run non-production nodes on Spot Instances with occasional downtime. If an interruption occurs on your instance it will be brought back automatically within a few minutes. Solution: AMI Auto-backup

 

Sincerely yours,
Customer Success team at Spotinst – Always here for you!

Shiri Ivtsan
Customer Success

Stay current

Sign up for our newsletter, and we'll send you the latest updates on Spotinst, tips, tutorials and more cool stuff!