fbpx

Running Hadoop and Spark using Amazon EMR with Spotinst MapReduce Group

Finish your jobs on-time

Aharon Twizer
Spotinst, Co-Founder and CTO

Amazon Elastic MapReduce

Amazon Elastic MapReduce (EMR) is a web service that makes it easy to quickly process vast amounts of data and simplifies big data processing, providing a managed Hadoop framework that makes it easy, fast, and cost-effective for you to distribute and process vast amounts of your data across dynamically scalable Amazon EC2 instances.

You can also run other popular distributed frameworks such as Spark and Presto in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3.

Spotinst MapReduce Group

Spotinst MapReduce Group, is an extension of Spotinst Group, that makes a wise and reliable use of Amazon EC2 Spot Instances in your existing Amazon EMR cluster or your custom Hadoop environment. Spotinst EMR, has a built-in auto-scale feature that scales task nodes up or down using cost-effective approach.

Running Spotinst MapReduce

– as part of an existing Amazon EMR Cluster

1_emr-task-scaler

NOTE: Task Nodes Scaling using Spotinst.

If you want to connect Spotinst MapReduce to an existing EMR cluster, simply provide the following details:

  • Your EMR clusterId
  • Desired distribution of Instance Types

In this example we have selected all 2-xlarge and 4-xlarge instances, Spotinst’s Optimizer will choose the most cost-effective based on their prices history and availability over the past days, weeks and months:

  • Minimum, Maximum and desired number of CPU Cores.
  • Scaling polices – define alarms for adding\removing task nodes from the group.

Spotinst by default scales according to the CPU threshold of the Task nodes. You can Specify your CloudWatch metric – reflecting your pending jobs queue etc.

Within a simple API call – add dozens or hundreds of Task nodes to your cluster, sleep quietly as Spotinsts’ Optimizer will utilize the resources in the most cost-effective manner, will scale up or down according to the cluster activity, and will handle any failure in the Spot market.

– as a complete Amazon EMR cluster

2_emr-full-spot

NOTE: Complete EMR cluster on Spotinst.

If you wish to run jobs in a cost-oriented manner, short Map Reduce jobs or recoverable compute jobs that use Amazon s3 as their data source. you would want to spin up a Spotinst “ClusterWatcher”

For creating a ClusterWatcher – simply provide the following:

  • A list of instance types that your workload can run with.

You can provide specifically for the master, core nodes and the task nodes separately.

  • A list of availability zones in your VPC.

Spotinst’s Optimizer will choose the best availability zone to run your hardware, and will spin the most profitable instance types that match your workload.

  • A cluster ID to clone from.

To clone all your application, roles, steps and bootstrap actions.

Finish your jobs on-time

If you wish to finish some of your jobs within a fixed time window, you can issue an API call to Spotinst with your JobId (job_XXXXXXXXXXXX_XXXX) and the estimated time frame you would wish to finish the Job, and Spotinst’s Optimizer will scale up resources according to the RemainingMapTasks, RemainingMapTasksPerSlot andRemainingReduceTasks Metrics of the Specific JobId.

See it in action

3_emr-in-action

NOTE: Spotinst’s Dashboard.

Spotinst gets wiser

As you run your clusters using Spotinst EMR, Spotinst’s Optimizer is learning what resources the tasks need and how long they usually take, what is your workload pattern and which compute resources would match best to serve your needs. The main key is finding the best combination between workload awareness and price reduction. Spotinst will choose the instances that will generate the most cost-effective and nonetheless available for your job.

Aharon Twizer
Spotinst, Co-Founder and CTO

Stay current

Sign up for our newsletter, and we'll send you the latest updates on Spotinst, tips, tutorials and more cool stuff!