Hadoop and memcached: Performance and power characterization and analysis
© Issa and Figueira; licensee Springer. 2012
Received: 5 April 2012
Accepted: 7 June 2012
Published: 12 July 2012
Given the rapid expansion in cloud computing in the past few years, there is a driving necessity of having cloud workloads running on a backend servers analyzed and characterized for performance and power consumption. In this research, we focus on Hadoop framework and Memcached, which are distributed model frameworks for processing large scale data intensive applications for different purposes. Hadoop is used for short jobs requiring low response time; it is a popular open source implementation of MapReduce for the analysis of large datasets, while Memcached is a high performance distributed memory object caching system that could speed up throughput of web applications by reducing the effect of bottlenecks on database load. In this paper, we characterize different workloads running on Hadoop framework and Memcached for different processor configurations and microarchitecture parameters. We implement an analytical estimation model for performance and power using different server processor microarchitecture parameters. The proposed analytical estimation model uses analytical method to scale different processor microarchitecture parameters such as CPI with respect to processor core frequency. We also propose an analytical model to estimate power consumption scaling for different processor core frequency. The combination of both performance and power consumption analytical models enables the estimation of performance per watt for different cloud benchmarks. The proposed estimation models are verified to estimate power and performance with less than 10% error deviation.
KeywordsPerformance estimation Performance analysis Power analysis Power estimation Cloud computing Hadoop Memcached
With the continuing growth of web services, more servers are being added to data centers, also known as backend servers, to keep up with demand for cloud computing. These systems are scalable, manageable, and reliable in performing data-intensive requests. In this paper, we present performance and power characterizations and predictions for different cloud computing frameworks and workloads. We experiment with Memcached and Hadoop , which are mainly used by Google, Amazon, Yahoo, among others.. The main performance metric for these workloads is the latency to get a computation operation completed over a cloud network. Given the infrastructure of cloud networks, there are many factors contributing to the latency between clients and servers. The latency is related to the processor latency, Internet access latency, disk IO latency, and latency associated with moving data within a given cluster. In this paper, we characterize all of these latency factors and their contribution to the overall latency, and we compare different architectures, such as ATOM, Nehalem (NHM), and Westmere (WSM) Xeon processors.
In this paper, we also propose a performance and performance-per-watt analytical projection model. The model is verified to project performance and performance-per-watt with <10% error deviation between the measured and the projected data. The projection model is based on previous work published in [2, 3], and we added the power factor in the regression model for the performance-per-watt projection. The latency associated with executing these workloads can be divided into three categories. The first category is related to workload characteristics such as data block size to be processed and threads requested by the client to the server. The second category is related to the processor microarchitectures, such as Cycle-per-Instruction (CPI), number of cores, number of threads, and memory latency due to Last Level Cache (LLC) misses, core frequency, and processor efficiency.
The performance and power projection models are based on the overall latency related to the backend server’s processors. Performance-per-watt is defined as the rate of computation such as performance score, for every watt consumed. The power consumed by a computer is converted into heat, so the higher the wattage, the more cooling is required, which increases the cost for maintaining a given operating temperature. The objective is to achieve a higher performance per watt for a given workload.
The remaining sections of this paper are organized as follows: Section 2 is related work in which we review other published papers related to cloud performance and power characterization and evaluation. Section 3 is an overview and characterization of Hadoop framework and different workloads running on Hadoop MapReduce architecture in which we characterize performance-per-watt and performance-per-$. Then in section 4, we similarly characterize memcached workload on different server processor architectures. In Section 5, we present performance-per-watt characterization and analysis for disk IO. In Section 6, we present a detailed performance-per-watt projection analytical model and conclude in section 7.
Several papers on cloud workload characterization and optimization have been published. There are a few published papers on cloud computing performance prediction model, which is mainly related to the research presented in this paper. For instance, Vianna in  proposed an analytical model to predict performance for a Hadoop online prototype using intra-job pipeline parallelism with no reference to power consumption. In comparison with our analytical model, we project performance and performance-per-watt for Hadoop and Memcached from a measured baseline while changing one microarchitecture variable (e.g., core frequency and Cycles per Instruction (CPI). Our model predicts with <10% error deviation from measured numbers in all tested cases. It can be simply implemented without the need for a simulator or traces.
Xie in  focuses on the optimization of the MapReduce performance in heterogeneous Hadoop clusters. The paper shows performance improvements for placing data across multiple nodes so that each node has a balanced data processing performance. The paper does not provide a prediction model to verify and estimate performance variations for different disks and processor architectures. The paper also does not analyze disk IO latency variation for different patterns, nor does it show any improvement in the power consumption associated with the proposed optimized data placing method.
Other work related to Hadoop performance includes Dejun and Chi. , who characterize response time and I/O performance. Ibrahim et al.  analyze Hadoop execution on virtual machines. Stewart  compares performance of several data query languages for Hadoop. All of these works focus on different aspects and approaches for performance analyses. Our work complements these previous works, as we also present a power analysis as well as a prediction method for performance and performance-per-watt, which is the focus of the research presented in this paper.
Leverich and Kozyrakis  presented a power model estimate for Hadoop cluster based on a linear interpolation of CPU utilization. Our power model is based on a regression prediction method. In addition, we present performance-per-watt to understand the ratio of performance relative to power for a given processor architecture.
Wiktor in  presented a comprehensive study related to Hadoop configuration parameters affecting query performance focusing on data size, number of nodes, number of reducers and other configuration variables. This study complements our characterization for various cloud workloads running on Hadoop framework. Our focus in this paper is performance, performance-per-watt and performance-per-$ characterization for different backend server processors. We also propose a prediction method to project performance and performance-per-watt for different processor microarchitecture variables.
Jiang in  conducted an in-depth performance analysis for MapdReduce. The research presented optimization methods to improve performance. The research does not present the impact of this improvement with respect to performance-per-watt and performance-per-$.
Hadoop overview and characterization
Hadoop is a framework used to process large data sets in a distributed computing environment. The underlying architecture of Hadoop is HDFS (Hadoop Distributed File System). It provides fault-tolerance by replicating data blocks. The NameNode in Hadoop architecture stores information on data blocks, the DataNodes stores data blocks, and host Map-Reduce computation, and JobTracker is used to track jobs and detects failure. Hadoop is based on Google’s MapReduce in which an application can break into small parts or blocks that can be run on any node so that applications can run on systems with thousands on nodes. Hadoop framework includes several benchmarks such as Sort, Word Count, Terasort, Kmeans iterations, and NutchIndexing. These benchmarks are based on distributed computing and storage. Apache Hadoop has an architecture that is similar to the MapReduce runtime used by Google. Apache Hadoop runs on the Linux operating system. Hadoop accesses data via HDFS (Hadoop Distributed File System), which maps all the local disks of the computing nodes to a single file-system hierarchy, allowing the data to be dispersed across all the data/computing nodes. HDFS also replicates the data on multiple nodes so that failures of nodes containing a portion of the data will not affect the computations that use that data.
Workloads based on Hadoop framework: System Resource Utilization
System resource utilization
Sort Phase: IO-bound, Reduce Phase: Communication-bound.
Map Stage: CPU-Bound
Reduce stage: IO-bound
IO-bound with high CPU utilizations in map stage. This workload is mainly used for web searching.
CPU-bound in iteration, IO-bound in clustering. It is used for machine learning and data mining.
The disk I/O bandwidth limits the performance for the IO-bound benchmarks, so adding more disks may benefit performance. In addition, memory can be a performance-limiting factor for computation-bound workloads, such as Terasort, so adding more memory will increase memory buffers and will reduce the amount of data being moved back to disk. In addition, Memory is a limiting factor for Memcached, which we will discuss in a later section. There is a big split between CPU-bound versus memory-bound workloads. The most important characteristic affecting performance of any workload on any system is the number of main-memory transactions it does.
For CPU-bound workloads, performance is gated by activity on the processor. Important performance parameters are core frequency latency and bandwidth from processor caches. Therefore, systems are cheaper to build for CPU-bound workloads. For Memory-bound workloads it is the opposite of CPU-bound - performance is mainly determined by off-chip events, mainly how many main memory transactions can be completed per unit time, i.e. by the bandwidth actually achieved from/to main memory.
In a Hadoop cluster, a master node controls a group of slave nodes by assigning tasks to the slave nodes based on their availability. In this section, we characterize the Hadoop framework based benchmarks performance and power on ATOM and Xeon-based systems:
· Core Frequency = 1.66 GHz, # of cores = 2, Threads/core =2, L2 cache size = 1 M, DDR2-667/800, Memory BW = 6.4 GB/s.
· Core Frequency = 2.93 GHz, # of cores = 4, Threads/core = 2, Memory BW = 32 GB/s.
Each server in a Hadoop cluster can be configured to handle a specific capacity. JobTracker performs a specific task assignment, while NameNode maintains the HDFS, which requires high RAM capacity. TaskTracker performs the map-reduce task and DataNode stores and handles read/write operations for HDFS. All Hadoop applications can be categorized as I/O-bound, compute-bound, or in-between. This makes it critical to have configurations with optimal memory size and number of processor sockets, and large numbers of hard drives. Data locality in Hadoop/MapReduce will determine its performance, as Hadoop usually distributes data blocks to multiple nodes based on disk space availability. This is a fair distribution in a homogenous cluster environment. In a heterogeneous computing environment, we have a combination of fast and slow nodes: the faster nodes will complete the processing of data faster than the slower nodes, and the slower nodes will have to transfer part of the data to the faster nodes for processing.
Hadoop characterization and measurements results
Hadoop framework based workloads - NHM vs. ATOM latency and throughput
NHM time (sec)
ATOM time (sec)
Speedup Ratio (NHM vs. ATOM)
Throughput (tasks completed/min) NHM
Throughput (tasks completed/min) ATOM
Speedup Ratio(NHM vs. ATOM)
Hadoop framework based workloads: NHM vs. ATOM performance-per-watt and Performance-per-$
NHM Average power (W)
ATOM Average power (W)
Performance-per-watt (NHM vs. ATOM)
NHM Server Cost ($)
ATOM Server Cost ($)
Performance- per-$ (NHM vs. ATOM)
Performance, Price and Power efficiency for WSM vs. ATOM
Job Running Time (sec)
Speedup (WSM vs. ATOM)
Server Cost ($)
Cost Ratio (WSM vs. ATOM)
Performance -per-$ Ratio (WSM-vs. ATOM)
Average Power (W)
Power Ratio (WSM vs. ATOM)
Performance –per-watt Ratio (WSM vs. ATOM)
Microarchitecture parameters for NHM vs. ATOM
Memory Read Latency(ns)
LLC cache misses/Byte
The microarchitecture parameters show that NHM has a lower CPI compared to ATOM for all benchmarks and higher memory bandwidth and lower Last Level Cache (LLC) misses. This shows the clear advantage of NHM over ATOM, which correlates to the conclusion based on performance and power numbers.
Note that Terasort is implemented as a MapReduce sort job with a custom partition. It uses a sort list of n-1 sampled keys that define the key range for each reduction. Terasort is tested on the ATOM D525 processor and a 1.8 GHz core frequency with a two cores/four threads configuration. We used 10 GB and 100 GB data sizes in compressed and uncompressed modes with combinations of different map and reduction factors.
Hadoop Wordcount does not work very efficiently for small inputs (less than 10 GB for Xeon and 7 GB for ATOM). After 11 GB for Xeon, the execution time increases almost linearly with the input size. This makes the processing rate (Mbytes/second) almost constant. This observation is used to operate Hadoop within certain input sizes for optimized performance as shown in Figure 3 and Figure 4.
In summary for Hadoop characterization, the performance of NHM Server is 3.7× ~ 12.5× compared against the ATOM server, depending on the characteristics of the workloads. NHM server is better than ATOM server for most workloads in terms of performance-per-watt (except Wordsort), while is no better than ATOM in terms of performance-per-$. Microarchitecture metrics also show an advantage on NHM- server over ATOM server.
Memcached overview and characterization
Memcached is a free open-source, high-performance, distributed-memory object caching system. Its architecture is based on distributed caching layer, which enables the aggregation of spare memory from multiple nodes. It is typically used to cache database queries, which are very network intensive. It is intended for use in speeding up dynamic web applications by alleviating database load. It is used in large sites such as Facebook, Twitter, and YouTube. It can significantly reduce database load and is suitable for websites with high database loads. Memcached is an in-memory key-object store mechanism for small blocks of data from database, rendering, or API calls. It uses a simple text protocol. It utilizes simple operations such as get, insert, replace, delete, and append. There is one issue for using Memcached; it does not have built-in security features such as authentication to create a fast connection. This issue can be resolved by deploying a firewall and restricting access.
In Memcached clusters, there is no cross-communication among servers, only clients can communicate with the server. Client libraries may consist of PHP/C/JAVA/Python programs, as well as server lists. Clients select consistent hashing to select a unique server per key.
The two main functions in Memcached are storing and getting data. The “Store” operation is usually transmitted over TCP to ensure the data is copied correctly with no errors, which requires more network bandwidth over large data size, while the “Get” operation can be done over UDP, which requires less network bandwidth but is also less secure.
Memcached characterization and measurements
In this section, we characterize Memcached throughput with respect to power to determine the performance-per-watt. This characterization is used as a baseline for the projection model we derived in next section. For this experiment, we cover the System-Under-Test (SUT) and client step for characterization of Memcached. The SUT components are configured as follows:
1 ATOM (D525/1P*2c/1.8 GHz/4 GB/1.80 GHz/82574 L–e1000e/i386)
1 WSM-EP (L5640 S5500WB/2P*6c/2.3 GHz/24 GB/2.27 GHz/82576-igb/X86-64)
OS: RHEL5.4 (updated to 22.214.171.124 for RPS/RFS patch)
Memcached: 1.3.3 (partition patch)
7 WSM + 4 extra SNB
Memcached: default 80 client threads
Binary Protocol + Modular/Default Hashing
Preload 100 K 64B* objects default
Pure Get/Multi-Get Operations
Persistent TCP connections
Note that the item size used in Facebook is 64B.
1) ATOM -Threads and Partitions
ATOM– Object Size
Memcached can be scaled out simply by adding nodes, but adding nodes is not recommended because it increases the power consumption. Another issue is that each client needs to setup TCP connections to all nodes, which will lead to incast issues for multi-get operations, where latency increases as the number of clients requesting threads increases.
To confirm the latency observations we have seen in cloud clusters, we set up Memcached on one WSM machine and ran memslap (a traffic generator) from a different WSM system over a dedicated network. On the server side, Memcached defaults to four threads, while on the client side, we varied the number of requesting threads and the data size.
Memcached data results comparison table
Memcached performance, power, and microarchitecture parameters
1 Socket *2 Cores
2 Sockets *6 Cores
There is a significant difference in throughput ~14.79× between the two systems, for almost the same latency. There are several factors contributing to making the performance for WSM better than for ATOM D525. The first factor is that WSM has 24 threads (2 sockets * 6 cores * 2 threads/core), while ATOM has four threads. The second factor is that CPU utilization for ATOM is higher than for the WSM processor. The difference in core frequency and memory size is also a contributing factor. However, the increased performance for WSM comes at the expense of power consumption; the power for ATOM is ~5.9× lower than the power consumption for WSM, and the price difference (performance-per-$) is an advantage for ATOM.
Recommendations to reduce performance bottlenecks
From the characterization results for Memcached, we identified three different kinds of bottlenecks. The CPU is the first bottleneck for wimpy core-based servers  when the objects are small. The second bottleneck is the network bandwidth if the object size is large enough. The third bottleneck is the cache user-level lock contention. This can be resolved or minimized by partitioning the hash table in a way in which each partition uses its own cache lock. Running multiple instances in single node can be another way to partition the big hash table. From the measured power data, we can also conclude that the WSM system is more power efficient than the ATOM D525 and provides higher power proportionality in a wimpy-core based server.
Disk IO performance and power evaluation
Several cloud workloads using Hadoop framework are IO-bound, which means that Disk IO performance becomes a bottleneck for achieving higher performance. Memcached is memory-bound workload, so disk IO performance-per-watt characterization may not be applicable for Memcached.
For example, in Hadoop, some data operations may not all fit in main memory, so disk IO operations are needed to complete the operation for specific servers with a small RAM. For such workloads, the disk latency is an important factor that affects performance.
In this section, we evaluate disk IO performance and power watts on the ATOM D525 and WSM EP X5660 systems used in our experiments. For this evaluation, we used a disk traffic generator tool to drive IO load with different IO parameters and collect IO performance. We used a performance-profiling tool to look at CPU utilization at different IO parameters. A power meter (Yokogawa) is used to collect power at idle and different load cases.
Disk IO evaluation methodology
In this section, we discuss the method used to evaluate disk IO on both processors. A few parameters will affect disk I/O behavior, and only the following combinations are selected in our experiment. The Read/Write ratio, where read-only is 100/0 and Write only is 0/100 for both the sequential and random pattern behaviors. The block size (KB) is 4 KB for random behavior and 32 KB for sequential behavior. The values for Queue Depth (QD) parameters for variable load used are 1, 2, 4, 8, 16, and 32. The performance indicator used for random pattern is IOPs and for sequential pattern is IOBW(KB/sec).
The IO performance and latency are well matched on ATOM D525 and WSM X5660 platforms with 3% better performance and 3% lower latency at QD = 1 in random write pattern. Latencies are proportional to QD for both patterns. Also, IO performance and latency are well-matched on ATOM D525 and Xeon X5660 platform, with a 9% worse performance and 2% higher latency at QD = 1 in sequential write pattern. Latencies are proportional to QD for both patterns.
Disk IO power efficiency
Disk IO performance-per-watt for ATOM D525 Disk
Disk IO performance-per-watt for WSM Disk
At peak performance (QD = 32), IOPS is the unified performance indicator for both random and sequential patterns. ATOM D525 shows much better performance-per-watts than WSM X5660 for all patterns. Sequential patterns show better performance-per-watt than random patterns on both platforms. In summary, similar disk behavior was noted on I/O performance of ATOM D525 and WSM-EP X5660, but much better performance-per-watt was seen for ATOM D525 than for WSM-EP X5660.
Performance-per-watt estimation model
PL increases with the number of processes because of increased inter-process communication. Our measurement suggests that the number of instructions increases logarithmically with the number of processes, assuming PL is independent of the platform. The a and b variables for the path length are determined by curve fitting the total number of instructions retired relative to the number of cores scaling. The a and b variables are constant and change for different benchmarks. We used the Amdahl’s law regression method published in  to analyze the CPI scaling with respect to higher core frequencies [13, 14]. The projection model requires at least two measured data points to establish a measured baseline. This baseline is measured on a processor of similar architecture for the one to which we are projecting. For example, if our measured baseline is for the ATOM with two performance data points at two different core frequencies, we can use this baseline into the model to project performance for the same ATOM architecture but at higher core frequencies. In case we have to project for a different processor architecture family, a new measured baseline is required. We also derive the maximum performance a processor can achieve as core frequency increases to higher values.
Wordcount Time (sec): Projected vs. Measured
ATOM-D510 @ 1.66 GHz
ATOM-D525@ 1.88 GHz
NHM@ 2.93 GHz
Wordsort time (sec): Projected vs. Measured
ATOM-D510 @ 1.66 GHz
ATOM-D525@ 1.88 GHz
NHM@ 2.93 GHz
The error deviation between projected and measured times is <10% for all three processors. In summary, the performance model consists of two sections. The first section is the Amdahl’s law regression method  used to analyze the sensitivity curve for CPI at different core frequencies for a given processor architecture. The limitation for the Amdahl’s law regression method is that each processor architecture family (i.e. ATOM or Xeon) needs a different measured baseline to be able to project for different core frequencies.
Once we obtain the CPI for a given frequency on a given processor architecture, we use that CPI in Eq(2) in which we have also to include the number of cores and the data size to predict the execution time. Each of the workloads requires its own path length equation. For example, in Hadoop, the Wordcount Path Length equation cannot be used for Wordsort, as each has its own Path Length equation. However, the same Path Length equation is common for different processor architectures because the Path Length is derived for a specific workload, not processor architecture. As indicated in Table 9, we used the same Path Length for Wordcount for both ATOM and NHM which have different architectures, and the projection results presented <10% error deviation.
Power projection model
In summary, the performance-per-watt model presented in this paper is based on performance prediction method derived and power prediction method as described earlier. The method shows the importance of how processor behavior will be at higher core frequencies by taking the ratio of projected performance relative to projected power. This enables the analysis of performance-per-watt for core frequencies we cannot measure or for different processors of similar architectures (i.e., higher core count, core frequency) that are not available in the market yet for measurement.
We presented a detailed performance and power analysis and characterization for Hadoop and Memcached workloads that led to identifying several bottlenecks that can be avoided to improve performance. The performance, cost and power analysis were implemented on different processor architectures such as WSM, NHM, and ATOM processors running on a backend server cluster. We identified several bottlenecks for performance and power in which optimum operating points are identified. In addition, we provided a comparison to show performance-per-$ between different processor architectures. Both performance-per-watt and performance-per-$ need to be minimized for an optimum solution in a cloud cluster. Furthermore, we proposed a projection analytical model to project performances and performance per watt with error deviation <10% between projected and measured data. More importantly, the projection model is flexible as it can be applied by establishing a CPI measured baseline for a given processor architecture and project from that CPI baseline to a different core frequency of the same processor architecture. The method does not require traces or simulations; it does require a code to implement.
Joseph A. Issa received his B.E in computer engineering from Georgia Institute of Technology in 1996. He obtained his master’s degree in computer engineering at San Jose State University in 2000. Currently he is a PhD candidate computer engineering major at Santa Clara University. His research interests are in areas of performance and power prediction, analysis and characterization.
Silvia M. Figueira received the B.S. and M.S. degrees in Computer Science from the Federal University of Rio de Janeiro (UFRJ), Brazil, and the Ph.D. degree in Computer Science from the University of California, San Diego. Currently, she is an Associate Professor of Computer Engineering at Santa Clara University. Her research interests are in the areas of performance and energy consumption modeling and prediction.
- Apache Software Foundation: Official apache hadoop website. 2011. . http://hadoop.apache.org .Google Scholar
- Issa J, Figueira S: Graphics Performance Analysis Using Amdahl's Law: IEEE/SCS SPECTS. International Symposium on Performance Evaluation of Computer and Telecommunication System, Ottawa, Canada; 2010.Google Scholar
- Issa J, Figueira S: Performance and power-consumption analysis of mobile internet devices. IEEE IPCC–International Performance Computing and Communications Conference, Orlando, Florida; 2011.View ArticleGoogle Scholar
- Vianna E: Modeling performance of the hadoop online prototype. International Symposium on Computer Architecture; Vitoria, Espirito Santo; 2011.View ArticleGoogle Scholar
- Xie J: Improving Map Reduce performance through data placement in heterogenous Hadoop clusters. IEEE International Symposium on Parallel & Distributed Processing, Atlanta, Georgia, USA; 2010.Google Scholar
- Dejun J, Chi GPC: EC2 performance analysis for resource provisioning of service-oriented applications. International Conference on Service-Oriented Computing, Stockholm, Sweden; 2009.Google Scholar
- Ibrahim S, Jin H, Lu L, Qi L, Wu S, Shi X: Evaluating MapReduce on virtual machines: The hadoop case. International Conference on Cloud Computing, Beijing, China; 2009.Google Scholar
- Stewart R: Performance and Programmability of High Level Data Parallel Processing Languages. 2010. . http://www.macs.hw.ac.uk/~rs46/files/publications/MapReduce-Languages/Complete_Results_Chapter.pdf.old .Google Scholar
- Leverich J, Kozyrakis C: On the energy (in)efficient of Hadoop clusters, ACM SIGOPS Operating systems Review. , New York; 2010:61–65.Google Scholar
- Wiktor T, et al.: Performance analysis of hadoop for query processing. IEEE,International Conference on Advanced Information Networking and Applications, Biopolis; 2011.Google Scholar
- Jiang DR, Ooi BB, Shi L, Wu S: The performance of mapreduce: an in-depth study. Proceedings of the Very Large Database Endowment 2010,3(1–2):472–483.Google Scholar
- Berezecki M, Frachtenberg E, Paleczny M, Steele K: Many-Core Key-Value Store. International Green Computing Conference and Workshops (IGCC), Orlando, Florida; 2011.View ArticleGoogle Scholar
- Hennessy JL, Patterson DA: Computer architecture: A quantitative approach. 4th edition. Elsevier, Morgan Kaufmann; 2007.Google Scholar
- Hennessy JL, Patterson DA: Computer organization & design: The hardware/software interface. 4th edition. Elsevier, Morgan Kaufmann; 2009.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.