Performance characterization and analysis for Hadoop K-means iteration
© Issa. 2016
Received: 10 August 2015
Accepted: 6 March 2016
Published: 17 March 2016
The rapid growth in the demand for cloud computing data presents a performance challenge for both software and hardware architects. It is important to analyze and characterize the data processing performance for a given cloud cluster and to evaluate the performance bottlenecks in a cloud cluster that contribute to higher or lower computing processing time. In this paper, we implement a detailed performance analysis and characterization for Hadoop K-means iterations by scaling different processor micro-architecture parameters and comparing performance using Intel and AMD processors. This leads to the analysis of the underlying hardware in a cloud cluster servers to enable optimization of software and hardware to achieve maximum performance possible. We also propose a performance estimation model that estimates performance for Hadoop K-means iterations by modeling different processor micro-architecture parameters. The model is verified to predict performance with less than 5 % error margin relative to a measured baseline.
KeywordsPerformance prediction Performance analysis Hadoop K-means Iterations
Given the rapid growth in the demand of cloud computing [1, 2] and cloud data, there is an increasing demand in storing, processing and a retrieving large amount of data in a cloud cluster. The data can be either stored to a cloud network such as scientific data (i.e. Climate modeling, Fusion, Bioinformatics…etc) or use the cloud network for data-intensive tasks such as collecting experimental data, dumping data on parallel storage systems, run large scale simulations…etc. Cloud computing is an emerging technology used to deliver different types of resources known as services over the internet. Cluster computing [3–7] is a set of stand-alone computers connected together to form a single computing resource [8, 9]. This improves the performance and availability of a cloud cluster as compared to a single computer.
Hadoop was introduced as a solution to handle processing, storing and retrieving Big Data in a cloud environment which usually runs on a cluster of commodity machines. This cluster is composed of a master and slave nodes that process and compute data in parallel. It is important for processor architects to understand what processor micro-architecture parameters contribute to higher or lower performance. It is also important for benchmark developers to optimize the benchmark software for a given hardware to achieve maximum performance possible. Hadoop is an open-source framework with two main components: MapReduce , and Hadoop Distributed File System (HDFS). HDFS is the primary storage for Hadoop; it is highly reliable and uses sockets for communications and is used for distributed storage [11, 12]. One important feature of HDFS is the partitioning of data and computation using thousands of hosts, and the execution of application computations in parallel in a way it is close to their data [13–16]. Hadoop cluster scales with computation and storage capacity by adding more servers. For example, Yahoo Hadoop cluster uses 40,000 servers and stores 40 PetaBytes of application data. Hadoop HDFS is used for data protection and reliability by replicating the file content across multiple DataNodes. This replication increases the probability for locating computation near the needed data.
The MapReduce [17, 10] framework is used for parallel processing. MapReduce and HDFS are co-designed, co-developed and co-deployed. What this means is that we have a single set of servers where MapReduce and HDFS are deployed so there is no separate set of servers for HDFS to store data and a separate set of servers for processing data. One important aspect of MapReduce is that it’s capability of moving compute to data (DataNode on which the data is located) and not the other way around. MapReduce knows where the data is placed in a cluster by working closely with HDFS. MapReduce consists of two main components, the JobTracker, and the TaskTracker. The JobTracker is the master and it is responsible for resource management such as tracking which nodes are up or down and how to deal with node failures. The TaskTracker is the slave, it gets direction from the JobTracker to run the tasks and report any failures and scheduling tasks.
Workloads based on Hadoop framework: System Resource Utilization
System Resource Utilization
Sort Phase: IO-bound in the Reduce Phase: Communication-bound.
Map Stage: CPU-Bound
Reduce stage: IO-bound
IO-bound with high CPU utilizations in the map stage. This workload is mainly used for web searching.
CPU-bound in the iteration, IO-bound in the clustering. It is used for machine learning and data mining.
In this paper, we present a detailed performance characterization for Hadoop K-means iterations using different processor configurations. We also propose a performance projection model that projects and model performance by changing different processor architecture parameters such as the number of cores/threads, memory bandwidth, memory size, cycles-per-instruction (CPI) and memory latency [18, 19]. The remainder of this paper is organized as follows: In “Hadoop K-means Overview” section, we start with an overview of Hadoop K-means and Mahout K-means implementations. In “Related Work” section, compare our work to other published work of the same topic. In “Performance Characterization using Intel Xeon Based Platform” section we present a detailed performance characterization of Hadoop K-means for different key processor architecture parameters using Intel Xeon processor. In “Performance Characterization using AMD Interlagos Platform” section, we present a detailed performance analysis and characterization for Hadoop K-means using AMD Interlagos processor by analyzing the performance sensitivity to key processor architecture parameters. In “Performance Projection Model‴ section we propose a performance projection model that projects processor performance and total runtime and finally we conclude and discuss future work.
Hadoop K-means overview
Hadoop is designed as a framework for processing (storing and appending) multi Petabytes of data sets in a distributed computing cluster systems. There are several components of Hadoop architecture, the first component is known as the NameNode which is responsible for storing the file system namespace. The second component of Hadoop architecture is the DataNodes which is responsible for storing blocks and hosting Map-Reduce computation. The JobTracker component is responsible for tracking jobs; also it is responsible for detecting any failures. All applications in Hadoop are based on MapReduce which was introduced by Google. MapReduce means that a given application can be broken down into smaller blocks that can run on any node. The application can run on systems with thousands on nodes to achieve better performance. Hadoop is a framework which consists of several micro-benchmarks. Some of these benchmarks are Sort, Word Count, TeraSort, K-means, and NutchIndexing. The file system in Hadoop is organized in a way that maps all the local disks in a cluster into a single file system hierarchy known as HDFS. Hadoop K-means is basically used for machine learning as well as data mining. It is divided into two main phases, the first phase is the iteration phase and the second phase is the clustering phase. In the iteration phase, the performance is a CPU-bound, which means the performance will increase if there is an increase in processing power such as an increase in the number of cores. In the clustering phase, the performance is IO-bound which means that the performance is limited and bounded by IO communication within a cluster. Clustering is a technique used to identify groups (clusters) within the input observation in such a way that the objects within each group will have high similarities and fewer similarities between other groups or clusters. The similarities metric in clustering algorithm uses distance measured only, similarities by correlation is not used in the clustering algorithm. K-means clustering generates a specific number of disjoint (non-hierarchal) clusters. The K-means method is numerical, unsupervised and iterative. K stands for K number of clusters and must be manually supplied by the user based on the input data.
Hadoop K-means version 2.7.0 is a clustering algorithm in which the input is a set of data points such as K with a set of points X1, X2....Xn. The variable K refers to how many clusters it needs to find. The algorithm starts by placing K centroids in a random location such as C1, C2…Ck. The algorithm will then repeat by executing the following steps below until convergence:
The number of iterations is determined by how many times it will run until the algorithm convergence. Another implementation of K-means is Apache Mahout  which is used as a machine learning software that allows applications to analyze a large set of data. Mahout uses Apache Hadoop power to solve a complex problem by breaking them up into multi-parallel tasks. Mahout offers three machine learning techniques which are Recommendation, Classification, and Clustering. Recommendations use user’s information with community information to determine the likelihood of user’s preference. For example, Netflix uses the Mahout Recommendation engine to suggest movies. Classification engine is used for example in classifying spam emails. It uses known data to determine how new data should be classified into a set of existing categories. So every time a user mark an email as ‘spam’ it directly influences the email Classification engine for providing future email spams. The last Mahout engine is clustering which is used for example to group different news of similar article together. This is mainly used by Google and other search engines. Clustering forms a group of similar data based on common characteristics. Unlike classifications, clustering does not group data into an existing set of known categories. This is particularly useful when the user is not sure how to organize the data in the first place.
Analyzing cloud computing performance is an important research topic that leads to several published papers. Map-Reduce clusters are becoming popular for a set of applications [21–23] in which large data processed is stored and shared as files in a distributed file systems.
Emanuel V in  presents an analytical model that estimates performance for a Hadoop online prototype using job pipeline parallelism method. In comparison, the projection model proposed in this paper projects performance and runtime using different processor micro-architecture parameters that are important parameters for processor architects to model performance. Furthermore, our model is verified to predict both performance and runtime with <5 % error margin for all tested cases. The performance projection model we present in this paper is flexible and can be implemented without the need for a simulator and sampling traces.
Dejun et al in , propose an approach to evaluate the response time and I/O performance. Ibrahim et al in , analyze Hadoop execution time on virtual machines. Stewart in  compares the performance of several data query languages. All their work is focused on different aspects for analyzing Hadoop performance. Our work complements performance analysis for Hadoop. We also present a prediction analytical model for performance which is the main focus of the research presented in this paper. There are several performance monitoring tools for Hadoop K-means. Salsa  for example is a DataNode/TaskTracker log analyzer which provides data and control flow execution on each node. Mochi  extracts job execution view from a DataNode/TaskTracker logs.
Therdphapiyanak et al in  proposed an implementation using Mahout/Hadoop for a large data set by pre-determining the appropriate numbers of K-means clusters. This is done by describing the appropriate number of cluster and the proper amount of entries in log files.
Jiang in  conducted an in-depth performance analysis for MapdReduce. The research presented optimization methods to improve performance. However, his research does not present an estimation model to translate the optimized methods presented into a performance prediction model.
Esteves et al, in  presented detailed performance analysis of Mahout K-means using large data sets running on Amazon EC2 instances. The performance gain in runtime was compared when running on a multi-node cluster.
Performance characterization using Intel Xeon based platform
2xIntel Core i7 CPU at 2.7GHz
32GB Memory (8x 4GB DIMMs) at 1066 MHz
Network controller using onboard 1GbE
Seagate disk at 1 TB 7200RPM
HDFS setup using 5x Intel 200GB SSDs on each system
Disabled power management (including C-states)
Operating System used is Red Hat Enterprise Linux
with Apache Hadoop version 1.2.1
1:1 Map slots
1:1 Reduce Slots: 1:1
Heap Size is 2GB
Problem Size requirements
Core and Socket Scaling
For core scaling, using four and eight cores, the performance scaling is linear; close to 2x going from four cores to eight cores. This means that Hadoop K-means performance scales linearly with a number of cores.
Core Frequency Scaling
Hyper-Threading/Simultaneous Multi-Threading Scaling
Enabling the processor Hyper-Threading (HT) feature will enable an active core to execute two threads per core instead of one thread or Single Thread (ST). In our performance characterization, we found that enabling Simultaneous Multi-Threading (SMT) and scaling the workload with respect to thread count shows an average of 20 % increase in performance with all cores active.
Last Level Cache Scaling
Data Input Size Scaling
Memory and Heap Size Scaling
Performance characterization using AMD interlagos platform
AMD Platform Setup
4x AMD Interlagos platforms (Bulldozer core)
Two different chassis: 2x HP Proliant and 2x Supermicro
CPU: 2x 2.60GHz ITL
Memory: Fixed at 1066 MHz for all configurations 32GB = 4x 8GB DIMMs (1-socket) and 64GB = 8x 8GB DIMMs (2-socket)
NIC: Onboard 1GbE (only 1 port in use)
For disk configuration, the System disk used is Seagate 1 TB 7200RPM (holds no HDFS data). The HDFS: 4x Intel 200GB SSDs on each system. All disks attached via an SAS controller.
Power Now: Enabled, but frequency fixed via On Demand governor and Core Performance Boost disabled
Operating System: Red Hat Enterprise Linux 6.1 with Kernel version 2.6.32
Java: Sun 1.6.0_25
Hadoop distribution: 1.0.2 snapshot (based on Apache distribution)
Map Slots: 1:1 with active logical threads
Reduce Slots: 1:1 with active logical threads
Heap Size: 2GB
Problem Size Requirements
Hadoop K-means Performance Scaling with respect to thread Count
Hadoop K-means Performance scaling with respect to thread count
Performance change with respect to a change in the number of sockets was implemented on AMD Interlagos system for one and two sockets.
Hadoop K-means Performance Scaling for AMD with respect to core count using 2 sockets
The same measurement was implemented on Intel Xeon processor, and a similar conclusion can be concluded for Intel Xeon as performance doubles when the number of cores doubles. There is a slight increase in CPI when the number of cores doubles.
Hadoop K-means Performance Scaling for AMD with respect to core frequency
AMD Interlagos core and cluster scaling using one socket
There is an increase in performance going from one cluster to four clusters in a range of 4.4x, and the performance increases by ~2.2x going from four clusters to 8 clusters as shown in Fig. 18.
From Fig. 19, the performance measured is compared to performance projected by the model. We verified different processor configurations such as different number of sockets, different number of cores, different core frequency, and different input sizes, with Hyper-threading set to off. All these variables are included in the performance model. The error variation is within expected range of < 5 %. Among all tested configurations in Fig. 19, the peak performance achieved is for 2 sockets, 8 cores, 2.7GHz core frequency with 28 GB input size.
For modeling run-time, the highest run-time is expected for the configuration with lowest core frequency which in this case is 2.1GHz as shown in Fig. 20. All tested cases for run-time (measured vs. projected) shows an error margin of <5 %.
In this paper, we presented a detailed performance characterization analysis for Hadoop K-means using Intel and AMD based processors. We also proposed a projection model for Hadoop K-means workload. The projection model is verified to project performance and runtime with 5 % error margin for all tested cases. The model is flexible to accept any changes in processor micro-architecture parameters and estimate performance or runtime. The model does not require any simulation which in turn requires trace based sampling for the workload. In future work, we can implement the same approach for different Hadoop framework workloads such as word count and implement a full details performance characterization. The model can be expanded to include IO latency such as disk and network latency. The focus of this paper is on the processor performance excluding any IO latency, this is why the input size selected was 28GB which is less than the system memory size of 32GB. For AMD Interlagos versus Intel Xeon performance analysis, we conclude that there is about 38 % better performance for Intel Xeon as compared to AMD Interlagos. The socket and core scaling is almost linear in most measured cases, the sample conclusion applied to Intel Xeon Processor. For cluster-per-core scaling, there is about 60 % increase in performance for AMD Interlagos processor.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Mak VW and Lundstrom SF (1990) Predicting performance of parallel computations. IEEE Trans. Parallel Distributed Systems, Online Journal
- Mell P, Grance T (2011) The NIST definition of cloud computing. NIST Special Publication, online (800-145)
- Dean J and Ghemawat S (2004) MapReduce: Simplified data processing on large clusters. 6th conference on Symposium on Opearting Systems Design & Implementation, Seattle, WA, USA
- Fitzpatrick B (2004) Distributed caching with Memcached. Linux Journal 2004(124):5Google Scholar
- Vianna E (2011) Modeling performance of the hadoop online prototype. International Symposium on Computer Architecture, VitoriaView ArticleGoogle Scholar
- Xie J (2010) Improving Map Reduce performance through data placement in heterogeneous Hadoop clusters. IEEE International Symposium on Parallel & Distributed Processing, AtlantaGoogle Scholar
- Ishii M, Han J, Makino H (2013) Design and performance evaluation for hadoop clusters on virtualized environment. In: 2013 International Conference on Information Networking (ICOIN), pp. 244–249
- Chao T, Zhou H, He Y, and Zha L (2009) A Dynamic MapReduce Scheduler for Heterogeneous Workloads. Techical paper online, IEEE Computer Society.
- Ranger C, Raghuraman R, Penmetsa A, Bradski G, and Kozyrakis C (2007) Evaluating MapReduce for multi-core and multiprocessor systems. High-Performance Computer Architecture, Proc. IEEE 13th Int’l Symp. High Performance Computer Architecture, Scottsdale, AZ, USA
- Dean J and Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In Op. Systems Design & Implementation
- Hendricks J, Sambasivan RR, and Sinnamohideenand S, Ganger GR (2006) Improving small file performance in object-based storage. Technical report, Carnegie Mellon University Parallel Data Lab, Online technical report
- Berezecki M, Frachtenberg E, Paleczny M, Steele K (2011) Many-Core Key-Value Store. International Green Computing Conference and Workshops, Orlando, FL, USA
- Mandal A et al (2011) Provisioning and Evaluating Multi-domain Networked Clouds for Hadoop-based Applications. Third International Conference on Cloud Computing Technology and Science, Athens, Greece
- Shafer J, Rixner S, Cox AL (2010) The Hadoop distributed filesystem: Balancing portability and performance. IEEE International Symposium on Performance Analysis of Systems & Software, White Plains, NY, USA
- Leverich J, Kozyrakis C (2010) On the energy (in)efficient of Hadoop clusters. ACM SIGOPS Operating systems Review, Indianapolis, IN, USA
- Chun B (2010) An Energy Case for Hybrid Datacenters. ACM SIGOPS Operating System Review, Indianapolis, IN, USA
- Wang G, Butt A, Pandey P, and Gupta K (2009) Using realistic simulation for performance analysis of MapReduce setups. LSAP. ACM, Munich, Germany
- Issa J, Figueira S (2010) Graphics Performance Analysis Using Amdahl’s Law. IEEE/SCS SPECTS, International Symposium on Performance Evaluation of Computer and Telecommunication System, Ottawa, ON, Canada
- Issa J, Figueira S (2011) Performance and power-Consumption Analysis of Mobile Internet Devices. IEEE IPCC–International Performance Computing and Communications Conference, Austin, TX, USA
- Apache Software Foundation: Official apache hadoop website: http://hadoop.apache.org. (2015)
- Wiktor T et al (2011) Performance Analysis of Hadoop for Query Processing. International Conference on Advanced Information Networking and Applications, Fukuoka, Japan
- Ekanayake J, Pallickara S, and Fox G (2008) MapReduce for data intensive scientific analysis. In: Fourth IEEE Intl. Conf. on eScience, pp. 277–284.
- Chu C-T, Kim SK, Lin Y-A, Yu Y, Bradski G, Ng AY, and Olukotun K (2007) Map-Reduce for machine learning on multicore. NIPS, Vancouver, B.C., Canada, pp. 281–288.
- Emanuel V (2011) Modeling Performance of the Hadoop online Prototype. ISCA, San Jose, CA, USA
- Dejun J and Chi GPC (2009) EC2 Performance Analysis for Resource Provisioning of Service-Oriented Applications. International Conference on Service-Oriented Computing, Stockholm, Sweden
- Ibrahim S, Jin H, Lu L, Qi L, Wu S, and Shi X (2009) Evaluating MapReduce on Virtual Machines: The Hadoop Case. International Conference on Cloud Computing, Bangalore, India
- Stewart R (2010) Performance and Programmability of High Level Data Parallel Processing Languages. http://www.macs.hw.ac.uk/~rs46/papers/appt2011/RobertStewart_APPT2011.pdf
- Tan J, Pan X, Kavulya S, Gandhi R, and Narasimhan P (2008) Salsa: Analyzing Logs as State Machines. In: Workshop on Analysis of System Logs.
- Tan J, Pan X, Kavulya S, Gandhi R, and Narasimhan P (2009) Mochi: Visual Log-analysis Based Tools for Debugging Hadoop. In: HotCloud.
- Therdphapiyanak J, Piromsopa K (2013) An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework. In: 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology
- Jiang DR, B. Ooi B, Shi L, and Wu S (2010) The performance of MapReduce: an in-depth study. Proceedings of the VLDB Endowment, Online Journal
- Esteves RM, Pais R, Rong C (2011) K-means Clustering in the Cloud -- A Mahout Test. In: IEEE Workshops of International Conference on Advanced Information Networking and Applications (WAINA), vol., no., pp.514-519, 22-25