Performance analysis model for big data applications in cloud computing

Bautista Villalpando, Luis Eduardo; April, Alain; Abran, Alain

doi:10.1186/s13677-014-0019-z

Research
Open access
Published: 06 December 2014

Performance analysis model for big data applications in cloud computing

Luis Eduardo Bautista Villalpando^1,2,
Alain April² &
Alain Abran²

Journal of Cloud Computing volume 3, Article number: 19 (2014) Cite this article

13k Accesses
15 Citations
5 Altmetric
Metrics details

Abstract

The foundation of Cloud Computing is sharing computing resources dynamically allocated and released per demand with minimal management effort. Most of the time, computing resources such as processors, memory and storage are allocated through commodity hardware virtualization, which distinguish cloud computing from others technologies. One of the objectives of this technology is processing and storing very large amounts of data, which are also referred to as Big Data. Sometimes, anomalies and defects found in the Cloud platforms affect the performance of Big Data Applications resulting in degradation of the Cloud performance. One of the challenges in Big Data is how to analyze the performance of Big Data Applications in order to determine the main factors that affect the quality of them. The performance analysis results are very important because they help to detect the source of the degradation of the applications as well as Cloud. Furthermore, such results can be used in future resource planning stages, at the time of design of Service Level Agreements or simply to improve the applications. This paper proposes a performance analysis model for Big Data Applications, which integrates software quality concepts from ISO 25010. The main goal of this work is to fill the gap that exists between quantitative (numerical) representation of quality concepts of software engineering and the measurement of performance of Big Data Applications. For this, it is proposed the use of statistical methods to establish relationships between extracted performance measures from Big Data Applications, Cloud Computing platforms and the software engineering quality concepts.

Introduction

According to ISO subcommittee 38, the CC study group, Cloud Computing (CC) is a paradigm for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable cloud resources accessed through services which can be rapidly provisioned and released with minimal management effort or service provider interaction [1].

One of the challenges in CC is how to process and store large amounts of data (also known as Big Data – BD) in an efficient and reliable way. ISO subcommittee 32, Next Generation Analytics and Big Data study group, refers Big Data as the transition from structured data and traditional analytics to analysis of complex information of many types. Moreover, the group mentions that Big Data exploits cloud resources to manage large data volume extracted from multiple sources [2]. In December 2012, the International Data Corporation (IDC) stated that, by the end of 2012, the total data generated was 2.8 Zettabytes (ZB) (2.8 trillion Gigabytes). Furthermore, the IDC predicts that the total data generated by 2020 will be 40 ZB. This is roughly equivalent to 5.2 terabytes (TB) of data generated by every human being alive in that year [3].

Big Data Applications (BDA) are a way to process a part of such large amounts of data by means of platforms, tools and mechanisms for parallel and distributed processing. ISO subcommittee 32 mentions that BD Analytics has become a major driving application for data warehousing, with the use of MapReduce outside and inside of database management systems, and the use of self-service data marts [2]. MapReduce is one of the programming models used to develop BDA, which was developed by Google for processing and generating large datasets.

Sometimes, anomalies and defects found in platforms of Cloud Computing Systems (CCS) affect the performance of BDA resulting in degradation of the whole system. Performance analysis models (PAM) for BDA in CC, should propose a means to identify and quantify “normal application behaviour”, which can serve as a baseline for detecting and predicting possible anomalies in the software (i.e. applications in a Big Data platforms) that may impact BDA itself. To be able to design such PAM for BDA, methods are needed to collect the necessary base measures specific to performance, and a performance framework must be used to determine the relationships that exist among these measures.

One of the challenges in designing PAM for BDA is how to determine what type of relationship exists between the various base measures and the performance quality concepts defined in international standards such as ISO 25010 [4]. For example, what is the extent of the relationship between the amounts of physical memory used by a BDA and the performance quality concepts of software engineering such as resource utilization or capacity? Thus, this work proposes the use of statistical methods to determine how closely performance parameters (base measures) are related with performance concepts of software engineering.

This paper is structured as follows. Related work and background sections present the concepts related to the performance measurement of BDA and introduces the MapReduce programming model. In addition, background section presents the Performance Measurement Framework for Cloud Computing (PMFCC), which describes the key performance concepts and sub concepts that the best represent the performance of CCS. Analysis model section, presents the method for examining the relationships among the performance concepts identified in the PMFCC. An experimental methodology based on the Taguchi method of experimental design, is used and offers a means for improving the quality of product performance. Experiment section presents the results of an experiment, which analyzes the relationship between the performance factors of BDA, Cloud Computing Platforms (CCP) and the performance concepts identified in the PMFCC. Finally, conclusion section presents a synthesis of the results of this research and suggests future work.

Related work

Researchers have analyzed the performance of BDA from various viewpoints. For example, Alexandru [5] analyzes the performance of Cloud Computing Services for Many-Task Computing (MTC) system. According to Alexandru, scientific workloads often require High-Performance Computing capabilities, in which scientific computing community has started to focus on MTC, this means high performance execution of loosely coupled applications comprising many tasks. By means of this approach it is possible to demand systems to operate at high utilizations, like to current production grids. Alexandru analyzes the performance based on the premise if current clouds can execute MTC-based scientific workload with similar performance and at lower cost that the current scientific processing systems. For this, the author focuses on Infrastructures as a Service (IaaS), this means providers on public clouds that are not restricted within an enterprise. In this research, Alexandru selected four public clouds providers; Amazon EC2, GoGrid, ElasticHosts and Mosso in which it is performed a traditional system benchmarking in order to provide a first order estimate of the system performance. Alexandru mainly uses metrics related to disk, memory, network and cpu to determine the performance through the analysis of MTC workloads which comprise tens of thousands to hundreds of thousands of tasks. The main finding in this research is that the compute performance of the tested clouds is low compared to traditional systems of high performance computing. In addition, Alexandru found that while current cloud computing services are insufficient for scientific computing at large, they are a good solution for scientists who need resources instantly and temporarily.

Other similar research is performed by Jackson [6] who analyzes high performance computing applications on the Amazon Web Services cloud. The purpose of this work is to examine the performance of existing CC infrastructures and create a mechanism to quantitatively evaluate them. The work is focused on the performance of Amazon EC2, as representative of the current mainstream of commercial CC services, and its applicability to Cloud-based environments for scientific computing. To do so, Jackson quantitatively examines the performance of a set of benchmarks designed to represent a typical High Performance Computing (HPC) workload running on the Amazon EC2 platform. Timing results from different application benchmarks are used to compute the Sustained System Performance (SSP) metric to measure the performance delivered by the workload of a computing system. According to the National Energy Research Scientific Computing Center (NERSC) [7], SSP provides a process for evaluating system performance across any time frame, and can be applied to any set of systems, any workload, and/or benchmark suite, and for any time period. The SSP measures time to solution across different application areas and can be used to evaluate absolute performance and performance relative to cost (in dollars, energy or other value propositions). The results show a strong correlation between the percentage of time an application spends communicating, and its overall performance on EC2. The more communication there is, the worse the performance becomes. Jackson also concludes that the communication pattern of an application can have a significant impact on performance.

Other researchers focus their work on the performance analysis of MapReduce applications. For example, Jin [8] proposes a stochastic model to predict the performance of MapReduce applications under failures. His work is used to quantify the robustness of MapReduce applications under different system parameters, such as the number of processes, the mean time between failures (MTBF) of each process, failure recovery cost, etc. Authors like Jiang [9], performs a depth study of factors that affect the performance of MapReduce applications. In particular, he identifies five factors that affect the performance of MapReduce applications: I/O mode, indexing, data parsing, grouping schemes and block level scheduling. Moreover, Jiang concludes that carefully tuning each factor, it is possible to eliminate the negative impact of these factors and improve the performance of MapReduce applications. Other authors like Guo [10] and Cheng [11] focus their works on improving the performance of MapReduce applications. Gou explodes the freedom to control concurrency in MapReduce in order to improve resource utilization. For this, he proposes “resource stealing” which dynamically expands and shrinks the resource usage of running tasks by means of the benefit aware speculative execution (BASE). BASE improves the mechanisms of fault-tolerance managed by speculatively launching duplicate tasks for tasks deemed to be stragglers. Furthermore, Cheng [11] focuses his work on improving the performance of MapReduce applications through a strategy called maximum cost performance (MCP). MCP improves the effectiveness of speculative execution by means of accurately and promptly identifying stragglers. For this he provides the following methods: 1) Use both the progress rate and the process bandwidth within a phase to select slow tasks, 2) Use exponentially weighted moving average (EWMA) to predict process speed and calculate a task’s remaining time and 3) Determine which task to backup based on the load of a cluster using a cost-benefit model.

Although these works present interesting methods for the performance analysis of CCS and improving of BD applications (MapReduce), their approach is from an infrastructure standpoint and does not consider the performance from a software engineering perspective. This work focuses on the performance analysis of BDA developed by means of the Hadoop MapReduce model, integrating software quality concepts from ISO 25010.

Background

Hadoop MapReduce

Hadoop is the Apache Software Foundation’s top level project, and encompasses the various Hadoop sub projects. The Hadoop project provides and supports the development of open source software that supplies a framework for the development of highly scalable distributed computing applications designed to handle processing details, leaving developers free to focus on application logic [12]. Hadoop is divided into several sub projects that fall under the umbrella of infrastructures for distributed computing. One of these sub projects is MapReduce, which is a programming model with an associated implementation, both developed by Google for processing and generating large datasets.

According to Dean [13], programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. Authors like Lin [14] point out that today, the issue of tackling large amounts of data is addressed by a divide-and-conquer approach, the basic idea being to partition a large problem into smaller sub problems. Those sub problems can be handled in parallel by different workers; for example, threads in a processor core, cores in a multi-core processor, multiple processors in a machine, or many machines in a cluster. In this way, the intermediate results of each individual worker are then combined to yield the final output.

The Hadoop MapReduce model results are obtained in two main stages: 1) the Map stage, and 2) the Reduce stage. In the Map stage, also called the mapping phase, data elements from a list of such elements are inputted, one at time, to a function called Mapper, which transforms each element individually into an output data element. Figure 1 presents the components of the Map stage process.

The Reduce stage (also called the reducing phase) aggregates values. In this stage, a reducer function receives input values iteratively from an input list. This function combines these values, returning a single output value. The Reduce stage is often used to produce “summary” data, turning a large volume of data into a smaller summary of itself. Figure 2 presents the components of the Reduce stage.

According to Yahoo! [15], when a mapping phase begins, any mapper (node) can process any input file or part of an input file. In this way, each mapper loads a set of local files to be able to process them. When a mapping phase has been completed, an intermediate pair of values (consisting of a key and a value) must be exchanged between machines, so that all values with the same key are sent to a single reducer. Like Map tasks, Reduce tasks are spread across the same nodes in the cluster and do not exchange information with one another, nor are they aware of one another’s existence. Thus, all data transfer is handled by the Hadoop MapReduce platform itself, guided implicitly by the various keys associated with the values.

Performance measurement framework for cloud computing

The Performance Measurement Framework for Cloud Computing (PMFCC) [16] is based on the scheme for performance analysis shown in Figure 3. This scheme establishes a set of performance criteria (or characteristics) to help to carry out the process of analysis of system performance. In this scheme, the system performance is typically analyzed using three sub concepts, if it is performing a service correctly: 1) responsiveness, 2) productivity, and 3) utilization, and proposes a measurement process for each. There are several possible outcomes for each service request made to a system, which can be classified into three categories. The system may: 1) perform the service correctly, 2) perform the service incorrectly, or 3) refuse to perform the service altogether. Moreover, the scheme defines three sub concepts associated with each of these possible outcomes, which affect system performance: 1) speed, 2) reliability, and 3) availability. Figure 3 presents this scheme, which shows the possible outcomes of a service request to a system and the sub concepts associated with them.

Based on the above scheme, the PMFCC [16] maps the possible outcomes of a service request onto quality concepts extracted from the ISO 25010 standard. The ISO 25010 [4] standard defines software product and computer system quality from two distinct perspectives: 1) a quality in use model, and 2) a product quality model. The product quality model is applicable to both systems and software. According to ISO 25010, the properties of both determine the quality of the product in a particular context, based on user requirements. For example, performance efficiency and reliability can be specific concerns of users who specialize in areas of content delivery, management, or maintenance. The performance efficiency concept proposed in ISO 25010 has three sub concepts: 1) time behavior, 2) resource utilization, and 3) capacity, while the reliability concept has four sub concepts: 1) maturity, 2) availability, 3) fault tolerance, and 4) recoverability. The PMFCC selects performance efficiency and reliability as concepts for determining the performance of CCS. In addition, the PMFCC proposes the following definition of CCS performance analysis:

“The performance of a Cloud Computing system is determined by analysis of the characteristics involved in performing an efficient and reliable service that meets requirements under stated conditions and within the maximum limits of the system parameters”.

Once that the performance analysis concepts and sub concepts are mapped onto the ISO 25010 quality concepts, the framework presents a model of relationship (Figure 4) that presents a logical sequence in which the concepts and sub concepts appear when a performance issue arises in a CCS.

In Figure 4, system performance is determined by two main sub concepts: 1) performance efficiency, and 2) reliability. We have seen that when a CCS receives a service request, there are three possible outcomes (the service is performed correctly, the service is performed incorrectly, or the service cannot be performed). The outcome will determine the sub concepts that will be applied for performance analysis. For example, suppose that the CCS performs a service correctly, but, during its execution, the service failed and was later reinstated. Although the service was ultimately performed successfully, it is clear that the system availability (part of the reliability sub concept) was compromised, and this affected CCS performance.