Improving the performance of Hadoop Hive by sharing scan and computation tasks

Dokeroglu, Tansel; Ozal, Serkan; Bayir, Murat Ali; Cinar, Muhammet Serkan; Cosar, Ahmet

doi:10.1186/s13677-014-0012-6

Research
Open access
Published: 29 July 2014

Improving the performance of Hadoop Hive by sharing scan and computation tasks

Tansel Dokeroglu¹,
Serkan Ozal¹,
Murat Ali Bayir²,
Muhammet Serkan Cinar³ &
…
Ahmet Cosar¹

Journal of Cloud Computing volume 3, Article number: 12 (2014) Cite this article

11k Accesses
15 Citations
4 Altmetric
Metrics details

Abstract

MapReduce is a popular programming model for executing time-consuming analytical queries as a batch of tasks on large scale data clusters. In environments where multiple queries with similar selection predicates, common tables, and join tasks arrive simultaneously, many opportunities can arise for sharing scan and/or join computation tasks. Executing common tasks only once can remarkably reduce the total execution time of a batch of queries. In this study, we propose a Multiple Query Optimization framework, SharedHive, to improve the overall performance of Hadoop Hive, an open source SQL-based data warehouse using MapReduce. SharedHive transforms a set of correlated HiveQL queries into a new set of insert queries that will produce all of the required outputs within a shorter execution time. It is experimentally shown that SharedHive achieves significant reductions in total execution times of TPC-H queries.

Introduction

Hadoop is a popular open source software framework that allows the distributed processing of large scale data sets [1]. It employs the MapReduce paradigm to divide the computation tasks into parts that can be distributed to a commodity cluster and therefore, provides horizontal scalability [2]–[9]. The MapReduce functions of Hadoop uses (key,value) pairs as data format. The input is retrieved in chunks from Hadoop Distributed File System (HDFS) and assigned to one of the mappers that will process data in parallel and produce the (k₁,v₁) pairs for the reduce step. Then, (k₁,v₁) pair goes through shuffle phase that assigns the same k₁ pairs to the same reducer. The reducers gather the pairs with the same k₁ values into groups and perform aggregation operations (see Figure 1). HDFS is the underlying file system of Hadoop. Due to its simplicity, scalability, fault-tolerance and efficiency Hadoop has gained significant support from both industry and academia; however, there are some limitations in terms of its interfaces and performance [10]. Querying the data with Hadoop as in a traditional RDBMS infrastructure is one of the most common problems that Hadoop users face. This affects a majority of users who are not familiar with the internal details of MapReduce jobs to extract information from their data warehouses.

Hadoop Hive is an open source SQL-based distributed warehouse system which is proposed to solve the problems mentioned above by providing an SQL-like abstraction on top of Hadoop framework. Hive is an SQL-to-MapReduce translator with an SQL dialect, HiveQL, for querying data stored in a cluster [11]–[13]. When users want to benefit from both MapReduce and SQL, mapping SQL statements to MapReduce tasks can become a very difficult job [14]. Hive does this work by translating queries to MapReduce jobs, thereby exploiting the scalability of Hadoop while presenting a familiar SQL abstraction [15]. These attributes of Hive make it a suitable tool for data warehouse applications where large scale data is analyzed, fast response times are not required, and there is no need to update data frequently [4].

Since most data warehouse applications are implemented using SQL-based RDBMSs, Hive lowers the barrier to moving these applications to Hadoop, thus, people who already know SQL can easily use Hive. Similarly, Hive makes it easier for developers to port SQL-based applications to Hadoop. Since Hive is based on a query-at-a-time model and processes each query independently, issuing multiple queries in close time interval decreases performance of Hive due to its execution model. From this perspective, it is important to note that there has been no study, to date, that incorporates the Multiple-query optimization (MQO) technique for Hive to reduce the total execution time of a batch of queries [16]–[18].

Studies concerning MQO for traditional warehouses have shown that it is an efficient technique that dramatically increases the performance of time-consuming decision support queries [2],[19]–[21]. In order to improve the performance of Hadoop Hive in massively issued query environments, we propose SharedHive, which processes HiveQL queries as a batch and improves the total execution time by merging correlated queries before passing them to the Hive query optimizer [6],[15],[22]. By analyzing the common tasks of correlated HiveQL queries we merge them to a new set of insert queries with an optimization algorithm and execute as a batch. The developed model is introduced as a novel component for Hadoop Hive architecture.

In Related work Section, brief information is presented concerning the related work on MQO, SQL-to-MapReduce translators that are similar to Hive, and recent query optimization studies on MapReduce framework. SharedHive system architecture Section explains the traditional architecture of Hive and introduces our novel MQO component. The next Section (Sharing scan and computation tasks of HiveQL queries) explains the process of generating a set of merged insert queries from correlated queries. Experimental setup and results Section discusses the experiments conducted to evaluate the SharedHive framework for HiveQL queries that have different correlation levels. The Section before the conclusion presents the comparison of SharedHive with the other MapReduce-based MQO methods. Our concluding remarks are given in Conclusions and future work Section.

Related work

The MQO problem was introduced in the 1980s and finding an optimal global query plan using MQO was shown to be an NP-Hard problem [16],[23]. Since then, a considerable amount of work has been undertaken on RDBMSs and data analysis applications [24]–[26]. Mehta and DeWitt considered CPU utilization, memory usage, and I/O load variables in a study during planning multiple queries to determine the degree of intra-operator parallelism in parallel databases to minimize the total execution time of declustered join methods [27]. A proxy-based infrastructure for handling data intensive applications has been proposed by Beynon [28]; however, this infrastructure was not as scalable as a collection of distributed cache servers available at multiple back-ends. A data integration system that reduces the communication costs by a multiple query reconstruction algorithm is proposed by [29]. IGNITE [30] and QPipe [31] are important studies that use the micro machine concept for query operators to reduce the total execution time of a set of queries. A novel MQO framework is proposed for the existing SPARQL query engines [32]. A cascade-style optimizer for Scope, Microsoft’s system for massive data analysis, is designed in [33]. CoScan [34],[35] shows how sharing scan operations can benefit multiple MapReduce queries.

In recent years, a significant amount of research and commercial activity has focused on integrating MapReduce and structured database technologies [36]. Mainly there are two approaches, either adding MapReduce features to a parallel database or adding database technologies to MapReduce. The second approach is more attractive because no widely available open source parallel database system exists, whereas MapReduce is available as an open source project. Furthermore, MapReduce is accompanied by a plethora of free tools as well as having cluster availability and support. Hive [11], Pig [37], Scope [20], and HadoopDB [10],[38] are projects that provide SQL abstractions on top of MapReduce platform to familiarize the programmers with complex queries. SQL/MapReduce [39] and Greenplum [21] are recent projects that use MapReduce to process user-defined functions (UDF).

Recently, there have been interesting studies that apply MQO to MapReduce frameworks for unstructured data; for example MRShare [40] processes a batch of input queries as a single query. The optimal grouping of queries for execution is defined as an optimization problem based on MapReduce cost model. The experimental results reported for MRShare demonstrate its effectiveness. In spite of some initial MQO studies to reduce the execution time of MapReduce-based single queries [41], to our knowledge there is no study similar to ours that is related to the MQO of Hadoop Hive by using insert query statements.

SharedHive system architecture

In this section, we briefly present the architecture of SharedHive which is a modified version of Hadoop Hive with a new MQO component inserted on top of the Driver component of Hive (see Figure 2). Inputs to the driver which contains compiler, optimizer and executer are pre-processed by the added Multiple Query Optimizer component which analyzes incoming queries and produces a set of merged HiveQL insert queries. Finally, the remaining queries that don’t have any correlation with others are appended at the end of the correlated query sets. The system catalog and relational database structure (relations, attributes, partitions, etc.) are stored and maintained by Metastore. Once a HiveQL statement is submitted, it is maintained by Driver which controls the execution of tasks in order to answer the query. Compiler parses the query string and transforms the parse tree to a logical plan. Optimizer performs several passes over the logical plan and rewrites it. The physical plan generator creates a physical plan from the logical plan.

HiveQL statements are submitted via the Command Line Interface (CLI), the Web User Interface or the thrift interface. Normally, the query is directed to the driver component in conventional Hive architecture. In SharedHive, the MQO component (located after the client interface) receives the incoming queries before the driver component. The set of incoming queries are inspected, their common tables and intermediate common joins are detected, and merged to obtain a new set of HiveQL queries that answer all the incoming queries. The details of this process are explained in the next Section.

The new MQO component passes the new set of merged queries to the compiler component of Hive driver that produces a logical plan using information from the Metastore and optimizes this plan using a single rule-based optimizer. The execution engine receives a directed acyclic graph (DAG) of MapReduce and associated HDFS tasks, then executes them in accordance with the dependencies of the tasks. The new MQO component does not require any major changes in the system architecture of Hadoop Hive and can be easily integrated into Hive.

Sharing scan and computation tasks of HiveQL queries

In order to benefit from the common scan/join tasks of the input queries and reduce the number (i.e. total amount) of redundant tasks, SharedHive merges input queries into a new set of HiveQL insert queries and produces answers to each query as a separate HDFS file.

The problem of merging a set of queries can be formally described as:

Input: A set of HiveQL queries Q={q₁,...,q_n}.Output: A set of merged HiveQL queries Q^′={q $_{1}^{'}$ ,...,q $_{m}^{'}$ }, where m ≤n.

Rewrite/combine the given input queries in such a way that the total execution time of query set Q^′ is less than the total execution time of query set Q. If the execution time of query q_i is represented with t_i then

\sum_{i = 1}^{m} (t_{i}^{^{'}}) \leq \sum_{i = 1}^{n} (t_{i})

(1)

Given q $_{i}^{^{'}}$ is the merged insert query corresponding to queries q_j and q_k then all of the output tuples and columns required by both queries must be produced by query q $_{i}^{^{'}}$ preserving the predicate attributes of q_j and q_k.

The existing architecture of Hive produces several jobs that run in parallel to answer a query. The insert queries merged by SharedHive can combine the scan and/or intermediate join operations of the input queries in a new set of insert queries and gain performance increases by reducing the number of MapReduce tasks and the sizes of read/written HDFS files.

Unlike the traditional SQL statements, HiveQL join query statements are written in the FROM part of the query [15] such as

The example below shows how a merged HiveQL insert query for TPC-H queries Q1 and Q6 is constructed.

Merging TPC-H Queries Q1 and Q6 :

The underlying SQL-to-Mapreduce translator of Hive uses one operation to one job model [22] and opens a new job for each operation (table scan, join, group by, etc.) in a SQL statement. Significant performance increases can be obtained by reducing the number of MapReduce tasks of these jobs. Figure 3 presents MapReduce tasks of merged insert query (Q1+Q6) that reduces the scan operations.

In the Appendix, three merged queries are presented to explain the merging process of HiveQL queries that share common input and output parts. The first one merges two Q1 queries that have different selection predicates, the second one merges two fully-correlated queries, Q14 and Q19, that share a common join operation and the third one merges two partially correlated queries Q1 and Q18[42].

HiveQL statements have a preprocessing overhead for MapReduce tasks that will be executed to complete a query and this causes high latencies that could cause short running queries to take longer time on Hive [43]. In addition to the emerging opportunities of using common table scan and join operations, SharedHive intends to decrease the preprocessing period of uncorrelated query MapReduce tasks.

In the merging process of SharedHive, each query is classified according to the shared tables and/or join operations in the FROM clause of HiveQL statements. The input queries are inserted into a data structure that maintains the groups of similar queries according to the largest sharing opportunity they have with other queries.

While grouping the queries, the highest precedence is given to (a) queries with fully-correlated FROM expressions, (b) queries with partially-correlated FROM expressions and (c) queries that have no correlation with the other queries (which are appended to the end of the set of merged queries) (see Algorithm ??). With this approach, the common scan/join tasks in merged insert queries are not executed repeatedly [15]. After the merging process, the optimized set of insert queries are passed to the query execution layer of Hive.

Experimental setup and results

In this section, experimental setup and the performance evaluation of the merged HiveQL insert queries are presented. TPC-H is chosen as our benchmark database and related decision support queries because they process high amounts of data [44]. We believe this is a good match for our experiments since Hadoop is also designed to process large amounts of data. 11 query sets are prepared from standard TPC-H queries to experimentally analyze performance of SharedHive under different workload scenarios. These query sets define three correlation categories for merged queries (uncorrelated, partially correlated, and fully correlated). Uncorrelated queries have nothing in common, partially-correlated queries share at least one table and zero or more join operations. Fully-correlated queries have exactly the same list of the tables/joins (where conditions have different selection predicates). Table 1 gives the selected set of queries and their correlation levels. Query sets 8 and 9 use single queries that are submitted several times with different selection predicates. Query set 8 executes no join operation so that it presents the performance gains of SharedHive with intensive scan sharing, whereas query set 9 includes common join operations that require communication between datanodes. Query sets 10 and 11 include queries that produce several merged insert queries.

Table 1 Sets of selected TPC-H queries and their correlation levels

Full size table

Three different TPC-H decision support databases with sizes 1GB, 100GB and 1TB are used. Similar experimental settings are used in previous studies [22],[40].

The experiments are performed on a private Cloud server, 4U DELL PowerEdge R910 having 32 (64 with Hyper Threading) cores. Each core is Intel Xeon E7-4820 with 2.0GHz processing power. The server has 128GB DDR3 1600MHz virtualized memory and Broadcom Nextreme II 5709 1Gbps NICs. Operating system of the physical server is Windows Server 2012 Standard Edition. 20 Linux CentOS 6.4 virtual machines are installed on this server as guest operating systems. Each virtual machine has two 2.0GHz processors, 2GB RAM and 250GB disk storage. An additional master node is used as NameNode/JobTracker (4 processors, 8GB RAM and 500GB disk storage). The latest stabilized versions of Hadoop, release 1.2.1 and Hive version 0.12.0 are used [1],[11]. The splitsize of the files (HDFS block size) is 64MB, replication number is 2, maximum number of map tasks is 2, maximum number of reduce tasks is 2 and map output compression is disabled during the experiments.

In order to remove noise in performance measurements, the Cloud server is only dedicated to our experiment during the performance evaluation. Therefore, we believe that performance interference from external factors such as network congestion or OS-level contention on shared resources are minimized as much as possible. We observe that there were only negligible changes in the response time of the queries when we repeated our experiments three times.

Table 2 presents the response times of TPC-H queries (Q1, Q3, Q6, Q11, Q12, Q14, Q17, Q18, Q19, Q22) with 1GB, 100GB and 1TB database sizes. These results constitute baselines to compare the results of the merged HiveQL queries with single execution performance of Hive.

Table 2 Execution times (sec.) of single TPC-H queries

Full size table

Tables 3 and 4 show the performance increases for the selected HiveQL query sets given in Table 1 that are merged and run to observe the effect of SharedHive on total response times. The percentage values show the reduction of the response time. Significant performance increases can be seen easily.

Table 3 Execution times of sequential and merged queries in seconds

Full size table

Table 4 Execution times of sequential and merged query sets (8 and 9) in seconds

Full size table

Although uncorrelated queries have nothing in common, their total execution times are observed to reduce by 0.2%-6.9% due to the improvement in HIVE query preprocessing overheads. Merging uncorrelated queries does not increase the performance when the database size reaches terabyte scale. The reductions in total execution times of partially correlated queries is higher than uncorrelated query sets (between 1.5%-20.8%). The highest benefits are observed in the fully correlated query sets (between 9.9%-39.9%). For query set 8 (single Q1 query submitted 8 times) the total query execution time is reduced from 26,716 to 3,985 seconds (85.1% reduction). The performance of mixed query sets depends on the correlation level of the queries they contain. Mixed query sets 10 and 11 execute their queries with 9.9% and 15.5% less execution times, respectively. During these experiments, the size of the intermediate tables that are written to the disks is considered carefully by SharedHive. If predicted overhead of writing intermediate results is larger than the expected improvement in response time, then queries are not merged. SharedHive is observed to reduce the number of MapReduce tasks and the sizes of read/written HDFS files as well. The results given in Table 5 present the effect of SharedHive for the number of MapReduce tasks and the sizes of read/written files of the given insert queries (having different correlation levels). As the correlation level of queries increases the number of MapReduce tasks and the sizes of read/written data also decreases substantially.

Table 5 Comparing the number of MapReduce tasks and the sizes of read/written HDFS files by Hive and SharedHive for different correlation level 100GB TPC-H data warehouse queries

Full size table

The optimization time of SharedHive on analyzing and merging the queries is observed to be small. This is because of the small number of input queries and executing Algorithm ?? on them requires only examination of their FROM clauses which are parsed to identify similar expressions and rewriting the merged HiveQL query. This optimization does not take more than a few milliseconds.

In the last phase of our experiments, SharedHive is run on five different cluster sizes to observe its scalability with increasing number of datanodes. First, the merged insert query (Q14+Q19) is executed on the cluster using three different database sizes (1GB, 10GB and 100GB). It is observed that increasing the number of datanodes in the cluster improves the performance of the merged query reducing execution times by 29%, 81% and 88% in the database instances when the number of datanodes is increased from 1 to 20 (see Figure 4).

MQO component of SharedHive is an extension to Hive and welcomes any performance increase that is achieved on the HDFS layer either due to increase in the number of datanodes or balanced distribution of data files.

Comparison with other MapReduce-based MQO systems

SharedHive can perform the execution of the selected/correlated queries in shorter times than Hive by reducing the number of MapReduce tasks and the sizes of the files read/written by the tasks. The correlation detection mechanism of SharedHive is simple and does not find the number common rows and/or columns of queries with complex algorithms as in [19],[29]. The execution time performance gains are observed to be within the range of %1.5-85.1% in accordance with the correlation level of the queries. For repeatedly issued similar queries that have different predicates, SharedHive performs well. SharedHive benefits from underlying HDFS architecture therefore, its scalability is preserved and better performance is obtained when additional datanodes are introduced to Hadoop. The query results obtained by SharedHive have been compared with those of Hive and verified to be the same.

MRShare [40] is a recent MQO system developed for benefitting from multiple queries containing similar MapReduce tasks. It transforms a batch of queries into a new batch that will be executed more efficiently by merging jobs into groups and evaluating each group as a single query. MRShare optimizes queries that work on the same input table and does not consider sharing of join operations. However, SharedHive can merge queries containing joins into a new set of insert queries. MRShare shares scan tasks by creating a single job for multiple jobs and does not use temporary files (as it is done by SharedHive).

YSmart [22] is a correlation-aware MQO system similar to SharedHive. It detects and removes redundant MapReduce tasks of single complex queries but does not optimize multiple queries. The developers of YSmart present experimental results that significantly outperform conventional Hive for single queries. SharedHive does not provide any performance increase for single queries unless they are submitted several times (with different predicates). SharedHive works in the application layer of Hive by merging the query level operations, whereas MRShare and YSmart explore and eliminate redundant tasks in the MapReduce layer.

Apache Pig is the most mature MapReduce-based platform that supports a large number of sharing mechanisms among multiple queries [37]. Complex tasks consisting of multiple interrelated data transformations are explicitly encoded as data flow sequences; however, its query language, Pig Latin, is not compatible with standard SQL statements like SharedHive.

Conclusions and future work

In this study, we propose a multiple query optimization (MQO) framework, SharedHive, for improving the performance of MapReduce-based data warehouse Hadoop Hive queries. To our knowledge, this is the first work that aims at improving the performance of Hive with MQO techniques. In SharedHive, we detect common tasks of correlated TPC-H HiveQL queries and merge them into a new set of global Hive insert queries. With this approach, it has been experimentally shown that significant performance improvements can be achieved by reducing the number of MapReduce tasks and the total sizes of read/written files.

As future work, we plan to incorporate MQO functionality at MapReduce layer, similar to YSmart, into SharedHive. In this way, it will be possible to eliminate even more redundant MapReduce tasks in queries and improve the overall performance of naïve rule-based Hive query optimizer even further.

Appendix

A. Merging two Q1 queries that have different select predicates

B. Merging queries Q14 and Q19(Fully correlated FROM clauses)

C. Merging queries Q1 and Q18(Partially correlated FROM clauses)

Abbreviations

HDFS:: Hadoop distributed file system
RDBMS:: Relational database management system
SQL:: Structured query language
QL:: Query language
MQO:: Multiple-query optimization
UDF:: User-defined functions
CLI:: Command line interface
DAG:: Directed acyclic graph

References

Apache Hadoop. Last Accessed 1 February 2014., [http://hadoop.apache.org/]
Dean J, Ghemawat S: MapReduce: simplified data processing on large clusters. Commun ACM 2008, 51(1):107–113. 10.1145/1327452.1327492
Article Google Scholar
Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) MapReduce online In: Proceedings of the 7th, USENIX conference on Networked systems design and implementation, April 28–30, 21–21, San Jose, California.
Google Scholar
Stonebraker M, Abadi D, DeWitt DJ, Madden S, Paulson E, Pavlo A, Rasin A: MapReduce and parallel DBMSs: friends or foes? Commun ACM 2010, 53(1):64–71. 10.1145/1629175.1629197
Article Google Scholar
DeWitt D, Stonebraker M (2008) MapReduce: A major step backwards. The Database Column,1.
Google Scholar
He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z (2011) Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems In: Proceedings of the 2011 IEEE 27th International, Conference on Data Engineering, April 11–16, 1199–1208.
Chapter Google Scholar
Kang U, Tsourakakis CE, Faloutsos C: Pegasus: mining peta-scale graphs. Knowl Inf Syst 2011, 27(2):303–325. 10.1007/s10115-010-0305-0
Article Google Scholar
Grolinger K, Higashino WA, Tiwari A, Capretz MA: Data management in cloud environments: NoSQL and NewSQL data stores. J Cloud Comput: Adv Syst Appl 2013, 2(1):22. 10.1186/2192-113X-2-22
Article Google Scholar
Bayir MA, Toroslu IH, Cosar A, Fidan G (2009) Smart miner: a new framework for mining large scale web usage data In: Proceedings of the 18th international conference on World wide web, 161–170. ACM.
Google Scholar
Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc VLDB 2009, 2(1):922–933. 10.14778/1687627.1687731
Article Google Scholar
Hadoop Hive project. Last Accessed 4 January 2014., [http://hadoop.apache.org/hive/] Hadoop Hive project. Last Accessed 4 January 2014.
Dai W, Bassiouni M: An improved task assignment scheme for Hadoop running in the clouds. J Cloud Comput: Adv Syst Appl 2013, 2(1):1–16. 10.1186/2192-113X-2-23
Article Google Scholar
Issa J, Figueira S: Hadoop and memcached: performance and power characterization and analysis. J Cloud Comput: Adv Sys Appl 2012, 1(1):1–20. 10.1186/2192-113X-1-10
Article Google Scholar
Ordonez C, Song IY, Garcia-Alvarado C (2010) Relational versus non-relational database systems for data warehousing In: Proceedings of the ACM 13th international workshop on Data warehousing and OLAP, October 30–30, Toronto, ON, Canada.
Google Scholar
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Murthy R (2010) Hive-a petabyte scale data warehouse using hadoop In: Proceedings of ICDE, 996–1005.
Google Scholar
Sellis TK: Multiple-query optimization. ACM Trans Database Syst (TODS) 1988, 13(1):23–52. 10.1145/42201.42203
Article Google Scholar
Bayir MA, Toroslu IH, Cosar A: Genetic algorithm for the multiple-query optimization problem. IEEE Trans Syst Man Cybernet Part C Appl Rev 2007, 37(1):147–153. 10.1109/TSMCC.2006.876060
Article Google Scholar
Cosar A, Lim EP, Srivastava J (1993) Multiple query optimization with depth-first branch-and-bound and dynamic query ordering In: Proceedings of the second international conference on, Information and knowledge management, 433–438. ACM.
Google Scholar
Zhou J, Larson PA, Freytag JC, Lehner W (2007) Efficient exploitation of similar subexpressions for query processing In: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, 533–544. ACM.
Chapter Google Scholar
Chaiken R, Larson PÃ, Ramsey B, Shakibn D, Weaver S, Zhou J: SCOPE: easy and efficient parallel processing of massive data sets. Proc VLDB 2008, 1(2):1265–1276. 10.14778/1454159.1454166
Article Google Scholar
Cohen J, Dolan B, Dunlap M, Hellerstein JM, Welton C: MAD skills: new analysis practices for big data. Proc VLDB 2009, 2(2):1481–1492. 10.14778/1687553.1687576
Article Google Scholar
Lee R, Luo T, Huai Y, Wang F, He Y, Zhang X (2011) Ysmart: Yet another sql-to-mapreduce translator In: Distributed Computing Systems (ICDCS), 2011 31st International, Conference on, 25–36. IEEE.
Google Scholar
Finkelstein S (1982) Common expression analysis in database applications In: Proceedings of the 1982 ACM SIGMOD international conference on, Management of data, 235–245. ACM.
Google Scholar
Roy P, Seshadri S, Sudarshan S, Bhobe S: Efficient and extensible algorithms for multi query optimization. ACM SIGMOD Rec 2000, 29(2):249–260. 10.1145/335191.335419
Article Google Scholar
Giannikis G, Alonso G, Kossmann D: SharedDB: killing one thousand queries with one stone. Prof VLDB 2012, 5(6):526–537. 10.14778/2168651.2168654
Article Google Scholar
Chen F, Dunham MH: Common subexpression processing in multiple-query processing. IEEE Trans Knowl Data Eng 1998, 10(3):493–499. 10.1109/69.687980
Article Google Scholar
Mehta M, DeWitt DJ (1995) Managing intra-operator parallelism in parallel database systems In: VLDB vol 95, 382–394.
Google Scholar
Beynon M, Chang C, Catalyurek U, Kurc T, Sussman A, Andrade H, Ferreira R, Saltz J: Processing large-scale multi-dimensional data in parallel and distributed environments. Parallel Comput 2002, 28(5):827–859. 10.1016/S0167-8191(02)00097-2
Article Google Scholar
Chen G, Wu Y, Liu J, Yang G, Zheng W: Optimization of sub-query processing in distributed data integration systems. J Netw Comput Appl 2011, 34(4):1035–1042. 10.1016/j.jnca.2010.06.007
Article Google Scholar
Lee R, Zhou M, Liao H (2007) Request Window: an approach to improve throughput of RDBMS-based data integration system by utilizing data sharing across concurrent distributed queries In: Proceedings of the 33rd international conference on, Very large data bases, 1219–1230. VLDB Endowment.
Google Scholar
Harizopoulos S, Shkapenyuk V, Ailamaki A (2005) QPipe: a simultaneously pipelined relational query engine In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 383–394. ACM.
Google Scholar
Le W, Kementsietsidis A, Duan S, Li F (2012) Scalable multi-query optimization for SPARQL In: Data Engineering (ICDE), 2012 IEEE 28th International Conference on, 666–677. IEEE.
Google Scholar
Silva YN, Larson PA, Zhou J (2012) Exploiting common subexpressions for cloud query processing In: Data Engineering (ICDE), 2012 IEEE 28th International Conference on, 1337–1348. IEEE.
Google Scholar
Wang X, Olston C, Sarma AD, Burns R (2011) CoScan: cooperative scan sharing in the cloud In: Proceedings of the 2nd ACM Symposium on Cloud Computing, 11. ACM.
Google Scholar
Wolf J, Balmin A, Rajan D, Hildrum K, Khandekar R, Parekh S, Wu K-L, Vernica R: On the optimization of schedules for MapReduce workloads in the presence of shared scans. VLDB J 2012, 21(5):589–609. 10.1007/s00778-012-0279-5
Article Google Scholar
Ferrera P, De Prado I, Palacios E, Fernandez-Marquez JL, Serugendo GDM (2013) Tuple MapReduce and Pangool: an associated implementation. Knowl Inf Syst: 1–27. doi:10.1007/s10115–013–0705-z.
Google Scholar
Apache Pig. Last Accessed 1 February 2014., [http://pig.apache.org/]
Bajda-Pawlikowski K, Abadi DJ, Silberschatz A, Paulson E (2011) Efficient processing of data warehousing queries in a split execution environment In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, 1165–1176. ACM.
Google Scholar
Friedman E, Pawlowski P, Cieslewicz J: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proc VLDB 2009, 2(2):1402–1413. 10.14778/1687553.1687567
Article Google Scholar
Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N: MRShare: sharing across multiple queries in mapreduce. Proc VLDB 2010, 3(1–2):494–505. 10.14778/1920841.1920906
Article Google Scholar
Gruenheid A, Omiecinski E, Mark L (2011) Query optimization using column statistics in hive In: Proceedings of the 15th, Symposium on International Database Engineering & Applications, 97–105. ACM.
Google Scholar
Chaudhuri S, Shim K (1994) Including group-by in query optimization In: VLDB vol. 94, 354–366.
Google Scholar
Jiang D, Tung AK, Chen G: Map-join-reduce: toward scalable and efficient data analysis on large clusters. Knowl Data Eng IEEE Trans 2011, 23(9):1299–1311. 10.1109/TKDE.2010.248
Article Google Scholar
Running TPC-H queries on Hive. Last Accessed 1 January 2014., [http://issues.apache.org/jira/browse/HIVE-600]

Download references

Acknowledgements

We sincerely thank all the researchers in our references section for the inspiration they provide.

Author information

Authors and Affiliations

Middle East Technical University Computer Engineering Department, Cankaya, Ankara, Turkey
Tansel Dokeroglu, Serkan Ozal & Ahmet Cosar
Microsoft Research Redmond, One Microsoft Way, Redmond, Washington, 98052, USA
Murat Ali Bayir
Hacettepe University Computer Engineering Department, Cankaya, Ankara, Turkey
Muhammet Serkan Cinar

Authors

Tansel Dokeroglu
View author publications
You can also search for this author in PubMed Google Scholar
Serkan Ozal
View author publications
You can also search for this author in PubMed Google Scholar
Murat Ali Bayir
View author publications
You can also search for this author in PubMed Google Scholar
Muhammet Serkan Cinar
View author publications
You can also search for this author in PubMed Google Scholar
Ahmet Cosar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tansel Dokeroglu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

TD designed the algorithm for merging the (fully, partially correlated) queries of Hive and executed the experiments. SO implemented the new Multiple Query Optimization component and added it to Hive framework. MAB prepared the mathematical formulation of Multiple Query optimization process for Hive and drafted the manuscript. MSC prepared the Hadoop/Hive experimental setup. AC coordinated the whole study and prepared the related work section. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Dokeroglu, T., Ozal, S., Bayir, M.A. et al. Improving the performance of Hadoop Hive by sharing scan and computation tasks. J Cloud Comp 3, 12 (2014). https://doi.org/10.1186/s13677-014-0012-6

Download citation

Received: 09 May 2014
Accepted: 20 June 2014
Published: 29 July 2014
DOI: https://doi.org/10.1186/s13677-014-0012-6

Improving the performance of Hadoop Hive by sharing scan and computation tasks

Abstract

Introduction

Related work

SharedHive system architecture

Sharing scan and computation tasks of HiveQL queries

Experimental setup and results

Comparison with other MapReduce-based MQO systems

Conclusions and future work

Appendix

A. Merging two Q1 queries that have different select predicates

B. Merging queries Q14 and Q19(Fully correlated FROM clauses)

C. Merging queries Q1 and Q18(Partially correlated FROM clauses)

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords