Data management in cloud environments: NoSQL and NewSQL data stores

Grolinger, Katarina; Higashino, Wilson A; Tiwari, Abhinav; Capretz, Miriam AM

doi:10.1186/2192-113X-2-22

Research
Open access
Published: 18 December 2013

Data management in cloud environments: NoSQL and NewSQL data stores

Katarina Grolinger¹,
Wilson A Higashino^1,2,
Abhinav Tiwari¹ &
…
Miriam AM Capretz¹

Journal of Cloud Computing: Advances, Systems and Applications volume 2, Article number: 22 (2013) Cite this article

43k Accesses
145 Citations
1 Altmetric
Metrics details

Abstract

Advances in Web technology and the proliferation of mobile devices and sensors connected to the Internet have resulted in immense processing and storage requirements. Cloud computing has emerged as a paradigm that promises to meet these requirements. This work focuses on the storage aspect of cloud computing, specifically on data management in cloud environments. Traditional relational databases were designed in a different hardware and software era and are facing challenges in meeting the performance and scale requirements of Big Data. NoSQL and NewSQL data stores present themselves as alternatives that can handle huge volume of data. Because of the large number and diversity of existing NoSQL and NewSQL solutions, it is difficult to comprehend the domain and even more challenging to choose an appropriate solution for a specific task. Therefore, this paper reviews NoSQL and NewSQL solutions with the objective of: (1) providing a perspective in the field, (2) providing guidance to practitioners and researchers to choose the appropriate data store, and (3) identifying challenges and opportunities in the field. Specifically, the most prominent solutions are compared focusing on data models, querying, scaling, and security related capabilities. Features driving the ability to scale read requests and write requests, or scaling data storage are investigated, in particular partitioning, replication, consistency, and concurrency control. Furthermore, use cases and scenarios in which NoSQL and NewSQL data stores have been used are discussed and the suitability of various solutions for different sets of applications is examined. Consequently, this study has identified challenges in the field, including the immense diversity and inconsistency of terminologies, limited documentation, sparse comparison and benchmarking criteria, and nonexistence of standardized query languages.

Introduction

In recent years, advances in Web technology and the proliferation of sensors and mobile devices connected to the Internet have resulted in the generation of immense data sets that need to be processed and stored. Just on Facebook, 2.4 billion content items are shared among friends every day [1]. Today, businesses generate massive volume of data which has grown too big to be managed and analyzed by traditional data processing tools [2]. Indeed, traditional relational database management systems (RDBMS) were designed in an era when the available hardware, as well as the storage and processing requirements, were very different than they are today [3]. Therefore, these solutions have been encountering many challenges in meeting the performance and scaling requirements of this “Big Data” reality.

Big Data is a term used to refer to massive and complex datasets made up of a variety of data structures, including structured, semi-structured, and unstructured data. According to the Gartner group, Big Data can be defined by 3Vs: volume, velocity, and variety [4]. Today, businesses are aware that this huge volume of data can be used to generate new opportunities and process improvements through their processing and analysis [5, 6].

At about the same time, cloud computing has also emerged as a computational paradigm for on-demand network access to a shared pool of computing resources (e.g., network, servers, storage, applications, and services) that can be rapidly provisioned with minimal management effort [7]. Cloud computing is associated with service provisioning, in which service providers offer computer-based services to consumers over the network. Often these services are based on a pay-per-use model where the consumer pays only for the resources used. Overall, a cloud computing model aims to provide benefits in terms of lesser up-front investment, lower operating costs, higher scalability, elasticity, easy access through the Web, and reduced business risks and maintenance expenses [8].

Due to such characteristics of cloud computing, many applications have been created in or migrated to cloud environments over the last few years [9]. In fact, it is interesting to notice the extent of synergy between the processing requirements of Big Data applications, and the availability and scalability of computational resources offered by cloud services. Nevertheless, the effective leveraging of cloud infrastructure requires careful design and implementation of applications and data management systems. Cloud environments impose new requirements to data management; specifically, a cloud data management system needs to have:

Scalability and high performance, because today’s applications are experiencing continuous growth in terms of the data they need to store, the users they must serve, and the throughput they should provide;
Elasticity, as cloud applications can be subjected to enormous fluctuations in their access patterns;
Ability to run on commodity heterogeneous servers, as most cloud environments are based on them;
Fault tolerance, given that commodity machines are much more prone to fail than high-end servers;
Security and privacy features, because the data may now be stored on third-party premises on resources shared among different tenants;
Availability, as critical applications have also been moving to the cloud and cannot afford extended periods of downtime.

Faced with the challenges that traditional RDBMSs encounter in handling Big Data and in satisfying the cloud requirements described above, a number of specialized solutions have emerged in the last few years in an attempt to address these concerns. The so-called NoSQL and NewSQL data stores present themselves as data processing alternatives that can handle this huge volume of data and provide the required scalability.

Despite the appropriateness of NoSQL and NewSQL data stores as cloud data management systems, the immense number of existing solutions (over 120 [10]) and the discrepancies among them make it difficult to formulate a perspective on the domain and even more challenging to select the appropriate solution for a problem at hand. This survey reviews NoSQL and NewSQL data stores with the intent of filling this gap. More specifically, this survey has the following objectives:

To provide a perspective on the domain by summarizing, organizing, and categorizing NoSQL and NewSQL solutions.
To compare the characteristics of the leading solutions in order to provide guidance to practitioners and researchers to choose the appropriate data store for specific applications.
To identify research challenges and opportunities in the field of large-scale distributed data management.

NoSQL data models and categorization of NoSQL data stores have been addressed in other surveys [10–14]. In addition, aspects associated with NoSQL, such as MapReduce, the CAP theorem, and eventual consistency have also been discussed in the literature [15, 16]. This paper presents a short overview of NoSQL concepts and data models; nevertheless, the main contributions of this paper include:

A discussion of NewSQL data stores. The category of NewSQL solutions is recent; the first use of the term was in 2011 [17]. NewSQL solutions aim to bring the relational data model into the world of NoSQL. Therefore, a comparison among NewSQL and NoSQL solutions is essential to understand this new class of data stores.
A detailed comparison among various NoSQL and NewSQL solutions over a large number of dimensions. By presenting this comparison in a table form, this paper helps practitioners to choose the appropriate data store for the task at hand. Previous surveys have included comparisons of NoSQL solutions [11]; nonetheless, the number of compared attributes was limited, and the analysis performed was not as comprehensive.
A review of a number of security features is also included in the data store comparison. According to the surveyed literature [10–14], security has been overlooked, even though it is an important aspect of the adoption of NoSQL solutions in practice.
A discussion of the suitability of various NoSQL and NewSQL solutions for different sets of applications. NoSQL and NewSQL solutions differ greatly in their characteristics; moreover, changes in this area are rapid, with frequent releases of new features and options. Therefore, this work discusses the suitability of NoSQL and NewSQL data stores for different use cases from the perspective of core design decisions.

The rest of this paper is organized as follows: the “Background and Related Work” section describes background concepts and studies related to this survey. The methodology used in this survey is presented in the “Methodology” section. The “Data Models” section presents the NoSQL and NewSQL data models and categorizes the surveyed data stores accordingly. Querying capabilities are discussed in the “Querying” section, while the “Scaling” section describes the solutions’ scaling properties and the “Security” section their security features. The suitability of NoSQL and NewSQL data stores for different use cases is discussed in the “Use Cases” section. The challenges and opportunities identified in this study are described in the “Opportunities” section, and the “Conclusions” section concludes the paper.

Background and related work

This section introduces relevant concepts and positions this paper with respect to other surveys in the NoSQL domain.

Cloud computing

Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., network, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction [7]. It denotes a model in which a computing infrastructure is viewed as a “cloud”, from which businesses and individuals can access applications on demand from anywhere in the world [18]. Essential characteristics of the cloud-computing model, according to the U.S. National Institute of Standards and Technology (NIST), include [7]:

On-demand self-service, enabling a user to access cloud provider services without human interaction;
Broad network access that enables heterogeneous thick and thin client applications to access the services;
Pooling of service provider computing resources to serve multiple consumers;
Automatic, rapid, and elastic provisioning of resources;
Measured service in which resource usage is monitored and controlled.

Overall, a cloud computing model aims to provide benefits in terms of lesser up-front investment in infrastructure during deployment, lower operating costs, higher scalability, ease of access through the Web, and reduced business risks and maintenance expenses [8].

The CAP theorem

In order to store and process massive datasets, a common employed strategy is to partition the data and store the partitions across different server nodes. Additionally, these partitions can also be replicated in multiple servers so that the data is still available even in case of servers’ failures. Many modern data stores, such as Cassandra [19] and BigTable [20], use these and others strategies to implement high-available and scalable solutions that can be leveraged in cloud environments. Nevertheless, these solutions and others replicated networked data stores have an important restriction, which was formalized by the CAP theorem [21]: only two of three CAP properties (consistency, availability, and partition tolerance) can be satisfied by networked shared-data systems at the same time [21, 22].

Consistency, as interpreted in CAP, is equivalent to having a single up-to-date instance of the data [22]. Therefore, consistency in CAP has a somewhat dissimilar meaning to and represents only a subset of consistency as defined in ACID (Atomicity, Consistency, Isolation and Durability) transactions of RDBMSs [22], which usually refers to the capability of maintaining the database in a consistent state at all times. The Availability property means that the data should be available to serve a request at the moment it is needed. Finally, the Partition Tolerance property refers to the capacity of the networked shared-data system to tolerate network partitions. The simplest interpretation of the CAP theorem is to consider a distributed data store partitioned into two sets of participant nodes; if the data store denies all write requests in both partitions, it will remain consistent, but it is not available. On the other hand, if one (or both) of the partitions accepts write requests, the data store is available, but potentially inconsistent.

Despite the relative simplicity of its result, the CAP theorem has had important implications and has originated a great variety of distributed data stores aiming to explore the trade-offs between the three properties. More specifically, the challenges of RDBMS in handling Big Data and the use of distributed systems techniques in the context of the CAP theorem led to the development of new classes of data stores called NoSQL and NewSQL.

NoSQL and NewSQL

The origin of the NoSQL term is attributed to Johan Oskarsson, who used it in 2009 to name a conference about “open-source, distributed, non-relational databases” [23]. Today, the term is used as an acronym for “Not only SQL”, which emphasizes that SQL-style querying is not the crucial objective of these data stores. Therefore, the term is used as an umbrella classification that includes a large number of immensely diverse data stores that are not based on the relational model, including some solutions designed for very specific applications such as graph storage. Even though there is no agreement on what exactly constitutes a NoSQL solution, the following set of characteristics is often attributed to them [11, 15]:

Simple and flexible non-relational data models. NoSQL data stores offer flexible schemas or are sometimes completely schema-free and are designed to handle a wide variety of data structures [11, 12, 24]. Current solution data models can be divided into four categories: key-value stores, document stores, column-family stores, and graph databases.
Ability to scale horizontally over many commodity servers. Some data stores provide data scaling, while others are more concerned with read and/or write scaling.
Provide high availability. Many NoSQL data stores are aimed towards highly distributed scenarios, and consider partition tolerance as unavoidable. Therefore, in order to provide high availability, these solutions choose to compromise consistency in favour of availability, resulting in AP (Available/ Partition-tolerant) data stores, while most RDBMs are CA (Consistent/Available).
Typically, they do not support ACID transactions as provided by RDBMS. NoSQL data stores are sometimes referred as BASE systems (Basically Available, Soft state, Eventually consistent) [25]. In this acronym, Basically Available means that the data store is available all the time whenever it is accessed, even if parts of it are unavailable; Soft-state highlights that it does not need to be consistent always and can tolerate inconsistency for a certain time period; and Eventually consistent emphasizes that after a certain time period, the data store comes to a consistent state. However, some NoSQL data stores, such as CouchDB [26] provide ACID compliance.

These characteristics make NoSQL data stores especially suitable for use as cloud data management systems. Indeed, many of the Database as a Service offerings available today, such as Amazon’s SimpleDB [27] and DynamoDB [28], are considered to be NoSQL data stores. However, the lack of full ACID transaction support can be a major impediment to their adoption in many mission-critical systems. For instance, Corbert et al.[29] argue that it is better to deal with performance problems caused by the overuse of transactions rather than trying to work around the lack of transaction support. Furthermore, the use of low-level query languages, the lack of standardized interfaces, and the huge investments already made in SQL by enterprises are other barriers to the adoption of NoSQL data stores.

The category of NewSQL data stores, on the other hand, is being used to classify a set of solutions aimed at bringing to the relational model the benefits of horizontal scalability and fault tolerance provided by NoSQL solutions. The first use of the term is attributed to a report of the 451 group in 2011 [17]. The Google Spanner [29] solution is considered to be one of the most prominent representatives of this category, as is also VoltDB [30], which is based on the H-Store [31] research project. Clustrix [32] and NuoDB [33] are two commercial projects that are also classified as NewSQL. All these data stores support the relational model and use SQL as their query language, even though they are based on different assumptions and architectures than traditional RDBMSs. Generally speaking, NewSQL data stores meet many of the requirements for data management in cloud environments and also offer the benefits of the well-known SQL standard.

Related surveys

Several surveys have addressed the NoSQL domain [10–14]; nevertheless, this survey is different because it focuses on the comparison of available NoSQL and NewSQL solutions over a number of dimensions. Hecht and Jablonski [11] presented a use case-oriented survey, which, like this one, compares features of several NoSQL solutions, including the data models, querying capabilities, partitioning, replication, and consistency. However, for a large number of features, they use a “black and white” (+/−) approach to indicate that the solution either does or does not have the feature. This survey adopts a different approach by expressing degrees, aspects, and details of each solution’s features. Moreover, this survey includes security features and NewSQL solutions, which are not addressed in their work.

Pokorny [13], Cattell [12], and Sakr et al.[14] have also reviewed NoSQL data stores. They portrayed a number of NoSQL data stores, describing their data models and their main underlying principles and features. However, in contrast to this work, they did not perform direct feature comparison among data stores. Sadalage and Fowler [15] described the principles on which NoSQL stores are based and why they may be superior to traditional databases. They introduced several solutions, but they did not compare features as is done in this work.

In addition, existing surveys have not described the rationale or method for choosing the specific data stores to include in their studies [11–14]. For example, Sakr et al. stated, “…we give a brief introduction about some of those projects” [14], or Hecht and Jablonski “The most prominent stores are …” [11]; however, the method for choosing the data stores included in their studies were not presented. In contrast, this work uses a systematic approach to choose which data stores to include in the study. Additionally, this survey includes different data stores than the existing surveys [11–14].

Methodology

Due to the large number of NoSQL and NewSQL solutions, it was not feasible to include all of them in this survey. While other NoSQL surveys did not specify the methodology for choosing the data stores to be included in their studies [11–14], this survey makes use of a systematic approach to select the solutions.

DB-Engine Ranking [34] ranks database systems according to their popularity by using parameters such as the number of mentions on Web sites, general interest according to Google Trends, frequency of technical discussions on the Web, number of job offers, and number of professional profiles in which the solutions are mentioned. As can be seen, the DB-Engine Ranking estimates overall popularity of a data store on the Web. Nevertheless, this work is also interested in popularity within the research community; therefore, it also considers how often each system has been mentioned in research publications. Even though various research repositories could have been used, this study focuses on the IEEE as it is one of the most prominent publishers of research papers in computer science and software engineering. Hence, the initial list of NoSQL solutions was obtained from DB-Engine Ranking [34] and includes all NoSQL solutions listed by DB-Engine Ranking. Next, the IEEE Xplore database was searched to determine how many times each data store was mentioned in the indexed publications. For each NoSQL category, the most often cited data stores were chosen to be included in this survey. The key-value category was further divided into in-memory and disk-persistent key-value stores, and the most prominent solutions within each subcategory were chosen.

The prevalent data stores found in IEEE publications are similar to the data stores ranked high by DB-Engine Ranking. In the document category, the same three data stores, MongoDB [35], CouchDB [26], and Couchbase [36] are the most popular according to DB-Engine Ranking and IEEE publications. Both popularity estimation approaches rank Cassandra [19] and HBase [37] as the most prominent in the column-family category. SimpleDB [27] and DynamoDB [28] are ranked high by both approaches. While DB-Engine Ranking considers them key-value stores, this work categorizes them as column-family stores because of their table-like data model. In the remaining two categories, key-value data stores and graph databases, a large number of solutions rank high in popularity according to both approaches, including Redis [38], Memcached [39], Riak [40], BerkeleyDB [41], and Neo4J [42].

The selection of NewSQL data stores followed a similar approach. Nevertheless, because most of these solutions are very recent, only VoltDB and Spanner had a significant number of hits in the IEEE Xplore database. Therefore, in order to include a larger number of solutions in this survey, Clustrix and NuoDB were also selected because of their unique architectural and technical approaches.

The selected NoSQL and NewSQL solutions were compared with a focus on the data model, querying, scaling, and security-related capabilities. The categorization according to data model was used because the data model is the main factor driving other capabilities, including querying and scaling. In the querying context, support for MapReduce, SQL-like querying, REST (representational state transfer) and other APIs was considered. With regard to scaling, the study considered scaling read and write requests, or scaling data storage and analyzed four concepts closely related: partitioning, replication, consistency, and concurrency control. Finally, the following security related features were analyzed: authentication, authorization, encryption, and auditing.

Data models

The family of data stores belonging to the NoSQL category can be further sub-classified based on their data models. Many authors have proposed distinct interpretations for NoSQL categories, which has led to different sub-classifications [10, 12]. In this paper, the classification provided by Hecht and Jablonski [11] has been used, which divides the various NoSQL data stores into four major categories: key-value stores, column-family stores, document stores, and graph databases. Figure 1 shows representations of these models. This study also reviews NewSQL as a hybrid between NoSQL stores and relational databases.

Key-value stores

Key-value stores have a simple data model based on key-value pairs, which resembles an associative map or a dictionary [11]. The key uniquely identifies the value and is used to store and retrieve the value into and out of the data store. The value is opaque to the data store and can be used to store any arbitrary data, including an integer, a string, an array, or an object, providing a schema-free data model. Along with being schema-free, key-value stores are very efficient in storing distributed data, but are not suitable for scenarios requiring relations or structures. Any functionality requiring relations, structures, or both must be implemented in the client application interacting with the key-value store. Furthermore, because the values are opaque to them, these data stores cannot handle data-level querying and indexing and can perform queries only through keys. Key-value stores can be further classified as in-memory key-value stores which keep the data in memory, like Memcached [39] and Redis [38], and persistent key-value stores which maintain the data on disk, such as BerkeleyDB [41], Voldemort [43], and Riak [40].

Column-family stores

Most column-family stores are derived from Google Bigtable [20], in which the data are stored in a column-oriented way. In Bigtable, the dataset consists of several rows, each of which is addressed by a unique row key, also known as a primary key. Each row is composed of a set of column families, and different rows can have different column families. Similarly to key-value stores, the row key resembles the key, and the set of column families resembles the value represented by the row key. However, each column family further acts as a key for the one or more columns that it holds, where each column consists of a name-value pair. Hadoop HBase [37] directly implements the Google Bigtable concepts, whereas Amazon SimpleDB [27] and DynamoDB [28] have a different data model than Bigtable. SimpleDB and DymanoDB contain only a set of column name-value pairs in each row, without having column families. Cassandra [19], on the other hand, provides the additional functionality of super-columns, which are formed by grouping various columns together.

In column-family stores, a column family in different rows can contain different columns. Occasionally, SimpleDB and DynamoDB are classified as key-value stores [34]; however, this paper considers them as column-family stores due to their table-like data model in which each row can have different columns. Typically, the data belonging to a row is stored together on the same server node. However, Cassandra offers to store a single row across multiple server nodes by using composite partition keys. In column-family stores, the configuration of column families is typically performed during start-up. However, a prior definition of columns is not required, which offers huge flexibility in storing any data type.

In general, column-family stores provide more powerful indexing and querying than key-value stores because they are based on column families and columns in addition to row keys. Similarly to key-value stores, any logic requiring relations must be implemented in the client application.

Document stores

Document stores provide another derivative of the key-value store data model by using keys to locate documents inside the data store. Most document stores represent documents using JSON (JavaScript Object Notation) or some format derived from it. For example, CouchDB [26] and the Couchbase server [36] use the JSON format for data storage, whereas MongoDB [35] stores data in BSON (Binary JSON). Document stores are suitable for applications in which the input data can be represented in a document format. A document can contain complex data structures such as nested objects and does not require adherence to a fixed schema. MongoDB provides the additional functionality of grouping the documents together into collections. Therefore, inside each collection, a document should have a unique key.

Unlike an RDBMS, where every row in a table follows the same schema, each document inside these document stores can have a different structure. Document stores provide the capability of indexing documents based on the primary key as well as on the contents of the documents. This indexing and querying capability based on document contents differentiates this data model from the key-value stores model, in which the values are opaque to the data store. On the other hand, document stores can store only data that can be represented as a document. Like key-value stores, they are inefficient in multiple-key transactions involving cross-document operations.

Graph databases

Graph databases originated from graph theory and use graphs as their data model. A graph is a mathematical concept used to represent a set of objects, known as vertices or nodes, and the links (or edges) that interconnect these vertices. By using a completely different data model than key-value, column-family, and document stores, graph databases can efficiently store the relationships between different data nodes. In graph databases, the nodes and edges also have individual properties consisting of key-value pairs. Graph databases are specialized in handling highly interconnected data and therefore are very efficient in traversing relationships between different entities. They are suitable in scenarios such as social networking applications, pattern recognition, dependency analysis, recommendation systems and solving path finding problems raised in navigation systems [11, 44].

Some graph databases such as Neo4J [42] are fully ACID-compliant. However, they are not as efficient as other NoSQL data stores in scenarios other than handling graphs and relationships. Moreover, existing graph databases are not efficient at horizontal scaling because when related nodes are stored on different servers, traversing multiple servers is not performance-efficient.

NewSQL

These solutions are by definition based on the relational model. VoltDB [30], Clustrix [32], and NuoDB [33] offer their clients a pure relational view of data. On the other hand, Google Spanner [29] is based on a semi-relational model in which tables are seen as mappings from the primary-key columns to the other columns. In its model, hierarchies of tables are created so that users can specify locality relationships between tables [29].

Even though clients interact with these data stores in terms of tables and relations, it is interesting to note that NewSQL solutions might use different data representations internally. For example, NuoDB can store its data into any compatible key-value store.

Querying

Similar to the selection of a data model, the querying capabilities of data stores play an important role when choosing among them for a particular scenario. Different data stores offer different APIs and interfaces to interact with them. This is directly dependent upon the data model that a particular data store possesses. For example, a key-value store cannot provide querying based on the contents of the values, because these values are opaque to the data store. On the other hand, a document store can do so because its data model provides the capability to index and query the document contents.

Another important query-related feature of NoSQL and NewSQL data stores is their level of support for MapReduce. MapReduce, which was first developed by Google, is a programming model and an associated implementation for processing large datasets [45]. It has now become a widely accepted approach for performing distributed data processing on a cluster of computers. Because one of the primary goals of NoSQL data stores is to scale over a large number of computers, MapReduce has been adopted by most of them. Similarly, SQL-like querying has been a preferred choice because of its widespread use over the past few decades, and it has now also been adopted in the NoSQL world. Therefore, some of the prominent NoSQL data stores like MongoDB [35] offer a SQL-like query language or similar variants such as CQL [46] offered by Cassandra and SparQL [47] by Neo4j and Allegro Graph [48].

As for the NewSQL category, the use of SQL as a query language is one of its defining characteristics, but the level of SQL support varies considerably. Clustrix [32] and NuoDB [33] are the most SQL-compliant of the solutions analyzed, having only minor incompatibilities with the standard. On the other hand, Corbett et al. state that the Google Spanner query language “looks like SQL with some extensions to support protocol-buffer-value fields” [29], but they do not provide details about the language. Finally, VoltDB [30] has a larger number of restrictions in place: it is not possible to use the having clause, tables cannot join with themselves, and all joined tables must be partitioned over the same value. It is also worth mentioning that the recommended way of interacting with VoltDB is through Stored Procedures. These procedures are written in Java, where programming logic and SQL statements are interspersed.

On the other hand, a command-line interface (CLI) is usually the simplest and most common interface that a data store can provide for interaction with itself and is therefore offered by almost all NoSQL and NewSQL products. In addition, most of these products offer API support for multiple languages. Moreover, a REST-based API has been very popular in the world of Web-based applications because of its simplicity [49]. Consequently, in the NoSQL world, a REST-based interface is provided by most solutions, either directly or indirectly through third-party APIs. Table 1 provides a detailed view of the different APIs support provided by the most prominent NoSQL and NewSQL solutions along with other querying capabilities offered.

Table 1 Querying capabilities

Full size table

Scaling

One of the main characteristics of the NoSQL and NewSQL data stores is their ability to scale horizontally and effectively by adding more servers into the resource pool. Even though there have been attempts to scale relational databases horizontally, on the contrary, RDBs are designed to scale vertically by means of adding more power to a single existing server [3].

With regard to what is being scaled, three scaling dimensions are considered: scaling read requests, scaling write requests, or scaling data storage. The partitioning, replication, consistency, and concurrency control strategies used by the NoSQL and NewSQL data stores have significant impact on their scalability. For example, partitioning determines the distribution of data among multiple servers and is therefore a means of achieving all three scaling dimensions.

Another important factor in scaling read and write requests is replication: storing the same data on multiple servers so that read and write operations can be distributed over them. Replication also has an important role in providing fault tolerance because data availability can withstand the failure of one or more servers. Furthermore, the choice of replication model is also strongly related to the consistency level provided by the data store. For example, the master–slave asynchronous replication model cannot provide consistent read requests from slaves.

Finally, another influential factor in scaling read and write requests is concurrency control. Simple read/write lock techniques may not provide sufficient concurrency control for the read and write throughput required by NoSQL and NewSQL solutions. Therefore, most solutions use more advanced techniques, such as optimistic locking with multi-version concurrency control (MVCC).

In the following subsections, partitioning, replication, consistency, and concurrency control strategies of NoSQL and NewSQL data stores will be compared; an overview is presented in Table 2.

Table 2 Partitioning, replication, consistency, and concurrency control capabilities

Full size table

Partitioning

Most NoSQL and NewSQL data stores implement some sort of horizontal partitioning or sharding, which involves storing sets or rows/records into different segments (or shards) which may be located on different servers. In contrast, vertical partitioning involves storing sets of columns into different segments and distributing them accordingly. The data model is a significant factor in defining strategies for data store partitioning. For example, vertical partitioning segments contain predefined groups of columns; therefore, data stores from the column-family category can provide vertical partitioning in addition to horizontal partitioning.

The two most common horizontal-partitioning strategies are range partitioning and consistent hashing. Range partitioning assigns data to partitions residing in different servers based on ranges of a partition key. A server is responsible for the storage and read/write handling of a specific range of keys. The advantage of this approach is the effective processing of range queries, because adjacent keys often reside in the same node. However, this approach can result in hot spots and load-balancing issues. For example, if the data are processed in the order of their key values, the processing load will always be concentrated on a single server or a few servers. Another disadvantage is that the mapping of ranges to partitions and nodes must be maintained, usually by a routing server, so that the client can be directed to the correct server. BerkeleyDB, Cassandra, HBase, and MongoDB implement range partitioning as depicted in Table 2.

In consistent hashing, the dataset is represented as a circle or ring. The ring is divided into a number of ranges equal to the number of available nodes, and each node is mapped to a point on the ring. Figure 2 illustrates consistent hashing on an example with four nodes N1 to N4. To determine the node where an object should be placed, the system hashes the object’s key and finds its location on the ring. In the example from Figure 2, object a is located between nodes N4 and N1. Next, the ring is walked clockwise until the first node is encountered, and the object gets assigned to that node. Accordingly, object a from Figure 2 gets assigned to node N1. Consequently, each node is responsible for the ring region between itself and its predecessor; for example, node N1 is responsible for data range 1, node N2 for data range 2, and so on. With consistent hashing, the location of an object can be calculated very fast, and there is no need for a mapping service as in range partitioning. This approach is also efficient in dynamic resizing: if nodes are added to or removed from the ring, only neighbouring regions are reassigned to different nodes, and the majority of records remain unaffected [16]. However, consistent hashing negatively impacts range queries because neighbouring keys are distributed across a number of different nodes. Voldemort, Riak, Cassandra, DynamoDB, CouchDB, VoltDB, and Clustrix implement consistent hashing.

The in-memory stores analyzed, Redis and Memcache, do not implement any partitioning strategy and leave it to the client to devise one. Amazon SimpleDB, the NoSQL solution which is provided as a service, offers its clients simple, manual mechanisms for partitioning data, as described in Table 2. However, the service provider might implement additional partitioning to achieve the throughput capacity specified in the service level agreement.

Partitioning graph databases is significantly more challenging than partitioning other NoSQL stores [50]. The key-value, column-family, and document data stores partition data according to a key, which is known and relatively stable. In addition, data are accessed using a lookup mechanism. In contrast, graphs are highly mutable structures, which do not have stable keys. Graph data are not accessed by performing lookups, but by exploiting relations among entities. Consequently, graph partitioning attempts to achieve a trade-off between two conflicting requirements: related graph nodes must be located on the same server to achieve good traversal performance, but, at the same time, too many graph nodes should not be on the same server because this may result in heavy and concentrated load. A number of graph-partitioning algorithms have been proposed [50], but their adoption in practice has been limited. One of the reasons is the rapid pace of changes in graphs, which may trigger intensive rebalancing operations. For this reason, the graph databases investigated, Neo4J, HypergraphDB, and AllegroGraph, do not offer partitioning in the traditional sense. However, Neo4J offers cache sharding, while HypergraphDB relies on autonomous agents to provide communication among graphs residing in different peer nodes, as summarized in Table 2.

The NewSQL data stores investigated also use diverse partitioning strategies. VoltDB uses a traditional approach in which each table is partitioned using a single key and rows are distributed among servers using a consistent hashing algorithm. Stored procedures can be executed on a single partition or on all of them; however, the drawback is that the user is responsible for selecting between these options. The Clustrix data store also partitions the data using a consistent hashing algorithm over a user-defined primary key. In addition, Clustrix also partitions the table indices using the indexed columns as the keys. Theoretically, this strategy enables parallel searches over these indices, leading to faster query resolution.

Google’s Spanner uses a different partitioning model. A Spanner deployment contains a set of servers known as spanservers, which are the nodes responsible for serving data to clients. A spanserver manages hundreds to thousands of tablets, each of which contains a set of directories. A directory is basically a set of rows that shares a common key prefix, as specified by the user-defined table hierarchy mentioned in section “Data Models”. A directory is also considered to be the basic unit of placement configuration, which is used to define constraints for data partitioning and replication among the available tablets. Some of the criteria that can be defined are the datacentres where replicas should reside, the number of replicas, the distance of the data to their clients, and the distance among replicas. The data store automatically moves the directories among the spanservers to respect these criteria and to improve general data access performance.

NuoDB is another NewSQL solution that uses a completely different approach for data partitioning. A NuoDB deployment is made up of a number of Storage Managers (SM) and Transaction Managers (TM). The SMs are the nodes responsible for maintaining the data, while the TMs are the nodes that process the queries. Each SM has a complete copy of the entire data, which basically means that no partitioning takes place within the SM. Nevertheless, the underlying key-value store used by the SMs can partition the data by itself, although this is neither controllable nor viewable by the user.

Replication

In addition to increasing read/write scalability, replication also improves system reliability, fault tolerance, and durability. Two main approaches to replication can be distinguished: master–slave and multi-master replication.

In master–slave replication, shown in Figure 3.a, a single node is designated as a master and is the only node that processes write requests. Changes are propagated from the master to the slave nodes. Examples of data stores with master–slave replication are Redis, BerkeleyDB, and HBase. In multi-master replication, illustrated in Figure 3b, multiple nodes can process write requests, which are then propagated to the remaining nodes. Whereas in master–slave replication the propagation direction is always from master to slaves, in multi-master replication, propagation happens in different directions. CouchDB and Couchbase Server are examples of multi-master data stores. Three other data stores, Voldemort, Riak, and Cassandra, support masterless replication, which is similar to multi-master replication as multiple nodes accept write requests, but as highlighted by the term masterless, all nodes play the same role in the replication system. Note that all three of the data stores with masterless replication use consistent hashing as a partitioning strategy. The strategy for placing replicas is closely related to node position on the partitioning ring, as shown in Table 2.

NewSQL replication schemes can be considered as multi-master or masterless schemes because any node can receive update statements. In VoltDB and Clustrix, a transaction/session manager receives the updates, which are forwarded to all replicas and executed in parallel. On the other hand, Google Spanner uses the Paxos state-machine algorithm [29] to guarantee that a sequence of commands will be executed in the same order in all the replica nodes. Note that Paxos is a distributed algorithm without central arbitration, which differs significantly from the other solutions. Finally, in NuoDB, the table rows are represented as in-memory distributed objects which communicate asynchronously to replicate their state changes.

The choice of replication model impacts the ability of the data store to scale read and write requests. Master–slave replication is generally useful for scaling read requests because it allows the many slaves to accept read requests – examples are BerkeleyDB and MongoDB. However, some data stores such as HBase do not permit read requests on the slave nodes. In this case, replication is used solely for failover and disaster recovery. In addition, master–slave data stores do not scale write requests because the master is the only node that processes write requests. An interesting exception is the Neo4J database, which is able to handle write requests on the slave nodes also. In this case, write requests are synchronously propagated from slaves to master and therefore are slower than write requests to master. Finally, multi-master and masterless replication systems are usually capable of scaling read and write requests because all nodes can handle both requests.

Another replication characteristic with a great impact on data stores throughput is how write operations are propagated among nodes. Synchronization of replicas can be synchronous or asynchronous. In synchronous or eager replication, changes are propagated to replicas before the success of the write operation is acknowledged to the client. This means that synchronous replication introduces latencies because the write operation is completed only after change propagation. This approach is rarely used in NoSQL because it can result in large delays in the case of temporary loss or degradation of the connection. In asynchronous or lazy replication, the success of a write operation is acknowledged before the change has been propagated to replica nodes. This enables replication over large distances, but it may result in nodes containing inconsistent copies of data. However, performance can be greatly improved over synchronous replication. As illustrated in Table 2, the majority of the data stores studied use asynchronous replication. Typically, NoSQL solutions use this approach to achieve the desired performance, yet CouchDB uses it to achieve off-line operation. In CouchDB, multiple replicas can have their own copies of the same data, modify them, and then synchronize these changes at a later time.

Consistency

Consistency, as one of the ACID properties, ensures that a transaction brings the database from one valid state to another. However, this section is concerned with consistency as used in the CAP theorem, which relates to how data are seen among the server nodes after update operations. Basically, two consistency models can be distinguished: strong and eventual consistency. Strong or immediate consistency ensures that when write requests are confirmed, the same (updated) data are visible to all subsequent read requests. Synchronous replication usually ensures strong consistency, but its use can be unacceptable in NoSQL data stores because of the latency it introduces. Among the observed NoSQL data stores with replication, HBase is the only one exclusively supporting strong consistency. In the eventual consistency model, changes eventually propagate through the system given sufficient time. Therefore, some server nodes may contain inconsistent (outdated) data for a period of time. Asynchronous replication, if there are no other consistency-ensuring mechanisms, will lead to eventual consistency because there is a lag between write confirmation and propagation. Because NoSQL data stores typically replicate asynchronously, and eventual consistency is often associated with them, it was expected that the reviewed NoSQL solutions provide eventual consistency. Nevertheless, as illustrated in Table 2, the majority of these data stores allow configuration of the consistency model using alternate consistency-ensuring mechanisms; however, choosing strong consistency may have a performance impact.

The data stores with consistent hashing and masterless replication, specifically Voldemort, Riak, and Cassandra, use a quorum approach in their consistency models. In this approach, a read or write quorum is defined as the minimum number of replicas that must respond to a read or write request for it to be considered successful and confirmed to the requestor. Even though these data stores are designed for eventual consistency, they can achieve strong consistency by choosing (read quorum + write quorum) larger than the number of replicas.

MongoDB can achieve strong consistency using two different techniques. First, a connection can be set to read-only from the master, which removes the data-stores’ ability to scale read requests. The second option is to set the write concern parameter to “Replica Acknowledged”, which ensures that a write succeeds on all replicas before being confirmed. This makes the data store into a synchronous replication system and degrades its performance.

Finally, it is important to note that the NewSQL solutions analyzed, with the exception of NuoDB, are strongly consistent, fully transactional data stores.

Concurrency control

Concurrency control is of special interest in NoSQL and NewSQL data stores because they generally need to accommodate a large number of concurrent users and very high read and/or write rates. All the solutions studied facilitate concurrency by implementing partitioning and replication. However, this section focuses on concurrency control as a means of achieving simultaneous access to the same entity, row, or record on a single server node.

The main concurrency-control schemes can be categorized as pessimistic or optimistic. Pessimistic concurrency control, or pessimistic locking, assumes that two or more concurrent users will try to update the same record or object at the same time. To prevent this situation, a lock is placed onto the accessed entity so that exclusive access is guaranteed to a single operation; other clients trying to access the same data must wait until the first one finishes its work. The entity that is locked depends on the underlying data model. For example, key-value stores lock records consisting of key-value pairs, column-family stores lock rows, and document stores enforce locking at document level. In graph databases, specifically in Neo4J, locks are acquired on nodes and their relationships. BerkeleyDB and MongoDB implement readers-writer locks which allow either multiple readers to access data or a single writer to modify them. Pessimistic locking techniques can lead to performance degradation, especially in write-intensive scenarios.

Optimistic concurrency control or optimistic locking assumes that conflicts are possible, but rare. Therefore, instead of locking the record, the data store checks at the end of the operation to determine whether concurrent users have attempted to modify the same record. If a conflict is identified, different conflict-resolution strategies can be used, such as failing the operation immediately or retrying one of the operations. Several of the data stores investigated, including Voldemort, Riak, HBase, CouchDB, Clustrix, and NuoDB, implement optimistic concurrency control with multi-version concurrency control (MVCC). In MVCC, when the data store needs to update a record, it does not overwrite the old data, but instead adds a new version and marks the old version as obsolete. Multiple versions are stored, but only one is marked as current. With the MVCC approach, a read operation sees the data the way they were when it began reading, even if the data were modified or deleted by other operations in the meantime.

A number of NoSQL solutions allow applications to implement optimistic concurrency control by providing primitives such as check and set (CAS) in Memcached and Couchbase Server. The CAS method ensures that a write will be performed only if no other client has changed the record since it was last read. In Redis, the WATCH primitive performs a similar function. Optimistic concurrency-control implementations use various approaches to determine whether a record has been changed. For example, Memcached uses version stamps and AmazonDB incrementing version numbers. Often it is hard to tell which approach a data store uses internally to achieve check and set functionality based solely on the system documentation.

Cassandra has been recognized for its ability to handle large numbers of write requests [19], and therefore architecture characteristics contributing to Cassandra’s write scalability are highlighted. Although the storage structure in typical relational databases and a number of NoSQL data stores including MongoDB and CouchDB relies on a B-Tree, Cassandra takes advantage of a log-structured merge tree. When a write occurs, Cassandra stores the changes in two places: in the memory structure called memtable, and in the commit log on disk by appending to the existing data. When the memtable reaches a threshold, the memtable data are flashed to SSTables (sorted string tables) on disk, and data in the commit log corresponding to the flushed memtable are purged. When flashing the memtable, Cassandra writes entire sectors to disk using sequential I/O instead of modifying rows in place. This approach eliminates locking of data on disk for concurrency control because write operations only append data and do not modify existing data on disk. Consequently, Cassandra is especially suitable for applications with high write volume or those that require very fast writes.

Some of the NewSQL solutions analyzed also implement innovative approaches to concurrency control. For example, Google’s Spanner uses a hybrid approach in which read-write transactions are implemented through read-write locks, but read-only transactions are lock-free. This is possible because Spanner stores multiple versions of data, and a read transaction is basically a read at a “safe” timestamp. On the contrary, VoltDB implements an interesting alternative to concurrency control. This data store assumes that the total available memory is large enough to store the entire data store. Moreover, it also assumes that all user transactions are short-lived and can be very efficiently executed over in-memory data. Based on these assumptions, all transactions are then executed sequentially in a single-threaded, lock-free environment.

Security

Security is an important aspect of data stores that is overlooked by many NoSQL implementations. In this section, the data stores surveyed are compared with regard to the following features:

Authentication: mechanisms that enable verification of the identity of users who are accessing the data. This is usually achieved through a password associated with a user’s login, but more sophisticated mechanisms are also possible, such as user certificates. For many enterprises, an important requirement for authentication is the capacity of integration with enterprise user-directory systems such as Lightweight Directory Access Control (LDAP)/Active Directory and Kerberos servers.
Authorization: this refers to the capability to ensure access control to the data-store resources. Authorization is usually performed through association of each user with a set of permissions. For example, some data stores might require specific permissions for read and write requests on tables, creation of users, and execution of administrative functions. Authorization information might also be included in directory systems.
Encryption: this refers to mechanisms that encrypt data so that they cannot be read by attackers and others unauthorized parties. A complete encryption solution should be present in at least three different levels:

○ Data at rest: data stored on disks can be read if an attacker has access to the servers’ file systems. A data-at-rest encryption mechanism guarantees that the users’ data are automatically encrypted when written to these files and unencrypted when retrieved.

○ Client-to-server communication: Most data stores allow remote connections of users and applications so that stored data can be obtained. This data flow must also be encrypted to guarantee private and secure communication.

○ Server-to-server connections: because many NoSQL and NewSQL data stores include some sort of replication and distributed processing functionalities, communications among the server nodes can also be eavesdropped to obtain unauthorized access to data. A server-to-server encryption mechanism guarantees that these flows cannot be read.

Auditing: auditing functionalities are usually related to the creation of an audit trail that logs records of events that occurred in a data stores. This is especially important in forensic analysis of security events. Many security standards, such as PCI-DSS [51] and HIPAA [52], require the existence of audit trails.

Table 3 shows a summary of the security features found in the solutions surveyed. It is worth mentioning that very often the system documentation mentions nothing about some of the criteria analyzed, especially server-to-server communication and data-at-rest encryption. In these cases, the corresponding cells in the table contain “NA”.

Table 3 Security features

Full size table

Generally speaking, it is possible to affirm that the security features of NoSQL solutions are not as mature as those included in traditional RDBMSs. Many solutions, such as Redis, Memcached, Voldemort, and Riak, are designed to be used in secure networked environments only. Therefore, they assume that it is the network administrator’s responsibility to ensure that only authorized applications have access to the data store, using mechanisms such as firewalls, operating system configurations, or the adoption of virtual private networks (VPN). In these cases, there is no fine-grained access control to the data store. Furthermore, audit features are not present in most cases, and when present, they are very simple and not customizable. For example, VoltDB can log all the queries executed on its data, but it cannot constrain this logging to only a subset of the tables.

Another interesting observation is that MongoDB and Cassandra offer additional security functionalities in their enterprise editions, acknowledging the fact that security is a particularly relevant concern for large companies. For instance, data-at-rest encryption and auditing functionalities are available only in Cassandra Enterprise Edition.

Among the NewSQL solutions, Clustrix and NuoDB use the authorization and authentication schemes of traditional RDBMS by supporting the GRANT/REVOKE statements. In its turn, VoltDB implements access control to execution of stored procedures, and no information regarding Google Spanner security could be found.

Cloud data management systems may also need to handle other security related concerns, such as legal issues associated to the data location, and the complete disposal of sensitive data [53], but they are out of scope of this survey.

Use cases

Due to the diversity of NoSQL and NewSQL solutions, making the choice of the most appropriate data store for a given use case scenario is a challenging task. This section discusses some general guidelines that can be used in this task and shows examples of applications that use different data stores. The following discussion is mostly focussed on selecting a specific data model over others, but when relevant, we also examine the appropriateness of specific data stores.