In the following, we survey the most popular CPIs, map them to deployment scenarios in which they are applicable and provide an overview of their performance. Using the proposed methodology, we build a CPI catalog that summarizes our analysis by showing how the existing CPIs map to the deployment scenario requirements of our proposed taxonomy. The CPI catalog lists the requirements that a CPI is able to satisfy, and allows a quick determination of which CPIs match the requirements of a specific deployment scenario in terms of the assumed attacker model, the required CPI functionality, and the protection level.
CPI catalog
The CPI catalog that contains different classes of CPI approaches can be found in Table 3. Depending on the functionality that is utilized, some approaches protect against multiple attacker models. In the following, we provide a brief description and explain the categorization of each CPI approach.
Deterministic indexes
In order for the SP to evaluate equality selections on attribute values without revealing the values, each plaintext value can be mapped to a deterministic substitute. For instance, these substitutes can be keyed hash values of the plaintext value or ciphertexts that are produced by deterministic encryption schemes. Both the keyed hash function and the deterministic encryption schemes ensure that mapping single deterministic substitutes back to the plaintext value without knowing the key is infeasible. In order to search for a certain attribute value a, the attribute value a is replaced by its deterministic substitute within the query.
Example: To evaluate the query SELECT …WHERE Gender=m on Table 2, the attribute value “m” is first mapped to its deterministic substitute η by the mediator using the secret key. The query SELECT …WHERE Gender=
η can then be passed to the SP, which returns record 1, 2, 4 and 6 as the result.
Since deterministic ciphertexts are distinguishable and only equality selections are evaluated, the approach protects against Q&BKS-attackers, i.e., attackers with background knowledge of the data’s schema (cf. Figure 3).
In order to also use deterministic indexes against attackers with background knowledge of the data, the flattened hash indexes approach was proposed [10]. The basic idea of flattened hash indexes is to map different plaintext values to the same deterministic substitute in such a way that each deterministic substitute occurs the same number of times. By doing this, the deterministic ciphertexts become indistinguishable in the sense that background knowledge on the frequency distribution of an attribute can no longer be applied. However, flattening the distribution of the deterministic substitutes does not entirely prevent background knowledge from being applied on single plaintext records because equal plaintext values still map to equal deterministic substitutes [10].
Example: Even if Table 2 contained as many female as male persons, an attacker who knows that Adam has gender “m” could infer that Carol cannot have gender “m” because Carol’s record contains θ, which does not match Adam’s η.
Thus, flattened hash indexes can be used against D&BKQ-attackers, but only provide probabilistic database protection because an attacker with background knowledge may gain information about some of the plaintext values based on the outsourced data.
Bucketization
Range selections cannot be evaluated based on deterministic indexes because deterministic ciphertexts do not maintain the order of the plaintext values. The idea of value bucketization [11,12] addresses this by sorting the plaintext values into buckets, i.e., continuous value ranges. For each plaintext value, only the corresponding bucket ID that does not contain information on the content of the bucket is outsourced. In order to query for a range of values, the ID of each contained and intersecting bucket is queried.
Since the outsourced values are distinguishable ciphertexts, bucketization can be considered secure against a D&BKS-attacker that is only able to access the data and has background knowledge of the data’s schema. However, once an attacker can observe queries or modifications that update/delete records, which are selected based on a bucketization index, it can deduce the ordering of the bucket IDs and thus transform distinguishable ciphertexts into order-preserving ciphertexts. Therefore, if queries can be monitored by the attacker, bucketization is only secure against Q&NBK-attackers that have no background knowledge.
Like deterministic indexes, buckets can be flattened so that each bucket contains the same number of records. For the same reasons as flattened hash indexes, flattened bucketization can be used against D&BKQ-attackers, however it only provides probabilistic database protection.
Order-preserving encryption
To allow the SP to evaluate range selections on encrypted attribute values, order-preserving encryption schemes (OPES) can be used to encrypt the attribute values [13,14]. OPES schemes maintain order when mapping plaintext values to ciphertext values, i.e., the ciphertext values have the same order as the corresponding plaintext values. Thus, it is possible for the SP to evaluate < and > operators without decrypting the ciphertexts and revealing the plaintext values.
Since order-preserving ciphertexts are outsourced, the approach only protects against Q&NBK-attackers without background knowledge (cf. Figure 3).
Searchable encryption
Searchable encryption schemes [15-17,17-20] can be used to encrypt data values in such a way that the SP can check whether a ciphertext contains a value that matches a given predicate or not. The outsourced ciphertexts are indistinguishable. Searchable encryption schemes enable the mediator to generate a token based on the predicate that has to be checked and the secret key that was used to encrypt the data. This token is passed to the SP. While it is not possible to determine the predicate to be evaluated from the token without the secret key, the token can be used by the SP to check which ciphertexts match the (unknown) predicate. Predicates can be utilized to encode equality, range and like selections.
Since the outsourced ciphertexts are indistinguishable, searchable encryption can be used to protect the data from D&BKQ-attackers that have access to the outsourced data as well as background knowledge on the data and the queries. However, once an attacker can monitor queries or modifications it is not necessary for them to have background knowledge of the data because indistinguishable ciphertexts alone do not protect against Q&BKD-attackers and searchable encryption approaches do not provide access confidentiality (cf. Figure 3) [21]. Furthermore, if range selections can be monitored, it is not necessary for the attacker to have background knowledge because they can establish an order between the ciphertext values based on the monitored range queries, as was the case with bucketization.
Encrypted B-Trees
In traditional database management systems, B-Trees are used to create indexes and speed up query execution. To enable efficient query execution based on indistinguishable ciphertexts stored by the SP, encrypted B-Trees containing only indistinguishably encrypted nodes were proposed [10,22]. Since the encrypted nodes of the encrypted B-Tree are indistinguishable for the SP, the trustworthy mediator has to maintain the encrypted B-Tree and participate in the execution of queries. To retrieve a leaf that references a record with a certain attribute value, the mediator first retrieves the root node of the B-Tree, decrypts it and selects the node for descending into the tree as in the unencrypted case. This node is then retrieved once again and the process is repeated until the target leaf node is reached. Thus, l
o
g(n) communication rounds are required to retrieve a record.
Since the outsourced ciphertexts are indistinguishable, encrypted B-Trees can be used to protect attributes against D&BKQ-attackers that have access to the outsourced data as well as background knowledge on the data and the queries (cf. Figure 3). However, since encrypted B-Trees do not provide access confidentiality, they are not suited to provide protection against a Q&BKD-attacker that has background knowledge of the data and can monitor queries/modifications. Once an attacker can observe queries or modifications, they are not only able to distinguish ciphertexts, but can also infer the order of the ciphertexts due to the B-Tree’s ordered structure [23]. Thus, if queries or modifications that target specific records can be observed, encrypted B-Trees can only provide security guarantees against Q&NBK-attackers that have no background knowledge.
In order to make encrypted B-Trees applicable to stronger attacker models, shuffling the B-Tree after every query was proposed to achieve access and pattern confidentiality [24]. For each value looked up in the B-Tree, e different values are looked up so that e nodes need to be retrieved at each stage of the B-Tree. Once the leaf nodes for all the cover searches are received, the nodes of each stage are shuffled by the client and written back to the B-Tree at the SP. Thus, the access patterns of two identical queries look different to the SP. However, since only e nodes are shuffled, the SP is still able to distinguish queries with a certain probability. This amount of information decays with the number of queries that are executed in between the queries of interest. It may be possible for the attacker to apply background knowledge concerning the access pattern with a certain probability. Thus, shuffled B-Trees only guarantee probabilistic database protection against Q&BKQ-attackers that have background knowledge of query patterns.
Fragmentation
In some cases, it is not attribute values, but attribute value combinations that are considered confidential. Fragmentation approaches split relations up into multiple fragments to protect the attribute combinations. For instance, even though an attacker is allowed to see plaintext attribute values of attribute age and name, it must be impossible to link any age to a name. In order to achieve that, the relation can be split up into two unlinkable fragments, where each one contains either name or age as shown in Table 2. Since no encryption is applied, the SP can still evaluate equality, range and like selections as well as aggregations.
To protect a record, it suffices to protect one of the attributes that should not be linked by not storing the attribute in the same fragment. As the SP does not learn the values of attributes that are not part of a fragment, the values of such attributes can be considered indistinguishable ciphertexts.
The fragments can all be stored on a single SP or distributed on multiple non-colluding SPs. If the fragments are stored on a single SP [25], the confidentiality of attribute combinations can only be guaranteed against D&BKQ-attackers that are not able to observe any modifications of the data. For instance, in order to insert a new record, partial records are inserted into each fragment. An attacker that observed the newly inserted partial records can infer that they belong to the same record and link them. As shown in Figure 3, to protect against stronger attackers, access confidentiality is necessary to obfuscate which records were inserted. Notice that storing fragments on a single SP can only provide probabilistic record protection: For instance, in Table 2 the attacker learns that Bob’s age is either 23, 25 or 30 based on the outsourced data. Thus, values of protected attributes are only somewhat indistinguishable.
Storing the fragments on multiple non-colluding SPs [26-28] can be considered secure against Q&BKD-attackers that are able to observe queries and modifications. Since the fragments are stored on different SPs, an attacker that compromised a single SP is not thereby enabled to observe which partial records are inserted into each fragment or to join them to reveal the confidential attribute combination. Furthermore, computational record protection can be achieved because the value range of the protected attributes cannot be narrowed down by the SPs as only a single fragment can be observed by each SP. Thus, values of protected attributes are truly indistinguishable. For instance, an attacker that has access to the fragment that contains age in Table 2 can infer that the age of a person in the dataset is either 23, 25 or 30. However, since the attacker has no access to the other fragment, they do not learn any names that are contained in the database and can be possibly linked to the age.
Homomorphic encryption
Homomorphic encryption schemes [29-31] can be used to produce ciphertexts that can be aggregated without knowing the secret key used for encryption [32,33]. The execution of certain operations in the ciphertext domain has the effect of executing an operation such as addition or multiplication is performed in the plaintext domain. Thus, the mediator can outsource homomorphically encrypted ciphertexts that can be aggregated by the SP (e.g., summed, depending on the scheme utilized).
The ciphertexts produced by homomorphic encryption schemes can be considered indistinguishable for an attacker. Aggregation queries target all records, rather than specific ones, and therefore naturally ensure access and pattern confidentiality. More precisely, all the records that were selected are either based on other indexing approaches such as deterministic indexes or B-Trees. Thus, homomorphic encryption can be considered secure against strong Q&BKQ-attackers with background knowledge of the data and queries and the ability to monitor queries.
Oblivious RAM and private information retrieval
The goal of Oblivious RAM (ORAM) [34-36] approaches is to ensure that queries evaluated on the data are indistinguishable and that the SP does not know which records are returned or whether the same query has been executed before. ORAM approaches shuffle the outsourced encrypted data structure with each data access to ensure that the executions of multiple identical/similar queries look entirely different and cannot be distinguished by the SP. The property of indistinguishable queries can only be guaranteed if there is access and pattern confidentiality. Thus, ORAM approaches ensure access and pattern confidentiality and can be considered secure against Q&BKD-attackers.
Private Information Retrieval (PIR) [37,38] approaches obfuscate data access patterns and can be considered secure against Q&BKD-attackers. Unlike ORAM, PIR can only be applied for data retrieval, not for writing data, which is a common requirement in DaaS scenarios. In contrast to ORAM, PIR approaches can obfuscate query access patterns in a single round of communication. Computational PIR approaches [39-42] achieve this at the expense of increased computational cost for the SPs, while information-theoretical PIR approaches [37,43,44] obfuscate access patterns based on non-colluding SPs. To evaluate queries (e.g., range selections) based on probabilistic ciphertexts, PIR approaches can be combined with methods such as encrypted B+ trees. This implies a logarithmic number of communication rounds, which would cancel out the benefits of PIR over ORAM in this regard.
CPI efficiency
In order to facilitate assessment of whether a CPI is suited for a given deployment scenario, we also provide a high level overview of the query execution performance achieved by each CPI. While a fine-grained performance evaluation is beyond the scope of this article, our aim is to give a rough estimate of the most important performance metrics. We distinguish between transmission overhead, which we define as the amount of data transmitted during a query, the number of sequential communication rounds between the mediator and the SP to answer a query and the number of entries that need to be touched by the SP in order to evaluate a query. Lastly, we categorize each CPI with respect to the computational overhead induced for the mediator and the SP according to four levels. We assign the level none if no or almost no calculations need to be performed. Overhead is assumed to be low if only lightweight cryptographic operations such as hashing are performed. If only symmetric encryption schemes are used, we assume the overhead to be moderate. If more resource-consuming asymmetric encryption schemes are utilized, we regard the computational overhead as high.
In the following, some of the most important findings are highlighted:
Compared to other CPI approaches, deterministic indexes and fragmentation induce the lowest overhead. With fragmentation, however, the SP is not able to fully execute a query based on a single fragment that does not contain all the attributes that are relevant for the query. Thus, a large number e of false-positive records may be unnecessarily transmitted and have an impact on query execution performance. In the non-colluding-SP model, this is less of a problem because attributes can be stored redundantly in multiple fragments.
The performance of flattened deterministic indexes and bucketization highly is highly dependent on the distribution of the outsourced attribute values. In unequally distributed datasets, many different plaintext values need to be mapped to a deterministic substitute in order to ensure a flat index. Querying for one of these plaintexts will result in the unnecessary transmission of many false-positive records e.
B-Trees require log(n) consecutive communication rounds between the mediator and the SP to retrieve the tree nodes. This leads to a stacking of network latency between the mediator and SP during query execution.
Using searchable encryption, the SP has to scan all outsourced ciphertexts contained in the index to find the records that match a query.
Homomorphic encryption induces a comparatively high computational overhead for the SP to aggregate ciphertexts and for the mediator to encrypt/decrypt the ciphertexts. We refer to partially homomorphic encryption here; the overhead for fully homomorphic encryption is much higher.
ORAM and PIR approaches each have their own trade-offs between transmission overhead, the necessary communication rounds and computational overhead. Overall they are considered expensive compared to other approaches that protect against weaker attacker models.