 Research
 Open access
 Published:
Secure semantic search using deep learning in a blockchainassisted multiuser setting
Journal of Cloud Computing volume 13, Article number: 29 (2024)
Abstract
Deep learningbased semantic search (DLSS) aims to bridge the gap between experts and nonexperts in search. Experts can create precise queries due to their prior knowledge, while nonexperts struggle with specific terms and concepts, making their queries less precise. Cloud infrastructure offers a practical and scalable platform for data owners to upload their data, making it accessible to intended data users. However, the contemporary singleowner/singleuser (S/S) approach to DLSS schemes falls short of effectively leveraging the inherent multiuser capabilities of cloud infrastructure. Furthermore, most of these schemes delegate the dissemination of secret keys to a single trust point within the mutual distrust scenario in cloud infrastructure. This paper proposes a Secure Semantic Search using Deep Learning in a BlockchainAssisted MultiUser Setting \((S^3DBMS)\). Specifically, the seamless integration of attributebased encryption with transfer learning allows the construction of DLSS in multiowner/multiuser (M/M) settings. Further, blockchain’s smart contract mechanism allows a multiattribute authority consensusbased generation of user private keys and systemwide global parameters in a mutual distrust M/M scenario. Finally, our scheme achieves privacy requirements and offers improved security and accuracy.
Introduction
To reduce costs linked with managing large amounts of data, more individuals and organizations are choosing to entrust the management of this data to the public cloud. To maintain the privacy and security of the data being entrusted, it is a usual practice to encrypt the data before sending it to the cloud service. When dealing with this outsourced encrypted data in a cloud infrastructure, the primary and essential action involves initiating a search process. This search serves as the principal entry point for reaching and interacting with the encrypted data in the cloud environment before any subsequent tasks or analyses can be conducted.
Deep learning natural language processing (NLP) models have revolutionized information retrieval, making semanticaware searches accessible even to those without specialized knowledge in a particular field. These models operate through complex neural networks, which enable them to understand the nuances of human language beyond the confines of traditional keywordbased approaches. Unlike conventional search algorithms that rely solely on exact keywords, deep learning NLP models go a step further by considering the context and intent behind user queries. This nuanced approach allows these models to provide more sophisticated and accurate search results, enhancing the overall user experience. For individuals lacking expertise in a specific domain, the advent of deep learning NLP models is particularly beneficial. Users can now formulate queries in a natural and conversational manner, eliminating the need for meticulously crafted keywordheavy requests. The models, in response, excel at understanding user intent, leading to more precise and relevant search outcomes. However, contemporary schemes integrating deep learningbased models into searchable encryption(SE) predominantly operate within singleowner/singleuser settings. This prevalent approach significantly restricts the adaptability and versatility of cloud storage solutions.
Therefore, in this paper, our novel objective is to introduce deep learningdriven semanticaware searchable encryption (SE) in a multipleowner/multipleuser (M/M) setting. This is realized through the integration of attributebased encryption (ABE) with secure transfer learning. However, employing this novel insight in the M/M setting is a nontrivial task. Particularly, we need to properly handle the following two key challenges. To start with, AttributeBased Encryption (ABE) is a cryptographic primitive that provides a finegrained access control mechanism for encrypted data. It enables data owners to encrypt their data in a way that only users possessing specific attributes can decrypt and access it. This is particularly valuable in scenarios where access control requirements are intricate and dynamic. However, the traditional ABE scheme inherently operates on a trusted attribute authority (TAA), as shown in Fig. 1, which contradicts the basic premise of a multiuser setting in the mutually distrustful scenario of cloud infrastructure. This situation leads to scalability, availability, privacy, and trust issues. Therefore, a secure distributed mechanism is desired to be placed on top of the traditional ABE primitives while still maintaining access at a fine granularity. Second, in a scenario where there are multiple data owners and multiple data users, all operating in an environment of mutual distrust (such as cloud infrastructure), a revocation mechanism becomes crucial for maintaining the security and integrity of the system. A revocation mechanism is a way to handle situations where access to certain data must be revoked or invalidated for specific users or groups. A revocation mechanism becomes even more important in the context of AttributeBased Encryption (ABE), which provides finegrained access control based on attributes. In a scenario where a single point of trust is absent, direct revocation remains the primary option. Here, the responsibility for specifying the revocation list during the encryption process falls directly upon the data user.
In the M/M setting, an extra demand arises: the data owners (DOs) need to be aware of who accessed their outsourced data and when, while the data users (DUs) must have confidence that the accessed data remains unaltered. We construct a blockchainassisted multiattribute authority scheme to ensure the privacy of deeplearningbased semanticaware searches and address the demands of a multidata owner and multidata user (M/M) setting within a scenario of mutual distrust.
This study’s main contributions can be summarized as follows:

1.
We leverage the capabilities of blockchain’s smart contract mechanism to establish a multiattribute authority SE scheme. This integration of smart contracts avoids the dependence on a singular trusted entity within an ABE infrastructure. Instead, it facilitates the consensusbased generation of user private keys and systemwide global parameters in a mutual distrust scenario M/M setting.

2.
We seamlessly integrated user revocation into the ciphertext managed by data owners. This integration naturally accommodates the absence of a singular trust point within the ABE mechanism while preserving the unaffected status of nonrevoked users.

3.
We combine deep learningbased transfer learning with Semantic Term Matching Constraints (STMC) to achieve search results with high accuracy and ranking. This ensures that the index and query feature vectors have an identical feature space and underlying distribution.

4.
Our approach employs a smart contract to record public system parameters and user actions, covering data upload, search, and download processes. Blockchain’s inherent features ensure that operation records are tamperresistant, allowing data owners to monitor data access following the upload process easily.
Related work
Secure semantic search schemes
Semantic search aims to bridge the gap between experts and nonexperts in search. Experts can create precise queries due to their deep knowledge, while nonexperts struggle with specific terms and concepts, making their queries less precise. Various strategies have been developed to achieve secure semantic searching, including methods like query expansion and word embeddings. Expansionbased approaches involve extending query terms to include more words that are semantically relevant. This increases the likelihood of matching predefined keywords effectively. The techniques in this category are broadly categorized into three groups: mutual information, synonyms, and concept hierarchies. As demonstrated by [1, 2], mutual informationdriven techniques utilize the probabilities of hashed keyword cooccurrences. They create a mutual information model to expand hashed query terms within the encrypted data. The majority of synonymoriented methods [3, 4] and strategies based on concept hierarchies [5, 6] revolve around extending query terms in the plaintext. These techniques subsequently construct secure indexes utilizing the SecKnn algorithm [7]. Word embedding is a methodology for representing words in a vector format to retain their semantic context. Yang and Zhu [8] utilize word embeddings to introduce a secure semantic search approach employing linear optimal matching, leading to accurate search outcomes. Additionally, [9] presents an embeddingcentered scheme wherein documents and queries are transformed into condensed vectors. This scheme then employs the SecKnn algorithm to encrypt these compact vectors. These strategies tackle the issue of semanticaware absence, yet they remain confined to singledata owner (DO) or singledata user (DU) setups. This limitation significantly curtails their applicability across a wider spectrum of realworld scenarios.
Verifiable blockchainbased schemes
Ensuring verifiably secure searching entails the integration of verification mechanisms within schemes. These mechanisms serve to validate the accuracy of search outcomes without compromising the confidentiality of sensitive information. The majority of verifiable searching techniques [10,11,12] utilize methodologies such as the Merkle Hash Tree or other variants to establish an anticipated checklist. This checklist serves the purpose of crossvalidating the search results for accuracy without compromising sensitive information. Yang and Zhu [8] research presents a novel verification approach that uses intermediate search data to validate the accuracy of search results. Nevertheless, these strategies presuppose the involvement of a trusted entity to oversee the verification process. For instance, certain approaches rely on methods centered around a trusted third party. However, this approach raises noteworthy concerns and challenges pertaining to both security and operational efficiency [13]. In the context of data management, certain schemes [13,14,15] adopt blockchainbased public audits. These audits serve to mitigate the reliance on single points of trust. Furthermore, certain research studies employ blockchain’s unchangeable and consensus nature to create reliable methods for detecting accurate information and establishing resilient information tracking systems [16, 17]. The multiparty transaction scheme introduced in [18] is built upon blockchain technology. In this approach, all pertinent details concerning a transaction are consolidated within a single block. This design enhances the efficiency of ledger traceability and audit processes. The approach in [19] integrates the advantages of attributebased cryptography and the chameleon hash function, successfully attaining key security features while maintaining a high level of efficiency. Alternative blockchaindriven approaches [20,21,22,23] create smart contracts for their individual retrieval processes on the blockchain. For instance, [20] suggests an encrypted indexbased, privacyfocused decentralized storage system where blockchain nodes perform onchain retrieval to agree on search outcomes. Additionally, [24] introduces a twolayer verification mechanism. Users conduct the initial verification using a checklist to ensure accurate search results, followed by blockchain nodes performing a second verification by reexecuting onchain retrieval to establish consensus on the outcomes. The mentioned studies share two main drawbacks. Firstly, they are limited to a singlesecret key (S/S) setup, allowing only the key holder to create search queries, restricting the search scope. Secondly, none of these approaches incorporate revocation in the dynamic and scalable cloud infrastructure architecture.
Notations and background knowledge
Notations

\(\mathbb {G}_1, \mathbb {G}_T \) Bilinear groups of order p.

\(\mathcal {K}_{local}, \mathcal{D}\mathcal{K} \) Data user’s local key and delegated key respectively.

\(\mathbb {Z}_p\), \(\mathbb {Z}^*_p \) Finite fields, whose integer elements are \(\lbrace 0, 1, \ldots , p1\rbrace\) and \(\mathbb {Z}_p \setminus \lbrace 0 \rbrace\) respectively.

\(U \) Universal set of attributes.

\(\gamma \) DO chosen set of attributes.

\(S \) DU set of attributes.

\(D\) The plaintext set of n documents, namely \(D=\{D_1,D_2,\ldots ,D_n\}\).

\(\tilde{D} \) The encrypted set of n documents, denoted by \(\tilde{D}=\{\tilde{D_1},\tilde{D_2},\ldots ,\tilde{D_n} \}\).

\(D_v \) The index document vector generated from D by the Doc2Vec model, denoted by \(D_v=\{D_{v_1},D_{v_2},\ldots ,D_{v_n}\}\) .

\(I \) The encrypted index vector based on \(D_v\).

\(m\) Number of features in the Doc2Vec generated vector.

\(Q \) The search query keywords set, which is denoted as \(Q=\{q_1,q_2,\ldots ,q_n \}\).

\(Q_v \) The query feature vector generated from search query Q by Doc2Vec model.

\(\tilde{Q}_v \) The trapdoor vector based on query feature vector.

\(M_w \) Extracted Doc2Vec hidden layer weights matrix.

\(\tilde{M}_w \) Secure weights matrix based on \(M_w\).

\(M_1,M_2 \) Invertable matrices for inner product operation.

\(\tilde{M}_1,\tilde{M}_2 \) Secure invertable matrices based on \(M_1\) and \(M_2\) respectively.

\(RoTR(x) \) Circular right shift notation of an argument x.

\(LoTR(x) \) Circular left shift notation of an argument x.

\(A_c \) Access control.

\(V \) Configuration vector.
Preliminaries
In this section, an overview of cryptographic primitives and featureextraction techniques is presented.
Bilinear pairing
Let \(\mathbb {G}\), \(\mathbb {G}_T\) be two cyclic groups of prime order p, g be a generator of \(\mathbb {G}\), and \(e: \mathbb {G} \times \mathbb {G} \longrightarrow \mathbb {G}_T\) be the bilinear map which has several properties: (1) \(\forall u, v \in \mathbb {G},a,b \in \mathbb {Z}^*_p, e\left(u^a,v^b\right)= e(u,v)^{ab}\), (2) \(e(g,g) \ne 1\), (3) e can be efficiently computed.
Nonmonotonic access structure
We recall the definition of a nonmonotonic access structure proposed by Ostrovsky et al. [25]. For a set of parties P, a collection of monotonic access structures A has the following properties: either the name is normal (x), or it is primed \((x^\prime )\) and if \(x \in P\) then \(x^\prime \in P\) and vice versa. We write \(\overset{\smile }{x}\) to denote a parity in P that may be primed or unprimed. Conceptually, prime attributes represent the negation of unprimed attributes. For an access structure, \(\mathbb {A} \in A\) over a set of parties P, the corresponding nonmonotonic access structure is \(NM(\mathbb {A})\) over a set of parties \(\tilde{P}\), where \(\tilde{P}\) is the set of all primed parties in P. Then, we can define the following family \(\tilde{\mathbb {A}}\): For every set \(\tilde{S} \subset \tilde{P}\) we define \(N(\tilde{S})\) as \(N(\tilde{S})= \tilde{S} \cup \{x^\prime \mid x\in \tilde{P} \setminus \tilde{S}\}\). Then, we define \(NM(\mathbb {A})\) by saying that \(\tilde{S}\) is authorized in \(NM(\mathbb {A})\) if and only if \(N(\tilde{S})\) is authorized in \(\mathbb {A}\).
Semantic term matching constraints
The work of [26,27,28] has formulated the retrieval constraints on retrieval framework for enhanced retrieval performance. We define \(D, D_1,\) and \(D_2\) to represent the documents and \(Q, Q_1,\) and \(Q_2\) to represent the queries. Let M and N be parts of the keywords and words in the document and query, respectively. Let s(k, g) be any given semantic similarity between document keyword g and query word k. We assume that the word k is semantically more similar to word g than to word h if and only if \(s(k,g) > s(k,h)\). If we let the weight of keyword g in document D as \(\tau (g,D)\), then we can define \(\mathcal {H}(Q,D)\) the relevance score between query Q and document D. To simplify the explanation, we use \(\chi (N,g)\) to represent the semantic similarity between words N and the keyword g and \(\mathcal {P}(Q,M)\) to represent the semantic similarity between query Q and keywords M.
STMC1: For all \(Q,D_1\), and \(D_2\), if Q is composed of k and N, \(D_1\) is composed of g and M, \(D_2\) is composed of h and M, \(\mathcal {P}(Q,M) = 0, \chi (N,g)= \chi (N,h)=0, \tau (g,D_1)= \tau (h, D_2),\) and \(S(k,g)>S(k,h),\) then \(\mathcal {H}(Q,D_1)>\mathcal {H}(Q,D_2).\)
Within the context of STMC1, the retrieval function necessitates the allocation of higher score to a document exhibiting a keyword that possesses greater semantic relatedness to the query word.
STMC2: For all \(Q,D_1\), and \(D_2\), if Q is composed of k and N, \(D_1\) is composed of g and M, \(D_2\) is composed of h and M, \(\mathcal {P}(Q,M)=0\), \(\chi (N,g)= \chi (N,h)=0\), and k equals g, then \(\mathcal {H}(Q,D_1) > \mathcal {H}(Q,D_2)\) even \(\tau (h,D_1)\) is much smaller than \(\tau (h,D_2)\).
Within the context of STMC2, their retrieval function necessitates that the exact match of an original query term Q should consistently constitute a relevance score that is not inferior to the matching of a semantically related term D, irrespective of the frequency of occurrence of term D within the document.
STMC3: For all \(D, Q_1\) and \(Q_2\), if Q is composed of \(\{k,h\}\) and N, \(Q_2\) is composed of k and D is composed of g and M, \(\mathcal {P}(Q_1,M)=\mathcal {P}(Q_1,M)=0 \chi (N,g)=0\) and \(S(h,g)=S(k,g)\), then \(\mathcal {H}(Q_1,D) > \mathcal {H}(Q_2,D)\).
Within the context of STMC3, the retrieval function necessitates the involvement of greater variety of query words during the search process.
Prediction based embedding
Word embedding makes it possible to represent a language’s text using a vector easily readable by the machine, the basic idea behind Natural Language Processing. To preserve the latent semantic relationship between words, Mikolov et al. [29] proposed a Word2Vec vectorization model. Doc2Vec is an extension of Word2Vec that vectorizes an entire paragraph, an article, or a document instead of individual words. Similar to Word2Vec, it can operate in either of its training methods: Distributed Memory Version of paragraph vector (PVDM) or Distributed Bag of Words Version of Paragraph vector (PVDBOW) as depicted in Fig. 2(a) and (b) respectively.
Shamir secret sharing scheme
The Shamir Secret Sharing (SSS) scheme enables secure distributed information sharing. Its core concept involves breaking a secret value s into n shares, with secret reconstruction requiring possessing at least k shares. Specifically, in the process of Shamir Secret Sharing (SSS), where s is assumed to be an element of the integer modulo p, and with k representing the threshold and n participants, the procedure unfolds as follows: Initially, a randomly generated polynomial of degree (k1) denoted as \(f(x) = a_0 + a_1x + a_2x^2 + \ldots + a_{t1}x^{t1}\) is employed, where \(a_0\) is set to s, and \(a_1\), \(a_2\),’ through \(a_{t1}\) are chosen at random from \(Z_p\). Following this, the algorithm computes n distinct points on the curve defined by this polynomial, and each participant is provided with one of these points, facilitating secure information sharing. The secret s can be restored through the utilization of Lagrange interpolation, which involves a selection of k out of the n available points in accordance with the equation \(s = \sum _{i=1}^{k1} y_i \prod _{\begin{array}{c} m=0 \\ m \ne j \end{array}}^{k1} \frac{x_m}{x_m  x_j} \cdot\)
System model
Architecture
Our system model architecture is shown in Fig. 3, which consists of five entities. The detailed characteristics and functions of each entity are introduced as follows.
MAA: MAAs (Multiple attribute authorities) are responsible for providing attributes to data users using blockchain. Within MAAs, various attribute authorities contribute to the pool of attributes. This is crucial for ABE schemes, as they’re used in distributed access control environments where a single authority might not know all attributes. A distinct authority manages each attribute.
Blockchain: In the situation we are examining, the shared data contains personal information. This is why we’re implementing a Consortium Blockchain in our scheme. Considering the underlying scenario, respected and wellestablished institutions will serve as consensus nodes. Their role is to manage the blockchain and create blocks of information. In parallel, smart contracts facilitate attribute authorities in configuring overall settings and generating secret keys for users.
DO: DO encrypts documents for outsourcing to CS, allowing authorized Data Users (DU) to perform semantic searches. DO creates feature vectors by training a Doc2Vec model on the document set, securing model weights. Encrypted documents and index are outsourced to the CS.
DU: Data Users (DU) interact with the Cloud Storage (CS) to perform searches using encrypted indexes. The DU’s identity is verified through attributebased access control. If the DU meets the access criteria set by the DO, the CS provides encrypted components \(\tilde{CT}\). The DU employs the Doc2Vec model, loading it with query keywords from set Q to generate the query feature vector \(Q_v\). The encrypted vector \(\tilde{Q}_v\) is then submitted to the CS to retrieve the topk relevant documents.
CS: It manages storage and search queries for DO and DU, respectively. The Cloud Storage (CS) communicates twice with DU: first to confirm access permissions based on attributes and then to retrieve the topk relevant encrypted documents from secure indexes in response to search queries.
Workflow of \(S^3DBMS\)
The diagram in Fig. 4 illustrates the sequence of steps in our proposed scheme. To begin, the security parameter is inputted to the attribute authorities and the blockchain. They then employ Shamir’s secret scheme for N attribute authorities to create the global system parameters. After that, data users enroll themselves with the blockchain by utilizing the registration public key \(RPK_{uid}\), which is generated using the system parameters. Next, the multiattribute authorities utilize the DU set of attributes and system parameters along with their registration key to construct the partial delegated key \(PDK_{uid}\). Subsequently, the blockchain’s smart contract calculates the complete secret key \(SK_{uid}\) for the DU through Lagrange polynomial interpolation.
Furthermore, DO encrypts their documents using system parameters, creating CT. This encrypted content is stored with a CS. CT includes encrypted documents, secure indices, a revocation list (RL), and access control elements (\(A_c\)). The CS maintains the encrypted document and logs the encryption process through a blockchainbased smart contract. Subsequently, the DU submits the delegated key \(DK_{uid}\) to the CS. After confirming the authorization for access and verifying the status of any revocation requests, the CS proceeds to compute the intermediate ciphertext. Subsequently, the CS transmits this intermediate ciphertext ICT in response to the DU. Following the recovery of secret parameters from the intermediate ciphertext (ICT), the DU generates a semanticaware search trapdoor and submits it to the CS. Afterwards, given the trapdoor, the CS performs semantic searching and outputs the encrypted topk related documents.
Smart contract
In the proposed scheme, two categories of smart contracts are introduced: SystemContract and KeyGenContract. During the Setup phase, SystemContract compiles partial attribute sets from various attribute authorities and employs Lagrange polynomial interpolation to construct global public parameters. This same approach enables attribute authorities to generate delegated keys by gathering partial delegated keys from each DU. RecordContract captures the user’s identity and the operations executed on CS over stored data for documentation.
Proposed schemes construction
Concrete construction
1) Setup(\(\lambda\)): This algorithm is run by the attribute authorities and blockchain to provide a working environment for the proposed scheme. It initiates the system contract with the help of security parameters \(\lambda\) as input. Defines a bilinear group \(\mathbb {G}\) of prime order p with generator g and h. It also select a universal set of attributes \(U=\{x_1,x_2,\ldots , x_n \}\) and two random secret values \(\{\alpha , \beta \} \in \mathbb {Z}^*_p\). It also generates a secret share for each attribute authority using Shamir’s secret sharing scheme. Each attribute authority \(AA_i \in AA_k\) computes the partial public parameters \(\left\{ e(g,g)^{\alpha _i \beta _i}, g^{\alpha _i} \right\}\) using their secret share and sends them to system contract. Upon receiving, the SystemContract uses the Lagrange polynomial interpolation to compute global public parameters.
Further, it randomly defines two polynomials p(x) and q(x) of degree n in a random manner, with the condition that \(q(0)= \beta\). It then creates two functions, U(x) and V(x), which can be computed publicly and map to \(g_2^{x^n}g^{p(x)}\) and \(g^{q(x)}\), respectively. Notably, the use of the Lagrange Coefficient allows for the evaluation of \(g^{p(x)}\) and \(g^{q(x)}\) using the public key components [30]. Finally, set the MSK and the global public key components GPK.
2) \(\mathbf {Registration:}\) DO selects a random value \(q_{uid} \in \mathbb {Z}_p\) and compute \(RPK_{uid}= g^{q_{uid}}\). DO keeps the \(q_{uid}\) as confidential and transmits the \(RPK_{uid}\) to the blockchain. Subsequently, the consensus nodes activate the RecordContract to record the DO who accessed the data.
3) \(\mathbf {KeyGen(MSK,GPK):}\) This phase runs by attribute authorities and KeyGenContract, binds access control components from the system attribute set U to the DU secret key components. This DU is then eligible for the decryption of the access control components if the attached attribute matches the secret key. For each attribute \(x \in S_{uid}\), \(AA_i\) randomly select \(r_{x_i} \in \mathbb {Z}_p\) and compute partial delegated key (PDK) components \(k_i,k^{\prime }_i\) and send them to KeyGenContract. First, it obtains share \(\lambda _i\) for the system parameter \(\alpha\) by applying the linear secret sharing mechanism \(\Pi\) and randomly selecting a value \(t_{uid} \in \mathbb {Z}_p\). Further, using the Lagrange interpolation method, KeyGenContract combines the partially delegated key components as a SK and sends it to DU.
4) \(\mathbf {EncDoc(PK,\mathcal {K},D):}\) DO generates two keys. The first one, V, is an mbit randomly generated configuration vector, while the second is \(\mathcal {K} \in \mathbb {Z}_p\). The configuration vector V helps the DO scramble the Doc2Vec hidden layer weight matrix \(M_w\) and the secure inner product matrices. The key \(\mathcal {K}\) is for document set D encryption. It also randomly generates two \(M \times M\) dimensional invertible matrices and a random value \(s\in \mathbb {Z}_p\). Then, the DO uses his own set of documents D to train the Doc2Vec model and gets the mdimensional feature vector \(D_{v_i}\) for each document \(D_i\) in D. The reason for an owntrained neural network instead of a pretrained neural network is to avoid the large dictionary that would produce a huge word vector for its dataset’s vocabulary. Additionally, the dimension of the feature vector in the Doc2Vec model is much less than the words in the vocabulary of the document set. After the normalization, these feature vectors are treated as a semanticaware plaintext index for its source documents. After which, the DO freezes the trained model and gets its hidden layer weights matrix \(M_w\). Next, the DO encrypts the plaintext indexes \(D_{v_i}\) in \(D_{v}\) to its equivalent secure \(I_i\) in I using secure inner product operation as shown in Algorithm 6 in steps \(610\). Also, encrypts each \(D_i\) in D, using symmetric key \(\mathcal {K}\) to get its \(\tilde{D}_i\) in \(\tilde{D}\). The rows of the weights matrix \(M_w\) are the word vector representation for each word in the dictionary for our document set D. These normalized rows against each feature expose the semantic relationship for a given word. Hence, the DO uses the configuration vector V to obfuscate this semantic relationship, as depicted in Algorithm 6 in steps \(1418\). Similar operations are performed with the transformation matrix \(M_1\) and \(M_2\).
Let the revocation list \(RL=\{RPK_{{uid}_1}, RPK_{{uid}_2},\ldots RPK_{{uid}_n}\}\) of r users. The algorithm will split s into r random share \(s_1,s_2, \ldots , s_r\) such that \(\sum _{s_i}=s\) and compute \(C_0\) and \(C_1\) accordingly.
To dictate the access control through the set of negative and nonnegative attributes \(\gamma \in \mathbb {Z}_p^*\), DO computes the access control components Ac. Then, convert the configuration vector V into an integer \(I_v\) and encrypt it with symmetric key \(\mathcal {K}\) for eligible DU. Further, the whole ciphertext components are set to CT. Finally, the DO uses \(SHA_{256}\) to compute the hash value \(v_i=SHA(CT)\) and send \((CT,v_i,address_{DO})\) it to CS. Subsequently, the CS submits the \((v_i, CS_{id}, ``upload'')\) to the address of DO’s RecordContract.
4) \(\mathbf {Search_A(\mathcal{D}\mathcal{K},CT)}\): Let the DU submits the \(\mathcal{D}\mathcal{K}\). This phase of the system model, as depicted in Algorithm 7, is run by CS and is divided into the following phases:

Access Verification: CS first checks whether the attribute embedded into the \(\mathcal{D}\mathcal{K}\) satisfies the set of attributes attached to the access control components. If it does not satisfy, CS terminates the search phase; otherwise, it proceeds to process as follows: For each nonnegated attributes x, CS computes \(F_i = \frac{e\left(K_i^{\prime (1)}, A^{(2)}\right)}{e\left(K_i^{(2)}, A_i^{(3)}\right)}\). Similarly for negated attributes \(x^\prime\), CS computes \(F_i = \frac{e\left(K_i^{\prime (3)}, A^{(2)}\right)}{e\left(K_i^{(5)}, \prod _{x \in S}(A_x)^{(4)^{\sigma _x}}\right). e\left(K_i^{(4)}, A^{(2)}\right)^{\sigma _{x_i}}}\) Finally, the verification is confirmed if \(\prod _{i \in I} F^{w_i}_i = e(g,g)^{\alpha \beta }\), else \(F_i = \bot\).

Revocation Identity: Similarly, the CS checks whether the DU’s \(RPK_{uid}\) is in the revocation RL list by simply computing
$$\begin{aligned} F_i= \frac{e(C_0,D_0)}{A.B} \end{aligned}$$(1)Where \(A=e\left(D_1, \prod _{i=1}^r C_{i,1}^{1/(RPK_{uid}RPK_{uid_i})}\right)\) and \(B=e\left(D_2, \prod _{i=1}^r C_{i,2}^{1/(RPK_{uid}RPK_{uid_i})}\right)\) If \(RPK_{uid} = RPK_{uid_i}\), the CS will fail to get two linearly independent equations and hence fails to solve the above equation for \(e(g,g)^{s\alpha \beta }\).

Ciphertext Precomputation: If CT is accessible, this phase is similar to the above, except for two components \(K_i^{\prime (1)}\) and \(K_i^{\prime (3) }\) proceeds as follows: For each nonnegated attribute x, the CS computes \(IC_i = \frac{e\left(K_i^{ (1)}, A^{(2)}\right)}{e\left(K_i^{(2)}, A_i^{(3)}\right)}\). Similarly, for negated attributes \(x^\prime\), CS computes \(IC_i = \frac{e\left(K_i^{ (3)}, A^{(2)}\right)}{e\left(K_i^{(5)}, \prod _{x \in S}(A_x)^{(4)^{\sigma _x}}\right). e\left(K_i^{(4)}, A^{(2)}\right)^{\sigma _{x_i}}}\) Further it computes \(\prod _{i \in I} IC^{w_i}_i = e(g_2,g)^{ts\alpha }\) and set it to \(\lbrace I_R \rbrace\). Finally, the CS returns the intermediate ciphertext \(\tilde{CT}\) to the DU.
5) \(\mathbf {TDGen(}\mathcal {K}_{\textbf{local}}\textbf{, Q,} \tilde{\textbf{CT}}\mathbf {)}\): Implemented in Algorithm 8 and run by the DU in our system model. Let Q be the query keyword set in the search trapdoor. DU first computes \(e(g_2,g)^{\alpha s}\) to recover the integer \(I_v\) and subsequently convert it into configuration vector V. DU gets the unpermuted transformation matrices \(M_1\), \(M_2\) and weight matrix \(M_w\) by applying inverse shift row transformation using configuration vector V. Loads the Doc2vec model with \(M_w\) and inputting query keyword set Q, obtain the query feature vector \(Q_v\) with mdimensional. Finally, using the secure inner product encryption operation to get trapdoor vector \(\tilde{Q}_v\) and send it to the CS.
6) \(\textbf{Search}_\textbf{B}\mathbf {(}\tilde{\textbf{Q}}_{\textbf{v}}\mathbf {, I)}\): For each document \(D_i\), the CS performs the inner product operation between the secure index \(I_i\) in I and the trapdoor \(\tilde{Q}_v\) to get the semanticaware ranked scores as shown in the following equation. The CS invokes the RecordContract for the corresponding DO using their addresses. Then, it uploads the hash value of ciphertext \(V_i\) and data user’s id to the smart contract as \((V_i, uid,``search'')\). The CS, along with \(A^{(0)}\), returns the topk documents as a search result \(R_{score}\) to the DU.
7) \(\mathbf {Decryption(\mathcal {K}_{local}, A^{(0)})}\): DU computes the hash value of the ciphertext CT of ranked topk received documents. If it is not the same, the algorithm halts. Otherwise send the \((V_i, uid, ``decryption'')\) to RecordCcontract and recover the symmetric key \(\mathcal {K}\) by computing
Enhanced \(S^3DBMS\) scheme
We proposed an Enhanced \(S^3DBMS (ES^3DBMS)\) scheme to further strengthen our basic scheme’s security. A detailed description of this is provided in the following subsections.
Security improvement
\(ES^3DBMS\) attains enhanced security by utilizing the learning with errors (LWE)based secure kNN algorithm to encrypt features indices [31]. This approach guarantees strong privacy protection for the underlying feature vectors. The required changes in the \(ES^3DBMS\) are as follows:
\(\mathbf {EncDoc(PK}, \mathcal {K},\mathcal {M})\mathbf {:}\) The DO generates a set of encryption keys for feature vectors, represented as \(\gamma , M, M^{1}\). Here, \(\gamma\) is a publically randomly chosen integer from the set \(\mathbb {Z}_{p_1}\), while M is a randomly generated invertible matrix with dimensions \(2m \times 2m\). \(M^{1}\) represents the inverse of matrix M. Additionally, the integers \(p_1\) and \(p_2\) define the range of numbers, with \(p_1\) significantly greater than \(p_2\). Next, to encrypt the features vectors, the DO extends the mdimensional vectors to 2mdimensional as
where \(\alpha \in \mathbb {Z}^{d1}_{p_2}\) are selected by DO as random numbers for each feature vector \(v_{i,j}\). Further, DO encrypt each extended feature vector \(v_{i,j}\) as
Here, \(\epsilon _i\) represent a random integer noise vector, \(\gamma>> 2max(\epsilon _i)\) represents the absolute value of elements in \(\epsilon _i\).
\(\textbf{TDGen}(\tilde{\textbf{CT}}, \textbf{Q}, \mathcal {K}_{\textbf{local}} )\): Similarly, the DU extends the query vector \(Q_v\) to \(Q_v=(\delta _jq_1,\delta _jq_2,\ldots ,\delta _jq_m,\delta _j,\beta _j )\)
where, \(\delta _i \in \mathbb {Z}_{p_2}\), \(\beta _j \in \mathbb {Z}^{m1}_{p_2}\) are radom numbers. Further, DU encrypts this extended query vector \(Q_v\) as
where, \(\epsilon _j \in Z^{2m}_{p_1}\) random integer noise vector.
\(\textbf{Search}_\textbf{B}(\tilde{\textbf{I}}, \tilde{\textbf{Q}}_{\textbf{v}})\): With the extended dimension, the relevance score between the trapdoor \(\tilde{Q}_v\) and every document index in \(\tilde{I}\) is computed as
Correctness of access control
Algorithm \(Search_A\) tells us that the CS can confirm the access authorization by checking whether the below equation is true or not:
The set of authorized entities is denoted as \(\tilde{\gamma }=N(\gamma )\in \mathbb {A}\). Let I be the set of indices i such that \(\overset{\smile }{x}\ \in \tilde{\gamma }\) and \({w_i \in Z_p \vert i \in I }\) be a collection of constants. A valid share \(\lambda _i\) of secret share \(\alpha\) based on the \(\prod\) protocol satisfies the equation \(\Sigma _{i \in I}w_i \lambda _i=\alpha\). The CS calculates this equation for each nonnegated attribute \(x_i \in \gamma\) (i.e., \(\overset{\smile }{x}_i \in \gamma ^\prime\)).
Similarly for negated attributes \(x_i \notin \gamma\) (so \(\overset{\smile }{x}_i \in \gamma ^\prime\)), we let \(\gamma _i= \gamma \cup \{x_i \}\) and compute Lagrange interpolation over the points in set \(\gamma _i\) to get the coefficient \(\{\sigma _x \}_{x \in \gamma _i}\), such that \(\sum _{x \in \gamma _i}\sigma _xq(x)= q(0)=\beta\). Now CS computation proceeds as follows:
Finally,
Security analysis
For the access control threat, the security proof is modeled in the form of a security game between an attacker \(\mathcal {A}\) and challenger \(\mathcal {C}\).
Theorem 1
If a probabilistic time adversary (PPT) can break the access control components in Algorithm 2 with advantage \(\epsilon\) in the selectiveset models. We can construct a simulator \(\mathcal {B}\) to the decisional BDH game with the advantage \(\frac{\epsilon }{2}\).
Proof
\(\mathcal {B}\) plays the role of challenger \(\mathcal {C}\) which randomly select \(a,b,c,z \in \mathbb {Z}^*_p\) flips a binary coin \(\mu \in \lbrace 0,1 \rbrace\), outside of \(\mathcal {B}\) view. If \(\mu = 0\), \(\mathcal {C}\) sets \(\mathbb {Z}_\mu = e(g,g)^{a,b,c}\), otherwise it sets \(\mathbb {Z}_\mu = e(g,g)^{z}\).
Init. The simulator \(\mathcal {B}\) runs \(\mathcal {A}\). \(\mathcal {A}\) declares a revocation list \(RL=\{RPK_{{uid}_1}, RPK_{{uid}_2},\ldots RPK_{{uid}_n}\}\) along with corrupted attribute authorithies \(AA_c \subseteq AA\) and chooses the challenge access structure \(\gamma\) of d attributes.
Setup. \(\mathcal {B}\) now creates the public keh components by assigning \(g^\alpha = A\) and \(g^\beta =B\) by implicitly assigning \(\alpha = a\) and \(\beta = b\). However, for revocation key RK it will set \(\beta\) as \(b_1+b_2+ \ldots b_r\). It then randomly chooses \(y \in \mathbb {Z}_p\), a polynomial f(x) of degree d randomly and fixes a degree d polynomial U(x) as per the following procedure:
For all \(x \in \gamma\), sets \(u(x)=x^d\), otherwise sets \(u(x) \ne x^d\). \(\mathcal {B}\) implicitly set the polynomial h and q as follows: First, sets \(h(x) = \beta u (x)+f(x)\). It then, randomly selects points \(\theta _{x_1}, \theta _{x_2}, \ldots , \theta _{x_d} \in \mathbb {Z}_p\) for a set \(\gamma = \lbrace x_1, x_2, \ldots , x_d \rbrace\) and sets \(q_{(x_i)}= \theta _{x_i}\) such that \(q(0)= \beta\). Finally, it sends the public key components \(\{\lbrace g^{q_i} \mid _{i \in [1,d]} \rbrace\), \(g^{h_i}= g_2^{u_i} g^{f_i} \mid _{i \in [1,d]}\), \(g^{\beta }= \prod _{i \in RL} g^{b_i}\), \(g^{\beta ^2}= \prod _{i,j \in RL} g^{b_i.b_j}\), \(h=\prod _{i \in RL}({g^{b_i}}^{t_{uid}}g^y)\}\) to \(\mathcal {A}\).
Phase 1. \(\mathcal {A}\) repeatedly asks for a number of access control structures except \(\gamma\). Suppose \(\mathcal {A}\) asks for a secret key components such that \(\tilde{\mathbb {A}}(\gamma )= 0\), where \(\tilde{\mathbb {A}}\) is early defined as \(NM(\mathbb {A})\) for some monotonic access structure \(\mathbb {A}\) over a set of attributes S, for some linear secret sharing scheme \(\Pi\). To map the secret shares to negated or nonnegated attributes, we let M be the sharing matrix over \(\Pi\), then the sharing for this simulation is as follows:
First, we set the secret \(\alpha = a\) for this distribution then, we randomly select a vector \(v= (v_1, v_2, \ldots , v_{n+1}) \in \mathbb {Z}_p^{n+1}\) since \((1,0, \ldots , 0)\) is independent of \(M_{\gamma ^\prime }\). Therefore, we can efficiently compute [32] a vector \(\mathcal {W}= (\mathcal {W}_1, \ldots , \mathcal {W}_{n+1})\) such that \((1,0, \ldots , 0).\mathcal {W}= \mathcal {W_1}=1\) and \(M_{\gamma ^\prime }.\mathcal {W}= \overrightarrow{0}\), where \(M_{\gamma ^\prime }\) is the submatrix of M associated with a set of attributes in \(\gamma ^\prime\). We now define a uniformly distributed vector \(\upsilon =\nu +(a v_1)\mathcal {W}\) subject to the constraint that \(\upsilon _1= a\). To compute the shares \(\lambda = M_{\upsilon }\), we have \(\lambda _i= M_{i} \upsilon = M_i \nu\) for all \(x_i \in \gamma ^\prime\), such that it was no dependence on a. First, for negated attributes \(\overset{\smile }{x}_i= x_i^\prime\), we show how to compute the secret key components. Note that if \(x_i \in \gamma\) if and only if \(x_i \notin \gamma\).

The secret share \(\lambda _i\) may depend linearly on a if and only if \(x_i \in \gamma\). \(\mathcal {B}\) selects \(r_i^\prime\) and \(t \in \mathbb {Z}_p\) at random and set \(\mathcal {K}_{local}= (\frac{1}{t})\). Now, by letting \(q(x_i)= \theta _{x_i}\) and \(r_i = t\lambda _i+ r_i^\prime\), \(\mathcal {B}\) outputs the following valid delegated key components. \(K_i^\prime = \left(K_i^{(3)}= g_2^{r_i^\prime }, K_i^{(4)}= g^{\theta _{x_i}(t\lambda _i+r^\prime _i)}, K_i^{(5)} = g^{t_{\lambda _i}+r^\prime _i}\right)\).

The secret share \(\lambda _i\) is independent on a if and only if \(x_i \in \gamma\) and hence known to the \(\mathcal {B}\). In this case, the simulator picks \(r_i \in \mathbb {Z}_p\) and randomly outputs the following delegated key components: \(K_i= \left( K_i^{(3)}= g_2^{t\lambda _i+r_i}, K_i^{(4)}= V(x_i)^{r_i}, K_i^{(5)} = g^{r_i}\right)\)
Now, for nonnegated attribute \(\overset{\smile }{x}_i=x_i\), we describe how to compute the secret key components.

The secret share \(\lambda _i\) is independent of any secret, if and only if \(x_i\in \gamma\). In this case, \(\mathcal {B}\) picks \(r_i \in \mathbb {Z}_p\) and randomly output \(K_i= \left(K_i^{(1)}= g^{t\lambda _i}.T(x_i)^{r_i}, K_i^{(2)}= g^{r_i}\right)\)

The secret share \(\lambda _i\) depends on a if and only if \(x_i \notin \gamma\). Then \(\mathcal {B}\) let \(g_3= g^{\lambda _i}\) and select \(r_i^\prime \in \mathbb {Z}_p\) at random and output the component \(K_i^{(1)}\) and \(K_i^{(2)}\) of delegated key \(K_i\) as
$$\begin{aligned} K_I^{(1)}{} & {} = g_3^{\frac{f(x)}{x_i^d+u(x_i)}} \left(g_2^{x_i^d+u(x_i)}. g^{f(x)}\right)^{r^\prime _i} \\ K_i^{(2)}{} & {} = g_3^{\frac{1}{x_i^d+u(x_i)}} g^{r_i^\prime } \end{aligned}$$
Now \(\mathcal {B}\) constitute secret key for identies in RL. For each \(RPK_{uid} \in RL\), \(\mathcal {B}\) select a random \(Z_i \in Z_p\) and set the \(t_{uid}\) implicitly as \(t_{uid}= ai^2 + 2i\). The secret key components for \(RPK_{uid}\) is computed as:
Challenge. Adversary \(\mathcal {A}\) select two equal lenght messages \(M_1\) and \(M_2\) and submit it to \(\mathcal {B}\) for encryption. \(\mathcal {B}\) randomly chooses \(s^\prime , s_1^\prime , \ldots s_r^\prime \in Z_p\) such that \(s^\prime = \sum _i s_i^\prime\). Now \(\mathcal {B}\) flips a fair coin \(v \in \lbrace 0,1 \rbrace\) and returns the encryption of \(M_v\) as:
Now the following cases arise:

If \(\mu = 0\), then \(Z= e(g,g)^{abc}\). Then by inspection, the encryption of Ac is valid encryption under the set \(\gamma\).

if \(\mu = 1\), then \(Z= e(g,g)^z\). Since z is random, the ciphertext \(C^{(1)}= M_{\upsilon } \times e(g,g)^z\) will be a random element of \(\mathbb {G}_T\) from \(\mathcal {A}\) point of view and hence contain no information about \(M_v\).
Phase 2. \(\mathcal {B}\) acts the same way as it did in Phase 1.
Guess. \(\mathcal {B}\) will submit a guess \(v^\prime\) of v. There are two probabilities, either \(\mu ^\prime = 1\), which indicates it has given a valid BDHtuple, or \(\mu ^\prime = 0\), which indicates a random 4tuple. Now, the probability analysis is given as:
 In the case where \(\mu =1\), \(\mathcal {A}\) gains no information about v. Therefore, we have
Since the simulator guesses \(\mu ^\prime =1\) when \(v \ne v^\prime\) also, we have
 if \(\mu = 0\), then \(\mathcal {A}\) sees the encryption of \(M_v\). The \(\mathcal {A}\) advantage is \(\epsilon\) by assumption. Therefore, we have
Since the simulator guess \(\mu ^\prime =0\) when \(v=v^\prime\) so we gain
Using Eqs. (4) and (6), the overall advantage of the algorithm \(\mathcal {B}\) in the decisional BDH game is
Performance analysis
This section thoroughly evaluates our proposed approach in different scenarios, focusing on functionality, precision, and time cost. The experiments were carried out on a computer with the Windows 7 operating system, featuring 32 GB RAM with two 64bit 3.4 GHz Intel(R) Core(TM) i73770 CPUs using Python in the Gensim framework, and trained it on a dataset of 20 newsgroups with 11,315 articles, but without GPUs. It is expected that the use of GPUs would significantly speed up both the training and testing processes. In our experiments, ’m,’ ’h,’ ’n,’ and ‘k’ represent the number of features, keywords, total documents, and required documents, respectively. Default values are in Table 1.
Functionality analysis
This section compares our proposed scheme’s functionality with secure semantic searching schemes and blockchainbased searchable encryption schemes. As demonstrated in Table 2, our scheme supports semantic term matching constraints (STMC) in M/M settings and is more comprehensive than existing schemes. Our scheme and the schemes proposed by [20, 24, 33, 34] utilize blockchain technology to ensure reliable search outcomes. The previous schemes lack support for semanticaware search, which means they only allow exactkeyword matching over encrypted indices. As a result, they are not suitable for practical application as they fail to learn the latent semantic intention of the user search query for a better search experience. Our scheme produces a probabilistic query for document retrieval. A unique query feature vector is generated every time for the same set of keywords to hide the search pattern. Compared with schemes, our scheme uses a deep learning doc2vec model that has led to the representation of more powerful predictive word embedding that not only captures the semantic features of words but also surrounding context [29, 35]. Unlike the mentioned schemes, the proposed scheme puts access control on the outsourced document, which also provides users with revocations. Hence, it is more suitable for a broader range of applications. Similar to [33] scheme, our scheme employs semantic termmatching constraints to enhance search accuracy. The remaining blockchainbased schemes rely on single or multikeyword matching to identify documents that contain the exact keywords in the indices. Consequently, these schemes focus solely on determining if the retrieved documents possess a specific keyword, disregarding any consideration of semantic matching between index and query keywords. As a result, these schemes fail to meet the fundamental retrieval heuristic STMC1. According to STMC2, the retrieval algorithm should prioritize a document that exactly matches the query words with a higher relevance score than documents that only contain words related in meaning. While semantic similarity is valuable for document retrieval, relying too much on it is not recommended. However, the previous semanticbased searching schemes fail to acknowledge this constraint. Our proposed scheme introduces the truncated cosine similarity metric \(\widehat{cos}=(\tilde{I}_{v_i},\tilde{Q}_v)\) to balance the semantic matching and exact matching effectively. STMC3 requires a damping effect in accumulating relevant score processes of specific query words, enabling more distinct query words to contribute to the search process. However, the mentioned semantic searching schemes do not fulfill this requirement. The scheme in [9, 36] word2vec operates at a wordlevel granularity, disrespecting the context or order in which words appear in a query. This limitation prevents the finegrained control necessary for introducing a damping effect and prioritizing specific query words. Compared to these schemes, the doc2vec distributed document embedding reduce the impact of specific terms by assigning different dimensions to different aspect of the document, ensuring a balanced representation. Additionally, the weight of the trained DO model is securely transferred to the DU side to facilitate relevant query feature vector formulation.
Our scheme, along with previous approaches [8, 9, 36, 37], enables semantic search on encrypted documents. Compared to these schemes, our proposed scheme stands alone in the domain of semantic searching by integrating essential features of multiattribute authority, revocation, and access control simultaneously. Although, the scheme in [8] supports only revocation. In terms of the weight model, it is noteworthy that schemes [9, 36] employ Word2Vec. Word2Vec is primarily designed to capture semantic relationships among individual words by representing them as dense vector embedding. However, relying solely on Word2Vec may not adequately capture the holistic meaning of an entire document. In contrast, our scheme utilizes Doc2Vec, representing the entire document as a fixedlength vector. This characteristic of Doc2Vec makes it a more appropriate choice for documentlevel semantic similarity tasks. By considering the document as a whole, Doc2Vec enables a more comprehensive understanding of the document’s semantic meaning.
Semantic precision evaluation
We leverage the categorizations inherent in the 20 newsgroup dataset, employing them as 22 discrete semantic classes that encompass a range of subjects, including but not limited to baseball, politics, and hardware. Furthermore, we utilize the articles within these classes to assess the semantic precision of our proposed scheme quantitatively. To achieve a level of semantic precision that is reasonably valid, we make the underlying assumption that the documents belonging to different classes hold no significant semantic relevance to each other. The data user’s query is expected to contain keywords from the same class. For instance, if the query includes terms like “pitch,” “helmet,” and “National League,” the documents retrieved from the class associated with baseball would be considered accurate results. We use the semantic precision calculation method used in [38] to evaluate the search precision of our scheme as follows:
Where TP and FP denote the quantities of true positive and false positive returned documents, respectively.
In this evaluation of semantic precision, we omitted expansionbased schemes. Despite their ability to retrieve the topk documents for a given search query, these schemes rely on shallow semantic relationships, falling short of capturing the underlying latent semantic nuances of the search query. Schemes [9, 36, 37], similar to our approach, build indices and query vectors utilizing word/document embedding to retain the semantic relationships among words. This is why these schemes have been chosen for comparing their precision. Among the wordlevel embedding schemes, Scheme [9] exhibits the lowest performance. This can be attributed to its approach of accumulating wordlevel embedding in a bitwise manner, leading to the creation of compact word vectors for representing documents. Consequently, this method falls short of capturing complete semantic information within the word vectors. Much like Scheme [9], Scheme [37] employs wordlevel embeddings for constructing feature vectors. However, in this scheme, an additional step is taken where the kmeans clustering algorithm is utilized to categorize each document prior to generating the feature vectors. This approach results in the construction of indices based on categorydocument vectors, contributing to its enhanced precision compared to Scheme [9]. Both our proposed approach and the method outlined in [36] show noticeable improvements in precision. However, our scheme outperforms the approach in [36]. This advantage can be attributed to several aspects of our scheme. We address the problem of excessive matching by incorporating a truncated cosine similarity measure to assess the semantic similarity between embedding. Additionally, our approach utilizes doc2vec, a technique more suitable for creating embedding at the document level as compared to word2vec. Doc2vec captures both the context in which words appear within a document and the distinct identity of the document itself, contributing to its enhanced performance. We further illustrate the effectiveness of precision by varying the parameters n, k, and h for these schemes. As depicted in Fig. 5(a) and (b), it becomes evident that the our scheme exhibits a higher level of semantic search precision in comparison to the remaining word2vecbased schemes. The average search precision for our scheme is observed to be approximately 88.3% and 87.2% when the parameter k is assigned values of 50 and 100, respectively. In Fig. 5(c) and (c), when n is set at 4000 and 8000, our approach achieves average search precision rates of 86.7% and 86.4%, respectively. Furthermore, as k increases, both our scheme and word2vecbased schemes experience a decline in precision. This stems from the limited number of relevant documents within each semantic class in the dataset. The elevated k introduces irrelevant search documents into the result list, deviating from the user’s query intention. The different h settings also maintain the same order of precision as observed with n. Both Fig. 5(e) and (f) show that our scheme achieves an average semantic precision of about 86.9% and 87.6% for different n settings. The wrod2vec and doc2vec models, which accurately capture the underlying meaning of documents, are only slightly affected by the increase in the query keyword h in terms of its precision.
Experimental analysis of \(S^3BDMS\)
Scheme [39], alongside our proposed framework, employs attributebased access control, blockchain technology, and revocation mechanisms, thereby eliminating the need for a singular trusted entity. Consequently, a comprehensive simulation is executed to evaluate the operational efficacy of these schemes in relation to encryption and decryption algorithms. Both schemes are executed using the Java PairingBased Cryptography library, employing an 80bit elliptic curve group generated from the equation \(y^2 = x^3 + x\), operating over a 256bit finite field. Figure 6(a) displays the time it takes for encryption function in both schemes. This helps us understand how changing attributes affect the schemes. In our system, we see a sudden increase in time at the start of creating the secure index, as shown in Fig. 6(a). This increase results from the splitting process and matrix multiplications involved in secure kNN inner product operations. Moreover, the only factor that appears to influence how long the encryption process lasts is the number of attributes present in the access control structure. Regarding decryption, as illustrated in Fig. 6(b), the time taken by the DU is less than the scheme in [39] because most of the intense computation operations, i.e., bilinear pairing, from the DU end are outsourced to the computationally rich cloud server with the help of the delegated key component (DK) generated by the attribute authorities. As a result, in decryption, a constant number of operations are allocated for the DU to decrypt the ciphertext.
Conclusion and future work
This paper introduces a semantic search scheme based on deep learning and multiattribute authority within a multiuser setting in cloud storage infrastructure. We present an innovative perspective regarding attributebased encryption and transfer learning within the context of the SE framework. We use attributebased encryption to securely transfer trained parameters from the DO to the DU, enabling the creation of a query feature vector within the model’s training feature space to obtain highly accurate ranked results. Concurrently, blockchain’s smart contracts enable a multiattribute authority to generate user private keys and systemwide parameters through consensus in the context of mutual distrust within an M/M setting. Moreover, the scheme’s flexibility is improved by incorporating nonmonotonic access structures and direct revocation. Users’ activities are transparently and reliably recorded on the blockchain through smart contracts.
Although we’re using a pretrained neural network, finetuning this model with the data owner proves to be resourceintensive, especially for devices with limited resources at the user’s end. In our future work, we aim to explore privacypreserving outsourced machine learning techniques. Our focus will be on outsourcing the extraction of feature vectors to a powerful cloud server for the neural network model. This exploration aims to tackle the challenges posed by resource constraints at the user end and improve the privacy aspects of machine learning processes.
Availability of data and materials
In this paper, we utilized realworld dataset of 20 newsgroups, publicly available with 20 different categories.
References
Sun X, Zhu Y, Xia Z, Chen L (2014) Privacypreserving keywordbased semantic search over encrypted cloud data. Int J Secur Appl 8(3):9–20
Xia Z, Zhu Y, Sun X, Chen L (2014) Secure semantic expansion based search over encrypted cloud data supporting similarity ranking. J Cloud Comput 3:1–11
Fu Z, Sun X, Linge N, Zhou L (2014) Achieving effective cloud search services: multikeyword ranked search over encrypted cloud data supporting synonym query. IEEE Trans Consum Electron 60(1):164–172
Moh TS, Ho KH (2014) Efficient semantic search over encrypted data in cloud computing. In: 2014 International Conference on High Performance Computing & Simulation (HPCS), IEEE, pp 382–390
Fu Z, Wu X, Wang Q, Ren K (2017) Enabling central keywordbased semantic extension search over encrypted outsourced data. IEEE Trans Inf Forensic Secur 12(12):2986–2997
Fu Z, Xia L, Sun X, Liu AX, Xie G (2018) Semanticaware searching over encrypted data for cloud computing. IEEE Trans Inf Forensic Secur 13(9):2359–2371
Wong WK, Cheung DWl, Kao B, Mamoulis N (2009) Secure knn computation on encrypted databases. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp 139–152
Yang W, Zhu Y (2020) A verifiable semantic searching scheme by optimal matching over encrypted data in public cloud. IEEE Trans Inf Forensic Secur 16:100–115
Liu Y, Fu Z (2019) Secure search service based on word2vec in the public cloud. Int J Comput Sci Eng 18(3):305–313
Chai Q, Gong G (2012) Verifiable symmetric searchable encryption for semihonestbutcurious cloud servers. In: 2012 IEEE international conference on communications (ICC), IEEE, pp 917–922
Stefanov E, Papamanthou C, Shi E (2013) Practical dynamic searchable encryption with small leakage. Cryptol ePrint Arch
Zhu J, Li Q, Wang C, Yuan X, Wang Q, Ren K (2018) Enabling generic, verifiable, and secure data search in cloud services. IEEE Trans Parallel Distrib Syst 29(8):1721–1735
Li J, Wu J, Jiang G, Srikanthan T (2020) Blockchainbased public auditing for big data in cloud storage. Inf Process Manag 57(6):102382
Jing N, Liu Q, Sugumaran V (2021) A blockchainbased code copyright management system. Inf Process Manag 58(3):102518
Zhao Q, Chen S, Liu Z, Baker T, Zhang Y (2020) Blockchainbased privacypreserving remote data integrity checking scheme for iot information systems. Inf Process Manag 57(6):102355
Campanile L, Iacono M, Marulli F, Mastroianni M (2021) Designing a gdpr compliant blockchainbased iov distributed information tracking system. Inf Process Manag 58(3):102511
Chen Q, Srivastava G, Parizi RM, Aloqaily M, Al Ridhawi I (2020) An incentiveaware blockchainbased solution for internet of fake media things. Inf Process Manag 57(6):102370
Hong H, Sun Z (2021) A secure peer to peer multiparty transaction scheme based on blockchain. Peer Peer Netw Appl 14:1106–1117
Hong H, Hu B, Sun Z (2021) An efficient and secure attributebased online/offline signature scheme for mobile crowdsensing. HumCent Comput Inf Sci 11:26
Hu S, Cai C, Wang Q, Wang C, Wang Z, Ye D (2019) Augmenting encrypted search: a decentralized service realization with enforced execution. IEEE Trans Dependable Secure Comput 18(6):2569–2581
Jiang S, Cao J, McCann JA, Yang Y, Liu Y, Wang X, Deng Y (2019) Privacypreserving and efficient multikeyword search over encrypted data on blockchain. In: 2019 IEEE international conference on Blockchain (Blockchain), IEEE, pp 405–410
Li H, Gu C, Chen Y, Li W (2019) An efficient, secure and reliable search scheme for dynamic updates with blockchain. In: Proceedings of the 2019 the 9th International Conference on Communication and Network Security, pp 51–57
Tahir S, Rajarajan M (2018) Privacypreserving searchable encryption framework for permissioned blockchain networks. 2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber. Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), IEEE, pp 1628–1633
Cai C, Weng J, Yuan X, Wang C (2018) Enabling reliable keyword search in encrypted decentralized storage with fairness. IEEE Trans Dependable Secure Comput 18(1):131–144
Ostrovsky R, Sahai A, Waters B (2007) Attributebased encryption with nonmonotonic access structures. In: Proceedings of the 14th ACM conference on Computer and communications security, pp 195–203
Fang H, Tao T, Zhai C (2004) A formal study of information retrieval heuristics. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp 49–56
Fang H, Zhai C (2005) An exploration of axiomatic approaches to information retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp 480–487
Guo J, Fan Y, Ai Q, Croft WB (2016) Semantic matching by nonlinear word transportation for information retrieval. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp 701–710
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Sahai A, Waters B (2005) Fuzzy identitybased encryption. In: Advances in Cryptology–EUROCRYPT 2005: 24th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Aarhus, Denmark, May 2226, 2005. Proceedings 24, Springer, pp 457–473
Chen C, Zhu X, Shen P, Hu J, Guo S, Tari Z, Zomaya AY (2015) An efficient privacypreserving ranked keyword search method. IEEE Trans Parallel Distrib Syst 27(4):951–963
Anton H, Rorres C (2013) Elementary linear algebra: applications version. John Wiley & Sons
Yang W, Sun B, Zhu Y, Wu D (2021) A secure heuristic semantic searching scheme with blockchainbased verification. Inf Process Manag 58(4):102548
Li J, Li D, Zhang X (2023) A secure blockchainassisted access control scheme for smart healthcare system in fog computing. IEEE Internet Things J
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Chen L, Xue Y, Mu Y, Zeng L, Rezaeibagha F, Deng RH (2022) Casesse: Contextaware semantically extensible searchable symmetric encryption for encrypted cloud data. IEEE Trans Serv Comput 16(2):1011–1022
Hu Z, Dai H, Liu Y, Yang G, Zhou Q, Chen Y (2022) Csmrs: An efficient and effective semanticaware ranked search scheme over encrypted cloud data. In: 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD), IEEE, pp 699–704
Gabryel M, Damaševičius R, Przybyszewski K (2018) Application of the bagofwords algorithm in classification the quality of sales leads. In: Artificial Intelligence and Soft Computing: 17th International Conference, ICAISC 2018, Zakopane, Poland, June 37, 2018, Proceedings, Part I 17, Springer, pp 615–622
Wang X, Zhou Z, Luo X, Xu Y, Bai Y, Luo F (2021) A blockchainbased finegrained access data control scheme with attribute change function. 2021 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing. Scalable Computing & Communications, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/IOP/SCI), IEEE, pp 348–356
Funding
This research work was funded by Institutional Fund Projects under grant no. (IFPIP: 2086111443). The authors gratefully acknowledge technical and financial support provided by the Ministry of Education and King Abdulaziz University, DSR, Jeddah, Saudi Arabia.
Author information
Authors and Affiliations
Contributions
Shahzad Khan formulated the research concept, participated in its design, and made substantial contributions to the manuscript’s writing. Haider Abbas shared valuable knowledge about the theory, carefully improved the research methods, and oversaw the paper while providing important feedback at every stage. Muhammad Binsawad oversaw the implementation, conducted the simulations, and interpreted the results.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Khan, S., Abbas, H. & Binsawad, M. Secure semantic search using deep learning in a blockchainassisted multiuser setting. J Cloud Comp 13, 29 (2024). https://doi.org/10.1186/s13677023005785
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13677023005785