Efficient phrase search with reliable verification over encrypted cloud-IoT data

Wanshan Xu; Ze Zhu; Muhammad Irfan Khalid

doi:10.7717/peerj-cs.2235

Efficient phrase search with reliable verification over encrypted cloud-IoT data

Wanshan Xu ¹, Ze Zhu¹, Muhammad Irfan Khalid²

1School of Computer and Cyberspace Security, Communication University of China, Beijing, China

2Faculty of Computing and Information Technology, Department of Information Technology, University of Sialkot, Sialkot, Pakistan

DOI: 10.7717/peerj-cs.2235

Published: 2024-11-22
Accepted: 2024-07-13
Received: 2024-03-15

Academic Editor: Yue Zhang

Subject Areas: Algorithms and Analysis of Algorithms, Security and Privacy, Internet of Things, Blockchain
Keywords: Phrase search, Blockchain, Verification, Efficient

Copyright: © 2024 Xu et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Xu W, Zhu Z, Khalid MI. 2024. Efficient phrase search with reliable verification over encrypted cloud-IoT data. PeerJ Computer Science 10:e2235 https://doi.org/10.7717/peerj-cs.2235

The authors have chosen to make the review history of this article public.

Abstract

Phrase search encryption enables users to retrieve encrypted data containing a sequence of consecutive keywords without decrypting, which plays an important role in cloud Internet of Things (IoT) systems. However, due to the sequential relationship between keywords in the phrase, phrase search and verification are more difficult than multi-keyword search. Furthermore, verification evidence is generated by the server in existing schemes, and cloud servers are generally considered untrustworthy, so the verification is unreliable. To address this, we propose an efficient phrase search scheme that supports reliable verification of search results, where blockchain is introduced to generate verification evidence and perform verification of the results. The immutable nature of blockchain guarantees the credibility of evidence and verification. During the verification, we use a multiset hash function to generate aggregated evidence, reducing storage and blockchain transaction costs. In addition, we design a novel composite index and discrimination algorithm based on homomorphic encryption, with which we can quickly identify phrases and improve search efficiency. Finally, we conducted security analysis and detailed experiments on our scheme, which proved that the scheme is secure and efficient.

Introduction

Currently, the Internet of Things (IoT) has developed rapidly and is widely used in agriculture, industry, medicine and other fields, helping to improve crop production, manufacturing efficiency, and protect patients’ health. Every day, hundreds of millions of IoT devices around the world generate massive amounts of data, which is stored on the local or cloud. Compared to local storage, cloud storage can not only reduce local storage and management costs, achieve efficient data processing and analysis, but also help to achieve data sharing between different users, so it has been widely researched and applied.

Although cloud storage brings many conveniences to users, it also poses security and privacy risks. Cloud servers are generally considered untrustworthy, the unauthorized inside user may attempt to access sensitive information (e.g., patient’s disease name, blood pressure, etc.), and some hackers may also illegally access data, which will lead to data destruction and privacy leaks. In this case, IoT devices generally encrypt data first, and then outsource the ciphertext to the cloud to protect the integrity and privacy of the data.

For data outsourced to the cloud, when users need to access it, they perform retrieval on ciphertext. To achieve keyword retrieval on encrypted data and maintain the balance between search efficiency and security, Song, Wagner & Perrig (2000) proposed the concept of searchable encryption (SE), according to the number of keywords queried, SE is divided into two categories: single keyword search and multi-keyword search. Phrase search is an important technology of searchable encryption, which can search for a series of conjunction keywords in sentences or documents (Tang et al., 2012; Anand et al., 2014). Designing an efficient phrase search solution is very challenging, existing single keyword (Curtmola et al., 2006; Stefanov, Papamanthou & Shi, 2014) or multi-keyword encryption search schemes (Cash et al., 2013; Poon & Miri, 2015) cannot be directly applied to phrase search because they cannot determine the location of keywords. For example, in the electronic medical system, certain diseases are expressed by phrases, such as “myocardial infarction”. When searching for this phrase with a multi-keyword encrypted search scheme, the cloud server may return search results that contain both “myocardium” and “infarction”, but they may not appear as a phrase. Obviously, the search results contain a lot of invalid files.

Another challenge for phrase search is the verification of search results. Since data is outsourced on the cloud, external or internal attacks on cloud server may compromise the integrity or confidentiality of the data. In addition, data may be lost or damaged during data transmission. Therefore, it is necessary to verify the results of phrase search.

Although there are some studies (Kissel & Wang, 2013; Ge et al., 2021) addressing the problem of phrase search result verification, unfortunately, these verification schemes lack reliability. The reason is that in the existing solution, the server calculates search results and uses methods such as RSA accumulators to generate verification evidence. These search results and verification evidence may be forged by the cloud server (for example, the server may store only a part of the file and search index for financial gain, in which case the search results and verification evidence are incomplete). In addition, data users may forge verification results for cost savings, which may also result in the unreliability of verification results. In recent studies, some researchers have adopted blockchain technology. These schemes guarantee the reliability of verification based on the immutable property of the blockchain and have obtained ideal experimental results. But, these schemes mainly focus on the encrypted search of single keyword and cannot be applied to phrase search.

To address these problems, we design a blockchain-based phrase search scheme supporting reliable verification over encrypted cloud-IoT data, our main contributions are as follows:

1) We propose an efficient phrase search scheme over encrypted cloud-IoT data. In our scheme, a composite index containing keyword position and a distance discrimination algorithm based on homomorphic encryption are proposed, which can not only reduce the complexity of phrase recognition, but also achieve efficient phrase search and result verification.

2) We propose a method that enables reliable verification of phrase search results. In our scheme, the verification evidence calculation and verification process of phrase search are both executed by the blockchain, breaking the pattern of the server generating both search results and verification evidence, so the reliability of phrase search is ensured. Furthermore, we use a multiset hash function to calculate cumulative evidence, which significantly reduces the overhead of the blockchain.

3) We conducted a security analysis of the scheme and conducted detailed experiments. The results demonstrate that our construction is secure and enjoys good search efficiency.

The article is structured as follows: “Related Work” introduces the current research progress related to phrase search and verification; “Problem Formulation” describes the system model, threat model, algorithm definitions, and security definitions; “Methods” provides a detailed description of the phrase search and verification algorithms used in our scheme; and finally, “Security Analysis” and “Results” respectively analyze the security and experimental results of the proposed solution.

Related work

Searchable symmetric encryption (SSE) was first proposed by Song, Wagner & Perrig (2000) in 2000, which provides users with a new way to perform retrieval on encrypted data. However, this scheme uses full-text matching, and the search time is linear. To improve the search efficiency, Anand et al. (2014) proposed an efficient searchable encryption scheme with the inverted index, achieving a subcaptionlinear search. Following this direction, a great many schemes have been proposed to support dynamic update (Kamara, Papamanthou & Roeder, 2012; Stefanov, Papamanthou & Shi, 2014; Liu et al., 2021), multi-client query (Sun, Zuo & Liu, 2022; Du et al., 2020) and privacy protection (Liu et al., 2014; Song et al., 2021). But these schemes are mainly focusing on a single keyword, and the cloud returns some irrelevant files. To further improve the search efficiency and accuracy, SSE schemes supporting multi-keyword search are proposed, such as boolean query (Cash et al., 2013) and conjunctive queries (Lai et al., 2018). Compared with single keyword query, multi-keyword search improves search accuracy and reduces the communication and storage overhead.

Phrase search is a special case of multi-keyword search, it requires a sequential relationship between multiple keywords. Anand et al. (2014) first defined the model of phrase search and its security definition, but it is impractical in real scenarios since the client and the server require two rounds of interaction to complete a phrase query. Poon & Miri (2015) proposed a low storage phrase search scheme using bloom filter and symmetric encryption, and further proposed a fast phrase search scheme based on n-gram filters in 2019 (Poon & Miri, 2019). Li et al. (2015) implemented phrase search based on relative position, and realizes lightweight transactions and storage during the retrieval process. Ge et al. (2021) proposed an intelligent fuzzy phrase search scheme over encrypted network data for IoT, which dentifies phrases through binary matrices and look-up tables, and uses a fuzzy keyword set to resolve spelling errors in phrase searches. Shen et al. (2019) proposed a phrase search scheme that protects user privacy, which uses homomorphic encryption and bilinear mapping to achieve phrase identification.

Verifiable search: As we all know, servers in SSE are not completely trusted and may return incorrect search results due to external or internal attacks, so verifiable search is necessary. The concept of verifiable searchable symmetric encryption (VSSE) was first proposed by Qi & Gong (2012) in 2012, since then, a series of VSSE schemes are proposed (Liu et al., 2016; Miao et al., 2021; Chen et al., 2021; Wu et al., 2023). Unfortunately, these schemes are valid for a single keyword but do not support multiple keywords. Wan & Deng (2018) used homomorphic MAC to design a scheme that can verify the search results of multiple keywords. Li et al. (2021) used RSA accumulators to verify multi-keyword search results and uses bitmaps to improve search efficiency. There are similar multi-keyword verifiable ciphertext retrieval schemes (Liu et al., 2021; Liang et al., 2020, 2021). Kissel & Wang (2013) utilized a validation tag to build a verifiable phrase search scheme over encrypted data, but they failed to verify the integrity of the file. For more complex phrase searches, Ge et al. (2021) used the MAC function and look-up tables to implement phrase search result verification. Although this construction can verify the phrase search, it adopts a two-phase query strategy, which means the user needs to interact with the server twice in a phrase search and generate a large number of trapdoors.

Verifiable search based on blockchain: In the above verifiable schemes, the server sends the search results and verification evidence to the user, and the user calculates the search results and compares them with the received evidence to complete the verification. But this approach has some disadvantages. First, the results and evidence are unreliable due to the server is untrusted. Second,this approach cannot solve the problem of fair verification between server and user. To address this problem, blockchain is introduced into verifiable search. Currently, some verifiable search solutions based on blockchain have been proposed (Hu et al., 2018; Li et al., 2019; Guo, Zhang & Jia, 2020), but these solutions mainly focus on single keyword search, while there are almost no reliable and fair verification solutions for multi-keyword search scenarios. The same is true for phrase searches, which are more complex than multi-keyword searches.

Problem formulation

In this section, we formally define the efficient and reliable phrase search scheme over encrypted cloud-IoT data. We present the system model, threat model and security definition. We denote a composite index as a secure index, a searched phrase as a query and an encrypted query as a trapdoor. The notations and symbols used in our system are shown in Table 1.

Table 1:

Notations and symbols.

Notation	Definition
$i d_{i}$	The identifier of the file $F_{i}$
M	The number of files
N	The number of keywords
$L (\cdot)$	A bit-length of $\cdot$
$\| \cdot \|$	Number of elements in set $\cdot$
$S L_{w_{1}}^{i} \|_{\| A \|}$	Get the first \|A\| bits of $S L_{w_{1}}^{i}$
$S L_{w_{1}}^{i} \|_{- (\| A \| - \| B \|)}$	Get the last $(\| A \| - \| B \|)$ bits of $S L_{w_{1}}^{i}$
$\tilde{w}$	The query phrase
R	A set of ciphertext satisfying phrase search
$p r o o f$	Verification result, $1$ : valid, $0$ : invalid
\|\|	Concatenation symbol, $a \| \| b$ denotes the concatenation of message a and b.
$r_{i, j}$	Number of positions of keyword $w_{i}$ in file $F_{j}$

DOI: 10.7717/peerj-cs.2235/table-1

System model

Four entities are included in our system: IoT device, data user, cloud server, and blockchain. The system model is shown in Fig. 1. IoT device as the data owner collects data and stores them in the form of files $F = {F_{1}, F_{2}, \dots F_{M}}$ . The IoT device extracts all the keywords $W = {w_{1}, w_{2}, \dots w_{N}}$ in $F$ and adopts the bitmap to build composite index $I$ . The IoT device encrypts all the files in $F$ to ciphertexts $C = {C_{1}, C_{2}, \dots C_{M}}$ by symmetric encryption, and calculates the hash value $h a s h_{i}$ of each ciphertext in $C$ through $s h a 256$ , which will be added to the checklist L. At last, ( $I$ , $C$ ) and ( $I$ , L) are sent to the cloud server and the blockchain, respectively.

The data user obtains the system public parameters $Ω$ through the authorization of the IoT device, generates a search trapdoor through these public parameters and the phrases to be queried $\tilde{w}$ . The trapdoor will be sent to the server and blockchain for encrypted search and result verification, respectively. The data user receives the search results R and verification result $p r o o f$ from the blockchain, and accepts R if $p r o o f = 1$ , otherwise rejects R.

The cloud server stores the index $I$ and ciphertexts $C$ sent by the IoT device, and performs a search over encrypted data using the trapdoor sent by the data user to generate the search result R and aggregate evidence $ψ$ . At last, the cloud server sends $ψ$ to the blockchain for verification, and sends R to the user.

The blockchain verifies aggregate evidence $ψ$ returned by the server and generates $p r o o f$ . To achieve reliable verification, the blockchain performs a phrase search in parallel with the cloud server to generate the verification standard value $ψ^{'}$ . The blockchain compares $ψ^{'}$ with the aggregated evidence $ψ$ returned by server, calculates the verification evidence $p r o o f$ , and sends it to the user. In particular, during the verification, multi-set hash functions are used to verify the aggregate hash results of the ciphertext, while the ciphertext is off-chain, thereby reducing blockchain storage and computing overhead.

Threat model

In our system, IoT devices and blockchains are completely trusted, IoT devices can collect data honestly, and generate secure indexes and checklists. The blockchain performs fair verification of search results, and the verification result is reliable and unforgeable.

Correspondingly, cloud servers and users are considered untrustworthy. The cloud server may only store part of the index and ciphertext for saving storage resources. At the same time, it may perform searches dishonestly in order to save computation costs. In addition, there may be other software/hardware malfunctions in the system. All the above reasons will make the file and the verification evidence returned by the server incomplete or incorrect. As for data users, it may falsify verification results for financial gain and is therefore not trustworthy.

Algorithm definitions

Our scheme consists of six polynomial algorithms $\prod = {K e y G e n, I n d e x B u i l d, T o k e n G e n, S e a r c h, V e r i f y, D e c}$ :

1) $K \leftarrow K e y G e n (1^{λ})$ , this algorithm inputs a secure parameter $λ$ , and outputs the key set $K = (K_{1}, K_{2}, K_{3}, K_{4}, K_{I}, K_{Z}, K_{X}, p k, s k)$ .

2) $(I, T, B) \leftarrow I n d e x B u i l d (F, W, K)$ , this algorithm takes the set of files $F$ , the set of keywords $W$ , the key set K as input, and outputs the secure index $I$ , the encrypted database T, the checklist B.

3) $T K_{i, Q} \leftarrow T o k e n G e n (\tilde{w}, K_{3}, p k)$ , this algorithm takes a query phrase $\tilde{w}$ , a secret key $K_{3}$ and a public key $p k$ as input, and outputs the search trapdoor $T K_{i, Q}$ .

4) $(ψ, R) \leftarrow S e a r c h (I, T, T K_{i, Q})$ , this algorithm takes the secure index $I$ , the encrypted database T and the search trapdoor $T K_{i, Q}$ as input, and outputs the aggregate evidence $ψ$ and search results R.

5) $p r o o f \leftarrow V e r i f y (I, ψ, B, T K_{i, Q})$ , this algorithm takes the secure index $I$ , aggregate evidence $ψ$ , the checklist B, the search trapdoor $T K_{i, Q}$ as input, and outputs the verification evidence $p r o o f$ .

6) $F \leftarrow D e c (K_{2}, C)$ , this algorithm takes the symmetric key $K_{2}$ and the encrypted file C as input, and outputs the plaintext F.

Leakage function

The goal of searchable encryption is to leak as little information as possible about the keywords and files during ciphertext retrieval. Similar to Wu et al. (2023), the leak function is defined as $L = {L_{I n d e x B u i l d}, L_{S e a r c h}, L_{V e r i f y}}$ . According to the common definition, query history $H i s t = {(D B_{i}, q_{i})}_{i = 0}^{n}$ , which stores a series of query requests and corresponding database snapshots. The search pattern $s p (w) = {i | f o r e a c h q_{i} t h a t c o n t a i n s w i n H i s t}$ , which records each query request. The proof history $p h (w) = {(i, p r o o f_{i}) | f o r e a c h (i, i n d_{i}, w) i n H i s t}$ . Then, we can define the leakage function $L_{I n d e x B u i l d} = (p h (w))$ , $L_{S e a r c h} = (s p (w), p h (w))$ and $L_{V e r i f y} = (p h (w))$ .

Security definitions

Definition 1 (Verifiability). In an efficient and verifiable phrase search scheme, if the probability that the forged result generated by any probabilistic polynomial time (PPT) adversary passes the Verify algorithm is infinitesimal, the scheme satisfies verifiability.

Definition 2 (CKA2-security). For the verifiable phase search scheme $\prod$ ={ $K e y G e n, I n d e x B u i l d,$ $T o k e n G e n, S e a r c h, V e r i f y, D e c$ }, there is a leakage function $L = {L_{I n d e x B u i l d}, L_{S e a r c h}, L_{V e r i f y}}$ , an adversary $A$ and an idealized simulator $S$ , as well as two games $R e a l_{A} (λ)$ and $I d e a l_{A, S} (λ)$ , satisfying:

$R e a l_{A} (λ)$ : The challenger generates system key $K = {K_{1}, K_{2}, K_{3}}$ and index ( $I$ , T, B) by executing algorithm $K e y G e n (1^{λ})$ and algorithm $I n d e x B u i l d (F, W, K)$ ,( $I$ , T, B) are transmitted to the adversary $A$ . $A$ proceeds to formulate a sequence of adaptive queries $Q = {q_{1}, q_{2}, \dots, q_{t}}$ , with the challenger generating search tokens for each query, and receives the results of executing algorithms $S e a r c h$ and $V e r i f y$ . Finally, $A$ produces a bit $b$ as the output of this experiment.

$I d e a l_{A, S} (λ)$ : The simulator $S$ takes (F, W) generated by the adversary $A$ as input and outputs index ( $I$ , T, B) by executing algorithm $L_{I n d e x B u i l d}$ . Then, for a series of adaptive queries $Q = {q_{1}, q_{2}, \dots, q_{t}}$ generated by the adversary $A$ , $S$ generates search results by executing algorithms $L_{S e a r c h}$ and $L_{V e r i f y}$ , $A$ receives those results and produces a bit $b$ as the output of this experiment.

If there is a simulator $S$ such that for any PPT adversary $A$ :

$| Pr [R e a l_{A} (λ) = 1] - Pr [I d e a l_{A, S} (λ)] =\leq n e g l (λ),$ then $\prod$ is $L$ –secure against adaptive chosen-keyword attack (CKA2), where $n e g l$ is an negligible function and $λ$ is the security parameter.

Preliminaries

Bitmaps employ binary strings to represent information sets, commonly utilized for storing file identifiers in encrypted searches, thus efficiently reducing storage requirements. In our model, each keyword $w_{i}$ corresponds to a bitmap, and the bitmap is a string composed of a series of 0 or 1, each 0 or 1 denotes a file. If the $i - t h$ document contains $w_{i}$ , the value of the string at position $i$ is set to 1, otherwise 0. For instance, with four files ( $F_{1}$ , $F_{2}$ , $F_{3}$ , $F_{4}$ ) and two keywords ( $w_{1}$ , $w_{2}$ ) in the system, depicted in Fig. 2, $w_{1}$ is found in $F_{1}$ and $F_{3}$ , while $w_{2}$ exists in $F_{2}$ and $F_{3}$ . The bitmaps for $w_{1}$ and $w_{2}$ are 1010 and 0110, respectively. To search for files containing both $w_{1}$ and $w_{2}$ , an “AND” operation on these two bitmaps is performed, yielding $1010 \land 0110 = 0010$ , indicating that $F_{3}$ contains both $w_{1}$ and $w_{2}$ .

Homomorphic encryption represents an encryption technique capable of transforming a ciphertext into another without altering the decryption key. In this study, we employ the prevalent Paillier additive homomorphic encryption to compute the distance between keywords within phrases. In essence, its functionality can be outlined as follows:

1) Key Generation: Let $p$ and $q$ denote two large primes such that $gcd (p q, (p - 1) \times (q - 1)) = 1$ . Define $n = p q$ and $λ = l c m (p - 1, q - 1)$ . Choose a random integer $g$ from $Z_{n^{2}}^{*}$ satisfying $gcd (L (g^{λ} m o d n^{2}), n) = 1$ , where $L (x) = (x - 1) / n$ and $Z_{n^{2}}^{*} = {1, 2, \dots, n^{2} - 1}$ . Then, the public key $(n, g)$ and the private key $λ$ are obtained.

2) Encryption: Given a message $m$ , it can be encrypted into its ciphertext $c$ as follows:

(1) $c = E (m, r) = g^{m} r^{n} m o d n^{2}$ where $r$ is randomly selected from $r \in Z_{n}$ .

3) Decryption: For the ciphertext $c$ , it can be decrypted into its plaintext $m$ as follows:

(2) $m = D (c, λ) = \frac{L (c^{λ} m o d n^{2})}{L (g^{λ} m o d n^{2})} m o d n$

This algorithm exhibits additive homomorphism. Given two messages $a$ and $b$ along with their corresponding ciphertexts $E (a)$ and $E (b)$ , we can obtain the ciphertext of $(a + b)$ via $E (a) \cdot E (b)$ , i.e., $E (a + b) = E (a) \cdot E (b)$ . This property can be leveraged to compute the distance between keywords in a phrase, aiding in determining their positional relationship.

Multiset Hash Function (Li et al., 2023): Multiset hash is a cryptographic tool that maps multiple sets of any finite size to a fixed hash length. Furthermore, multiset hash is also updateable: when the elements in the set change, the hash value only updates the current value without recalculating all.

Our scheme uses the most efficient multi-set hash function: MSet-XOR-Hash, containing three polynomial algorithms ( $H$ , $+ H$ , $\equiv H$ ). Given a multiset M, the MSet-XOR-Hash can be expressed as follows:

${\begin{matrix} H (r, M) & = H_{k} (0, r) \oplus \underset{m \in M}{\oplus} H_{k} (1, m); \\ H (r, M \cup {x}) & \equiv_{H} H (r, M) +_{H} H (r, {x}) \\ \equiv_{H} H (r, M) \oplus H_{k} (1, x); \\ H (r, M \ {x}) & \equiv_{H} H (r, M) -_{H} H (r, {x}) \\ \equiv_{H} H (r, M) \oplus H_{k} (1, x) \end{matrix}$

Methods

We present the construction of the efficient and reliable phrase search scheme over encrypted cloud-IoT data in this section.

Composite index containing files and locations

In phrase search, a phrase is composed of multiple keywords according to a certain positional relationship, which is also the difference between phrase search and multi-keyword search. To perform a phrase search, the cloud server not only needs to search for all keywords contained in the phrase, but also needs to determine whether the order between keywords is correct.

To identify the position relationship between keywords in phrases, we designed a composite index containing files and locations, the structure of the composite index is shown in Fig. 3.

Figure 3: Example of the composite index structure.

Download full-size image

DOI: 10.7717/peerj-cs.2235/fig-3

The composite index adopts inverted index structure to ensure high efficiency of search, but, it’s different from the general inverted index in that each keyword not only corresponds to the ID of a series of files, but also appends all the locations where the keyword appears in the file. For example, in Fig. 3, suppose there are three keywords (“ $h e a r t$ ”, “ $a t t a c k$ ”, “ $m e d i c$ ”) and five corresponding files ( $F_{1}$ , $F_{2}$ , $F_{3}$ , $F_{4}$ , $F_{5}$ ), for simplicity, encryptions are not shown. The positions of keyword “ $h e a r t$ ” in files $F_{1}$ , $F_{2}$ , $F_{3}$ are (1, 8, 3), (1, 2, 4) and (2, 3, 5) respectively, the positions of keyword “ $a t t a c k$ ” in files $F_{1}$ , $F_{2}$ , $F_{4}$ are (2, 5, 7), (3, 7, 9) and (1, 2, 5). When the cloud server searches the phrase “ $h e a r t$ $a t t a c k$ ”, it finds that the location of keyword “ $h e a r t$ ” in $F_{1}$ is (1, 8, 3) through the composite index, then it finds the location of keyword “ $a t t a c k$ ” in $F_{1}$ is (2, 5, 7). Using the encrypted distance discrimination algorithm, the cloud server computes the position of “ $a t t a c k$ ” in file $F_{1}$ is 1 larger than that of “ $h e a r t$ ” by $E (2) = E (1) \cdot E (1)$ . Similarly, the cloud server computes that “ $a t t a c k$ ” is after “ $h e a r t$ ” in $F_{2}$ through $E (3) = E (2) \cdot E (1)$ . After searching the location of all keywords in the composite index, the server calculates that $F_{1}$ and $F_{2}$ contain the phrase “heart attack”.

Encrypted distance discrimination algorithm-EDDA

The sequence of keywords in a phrase can be expressed by a sentinel and the distance between each remaining keyword and the sentinel. For example, in a phrase containing three keywords ( $w_{1}$ , $w_{2}$ , $w_{3}$ ), the position of $w_{1}$ , $w_{2}$ , $w_{3}$ are 1, 2, 3, we choose $w_{1}$ as the sentinel. The distance between $w_{2}$ and $w_{1}$ is 1, and the distance between $w_{3}$ and $w_{1}$ is 2. Suppose that positions of ( $w_{1}$ , $w_{2}$ , $w_{3}$ ) are ( $p o s_{1}$ , $p o s_{2}$ , $p o s_{3}$ ), if we can calculate $p o s_{2} = p o s_{1} + 1$ and $p o s_{3} = p o s_{1} + 2$ , we can recognize that ( $w_{1}$ , $w_{2}$ , $w_{3}$ ) is a phrase.

In our scheme, the positions of keywords stored in the composite index are encrypted, and the server should be able to recognize phrases without the decryption key. Therefore, we utilize the paillier homomorphic encryption to construct the distance discrimination algorithm to determine the location relationship between keywords,the details are as follows:

$E (d)$ is the distance after paillier homomorphic encryption, and $S L_{w_{j}}^{i}$ denotes the encrypted location of the keyword $w_{j}$ in file $i d_{i}$ , the definition of $S L_{w_{j}}^{i}$ is as follows:

(3) $S L_{w_{j}}^{i} = π (i d_{i}) + E (p o s_{w_{j}}^{i})$

E represents paillier homomorphic encryption function and $π$ represents hash function, $p o s_{w_{j}}^{i}$ is the original location of the keyword $w_{j}$ in file $i d_{i}$ . In addition, keyword $w_{j}$ may appears in multiple locations in the same file, in this case, $E (p o s_{w_{j}}^{i})$ represents a series of positions.

In distance discrimination algorithm, $S L_{w_{1}}^{i} |_{| π |} \oplus S L_{w_{2}}^{i} |_{| π |}$ is used to determine whether $w_{1}$ and $w_{2}$ belong to the same file, if so, $v a l_{i} == 0$ . $E (p o s_{w_{1}}^{i}) \leftarrow S L_{w_{j}}^{i} |_{- (| S L_{w_{j}}^{i} | - | π |)}$ is used to calculate encrypted location of keyword $w_{j}$ , and $E (p o s_{w_{2}}^{i}) == E (p o s_{w_{1}}^{i}) + E (d)$ is used to determine whether the keyword $w_{2}$ is located in the $d$ position after $w_{1}$ . When the user executes the phrase search request, he can designate the first word (i.e., $w_{1}$ ) in the phrase as the sentry, then calculate the distance $d$ between the remaining words and the sentry one by one, encrypt $d$ , and finally generate a search token and send it to the server for search.

Details of our construction

Like most searchable encryption schemes, we adopt an inverted index structure to construct the secure index. In the inverted index, we use a bitmap to store the identifier of the file. Let $H : {0, 1}^{*} \to {0, 1}^{m}, F : {0, 1}^{*} \to {0, 1}^{n}$ be secure pseudo-random functions (PRFs).

Algorithm 1:

Distance discrimination algorithm.

Input:

S L_{w_{1}}^{i}, S L_{w_{2}}^{i}, E (d)

Output: Flag

F l a g \leftarrow F a l s e

v a l_{i} & = S L_{w_{1}}^{i} |_{| π |} \oplus S L_{w_{2}}^{i} |_{| π |}

3: if

v a l_{i} == 0

then

E (p o s_{w_{1}}^{i}) \leftarrow S L_{w_{1}}^{i} |_{- (| S L_{w_{1}}^{i} | - | π |)}

E (p o s_{w_{2}}^{i}) \leftarrow S L_{w_{2}}^{i} |_{- (| S L_{w_{2}}^{i} | - | π |)}

6: if

E (p o s_{w_{2}}^{i}) == E (p o s_{w_{1}}^{i}) + E (d)

then

F l a g \leftarrow T r u e

8: end if

9: end if

10: return Flag

DOI: 10.7717/peerj-cs.2235/table-3

$K e y G e n (1^{λ})$ . The IoT device uses the secret parameters $λ$ to generate the key set $K = {K_{1}, K_{2}, K_{3}, K_{4}, K_{I}, K_{Z}, K_{X}, p k, s k}$ , where $K_{1}, K_{2}, K_{3}, K_{4}, K_{I}, K_{Z}, K_{X} \overset{$}{⟵} {0, 1}^{λ}$ , $(p k, s k) \leftarrow P a i l l i e r . K e y G e n (1^{λ})$ . $K_{1}$ is used to encrypt the identifier of files, $K_{2}$ is the secret key of symmetric encryption, $K_{3}$ is the key for PRF F, $(p k, s k)$ is the public key and the private key of paillier encryption.

$I n d e x B u i l d (F, W, K)$ . Given a set of files $F$ , a set of keywords $W$ , and the key set K, the IoT device generates the secure index $I$ , the encrypted database T, the checklist B, the details are shown in Algorithm 2.

Algorithm 2:

IndexBuild.

Input:

D B, W, K

Output:

I, T, B

1: for

F_{i} \in F

ℓ_{i} \leftarrow H (i d_{i} | | K_{1}); C_{i} \leftarrow E n c (K_{2}, F_{i})

h a s h_{i} \leftarrow H (C_{i}); B [ℓ_{i}] \leftarrow h a s h_{i}; T [ℓ_{i}] \leftarrow C_{i}

4: end for

5: for

w_{j} \in W

6: for

i d \in D B

u_{w_{j}} \leftarrow F_{1} (K_{3}, w_{j}); s t_{j} \leftarrow F_{2} (K_{4}, i d); t_{w_{j}} \leftarrow H_{2} (u_{w_{j}} | | s t_{j})

8: Extract positions

(p o s_{1}, p o s_{2}, . . ., p o s_{m})

of keyword

w_{j}

in file

F_{i}

S L_{w_{j}}^{i} = π (i d_{i}) + E (p o s_{1}) + E (p o s_{2}) + . . . + E (p o s_{m})

10:

v_{B} \leftarrow (B_{w_{j}} | | S L_{w_{j}}^{i}) \oplus H_{2} (u_{w_{j}} | | s t_{j}); I [t_{w_{j}}] \leftarrow v_{B}; \sum [w_{j}] = s t_{j}

11: end for

12: end for

13: send

(I, B)

to blockchain, send

(I, T)

to cloud server

DOI: 10.7717/peerj-cs.2235/table-4

For each file $F_{i} \in F$ , IoT device encrypts it to the ciphertext $C_{i}$ with symmetric encryption $E n c (K_{2}, F_{i})$ . The ciphertext $C_{i}$ is stored in encrypted database T, and the hash value $h a s h_{i}$ of $C_{i}$ is stored in checklist B for verification.

IoT device generates a bitmap $B_{w_{j}}$ for each keyword $w_{j}$ , $B_{w_{j}}$ is encrypted by $v_{B} \leftarrow B_{w_{j}} \oplus H_{2} (u_{w_{j}} | | s t_{j})$ and $v_{B}$ is stored in the secure index $I$ . Especially, in order to protect the privacy of files, IoT device uses $ℓ_{i} \leftarrow H_{2} (i d_{i} | | K_{1})$ to encrypt the $i d$ of files, and then uses $ℓ_{i}$ to generate $B_{w_{j}}$ . Since the $i d$ stored on the server is encrypted, the server cannot obtain the real $i d$ from $B_{w_{j}}$ , which ensures the privacy of the search pattern.

To identify phrases, the IoT device extracts positions $(p o s_{1}, p o s_{2}, \dots, p o s_{m})$ of keyword $w_{j}$ in file $F_{i}$ , and encrypts them using $P a i l l i e r . E n c (p o s_{m})$ :

$S L_{w_{j}}^{i} = π (i d_{i}) + E (p o s_{1}) + E (p o s_{2}) + \dots + E (p o s_{m})$

$T o k e n G e n (\tilde{w}, K_{3}, p k)$ . Authorized users get shared parameters $Ω = {K_{3}, p k}$ from the IoT device. For the phrase $\tilde{w} = {w_{1}, w_{2}, . . ., w_{t}}$ to be queried, the trapdoor $T K_{i, Q}$ is generated as Algorithm 3.

Algorithm 3:

TokenGen.

Input: The query phrase

\tilde{w}

, the key set K

Output: The serach trapdoor

T K_{i, Q}

1: Suppose that query phrase

\tilde{w} = {w_{1}, w_{2}, . . ., w_{t}}

2: for

j = 1 \to t

s t_{j} \leftarrow \sum [w_{j}], u_{w_{j}} \leftarrow F_{1} (K_{3}, w_{j})

t_{w_{j}} \leftarrow H_{2} (u_{w_{j}} | | s t_{j})

5: if

j > 1

then

d = j - 1; E_{d} \leftarrow P a i l l i e r . E n c (d)

7: end if

8: end for

9: send

T K_{i, Q} = {t_{w_{1}}, t_{w_{2}}, . . ., t_{w_{t}}, E_{1}, E_{2}, . . ., E_{t - 1}}

to blockchain and cloud server

DOI: 10.7717/peerj-cs.2235/table-5

Assume that the keywords ${w_{1}, w_{2}, \dots, w_{t}}$ in phrase are arranged in order. The data user calculates the distance $d$ between the keyword $w_{t}$ and the first keyword $w_{1}$ and encrypts it with $P a i l l i e r . E n c (d)$ . At last, $t_{w_{j}}$ and $E_{d}$ are added to the trapdoor $T K_{i, Q}$ and sent to the cloud server and the blockchain.

$S e a r c h (I, T, T K_{i, Q})$ . As shown in Algorithm 4, the cloud server performs an encrypted search with the secure index $I$ , the encrypted database T and the trapdoor $T K_{i, Q}$ .

Algorithm 4:

Search.

Input:

I, T, T K_{i, Q}

Output: Search results R

1: Parse

{t_{w_{1}}, t_{w_{2}}, . . ., t_{w_{t}}, E_{1}, E_{2}, . . E_{t - 1}} \leftarrow T K_{i, Q}

2: for

t_{w_{j}} \in T K_{j, Q}

v_{B} \leftarrow I [t_{w_{j}}]; B_{w_{j}} | | S L_{w_{j}}^{i} \leftarrow v_{B} \oplus H_{2} (t_{w_{j}})

4: end for

B = B_{1} \land B_{2} \land . . . \land B_{t}, r = H (B, {⊥})

6: Get

I D_{B} = {ℓ_{1}, ℓ_{2}, . . ., ℓ_{p}}

from

B

7: for

ℓ_{i} \in I D_{B}

f l a g = {000...000}^{t - 1}, h a s h_{i} = H_{1} (t [ℓ_{i}]), ψ = ψ +_{H} H (r, h a s h_{i})

(E (p o s_{1}^{1}), E (p o s_{2}^{1}), E (p o s_{m}^{1})) \leftarrow S L_{w_{1}}^{1}

10: for

d = 2 \to t

11:

(E (p o s_{1}^{i}), E (p o s_{2}^{i}), E (p o s_{n}^{i})) \leftarrow S L_{w_{j}}^{i}

12: for

k = 1 \to m; k^{'} = 1 \to n

13: if

E (p o s_{1}^{k^{'}}) = E (p o s_{1}^{k}) \times E (d - 1)

then

14: Set the position

(i - 1)

of flag to 1

15: break

16: end if

17: end for

18: end for

19: If all positions of flag are 1

20: get

C_{i} \leftarrow T [ℓ_{i}]

R \leftarrow R \cup C_{i}

21: end for

22: Server sends {

ψ

} to the blockchain for verification, and sends R to the data user

DOI: 10.7717/peerj-cs.2235/table-6

After receiving the query request, the server parses the trapdoor ${t_{w_{1}}, t_{w_{2}}, \dots, t_{w_{t}}, E_{1}, E_{2}, . . E_{t - 1}} \leftarrow T K_{i, Q}$ . The server gets the bitmap $B_{w_{i}}$ of $w_{i}$ from the secure index $I$ through $v_{B} \leftarrow I [t_{w_{i}}], B_{w_{i}} \leftarrow v_{B} \oplus H (t_{w_{i}})$ . To get the file that contains all the keywords in the phrase $\tilde{w}$ , the server performs the operation “AND” on the bitmap of all keywords as follows:

$B = B_{1} \land B_{2} \land \dots \land B_{t} .$

The file corresponding to the element with a value of “1” in $B$ contains all the keywords in the phrase. The server gets the set of identifiers of these files as $I D_{B} = {ℓ_{1}, ℓ_{2}, \dots, ℓ_{p}}$ according to $B$ .

Next, the server determines whether the sequence of the keywords in the file $ℓ_{i}$ is consistent with the order of the keywords in the phrase, as described in line 7–line 20 in Algorithm 3. The server chooses a binary string $f l a g$ of length $(t - 1)$ , and set all values to “0”. The server gets all positions $E (p o s_{1}^{i}), E (p o s_{2}^{i}), E (p o s_{n}^{i})$ of the keyword $w_{j}$ in the file $ℓ_{i}$ . For the position $E (p o s_{1}^{k^{'}})$ of the keyword $w_{j} (j > 1)$ in file $ℓ_{i}$ , the server utilizes the distance discrimination algorithm EDDA to determine the distance between keyword $w_{j} (j > 1)$ and the first keyword $w_{1}$ in the phrase. Like

(4) $E (p o s_{1}^{k^{'}}) = E (p o s_{1}^{k}) \times E (d - 1)$ where E represents the $P a i l l i e r . E n c$ . If Formula (4) holds, the distance between keywords $w_{1}$ and $w_{j}$ is $(d - 1)$ , which is the same as that in the phrase, the server sets the position $(i - 1)$ of $f l a g$ to “1”. If all positions of $f l a g$ are “1”, then the file $ℓ_{i}$ contains the phrase $\tilde{w}$ , and it is added to the search result R. Finally, the aggregation proof $ψ$ is sent to the blockchain for reliable verification, and the ciphertext collection of search results R is sent to the user.

$V e r i f y (I, B, T K_{i, Q}, ψ)$ . The blockchain utilizes $T K_{i, Q}$ for phrase searches to verify the aggregated evidence $ψ$ returned by the cloud server, as shown in Algorithm 5. To ensure the reliable verification of search results, the verification algorithm $V e r i f y$ not only verifies the integrity of the files, but also verifies whether the server has returned all files that meet the search requirements.

Algorithm 5:

Verify.

Input:

I, B

T K_{i, Q}

, ψ

Output: proof

ψ^{'} \leftarrow ϕ, p r o o f = 0

2: Using search trapdoors

T K_{i, Q}

to perform phrase searches same as line 1–line 6 of Algorithm 4.

3: for

ℓ_{i} \in I D_{B}

h a s {h_{i}}^{'} \leftarrow B [ℓ_{i}], ψ^{'} \leftarrow +_{H} H (r, h a s h_{i})

5: end for

6: if

ψ = ψ^{'}

then

p r o o f = 1

8: end if

9: The blockchain sends proof to the data user.

DOI: 10.7717/peerj-cs.2235/table-7

The blockchain performs the same operations as the server (line 1–line 6), retrieving the composite index stored on itself with the trapdoor $T K_{i, Q}$ , and calculates the search result $ψ^{'}$ . Due to the immutability of data on the blockchain, the composite index stored and search results calculated by the blockchain are reliable. Blockchain achieves reliable verification of phrase search results by comparing $ψ^{'}$ with the search result $ψ$ returned by the server.

For the file $ℓ_{i}$ , the blockchain obtains the corresponding hash value $h a s h_{i}$ by searching the checklist B and compresses it into the benchmark value $ψ^{'}$ . By comparing the aggregate evidence $ψ$ sent by the server with $ψ^{'}$ , the blockchain sets the value of $p r o o f$ as follows:

$p r o o f = {\begin{matrix} 1, & i f ψ = ψ^{'}, \\ 0, & o t h e r w i s e . \end{matrix}$

By comparing $ψ = ψ^{'}$ , the blockchain can determine: 1) whether the server has returned all files that meet the search requirements; 2) the content of the files has been modified.

Then the verification evidence $p r o o f$ are sent to the data user. The data user judges the received $p r o o f$ , and accepts the search result R if $p r o o f = 1$ , otherwise rejects R. For the accepted search result R, the data user uses the symmetric key to decrypt the file in it, to get the plaintext of the file, and the phrase search process is completed.

Discussion

Ensuring the reliability of verification is an important target of our scheme. In the existing phrase search scheme, the secure index is stored on the server, and the verification evidence is generated by the server. Untrusted servers may only store partial indexes and ciphertexts, resulting in untrustworthy search results and verification evidence. Whereas in our scheme, blockchain uses search trapdoor to calculate verification evidence, the data stored on the blockchain is unforgeable, so the search results on the blockchain are reliable. At the same time, the verification of the results returned by the server is also performed by the blockchain, which prevents dishonest data users from falsifying the verification results and ensures the reliability of the verification results.

Security analysis

Theorem 1: The proposed efficient and reliable phrase search scheme satisfies verifiability.

Proof. Let $A$ be a PPT adversary who can produce a forgery $R_{S}$ , which can pass the verification algorithm $V e r i f y$ . Assuming the correct search result is R, we will prove that there is no such adversary $A$ who can give a forgery $R_{S}$ satisfying $R = R_{S}$ .

Suppose the compressed hash values corresponding to R and $R_{S}$ are $ψ$ and $ψ_{S}$ , respectively, and we will discuss the following two cases:

Case 1: $R = R_{S}$ and $ψ \neq ψ_{S}$ . For each ciphertext $C_{j}$ in R, we have $h a s h_{j} \leftarrow H (C_{j}), ψ = ψ +_{H} H (r, h a s h_{j})$ , similarly, we have $h a s h_{j}^{S} \leftarrow H_{S} (C_{j}^{S}), ψ_{S} = ψ_{S} +_{H} H (r, h a s h_{j}^{S})$ for each ciphertext $C_{j}^{S}$ in $R_{S}$ . Since the data on the blockchain is unforgeable and $R = R_{S}$ , we have $ψ = ψ_{S}$ , which is contradictory to $ψ \neq ψ_{S}$ . Therefore this case does not hold.

Case 2: $R \neq R_{S}$ and $ψ = ψ_{S}$ . This implies that $A$ can discover a collision for H, which contradicts the collision resistance property of the hash function. Therefore, this case also does not hold.

In summary, the unforgeability of blockchain and the collision resistance of hash function ensures that any PPT adversary $A$ cannot forge search results. So, our scheme satisfies verifiability.

Theorem 2: If PRF F is pseudo-random, $E n c$ algorithm is secure against chosen plaintext attack (CPA-secure) and $P a i l l i e r . E n c$ is secure against chosen ciphertext attack (CCA-secure), then our proposed scheme is ( $L_{I n d e x B u i l d}, L_{S e a r c h}, L_{V e r i f y}$ )-secure against the adaptive chosen-keyword attack.

Proof. We establish the CKA2 security of our scheme by demonstrating the indistinguishability of $R e a l_{A} (λ)$ and $I d e a l_{A, S} (λ)$ . The proof starts with $R e a l_{A} (λ)$ and go through a series of indistinguishable games to achieve $I d e a l_{A, S} (λ)$ , thus proving that A and $I d e a l_{A, S} (λ)$ are indistinguishable.

Game $G_{1}$ : $G_{1}$ is the same with $R e a l_{A} (λ)$ :

$| Pr [R e a l_{A} (λ) = 1] = Pr [G_{1} = 1]$

Game $G_{2}$ : We replace the output of the pseudorandom function F( $F_{1}$ and $F_{2}$ ) with a sequence of binary random numbers $\hat{F}$ , the length of $\hat{F}$ is equal to |F|, and store the binary sequence in buckets $B_{1}$ and $B_{2}$ . If the adversary $A$ can distinguish between F and the random number sequence, then they can distinguish between $G_{1}$ and $G_{2}$ . Then,

$| Pr [G_{1} = 1] - Pr [G_{0} = 1] \leq A d v_{F_{1}, F_{2}, A}^{P R F} (λ)$

Game $G_{3}$ : In $G_{3}$ , the output of the hash function H( $H_{1}$ and $H_{2}$ ) is replaced by a series of randomly generated binary strings $\hat{H}$ , $| \hat{H} | = | H |$ . $G_{3}$ stores $\hat{H}$ in buckets $H B_{1}$ and $H B_{2}$ . If the adversary $A$ can distinguish between H and $\hat{H}$ , then they can distinguish between $G_{2}$ and $G_{3}$ . Then,

$| Pr [G_{3} = 1] - Pr [G_{2} = 1] \leq n e g l (λ)$

Game $G_{4}$ : In $G_{3}$ , the output of the multi-set hash function is computed based on $(r, h a s h i)$ , while in $G_{4}$ , the output of the multi-set hash function consists of a random binary string made up of a series of $0$ or $1$ . And, the binary string is recorded in a bucket $\hat{X}$ . From the previous analysis, we can conclude that

$| Pr [G_{4} = 1] - Pr [G_{3} = 1] \leq n e g l (λ)$

$I d e a l_{A, S} (λ)$ : $I d e a l_{A, S} (λ)$ and $G_{4}$ are the same, except that $I d e a l_{A, S} (λ)$ introduces simulator $S$ , $S$ executes algorithm $L_{I n d e x B u i l d}, L_{S e a r c h}, L_{V e r i f y}$ with the help of ( $s p (w), p h (w)$ ) and the adversary $A$ can sniff the algorithm output. The algorithm details are shown in Algorithms 6–8. The adversary $A$ cannot distinguish between the output of the random oracle in this game and the actual data, hence

$| Pr [I d e a l_{A, S} (λ) = 1] - Pr [G_{4} = 1] \leq n e g l (λ)$

Algorithm 6:

Simulator 6.

Input:

D B, W, K

Output:

I, T, B

1: for

F_{i} \in F

ℓ_{i} \leftarrow H B_{2}; C_{i} \leftarrow E n c (K_{2}, F_{i})

h a s h_{i} \leftarrow H B_{1}; B [ℓ_{i}] \leftarrow h a s h_{i}; T [ℓ_{i}] \leftarrow C_{i}

4: end for

5: for

\bar{w_{j}} \in W

6: for

i d \in D B

u_{w_{j}} \leftarrow B_{1}; s t_{j} \leftarrow B_{2}; t_{w_{j}} \leftarrow H B_{2}

8: Extract positions

(p o s_{1}, p o s_{2}, . . ., p o s_{m})

of keyword

\bar{w_{j}}

in file

F_{i}

S L_{w_{j}}^{i} = π (i d_{i}) + E (p o s_{1}) + E (p o s_{2}) + . . . + E (p o s_{m})

10:

v_{B} \overset{R}{⟵} {0, 1}^{(l + λ)}; I [t_{w_{j}}] \leftarrow v_{B}; \sum [w_{j}] = s t_{j}

11: end for

12: end for

13: send

(I, B)

to blockchain, send

(I, T)

to cloud server

DOI: 10.7717/peerj-cs.2235/table-8

Algorithm 7:

Simulator 7.

Input:

I, T, T K_{i, Q}

Output: Search results R

1: Parse

p h (w)

[(t_{1}, p f_{1}), (t_{2}, p f_{2}), . . ., (t_{c}, p f_{c})]

2: Parse

{\bar{t_{w_{1}}}, \bar{t_{w_{2}}}, . . ., \bar{t_{w_{t}}}, E_{1}, E_{2}, . . E_{t - 1}} \leftarrow m i n s p (\bar{T K_{i, Q}})

3: for

\bar{t_{w_{j}}} \in m i n s p (\bar{T K_{i, Q}})

v_{B} \overset{R}{⟵} {0, 1}^{(l + λ)}; (B_{w_{j}} | | S L_{w_{j}}^{i}) \overset{R}{⟵} {0, 1}^{(l + λ)}

5: end for

B = B_{1} \land B_{2} \land \dots \land B_{t}

7: Get

I D_{B} = {ℓ_{1}, ℓ_{2}, . . ., ℓ_{p}}

from

B

8: for

ℓ_{i} \in I D_{B}

f l a g = {000...000}^{t - 1}

10: if

\hat{X} [i] = ⊥

then

11:

\hat{X} [i] \overset{R}{⟵} {0, 1}^{n}

12: else

13:

\hat{X} [i] \leftarrow p f_{i}

14: end if

15:

(E (p o s_{1}^{1}), E (p o s_{2}^{1}), E (p o s_{m}^{1})) \leftarrow S L_{w_{1}}^{1}

16: for

d = 2 \to t

17:

(E (p o s_{1}^{i}), E (p o s_{2}^{i}), E (p o s_{n}^{i})) \leftarrow S L_{w_{j}}^{i}

18: for

k = 1 \to m; k^{'} = 1 \to n

19: if

E (p o s_{1}^{k^{'}}) = E (p o s_{1}^{k}) \times E (d - 1)

then

20: Set the position

(i - 1)

of flag to 1

21: break

22: end if

23: end for

24: end for

25: If all positions of flag are 1

26: get

c_{i} \leftarrow T [ℓ_{i}]

R \leftarrow R \cup c_{i}

27: end for

28: Server sends {

\hat{X}

} to the blockchain for verification, and sends R to the data user

DOI: 10.7717/peerj-cs.2235/table-9

Algorithm 8:

Simulator 8.

Input:

I, B

T K_{i, Q}

, ψ

Output: proof

1: Parse

p h (w)

[(t_{1}, p f_{1}), (t_{2}, p f_{2}), . . ., (t_{c}, p f_{c})]

ψ^{'} \leftarrow ϕ, p r o o f = 0

3: Using search trapdoors

\bar{T K_{i, Q}}

to perform phrase searches same as line 1–line 6 of Simulator 7.

4: for

ℓ_{i} \in I D_{B}

5: if

{\hat{X}}^{'} [i] = ⊥

then

{\hat{X}}^{'} [i] \overset{R}{⟵} {0, 1}^{n}

7: else

\hat{X} [i] \leftarrow p f_{i}

9: end if

10:

h a s {h_{i}}^{'} \leftarrow B [ℓ_{i}], ψ^{'} \leftarrow +_{H} H (r, h a s h_{i})

11: end for

12: if

ψ = ψ^{'}

then

13:

p r o o f = 1

14: end if

15: The blockchain sends proof to the data user.

DOI: 10.7717/peerj-cs.2235/table-10

From what we have discussed above, the adversary cannot distinguish the result in the experiment Real and the result in the experiment Ideal. That is:

$| Pr [R e a l_{A} (λ) = 1] - Pr [I d e a l_{A, S} (λ) = 1] \leq n e g l (λ)$

Therefore, our proposed scheme satisfies CKA2 security.

Results

In order to objectively evaluate the performance of our scheme, we design a series of scientific experiments in this section. We conducted a comprehensive analysis of the experimental results and compared them with the existing phrase search scheme (Kissel & Wang, 2013) and scheme (Ge et al., 2021). Our experiments are deployed on a local laptop with a Linux operating system, Intel Core i7-8550 CPU, and 8 GB RAM. Experimental programs are developed using Python. As for the pseudo-random functions F and the hash function H in the algorithm, we use HMAC-SHA-256 and SHA-256 respectively to implement them. Additionally, we symmetrically encrypt files using AES-128, and the security parameter is set to 128 bits. To evaluate our scheme in practice, we employ the Enron email dataset (Cukierski, 2015), a real-world dataset comprising over 517 thousand documents. Using the Porter Stemmer, we extract more than 1.67 million keywords and eliminate irrelevant terms such as “of” and “the”.

Evaluation of IndexBuild

In this phase, the IoT device mainly completes the following work: (1) encrypt the files in the system into ciphertext; (2) generate the secure index for all the keywords; (3) calculate checklist of the ciphertext for verification.

The performance of the scheme can be evaluated through the execution time of algorithm IndexBuild, and we evaluate the execution time of IndexBuild in different numbers of files and keywords respectively. Figure 4A shows the variation pattern between the execution time of IndexBuild and the number of keywords in a single file, while files changes from 1,000, 2,000 to 3,000; in contrast, Fig. 4B shows the variation between the execution time of IndexBuild and the number of files in the system, while keywords in a single file changes from 10, 30 to 50. Obviously, the execution time of algorithm IndexBuild is affected by both the number of files and keywords. The more files and keywords contained in each file, the more time it takes in IndexBuild.

Figure 4: (A and B) Time of IndexBuild.

Download full-size image

DOI: 10.7717/peerj-cs.2235/fig-4

Evaluation of TokenGen

Search trapdoors are generated by users, which contains the permutation value of each keyword in the query phrase and the encrypted distance for other keywords except the first one. Figure 5 shows the time it takes to calculate a search trapdoor for different sizes of search phrases, it’s clear that the time increases with the size of the query phrase. This is easy to understand, because the more keywords in the phrase, the more distances between keywords that need to be encrypted, resulting in more trapdoor calculation time.

Figure 5: Time of trapdoor generation.

Download full-size image

DOI: 10.7717/peerj-cs.2235/fig-5

Evaluation of search

Similarly, we use execution time to evaluate the search efficiency of our scheme. Figure 6A shows how search time changes with query phrase size when the number of keywords contained in each file is 10, 30 and 50. Figure 6B shows how search time changes with query phrase size when the number of files is 1,000, 3,000 and 5,000.

Figure 6: (A and B) Time of search.

Download full-size image

DOI: 10.7717/peerj-cs.2235/fig-6

The results of Fig. 6 demonstrate that as the size of the query phrase grows or the number of documents grows, the search time will increase accordingly, which indicates that the more keywords in the phrase and the more files in the system, the more time it takes for the server to perform a phrase search. Furthermore, since we can use trapdoors to directly locate keywords in the inverted index, the search time is independent of the number of keywords contained in each file.

Evaluation of verification

The blockchain first performs the similar operations as the server to search files that meet the search phrase, then gets the corresponding hash value through the checklist B, and finally calculates the benchmark $ψ^{'}$ based on the multi-set hash function. During the verification, the blockchain draws the verification conclusion by comparing $ψ^{'}$ with the search aggregation evidence $ψ$ returned by the server. Therefore, the verification time is related to the number of files that match the query phrase, and the experimental results are shown in Fig. 7. Clearly, the verification time grows sub-linearly with the number of files and the size of the phrase.

Figure 7: Time of verification.

Download full-size image

DOI: 10.7717/peerj-cs.2235/fig-7

The gas consumption during the verification process is shown in Fig. 8. During the verification process, the blockchain performs multiset hash calculations on the hash values in the checklist that meet that meet the requirements of the phrase search, so the gas consumption increases with the number of search results. When the number of resulting files is 5, the gas consumption is $1.6 \times 10^{5}$ , and when the number of files is 25, the gas consumption is $6.8 \times 10^{5}$ , gas consumption grows sublinearly.

Figure 8: Gas consumption for verification.

Download full-size image

DOI: 10.7717/peerj-cs.2235/fig-8

Comparison with existing schemes

We choose scheme (Kissel & Wang, 2013) and scheme (Ge et al., 2021) with similar functions to compare, and the results are shown Table 2, in which VPS and VPS-IoT denote scheme (Kissel & Wang, 2013) and scheme (Ge et al., 2021), respectively. Both VPS and VPS-IoT adopt a two-stage search strategy, so it takes two rounds of interaction between the user and the cloud server to complete a phrase search. Additionally, they calculate the verification evidence by the server. In this case, if the server is not trustworthy, the evidence may also be incorrect, which poses a huge threat to the reliability of verification.

Table 2:

Compare with existing schemes.

Scheme	Index building	Trapdoor generation	Query	Verification	Round
VPS	$O (M N)$	$O (\| q \| * \| F_{w} \|)$	$O (\| F_{w} \| * \| F_{j} \|^{\| q \|})$	$O (\| q \| * \| F_{w} \|)$	2
VPS-IoT	$O (M N) + O (\sum_{j = 1}^{M} N \| F_{j} \|)$	$O (2 \| q \| + \| F_{w} \| + \| q \| * \| F_{w} \|)$	$O (\| F_{w} \| * \| F_{j} \|^{\| q \|} + \| q \|)$	$O (\| q \| * \| F_{q} \|)$	2
Our Scheme	$O (M N)$	$O (\| q \| * \| F_{w} \|)$	$O (\| F_{w} \| * \| F_{j} \|^{\| q \|})$	$O (\| F_{w} \|)$	1

DOI: 10.7717/peerj-cs.2235/table-2

Note:

$| F_{j} |$ is the number of keywords contained in the file $F_{j}$ ; $w_{j}$ is a collection of different keywords in the file $F_{j}$ ; $| q |$ is the number of query keywords; $| F_{w} |$ is the number of files containing query keywords; $| F_{q} |$ is the size of the returned result file.

The experimental results are shown in Figs. 9–12. The results in Fig. 9 show that VPS takes the least time in building the index, while VPS-IoT takes the most. The reason lies in that VPS lacks verification of file integrity, so the calculation cost is low. Both our scheme and scheme VPS-IoT can verify the integrity of the file, but the structure of the lookup table in VPS-IoT is complex, requiring a large number of encryption and MAC operations on keyword positions, ciphertext, etc., so it needs more time than our scheme. Figure 10 represents that our scheme gains the highest efficiency in trapdoor generation. Both VPS and VPS-IoT adopt a two-phase query strategy, the data user generates two trapdoors for a query, while our scheme only needs to generate one trapdoor, obviously, our scheme is more efficient. Figure 11 shows the comparison of the query efficiency of the three schemes, the query is performed over 1,000 files and each file contains 20 keywords. The complexity of the three schemes is almost the same, the search time grows sub-linearly with the number of keywords in the phrase. As for verification efficiency, we deploy the three schemes on 50 files, and the experimental results are shown in Fig. 12. Scheme VPS-IoT performs best among the three schemes, but it cannot verify the integrity of the file. Our scheme takes less time than scheme VPS when the size of the query phrase becomes larger, which demonstrates the efficiency of our scheme. Furthermore, the verification is performed on the blockchain in our scheme, ensuring the reliability of verification.

Figure 9: (A and B) IndexBuild.

Download full-size image

DOI: 10.7717/peerj-cs.2235/fig-9

Figure 10: Trapdoor generation.

Download full-size image

DOI: 10.7717/peerj-cs.2235/fig-10

Figure 11: Phrase search.

Download full-size image

DOI: 10.7717/peerj-cs.2235/fig-11

From what we have discussed above, our scheme has obvious advantages in index construction, trapdoor generation, and result verification compared with existing schemes, and the search efficiency is comparable to existing schemes. Furthermore, our scheme enables reliable and complete verification of search results with the help of blockchain, preventing the server from generating unreliable verification evidence due to only storing partial indexes and ciphertexts. At the same time, our scheme can prevent the unfair verification problem caused by malicious users forging verification results.

Discussion

In this article, we presented a efficient phrase search scheme with reliable verification over encrypted cloud-IoT data, which tackled the challenges of efficient phrase identification and reliable result verification. The scheme introduces the blockchain to the verification which ensures the reliability of the verification evidence and verification process. During the verification process, we use a multiset hash function to aggregate the on-chain evidence into a hash value, which significantly reduces the blockchain transaction cost. In addition, the scheme designs a novel compound Index and distance discrimination algorithm that can quickly determine the order of keywords and achieve efficient identification of phrases, which reduces the computational and communication overhead.

Supplemental Information

The experimental code for this scheme.

This code includes algorithms such as index construction, trapdoor generation, search, and result validation.

DOI: 10.7717/peerj-cs.2235/supp-1

Download

data.

Link to the dataset used in the paper

DOI: 10.7717/peerj-cs.2235/supp-2

Download

[1] Anand A, Mele I, Bedathur S, Berberich K. 2014. Phrase query optimization on inverted indexes.

[2] Cash D, Jarecki S, Jutla CS, Krawczyk H, Rosu M, Steiner M. 2013. Highly-scalable searchable symmetric encryption with support for boolean queries.

[3] Chen CM, Tie Z, Wang EK, Khan MK, Kumar S, Kumari S. 2021. Verifiable dynamic ranked search with forward privacy over encrypted cloud data. Peer-to-Peer Networking and Applications 14(5):2977-2991

[4] Cukierski W. 2015. Enron email dataset.

[5] Curtmola R, Garay J, Kamara S, Ostrovsky R. 2006. Searchable symmetric encryption: improved definitions and efficient constructions.

[6] Du L, Li K, Liu Q, Wu Z, Zhang S. 2020. Dynamic multi-client searchable symmetric encryption with support for boolean queries. Information Sciences 506(6):234-257

[7] Ge X, Wei G, Yu J, Singh A, Xiong N. 2021. An intelligent fuzzy phrase search scheme over encrypted network data for IoT. IEEE Transactions on Network Science and Engineering 9(2):377-388

[8] Ge X, Yu J, Chen F, Kong F, Wang H. 2021. Towards verifiable phrase search over encrypted cloud-iot data. IEEE Internet of Things Journal 8(2):484-494

[9] Guo Y, Zhang C, Jia X. 2020. Verifiable and forward-secure encrypted search using blockchain techniques.

[10] Hu S, Cai C, Wang Q, Wang C, Luo X, Ren K. 2018. Searching an encrypted cloud meets blockchain: a decentralized, reliable and fair realization.

[11] Kamara S, Papamanthou C, Roeder T. 2012. Dynamic searchable symmetric encryption.

[12] Kissel ZA, Wang J. 2013. Verifiable phrase search over encrypted data secure against a semi-honest-but-curious adversary.

[13] Lai S, Patranabis S, Sakzad A, Liu J, Mukhopadhyay D, Steinfeld R, Sun S, Liu D, Zuo C. 2018. Result pattern hidings earchable encryption for conjunctive queries.

[14] Li M, Jia W, Guo C, Sun W, Tan X. 2015. LPSSE: lightweight phrase search with symmetric searchable encryption in cloud storage.

[15] Li F, Ma J, Miao Y, Jiang Q, Liu X, Choo KKR. 2021. Verifiable and dynamic multi-keyword search over encrypted cloud data using bitmap. IEEE Transactions on Cloud Computing 11:336

[16] Li F, Ma J, Miao M, Liu Z, Choo KKR, Liu X, Deng RH. 2023. Towards efficient verifiable boolean search over encrypted cloud data. IEEE Transactions on Cloud Computing 11(1):839-853

[17] Li H, Tian H, Zhang F, He J. 2019. Blockchain-based searchable symmetric encryption scheme. Computers & Electrical Engineering 73(3):32-45

[18] Liang YR, Li YP, Cao Q, Ren F. 2020. Vpams: verifiable and practical attribute-based multi-keyword search over encrypted cloud data. Journal of Systems Architecture 108(4):101741

[19] Liang Y, Li Y, Zhang K, Ma L. 2021. Dmse: dynamic multi-keyword search encryption based on inverted index. Journal of Systems Architecture 119(3):102255

[20] Liu J, Huang YY, Wang Y, Lv SY, Liu ZL, Dong CY, Lou W. 2021. Searchable symmetric encryption with forward search privacy. IEEE Transactions on Dependable and Secure Computing 18(1):460-474

[21] Liu Q, Nie X, Liu X, Peng T, Wu J. 2016. Verifiable ranked search over dynamic encrypted data in cloud computing.

[22] Liu X, Yang X, Luo Y, Zhang Q. 2021. Verifiable multi-keyword search encryption scheme with anonymous key generation for medical internet of things. IEEE Internet of Things Journal 9(22):22315-22326

[23] Liu C, Zhu L, Wang M, Tan YA. 2014. Search pattern leakage in searchable encryption: attacks and new construction. Information Sciences 265(7):176-188

[24] Miao Y, Deng RH, Choo KKR, Liu X, Ning J, Li H. 2021. Optimized verifiable fine-grained keyword search in dynamic multi-owner settings. IEEE Transactions on Dependable and Secure Computing 18(4):1804-1820

[25] Poon HT, Miri A. 2019. Fast phrase search for encrypted cloud storage. IEEE Transactions on Cloud Computing 7(4):1002-1012

[26] Poon HT, Miri A. 2015. A low storage phase search scheme based on bloom filters for encrypted cloud services.

[27] Qi C, Gong G. 2012. Verifiable symmetric searchable encryption for semi-honest-but-curious cloud servers.

[28] Shen M, Ma B, Zhu L, Du X, Xu K. 2019. Secure phrase search for intelligent processing of encrypted data in cloud-IoT. IEEE Internet of Things Journal 6(2):1998-2008

[29] Song QY, Liu ZT, Cao JH, Sun K, Li Q, Wang C. 2021. Sap-sse: protecting search patterns and access patterns in searchable symmetric encryption. IEEE Transactions on Information Forensics and Security 16:1795-1809

[30] Song D, Wagner D, Perrig A. 2000. Practical techniques for searching on encrypted data.