Efficient phrase search with reliable verification over encrypted cloud-IoT data

View article
PeerJ Computer Science

Introduction

Currently, the Internet of Things (IoT) has developed rapidly and is widely used in agriculture, industry, medicine and other fields, helping to improve crop production, manufacturing efficiency, and protect patients’ health. Every day, hundreds of millions of IoT devices around the world generate massive amounts of data, which is stored on the local or cloud. Compared to local storage, cloud storage can not only reduce local storage and management costs, achieve efficient data processing and analysis, but also help to achieve data sharing between different users, so it has been widely researched and applied.

Although cloud storage brings many conveniences to users, it also poses security and privacy risks. Cloud servers are generally considered untrustworthy, the unauthorized inside user may attempt to access sensitive information (e.g., patient’s disease name, blood pressure, etc.), and some hackers may also illegally access data, which will lead to data destruction and privacy leaks. In this case, IoT devices generally encrypt data first, and then outsource the ciphertext to the cloud to protect the integrity and privacy of the data.

For data outsourced to the cloud, when users need to access it, they perform retrieval on ciphertext. To achieve keyword retrieval on encrypted data and maintain the balance between search efficiency and security, Song, Wagner & Perrig (2000) proposed the concept of searchable encryption (SE), according to the number of keywords queried, SE is divided into two categories: single keyword search and multi-keyword search. Phrase search is an important technology of searchable encryption, which can search for a series of conjunction keywords in sentences or documents (Tang et al., 2012; Anand et al., 2014). Designing an efficient phrase search solution is very challenging, existing single keyword (Curtmola et al., 2006; Stefanov, Papamanthou & Shi, 2014) or multi-keyword encryption search schemes (Cash et al., 2013; Poon & Miri, 2015) cannot be directly applied to phrase search because they cannot determine the location of keywords. For example, in the electronic medical system, certain diseases are expressed by phrases, such as “myocardial infarction”. When searching for this phrase with a multi-keyword encrypted search scheme, the cloud server may return search results that contain both “myocardium” and “infarction”, but they may not appear as a phrase. Obviously, the search results contain a lot of invalid files.

Another challenge for phrase search is the verification of search results. Since data is outsourced on the cloud, external or internal attacks on cloud server may compromise the integrity or confidentiality of the data. In addition, data may be lost or damaged during data transmission. Therefore, it is necessary to verify the results of phrase search.

Although there are some studies (Kissel & Wang, 2013; Ge et al., 2021) addressing the problem of phrase search result verification, unfortunately, these verification schemes lack reliability. The reason is that in the existing solution, the server calculates search results and uses methods such as RSA accumulators to generate verification evidence. These search results and verification evidence may be forged by the cloud server (for example, the server may store only a part of the file and search index for financial gain, in which case the search results and verification evidence are incomplete). In addition, data users may forge verification results for cost savings, which may also result in the unreliability of verification results. In recent studies, some researchers have adopted blockchain technology. These schemes guarantee the reliability of verification based on the immutable property of the blockchain and have obtained ideal experimental results. But, these schemes mainly focus on the encrypted search of single keyword and cannot be applied to phrase search.

To address these problems, we design a blockchain-based phrase search scheme supporting reliable verification over encrypted cloud-IoT data, our main contributions are as follows:

1) We propose an efficient phrase search scheme over encrypted cloud-IoT data. In our scheme, a composite index containing keyword position and a distance discrimination algorithm based on homomorphic encryption are proposed, which can not only reduce the complexity of phrase recognition, but also achieve efficient phrase search and result verification.

2) We propose a method that enables reliable verification of phrase search results. In our scheme, the verification evidence calculation and verification process of phrase search are both executed by the blockchain, breaking the pattern of the server generating both search results and verification evidence, so the reliability of phrase search is ensured. Furthermore, we use a multiset hash function to calculate cumulative evidence, which significantly reduces the overhead of the blockchain.

3) We conducted a security analysis of the scheme and conducted detailed experiments. The results demonstrate that our construction is secure and enjoys good search efficiency.

The article is structured as follows: “Related Work” introduces the current research progress related to phrase search and verification; “Problem Formulation” describes the system model, threat model, algorithm definitions, and security definitions; “Methods” provides a detailed description of the phrase search and verification algorithms used in our scheme; and finally, “Security Analysis” and “Results” respectively analyze the security and experimental results of the proposed solution.

Related work

Searchable symmetric encryption (SSE) was first proposed by Song, Wagner & Perrig (2000) in 2000, which provides users with a new way to perform retrieval on encrypted data. However, this scheme uses full-text matching, and the search time is linear. To improve the search efficiency, Anand et al. (2014) proposed an efficient searchable encryption scheme with the inverted index, achieving a subcaptionlinear search. Following this direction, a great many schemes have been proposed to support dynamic update (Kamara, Papamanthou & Roeder, 2012; Stefanov, Papamanthou & Shi, 2014; Liu et al., 2021), multi-client query (Sun, Zuo & Liu, 2022; Du et al., 2020) and privacy protection (Liu et al., 2014; Song et al., 2021). But these schemes are mainly focusing on a single keyword, and the cloud returns some irrelevant files. To further improve the search efficiency and accuracy, SSE schemes supporting multi-keyword search are proposed, such as boolean query (Cash et al., 2013) and conjunctive queries (Lai et al., 2018). Compared with single keyword query, multi-keyword search improves search accuracy and reduces the communication and storage overhead.

Phrase search is a special case of multi-keyword search, it requires a sequential relationship between multiple keywords. Anand et al. (2014) first defined the model of phrase search and its security definition, but it is impractical in real scenarios since the client and the server require two rounds of interaction to complete a phrase query. Poon & Miri (2015) proposed a low storage phrase search scheme using bloom filter and symmetric encryption, and further proposed a fast phrase search scheme based on n-gram filters in 2019 (Poon & Miri, 2019). Li et al. (2015) implemented phrase search based on relative position, and realizes lightweight transactions and storage during the retrieval process. Ge et al. (2021) proposed an intelligent fuzzy phrase search scheme over encrypted network data for IoT, which dentifies phrases through binary matrices and look-up tables, and uses a fuzzy keyword set to resolve spelling errors in phrase searches. Shen et al. (2019) proposed a phrase search scheme that protects user privacy, which uses homomorphic encryption and bilinear mapping to achieve phrase identification.

Verifiable search: As we all know, servers in SSE are not completely trusted and may return incorrect search results due to external or internal attacks, so verifiable search is necessary. The concept of verifiable searchable symmetric encryption (VSSE) was first proposed by Qi & Gong (2012) in 2012, since then, a series of VSSE schemes are proposed (Liu et al., 2016; Miao et al., 2021; Chen et al., 2021; Wu et al., 2023). Unfortunately, these schemes are valid for a single keyword but do not support multiple keywords. Wan & Deng (2018) used homomorphic MAC to design a scheme that can verify the search results of multiple keywords. Li et al. (2021) used RSA accumulators to verify multi-keyword search results and uses bitmaps to improve search efficiency. There are similar multi-keyword verifiable ciphertext retrieval schemes (Liu et al., 2021; Liang et al., 2020, 2021). Kissel & Wang (2013) utilized a validation tag to build a verifiable phrase search scheme over encrypted data, but they failed to verify the integrity of the file. For more complex phrase searches, Ge et al. (2021) used the MAC function and look-up tables to implement phrase search result verification. Although this construction can verify the phrase search, it adopts a two-phase query strategy, which means the user needs to interact with the server twice in a phrase search and generate a large number of trapdoors.

Verifiable search based on blockchain: In the above verifiable schemes, the server sends the search results and verification evidence to the user, and the user calculates the search results and compares them with the received evidence to complete the verification. But this approach has some disadvantages. First, the results and evidence are unreliable due to the server is untrusted. Second,this approach cannot solve the problem of fair verification between server and user. To address this problem, blockchain is introduced into verifiable search. Currently, some verifiable search solutions based on blockchain have been proposed (Hu et al., 2018; Li et al., 2019; Guo, Zhang & Jia, 2020), but these solutions mainly focus on single keyword search, while there are almost no reliable and fair verification solutions for multi-keyword search scenarios. The same is true for phrase searches, which are more complex than multi-keyword searches.

Problem formulation

In this section, we formally define the efficient and reliable phrase search scheme over encrypted cloud-IoT data. We present the system model, threat model and security definition. We denote a composite index as a secure index, a searched phrase as a query and an encrypted query as a trapdoor. The notations and symbols used in our system are shown in Table 1.

Table 1:
Notations and symbols.
Notation Definition
idi The identifier of the file Fi
M The number of files
N The number of keywords
L() A bit-length of
|| Number of elements in set
SLw1i||A| Get the first |A| bits of SLw1i
SLw1i|(|A||B|) Get the last (|A||B|) bits of SLw1i
w~ The query phrase
R A set of ciphertext satisfying phrase search
proof Verification result, 1: valid, 0: invalid
|| Concatenation symbol, a||b denotes the concatenation of message a and b.
ri,j Number of positions of keyword wi in file Fj
DOI: 10.7717/peerj-cs.2235/table-1

System model

Four entities are included in our system: IoT device, data user, cloud server, and blockchain. The system model is shown in Fig. 1. IoT device as the data owner collects data and stores them in the form of files F={F1,F2,FM}. The IoT device extracts all the keywords W={w1,w2,wN} in F and adopts the bitmap to build composite index I. The IoT device encrypts all the files in F to ciphertexts C={C1,C2,CM} by symmetric encryption, and calculates the hash value hashi of each ciphertext in C through sha256, which will be added to the checklist L. At last, ( I, C) and ( I, L) are sent to the cloud server and the blockchain, respectively.

System model.

Figure 1: System model.

Image credit: component source from https://www.iconfont.cn/.

The data user obtains the system public parameters Ω through the authorization of the IoT device, generates a search trapdoor through these public parameters and the phrases to be queried w~. The trapdoor will be sent to the server and blockchain for encrypted search and result verification, respectively. The data user receives the search results R and verification result proof from the blockchain, and accepts R if proof=1, otherwise rejects R.

The cloud server stores the index I and ciphertexts C sent by the IoT device, and performs a search over encrypted data using the trapdoor sent by the data user to generate the search result R and aggregate evidence ψ. At last, the cloud server sends ψ to the blockchain for verification, and sends R to the user.

The blockchain verifies aggregate evidence ψ returned by the server and generates proof. To achieve reliable verification, the blockchain performs a phrase search in parallel with the cloud server to generate the verification standard value ψ. The blockchain compares ψ with the aggregated evidence ψ returned by server, calculates the verification evidence proof, and sends it to the user. In particular, during the verification, multi-set hash functions are used to verify the aggregate hash results of the ciphertext, while the ciphertext is off-chain, thereby reducing blockchain storage and computing overhead.

Threat model

In our system, IoT devices and blockchains are completely trusted, IoT devices can collect data honestly, and generate secure indexes and checklists. The blockchain performs fair verification of search results, and the verification result is reliable and unforgeable.

Correspondingly, cloud servers and users are considered untrustworthy. The cloud server may only store part of the index and ciphertext for saving storage resources. At the same time, it may perform searches dishonestly in order to save computation costs. In addition, there may be other software/hardware malfunctions in the system. All the above reasons will make the file and the verification evidence returned by the server incomplete or incorrect. As for data users, it may falsify verification results for financial gain and is therefore not trustworthy.

Algorithm definitions

Our scheme consists of six polynomial algorithms ={KeyGen,IndexBuild,TokenGen,Search,Verify,Dec}:

1) KKeyGen(1λ), this algorithm inputs a secure parameter λ, and outputs the key set K=(K1,K2,K3,K4,KI,KZ,KX,pk,sk).

2) (I,T,B)IndexBuild(F,W,K), this algorithm takes the set of files F, the set of keywords W, the key set K as input, and outputs the secure index I, the encrypted database T, the checklist B.

3) TKi,QTokenGen(w~,K3,pk), this algorithm takes a query phrase w~, a secret key K3 and a public key pk as input, and outputs the search trapdoor TKi,Q.

4) (ψ,R)Search(I,T,TKi,Q), this algorithm takes the secure index I, the encrypted database T and the search trapdoor TKi,Q as input, and outputs the aggregate evidence ψ and search results R.

5) proofVerify(I,ψ,B,TKi,Q), this algorithm takes the secure index I, aggregate evidence ψ, the checklist B, the search trapdoor TKi,Q as input, and outputs the verification evidence proof.

6) FDec(K2,C), this algorithm takes the symmetric key K2 and the encrypted file C as input, and outputs the plaintext F.

Leakage function

The goal of searchable encryption is to leak as little information as possible about the keywords and files during ciphertext retrieval. Similar to Wu et al. (2023), the leak function is defined as L={LIndexBuild,LSearch,LVerify}. According to the common definition, query history Hist={(DBi,qi)}i=0n, which stores a series of query requests and corresponding database snapshots. The search pattern sp(w)={i|foreachqithatcontainswinHist}, which records each query request. The proof history ph(w)={(i,proofi)|foreach(i,indi,w)inHist}. Then, we can define the leakage function LIndexBuild=(ph(w)), LSearch=(sp(w),ph(w)) and LVerify=(ph(w)).

Security definitions

Definition 1 (Verifiability). In an efficient and verifiable phrase search scheme, if the probability that the forged result generated by any probabilistic polynomial time (PPT) adversary passes the Verify algorithm is infinitesimal, the scheme satisfies verifiability.

Definition 2 (CKA2-security). For the verifiable phase search scheme ={ KeyGen,IndexBuild, TokenGen,Search,Verify,Dec}, there is a leakage function L={LIndexBuild,LSearch,LVerify}, an adversary A and an idealized simulator S, as well as two games RealA(λ) and IdealA,S(λ), satisfying:

RealA(λ): The challenger generates system key K={K1,K2,K3} and index ( I, T, B) by executing algorithm KeyGen(1λ) and algorithm IndexBuild(F,W,K),( I, T, B) are transmitted to the adversary A. A proceeds to formulate a sequence of adaptive queries Q={q1,q2,,qt}, with the challenger generating search tokens for each query, and receives the results of executing algorithms Search and Verify. Finally, A produces a bit b as the output of this experiment.

IdealA,S(λ): The simulator S takes (F, W) generated by the adversary A as input and outputs index ( I, T, B) by executing algorithm LIndexBuild. Then, for a series of adaptive queries Q={q1,q2,,qt} generated by the adversary A, S generates search results by executing algorithms LSearch and LVerify, A receives those results and produces a bit b as the output of this experiment.

If there is a simulator S such that for any PPT adversary A:

|Pr[RealA(λ)=1]Pr[IdealA,S(λ)]=≤negl(λ),then is L–secure against adaptive chosen-keyword attack (CKA2), where negl is an negligible function and λ is the security parameter.

Preliminaries

Bitmaps employ binary strings to represent information sets, commonly utilized for storing file identifiers in encrypted searches, thus efficiently reducing storage requirements. In our model, each keyword wi corresponds to a bitmap, and the bitmap is a string composed of a series of 0 or 1, each 0 or 1 denotes a file. If the ith document contains wi, the value of the string at position i is set to 1, otherwise 0. For instance, with four files ( F1, F2, F3, F4) and two keywords ( w1, w2) in the system, depicted in Fig. 2, w1 is found in F1 and F3, while w2 exists in F2 and F3. The bitmaps for w1 and w2 are 1010 and 0110, respectively. To search for files containing both w1 and w2, an “AND” operation on these two bitmaps is performed, yielding 10100110=0010, indicating that F3 contains both w1 and w2.

Bitmap.

Figure 2: Bitmap.

Homomorphic encryption represents an encryption technique capable of transforming a ciphertext into another without altering the decryption key. In this study, we employ the prevalent Paillier additive homomorphic encryption to compute the distance between keywords within phrases. In essence, its functionality can be outlined as follows:

1) Key Generation: Let p and q denote two large primes such that gcd(pq,(p1)×(q1))=1. Define n=pq and λ=lcm(p1,q1). Choose a random integer g from Zn2 satisfying gcd(L(gλmodn2),n)=1, where L(x)=(x1)/n and Zn2={1,2,,n21}. Then, the public key (n,g) and the private key λ are obtained.

2) Encryption: Given a message m, it can be encrypted into its ciphertext c as follows:

c=E(m,r)=gmrnmodn2where r is randomly selected from rZn.

3) Decryption: For the ciphertext c, it can be decrypted into its plaintext m as follows:

m=D(c,λ)=L(cλmodn2)L(gλmodn2)modn

This algorithm exhibits additive homomorphism. Given two messages a and b along with their corresponding ciphertexts E(a) and E(b), we can obtain the ciphertext of (a+b) via E(a)E(b), i.e., E(a+b)=E(a)E(b). This property can be leveraged to compute the distance between keywords in a phrase, aiding in determining their positional relationship.

Multiset Hash Function (Li et al., 2023): Multiset hash is a cryptographic tool that maps multiple sets of any finite size to a fixed hash length. Furthermore, multiset hash is also updateable: when the elements in the set change, the hash value only updates the current value without recalculating all.

Our scheme uses the most efficient multi-set hash function: MSet-XOR-Hash, containing three polynomial algorithms ( H, +H, H). Given a multiset M, the MSet-XOR-Hash can be expressed as follows:

{H(r,M)=Hk(0,r)mMHk(1,m);H(r,M{x})HH(r,M)+HH(r,{x})HH(r,M)Hk(1,x);H(r,M\{x})HH(r,M)HH(r,{x})HH(r,M)Hk(1,x)

Methods

We present the construction of the efficient and reliable phrase search scheme over encrypted cloud-IoT data in this section.

Composite index containing files and locations

In phrase search, a phrase is composed of multiple keywords according to a certain positional relationship, which is also the difference between phrase search and multi-keyword search. To perform a phrase search, the cloud server not only needs to search for all keywords contained in the phrase, but also needs to determine whether the order between keywords is correct.

To identify the position relationship between keywords in phrases, we designed a composite index containing files and locations, the structure of the composite index is shown in Fig. 3.

Example of the composite index structure.

Figure 3: Example of the composite index structure.

The composite index adopts inverted index structure to ensure high efficiency of search, but, it’s different from the general inverted index in that each keyword not only corresponds to the ID of a series of files, but also appends all the locations where the keyword appears in the file. For example, in Fig. 3, suppose there are three keywords (“ heart”, “ attack”, “ medic”) and five corresponding files ( F1, F2, F3, F4, F5), for simplicity, encryptions are not shown. The positions of keyword “ heart” in files F1, F2, F3 are (1, 8, 3), (1, 2, 4) and (2, 3, 5) respectively, the positions of keyword “ attack” in files F1, F2, F4 are (2, 5, 7), (3, 7, 9) and (1, 2, 5). When the cloud server searches the phrase “ heart attack”, it finds that the location of keyword “ heart” in F1 is (1, 8, 3) through the composite index, then it finds the location of keyword “ attack” in F1 is (2, 5, 7). Using the encrypted distance discrimination algorithm, the cloud server computes the position of “ attack” in file F1 is 1 larger than that of “ heart” by E(2)=E(1)E(1). Similarly, the cloud server computes that “ attack” is after “ heart” in F2 through E(3)=E(2)E(1). After searching the location of all keywords in the composite index, the server calculates that F1 and F2 contain the phrase “heart attack”.

Encrypted distance discrimination algorithm-EDDA

The sequence of keywords in a phrase can be expressed by a sentinel and the distance between each remaining keyword and the sentinel. For example, in a phrase containing three keywords ( w1, w2, w3), the position of w1, w2, w3 are 1, 2, 3, we choose w1 as the sentinel. The distance between w2 and w1 is 1, and the distance between w3 and w1 is 2. Suppose that positions of ( w1, w2, w3) are ( pos1, pos2, pos3), if we can calculate pos2=pos1+1 and pos3=pos1+2, we can recognize that ( w1, w2, w3) is a phrase.

In our scheme, the positions of keywords stored in the composite index are encrypted, and the server should be able to recognize phrases without the decryption key. Therefore, we utilize the paillier homomorphic encryption to construct the distance discrimination algorithm to determine the location relationship between keywords,the details are as follows:

E(d) is the distance after paillier homomorphic encryption, and SLwji denotes the encrypted location of the keyword wj in file idi, the definition of SLwji is as follows:

SLwji=π(idi)+E(poswji)

E represents paillier homomorphic encryption function and π represents hash function, poswji is the original location of the keyword wj in file idi. In addition, keyword wj may appears in multiple locations in the same file, in this case, E(poswji) represents a series of positions.

In distance discrimination algorithm, SLw1i||π|SLw2i||π| is used to determine whether w1 and w2 belong to the same file, if so, vali==0. E(posw1i)SLwji|(|SLwji||π|) is used to calculate encrypted location of keyword wj, and E(posw2i)==E(posw1i)+E(d) is used to determine whether the keyword w2 is located in the d position after w1. When the user executes the phrase search request, he can designate the first word (i.e., w1) in the phrase as the sentry, then calculate the distance d between the remaining words and the sentry one by one, encrypt d, and finally generate a search token and send it to the server for search.

Details of our construction

Like most searchable encryption schemes, we adopt an inverted index structure to construct the secure index. In the inverted index, we use a bitmap to store the identifier of the file. Let H:{0,1}{0,1}m,F:{0,1}{0,1}n be secure pseudo-random functions (PRFs).

Algorithm 1:
Distance discrimination algorithm.
Input: SLw1i,SLw2i,E(d)
Output: Flag
 1:  FlagFalse
 2:  vali&=SLw1i||π|SLw2i||π|
 3: if vali==0 then
 4:     E(posw1i)SLw1i|(|SLw1i||π|)
 5:     E(posw2i)SLw2i|(|SLw2i||π|)
 6:    if E(posw2i)==E(posw1i)+E(d) then
 7:        FlagTrue
 8:    end if
 9: end if
10: return Flag
DOI: 10.7717/peerj-cs.2235/table-3

KeyGen(1λ). The IoT device uses the secret parameters λ to generate the key set K={K1,K2,K3,K4,KI,KZ,KX,pk,sk}, where K1,K2,K3,K4,KI,KZ,KX${0,1}λ, (pk,sk)Paillier.KeyGen(1λ). K1 is used to encrypt the identifier of files, K2 is the secret key of symmetric encryption, K3 is the key for PRF F, (pk,sk) is the public key and the private key of paillier encryption.

IndexBuild(F,W,K). Given a set of files F, a set of keywords W, and the key set K, the IoT device generates the secure index I, the encrypted database T, the checklist B, the details are shown in Algorithm 2.

Algorithm 2:
IndexBuild.
Input: DB,W,K
Output: I,T,B
 1: for FiF do
 2:     iH(idi||K1);CiEnc(K2,Fi)
 3:     hashiH(Ci);B[i]hashi;T[i]Ci
 4: end for
 5: for wjW do
 6:    for idDB do
 7:        uwjF1(K3,wj);stjF2(K4,id);twjH2(uwj||stj)
 8:        Extract positions (pos1,pos2,...,posm) of keyword wj in file Fi
 9:        SLwji=π(idi)+E(pos1)+E(pos2)+...+E(posm)
10:        vB(Bwj||SLwji)H2(uwj||stj);I[twj]vB;[wj]=stj
11:     end for
12: end for
13: send (I,B) to blockchain, send (I,T) to cloud server
DOI: 10.7717/peerj-cs.2235/table-4

For each file FiF, IoT device encrypts it to the ciphertext Ci with symmetric encryption Enc(K2,Fi). The ciphertext Ci is stored in encrypted database T, and the hash value hashi of Ci is stored in checklist B for verification.

IoT device generates a bitmap Bwj for each keyword wj, Bwj is encrypted by vBBwjH2(uwj||stj) and vB is stored in the secure index I. Especially, in order to protect the privacy of files, IoT device uses iH2(idi||K1) to encrypt the id of files, and then uses i to generate Bwj. Since the id stored on the server is encrypted, the server cannot obtain the real id from Bwj, which ensures the privacy of the search pattern.

To identify phrases, the IoT device extracts positions (pos1,pos2,,posm) of keyword wj in file Fi, and encrypts them using Paillier.Enc(posm):

SLwji=π(idi)+E(pos1)+E(pos2)++E(posm)

TokenGen(w~,K3,pk). Authorized users get shared parameters Ω={K3,pk} from the IoT device. For the phrase w~={w1,w2,...,wt} to be queried, the trapdoor TKi,Q is generated as Algorithm 3.

Algorithm 3:
TokenGen.
Input: The query phrase w~, the key set K
Output: The serach trapdoor TKi,Q
1:  Suppose that query phrase w~={w1,w2,...,wt}
2:  for j=1t do
3:       stj[wj],uwjF1(K3,wj)
4:       twjH2(uwj||stj)
5:     if j>1 then
6:          d=j1;EdPaillier.Enc(d)
7:     end if
8:  end for
9:  send TKi,Q={tw1,tw2,...,twt,E1,E2,...,Et1} to blockchain and cloud server
DOI: 10.7717/peerj-cs.2235/table-5

Assume that the keywords {w1,w2,,wt} in phrase are arranged in order. The data user calculates the distance d between the keyword wt and the first keyword w1 and encrypts it with Paillier.Enc(d). At last, twj and Ed are added to the trapdoor TKi,Q and sent to the cloud server and the blockchain.

Search(I,T,TKi,Q). As shown in Algorithm 4, the cloud server performs an encrypted search with the secure index I, the encrypted database T and the trapdoor TKi,Q.

Algorithm 4:
Search.
Input: I,T,TKi,Q
Output: Search results R
 1: Parse {tw1,tw2,...,twt,E1,E2,..Et1}TKi,Q
 2: for twjTKj,Q do
 3:      vBI[twj];Bwj||SLwjivBH2(twj)
 4: end for
 5:  B=B1B2...Bt,r=H(B,{})
 6: Get IDB={1,2,...,p} from B
 7: for iIDB do
 8:    flag={000...000}t1,hashi=H1(t[i]),ψ=ψ+HH(r,hashi)
 9:    (E(pos11),E(pos21),E(posm1))SLw11
10:   for d=2t do
11:       (E(pos1i),E(pos2i),E(posni))SLwji
12:      for k=1m;k=1n do
13:         if E(pos1k)=E(pos1k)×E(d1) then
14:             Set the position (i1) of flag to 1
15:             break
16:         end if
17:      end for
18:   end for
19:   If all positions of flag are 1
20:   get CiT[i], RRCi
21: end for
22: Server sends { ψ} to the blockchain for verification, and sends R to the data user
DOI: 10.7717/peerj-cs.2235/table-6

After receiving the query request, the server parses the trapdoor {tw1,tw2,,twt,E1,E2,..Et1}TKi,Q. The server gets the bitmap Bwi of wi from the secure index I through vBI[twi],BwivBH(twi). To get the file that contains all the keywords in the phrase w~, the server performs the operation “AND” on the bitmap of all keywords as follows:

B=B1B2Bt.

The file corresponding to the element with a value of “1” in B contains all the keywords in the phrase. The server gets the set of identifiers of these files as IDB={1,2,,p} according to B.

Next, the server determines whether the sequence of the keywords in the file i is consistent with the order of the keywords in the phrase, as described in line 7–line 20 in Algorithm 3. The server chooses a binary string flag of length (t1), and set all values to “0”. The server gets all positions E(pos1i),E(pos2i),E(posni) of the keyword wj in the file i. For the position E(pos1k) of the keyword wj(j>1) in file i, the server utilizes the distance discrimination algorithm EDDA to determine the distance between keyword wj(j>1) and the first keyword w1 in the phrase. Like

E(pos1k)=E(pos1k)×E(d1)where E represents the Paillier.Enc. If Formula (4) holds, the distance between keywords w1 and wj is (d1), which is the same as that in the phrase, the server sets the position (i1) of flag to “1”. If all positions of flag are “1”, then the file i contains the phrase w~, and it is added to the search result R. Finally, the aggregation proof ψ is sent to the blockchain for reliable verification, and the ciphertext collection of search results R is sent to the user.

Verify(I,B,TKi,Q,ψ). The blockchain utilizes TKi,Q for phrase searches to verify the aggregated evidence ψ returned by the cloud server, as shown in Algorithm 5. To ensure the reliable verification of search results, the verification algorithm Verify not only verifies the integrity of the files, but also verifies whether the server has returned all files that meet the search requirements.

Algorithm 5:
Verify.
Input: I,B, TKi,Q, ψ
Output: proof
1:   ψϕ,proof=0.
2:  Using search trapdoors TKi,Q to perform phrase searches same as line 1–line 6 of Algorithm 4.
3:  for iIDB do
4:       hashiB[i],ψ+HH(r,hashi)
5:  end for
6:  if ψ=ψ then
7:       proof=1
8:  end if
9:  The blockchain sends proof to the data user.
DOI: 10.7717/peerj-cs.2235/table-7

The blockchain performs the same operations as the server (line 1–line 6), retrieving the composite index stored on itself with the trapdoor TKi,Q, and calculates the search result ψ. Due to the immutability of data on the blockchain, the composite index stored and search results calculated by the blockchain are reliable. Blockchain achieves reliable verification of phrase search results by comparing ψ with the search result ψ returned by the server.

For the file i, the blockchain obtains the corresponding hash value hashi by searching the checklist B and compresses it into the benchmark value ψ. By comparing the aggregate evidence ψ sent by the server with ψ, the blockchain sets the value of proof as follows:

proof={1,ifψ=ψ,0,otherwise.

By comparing ψ=ψ, the blockchain can determine: 1) whether the server has returned all files that meet the search requirements; 2) the content of the files has been modified.

Then the verification evidence proof are sent to the data user. The data user judges the received proof, and accepts the search result R if proof=1, otherwise rejects R. For the accepted search result R, the data user uses the symmetric key to decrypt the file in it, to get the plaintext of the file, and the phrase search process is completed.

Discussion

Ensuring the reliability of verification is an important target of our scheme. In the existing phrase search scheme, the secure index is stored on the server, and the verification evidence is generated by the server. Untrusted servers may only store partial indexes and ciphertexts, resulting in untrustworthy search results and verification evidence. Whereas in our scheme, blockchain uses search trapdoor to calculate verification evidence, the data stored on the blockchain is unforgeable, so the search results on the blockchain are reliable. At the same time, the verification of the results returned by the server is also performed by the blockchain, which prevents dishonest data users from falsifying the verification results and ensures the reliability of the verification results.

Security analysis

Theorem 1: The proposed efficient and reliable phrase search scheme satisfies verifiability.

Proof. Let A be a PPT adversary who can produce a forgery RS, which can pass the verification algorithm Verify. Assuming the correct search result is R, we will prove that there is no such adversary A who can give a forgery RS satisfying R=RS.

Suppose the compressed hash values corresponding to R and RS are ψ and ψS, respectively, and we will discuss the following two cases:

Case 1: R=RS and ψψS. For each ciphertext Cj in R, we have hashjH(Cj),ψ=ψ+HH(r,hashj), similarly, we have hashjSHS(CjS),ψS=ψS+HH(r,hashjS) for each ciphertext CjS in RS. Since the data on the blockchain is unforgeable and R=RS, we have ψ=ψS, which is contradictory to ψψS. Therefore this case does not hold.

Case 2: RRS and ψ=ψS. This implies that A can discover a collision for H, which contradicts the collision resistance property of the hash function. Therefore, this case also does not hold.

In summary, the unforgeability of blockchain and the collision resistance of hash function ensures that any PPT adversary A cannot forge search results. So, our scheme satisfies verifiability.

Theorem 2: If PRF F is pseudo-random, Enc algorithm is secure against chosen plaintext attack (CPA-secure) and Paillier.Enc is secure against chosen ciphertext attack (CCA-secure), then our proposed scheme is ( LIndexBuild,LSearch,LVerify)-secure against the adaptive chosen-keyword attack.

Proof. We establish the CKA2 security of our scheme by demonstrating the indistinguishability of RealA(λ) and IdealA,S(λ). The proof starts with RealA(λ) and go through a series of indistinguishable games to achieve IdealA,S(λ), thus proving that A and IdealA,S(λ) are indistinguishable.

Game G1: G1 is the same with RealA(λ):

|Pr[RealA(λ)=1]=Pr[G1=1]

Game G2: We replace the output of the pseudorandom function F( F1 and F2) with a sequence of binary random numbers F^, the length of F^ is equal to |F|, and store the binary sequence in buckets B1 and B2. If the adversary A can distinguish between F and the random number sequence, then they can distinguish between G1 and G2. Then,

|Pr[G1=1]Pr[G0=1]AdvF1,F2,APRF(λ)

Game G3: In G3, the output of the hash function H( H1 and H2) is replaced by a series of randomly generated binary strings H^, |H^|=|H|. G3 stores H^ in buckets HB1 and HB2. If the adversary A can distinguish between H and H^, then they can distinguish between G2 and G3. Then,

|Pr[G3=1]Pr[G2=1]negl(λ)

Game G4: In G3, the output of the multi-set hash function is computed based on (r,hashi), while in G4, the output of the multi-set hash function consists of a random binary string made up of a series of 0 or 1. And, the binary string is recorded in a bucket X^. From the previous analysis, we can conclude that

|Pr[G4=1]Pr[G3=1]negl(λ)

IdealA,S(λ): IdealA,S(λ) and G4 are the same, except that IdealA,S(λ) introduces simulator S, S executes algorithm LIndexBuild,LSearch,LVerify with the help of ( sp(w),ph(w)) and the adversary A can sniff the algorithm output. The algorithm details are shown in Algorithms 68. The adversary A cannot distinguish between the output of the random oracle in this game and the actual data, hence

|Pr[IdealA,S(λ)=1]Pr[G4=1]negl(λ)

Algorithm 6:
Simulator 6.
Input: DB,W,K
Output: I,T,B
 1:  for FiF do
 2:     iHB2;CiEnc(K2,Fi)
 3:     hashiHB1;B[i]hashi;T[i]Ci
 4:   end for
 5:   for wj¯W do
 6:      for idDB do
 7:         uwjB1;stjB2;twjHB2
 8:         Extract positions (pos1,pos2,...,posm) of keyword wj¯ in file Fi
 9:         SLwji=π(idi)+E(pos1)+E(pos2)+...+E(posm)
10:          vBR{0,1}(l+λ);I[twj]vB;[wj]=stj
11:   end for
12: end for
13: send (I,B) to blockchain, send (I,T) to cloud server
DOI: 10.7717/peerj-cs.2235/table-8
Algorithm 7:
Simulator 7.
Input: I,T,TKi,Q
Output: Search results R
 1: Parse ph(w) as [(t1,pf1),(t2,pf2),...,(tc,pfc)]
 2: Parse {tw1¯,tw2¯,...,twt¯,E1,E2,..Et1}minsp(TKi,Q¯)
 3: for twj¯minsp(TKi,Q¯) do
 4:    vBR{0,1}(l+λ);(Bwj||SLwji)R{0,1}(l+λ)
 5: end for
 6: B=B1B2Bt
 7: Get IDB={1,2,...,p} from B
 8:  for iIDB do
 9:      flag={000...000}t1
10:    if X^[i]= then
11:       X^[i]R{0,1}n
12:    else
13:        X^[i]pfi
14:    end if
15:    (E(pos11),E(pos21),E(posm1))SLw11
16:   for d=2t do
17:       (E(pos1i),E(pos2i),E(posni))SLwji
18:      for k=1m;k=1n do
19:         if E(pos1k)=E(pos1k)×E(d1) then
20:            Set the position (i1) of flag to 1
21:            break
22:         end if
23:      end for
24:   end for
25:   If all positions of flag are 1
26:   get ciT[i], RRci
27: end for
28: Server sends { X^} to the blockchain for verification, and sends R to the data user
DOI: 10.7717/peerj-cs.2235/table-9
Algorithm 8:
Simulator 8.
Input: I,B, TKi,Q, ψ
Output: proof
 1: Parse ph(w) as [(t1,pf1),(t2,pf2),...,(tc,pfc)]
 2: ψϕ,proof=0.
 3: Using search trapdoors TKi,Q¯ to perform phrase searches same as line 1–line 6 of Simulator 7.
 4:  for iIDB do
 5:     if X^[i]= then
 6:       X^[i]R{0,1}n
 7:     else
 8:        X^[i]pfi
 9:     end if
10:      hashiB[i],ψ+HH(r,hashi)
11:  end for
12:  if ψ=ψ then
13:     proof=1
14:  end if
15:  The blockchain sends proof to the data user.
DOI: 10.7717/peerj-cs.2235/table-10

From what we have discussed above, the adversary cannot distinguish the result in the experiment Real and the result in the experiment Ideal. That is:

|Pr[RealA(λ)=1]Pr[IdealA,S(λ)=1]negl(λ)

Therefore, our proposed scheme satisfies CKA2 security.

Results

In order to objectively evaluate the performance of our scheme, we design a series of scientific experiments in this section. We conducted a comprehensive analysis of the experimental results and compared them with the existing phrase search scheme (Kissel & Wang, 2013) and scheme (Ge et al., 2021). Our experiments are deployed on a local laptop with a Linux operating system, Intel Core i7-8550 CPU, and 8 GB RAM. Experimental programs are developed using Python. As for the pseudo-random functions F and the hash function H in the algorithm, we use HMAC-SHA-256 and SHA-256 respectively to implement them. Additionally, we symmetrically encrypt files using AES-128, and the security parameter is set to 128 bits. To evaluate our scheme in practice, we employ the Enron email dataset (Cukierski, 2015), a real-world dataset comprising over 517 thousand documents. Using the Porter Stemmer, we extract more than 1.67 million keywords and eliminate irrelevant terms such as “of” and “the”.

Evaluation of IndexBuild

In this phase, the IoT device mainly completes the following work: (1) encrypt the files in the system into ciphertext; (2) generate the secure index for all the keywords; (3) calculate checklist of the ciphertext for verification.

The performance of the scheme can be evaluated through the execution time of algorithm IndexBuild, and we evaluate the execution time of IndexBuild in different numbers of files and keywords respectively. Figure 4A shows the variation pattern between the execution time of IndexBuild and the number of keywords in a single file, while files changes from 1,000, 2,000 to 3,000; in contrast, Fig. 4B shows the variation between the execution time of IndexBuild and the number of files in the system, while keywords in a single file changes from 10, 30 to 50. Obviously, the execution time of algorithm IndexBuild is affected by both the number of files and keywords. The more files and keywords contained in each file, the more time it takes in IndexBuild.

(A and B) Time of IndexBuild.

Figure 4: (A and B) Time of IndexBuild.

Evaluation of TokenGen

Search trapdoors are generated by users, which contains the permutation value of each keyword in the query phrase and the encrypted distance for other keywords except the first one. Figure 5 shows the time it takes to calculate a search trapdoor for different sizes of search phrases, it’s clear that the time increases with the size of the query phrase. This is easy to understand, because the more keywords in the phrase, the more distances between keywords that need to be encrypted, resulting in more trapdoor calculation time.

Time of trapdoor generation.

Figure 5: Time of trapdoor generation.

Evaluation of search

Similarly, we use execution time to evaluate the search efficiency of our scheme. Figure 6A shows how search time changes with query phrase size when the number of keywords contained in each file is 10, 30 and 50. Figure 6B shows how search time changes with query phrase size when the number of files is 1,000, 3,000 and 5,000.

(A and B) Time of search.

Figure 6: (A and B) Time of search.

The results of Fig. 6 demonstrate that as the size of the query phrase grows or the number of documents grows, the search time will increase accordingly, which indicates that the more keywords in the phrase and the more files in the system, the more time it takes for the server to perform a phrase search. Furthermore, since we can use trapdoors to directly locate keywords in the inverted index, the search time is independent of the number of keywords contained in each file.

Evaluation of verification

The blockchain first performs the similar operations as the server to search files that meet the search phrase, then gets the corresponding hash value through the checklist B, and finally calculates the benchmark ψ based on the multi-set hash function. During the verification, the blockchain draws the verification conclusion by comparing ψ with the search aggregation evidence ψ returned by the server. Therefore, the verification time is related to the number of files that match the query phrase, and the experimental results are shown in Fig. 7. Clearly, the verification time grows sub-linearly with the number of files and the size of the phrase.

Time of verification.

Figure 7: Time of verification.

The gas consumption during the verification process is shown in Fig. 8. During the verification process, the blockchain performs multiset hash calculations on the hash values in the checklist that meet that meet the requirements of the phrase search, so the gas consumption increases with the number of search results. When the number of resulting files is 5, the gas consumption is 1.6×105, and when the number of files is 25, the gas consumption is 6.8×105, gas consumption grows sublinearly.

Gas consumption for verification.

Figure 8: Gas consumption for verification.

Comparison with existing schemes

We choose scheme (Kissel & Wang, 2013) and scheme (Ge et al., 2021) with similar functions to compare, and the results are shown Table 2, in which VPS and VPS-IoT denote scheme (Kissel & Wang, 2013) and scheme (Ge et al., 2021), respectively. Both VPS and VPS-IoT adopt a two-stage search strategy, so it takes two rounds of interaction between the user and the cloud server to complete a phrase search. Additionally, they calculate the verification evidence by the server. In this case, if the server is not trustworthy, the evidence may also be incorrect, which poses a huge threat to the reliability of verification.

Table 2:
Compare with existing schemes.
Scheme Index building Trapdoor generation Query Verification Round
VPS O(MN) O(|q||Fw|) O(|Fw||Fj||q|) O(|q||Fw|) 2
VPS-IoT O(MN)+O(j=1MN|Fj|) O(2|q|+|Fw|+|q||Fw|) O(|Fw||Fj||q|+|q|) O(|q||Fq|) 2
Our Scheme O(MN) O(|q||Fw|) O(|Fw||Fj||q|) O(|Fw|) 1
DOI: 10.7717/peerj-cs.2235/table-2

Note:

|Fj| is the number of keywords contained in the file Fj; wj is a collection of different keywords in the file Fj; |q| is the number of query keywords; |Fw| is the number of files containing query keywords; |Fq| is the size of the returned result file.

The experimental results are shown in Figs. 912. The results in Fig. 9 show that VPS takes the least time in building the index, while VPS-IoT takes the most. The reason lies in that VPS lacks verification of file integrity, so the calculation cost is low. Both our scheme and scheme VPS-IoT can verify the integrity of the file, but the structure of the lookup table in VPS-IoT is complex, requiring a large number of encryption and MAC operations on keyword positions, ciphertext, etc., so it needs more time than our scheme. Figure 10 represents that our scheme gains the highest efficiency in trapdoor generation. Both VPS and VPS-IoT adopt a two-phase query strategy, the data user generates two trapdoors for a query, while our scheme only needs to generate one trapdoor, obviously, our scheme is more efficient. Figure 11 shows the comparison of the query efficiency of the three schemes, the query is performed over 1,000 files and each file contains 20 keywords. The complexity of the three schemes is almost the same, the search time grows sub-linearly with the number of keywords in the phrase. As for verification efficiency, we deploy the three schemes on 50 files, and the experimental results are shown in Fig. 12. Scheme VPS-IoT performs best among the three schemes, but it cannot verify the integrity of the file. Our scheme takes less time than scheme VPS when the size of the query phrase becomes larger, which demonstrates the efficiency of our scheme. Furthermore, the verification is performed on the blockchain in our scheme, ensuring the reliability of verification.

(A and B) IndexBuild.

Figure 9: (A and B) IndexBuild.

Trapdoor generation.

Figure 10: Trapdoor generation.

Phrase search.

Figure 11: Phrase search.

Verification.

Figure 12: Verification.

From what we have discussed above, our scheme has obvious advantages in index construction, trapdoor generation, and result verification compared with existing schemes, and the search efficiency is comparable to existing schemes. Furthermore, our scheme enables reliable and complete verification of search results with the help of blockchain, preventing the server from generating unreliable verification evidence due to only storing partial indexes and ciphertexts. At the same time, our scheme can prevent the unfair verification problem caused by malicious users forging verification results.

Discussion

In this article, we presented a efficient phrase search scheme with reliable verification over encrypted cloud-IoT data, which tackled the challenges of efficient phrase identification and reliable result verification. The scheme introduces the blockchain to the verification which ensures the reliability of the verification evidence and verification process. During the verification process, we use a multiset hash function to aggregate the on-chain evidence into a hash value, which significantly reduces the blockchain transaction cost. In addition, the scheme designs a novel compound Index and distance discrimination algorithm that can quickly determine the order of keywords and achieve efficient identification of phrases, which reduces the computational and communication overhead.

Supplemental Information

The experimental code for this scheme.

This code includes algorithms such as index construction, trapdoor generation, search, and result validation.

DOI: 10.7717/peerj-cs.2235/supp-1

data.

Link to the dataset used in the paper

DOI: 10.7717/peerj-cs.2235/supp-2
  Visitors   Views   Downloads