Privacy-preserving k-NN interpolation over two encrypted databases

Murat Osmanoglu; Salih Demir; Bulent Tugrul

doi:10.7717/peerj-cs.965

Privacy-preserving k-NN interpolation over two encrypted databases

Murat Osmanoglu, Salih Demir, Bulent Tugrul

Department of Computer Engineering, Ankara University, Ankara, Turkey

DOI: 10.7717/peerj-cs.965

Published: 2022-05-31
Accepted: 2022-04-07
Received: 2021-09-24

Academic Editor: Shadi Aljawarneh

Subject Areas: Cryptography, Security and Privacy
Keywords: Big data, Cloud computing, Interpolation, k-nearest neighbour

Copyright: © 2022 Osmanoglu et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Osmanoglu M, Demir S, Tugrul B. 2022. Privacy-preserving k-NN interpolation over two encrypted databases. PeerJ Computer Science 8:e965 https://doi.org/10.7717/peerj-cs.965

Abstract

Cloud computing enables users to outsource their databases and the computing functionalities to a cloud service provider to avoid the cost of maintaining a private storage and computational requirements. It also provides universal access to data, applications, and services without location dependency. While cloud computing provides many benefits, it possesses a number of security and privacy concerns. Outsourcing data to a cloud service provider in encrypted form may help to overcome these concerns. However, dealing with the encrypted data makes it difficult for the cloud service providers to perform some operations over the data that will especially be required in query processing tasks. Among the techniques employed in query processing task, the k-nearest neighbor method draws attention due to its simplicity and efficiency, particularly on massive data sets. A number of k-nearest neighbor algorithms for query processing task on a single encrypted database have been proposed. However, the performance of k-nearest neighbor algorithms on a single database may create accuracy and reliability problems. It is a fact that collaboration among different cloud service providers yields more accurate and more reliable results in query processing. By considering this fact, we focus on the k-nearest neighbor (k-NN) problem over two encrypted databases. We introduce a secure two-party k-NN interpolation protocol that enables a query owner to extract the interpolation of the k-nearest neighbors of a query point from two different databases outsourced to two different cloud service providers. We also show that our protocol protects the confidentiality of the data and the query point, and hides data access patterns. Furthermore, we conducted a number of experiment to demonstrate the efficiency of our protocol. The results show that the running time of our protocol is linearly dependent on both the number of nearest neighbours and data size.

Introduction

Due to its low cost, scalability and reliability, cloud computing has increased its reputation in both the business and scientific communities. In addition to the benefits, it introduces new concerns that need to be addressed carefully (Krutz & Vines, 2010). One of the emerging issues in cloud computing is extracting knowledge from sensitive data while protecting the privacy of data owners, which is called privacy-preserving data mining (Agrawal & Srikant, 2000; Vaidya & Clifton, 2004). A privacy-preserving data mining method aims to provide data privacy using either data perturbation or cryptographic methods. Data perturbation-based models struggle with data quality issues, i.e. the valuable statistical information might be dissolved. This may yield less accurate and less reliable results. On the other hand, cryptographic-based models achieves the privacy of data owners through the encryption of data before outsourcing it to the cloud. However, this presents challenges of performing required operations over the encrypted data.

In addition to these facts, collaboration among different cloud service providers may also help them to create more accurate and reliable results in a privacy-preserving data mining method, i.e. more clouds can discover more knowledge than they can uncover on their own when they combine their data (Demir & Tugrul, 2018). There are some studies that propose privacy-preserving solutions for horizontally-partitioned databases to increase the total number of data samples with the goal of creating more accurate data mining models (Inan et al., 2007). In some cases, vertically-partitioned database solutions can be preferred to increase the number of attributes for the same instances (Skillicorn & McConnell, 2008). Institutions such as hospitals operating in different parts of a country may prefer the first choice. On the other hand, institutions such as banks and insurance companies may aggregate their data using the second choice.

In this study, we will examine the k-NN interpolation method that preserves the confidentiality of two different databases stored by two different cloud service providers. k-NN, categorized as a lazy learner, is a non-parametric method used for classification, clustering and interpolation which utilizes the idea that neighboring objects possess or display similar characteristics. Complex interpolation methods such as Kriging involve advanced operations and thus pose a great challenge to cloud computing. In addition, the high time requirements of such methods make them unsuitable in some scenarios such as healthcare applications. On the contrary, the simplicity and interpretability of the k-NN method make it an efficient tool for query processing tasks.

Our contribution

In this article, we introduce an efficient secure two-party k-NN (STPkNN) interpolation protocol that enables two different data owners to outsource their databases together with the query processing service to the cloud, and allows a query owner to extract the interpolation of the k-nearest neighbors of a query point from the encrypted databases. Our protocol preserves the confidentiality of data, assures the privacy of user’s query point, and hides data access patterns.

The STPkNN protocol can be considered as an extension of the protocol SkNN_m proposed in Elmehdwi, Samanthula & Jiang (2014), that enables a query owner to retrieve the k-nearest neighbors of a query point from a single encrypted database, to two-cloud settings. Briefly, the SkNN_m protocol calculates the k-nearest neighbors in an iterative way by performing the following steps k times: (i) it finds the minimum of the Euclidean distances between the data records and the query point, (ii) it calculates the one of the nearest neighbors that corresponds to the index of the minimum distance, and excludes the corresponding distance from the Euclidean distances. On the other hand, in two-cloud settings, the clouds have to share their local minimums of the Euclidean distances to decide on the global minimum that corresponds the index of the nearest neighbor of two databases at the moment, and remove that record from further iterations. However, it is not trivial to achieve this without revealing which data record corresponds to global minimum to any cloud.

To this aim, we first propose two new security primitives, the Secure Transformation (ST) protocol and the Secure Bit-AND-OR (SBAOR) protocol that enable the clouds to decide on the global minimum and exclude it from the further calculations without revealing data access pattern to any cloud. We show that both protocols protect the confidentiality of the input values which will be in encrypted form, i.e. no information about the input values is leaked to any party during the protocols, and the output is only revealed to one of the parties in the protocols. Briefly, the ST protocol allows the servers to securely transform the encryption of a record under a public key to an encryption of same record under another public key. On the other hand, for given the encryptions of two bit vectors x and y, the SBAOR protocol enables the servers to securely compute the negation of the logical disjunction of all bitwise multiplications x_i · y_i in encrypted form without revealing the bit vectors to any party.

By employing the ST and SBAOR protocols together with the other existing security protocols, we build our main protocol STPkNN that enables a query owner (QO) to extract the interpolation of the k-nearest neighbors of a query point chosen by QO from two different databases outsourced to two different cloud service providers. In the protocol, data owners encrypt their data before outsourcing them to the cloud service providers, and they do not participate in the STPkNN protocol. Thus, no information about the data is leaked to the cloud service providers during the protocol. Besides, our protocol guarantees that any record from both databases or any intermediate result generated in the protocol is not leaked to the cloud service providers. Also, it hides the data access pattern from both data owners and cloud service providers, i.e. the protocol does not reveal the information of which data records were used to produce the interpolation of k-nearest neighbors to any cloud service provider. On the other hand, the STPkNN protocol outputs the interpolation of k-nearest neighbors only to the query owner, and the query owner gets no information other than the interpolation.

We also conduct various experiments on two real-world datasets from the UCI machine learning repository, the cervical cancer (risk factors) dataset and the default of credit card clients dataset, to show the practicability of our protocol in real world scenarios. The experimental evaluation presents that our protocol scales well for the large datasets.

Related works

Due to its usefulness in many application scenarios such as classification, similarity search, and collaborative filtering, the problem of computing the k-nearest neighbors of a query point has been gained a lot of attention in recent years. The early studies mostly focused on how to implement a secure k-NN method between data owner and clients without using cloud systems. Shaneck, Kim & Kumar (2009) proposed a privacy-preserving protocol that employs secure multiparty computation to compute k-NN in horizontally partitioned databases. Besides, they also showed how their protocol can be efficiently used in different application such as outlier detection, classification, and clustering problems. Moreover, Qi & Atallah (2008) proposed a provable secure protocol for the single-step k-NN search problem that enjoys linear computation and communication complexity. Vaidya & Clifton (2005) introduced a privacy-preserving algorithm that performs top-k queries in vertically partitioned data. Additionally, Kantarcoğlu & Clifton (2004) proposed a method that privately calculates the k-NN classification over horizontally partitioned data in the distributed database model. Note that all of the above methods require the data owners to perform the necessary calculations to generate the result, and to return it directly to the query users. However, in our model, the data is outsourced to the cloud in encrypted form instead of being kept by the data owners. All of the computation required to process k-NN queries are performed by the cloud.

The recent studies have mostly focused on solutions in cloud computing settings. Wong et al. (2009) proposed an asymmetric scalar-product-preserving encryption (ASPE) scheme that can be employed to construct a secure k-NN protocol. The protocol proposed in Wong et al. (2009) uses a distance comparison function instead of an exact distance calculation. However, the secret key in the protocol should be disclosed to the query users. Zhu, Huang & Takagi (2016) introduced a secure protocol that achieves k-NN query processing on encrypted data without totally revealing the data owner’s secret key to the query user. However, their scheme requires data owners to be involved in the encryption of query points. Hu et al. (2011) proposed a secure traversal framework that can used, together with privacy homomorphism, to achieve secure k-NN query processing protocol. Cheng et al. (2015) proposed a privacy-preserving protocol that employs an encrypted hierarchical index tree to perform k-NN queries over spatial data outsourced to cloud in encrypted form. All three protocols (Hu et al., 2011; Zhu, Huang & Takagi, 2016; Cheng et al., 2015) leak data access pattern to the cloud. On the other hand, Kesarwani et al. (2018) proposed a secure k-NN query processing protocol over encrypted data by utilizing a leveled fully homomorphic encryption scheme. Wu et al. (2019) introduced a privacy preserving k-NN classification scheme over the encrypted cloud database that is secure against known-plaintext attack. Besides, Lei et al. (2020) shed light on the connection between a secure k-NN query processing scheme and a secure range query scheme. Based on this connection, they utilize a secure range query scheme together with a data structure named as random Bloom filter to build a secure k-NN query processing scheme. All three protocols (Kesarwani et al., 2018; Wu et al., 2019; Lei et al., 2020) hide data access pattern as well as preserving the data privacy and query privacy. However, they require the decryption keys to be given the query users. However, in our model, the decryption keys are not shared with the query users.

On the other hand, Elmehdwi, Samanthula & Jiang (2014) tackled with the same problem using homomorphic encryption method. In addition to ensuring the confidentiality of data owners and clients, the protocol proposed in Elmehdwi, Samanthula & Jiang (2014) also achieves to hide data access patterns from the clouds. Moreover, Xu et al. (2017) proposed an efficient secure k-NN protocol which achieves sublinear computational complexity. Similar to Elmehdwi, Samanthula & Jiang (2014), their protocol also achieves hiding of data access patterns using garbled circuits to simulate Oblivious RAM. Furthermore, Guo & Sun (2020) adopted the data structure R-tree to build an efficient k-NN scheme that requires only two rounds of interactions between the client and cloud servers to generate the result. They also utilized the Merkle hash tree techniques to obtain a better k-NN scheme that is secure against even a malicious cloud servers. There are also some studies that engage in location-based query processing over encrypted geospatial data (Lei et al., 2019; Lian et al., 2020). Lian et al. (2020) proposed an efficient k-NN scheme by employing the Moore curves together with the AES encryption scheme, that ensures the spatial data and location privacy.

The aforementioned studies use k-NN methods for either classification or query search applications. Unlike previous solutions, Kalideen, Osmanoglu & Tugrul (2019) proposed an efficient solution for the problem of computing the interpolation of k-NN to a given point in cloud computing settings. However, their solution reveals the knowledge of which data records were used to produce the interpolation to the cloud servers, and leaking such information might not be desired in some application required the data security. Unlike the protocol presented in Kalideen, Osmanoglu & Tugrul (2019), our protocol assures the desired security features, i.e. it hides data access pattern.

Problem formulation

In this section, we will give more precise definition of the problem and its security requirements.

Secure two-party k-NN interpolation problem

In our system there are two data owners DO₁ and DO₂ holding two different spatial databases D₁ and D₂, respectively. Each database D_u consists of n records $d_{1}^{(u)}, \dots, d_{n}^{(u)}$ such that each record $d_{i}^{(u)}$ is an m-dimensional spatial vector, i.e. $d_{i}^{(u)} = ⟨ d_{i, 1}^{(u)}, \dots, d_{i, m}^{(u)} ⟩$ where u = 1,2. There are also two cloud pairs (CSP₁^(u), CSP₂^(u)) so that each one is associated with a public key-secret key pair (pk_u, sk_u) of a public key encryption scheme that is semantically secure (Goldwasser & Micali, 1982). As the most of the studies in this field, we also consider each pair of cloud service providers (CSP₁^(u), CSP₂^(u)) as two non-colluding cloud servers, i.e. CSP₁^(u) stores the database and performs most of the homomorphic operations; on the other hand, CSP₂^(u) keeps the secret key and helps CSP₁^(u) to perform the complex operations over the ciphertexts.

In our problem, we assume that each data owner DO_u initially encrypts his database D_u as $E_{p k_{u}} (D_{u})$ where $E_{p k_{u}} (D_{u})$ consists of the attribute-wise encryptions $E_{p k_{u}} (d_{i, j}^{(u)})$ for 1 ≤ i ≤ n and 1 ≤ j ≤ m. Each DO_u then outsources $E_{p k_{u}} (D_{u})$ together with the query processing service to CSP₁^(u). Note that the underlying public key encryption scheme should enable cloud servers to perform homomorphic operations over ciphertexts.

There is also an authorized query owner QO who wants to retrieve the interpolation of k-nearest neighbors of a query point Q from both databases D₁ and D₂ stored in CSP₁⁽¹⁾ and CSP₁⁽²⁾, respectively. After QO requests the interpolation, the cloud service providers generate the result by performing required operations over the encrypted databases. This process should output the interpolation of k-nearest neighbors only to the query owner. The query owner should not learn any information other than the interpolation during this process. We denote such process as secure two-party k-nearest neighbors (STPkNN) protocol. We remark that STPkNN protocol should preserve the confidentiality of the records in the databases D₁ and D₂, and protect the privacy of the query point. Moreover, the protocol should hide data access patterns, i.e. it should not reveal the information of which data records were used to produce the interpolation of k-nearest neighbors to any data owner or any cloud service provider.

Example

In 2016, European Union adopted a new regulation on the protection of personal data, Regulation (EU) 2016/679 of the European Parliament and Of The Council (European-Parliament, 2016). The regulation states that ‘the protection of natural persons in relation to the processing of personal data is a fundamental right’. All of the personal health records that reveal information relating to the past, current or future physical or mental health status of the data subject are considered as personal sensitive data in the regulation. Therefore, the personal health records should be protected against unauthorized parties, i.e. only the one approved by the owner should be able to access to the data.

On the other hand, the processing of health data may be significant to advance research or healthcare practices. Consider a doctor who tries to determine whether a person has a particular hearth disease or not by analyzing the medical records of the person. In addition, the doctor may desire to compare the patient’s medical records with other patients’ presenting similar properties in order to improve diagnostic accuracy. In fact, this comparison enables the doctor to evaluate the validity of some tests, especially when the scores do not match the expected values. Consequently, the doctor can make an accurate diagnosis, if he is allowed to reach the data of other patients in the same region or across the country. Moreover, if the personal health records are stored in the cloud as encrypted in order not to violate the fundamental right of the owner of the records, it will be possible to perform reliable analysis on large datasets.

Let us clarify it with an example. Consider the subset of heart disease data set from UCI Machine Learning Repository depicted in Table 1. There are 10 different instances shown in the table, and each instance is associated with five attributes: ID (patient’s identity), trestbps (resting blood pressure in mm Hg), chol (serum cholesterol in mg/dl), thalach (maximum heart rate achieved), and oldpeak (ST depression induced by exercise relative to rest). Assume the data owner, which can be viewed as hospital in this context, encrypts these attributes, and outsources the encrypted database E_pk(D) together with the future query processing to the cloud. Also, assume there is a doctor who wants to determine whether a specific patient carries risk for a particular hearth disease. Let the medical record of the patient be $Q = ⟨ 150, 250, 145, 3 ⟩$ . The doctor, that will be the query owner in our context, asks the interpolation of k-nearest neighbors of Q from the cloud by providing the encryption E_pk(Q) to the cloud. Then, the cloud determines the interpolation of k-nearest neighbors by searching the encrypted database E_pk(D). For simplicity, let k be 3 for this example. As we observe here, the instances having IDs 1, 7, and 9 will be the 3 nearest neighbors to Q. So, the cloud returns the interpolation $T = ⟨ 141.6, 251.6, 152.3, 2.4 ⟩$ to the doctor that will benefit from T to make an accurate diagnosis. Consequently, necessary analysis can be carried out without revealing any sensitive information about both his patient and the other patients.

Table 1:

A subset heart disease data set.

ID	trestbps	chol	thalach	oldpeak
1	145	233	150	2.3
2	160	286	108	1.5
3	120	229	129	2.6
4	130	250	187	3.5
5	130	204	172	1.4
6	120	236	178	0.8
7	140	268	160	3.6
8	120	354	163	0.6
9	130	254	147	1.4
10	140	203	155	3.1

DOI: 10.7717/peerj-cs.965/table-1

Effect of collaboration on interpolation accuracy

Aguilar et al. (2005) stated that if interpolation models are developed with an insufficient amount of data, they will be less accurate and reliable. Namely, the collaboration between participants affects the accuracy of interpolation models. We here conduct a series of experiments to assess the impact of collaboration between participants on the accuracy of prediction in the interpolation methods. In our experiments, we employ two publicly available datasets from U.S. National Geochemical Survey Database that present sodium (Na) content of the soil in two states: Colorado and Wisconsin. Summary statistics of both data sets are presented in Table 2.

Table 2:

Summary statistics of data sets.

	Colorado	Wisconsin
Mean	0.999	0.773
Median	0.912	0.771
Minimum	0.063	0.007
Maximum	3.230	2.043
Standard Deviation	0.460	0.299
Skewness	0.912	−0.177
Kurtosis	4.034	3.131

DOI: 10.7717/peerj-cs.965/table-2

There are various performance evaluation metrics for interpolation methods. We here employed Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), which are often chosen as evaluation metrics for numerical prediction. The small values of both MAE and RMSE indicate that models will produce results that are more accurate. MAE and RMSE values are calculated as follows;

(1) $M A E = [\frac{1}{n} \sum_{i = 1}^{n} | p (x_{i}, y_{i}) - a (x_{i}, y_{i}) |]$

(2) $R M S E = {[\frac{1}{n} \sum_{i = 1}^{n} {(p (x_{i}, y_{i}) - a (x_{i}, y_{i}))}^{2}]}^{1 / 2}$

where n is the total number of data points in the dataset, p({x_i, y_i}) and z({x_i, y_i}) are predicted and actual values at location ({x_i, y_i}), respectively. The effects of varying k values on MAE and RMSE values are shown in Fig. 1.

Figure 1: Effects of varying k values on MAE and RMSE for Colorado and Wisconsin data sets.

Download full-size image

DOI: 10.7717/peerj-cs.965/fig-1

We assume that two data holders share all data points in the data set. Both data sets are randomly divided into two parts using sampling without replacement strategy, assuming each party has one of the pieces. In some situations, data holders may not have data in equal proportions. So, we have determined different sharing ratios considering the cases where there is no equal distribution. We specify the β value as the distribution ratio, which means that if one party holds β portion of the data, the other party will hold the remaining portion (1 − β). After several trials, the MAE and RMSE values obtained according to the various number of nearest neighbor counts are shown in the Tables 3 and 4, respectively.

Table 3:

Effects of collaboration for varying k and β (split ratio between parties) values on MAE.

β	Wisconsin				Colorado
k	25	50	75	100	25	50	75	100
1	0.183	0.169	0.163	0.155	0.319	0.298	0.290	0.283
5	0.156	0.147	0.146	0.145	0.258	0.241	0.235	0.232
10	0.158	0.145	0.143	0.139	0.258	0.242	0.234	0.227
15	0.161	0.147	0.142	0.140	0.259	0.245	0.236	0.232
30	0.167	0.154	0.149	0.143	0.269	0.252	0.244	0.238
50	0.181	0.161	0.154	0.150	0.286	0.262	0.252	0.246

DOI: 10.7717/peerj-cs.965/table-3

Table 4:

Effects of collaboration for varying k and β (split ratio between parties) values on RMSE.

β	Wisconsin				Colorado
k	25	50	75	100	25	50	75	100
1	0.254	0.243	0.241	0.233	0.443	0.421	0.408	0.402
5	0.208	0.199	0.199	0.198	0.361	0.334	0.327	0.321
10	0.212	0.196	0.193	0.188	0.358	0.337	0.325	0.317
15	0.215	0.198	0.193	0.189	0.358	0.341	0.329	0.323
30	0.224	0.206	0.200	0.195	0.368	0.348	0.338	0.331
50	0.238	0.214	0.207	0.202	0.385	0.360	0.347	0.340

DOI: 10.7717/peerj-cs.965/table-4

As seen from the results, the smallest MAE and RMSE values are observed when the 10-nearest neighbors are used for all points in each data set. The smallest MAE and RMSE are underlined in the tables. As seen from the Table 3, if only half of the data is available for creating a prediction model, there will be a deterioration in MAE values of 4.31% for the Wisconsin data set and 6.60% for the Colorado data set. It is also possible to observe similar aspects in Table 4 for each split ratio. As observed from the results, the data holder who has less amount of data always produces less accurate predictions. On the contrary, if there is a sufficient amount of data, the predictions generated by the model are more accurate and reliable.

Premilinaries

In this section, we will present the notations and the definitions of some primitives that will be used in our proposed protocols.

Notation

We here give the notations used in this paper.

n, the number of records in each database
m, the number of the attributes in each record
$ℓ$ , the domain size (in bits) of the squared Euclidean distance
DO_u, the u^th data owner
D_u, the u^th database
$d_{i}^{(u)}$ , the i^th record of the database D_u
QO, the query owner
Q, the query point
[x], the encryption of the individual bits of x
$d_{m i n_{p}}$ , the p^th closest record to Q
(pk_u, sk_u), the public key-secret key pair assigned to the u^th cloud pair
(CSP₁^(u), CSP₂^(u)), the u^th cloud pair, i.e. the former holds the encryption of the database $E_{p k_{u}} (D_{u})$ and the latter holds the corresponding secret key sk_u.

Homomorphic encryption

Homomorphic encryption is an encryption scheme that allows users to perform some mathematical operations on ciphertexts, such as addition and multiplication. This property enables to protect the confidentiality of the data, and makes the encryption scheme a very practical and useful tool in cloud computing, especially for the sensitive data. For that reason, homomorphic encryption schemes have been gaining a lot of attention in recent years. Within this direction, many homomorphic encryption schemes have been proposed (Goldwasser & Micali, 1982; Elgamal, 1984; Boneh, Goh & Nissim, 2005). In this study we use a well-known homomorphic encryption system, the Paillier scheme, to construct our protocols.

Let E_pk(·) be the encryption function with the public key pk and D_sk(·) be the decryption function with the secret key sk. For any given two plaintexts a and b, the Paillier scheme satisfies the following properties:

Addition: D_sk(E_pk(a + b)) = D_sk(E_pk(a) * E_pk(b) mod N²)
Multiplication: D_sk(E_pk(a * b)) = D_sk(E_pk(a)^bmod N²)

Note that the Paillier encryption scheme is semantically secure (Paillier, 1999).

Basic security primitives

Here, we briefly explain a set of basic security protocols. In these protocols, it’s assumed that there exist two semi-honest parties P₁ and P₂ joining the protocols, and the Paillier’s secret key is known only to one of them. We will also introduce two new security primitives in “Construction” that will be employed together with the basic primitives given here as building blocks in forming our construction.

Secure multiplication (SM) protocol

Consider two parties P₁ and P₂ such that the former holds (E_pk(x), E_pk(y)) and the latter holds the secret key sk, where x and y are not known to both parties. The protocol outputs E_pk(x * y) to P₁. Note that the output E_pk(x * y) is only known to P₁, and no information about x and y is revealed to any party during the protocol.

Secure squared Euclidean distance (SSED) protocol

The protocol considers two parties P₁ and P₂ with the inputs (E_pk(X), E_pk(Y)) and the secret key sk, respectively, and outputs E_pk(|X − Y|²) to P₁, where X and Y are m dimensional vectors. In the protocol, the encryption of squared Euclidean distance E_pk(|X − Y|²) is only known to P₁.

Secure bit-decomposition (SBD) protocol

The protocol considers P₁ with the input E_pk(x) and P₂ with the secret key sk, and outputs the encryptions of the bit-decomposition of x as $[x] = ⟨ E_{p k} (x_{1}), \dots, E_{p k} (x_{ℓ}) ⟩$ , where $0 \leq x \leq 2^{ℓ}$ . Note that the encryptions of bit-decomposition [x] is known to only P₁.

Secure minimum (SMIN) protocol

In the protocol, P₁ with the inputs ([x], [y]) and P₂ with the secret key sk securely compute the encryption of individual bits of minimum between x and y as [min(x,y)]. Note that the output [min(x,y)] is only known to P₁, and no information about x and y is revealed to any party during the protocol.

Secure minimum out of n numbers (SMIN_n) protocol

In the protocol, P₁ with the inputs ([x₁],…,[x_n]) and P₂ with the secret key sk securely compute [min(x₁,…,x_n)], where [min(x₁,…,x_n)] is the encryption of the individual bits of min(x₁,…,x_n). Note that the output [min(x₁,…,x_n)] is only known to P₁, and no information about x_i for any i is revealed to any party during the protocol.

Secure Bit-OR (SBOR) protocol

Consider two parties P₁ and P₂ such that the former holds (E_pk(a), E_pk(b)) and the latter holds the secret key sk, where a and b are two bits. The protocol outputs $E_{p k} (a \lor b)$ to P₁. The output $E_{p k} (a \lor b)$ is only known to P₁, and no information about a and b is revealed to any party during the protocol.

Since we don’t aim to study the existing protocols given above, we simply consider the most efficient implementation of them which were presented in Elmehdwi, Samanthula & Jiang (2014) and Samanthula, Hu & Jiang (2013). However, the implementation of the SMIN_n protocol given in Elmehdwi, Samanthula & Jiang (2014) fails for some inputs, i.e. it generates an incorrect output if the size of the input is given as n = 8k + 1 for some k ∈ Z. Let me illuminate it with an example: assume the protocol takes nine inputs ([x₁],…,[x₉]). At the last step, the protocol applies the SMIN protocol to the intermediate values [x′₁] and [x′₇]) (the encryptions of the local minimums), and outputs the encryption of 0 since [x′₇] was set to the encryption of zero at some previous steps. Therefore, independent of the inputs, the protocol always outputs the encryption of zero as the final output when the size of the input is given as n = 8k + 1 for some k ∈ Z. Thus, we develop a new implementation of the SMIN_n protocol that simply works as follows:

The server P₁ initially executes the SMIN protocol together with P₂ on [x₁] and [x₂] to get [R₁] = [min(x₁,x₂)] as the encryption of the individual bits of min(x₁,x₂),
it then iteratively runs the SMIN protocol together with P₂ on [R_i−1] and [x_i+1] to get [R_i] = [min(R_i−1,x_i+1)] as the encryption of the individual bits of min(x₁,x₂,…,x_i+1) for i = 2… (n − 1).

Note that the final output of the iterative steps will be [R_n−1] = [min(x₁,…,x_n)], which is the encryption of the individual bits of min(x₁,…,x_n).

Construction

In this section, we first introduce two new security primitives: the Secure bit-AND-OR (SBAOR) protocol and Secure Transformation (ST) protocol. We then give the security analysis of this protocols. By utilizing SBAOR and ST protocols together with the basic security primitives given in “Premilinaries”, we construct our main protocol. Furthermore, we also give the security analysis of the main protocol and discuss the computation complexity at the end of this section.

Secure bit-AND-OR (SBAOR) protocol

The SBAOR protocol allows the servers to securely compute the negation of the logical disjunction of all bitwise multiplications x_i · y_i in encrypted form without revealing the bit vectors to any party. In the main protocol, it will help the servers to separate the index of the current closest record to the query point from all other records of both databases by assigning the encryption of 1 to that particular index and the encryption of 0 to all other indices. In this way, the servers will be able to calculate the current closest record, and remove the corresponding index from further calculations.

The protocol considers two parties P₁ and P₂ such that the former holds ([x], [y]) and the latter holds the secret key sk, where [x] and [y] are the encryption of individual bits of x and y. The protocol enables the parties P₁ and P₂ to securely compute the encryption $E_{p k} (\bar{A})$ where $\bar{A} = 1 - A$ and $A = (x_{1} \cdot y_{1}) \lor (x_{2} \cdot y_{2}) \lor \dots \lor (x_{ℓ} \cdot y_{ℓ})$ . The output $E_{p k} (\bar{A})$ is only known to P₁, and no information about x and y is revealed to any party during the protocol.

In the protocol, P₁ and P₂ first runs the SM protocol on the inputs E_pk(x_i) and E_pk(y_i) to calculate E_pk(x_i * y_i) for $i \in [ℓ]$ where x_i and y_i are the i-th bits of x and y, respectively. Note that each E_pk(x_i * y_i) is only revealed to P₁. The server P₁ then calculates $E_{p k} (x_{1} \cdot y_{1} \lor \dots \lor x_{ℓ} \cdot y_{ℓ})$ as follows:

it initially executes the SBOR protocol together with P₂ on E_pk(x₁ * y₁) and E_pk(x₂ * y₂) to get $E_{p k} (R_{1}) = E_{p k} (x_{1} * y_{1} \lor x_{2} * y_{2})$ ,
it then iteratively runs the SBOR protocol together with P₂ on E_pk(R_i−1) and E_pk(x_i+1 * y_i+1) to get $E_{p k} (R_{i}) = E_{p k} (R_{i - 1} \lor x_{i + 1} * y_{i + 1}) f o r i = 2 \dots ℓ - 1$ .

Note that the final output of the iterative steps will be $E_{p k} (R_{ℓ - 1}) = E_{p k} (x_{1} \cdot y_{1} \lor \dots \lor x_{ℓ} \cdot y_{ℓ})$ . Finally, P₁ applies the equation $E_{p k} (\bar{R_{ℓ - 1}}) = E_{p k} (1) * E_{p k} (R_{ℓ - 1})^{N - 1}$ to compute the final output.

Algorithm 1:

SBAOR.

Input: ([x], [y]) from P₁ and sk from P₂

Output:

E_{p k} (\bar{x \cdot y})

to P₁

1. P₁ and P₂;

fori = 1 to

ℓ

E_{p k} (x_{i} * y_{i}) \leftarrow S M (E_{p k} (x_{i}), E_{p k} (y_{i}))

;

2. P₁ and P₂;

E_{p k} (R_{1}) \leftarrow S B O R (E_{p k} (x_{1} * y_{1}), E_{p k} (x_{2} * y_{2})

;

fori = 2 to

ℓ

− 1 do

E_{p k} (R_{i}) \leftarrow S B O R (E_{p k} (R_{i - 1}), E_{p k} (x_{i + 1} * y_{i + 1}))

;

3. P₁;

E_{p k} (\bar{x \cdot y}) \leftarrow E_{p k} (1 - R_{ℓ - 1}) \leftarrow E_{p k} (1) \times E_{p k} (R_{ℓ - 1})^{N - 1}

;

DOI: 10.7717/peerj-cs.965/table-5

Security Analysis of SBAOR: At the beginning of the protocol, the servers P₁ and P₂ execute the Secure Multiplication protocol. As emphasized in “Premilinaries”, the output of the protocol is only revealed to the server P₁, and no information about the plaintexts x_i and y_i is revealed to any party during this protocol. Later, the servers run the Secure Bit-OR (SBOR) Protocol on the inputs E_pk(R_i) and E_pk(x_i+1 * y_i+1). The SBOR protocol outputs the new E_pk(R_i+1) only to the server P₁, and no information about the plaintexts is revealed to any party during the protocol. At the final, the server P₁ only applies some homomorphic operations on the encryption $E_{p k} (R_{ℓ - 1})$ computed at the previous step. Therefore, the SBAOR protocol protects the confidentiality of the data, i.e. no information about the contents of the encryptions is revealed to any party during the protocol.

Secure Transformation (ST) protocol

The ST Protocol enables the servers to transform the encryption of a record under a public key to the encryption of same record under another public key. In the main protocol, the servers employ ST Protocol to collect the encryptions of local minimums, that indicate the indexes of local closest records of both databases, under the same public key so that they can decide the minimum among them.

The protocol considers three parties $P_{1}^{(u_{1})}$ with the input $E_{p k_{u_{1}}} (t)$ , $P_{2}^{(u_{1})}$ with the secret key $s k_{u_{1}}$ , and $P_{1}^{(u_{2})}$ . The protocol simply aims to transform the encryption of a record t under the public key $p k_{u_{1}}$ to the encryption of t under the public key $p k_{u_{2}}$ . Note that no information about t is revealed to any party during the protocol and the output $E_{p k_{u_{2}}} (t)$ is only known to the party $P_{1}^{(u_{2})}$ .

Briefly, $P_{1}^{(u_{1})}$ first masks $E_{p k_{u_{1}}} (t)$ with the randomly chosen vector $r^{'} \in Z_{N}^{m}$ as $μ = E_{p k_{u_{1}}} (t) * E_{p k_{u_{1}}} (r^{'})$ , and sends μ to $P_{2}^{(u_{1})}$ and $E_{p k_{u_{2}}} (r^{'})$ to $P_{1}^{(u_{2})}$ . After getting μ, $P_{2}^{(u_{1})}$ first decrypts it as $μ^{'} = D_{s k_{u_{1}}} (μ)$ , then encrypts μ′ with the public key $p k_{u_{2}}$ as $E_{p k_{u_{2}}} (μ^{'})$ , and finally sends the encryption to $P_{1}^{(u_{2})}$ . After receiving the encryption, the party $P_{1}^{(u_{2})}$ first removes the randomness r′ from the encryption $E_{p k_{u_{2}}} (μ^{'})$ and gets the encryption $E_{p k_{u_{2}}} (t)$ as $E_{p k_{u_{2}}} (t) = E_{p k_{u_{2}}} (μ^{'} - r^{'})$ . From the homomorphic property of the underlying encryption scheme, $E_{p k_{u_{2}}} (μ^{'} - r^{'})$ can easily be calculated as $E_{p k_{u_{2}}} (μ^{'}) * E_{p k_{u_{2}}} (r^{'})^{N - 1}$ .

Algorithm 2:

ST.

Input:

E_{p k_{u_{1}}} (t)

from P

_{1}^{(u_{1})}

and

s k_{u_{1}}

from P

_{2}^{(u_{1})}

Output:

E_{p k_{u_{2}}} (t)

to P

_{1}^{(u_{2})}

1. P

_{1}^{(u_{1})}

;

μ \leftarrow E_{p k_{u_{1}}} (t) \times E_{p k_{u_{1}}} (r^{'})

;

r^{'} \in_{R} Z_{N}^{m}

;

send μ to P

_{2}^{(u_{1})}

and

E_{p k_{u_{2}}} (r^{'})

to P

_{1}^{(u_{2})}

;

2. P

_{2}^{(u_{1})}

;

μ^{'} \leftarrow D_{s k_{u_{1}}} (μ)

;

send

E_{p k_{u_{2}}} (μ^{'})

to P

_{1}^{(u_{2})}

;

3. P

_{1}^{(u_{2})}

;

E_{p k_{u_{2}}} (t) \leftarrow E_{p k_{u_{2}}} (μ^{'} - r^{'}) \leftarrow E_{p k_{u_{2}}} (μ^{'}) \times E_{p k_{u_{2}}} (r^{'})^{N - 1}

;

DOI: 10.7717/peerj-cs.965/table-6

Security Analysis of ST: At the beginning of the protocol, the servers $P_{1}^{(u_{1})}$ randomizes the encryption $E_{p k_{u_{1}}} (t)$ with $r^{'} \in Z_{N}^{m}$ before sending it to the server $P_{2}^{(u_{1})}$ . So, the decryption computed by $P_{2}^{(u_{1})}$ will be uniformly random in $Z_{N}^{m}$ . Besides, P $_{1}^{(u_{2})}$ locally subtracts the encryption of the randomness r′ under the public key $p k_{u_{2}}$ from the encryption sent by $P_{2}^{(u_{1})}$ by performing some homomorphic operations. Thus, the protocol does not reveal any information about the record t to any party.

Main protocol

In this section, we will give the construction of our main protocol that enables a query owner to extract the interpolation of k-nearest neighbors for a query point of his choice as shown in Fig. 2. As we stated in the “Introduction”, our construction can be viewed as an extension of the protocol presented in Elmehdwi, Samanthula & Jiang (2014) that proposes an efficient solution of the k-nearest neighbor query problem over encrypted database outsourced to a single cloud.

The STPkNN protocol: for u = 1, 2; (1) each DOu uploads its data to the server CSP1(u); (2) each DOu gives its secret key to CSP2(u); (3) QO sends its query point Q in encrypted form to the servers CSP1(u); (4) CSP1(u) and CSP2(u) find the local nearest neighbours; (5) CSP1(1) and CSP1(2) decide on the global nearest neighbour among the local nearest neighbors (4 and 5 are repeated k times in the protocol); (6) the final prediction value is forwarded to QO. — Figure 2: The STPkNN protocol: for u = 1, 2; (1) each DO_u uploads its data to the server CSP₁^(u); (2) each DO_u gives its secret key to CSP₂^(u); (3) QO sends its query point Q in encrypted form to the servers CSP₁^(u); (4) CSP₁^(u) and CSP₂^(u) find the local nearest neighbours; (5) CSP₁⁽¹⁾ and CSP₁⁽²⁾ decide on the global nearest neighbour among the local nearest neighbors (4 and 5 are repeated k times in the protocol); (6) the final prediction value is forwarded to QO.

Download full-size image

DOI: 10.7717/peerj-cs.965/fig-2

We assume that each data owner DO_u has a database D_u that consists of n records $d_{1}^{(u)}, \dots, d_{n}^{(u)}$ where each $d_{i}^{(u)}$ is m-dimensional vector that lies in $[0, 2^{ℓ}]$ . We also assume that there exist two non-colluding semi-honest cloud service providers, ${C S P}_{1}^{(u)}$ and ${C S P}_{2}^{(u)}$ for each database D_u where ${C S P}_{1}^{(u)}$ is given the encryption of the database D_u and ${C S P}_{2}^{(u)}$ is given the corresponding secret key sk_u.

Initially, each DO_u encrypts his database D_u as $E_{p k_{u}} (d_{i, j}^{(u)})$ where 1≤ i ≤ n and 1≤ j ≤ m. Each DO_u then outsources the encryptions of the database, together with the future query service to the clouds, i.e. DO_u gives $E_{p k_{u}} (d_{i, j}^{(u)})$ to ${C S P}_{1}^{(u)}$ and his secret key sk_u to ${C S P}_{2}^{(u)}$ . When the query owner (QO) wants to retrieve the interpolation of the k-nearest neighbors for a query point Q, he produces two encryptions of his query point Q as $E_{p k_{1}} (Q) = ⟨ E_{p k_{1}} (q_{1}), \dots, E_{p k_{1}} (q_{m}) ⟩$ and $E_{p k_{2}} (Q) = ⟨ E_{p k_{2}} (q_{1}), \dots, E_{p k_{2}} (q_{m}) ⟩$ using the public keys of the data owners DO₁ and DO₂, respectively; and gives each encryption $E_{p k_{u}} (Q)$ to the corresponding cloud service provider ${C S P}_{1}^{(u)}$ .

After receiving the encryption $E_{p k_{u}} (Q)$ , each ${C S P}_{1}^{(u)}$ runs the SSED protocol together with the corresponding server ${C S P}_{2}^{(u)}$ on the input $(E_{p k_{u}} (Q), E_{p k_{u}} (d_{i}^{(u)}))$ where $E_{p k_{u}} (d_{i}^{(u)}) = ⟨ E_{p k_{u}} (d_{i, 1}^{(u)}), \dots, E_{p k_{u}} (d_{i, m}^{(u)}) ⟩$ for 1 ≤ i ≤ n, and obtains the encryption of the squared Euclidean distance between Q and $d_{i}^{(u)}$ as $E_{p k_{u}} (e_{i}^{(u)})$ where $e_{i}^{(u)} = | Q - d_{i}^{(u)} |^{2}$ . From the SSED protocol, $E_{p k_{u}} (e_{i}^{(u)})$ is revealed only to ${C S P}_{1}^{(u)}$ .

Opposite to the protocol proposed in Kalideen, Osmanoglu & Tugrul (2019), instead of sending the encryptions $E_{p k_{u}} (e_{i}^{(u)})$ to the server ${C S P}_{2}^{(u)}$ where 1 ≤ i ≤ n, that reveals the information of which indexes being used to compute the interpolation to ${C S P}_{2}^{(u)}$ , each ${C S P}_{1}^{(u)}$ securely runs the SBD protocol with the server ${C S P}_{2}^{(u)}$ on the inputs $E_{p k_{u}} (e_{i}^{(u)})$ to compute $[e_{i} (u)] = ⟨ E_{p k_{u}} (e_{i, 1}^{(u)}), \dots, E_{p k_{u}} (e_{i, ℓ}^{(u)}) ⟩$ , the encryptions of the individual bits of $e_{i}^{(u)}$ . Note that $[e_{i}^{(u)}]$ is only revealed to ${C S P}_{1}^{(u)}$ .

After this stage, the servers produce the interpolation of the k-nearest neighbours of the query point Q in an iterative way. In each iteration:

Each pair of servers ${C S P}_{1}^{(u)}$ and ${C S P}_{2}^{(u)}$ securely calculate the encryptions of the individual bits of the minimum value $[e_{m i n}^{(u)}]$ among $[e_{1}^{(u)}], \dots, [e_{n}^{(u)}]$ by running the protocol SMIN_n. Note that $[e_{m i n}^{(u)}]$ is only revealed to ${C S P}_{1}^{(u)}$ .
Each ${C S P}_{1}^{(u)}$ then locally calculates the encryption of e_min from $[e_{m i n}^{(u)}]$ as

$E_{p k_{u}} (e_{m i n}^{(u)}) = \prod_{i = 0}^{ℓ - 1} E_{p k_{u}} {(e_{m i n, i}^{(u)})}^{2^{ℓ - i - 1}} = E_{p k_{u}} (e_{m i n, 1}^{(u)} \times 2^{ℓ - 1} + \dots + e_{m i n, ℓ}^{(u)}) .$

At this stage of the protocol, the servers ${C S P}_{1}^{(1)}$ and ${C S P}_{1}^{(2)}$ have $E_{p k_{1}} (e_{m i n}^{(1)})$ and $E_{p k_{2}} (e_{m i n}^{(2)})$ as the encryption of the minimum distances. Now, the servers apply the following steps to decide the minimum among $e_{m i n}^{(1)}$ and $e_{m i n}^{(2)}$ :

– the servers ${C S P}_{1}^{(2)}$ with the input $E_{p k_{2}} (e_{m i n}^{(2)})$ , ${C S P}_{2}^{(2)}$ with the secret key sk₂, and ${C S P}_{1}^{(1)}$ securely runs the ST protocol to compute the encryption of $e_{m i n}^{(2)}$ under the public key pk₁. Note that $E_{p k_{1}} (e_{m i n}^{(2)})$ is only known to ${C S P}_{1}^{(1)}$ .
– the server ${C S P}_{1}^{(1)}$ now executes the SBD protocol with the server ${C S P}_{2}^{(1)}$ on the inputs $E_{p k_{1}} (e_{m i n}^{(1)})$ and $E_{p k_{1}} (e_{m i n}^{(2)})$ to compute $[e_{m i n}^{(1)}]$ and $[e_{m i n}^{(2)}]$ where $[e_{m i n}^{(u)}] = ⟨ E_{p k_{1}} (e_{m i n, 1}^{(u)}), \dots, E_{p k_{1}} (e_{m i n, ℓ}^{(u)}) ⟩$ . From the SBD protocol, $[e_{m i n}^{(1)}]$ and $[e_{m i n}^{(2)}]$ are only revealed to ${C S P}_{1}^{(1)}$ .
– ${C S P}_{1}^{(1)}$ then runs the SMIN protocol with ${C S P}_{2}^{(1)}$ on the inputs $[e_{m i n}^{(1)}]$ and $[e_{m i n}^{(2)}]$ , and gets $[e_{m i n}] = [m i n {e_{m i n}^{(1)}, e_{m i n}^{(2)}}]$ .
– after getting [e_min], ${C S P}_{1}^{(1)}$ locally calculates the encryption of e_min as

$E_{p k_{1}} (e_{m i n}) = \prod_{i = 0}^{ℓ - 1} E_{p k_{1}} (e_{m i n, i})^{2^{ℓ - i - 1}} .$

– the servers ${C S P}_{1}^{(1)}$ with the input $E_{p k_{1}} (e_{m i n})$ , ${C S P}_{2}^{(1)}$ with the secret key sk₁, and ${C S P}_{1}^{(2)}$ securely runs the ST protocol to compute the encryption of e_min under the public key pk₂. Note that $E_{p k_{2}} (e_{m i n})$ is only known to ${C S P}_{1}^{(2)}$ .

After identifying the minimum e_min among $e_{m i n}^{(1)}$ and $e_{m i n}^{(2)}$ , each ${C S P}_{1}^{(u)}$ locally computes the encryption of difference $(e_{m i n} - e_{i}^{(u)})$ for each i as $E_{p k_{u}} (λ_{i}^{(u)}) = E_{p k_{u}} (e_{m i n} - e_{i}^{(u)}) = E_{p k} (e_{m i n}) * E_{p k} {(e_{i}^{(u)})}^{N - 1}$ .
Each ${C S P}_{1}^{(u)}$ then randomizes $E_{p k_{u}} (λ_{i}^{(u)})$ as $E_{p k_{u}} (α_{i}^{(u)}) = E_{p k_{u}} {(λ_{i}^{(u)})}^{r_{i}^{(u)}} = E_{p k_{u}} (λ_{i}^{(u)} * r_{i}^{(u)})$ where $r_{i}^{(u)}$ is a random number in Z_N. It is a fact that only one is the encryption of zero among all 2n encryptions $E_{p k_{u}} (α_{i}^{(u)})$ and all others are the encryptions of some random numbers where i = 1 … n and u = 1, 2.
Each ${C S P}_{1}^{(u)}$ securely runs the SBD protocol with the server ${C S P}_{2}^{(u)}$ on the inputs $E_{p k_{u}} (α_{i}^{(u)})$ to compute $[α_{i}^{(u)}] = ⟨ E_{p k_{u}} (α_{i, 1}^{(u)}), \dots, E_{p k_{u}} (α_{i, ℓ}^{(u)}) ⟩$ , the encryptions of the individual bits of $α_{i}^{(u)}$ . Note that $[α_{i}^{(u)}]$ is only revealed to ${C S P}_{1}^{(u)}$ .
After getting the encryptions $[α_{i}^{(u)}]$ , each ${C S P}_{1}^{(u)}$ runs the SBAOR protocol with the server ${C S P}_{2}^{(u)}$ on $[α_{i}^{(u)}]$ and [1] for each i where $[1] = ⟨ E_{p k_{u}} (1), \dots, E_{p k_{u}} (1) ⟩$ , and gets $E_{p k_{u}} (β_{i}^{(u)}) = E_{p k_{u}} (\bar{α_{i}^{(u)} \cdot 1})$ . Note that one of the encryptions among all $E_{p k_{u}} (β_{i}^{(u)})$ is $E_{p k_{u}} (1)$ and the remaining encryptions are $E_{p k_{u}} (0)$ where i ∈ [n] and u = 1,2. Furthermore, if $β_{j}^{(v)} = 1$ , then $d_{j}^{(v)}$ is the closest record to Q from both databases.
${C S P}_{1}^{(u)}$ securely runs the SM protocol with ${C S P}_{2}^{(u)}$ on the inputs $E_{p k_{u}} (β_{i}^{(u)})$ and $E_{p k_{u}} (d_{i, j}^{(u)})$ to compute $β_{i, j}^{' (u)} = E_{p k_{u}} (β_{i}^{(u)} * d_{i, j}^{(u)})$ , for 1 ≤ i ≤ n and 1 ≤ j ≤ m. Then, each ${C S P}_{1}^{(u)}$ can now calculate the encryption of its candidate for the first closest record $d_{1}^{' (u)}$ as $E_{p k_{u}} (d_{1}^{' (u)}) = ⟨ E_{p k_{u}} (d_{1, 1}^{' (u)}), \dots, E_{p k_{u}} (d_{1, m}^{' (u)}) ⟩$ where $E_{p k_{u}} (d_{1, j}^{' (u)}) = \prod_{i = 1}^{n} β_{i, j}^{' (u)}$ . As we stated before, since only one of the encryptions among all $E_{p k_{u}} (β_{i}^{(u)})$ is $E_{p k_{u}} (1)$ and the remaining are $E_{p k_{u}} (0)$ , one of the encryptions $E_{p k_{u}} (d_{1}^{' (u)})$ will be the encryption of zero and the other one will be the encryption of nonzero number that will be the first closest record.
${C S P}_{1}^{(2)}$ with the input $E_{p k_{2}} (d_{1}^{' (2)})$ , ${C S P}_{2}^{(2)}$ with the secret key sk₂, and ${C S P}_{1}^{(1)}$ securely runs the ST protocol to compute the encryption of $d_{1}^{' (2)}$ under the public key pk₁. Note that $E_{p k_{1}} (d_{1}^{' (2)})$ is only known to ${C S P}_{1}^{(1)}$ .
${C S P}_{1}^{(1)}$ now can calculate the encryption of the first closest record as $E_{p k_{1}} (d_{m i n_{1}}) = E_{p k_{1}} (d_{1}^{' (1)}) * E_{p k_{1}} (d_{1}^{' (2)})$ . From the homomorphic property of the underlying encryption scheme, $E_{p k_{1}} (d_{1}^{' (1)}) * E_{p k_{1}} (d_{1}^{' (2)}) = E_{p k_{1}} (d_{1}^{' (1)} + d_{1}^{' (2)})$ , and since one of them is zero, $E_{p k_{1}} (d_{1}^{' (1)} + d_{1}^{' (2)})$ will be the encryption of the first closest record from both databases.
As the final step of the first iteration, the first closest records d_min1 should be excluded from the further iterations. To this aim, each ${C S P}_{1}^{(u)}$ securely executes the SBOR protocol with ${C S P}_{2}^{(u)}$ on the inputs $β_{i}^{(u)}$ and $E_{p k_{u}} (e_{i, h}^{(u)})$ where 1 ≤ i ≤ n and $1 \leq h \leq ℓ$ . As the output of the protocol, ${C S P}_{1}^{(u)}$ gets the encryptions of renewed distances as $E_{p k_{u}} (e_{i, h}^{(u)}) = E_{p k_{u}} (β_{i}^{(u)} \lor e_{i, h}^{(u)})$ . Observe that if $β_{j}^{(u)} = E_{p k_{u}} (1)$ for a particular j, the corresponding distance $e_{j}^{(u)}$ will take the maximum value, i.e. $[e_{j}^{(u)}] = ⟨ E_{p k_{u}} (1), \dots, E_{p k_{u}} (1) ⟩$ . On the other hand, if $β_{i}^{(u)} = E_{p k_{u}} (0)$ , the SBOR protocol will have no effect on $e_{i}^{(u)}$ .

Because our protocol outputs the interpolation of the k-nearest neighbors of the query point Q, the server ${C S P}_{1}^{(1)}$ does not need to keep all the nearest records separately. Instead, it gradually builds the interpolation, i.e. after each iteration, ${C S P}_{1}^{(1)}$ adds the current closest record $E_{p k_{1}} (d_{{m i n}_{p}})$ to the previous sum $E_{p k_{1}} (S_{p - 1}) = E_{p k_{1}} (d_{m i n_{1}} + \dots + d_{m i n_{p - 1}})$ as $E_{p k_{1}} (S_{p - 1}) * E_{p k_{1}} (d_{m i n_{p}})$ , and gets the current sum $E_{p k_{1}} (S_{p}) = E_{p k_{1}} (d_{m i n_{1}} + \dots + d_{m i n_{P}})$ .

After k iterations, ${C S P}_{1}^{(1)}$ will have the sum $E_{p k_{1}} (S_{k}) = E_{p k_{1}} (d_{m i n_{1}} + \dots + d_{m i n_{k}})$ as the encryption of the sum of the k-nearest neighbors of the query point Q. ${C S P}_{1}^{(1)}$ then computes the randomization of the encryptions as $γ_{j} = E_{p k_{1}} (S_{k, j}) * E_{p k_{1}} (r_{j})$ where r_j are random numbers in Z_N and 1 ≤ j ≤ m. ${C S P}_{1}^{(1)}$ then sends γ_j to ${C S P}_{2}^{(1)}$ and r_j to the query owner. Upon receiving γ_j, ${C S P}_{2}^{(1)}$ decrypts them as $γ_{j}^{^{'}} = D_{s k_{1}} (γ_{j})$ and sends the decryptions to the query owner. The query owner QO then computes the sum of k-nearest record as $S_{k, j}^{^{'}} = γ_{j}^{^{'}} - r_{j}$ where 1 ≤ j ≤ m. As the final step, QO computes the interpolation of k-nearest neighbors of Q as $⟨ S_{k, 1}^{'} / k, \dots, S_{k, m}^{'} / k ⟩$ .

Security analysis

In this section, we will give the security analysis of the protocol shown in Algorithm 3. As we emphasized above, the data owners encrypt their data before outsourcing them to the cloud. Since they use the Paillier encryption scheme which is semantically secure, the data is not leaked to any cloud service provider. On the other hand, at the first step of Algorithm 3, the query point Q is encrypted before given to the corresponding cloud service providers. Similarly, since the underlying encryption scheme (the Paillier cryptosystem) is semantically secure, the query point Q is not revealed to any data owner or any cloud service provider.

Algorithm 3:

STPkNN.

Input:

E_{p k_{1}} (D_{u})

from

{C S P}_{1}^{(u)}

; sk_u from

{C S P}_{2}^{(u)}

; Q from QO

Output:T, the interpolation of k-nearest neighbors of Q

1. QO;

a) compute

E_{p k_{u}} (Q) = ⟨ E_{p k_{u}} (q_{1}), \dots, E_{p k_{u}} (q_{m}) ⟩;

b) send each

E_{p k_{u}} (Q)

to the corresponding server

{C S P}_{1}^{(u)}

;

{C S P}_{1}^{(u)}

and

{C S P}_{2}^{(u)}

;

fori = 1 to ndo

E_{p k_{u}} (e_{i}^{(u)}) \leftarrow S S E D (E_{p k_{u}} (Q), E_{p k_{u}} (d_{i}^{(u)}))

;

[e_{i}^{(u)}] \leftarrow S B D (E_{p k_{u}} (e_{i}^{(u)}))

3. forp = 1 to kdo

{C S P}_{1}^{(u)}

and

{C S P}_{2}^{(u)}

;

–

[e_{m i n}^{(u)}] \leftarrow {S M I N}_{n} ([e_{1}^{(u)}], \dots, [e_{n}^{(u)}])

;

–

{C S P}_{1}^{(u)}

computes

E_{p k_{u}} (e_{m i n}^{(u)}) \leftarrow \prod_{i = 0}^{ℓ - 1} E_{p k_{u}} {(e_{m i n, i}^{(u)})}^{2^{ℓ - i - 1}}

;

{C S P}_{1}^{(1)}

{C S P}_{2}^{(1)}

{C S P}_{1}^{(2)}

, and

{C S P}_{2}^{(2)}

;

–

{C S P}_{1}^{(2)}

{C S P}_{2}^{(2)}

, and

{C S P}_{1}^{(1)}

execute

E_{p k_{1}} (e_{m i n}^{(2)}) \leftarrow S T (E_{p k_{2}} (e_{m i n}^{(2)}))

;

–

{C S P}_{1}^{(1)}

and

{C S P}_{2}^{(1)}

compute

[e_{m i n}^{(u)}] \leftarrow S B D (E_{p k_{1}} (e_{m i n}^{(u)}))

;

–

{C S P}_{1}^{(1)}

and

{C S P}_{2}^{(1)}

compute

[e_{m i n}] \leftarrow S M I N (E_{p k_{1}} ([e_{m i n}^{(1)}], [e_{m i n}^{(2)}]))

;

–

{C S P}_{1}^{(1)}

computes

E_{p k_{1}} (e_{m i n}) \leftarrow \prod_{i = 0}^{ℓ - 1} E_{p k_{1}} (e_{m i n, i})^{2^{ℓ - i - 1}}

;

–

{C S P}_{1}^{(1)}

{C S P}_{2}^{(1)}

, and

{C S P}_{1}^{(2)}

execute

E_{p k_{2}} (e_{m i n}) \leftarrow S T (E_{p k_{1}} (e_{m i n}))

;

{C S P}_{1}^{(u)}

and

{C S P}_{2}^{(u)}

;

fori = 1 to ndo

E_{p k_{u}} (λ_{i}^{(u)}) \leftarrow E_{p k_{u}} (e_{i}^{(u)} - e_{m i n})

;

E_{p k_{u}} (α_{i}^{(u)}) \leftarrow E_{p k_{u}} {(λ_{i}^{(u)})}^{r_{i}^{(u)}}

, where

r_{i}^{(u)} \in_{R} Z_{N}

;

[α_{i}^{(u)}] \leftarrow S B D (E_{p k_{u}} (α_{i}^{(u)}))

;

E_{p k_{u}} (β_{i}^{(u)}) \leftarrow S B A O R ([α_{i}^{(u)}], [1])

;

{C S P}_{1}^{(u)}

and

{C S P}_{2}^{(u)}

;

fori = 1 to n and j = 1 to mdo

β_{i, j}^{' (u)} \leftarrow S M (E_{p k_{u}} (β_{i}^{(u)}), E_{p k_{u}} (d_{i, j}^{(u)}))

;

E_{p k_{u}} (d_{p, j}^{' (u)}) \leftarrow \prod_{i = 1}^{n} β_{i, j}^{' (u)}

{C S P}_{1}^{(2)}

{C S P}_{2}^{(2)}

, and

{C S P}_{1}^{(1)}

;

–

E_{p k_{1}} (d_{p}^{' (2)}) \leftarrow S T (E_{p k_{2}} (d_{p}^{' (2)}))

;

{C S P}_{1}^{(1)}

;

–

E_{p k_{1}} (d_{m i n_{p}}) \leftarrow E_{p k_{1}} (d_{p}^{' (1)} + d_{p}^{' (2)}) \leftarrow E_{p k_{1}} (d_{1}^{' (1)}) * E_{p k_{1}} (d_{1}^{' (2)})

;

–

E_{p k_{1}} (S_{p}) \leftarrow E_{p k_{1}} (S_{p - 1}) * E_{p k_{1}} (d_{m i n_{p}})

;

{C S P}_{1}^{(u)}

and

{C S P}_{2}^{(u)}

;

fori = 1 to n and h = 1 to

ℓ

E_{p k_{u}} (e_{i, h}^{(u)}) \leftarrow S B O R (E_{p k_{u}} (β_{i}^{(u)}), E_{p k_{u}} (e_{i, h}^{(u)}))

{C S P}_{1}^{(1)}

;

forj = 1 to mdo

–

γ_{j} \leftarrow E_{p k_{1}} (S_{k, j}) \times E_{p k_{1}} (r_{j})

, where r_j ∈_RZ_N;

– sends γ_j to CSP

_{2}^{(1)}

and r_j to QO

{C S P}_{2}^{(1)}

;

forj = 1 to mdo

–

γ_{j}^{'} \leftarrow D_{s k_{1}} (E_{p k_{1}} (γ_{j}))

;

– sends

γ_{j}^{'}

to QO

6. QO;

a) forj = 1 to mdo

S_{k, j}^{'} \leftarrow γ_{j}^{'} - r_{j}

b) computes the interpolation as

T = ⟨ S_{k, 1}^{^{'}} / k, \dots, S_{k, j}^{^{'}} / k ⟩

;

DOI: 10.7717/peerj-cs.965/table-7

At the second step of Algorithm 3, the servers ${C S P}_{1}^{(u)}$ and ${C S P}_{2}^{(u)}$ execute the protocols SSED and SBD. As stated in Elmehdwi, Samanthula & Jiang (2014), the outputs of the protocols will be in the encrypted format, and will only be revealed to the servers ${C S P}_{1}^{(u)}$ . Besides, no information about the plaintexts is revealed to any party during these protocols. At the step 3(a) of each iteration in Algorithm 3, the output of the protocol SMIN_n is only revealed to the servers ${C S P}_{1}^{(u)}$ . Besides, the SMIN_n protocol guarantees that the servers involved in the protocol do not know which records from both databases correspond to the current minimum distances. Similarly, the output of the SMIN protocol executed at the step 3(b) of Algorithm 3 is only revealed to the server ${C S P}_{1}^{(1)}$ . Also, the protocol does not reveal which record corresponds to the current global minimum.

The servers also run the ST protocol at the steps 3(b) and 3(e) of Algorithm 3 to transform the encryption of the current minimum distance under the public key $p k_{u_{1}}$ to the encryption under the public key $p k_{u_{2}}$ . As we explained at the beginning of this section, the ST protocol protects the content of the encryption from all parties involved in the protocol. Furthermore, at the step 3(c), each server ${C S P}_{1}^{(u)}$ runs the SBAOR protocol with ${C S P}_{2}^{(u)}$ that outputs either the encryption of 1 just for the index corresponding to the current global minimum or the encryption of 0 for all the other indexes. Note that the SBAOR protocol uses the protocols SM and SBD as sub procedures, and it does not leak the index that corresponds to the current global minimum. Thus, data access patterns are protected from all the involved servers through the protocol, i.e. the servers do not know which data records used in producing the interpolation of k-nearest neighbors.

In conclusion, the STPkNN protocol preserves the confidentiality of the data, secures the privacy of user’s query point, and hides data access patterns.

Complexity analysis

In this section, we will discuss the computation complexity of our protocol. The servers perform n instantiations of SSED and SBD protocols at the second step of the protocol. Since the computation complexity of the SSED protocol proposed in Elmehdwi, Samanthula & Jiang (2014) is bounded by O(m) multiplications and O(m) exponentiations, and the computation complexity of the SBD protocol proposed in Samanthula, Hu & Jiang (2013) is bounded by $O (ℓ)$ multiplications and $O (ℓ)$ exponentiations, the computation complexity of this step is bounded by $O (n \cdot (m + ℓ))$ multiplications and $O (n \cdot (m + ℓ))$ exponentiations.

On the other hand, at the third step of our protocol, the servers perform the following operations O(k) times: a single instantiation of SMIN_n protocol, a single instantiation of SMIN, 2 instantiations of ST protocol, n instantiations of SBD and SBAOR protocols, n · m instantiations of SM protocol, and $n \cdot ℓ$ instantiations of SBOR protocol. The computation complexity of the SMIN_n protocol presented in this paper is bounded by $O (ℓ \cdot n)$ multiplications and $O (ℓ \cdot n)$ exponentiations and the computation complexity of the SMIN protocol presented in Elmehdwi, Samanthula & Jiang (2014) is bounded by $O (ℓ)$ multiplications and $O (ℓ)$ exponentiations. Besides, the ST protocol proposed in this paper, the SM protocol presented in Elmehdwi, Samanthula & Jiang (2014), and the SBOR protocol presented in Elmehdwi, Samanthula & Jiang (2014) only contain a constant number of multiplications and a constant number of exponentiations. Also, as we emphasized above, the computation complexity of the SBD protocol is bounded by $O (ℓ)$ multiplications and $O (ℓ)$ exponentiations (Samanthula, Hu & Jiang, 2013). Moreover, since the SBAOR protocol proposed in this paper deploys $ℓ$ instantiations of SM protocol and $ℓ - 1$ instantiations of SBOR protocols as sub procedures, the computation complexity of the SBAOR protocol bounded by by $O (ℓ)$ multiplications and $O (ℓ)$ exponentiations. Thus, the computation complexity of the third step is bounded by $O (k \cdot n \cdot (m + ℓ))$ multiplications and exponentiations at total.

In addition, the servers perform only O(m) operations at the remaining steps of the protocol. Thus, the total computation complexity of our protocol is bounded by $O (k \cdot n \cdot (m + ℓ))$ multiplications and exponentiations.

Performance evaluation

In this section, we evaluated the performance of the proposed protocol STPkNN by carrying out a number of experiments under different parameter settings. We deployed Paillier cryptosystem (Paillier, 1999) for the encryption, and implemented the proposed protocols in Java. All the experiments were performed on a virtual Linux machine with an IntelR XeonR Two-CoreTM CPU 2.20 GHz processor and 4 GB RAM running Ubuntu 16.04 LTS. For the experiments, we utilized two real data sets from UCI machine learning repository (Dua & Graff, 2017); Heart Disease that consists of 600 data records such that each one contains 14 attributes concerning heart disease diagnosis, and Bank Marketing that contains 800 data records such that each one includes 15 attributes that helps to predict whether a new client will pay a term deposit. We first processed these data sets so that they contain only non-negative integer values. We then split each data set into two equal parts so that each one will be operated by a single cloud pair. Note that, for all the measurements, the experiment was repeated for multiple query points and the average time taken to execute a query was reflected to the table.

We first evaluated the computation cost of STPkNN on finance data set in minutes for varying the number of nearest neighbors (k) and the number of attributes (m). As shown in Fig. 3A, if we fix the number of attributes as m = 6, the running time of our protocol varies from 74.08 to 226.16 min for finance data set when k is changed from 5 to 15. Besides, for m = 12, the running time of our protocol varies from 78.85 to 239.21 min when k is changed from 5 to 15. So the running time of our protocol grows linear with k. Also, we observe that the computation cost of our protocol increases by nearly a factor of 1.06 when m is doubled.

Figure 3: Running time of STPkNN for varying k values on the (A) finance and the (B) health data set.

Download full-size image

DOI: 10.7717/peerj-cs.965/fig-3

Similarly, we also evaluated the computation cost of STPkNN on heart disease data set in minutes for varying the number of nearest neighbors (k) and the number of attributes (m). As shown in Fig. 3B, if we set the number of attributes as m = 6, the running time of our protocol varies from 55.89 to 168.26 min when k is changed from 5 to 15. Besides, for m = 12, the running time of our protocol varies from 59.56 to 178.63 min when k is changed from 5 to 15. Thus, it is easy to observe that our protocol scales linearly with k. On the other hand, the running time of our protocol increases by almost a factor of 1.34 when the number of data records (n) is changed from 300 to 400. Thus, the running time of our protocol grows linear with n.

Conclusions

In this study, we proposed a secure k-NN method that produces an interpolation of k-nearest neighbors to a query point over encrypted databases. We here claimed that instead of using one, employing two different databases in the protocol will yield more accurate and reliable interpolation value. We validated this claim by conducting experiments on publicly available real data sets. We also showed that our protocol preserves the confidentiality of data, assures the privacy of user’s query point, and hides data access patterns. We finally analyzed the performance of the proposed protocol through a number of experiments under different parameter settings. As a future study, we will examine and expand our work to apply other interpolation methods on encrypted data in distributed architecture. We will extend our protocol, that considers two encrypted databases stored in two different clouds, to multi-cloud settings.

Supplemental Information

ID	trestbps	chol	thalach	oldpeak
1	145	233	150	2.3
2	160	286	108	1.5
3	120	229	129	2.6
4	130	250	187	3.5
5	130	204	172	1.4
6	120	236	178	0.8
7	140	268	160	3.6
8	120	354	163	0.6
9	130	254	147	1.4
10	140	203	155	3.1

ID	trestbps	chol	thalach	oldpeak
1	145	233	150	2.3
2	160	286	108	1.5
3	120	229	129	2.6
4	130	250	187	3.5
5	130	204	172	1.4
6	120	236	178	0.8
7	140	268	160	3.6
8	120	354	163	0.6
9	130	254	147	1.4
10	140	203	155	3.1