Embedding edit distance to enable private keyword search

Bringer, Julien; Chabanne, Hervé

doi:10.1186/2192-1962-2-2

Research
Open access
Published: 23 February 2012

Embedding edit distance to enable private keyword search

Julien Bringer¹ &
Hervé Chabanne²

Human-centric Computing and Information Sciences volume 2, Article number: 2 (2012) Cite this article

6245 Accesses
5 Citations
Metrics details

Abstract

Background

Our work is focused on fuzzy keyword search over encrypted data in Cloud Computing.

Methods

We adapt results on private identification schemes by Bringer et al. to this new context. We here exploit a classical embedding of the edit distance into the Hamming distance.

Results

Our way of doing enables some flexibility on the tolerated edit distance when looking for close keywords while preserving the confidentiality of the queries.

Conclusion

Our proposal is proved secure in a security model taking into account privacy.

Introduction

Cloud Computing enables users to have access to shared resources somewhere on the Internet. At least, some storage capacities can easily be envisaged. This brings many sensitive information in the Cloud where they should stay, to preserve their confidentiality, encrypted. To look at their content remotely (and without decrypting them), some specific procedures have been developed. Searchable encryption [1] builds up an index for each keyword of interest. This way, a user can search over his encrypted data for such a keyword and retrieve the files containing it. Note that this search should be made with great care, for privacy reasons, in order for the Cloud to not be able to find out what is the underlying keyword. Symmetric Searchable Encryption (SSE) as introduced by [2] relies on symmetric encryption primitives for efficiency reasons. In [3], Li et al. build on SSE for a solution for fuzzy keyword search over encrypted data in Cloud Computing. The fuzziness should here be understood as minor typos introduced by users when entering the request through their keyboard. In this context, the edit distance (Levenshtein distance) is relevant to measure the strings similarity.

Related works

Considers two different techniques: wildcard-based and gram-based techniques [3], for achieving fuzzy keyword search over encrypted data. These two methods build a set consisting of the searched keyword and the nearby words according to the used technique. For instance, for the keyword CASTLE, the fuzzy keyword set for wildcard-based technique consists of {CASTLE, *CASTLE, *ASTLE, C*ASTLE, C*STLE, ..., CASTL*E, CASTL*, CASTLE*} (respectively {CASTLE, CSTLE, CATLE, CASLE, CASTE, CASTL, ASTLE} for the gram-based technique) for an edit distance of 1. The idea behind these fuzzy keyword sets is to index - before the search phase - the exact keywords but also the ones differing slightly according to a fixed bound on the tolerated edit distance.

Our approach is somewhat different. For iriscode biometric data, the comparison of two iriscodes is made thanks to the computation of an Hamming distance [4]. There is today a trend to generalize this way of performing biometric matching for other modalities [5, 6] for easier embedding into cryptographic protocols. In their works on private identification, Bringer et al. [7–9] (see also Section Private identification schemes) actually show how to carry out fuzzy keyword search for the Hamming distance. Following this trend, our idea is to combine this with a classical embedding of edit distance into the Hamming distance [10, 11] (see Section Edit distance approximation) to obtain a fuzzy keyword search for the edit distance. This way of doing has at least two advantages. Firstly, contrary to [3] our way of proceeding does not need to a priori define the set of words which are considered as acceptable for the search. Moreover, we inherit of the security properties of [7] in their security model. Note that our proposal thus relies on an asymmetric security model. This can be seen as an asset for Cloud Computing applications. Indeed, using public-key encryption seems relevant in this context. To the best of our knowledge, this is the first scheme enabling fuzzy search with respect to edit distance over data encrypted with a public-key scheme.

Contribution and organization

The main contribution of this work is the proposal for a fuzzy keyword search over encrypted data where fuzzy means that we tolerate some edit distance deviation. A natural application of our results is Cloud Computing. We give proofs for the security properties of our scheme. We also discuss briefly and give some elements about its performances.

In the next Section, we briefly describe classical cryptographic primitives that we use. In Section Model presentation, we present our security model. In Section Useful technical tools, we recall some already published works on private identification schemes and the embedding of edit distances into the Hamming distance. In Section Our construction, we introduce our work and explain its properties.

Cryptographic primitives

Private information retrieval protocol

A Private Information Retrieval protocol (PIR, [12]) is a scheme that enables to retrieve a specific information from a remote server in such a way that the latter does not learn information about the query.

Suppose a database is constituted with M bits × = x₁,...,x_M. To be secure, the protocol should satisfy the following properties [13]:

Soundness: When the user and the database follow the protocol, the result of the request is exactly the requested bit.
User Privacy: For all × ∈ {0,1}^M, for 1≤ i,j ≤ M, for any algorithm used by the database, it cannot distinguish with a non-negligible probability the difference between the requests of index i and j.

Among the known constructions of computational secure PIR, block-based PIR - i.e. working on block of bits - allows to efficiently reduce the cost. The best performances are from Gentry and Ramzan [14] and Lipmaa [15] with a communication complexity polynomial in the logarithm of M. Surveys of the subject are available in [16, 17].

Some PIR protocols are called Symmetric Private Information Retrieval, when they comply with the Data Privacy requirement [13]. This condition states that the querier cannot distinguish between a database that possesses only the information he requested, and a regular one; in other words, that the querier does not get more information than he asked for.

Private information storage protocol

PIR protocols enable to retrieve information of a database. A Private Information Storage (PIS) protocol [17] is a protocol that enables to write information in a database with properties that are similar to that of PIR. The goal is to prevent the database from knowing the content of the information that is being stored; for detailed description of such protocols, see [1, 18].

To be secure, the protocol must also satisfy the Soundness and User Privacy properties, meaning that 1. following the protocol results in the update of the database with the appropriate value, and 2. any algorithm run by the database cannot distinguish between two writing requests.

Model presentation

In this section, we introduce the model of security for an Error-Tolerant Searchable Encryption scheme for edit distance by adapting the model from [7].

Entities for the protocol

The context is Cloud Computing where users can either store or retrieve data from the Cloud. This leads to three different entities:

The Cloud $CL$ which represents a single point of access to remote shared resources (i.e. a remote storage system). The Cloud is assumed to be untrusted, so we consider the content as publicly accessible to a third party and that communications in the Cloud and with users can be eavesdropped.
The sender $X$ sends data to be stored on the Cloud $CL$ .
The receiver $Y$ generates queries to the Cloud $CL$ to obtain the results of his searches.

Note that the sender and the receiver are not necessarily the same user and it is even possible that several senders and several receivers exist and interact. This corresponds well to the Cloud Computing model.

Definition of the primitives

In the sequel, messages are strings of length N, and ed(m₁m₂) denotes the edit distance between m₁,m₂ ∈{0,1}^N, i.e. the minimum number of character insertions, deletions and substitutions needed to transform one string into the other. Note that edit distance is well defined on larger alphabet and variable length strings. The scheme can be extended to these cases.

To enable error-tolerant searchable encryption, we need three main primitives: the key materials generation, the send request and the receive request.

Definition 1. A (ϵ, λ_min,λ_max)-Public Key Error-Tolerant Searchable Encryption for the edit distance is obtained with the following probabilistic polynomial-time methods:

KeyGen(1^ℓ) initializes the system, and generates public and private keys (pk,sk) for a security parameter ℓ. The public key pk is used to store data in the Cloud, and the secret key sk is used to retrieve information.
${Send}_{X, CL} (m, p k)$
is a protocol in which $X$ sends to $CL$ the data m ∈ {0,1}^Nto be stored in the Cloud. At the end of the protocol, $CL$ has stored the message m at a virtual address noted φ(m).
${Retrieve}_{, CL} (m', s k)$
is a protocol in which, given a fresh message m' ∈ {0,1}^N, $Y$ asks for the virtual addresses of all data that are stored on $CL$ and are close to m', with respect to the Completeness(λ_min) and Soundness(λ_max) criteria (cf. Section Security requirements). This outputs a set of virtual addresses, noted Φ(m'), where $Y$ can reach the corresponding messages.

Completeness and Soundness criteria for the parameters λ_min, λ_max represent the fact that a stored message will be actually retrieved if m' is at an edit distance less than λ_min and that no message at a distance greater than λ_max from m' will be returned (with a given non negligible probability). We emphasize that the definition above is focused on the searching problem (which is the tough task here): the algorithms' outputs are the virtual addresses where the retriever $Y$ can retrieve the messages. The messages are possibly stored encrypted via a second encryption scheme.

An important difference compared to [3] is that we do not rely on fuzzy keyword sets, we want to ensure a given tolerance (materialized by λ_min, λ_max). By avoiding wildcards and grams, we do not make any prior assumption on the location of the errors.

Security requirements

We first recall the completeness and soundness criteria that formalized the condition for the scheme and the Cloud to actually return the correct answer.

Condition 1. Completeness( λ_min), Soundness( λ_max) Let m₁, ..., m_p ∈ {0,1}^N be p different binary strings, and let m'∈{0,1}^N be another string. Assume that, after initialization of the system, all the messages m_i have been stored in the Cloud $CL$ with virtual addresses φ(x_i), and that a user $Y$ retrieved the set of virtual addresses Φ(m') associated to m'.

1.
The scheme is said to be complete, up to a probability 1 - ϵ₁ if
$\underset{m'}{Pr} [\exists i, e d (m', m_{i}) \leq λ_{m i n} & φ (m_{i}) \notin Φ (m')] \leq ϵ_{1}$

(i.e. that except with a small probability all close messages are retrieved during the search through a Retrieve query).

2.
The scheme is said to be sound, up to a probability 1 - ϵ₂ if
$\underset{m'}{Pr} [\exists i, d (m', m_{i}) > λ_{m a x} & φ (m_{i}) \in Φ (m')]$

is bounded by ϵ₂ (i.e. that a false positive happens only with a small probability).

We now give the definition of the security properties that the scheme needs to fulfill to ensure that the data stored in the Cloud are kept confidential and that privacy of queries is ensured.

Condition 2. Sender Privacy The scheme is said to respect Sender Privacy if the advantage of any server is negligible in the $E x p_{A}^{S e n d e r P r i v a c y}$ experiment, described below. Here, $A$ is a malicious opponent taking the place of $CL$ , and $C$ is a challenger at the user side.

\begin{gathered} E x p_{A}^{Sender Privacy} \\ |\begin{matrix} 1 . & (p k, s k) & \leftarrow & K e y G e n (1^{ℓ}) & (C) \\ 2 . & {m_{2}, \dots, m_{Ω}} & \leftarrow & A & (A) \\ 3 . & φ (m_{i}) & \leftarrow & {S e n d}_{C, CL} (m_{i}, p k) & (C) \\ 4 . & {m_{0}, m_{1}} & \leftarrow & A & (A) \\ 5 . & φ (m_{e}) & \leftarrow & \begin{matrix} {S e n d}_{C, CL} (m_{e}, p k) \\ e \in_{R} {0, 1} \end{matrix} & (C) \\ 6 . & Repeat steps (2, 3) & {S e n d}_{C, CL} (m_{e}, p k) \\ 7 . & e' \in {0, 1} & \leftarrow & A & (A) \end{matrix} \end{gathered}

The advantage of the adversary is $|P r [e' = e] \frac{1}{2} |.$

This experiment corresponds to a first phase where the adversary receives Send requests that he chose himself. Then $A$ selects a pair (m₀,m₁) of messages and the challenger $C$ chooses randomly one of the two messages to be stored in the Cloud. At the end, after a polynomial number of other Send requests, the adversary tries to guess which one of m₀ or m₁ has been sent. When the advantage of the adversary is negligible, we can assume that the data stored in the Cloud remains private.

The next condition focuses on retrieve queries. We want to ensure that the Cloud does not learn information on the retrieve queries, i.e. neither on the input message m', nor on the close retrieved messages.

Condition 3. Receiver Privacy The scheme is said to respect Receiver Privacy if the advantage of the Cloud is negligible in the experiment $E x p_{A}^{Receiver Privacy}$ described below. $A$ denotes the malicious opponent taking the place of $CL$ , and $C$ the challenger at the user side.

\begin{gathered} E x p_{A}^{Sender Privacy} \\ |\begin{matrix} 1 . & (p k, s k) & \leftarrow & K e y G e n (1^{ℓ}) & (C) \\ 2 . & {m_{1}, \dots, m_{Ω}} & \leftarrow & A & (A) \\ 3 . & φ (m_{i}), (i \in {1, \dots, Ω}) & \leftarrow & S e n d_{C, CL} (m_{i}, p k) & (C) \\ 4 . & {m_{2}^{'}, \dots, m_{p}^{'}} & \leftarrow & A & (A) \\ 5 . & Φ (m_{j}^{'}), (j \in {2, \dots, p}) & \leftarrow & {R e t r i e v e}_{C, CL} (m_{j}^{'}, s k) & (C) \\ 6 . & (m_{0}^{'}, m_{1}^{'}) & \leftarrow & A & (A) \\ 7 . & Φ (m'_{e}) & \leftarrow & \begin{matrix} {R e t r i e v e}_{C, CL} (m_{e}^{'}, s k) \\ e \in_{R} {0, 1} \end{matrix} & (C) \\ 8 . & R e p e a t s t e p s (4, 5) \\ 9 . & e' \in {0, 1} & \leftarrow & A & (A) \end{matrix} \end{gathered}

The advantage of the adversary is $|P r [e' = e] \frac{1}{2} |.$

This experiment begins with the adversary's choice of messages to be stored in the Cloud. Then $A$ chooses a number of retrieve queries to be made by the challenger. Following this, $A$ selects a pair of challenges $(m_{0}^{'}, m_{1}^{'})$ and one of them is randomly selected by $C$ as input to a Retrieve query. Note that $A$ should not see the result of the Retrieve queries. At the end of the experiment, $A$ tries to guess which one it was.

This condition captures the privacy of the receiver $Y$ when generating Retrieve queries: $CL$ does not learn information on their content.

Useful technical tools

Private identification schemes

The principle of a private identification scheme is to manage nearest neighbor search in the encrypted domain. The two main sub-problems are the Approximate Nearest Neighbor (ANN) problem and Searchable Encryption

The Approximate Nearest Neighbor (ANN) problem is defined as follows: Let $P$ be a set of points in a metric space (E,d_E). For an input x ∈ E and ϵ ≥ 0, find a point p_x ∈ $P$ such that

d_{E} (x, p_{x}) \leq (1 + ϵ) min_{p \in P} d_{E} (x, p) .

This is an approximation of the Nearest Neighbor problem as the exact case is hard to solve in large dimension spaces. Several algorithms for the ANN problem have been proposed [19] and the basic principle is to rely on sketching methods which output shorter vectors with increased stability and which enable to simplify the search: $P$ is preprocessed with such sketching to end-up with a look-up table of short vectors on which the search can be realized quickly through counting the number of the exact or almost exact matches. Sketching needs there to guarantee that two close inputs would give with a good probability the same short vector. Examples of sketching methods are numerous for vector space (with Hamming distance or Euclidean distance) [20–23]; for instance random projections on small subspace. In the private identification schemes [7–9], the authors suggest to use a construction exploited in [24] for iris biometry. This is adapted to binary vectors with Hamming distance comparison. The sketching functions are restriction of n bits vectors over r ≪ n of their coordinates to obtain r bits vectors:

Definition 2. Let $F = (f_{1}, \dots, f_{μ})$ be a family of function from {0,1}ⁿ to {0,1}^r such that for x ∈ {0,1}ⁿ, we have for all i ∈ {1,...,μ}, $f_{i} (x) = (x_{i_{1}}, \dots, x_{i_{r}})$ . We say that $F$ is a sketching family for the Hamming distance from dimension n to dimension r.

With a sketching family where all functions are independent and if we assume that the inputs are uniformly distributed, the probability to obtain the same output with two distinct inputs can be estimated as follows.

\forall x, x' \in {0, 1}^{n} \{\begin{matrix} P r_{f \in F} [f (x) = f (x') | d (x, x') < λ_{1}] > {(1 - \frac{λ_{1}}{n})}^{r} \\ P r_{f \in F} [f (x) = f (x') | d (x, x') > λ_{2}] < {(1 - \frac{λ_{2}}{n})}^{r} \end{matrix}

In our construction, we rely on this idea for Hamming distance approximation combined with the embedding method from [10, 11] of edit distance into the Hamming space.

As far privacy and security are concerned, private identification schemes are based on searchable encryption principle. The main goal of searchable encryption [2, 25] is to store messages into an encrypted database while still enabling to search the messages related to some keywords. For instance this could correspond to a remote mailing service where the user wants to retrieve his messages which contain a given keyword, without letting the server learn information on the content of his mails. [3] also uses such technique but only in a symmetric context. Following [7]'s idea, we adapt an asymmetric searchable encryption scheme for our construction (cf. Section Our construction).

A general solution to design a searchable encryption scheme is to associate a message to a set of keywords and to consider each keyword as a virtual address where the receiver can recover a link toward the associated messages. To manage all these relations in an efficient way, we follow [1, 26, 27] by using Bloom filters. Bloom filter [28] is a notion used in membership checking applications to reduce the memory cost of the data storage. We use an extension of this notion called Bloom filters with storage. It enables to store identifiers of elements in each array.

Definition 3. Bloom Filter with Storage, [1] Let $S$ be a finite subset of a space E and a set of identifiers associated to $S$ . For a family of v (independent and random) hash functions $H = {h_{1}, \dots, h_{v}}$ , with each h_i:E→{1,...,k}, a (v,k)-Bloom Filter with Storage for indexation of $S$ is $H$ , together with the array (t₁,...,t_k), defined recursively as:

1.
∀i∈{1,...,k}, t_i←∅,
2.
$\forall x \in S, \forall j \in {1, \dots, v}, t_{h_{j} (x)} \leftarrow t_{h_{j} (x)} \cup {I d (x)}$ where Id(x) is the identifier of x.

In other words, the array is empty at the beginning and for each element $x \in S$ , we add the identifier Id(x) of x at the cells indexed by h₁(x),...,h_v(x). To recover the identifiers associated to an element y, we compute $T (y) = ⋂_{j = 1}^{v} t_{h_{j} (y)}$ . The following lemma describes the accuracy of this storage method.

Lemma 1. [28] Let $(H, t_{1}, \dots, t_{k})$ be a (v,k)-Bloom filter with storage indexing $S$ . For $x \in S$ , the following properties hold:

$I d (x) \in T (x) = ⋂_{j = 1}^{ν} t_{h_{j} (x)}$ , i.e. the identifier of $x \in S$ is always retrieved,
the probability Pr[t∈T(y) and t≠Id(y)] to obtain a false positive is ${(1 - {(1 - \frac{ν}{k})}^{| S |})}^{ν} .$

Edit distance approximation

Our construction is based on the embedding of edit distance into Hamming distance designed in [10]. To solve problems such as those described in Section Private identification schemes, data are embedded into Hamming space and then we can apply techniques dedicated to Hamming distance.

Definition 4. Let $(E_{1}, d_{E_{1}})$ and $(E_{2}, d_{E_{2}})$ be two metric spaces. An embedding $ψ : (E_{1}, d_{E_{1}}) \to (E_{2}, d_{E_{2}})$ has a distortion c if for all (x,y) ∈ E₁,

c^{- 1} \times d_{E_{1}} (x, y) \leq d_{E_{2}} (ψ (x), ψ (y)) \leq c \times d_{E_{1}} (x, y)

[10] proves that {0,1}^N with edit distance can be embedded into ℓ₁ with small distortion $2^{O (\sqrt{{log}_{2} N {log}_{2} {log}_{2} N})}$ and then shows from a previous work [20] how to end upefficiently into the Hamming space. More precisely:

Lemma 2. [10] There exists a probabilistic polynomial time algorithm π and constants c₁,c₂ > 0 that, for every N ∈ ℕ, for every 4^-N ≫ δ > 0, and for all × ∈ {0,1}^N, computes $π (x) \in ℓ_{1}^{c_{2} (N^{2} {log}_{2} (N / δ))}$ and such that for all (x,y) ∈ {0,1}^N, with probability at least 1 - δ,

2^{- c_{1} (\sqrt{{log}_{2} N {log}_{2} {log}_{2} N})} e d (x, y) \leq L_{1} (π (x), π (y)) \leq 2^{c_{1} (\sqrt{{log}_{2} N {log}_{2} {log}_{2} N})} e d (x, y)

where L₁ denotes the distance L₁.

The principle of the algorithm is to partition a string x into about

$2^{(\sqrt{{log}_{2} N {log}_{2} {log}_{2} N})}$ substrings. From each substring xⁱ, sets of all substrings (shingles) when taking a window of a fixed size t are considered (i.e. all possible substrings of xⁱ formed by t subsequent coordinates). By considering the metric defined by the minimum cost perfect matching algorithm between sets, [10] then explains how such sets are embedded into ℓ₁. Note that this technique introduces a lot of redundancy in the substrings which are embedded and this increases the dimension by a factor at least N², but this is interesting for our construction as the distortion is very low and the algorithm remains polynomial in N.

Based on [20], the authors then show that there exist 0 < α < β < c₂ and an embedding Ψ from {0,1}^N with edit distance ed to ${0, 1}^{c_{2} ({log}_{2} (1 / δ))}$ with Hamming distance HD that computes Ψ(x) = (x;t) for every t ∈ ℕ and such that with probability at least 1 - \delta:

If ed(x,y) ≤ t, then HD(ψ(x), ψ(y)) ≤ α log₂(1/δ).
If $e d (x, y) \geq 2^{c_{1} (\sqrt{{log}_{2} N {log}_{2} {log}_{2} N})} t$ then HD(ψ(x), ψ(y)) ≥ β log₂(1/δ).

Our construction

Technical description

Setup

Let {0,1}^N be equipped with the edit distance. Let Ψ be the embedding of ({0,1}^N,ed) into $({0, 1}^{c_{2} ({log}_{2} (1 / δ))}, H D)$ (cf. previous section). Let $F = (f_{1}, \dots, f_{μ})$ be a sketching family for the Hamming distance from dimension c₂(log₂(1/δ)) to a dimension r. Let $(H, t_{1}, \dots, t_{k})$ , with $H = {h_{1}, \dots, h_{v}}$ , and h_i:{1,...,μ}×{0,1}^row{1,...,k}, be a (v,k)-Bloom Filter with Storage.

Let (Gen, Enc, Dec) be a semantically secure (IND-CPA, [29]) public key cryptosystem, let ${Q u e r y}_{D B}^{P I R}$ be the retrieve query from a database DB of a Private Information Retrieval protocol and let ${U p d a t e}_{D B}^{P I S} (v a l, i)$ be the write query into a database DB (that adds val to the i-th field) of a Private Information Storage protocol.

A Private Information Retrieval (PIR) [16] protocol enables to retrieve a specific block from a database without letting the database learn anything about the query and the answer (i.e. neither the index of the block nor the value of the block). This is done through a method ${Q u e r y}_{D B}^{P I R} (i)$ , that allows a user to recover the element stored at index i in DB by running the PIR protocol. A Private Information Storage (PIS) protocol [17] enables to write information in a database while preventing the database from learning information on what is being stored (neither the value of the data, nor the index of the location where the data is being stored). Such a protocol provides a method ${U p d a t e}_{D B}^{P I S} (v a l, i n d e x)$ , which takes as input an element and a database index, and puts the value val into the database entry index. See Section Cryptographic primitives for more details on these notions.

KeyGen(1^ℓ)

The function takes a security parameter ℓ as input and uses Gen to generate a public and private key pair (pk,sk). It also initializes the Bloom filter array, (t₁,...,t_k←(Ø,...,Ø)), and provides it to the Cloud.

{S e n d}_{X, CL} (m, p k)

To send a message to the Cloud, a user $X$ executes the following algorithm.

1.
$X$ sends Enc(m,pk) to $CL$ which will give him back a virtual address φ(m).
2.
$X$ computes the embedding Ψ(m) and for all i ∈ {1,...,μ}, f_i∘ψ(m) and for all j ∈ {1,...,v}, $X$ asks to $CL$ to update the Bloom filter array through queries
${U p d a t e}_{CL}^{P I S} (E n c (φ (m), p k), h_{j} (i | | f_{i} \circ ψ (m)))$

in order to add the identifier into the cell $t_{h_{j} (i | | f_{i} \circ ψ (m))} .$

For privacy concerns, $X$ will also complete the Bloom filter array with random data in order to get the same number l of elements for all cells t₁,...,t_k.

At the end of the algorithm, $CL$ has stored the message m at a virtual address noted φ(m) and the Bloom filter structure has been filled of encrypted identifiers via indexation by several sketches that enable to search with approximate data.

{R e t r i e v e}_{Y, CL} (m', s k)

To retrieve a message in the Cloud, a user $Y$ proceeds as follows.

1.
For all i ∈ {1,...,μ} and for all j ∈ {1,...,v}, $Y$ computes α_i,j = h_j(i||f_i∘ψ(m))_.
2.
$Y$ executes ${Q u e r y}_{CL}^{P I R} (α_{i, j})$ to retrieve the content of the cells $t_{α_{i, j}}$ from the Bloom filters stored into $CL$ .
3.
$Y$ decrypts the content of the cells with Dec(.,sk) and for i ∈ {1,...,μ}
- $Y$ computes the intersection of all the decrypted version of the cells $t_{α_{i, 1}}, \dots, t_{α_{i, ν}}$ .
- If φ(m) is in this intersection, this means that $Y$ most probably found a match f_i∘ψ(m) = f_i∘ψ(m')
4.
$Y$ counts the number of times an identifier is retrieved in such intersections $\cap_{j = 1}^{ν} t_{α_{i, j}}$ (for i ∈ {1,...μ}).
5.
$Y$ selects all the identifier which are retrieved above some threshold τ. This leads to the result $Φ (m') = {φ (m_{i_{1}}), \dots, φ (m_{i_{γ}})}$ of the execution of Retrieve.

Note that as the queries are made through a PIR protocol, the Cloud can not learn any information. The advantage of using Bloom filters here is to permit an efficient look-up into the structure, as for classical Bloom filter (i.e. without any encryption) compared to other hash tables techniques.

Security properties

In this section, we explain why this construction achieves the security requirements of Section Security requirements.

Lemma 3. Completeness The scheme is complete up to a probability 1 - ∈₁ with

ϵ_{1} \leq 1 - {(1 - \frac{α}{c_{2}})}^{r τ}

Proof. (sketch of) For m,m' such that ed(m,m') ≤ λ_min, Section Edit distance approximation implies that HD(ψ(m;λ_min), ψ(m';λ_min) ≤ α log₂(1/δ) with probability 1 - δ. Hence

P r [f_{i} (ψ (m)) = f_{i} (ψ (m'))] > {(1 - \frac{α}{c_{2}})}^{r} .

This leads to a probability lower than $1 - {(1 - \frac{α}{c_{2}})}^{r τ}$ to find less than τ times the identifier of a close message; probability that can thus be made small, cf. the example in Section Discussion.

More precisely, $ϵ_{1} \approx \sum_{i = 0}^{τ - 1} (\begin{matrix} μ \\ i \end{matrix}) {(1 - {(1 - \frac{α}{c_{2}})}^{r})}^{μ - i} {(1 - \frac{α}{c_{2}})}^{r i} .$

Lemma 4. Soundness With $λ_{m a x} = 2^{c_{1} (\sqrt{{log}_{2} N {log}_{2} {log}_{2} N})} λ_{m i n}$ and provided that Bloom filter functions from $H$ behave like pseudo-random functions from {1,...,μ} × {0,1}^r to {1,...,k}, then the scheme is sound up to a probability 1 - ϵ₂, with:

ε_{2} \approx {({(1 - \frac{β}{c_{2}})}^{r} (1 - \frac{1}{k^{v}}) + \frac{1}{k^{v}})}^{τ}

Proof. (sketch of) For m,m' such that ed(m,m') > λ_max, then Section Edit distance approximation implies that HD(ψ(m;λ_min), ψ(m';λ_min) ≥ β log₂(1/δ)_. Hence

P r [f_{i} (ψ (m)) = f_{i} (ψ (m'))] < {(1 - \frac{β}{c_{2}})}^{r} .

The other cause for an error could come from v collisions in the Bloom filter hashes.

Lemma 5. Sender Privacy Assume that the PIS protocol achieves PIS User Privacy, the scheme ensures Sender Privacy.

Proof. (sketch of) $CL$ receives only encrypted messages and Update^PIS queries that do not enable to distinguish between the output of Send(m₀, pk) and the output of Send(m₁, pk), after the execution of Send(m₁, pk), i ∈ {2,...,Ω} as we assume that the underlying encryption scheme is semantically secure and that the PIS protocol achieves PIS User Privacy.

Lemma 6. Receiver Privacy Assume that the PIR protocol ensures PIR User Privacy, then the scheme ensures Receiver Privacy.

Proof. (sketch of) The Cloud $CL$ receives and answers only to Query^PIR requests, that by assumption do not leak information neither on their content nor on the outputs.

Discussion

To illustrate the error rates that one can expect, we give an example of choice of parameters. For instance, we choose a Bloom filter array of size k = 128 with v = 64 hash functions. Then we can approximate ϵ₂ as ${(1 - \frac{β}{c_{2}})}^{r τ}$ . We have $ϵ_{1} \approx \sum_{i = 0}^{τ - 1} (\begin{matrix} μ \\ i \end{matrix}) {(1 - {(1 - \frac{α}{c_{2}})}^{r})}^{μ - i} {(1 - \frac{α}{c_{2}})}^{r i}$ where α < β. Assume that α = c₂/4 and β = c₂/2 then with μ = 128 functions in the sketching family for the Hamming distance, r = 10 and τ = 3, we obtain ϵ₂ negligible and ϵ₁ ≈ 0.023. With these parameters, we have μ × v = 2¹³ for the number of queries during Send and Retrieve phases. Concerning the cost of PIR and PIS queries, the size of the Bloom filter array should remain not too large, like k = 128 here, to be efficient.

Note that in practice, the choice of λ_min depends on the number of errors between two words that one wants to tolerate for fuzzy search. Our embedding is made such that λ_max is made close to λ_min. The other parameters have then to be tuned to obtain small or negligible error rates ϵ₁ and ϵ₂ (cf. Lemma 3 and Lemma 4). The purpose of this paper is to introduce a new encrypted search with edit distance. At this point, our contribution is mainly theoretical. To go further, one should consider a practical use case over the cloud to be able to devise an efficient implementation.

References

Boneh D, Kushilevitz E, Ostrovsky R, Skeith WE III: Public Key Encryption That Allows PIR Queries. In CRYPTO, Volume 4622 of Lecture Notes in Computer Science. Edited by: Menezes A. Springer; 2007:50–67.
Google Scholar
Curtmola R, Garay JA, Kamara S, Ostrovsky R: Searchable symmetric encryption: improved definitions and efficient constructions. In CCS'06: Proceedings of the 13th ACM conference on Computer and communications security. ACM; 2006:79–88.
Chapter Google Scholar
Li J, Wang Q, Wang C, Cao N, Ren K, Lou W: Enabling Efficient Fuzzy Keyword Search over Encrypted Data in Cloud Computing. Cryptology ePrint Archive, Report 2009/593 2009, 16.
Google Scholar
Daugman J: The importance of being random: statistical principles of iris recognition. Pattern Recognit 2003,36(2):279–291. 10.1016/S0031-3203(02)00030-4
Article Google Scholar
Bringer J, Despiegel V: Binary feature vector fingerprint representation from minutiae vicinities. Biometrics: Theory, Applications, and Systems, 2010. BTAS'10. IEEE 4th International Conference on 2010.
Google Scholar
Bringer J, Despiegel V, Favre M: Adding localization information in a fingerprint binary feature vector representation. SPIE Defense, Security, Sensing 2011.
Google Scholar
Bringer J, Chabanne H, Kindarji B: Error-tolerant searchable encryption. IEEE ICC 2009 CISS 2009.
Google Scholar
Bringer J, Chabanne H, Kindarji B: Identification with encrypted biometric data. Security Comm Networks 2011,4(5):548–562. 10.1002/sec.206
Article Google Scholar
Adjedj M, Bringer J, Chabanne H, Kindarji B: Biometric Identification over Encrypted Data Made Feasible. In ICISS, Volume 5905 of Lecture Notes in Computer Science. Edited by: Prakash A, Gupta I. Springer; 2009:86–100.
Google Scholar
Ostrovsky R, Rabani Y: Low distortion embeddings for edit distance. In STOC. Edited by: Gabow HN, Fagin R. ACM; 2005:218–224.
Google Scholar
Ostrovsky R, Rabani Y: Low distortion embeddings for edit distance. J ACM 2007.,54(5):
Chor B, Kushilevitz E, Goldreich O, Sudan M: Private Information Retrieval. J ACM 1998,45(6):965–981. 10.1145/293347.293350
Article MathSciNet Google Scholar
Gertner Y, Ishai Y, Kushilevitz E, Malkin T: Protecting data privacy in private information retrieval schemes. STOC 1998, 151–160.
Google Scholar
Gentry C, Ramzan Z: Single-database private information retrieval with constant communication rate. In ICALP, Volume 3580 of Lecture Notes in Computer Science. Edited by: Caires L, Italiano GF, Monteiro L, Palamidessi C, Yung M. Springer; 2005:803–815.
Google Scholar
Lipmaa H: An oblivious transfer protocol with log-squared communication. In ISC, Volume 3650 of Lecture Notes in Computer Science. Edited by: Zhou J, Lopez J, Deng RH, Bao F. Springer; 2005:314–328.
Google Scholar
Gasarch WI: A Survey on Private Information Retrieval. [http://www.cs.umd.edu/~gasarch/pir/pir.html]
Ostrovsky R, Shoup V: Private information storage (extended abstract). STOC 1997, 294–303.
Google Scholar
Ostrovsky R, Skeith WE III: Algebraic Lower Bounds for Computing on Encrypted Data. Cryptology ePrint Archive, Report 2007/064 2007.
Google Scholar
Piotr I: Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Computational Geometry, Chapter 39. 2nd edition. Edited by: Goodman JE, O'Rourke J. CRC Press; 2004.
Google Scholar
Kushilevitz E, Ostrovsky R, Rabani Y: Efficient Search for approximate nearest neighbor in high dimensional spaces. Symposium on the Theory Of Computing 1998, 614–623.
Google Scholar
Kirsch A, Mitzenmacher M: Distance-sensitive bloom filters. Algorithm Engineering & Experiments 2006.
Google Scholar
Piotr I, Rajeev M: Approximate nearest neighbors: towards removing the curse of dimensionality. Symposium on the Theory Of Computing 1998, 604–613.
Google Scholar
Andoni A, Piotr I: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun ACM 2008, 51: 117–122.
Article Google Scholar
Hao F, Daugman J, Zielinski P: A Fast Search Algorithm for a Large Fuzzy Database. Inf Forensics Security, IEEE Trans 2008,3(2):203–212.
Article Google Scholar
Boneh D, Di Crescenzo G, Ostrovsky R, Persiano G: Public Key Encryption with Keyword Search. In EUROCRYPT, Volume 3027 of LCNS. Edited by: Cachin C, Camenisch J. Springer; 2004:506–522.
Google Scholar
Goh EJ: Secure indexes. Cryptology ePrint Archive, Report 2003/216 2003.
Google Scholar
Bethencourt J, Song DX, Waters B: New constructions and practical applications for private stream searching (extended abstract). In IEEE Symposium on Security and Privacy. IEEE Computer Society; 2006:132–139.
Google Scholar
Bloom BH: Space/time trade-offs in hash coding with allowable errors. Commun ACM 1970,13(7):422–426. 10.1145/362686.362692
Article Google Scholar
Goldwasser S, Micali S: Probabilistic Encryption. J Comput Syst Sci 1984,28(2):270–299. 10.1016/0022-0000(84)90070-9
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors thank Céline Chevalier for her support.

Author information

Authors and Affiliations

Morpho, Issy-les-Moulineaux, France
Julien Bringer
Morpho & Télécom ParisTech, Paris, France
Hervé Chabanne

Authors

Julien Bringer
View author publications
You can also search for this author in PubMed Google Scholar
Hervé Chabanne
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hervé Chabanne.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JB and HC follow their previous work on biometric identification to extend it to the new area of application of cloud computing. Both authors read and approved the final manuscript.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Bringer, J., Chabanne, H. Embedding edit distance to enable private keyword search. Hum. Cent. Comput. Inf. Sci. 2, 2 (2012). https://doi.org/10.1186/2192-1962-2-2

Download citation

Received: 24 August 2011
Accepted: 23 February 2012
Published: 23 February 2012
DOI: https://doi.org/10.1186/2192-1962-2-2

Embedding edit distance to enable private keyword search

Abstract

Background

Methods

Results

Conclusion

Introduction

Related works

Contribution and organization

Cryptographic primitives

Private information retrieval protocol

Private information storage protocol

Model presentation

Entities for the protocol

Definition of the primitives

Security requirements

Useful technical tools

Private identification schemes

Our construction

Technical description

Setup

Security properties

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords