Cooperative privacy game: a novel strategy for preserving privacy in data publishing

Achieving data privacy before publishing has been becoming an extreme concern of researchers, individuals and service providers. A novel methodology, Cooperative Privacy Game (CoPG), has been proposed to achieve data privacy in which Cooperative Game Theory is used to achieve the privacy and is named as Cooperative Privacy (CoP). The core idea of CoP is to play the best strategy for a player to preserve his privacy by himself which in turn contributes to preserving other players privacy. CoP considers each tuple as a player and tuples form coalitions as described in the procedure. The main objective of the CoP is to obtain individuals (player) privacy as a goal that is rationally interested in other individuals’ (players) privacy. CoP is formally defined in terms of Nash equilibria, i.e., all the players are in their best coalition, to achieve k-anonymity. The cooperative values of the each tuple are measured using the characteristic function of the CoPG to identify the coalitions. As the underlying game is convex; the algorithm is efficient and yields high quality coalition formation with respect to intensity and disperse. The efficiency of anonymization process is calculated using information loss metric. The variations of the information loss with the parameters $$\alpha$$ α (weight factor of nearness) and $$\beta$$ β (multiplicity) are analyzed and the obtained results are discussed.

then it may paves a way towards privacy theft of his private information as well as his friend's information. It is not just enough to preserve our personal privacy, the people circled around us should also take an action. Though many social network sites provide different levels of privacy control, in addition rational cooperation of the people is also necessary.
Domingo-Ferrer initiates epitome of cooperation in privacy and termed it as Co-Privacy [4,5]. However, CoV (cooperative value) is modeled, that estimates the cooperation between the tuples using Cooperative Game Theory and it is titled as cooperative privacy. The following are the prime motivations towards the cooperative privacy (CoP) [5]: • To keep the information society growing on over a period of time, preservation of privacy is necessary It is just like trying to solve the global issues (e.g. international terrorism, global warming etc.) to sustain the physical world. Now, information society gives importance to preservation of privacy as they understand its significance but are scared of using these services. The people are forced towards privacy preservation in information society, just like the importance given to Go-Green and No Plastic by the environmentalists in society. • As far as possible, privacy should be maintained by the rational cooperation of others, in absence of which the entire information system may be inconsistent It is similar to the traffic rules. If a person doesn't follow the traffic rules, it causes a trouble to others and some times it may lead to deadlock. Even though the government has scaffold privacy of users as human rights, they still remain quite unrealistic. Just the setting of rules by the government is not enough to achieve privacy preservation, effort should be put by the technology people to enforce the users to maintain privacy world. At the same time there should be a rational cooperation among the users for societal usefulness.
This paper proposes a game named Cooperative Privacy Game (CoPG), using Coalitional Game Theory [6] to find the CoP of a data set which is to be published. In CoPG, each tuple is considered as a player and assigned a real value called cooperative value (CoV), which is formally defined as characteristic function. The CoV of each player in the data table is defined as stated by Shapley value [7] which assumes the compactness around it. CoPG is to cogitate the cooperation between the tuples (players) which is estimated based on the CoV. CoV is used to divide the given data table into groups, each called as coalition. Later, by applying anonymization techniques over these coalitions CoP is achieved in terms of Nash equilibria [6] for k-anonymity [1]. Since the underlying game CoPG is convex [8], the algorithm which is used in formation of coalitions, is efficient and yields high quality with respect to intensity and disperse. Here, intensity is the average distance between the point to the center and disperse is the average distance between point to point. The Shapley value of the characteristic function of the coalitional game is considered in this paper coincides with other solution concepts named Nucleolus, Gately point, τ-value. This was proved by Swapnil et al. [9]. It supports the adoption of the characteristic function, defined in the later section, for this game. Anonymization efficiency is calculated by using information loss metric and the advantages of proposed algorithms are discussed.

Related work
The notion of k-anonymity principle to protect privacy before publishing the data has been proposed by [1] Aggarwal [10], Bayardo et al. [11], LeFever et al. [12], Samarati et al. [13] employed and discussed suppression/generalization frameworks to achieve k-anonymity.
To support the k-anonymity, new notions like l-diversity [2], t-closeness [14], (α, k)-anonymity [15] were proposed which improve the privacy protection mechanism. Giving these protected data sets to other parties for data mining does not raise the privacy issues but none of the existing methods are able to completely exhaust the risk of privacy protection.
Garg et al. [8] attained pattern clustering, an important methodology in data mining, by using game theory and proposed the use of Shapley value to give a good start to K-means. For clustering, Gupta and Ranganathan [16,17] used a microeconomic game theoretic approach, which simultaneously optimizes two objectives, viz. compaction and equi-partitioning. Bulo and Pelillo [18] describes hypergraph clustering using evolutionary games. Chun and Hokari [19] proved the coincidence of Nucleolus and Shapley value for queueing problems.
Wang et al. [20] proposed efficient privacy preserving two-factor authentication schemes related to wireless sensor networks [21] presented a methodology using twofactor authentication to overcome the threat of de-synchronization attack of preserving anonymity [22,23] initiated evaluation metric for anonymous-two factor authentication in distribution systems. Recent study in crime data publishing [24] achieved k-anonymity with constrained resources.
Generally, to estimate the trade-off, Game theory is one of the good methodologies. In Privacy Preserving Data Mining (PPDM) game theory is used to estimate the trade-off between utility measure and privacy level. Anderson [25] explains how the Game theory is applied and analyzed the privacy in legal issues. In Economical perspective, Bhome et al. [26], Kleinberg [27], Calzolari et al. [28], Preibusch [29] present many privacy issues. Calzolari [28] uses game theory techniques to explore the flow of customer's private information between two interested firms. Dwork [30] proposed differential privacy using mechanism design methodology of game theory. In the context of recommender systems Machanavajjhala [31] defines an accuracy metric for differential privacy which analyzes the trade-off between privacy and accuracy.
Kleinberg et al. [27] described three scenarios modelled as Coalitional Games (introduced in Osborne [32]) and the reward allocation exchange of private information is done according to the core and Shapley values. Chakravarthy et al. [33][34][35] described coalitional game theory mechanism to achieve k-anonymization for a data set.

Preliminaries
This section outlines the information available in literature for k-anonymity and concise information about coalitional game theory concepts viz. Convex game, Shapley value, Core [32] and the related are given.

k-anonymity
Burnett et al. [36], presented the classification of attributes of a data table D. Explicit Identifiers (EID), Quasi Identifiers (QID), Sensitive Attributes (SA) and Non-Sensitive Attributes (NSA) are different classifiers of the attributes. EID is set of attributes which explicitly identify a person and his possible sensitive information, whereas the set of attributes which can potentially identify the sensitive information of a person by associating other external sources is QID. The set containing attributes like Disease, Salary etc., which holds sensitive information of a person is given by SA and remaining that do not fall into the above three are categorized as NSA.
If every data tuple in a data table D is indiscernible, under QID set of attributes, with at least k-1 other tuples then the table is said to k-anonymized. For example, Table 1 is 3-anonymized version of Table 2.

Cooperative game
A Cooperative game G with transferable utility (TU) [37] consists of two parameters N and ν. N is a set of n players i.e., N= {1,2,..., n} and ν is a real valued function defined over power set of N, P(N) i.e., ν : P(N) → R, ν(φ) = 0 is called characteristic function or value function. For any subset S of N, ν(S) is called as value or worth of the coalition S and this is explained with a simple example [38].
Example There are there players i.e. N = {1,2,3}. Player 1 is a seller, players 2 and 3 are buyers. Player 1 has a single unit to sell and its cost is $4. Each buyer is interested to buy the unit. Players 2 and 3 'willingness-to-pay' are $9 and $11 respectively. Now the game is characterized as follows.  The characteristic function ν is defined as The intuition of ν is pretty simple. If there is no coalition for transact then the pay-off is zero and this shows first three definitions. Now if Player 1 and 2 come together and transact then the total gain of this coalition is the difference between buyer's willingnessto-pay and sell's cost price and hence it is $5. Similarly worth of the coalition of Player 1 and 3 is $7. These two are represented by 4th and 5th relations. Players 2 and 3 cannot come together as each is trying for seller but not the buyer and therefore the worth is $0. Finally, ν({1, 2, 3}) = $7 not $5 + $7 = $12, because Player 1 has only one unit to sell and so he can transacts with only one buyer either Player 2 or Player 3. Obviously, Player 1 transact with the higher willingness-to-pay to maximize his worth, henceforth, ν({1, 2, 3}) = $7 rather than $5.

Convex cooperative game
It means that the marginal contribution of a player t i is more for S ⊇ T i.e. larger coalitions and formally: Any coalitional game can be analyzed by using solution concepts, which describes the distribution patterns of the total value of the game among individual players. The following are some of the solution concepts.

The core
.. x n ) be a payoff allocation vector, where x i is the payoff of ith player. The core is the set of all payoff allocation vectors which satisfy the following properties.
• Individual rationality: ∀i ∈ N, Every payoff allocation in the core of the game is 'stable' , intuitively, no player will get benefit by unilaterally deviating from a given payoff allocation of the core. A payoff allocation which holds Individual's rationality and collective rationality is called Imputation.

Shapley value
The Shapley value of coalitional game is a solution concept. It explains the expected payoff allocation for the Cooperative Privacy Game G. It formalizes a fair distribution of the total payoff among the players of the coalition formation. The payoff allocation, based on this solution concept, is fair as it is including the information of each player's contribution to the total value i.e., it assumes the relative importance of the each player in coalition formation [39].
Let be set of all permutations over N and x π i be contribution of player t i to permutation π of CoPG G. Any imputation cov = (cov 1 , cov 2 , . . . cov n ) is a Shapley value fairly distribution if it follows the axioms of Lloyd Shapley [7]. The Shapley value of each player i in the game G, is formally given by To overcome the rigidness of computation of the Eq. 2, [8] provided an equivalent equation stated as follows: In the evaluation of CoV of each tuple underlying the solution concept, Shapley value is the only mapping in the distribution of payoff 's of the players in a coalitional game which follows the properties like linearity, symmetry and carrier property [8]. This is one of the reasons, why we take on Shapley value in the process of computing cooperative value (CoV) which is used in the proposed method.

Cooperative Privacy Game Model
This Game Model provides a mechanism to find out the privacy level, k-anonymity [1], of the given data set by using the cooperation between the tuples. The underlying cooperation between every pair of tuples is estimated and termed as CoV. CoV takes advantage of Shapley value of each tuple. The data is segregated into groups based on the CoV.
Assume a data set D having an attribute set A, and among them A QID is collection of gives the distance between t i and t j , and also it is clear that To set up a CoPG among the players(tuples) CoV is a function defined as Insightfully, if two data tuples, namely, t i and t j are very similar then the f (d(t i , t j )) reaches 1.
where d max is maximum of the distances between all pairs of points in the data set. It is used to normalize the distances.
The following assumptions are made to establish the CoPG G = (N, ν): • Each tuple is a player and N = D QID , so |N| = n.
• Every player interacts with other players and tries to maximize their CoV as it depends on the 'average increase in their worth' across all valid subsets. • The characteristic function ν is defined as follows for all coalitions S ⊆ N Equation 5, computes the total worth of the coalition S and it has quadratic computation complexity which is proved in later section. The worth of the coalition is calculated as the sum of pairwise coordinations between the players; consequently this formulation smartly forms groups, each being called coalition which fulfil the property that the points having more CoV will be in the same coalition. These are formed based on the similarities between the players, leading to seclusion of data set into groups. These groups further under go anonymization process, which is discussed in the following sections.

Convexity of CoPG
In the process of proving that CoPG is convex, here are some propositions stated and proved.

Proposition 1 The Cooperative game G = (N, ν) is convex where ν is defined as
Proof According to the definition of Convex game 1, for any player t k ∈ N, if we consider two coalitions S and T such that T ⊆ S ⊆ N \ {t k } then Every convex Cooperative game has non-empty core [6] and also Shapley value belonging to core. From the Proposition 1, our CoPG with characteristic function stated in the Eq. 5 is a convex game and hence it has a solution.

Complexity of calculating cooperative value
This section presents the calculation process of CoV. The CoV for each tuple in the data table is computed using Eq. 3, but the computation is hard because it includes n! as a factor. The following proposition overcomes the computational infeasibility and provides a relation for CoV to compute in polynomial time. That implies CoV is the summation of contribution of the player t i for each coalition over all possible permutations. But for specific t i , x π i is equal to summation of all similarities with other players whose position is less than the position of t i with respect to a specific permutation. Now for specific t i and t j the total number of possible permutations is (n − 2)!. So, the second summation in above equation contains (n − 2)! terms. Also if t i takes first position in a permutation then there is no possibility of t j , if t i takes second position then one possibility is there for t j . If we metric then we have the following and hence the result.
If we adopt the above argument, the CoV of a tuple t i can be found with O(n) complexity. So, in quadratic time we can estimate the CoV of all tuples of given data table.
In experimentations it is observed that evolution of CoVs takes about linear time as the actual participation of t i is very less than the possible permutations n!.

Proposition 3
In convex CoPG setting for given ǫ > 0 and d(t i , t j ) ≤ ǫ → 0, ∀t i , t j ∈ N then Cooperative values of t i and t j are almost same.
Hence the hypothesis follows with the argument: as ǫ → 0 implies cov i → cov j .) is a metric on its domain) Fig. 1 Anonymization process architecture: cooperation, between pair of records, is estimated as first step and as a second step CoV of each record is calculated. In third and fourth steps, respectively, seclusion and anonymization are performed Insightfully, the above proposition states that the cooperative values of tuples which are more similar i.e., the distance between them is almost zero, are nearly equal. It results that the tuples having almost equal CoV will be in same coalition.

Achieving cooperative privacy
This section describes the mechanism adopted by the data protector who is taking action about privacy of sensitive information in his data releases. Figure 1, shows the possible steps involved in the process of anonymization for a given data set D to achieve cooperative privacy.
The methodology of the process is explained in the following steps:

Calculate Cooperation value between each pair
The similarity between every pair of tuples (players) is estimated as Cooperation value of the pair in the given data set D QID using Eq. 4.

Evaluating CoV
For each tuple CoV is assigned a value using Eq. 6 and Proposition 2.
3. Process of seclusion The tuples are secluded into groups based on CoV, which undergo anonymization process. 4. Anonymization Each secluded group of given data table is anonymized and the k-anonymized data along with information loss and k value of the data table D is published.

Calculating values of cooperation
In step 1, a data set D is considered with set of attributes A. By choosing the QID attributes, we have set of quasi identifiers A QID . The projection of D under A QID is D QID . By using Eq. 4, the CoV between every pair of tuples of D QID is found. A symmetric matrix of order n (as D QID is having n tuples) called CoMatrix can be constructed using the cooperative values. According to proposition 2, this CoMatrix can be constructed in quadratic polynomial time. For simplicity Manhattan distance is chosen as distance function in Eq. 4 and A QID with only numerical attributes. d max , which is maximum of all possible distances, is used for normalization in the formalization itself. Algorithm, presented in the Table 3, explains the calculation of CoMatrix in O(n 2 ) time and it is given as input to step 2.

Evaluation of CoV
In this section, the evaluation process of CoV of each tuple of D QID is discussed. It is a hard problem to compute the CoV, using Eq. 5, of each tuple as it includes n! permutation orderings. Nevertheless, the game setting G is convex, the underlying CoV is Shapley value gives the center of gravity of the extreme points of the non-empty core [8]. The selection of characteristic function of this game model is shown in Eq. 3. As laid down by Proposition 2 the CoV of each tuple can also be estimated using the following relation, quadratic time: Algorithm, presented in the Table 4, describes how each tuple will be assigned CoV. It assumes the CoMatrix evaluated in previous step as input and returns an array of CoVs of size n, corresponding to D QID . This Algorithm takes O(n 2 ) complexity.

Process of seclusion
This process describes how to seclude the tuples of the data set D QID into groups based on their CoVs, the inner sense is that, the density of tuples around a tuple will form a group. The basic idea is to start with a tuple whose CoV is maximum at the initial core point and collect all the tuples having 'very near' CoVs as core point and put them into one group is named as coalition group. The parameter α is called cooperative parameter which governs this 'very near' in the process. The CoVs of tuples gradually decreases when they are far away from the center of the coalition and hence α decreases accordingly. So, in order to degrade α in terms of CoVs, Table 3 Algorithm Table 4 Algorithm a non-linear decreasing function has been considered. For this, α = β * h(l max ) is taken into account where h is defined over the set of all CoVs and β ∈ [0, 1] is a weight factor. In practice, α = β * l max g max +1 is considered. Here, g max is global maximum of CoV used for normalization of the CoVs and l max is local maximum of coalition group. However, any degradation function α can be chosen over these CoVs based on the domain values of the given data set and by the same token β also.
Growth Control Queue (GCQ) is an array introduced in the Algorithm (see Table 5). The advantage of using this queue is to add tuple indexes to the queue, if their Shapley value is at least γ-multiple of center of the coalition. Here, γ is multiplicity of CoV. It senses that, GCQ contains all unallocated points which has very low CoV value as compared to the density around the coalition group. These points do not take part in the further growth of the coalition group and it provides the uniform distribution of density throughout the coalition and the density does not vary beyond the threshold [9]. The GCQ grabs all this information and the empty queue manages the growth of the coalition.

Anonymization
This phase assumes the set of cooperative groups (CoG) as an input which is obtained from the third phase and it returns the anonymized data for the purpose of publishing by using anonymization algorithms [3]. Hierarchy free generalization of numerical attributes [12] are used to attain k-anonymization and information loss of the anonymized data is also measured.

Table 5 Algorithm
In the process, for every coalition and for every QID attribute, max and min of all possible domain values are found and all these values are replaced under the QID in that particular group with [min, max]. Finally, k is calculated as Min of sizes of all possible partitions after the process.
The data user who is collecting the data from data collector, typically wants to get more information from it. When anonymized data set is published, some information is lost due to the algorithm applied over the data. The user needs more qualitative data for his purposes like data mining, etc. The quality of k-anonymization of a given data set, typically, calculate how much quality has been lost in process of anonymization. The utilization of the data set after completion of the anonymization, is measured using information loss(IL). There are different measures to estimate IL [3], however, the following relation is adopted to calculate IL of the numerical attributes after anonymization: is the spread of the domain of QID j in the specific coalition group CoG i . So, we can consider the IL as sum of all ratios of the spreads weighted by the ratio of group size and data set size.
Algorithm, described in the Table 6 explores the process of anonymization of the coordination groups. It also explains the computation of the IL as well as finding k value for k-anonymization. It assumes the output of Algorithm 5 as input and returns IL of anonymized data, k value of k-anonymization and published data D ′ .

Experimentation and empirical analysis
Experiments have been performed on Intel Core @ 2.93 GHz with 4GB RAM out of it 2GB of RAM has been exclusively allocated for the Net Beans platform. Experiments are conducted on Adult Data set available at UCI Machine Learning Repository [40]. 1000 records are selected randomly from the preprocessed Adult Data set which has 36,282 data records. Age, Fnlwgt, Hours-per-week, the numerical attributes, are chosen as Quasi Identifiers for our experimentation and number of coalitions, anonymity level, number of outliers, IL (using Eq. 7) are calculated over different values of similarity weight factor (β) and multiplicity factor (γ). As a state-of-art study, CoV algorithm is compared with Mondrain Multidimensional methodology [12] and K-member clustering for k-anonymity [41]. See Fig. 2.

Number of coalitions vs β and γ
The variations of number of coalitions over different γ values are given in Fig. 3. As multiplicity factor (γ) is increased, the number of coalitions increases, because, when multiplicity factor is relaxed then more number of tuples are included in the coalition which leads less number of coalitions i.e., if γ value is increased then there is a possibility for tight segregation which causes more number of coalitions. The number of coalitions is constant until some fixed value γ which relatively depends upon the weight factor β. Another observation is that there is a sudden climb after certain value (sum of β and γ is around 1.75 for our sample data set) and the growth rate of number of coalitions decreases according to decrease in the weight factor β (See Fig. 3).  Fig. 2 Comparison of CoV with K-Member clustering, Mondrain Multi-dimensional methods. Information loss is estimated using the three methods and variations are presented. Our method shows the consistency with the size of the data records where as the other two gradually decrease. Dramatically, when data size is largeref:adam.1996 marg the methods give approximate equality Figure 4 depicts the relation between number of coalitions and β. It is observed that the kind of variations is almost same as above, but growth rate in number of coalitions is more as compared with the former one. So, it can be said that the influence of β is more than that of γ in the process of Seclusion.

Number of outliers vs β and γ
The coalitions having single record are marked as outliers in Algorithm (see Table 5), and the number of outliers for different values of β and γ are established. Figure 5 depicts the variations of number of outliers with γ. It shows that there is no possibility of outliers for lesser values of γ. The relaxation of γ, includes the tuples which are defined as outliers in the case of more values of γ.
Intuitively, the records which are far away, in distance point of view, from the coalitions are also included when γ is reduced. It can be observed that the number of outliers decrease as β decreases. A similar observation can be seen in Fig. 6, then graphs are drawn for number of outliers and β. The presence of outliers are more than the previous case as β is influenced more than γ.

Information loss vs β and γ
This section presents how the IL varies over the parameters weight factor β and multiplicity factor γ. Figure 7 describes the changes in the IL with different γ values. The IL is calculated using Eq. 7. As γ increases it doesn't allow to include more number of records into the groups. So, IL calculated by using Eq. 7 implies that the coalitions having more similar data records, have less information loss. Insightfully, when we relax the γ then the far away tuple are also included into the groups.
In the present work for the anonymization process over these groups hierarchy free construction is used. In this methodology the values of an attribute are generalised in a group by min, max. While implementing this process if a far way tuple is included in the group then unnecessarily more generalization is required which in turn increases the IL. This implies that the IL increases with the increase in γ.
The behaviour of the graphs plotted for IL and different cases of β are almost same. The IL is constant as γ increases until some point, then there is a sudden decline at which the sum of β and γ assumes some fixed value (It is around 1.75 for our sample data set).  Figure 8, shows the relation between IL and β. Similar patterns shown above are seen but the rate of decrease is less than the previous cases and thence it can be concluded that the algorithm is more influenced by β.

Information loss vs size of data set
In this section the variation of IL value with size of data set is explained. Figure 9 depicts the IL values corresponding to different sizes of data for different β and γ. For all cases, it shows that there are fluctuations up to certain size depending upon the data set. After that IL increases but the rate of growth is less compared to the rate of growth of size of the data set. When β and γ are equal to 1, the interpolated curve for IL is shown in Fig. 10. Figure 11 shows that the change in IL over the variation in β and γ. The graph shows that the IL is minimum when β and γ are equal to 1. IL value increases as β or γ increases, but simultaneously the number of outliers decrease as shown in Figs. 5 and 6.   10 Interpolate curve between information loss and size of the data set: estimated Information loss according to number of data records for given β and γ equal to one. More fluctuations can be seen for less number of records because of occurrence of more outliers. In contrast, if data size increases information loss is not much varying.  Fig. 11 Relation between information loss, β and γ: variation of information loss is shown for given β and γ. The best choice for least Information Loss is for both β and γ equal to 1