Reidentification risk estimation using Gaussian-copula-based generative models
Abstract
Data publishers typically apply anonymization techniques to safeguard privacy when sharing data for secondary analysis. However, the current models for estimating reidentification risk fall short of adequately estimating the potential privacy risks of disclosed data. For instance, in this paper, we show that existing risk estimation models fail to precisely estimate population uniqueness risks as they overlook sensitive and other attributes. To address such problems, we present a generative, copula-based model that can precisely estimate the probability of reidentifying a specific individual based on population uniqueness, taking into account quasi-identifying and sensitive attributes. Further, we prove that risk estimation models should include the percentage of high-risk records in the dataset to provide additional insights into the reidentification risks. We applied the model to 12 real-world datasets and our results show over 91\% true positive recognition and less than 19\% error rate for high-risk records. This error rate is notably lower than the current best rate of 40\%. Further, we found that 43\% of Brazilians can be uniquely identified using the five attributes combination and 79\% of the United States population can be uniquely identified using the nine attributes considered. % such as Age, Work-Class, Educational Level, Marital Status, Occupation, Relationship status, Race, Sex, and Country of origin. Our results show that the proposed model shows more than 81\% accuracy in estimating the reidentification risk of high-risk records.