|
|
||||||||
Department of Operations and Information Systems, University of Massachusetts Lowell, Lowell, Massachusetts 01854
Data-mining techniques can be used not only to study collective behavior about customers, but also to discover private information about individuals. In this study, we demonstrate that decision trees, a popular classification technique for data mining, can be used to effectively reveal individuals' confidential data, even when the identities of the individuals are not present in the data. We propose a novel approach for organizations to protect confidential data from such a classification attack. The key components of this approach include a set of entropy-based measures to evaluate disclosure risks of individual records, an optimal pruning algorithm to identify high-risk records, and a pair of data-swapping procedures to reduce the disclosure risks. The proposed method provides the best trade-off between data utility and privacy protection against classification attacks. It can be applied to data with both numeric and categorical attributes. An experimental study on six real-world data sets shows that the proposed method is very effective in protecting privacy while enabling legitimate data mining and analysis.
School of Management, University of Texas at Dallas, Richardson, Texas 75080
xiaobai_li{at}uml.edu
sumit{at}utdallas.edu
Subject classifications: computers; databases/artificial intelligence; data mining; decision trees; pruning; public sector; society; privacy; probability; entropy; relative entropy.
History: Received August 2007;
revision received June 2008;
accepted November 2008.
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH |