Detecting privacy leaks using corpus-based association rules
Detecting inferences in documents is critical for ensuring privacy when sharing information. In this paper, we propose a refined and practical model of inference detection using a reference corpus. Our model is inspired by association rule mining: inferences are based on word co-occurrences. Using the model and taking the Web as the reference corpus, we can find inferences and measure their strength through web-mining algorithms that leverage search engines such as Google or Yahoo.
Our model also includes the important case of private corpora, to model inference detection in enterprise settings in which there is a large private document repository. We find inferences in private corpora by using analogues of our Web-mining algorithms, relying on an index for the corpus rather than a Web search engine.
We present results from two experiments. The first experiment demonstrates the performance of our techniques in identifying all the keywords that allow for inference of a particular topic (e.g. "HIV") with confidence above a certain threshold. The second experiment uses the public Enron e-mail dataset. We postulate a sensitive topic and use the Enron corpus and the Web together to find inferences for the topic.
These experiments demonstrate that our techniques are practical, and that our model of inference based on word co-occurrence is well suited to efficient inference detection.
- download PDF (265K)
Chow, R. ; Golle, P. ; Staddon, J. Detecting privacy leaks using corpus-based association rules. 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2008); 2008 August 24-27; Las Vegas, NV. NY: ACM; 2008; 893-901.
Copyright © ACM, 2008. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in KDD 2008 http://doi.acm.org/10.1145/1401890.1401997