Detecting privacy leaks using corpus-based association rules


Event KDD 2008


Chow, Richard
Golle, Philippe
Staddon, Jessica
Technical Publications
August 25th 2008
Detecting inferences in documents is critical for ensuring privacy when sharing information. In this paper, we propose a refined and practical model of inference detection using a reference corpus. Our model is inspired by association rule mining: inferences are based on word co-occurrences. Using the model and taking the Web as the reference corpus, we can find inferences and measure their strength through web-mining algorithms that leverage search engines such as Google or Yahoo!. Our model also includes the important case of private corpora, to model inference detection in enterprise settings in which there is a large private document repository. We find inferences in private corpora by using analogues of our Web-mining algorithms, relying on an index for the corpus rather than a Web search engine. We present results from two experiments. The first experiment demonstrates the performance of our techniques in identifying all the keywords that allow for inference of a particular topic (e.g. "HIV") with confidence above a certain threshold. The second experiment uses the public Enron e-mail dataset. We postulate a sensitive topic and use the Enron corpus and the Web together to find inferences for the topic. These experiments demonstrate that our techniques are practical, and that our model of inference based on word co-occurrence is well suited to efficient inference detection.


Chow, R.; Golle, P.; Staddon, J. Detecting privacy leaks using corpus-based association rules. 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2008); 2008 August 24-27; Las Vegas, NV.

Additional information

Focus Areas

Our work is centered around a series of Focus Areas that we believe are the future of science and technology.

Licensing & Commercialization Opportunities

We’re continually developing new technologies, many of which are available for¬†Commercialization.


PARC scientists and staffers are active members and contributors to the science and technology communities.