Private data is collected as
a normal part of our interactions with healthcare
providers, insurers, retail stores, and the government.
Although these data can be used to learn private
information about us, this is not a requirement
for many beneficial applications. For example, research
on medical outcomes, social issues or purchase patterns
doesn't necessarily require the identification of
the individuals in the respective medical, census
and marketing databases.
PARC has begun work on a "privacy
appliance" that protects privacy while allowing
the data to be put to beneficial use. Operating
as a "privacy firewall," a privacy appliance
sits between data consumers and data sources to
filter queries into those data sources and return
only data that do not violate privacy. The appliances
are owned and operated by the data owners. Little
or no change is required of the data sources.
Technical Challenges
For the privacy appliance to
be effective, it must limit the possibility of direct
and indirect disclosure of individual identities.
A key challenge is to protect the privacy of the
individuals represented in the data while retaining
the usefulness of the data. Several techniques,
including inference and access controls, are used
to address this problem. In addition, searchable
and immutable access logs are created to reduce
the threat of abuses.
Inference Control. The
first means of preventing direct disclosures are
simple. Data such as names, social security numbers,
credit card numbers, addresses, phone numbers and
other identifying attributes are withheld from query
results.
Controls must also include
methods of preventing the inference of identity
based on the combination of data. It has been shown
that even seemingly innocuous attributes can, when
taken together, be used to compromise an individual's
privacy. In the example table below, social security
number is clearly identifying and, more surprisingly,
individuals can be identified when their sex, zip
code and year of birth are all known. Hence, those
three attributes are said to form an inference channel.
Indeed, 87% of the US population is uniquely identifiable
by sex, zip code and date of birth (month, day and
year), according to the 1990 US Census.
SSN
Sex
Zip code
Year of Birth
123-45-6789
Male
94305
1976
234-56-7891
Male
94305
1977
345-67-8912
Female
93165
1977
456-78-9123
Female
93165
1976
567-89-1234
Male
93165
1976
678-91-2345
Female
94305
1977
789-12-3456
Male
93165
1977
891-23-4567
Female
94305
1976
Inference controls will also
include statistical analysis of data. A statistic
is considered sensitive if it reveals information
about an individual or if sensitive information
can be inferred from statistical summaries. Statistical
inference control has been widely used to protect
databases such as census data, and the standards
of operation are well defined. If queries
are computed over too few records, the privacy appliance
can label the data as sensitive and manage access
accordingly.
Access Control.
The privacy appliance's access control will block
queries that request identifying information and
will block or modify queries that include any of
the undesired inferences identified by the inference
control tool.
The access controls also prevent
queries that request combinations of data that have
been identified as sensitive by the inference controls.
This mechanism inhibits the disclosure of information,
both within a single query and over time, from which
an individual identity could be inferred.
In the example used earlier, an individual who has
seen the sex and zip code fields would be prevented
from viewing the final piece of information, year
of birth, in any subsequent queries. We are designing
protocols that allow flexible information access
and fast query responses, while ensuring that no
inference channels are disclosed.
Searchable Audit Logs.
Audit logs ensure that all access to the
data is recorded immediately and permanently, with
no possibility of alteration. This capability is
important to protect individuals against potential
abuse of personal data. Information that we believe
is safe to release today may turn out to be privacy-compromising
in the future. Logs reveal who has accessed this
information. In addition, agents who have used the
database may be compromised and in such an event,
logs reveal what information the agents have accessed.
No one would be able to misuse data without the
strong probability of detection.
However, the logs themselves
are sensitive and must be protected. We are designing
tamper-resistant logging mechanisms that protect
the logs through encryption, while enabling controlled
search through the use of identity-based encryption.
With our mechanisms an escrow agent can issue a
search capability to identify which queries pertain
to a certain keyword, while releasing no unnecessary
additional information.
Applications
Government Databases.
Government agents mine intelligence data to build
models capable of predicting future terrorist attacks.
Our inference control and logging technologies would
allow authorized agents to search for indications
of terrorist-related activity while limiting the
potential to compromise the privacy of individuals.
Undesired inferences may occur across data sources,
which is the reason for the cross-data source privacy
appliance. Its inference control component works
the same as that of the individual privacy appliances,
except that instead of analyzing a single data source,
its inference control tool analyzes the collection
of data sources. While all of the analysis may be
done here, it is safer for the privacy appliances
closest to the data sources to do as much as possible,
to keep the privacy mechanisms under the control
of the data owners.
Consumer
Settings. Individuals are often asked directly
to release personal information. For example, in
retail settings, individuals may be asked for demographic
information in return for coupons or other discounts.
It is difficult for the individual to evaluate the
privacy risks of releasing the information because
they do not know the attributes of the other respondents.
We are designing a personal privacy appliance
that allows individual users to evaluate the privacy
risks of releasing information. The personal privacy
appliance will store an individual's personal information
(e.g. shopping and entertainment preferences) and
inform the user as to the risk of identification
coming from releasing any of this information.
This work was partially funded by DARPA contract F30602-03-C-0037.
BUSINESS
CONTACT
Mark Grandcolas
Director of Business Development, Computing Science Laboratory
650-812-4429