| Knowledge
Extraction from
Document Collections
Research
in the Knowledge Extraction from Document
Collections (KXDC) project is focused on
developing computational tools and methodologies
for accessing information contained in
domain-focused document collections. While
current tools recognize what documents are
about,
KXDC researchers are working toward the development
of new generations of knowledge-based
tools that can interpret the meaning of the
documents' content.
KXDC researchers are exploring new techniques
for analyzing natural language texts and
producing conceptual representations of
their
content. Their work combines deep linguistic
analysis with general and domain-specific
knowledge
representation and inference.

There
are trade-offs
between
the functionality
of a knowledge
quality-management
system
and the
rate of
change
and volume
of the
documents
in a given
collection.
|
KXDC scientists are engaged in basic research
aimed at solving fundamental problems of
symbolic natural language understanding.
They
are challenging the conventional wisdom that
it is
not possible to automatically produce useful
representations of document content using
symbolic
natural language processing (NLP).
This work
draws on recent technological advances that
are
making the use of NLP techniques more commercially
feasible. It also draws on the Xerox
Linguistic Environment, a proprietary set
of technologies based on PARC's world-class
competency in NLP.
The KXDC research team includes computer
scientists and linguists with expertise in
computational linguistics, symbolic processing
of language, mathematical logic, artificial
intelligence, knowledge representation, and
automated reasoning. They are exploring three
key
research areas:
- natural language understanding
- knowledge representation and ontologies
- logical foundations for common-sense reasoning
Document
Collections
Researchers are using collections of thematically
focused natural language documents as the
basis
for their work because such collections
present a number of problems that can only
be solved
by
knowledge-intensive methods. The content
of domain-focused document collections
is typically
extremely valuable to their owners and
users. KXDC technologies will enable users
to leverage
that value by providing automated tools
that assist in the time and labor-intensive
process
of
maintaining the collections.
KXDC tools and techniques compare the information
contained in one document to that of
another document. They identify anomalies,
redundancies, contradictions or similarities
based on
document content.
KXDC research is currently focused in two
application areas:
- knowledge quality management, aimed at ensuring
and maintaining the quality of document collections
- knowledge-based content tracking, aimed at
helping users of document collections identify
how incoming documents relate to existing ones
Goals
The long-term goal
of KXDC research is to create a comprehensive
set of general
computational
techniques for interpreting the meaning
of natural language documents based on
the flow
of
argument, rather than on key words, and
for using that understanding in computational
tools. The
team's shorter-term objective is
to develop tools that can be used to solve
specific
problems for
focused applications such as content retrieval
and tracking, question answering, and knowledge
quality management.
KXDC
tools could be
customized for
specific corporate
document collections
such as
intellectual
property, design
plans, or product
documentation.
They could also
be customized for
applications in
intelligence, health
care, law, journalism,
or any field
that relies heavily
on document-based
information.
Novel Intelligence from Massive
Data (NIMD)
KXDC
researchers are
contributing
to a larger PARC
team
that is working
on an Advanced
Research Development
Agency (ARDA)
project called
Novel Intelligence
from Massive
Data
(NIMD). The project's
goal is to develop
knowledge-based
tools to assist
in intelligence
analysis.
Intelligence
agencies depend
on natural language-based
intelligence reports
for
ongoing
information on
a variety of topics.
Researchers are
developing computational
tools that
will
interpret the text
in those documents
and, based on the
content, flag any
that contain
information
that contradicts
existing reports
or that is inconsistent
with a particular
set of
beliefs.
The same techniques could eventually lead
to smart tools that could monitor documents
and
act
on their information. For example, computational
agents could be programmed to recognize
that
an incoming e-mail is confirming a meeting
to be held in another city, and to book
the necessary
flights based on that information.
Eureka
Project
In this project, the current research corpus
is a collection of documents written by
Xerox copier
repair technicians as part of the Eureka
system. Invented at PARC in 1990, Eureka
promotes knowledge
sharing via "tips," submitted
by service technicians, which identify
machine
problems and
propose solutions. There are currently
nearly 50,000 tips in the collection.
KXDC researchers' short-term goal
is to detect redundancy, contradiction,
and obsolescence
in the
tips by looking at their meaning rather
than just at the words they contain or
the topics
they are
about. Scientists are also investigating
new search techniques based on content.
A longer-term
goal is to develop tools that automate
a variety of tasks that deal with document
collections.
Another long-term
goal is "knowledge fusion," which
enables the creation of composite
documents from the content of existing
documents. Computational agents would sift
through a
number of documents with information that
is potentially relevant to a given task
or user,
and
compose a single document that synthesizes
the relevant information.
|