events contact us
Search the complete PARC site
 

Knowledge Extraction from Document Collections

Research in the Knowledge Extraction from Document Collections (KXDC) project is focused on developing computational tools and methodologies for accessing information contained in domain-focused document collections. While current tools recognize what documents are about, KXDC researchers are working toward the development of new generations of knowledge-based tools that can interpret the meaning of the documents' content.

KXDC researchers are exploring new techniques for analyzing natural language texts and producing conceptual representations of their content. Their work combines deep linguistic analysis with general and domain-specific knowledge representation and inference.

There are trade-offs between the functionality of a knowledge quality-management system and the rate of change and volume of the documents in a given collection.

KXDC scientists are engaged in basic research aimed at solving fundamental problems of symbolic natural language understanding. They are challenging the conventional wisdom that it is not possible to automatically produce useful representations of document content using symbolic natural language processing (NLP).

This work draws on recent technological advances that are making the use of NLP techniques more commercially feasible. It also draws on the Xerox Linguistic Environment, a proprietary set of technologies based on PARC's world-class competency in NLP.

The KXDC research team includes computer scientists and linguists with expertise in computational linguistics, symbolic processing of language, mathematical logic, artificial intelligence, knowledge representation, and automated reasoning. They are exploring three key research areas:

  • natural language understanding
  • knowledge representation and ontologies
  • logical foundations for common-sense reasoning

Document Collections
 

Researchers are using collections of thematically focused natural language documents as the basis for their work because such collections present a number of problems that can only be solved by knowledge-intensive methods. The content of domain-focused document collections is typically extremely valuable to their owners and users. KXDC technologies will enable users to leverage that value by providing automated tools that assist in the time and labor-intensive process of maintaining the collections.

KXDC tools and techniques compare the information contained in one document to that of another document. They identify anomalies, redundancies, contradictions or similarities based on document content. KXDC research is currently focused in two application areas:

  • knowledge quality management, aimed at ensuring and maintaining the quality of document collections
  • knowledge-based content tracking, aimed at helping users of document collections identify how incoming documents relate to existing ones

Goals
 

The long-term goal of KXDC research is to create a comprehensive set of general computational techniques for interpreting the meaning of natural language documents based on the flow of argument, rather than on key words, and for using that understanding in computational tools. The team's shorter-term objective is to develop tools that can be used to solve specific problems for focused applications such as content retrieval and tracking, question answering, and knowledge quality management.

KXDC tools could be customized for specific corporate document collections such as intellectual property, design plans, or product documentation. They could also be customized for applications in intelligence, health care, law, journalism, or any field that relies heavily on document-based information.

Novel Intelligence from Massive Data (NIMD)
 
KXDC researchers are contributing to a larger PARC team that is working on an Advanced Research Development Agency (ARDA) project called Novel Intelligence from Massive Data (NIMD). The project's goal is to develop knowledge-based tools to assist in intelligence analysis.

Intelligence agencies depend on natural language-based intelligence reports for ongoing information on a variety of topics. Researchers are developing computational tools that will interpret the text in those documents and, based on the content, flag any that contain information that contradicts existing reports or that is inconsistent with a particular set of beliefs.

The same techniques could eventually lead to smart tools that could monitor documents and act on their information. For example, computational agents could be programmed to recognize that an incoming e-mail is confirming a meeting to be held in another city, and to book the necessary flights based on that information.

Eureka Project
 
In this project, the current research corpus is a collection of documents written by Xerox copier repair technicians as part of the Eureka system. Invented at PARC in 1990, Eureka promotes knowledge sharing via "tips," submitted by service technicians, which identify machine problems and propose solutions. There are currently nearly 50,000 tips in the collection.

KXDC researchers' short-term goal is to detect redundancy, contradiction, and obsolescence in the tips by looking at their meaning rather than just at the words they contain or the topics they are about. Scientists are also investigating new search techniques based on content. A longer-term goal is to develop tools that automate a variety of tasks that deal with document collections.

Another long-term goal is "knowledge fusion," which enables the creation of composite documents from the content of existing documents. Computational agents would sift through a number of documents with information that is potentially relevant to a given task or user, and compose a single document that synthesizes the relevant information.

BUSINESS CONTACT
Lawrence Lee
Director of Business Development, Intelligent Systems Laboratory
650-812-4756
RELATED WEBPAGES

Eureka Knowledge-Sharing System

Sensemaking

ADDITIONAL INFORMATION
Project Team's Site on Knowledge Extraction from Document Collections
   

  (Logo/Homepage) PARC - Palo Alto Research Center

Copyright © 2002-2007 Palo Alto Research Center Incorporated. All Rights Reserved.
PARC, the PARC Logo, AspectJ, DataGlyph, Obje, Silx, StressedMetal, and ClawConnect
are trademarks or registered trademarks of Palo Alto Research Center Incorporated.