Powerset: Deep natural language processing for consumer search
Google, Yahoo, and other conventional search engines have been remarkably successful at making vast amounts of information available to ordinary users. They achieve robustness and scale by creating efficient bag-of-words indexes of the terms they extract from unstructured text and by encouraging users to specify their information needs with keywords that are well-suited to bag-of-words retrieval. These methods suffer from errors of both precision and recall. Undesired results are returned because the systems do not index and cannot filter according to the semantic relations that the user has in mind, and desired results are missed because keyword matches cannot identify passages that use different terms and different syntactic constructions to express semantically equivalent concepts.
It is not a novel idea that these precision and recall problems can be addressed— in principle— by using deep natural language processing to extract underlying semantic concepts and relations both from text and from queries. Powerset is a start-up company that is attempting to address these problems in practice. In a continuing collaboration with PARC researchers, we are extending PARC’s fairly mature natural language technologies and combining it with carefully tuned indexing and retrieval components to build a large semantic index for a natural language search engine.
In this talk I’ll point out why search is a particularly good application for natural language processing, outline some of the factors that justify this effort, and describe some of the technologies that make it possible. I’ll also show examples from Powerset’s recently launched Wikipedia search system to illustrate not only how semantic indexing solves some of the keyword recall and precision problems but also how we are using semantic information to provide new tools for exploring and understanding the information that comes back from a search.
Ronald M. Kaplan is Chief Scientific Officer at Powerset, Inc. Prior to joining Powerset, he was a Research Fellow at the (Xerox) Palo Alto Research Center where he created and directed the Natural Language Theory and Technology research group. He is also a Consulting Professor in the Linguistics Department at Stanford University.
He received his Ph.D. in Social Psychology from Harvard University, where he investigated how explicit computational models of grammar could be embedded in models of human language performance. He has made many contributions to computational linguistics and linguistic theory. These include the notions of consumer-producer and active-chart parsing, the design of the formal theory of Lexical Functional Grammar and its initial computational implementation, and the mathematical, linguistic, and computational concepts that underlie the use of finite-state phonological and morphological descriptions.
Kaplan is a past President of the Association for Computational Linguistics, a co-recipient of the 1992 Software System Award of the Association for Computing Machinery, and a Fellow of the ACM. He has also been a Fellow-in-Residence at the Netherlands Institute for Advanced Study in the Humanities and Social Sciences. He holds over 30 patents in computational linguistics and related areas.
Our work is centered around a series of Focus Areas that we believe are the future of science and technology.
We’re continually developing new technologies, many of which are available for Commercialization.
PARC scientists and staffers are active members and contributors to the science and technology communities.