Assuring high-accuracy document understanding: retargeting, scaling up, and adapting


No existing document-image understanding technology, whether experimental or commercially available, can guarantee high accuracy across the full range of documents of interest to government-agency users. Research at PARC has focused for more than ten years on relieving this critical bottleneck to automatic analy sis of the contents of paper-based documents, FAXes, etc. PARC has made significant progress --- documented in dozens of publications and patents, and embodied in experimental software tools --- towards this goal: we possess *document image decoding* (DID) technology that achieves high accuracy on images of documents printed in a potentially wide variety of writing systems and typefaces, unusual page layout styles, and severely degraded image quality. Our principal method o f attack has been *retargeting*: that is, our technology is designed to be traina ble, i.e. customized to the characteristics of individual documents or sets of simila r documents. In recent years we have reduced the effort of manual DID training significantly. In this paper we propose *scaling up* the DID methodology by massively parallel recognition using ensembles of automatically pre-trained DID decoders: this promises to reduce further the need for document-specific training. We have also made recent progress towards *adapting*, in which recognizers, without any manual training, adjust their models to fit the document at hand: this offers hope that manual training can someday be reduced to zero. Finally, we have extended DID methods to handle *gray-level images*. All of these R&D projects are ripe for accelerated extension, experimentation,, and application to government-agency needs.


Baird, H. S. ; Breuel, T. M. ; Popat, K. ; Sarkar, P. ; Lopresti, D. P. Assuring high-accuracy document understanding: retargeting, scaling up, and adapting. Symposium on Document Image Understanding Technology (SDIUT '03); 2003 April 9-11; Greenbelt; MD; USA.