TIPSTER Text Program A multi-agency, multi-contractor program | |
TABLE OF CONTENTS Introduction TIPSTER Overview TIPSTER Technology Overview TIPSTER Related Research Phase III Overview TIPSTER Calendar Reinvention Laboratory Project What's New Conceptual Papers Generic Information Retrieval Generic Text Extraction Summarization Concepts 12 Month Workshop Notes Conferences Text Retrieval Conference Multilingual Entity Task Summarization Evaluation More Information Other Related Projects Document Down Loading Request for Change (RFC) Glossary of Terms TIPSTER Source Information Return to Retrieval Group home page Return to IAD home page Last updated: Date created: Monday, 31-Jul-00 |
Summarization Concepts Traditionally, the preparation of document abstracts has been a human function. A person knowledgeable in the subject matter of the document reads it and then writes a short, typically one paragraph, summary of the document. The abstractor tries to include in the summary all of the important ideas and concepts presented in the original document. Obviously, what is important is based upon the abstractor's opinion. Tests that have been performed show that human abstractors don't always agree completely on the content of an abstract. Some people consider 85% agreement between abstractors to be about the best that can be obtained. The tremendous increase in documents and easier availability through electronic means has put an insurmountable burden upon organizations using humans to do abstracting. In only a few cases, such as technical and scholarly papers, can the preparation of the abstract be the responsibility of the author. Traditionally, news articles put the most 'important' information in the first paragraph; however, because news articles have short paragraphs it is usually necessary to use additional paragraphs to get a good summary of the article. With exponential increases in the number of documents available, high quality abstracts become more important as the demand for finding 'the right document' is also increasing. Thus, it is not surprising that there are efforts underway to develop machine aided methods to help improve the quality of information available to the user and to reduce the time to get the information. One of these efforts is summarization. Summarization can be more than just abstract. An abstract is usually thought of as being associated with a single document, but there may be a need to cluster or categorize large groups of documents with similar subject matter with a single summary. Summarization may be applied at different points in the normal text processing sequence so as to improve relevant information to the user, including:
Summarization will use natural language processing methods and/or statistical techniques to achieve a significant reduction in the quantity of text presented to a user with minimal reduction in information content. Some of the techniques that may be used, independently or combined, in building summaries may include:
It should be noted that effective summarization is not an easy task and it frequently involves semantic analysis and applying world knowledge for clearest presentation. While some research has been done and a few trial systems developed there is still much, much work to be performed before really good summarization software is available. And of course, multi-lingual considerations just increase the difficulty. Another issue under the summarization umbrella is how well does a particular approach works? Some initial work is underway as part of the TIPSTER program to examine the feasibility of performing summarization evaluation in a manner similar to MUC and TREC. |