TACP logo

National Institute of Standards and Technology Home Page
TIPSTER Text Program A multi-agency, multi-contractor program


TABLE OF CONTENTS


Introduction
TIPSTER Overview
TIPSTER Technology Overview
TIPSTER Related Research
Phase III Overview
TIPSTER Calendar
Reinvention Laboratory Project
What's New

Conceptual Papers
Generic Information Retrieval
Generic Text Extraction
Summarization Concepts
12 Month Workshop Notes

Conferences
Text Retrieval Conference
Multilingual Entity Task
Summarization Evaluation

More Information
Other Related Projects
Document Down Loading
Request for Change (RFC)
Glossary of Terms
TIPSTER Source Information

Return to Retrieval Group home page
Return to IAD home page

Last updated:

Date created: Monday, 31-Jul-00

Multilingual Entity Task Conference (MET)

The government group prepared the Chinese development data for the MET-2 Named Entity task as well as for a new collection of Japanese data. There were over 300 articles (including revised versions of MET-1 data) available in each language for distribution as training data for MET-2 participants. The dry run and formal test data were distributed at approximately 1-month intervals.

In 1998, because of staff limitations, support for collection and preparation of Spanish language answer-key data was on hold. MET began organizing a "collateral duty" project lacking both regular staff members and a centralized collection of online resources (data, software, etc.). Occasionally reliance on generous contributions from participants to help improve the existing capabilities of tools that support the labor-intensive process of data collection and markup were proposed for future development.

In 1998, the training collection included data from three Chinese and two Japanese sources. The MET-2 data therefore represented a somewhat richer variety of language patterns than the MET-1 data, which was collected from only a single newswire source in each language. Whereas MET-1 training, dry run, and formal test data sets were retrieved using a single set of keywords, in 1998 different keywords were used to select each data set. Consequently, participant systems would have been challenged to demonstrate greater portability in moving from one media source and topic domain to another.

Although the multilingual task would have been confined, as in MET-1, to Named Entity extraction, texts would have been selected according to their suitability for future Template Element and Scenario Template applications. As time permited, the government group began preparing the data, tools, and skills needed to support these more advanced tasks in MET-3.

Meanwhile TIPSTER prepared to offer participants newly available resources to support multilingual Information Extraction (IE) tasks. In particular, since MET-1 the government group acquired two online part-of-speech tagged Chinese lexicons, the larger of which differentiates 39 morpho-syntactic categories in glosses of over 100,000 terms. A revised version of the a Chinese segmenter (word-boundary finder) also would have been available.

Because word-boundary finding was proven to be a bottleneck problem for IE tasks in various non-Roman languages, the government group developed a second segmentation tool to help identify proper names, technical terms, newly coined words, etc., it may be missing from the lexicon. The tool utilized a core lexicon of only 5000 terms, selected for their high-frequency occurrence in newspaper text. The lexicon did not gloss the majority of 1-syllable function words. The small number of glossed terms actually reduces the chances that a rare or newly-coined, polysyllabic term will be incorrectly "chopped" into constituents that are mapped to 1 or 2-syllable glossed terms (e.g., that a spurious boundary is inserted after the "will" in unglossed "Williamsburg", making the word unrecognizable to downstream Information Extraction tasks.)

This approach results in a relatively large number of unsegmented "orphan" strings. Very often, these orphans fortuitously happen to be 1-word strings, bounded on either side by high-frequency glossed terms. Multiword orphans are further segmented by a routine comparing all orphans in a text for identical groups of 2 or more adjacent syllables, which are presumed to be "words". Preliminary review suggests this would have been an effective approach.

Because the tool depended on very few language-specific rules, it would have been easily portable to Thai and other languages requiring word segmentation. Performance can be enhanced by augmenting the core lexicon with domain-targeted lists of terms. Users also could have found ways to resolve ambiguities in unsegmented "orphan" strings by exploiting contextual part-of-speech information that the tool attached using SGML tags to adjacent, identified words.

The government group hoped to play a role in advancing participants' technical capabilities, by serving as a clearinghouse for basic multilingual text processing resources such as word-boundary finders, dictionaries, and tagging tools. Participants would have been encouraged to share basic techniques, tools, and data to support this effort.

Direct any comments, questions, or suggestions concerning MET or related issues to SECM for forwarding to the appropriate organization.

Multi-colored horizontal rule