Skip to content

Latest commit

 

History

History
51 lines (30 loc) · 2.44 KB

0306.md

File metadata and controls

51 lines (30 loc) · 2.44 KB

Introduction to the CoNLL-2003 shared task: language-independent named entity recognition, Sang et al., 2003

Paper, Tags: #nlp, #datasets, #ner

This task is language-independent NER. We release two datasets, English and German and the evaluation method, and also present a general overview of the systems that have taken part in the task and discuss their performance.

Name entities are phrases that contain the names of persons, organizations and locations.

We concentrate four types of named entities: persons, locations, organizations and names of miscellaneous entities.

Data

The data consists of 8 files covering two languages, English and German. For each language, the following files are present:

  • training file
  • development file
    • Can be used for tuning the parameters of the learning method
  • test file
  • large file with unannotated data

The challenge of this year is to incorporate the unannotated data in the learning process.

The English data was taken from the Reuters corpus, and it consists of Reuters news stories between August 1996 and August 1997. The text for the German data was taken from the ECI Multilingual Text Corpus, which is a corpus with text in many languages, we extracted the data from the German newspaper Frankfurter Rundshau. All the data was taken from articles written in one week at the end of August 1992.

Data preprocessing

For all data, a tokenizer, part-of-speech tagger and a chunker were applied to the raw data, and we created two basic language-specific tokenizers for this shared task.

  • The English data was tagged and chunked by the memory-based MBT tagger
  • The German data was lemmatized, tagged and chunked by Treetagger

Data format

Word tagged with O are outside of named entities. Otherwise, we extract four entities:

  • Persons, PER
  • Organizations, ORG
  • Locations, LOC
  • Miscellaneous names, MISC

We assume named entities are non-recursive and non-overlapping, when a named entity is embedded in another named entity, usually only the top level entity is annotated.

Evaluation is measured with F1 score.

Participating system

Most participants used information other than the available training data, including gazetteers and unannotated data.

The most frequently applied technique is the Maximum Entropy Model, also HMMs were employed, Conditional MMs, AdaBoost, memory-based learning and SVMs weren't as common.

Best performance was achieved by MEMs.