Skip to content

Latest commit

 

History

History
46 lines (29 loc) · 3.02 KB

1705.00108.md

File metadata and controls

46 lines (29 loc) · 3.02 KB

Semi-supervised sequence tagging with bidirectional language models, Peters et al., 2017

Paper, Tags: #nlp, #embeddings

This paper demonstrates a general semi-supervised method for adding pre-trained context embeddings from bidirectional language models to augment token representations in sequence tagging models.

Sequence tagging: assign a categorical label to each member of a sequence of observed values. Current (2017) state-of-the-art ST models include a bidirectional RNN that encodes token sequences into a context sensitive representation. The parameters of this biRNN are learnt only on labeled data.

Our approach doesn't require additional labeled data. We use a neural language model (LM), pre-trained on a large, unlabeled corpus to compute an encoding of the context at each position in the sequence (LM embedding) and use it in the supervised sequence tagging model.

Contributions

  1. The context sensitive representation captured in the LM embeddings is useful in the supervised sequence tagging setting.
  2. Using both forward and backward LM embeddings boosts performance over a forward only LM.
  3. Domain specific pre-training isn't necessary by applying a LM trained in the news domain to scientific papers

Language model augmented sequence taggers, TagLM

  1. Pretrain word embeddings and language model on a large, unlabeled corpora
  2. Prepare word embedding and LM embedding for each token in the input sequence
  3. Use both word embeddings and LM embeddings in the sequence tagging model (model receives 2 inputs)

Inputs

  1. Word embedding model
    • Character representation captures morphological information and is either a CNN or a RNN
    • Token embeddings, obtained as a lookup, initialized using pre-trained word embeddings
  2. Recurrent language model / encoding of context: multiple layers of biRNNs

BiLM

Computes the probability of a token sequence. Pass a token representation (from a CNN over characters or as token embeddings) through multiple layers of LSTMs to embed the history into a fixed dimensional vector -> forward LM embedding, predicts the probability of token tk+1. A backward LM embedding predicts the previous token given the future context.

Both embeddings are pre-trained separatedly, and then are shallowly concatenated to form the bidirectional LM embeddings.

Combining biLM with sequence model

TagLM uses the LM embeddings as additional inputs to the sequence tagging model.

Results

  • The LM embeddings amount to an average absolute improvement of 1.06 and 1.37 F1
  • The improvements obtained by adding LM embeddings are larger than the improvements previously obtained by adding other forms of transfer or joint learning
  • Adding backward LM embeddings consistently outperforms forward-only LM embeddings
  • LM size is important, replacing by a bigger model improves F1 as much as adding backward LM
  • LM embeddings can improve performance of a sequence tagger even when the data comes from a different domain