AG-Evaluation

AG is a method for reliable evaluation of distributional semantic models.
It was introduced in the paper Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure.

Here we provide:

A python implementation of the method
A suite of matching datasets in Hebrew (some of them were developed as part of the paper The Interplay of Semantics and Morphology in Word Embedding)
An example script, which evaluates a sample model on one of the datasets.

Requirements

Python 2.7
gensim (only for the example script)

Example

Run the following line on shell:

$ python sample.py

The code in sample.py loads a gensim word2vec model and runs evaluation on the 'nn' dataset.
Notice the model it uses (model.vec) covers only part of the vocabulary, so some of the comparisons in the datasets will not be used (to get warnings for oov words, just change the the print_oov parameter to True).

Can I use a model created by other library (not gensim)?

Sure, the model does not have to be a gensim model.
It just needs to be encapsulated in a class with a method "similarity" which takes two words and returns a score.

Can I use the AG method to evaluate models of other languages?

Of course, you just need to provide matching datasets which follow the structure described in the paper.

Can I perform more fine-grained analysis?

Yes, you can filter comparisions by different properties of the Comparison class (declared in evaluator.py).
For example, by changing the lambda in the last line of sample.py from comp: comp.set_name == 'nn' to comp: comp.set_name == 'nn' and comp.compare_type == 'randoms', you include only "positive-random" comparisons in the evaluation.

The provided datasets

The 'datasets' directory is divided into several sub-directories:

"basic" - in these datasets, all the words are base forms
"inflected" - these datasets contain the same words as 'basic', but inflected to other forms (to evaluate the effect of rich morphology)
"rare" - in these datasets, all the target words are rare (occur less than 100 times in Hebrew wikipedia)
"ambiguous" - in these datasets, the target words are morphologically ambiguous (to evaluate the ambiguity effect)
"cohyponyms" - datasets in which the preferred-relation is defined as "cohyponyms" (in contrast to "hyponym-hypernym" in the other datasets)

References

If you make use of this software for research purposes, we'll appreciate citing the following:

@InProceedings{avraham-goldberg:2016:RepEval,
  author    = {Avraham, Oded  and  Goldberg, Yoav},
  title     = {Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure},
  booktitle = {Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {106--110},
  url       = {http://anthology.aclweb.org/W16-2519}
}

Contact

For any question, please contact oavraham1@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

AG-Evaluation

Requirements

Example

Can I use a model created by other library (not gensim)?

Can I use the AG method to evaluate models of other languages?

Can I perform more fine-grained analysis?

The provided datasets

References

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

AG-Evaluation

Requirements

Example

Can I use a model created by other library (not gensim)?

Can I use the AG method to evaluate models of other languages?

Can I perform more fine-grained analysis?

The provided datasets

References

Contact