Title: Benchmarking and Evaluation Campaigns: the good, the bad, and the metrics
Presenter: Julio Gonzalo, UNED.
Competitive evaluation exercises have become mainstream in the Natural Language Processing and Information Retrieval research communities. Based on experiences at WePS (the Web People Search evaluation initiative) and CLEF (the Cross-Language Evaluation Forum), we will discuss the benefits and risks of focusing research around evaluation campaigns, and highlight some unresolved dilemmas. The presentation will make a special emphasis on metric design as a key feature which is often overlooked in mainstream experimental designs, and we will conclude with the presentation of a new metric, the Unanimous Improvement Ratio (UIR), which contributes to the analysis of experimental results when more than one metric is involved in the evaluation - as it is the case of Precision and Recall in many evaluation settings).
About the author:
Julio Gonzalo is a member of the nlp.uned.es research group, where he conducts research on the application of Language Engineering to Multilingual Information Access problems, and in particular in the development of evaluation metrics and methodologies. He has been involved in the organization of CLEF (the international evaluation campaign for Multilingual Information Access applications) since 2001, and he is co-organizer of WePS (Web People Search evaluation campaign). More information at http://nlp.uned.es/~julio