![]() |
Prof. Felix Naumann (Chair of Information Systems at the Hasso Plattner Institute, Postdam, Germany).
Title: Dr. Crowdsource: or How I Learned to Stop Worrying and Love Web Data |
Short bio: Felix Naumann studied mathematics, economy, and computer sciences at the University of Technology in Berlin. After receiving his diploma in 1997 he joined the graduate school at Humboldt University of Berlin. He completed his PhD. thesis on data quality in 2000. Before moving to the University of Potsdam, he worked at the IBM Almaden Research Center and served as an assistant professor for information integration at the Humboldt-University of Berlin. Since 2006 he holds the chair of Information Systems at the Hasso Plattner Institute. |
Abstract: The wealth of freely available, structured information on the Web is constantly growing. Driving domains are public data from and about governments and administrations, scientific data, and data about media, such as articles, books and albums. In addition, general-purpose datasets, such as DBpedia and Freebase from the linked open data community, serve as a focal point for many data sets. Thus, it is possible to query or integrate data from multiple sources and create new, integrated data sets with added value. Yet integration is far from simple: it happens at technical level by ingesting data in various formats, at structural level by providing a common ontology and mapping the data sources structures to it, and at semantic level by linking multiple records about same real world entities and fusing these representations into a clean and consistent record. The talk highlights the extreme heterogeneity of web data and points to methods to overcome them including a multitude of tasks that must be completed: source selection to identify appropriate and high quality sources, data extraction to create structured data, scrubbing to standarize and clean data, entity matching to associate different occurrences of the same entity, and finally data transformatiion and data fusion to combine all data about an entity in a single, consistent representation. |