UIUC header
Mias_header2

Cimple: Monitoring People and Events For Online Communities

Links: ppt

Participants: AnHai Doan

The Web is teeming with communities such as, for example, those of law enforcement agencies, database researchers, and bioinformatists. Community members often want to aggregate community data, and then query, monitor, and discover interesting information and events. For example, database researchers might be interested in questions such as, "is there any interesting connection between researchers X and Y?" As Web communities proliferate, developing effective ways to support their information needs at the community level is becoming increasingly important. The Cimple Project aims to develop a software platform that can be rapidly deployed and customized to manage data-rich online communities. To drive and validate Cimple, we are building DBLife, a prototype system that manages information for the database research community. This system is viewable online at http://dblife.cs.wisc.edu

Acquiring Community Data
Building a community portal first requires collecting an initial set of data sources that are relevant to the community. Often, a domain expert already knows the most prominent community sources. Still, it can be difficult even for a domain expert to select sources more relevant than the most prominent ones. Thus, we have developed RankSource, a tool that ranks sources and helps developers select data that is highly relevant to the community.

Extracting Community Information
Since community data is often unstructured, a key component of Cimple is information extraction (IE): extracting interesting entities, relationships, and events from community data. However, in current development frameworks, IE programs are often difficult to write, understand, optimize, and debug. Thus, we have been building a declarative framework for IE, providing tools to help developers quickly write IE programs.

Integrating Many Data Sources
To aggregate community information, we must perform data integration across multiple, heterogeneous data sources. This involves numerous challenges, such as inferring whether the names "D. Smith" and "David Smith" refer to the same person. We have built compositional frameworks that can cope with data heterogeneity by integrating multiple solutions in a principled way.

Harnessing Mass Collaboration
Mass collaboration aims to improve systems by exploiting user feedback. Mass collaboration has been applied to many other applications (e.g. bug detection, Wikipedia), but little has been done in extraction and integration contexts. In Cimple, we have built novel mass collaboration systems, including a "community wikipedia" that allows a partnership between human contributions and automatic extraction methods.

Exploiting Extracted Data
Given extracted community data, there are many useful services we can build, such as keyword search, structured queries, and news alerts. We have built prototypes of many of these services in the DBLife system. Also, we are building a best-effort framework that allows developers to immediately build simple but useful services, and add increasingly sophisticated services incrementally over time.

Maintaining Portals Over Time
Since community data changes, Cimple systems must be maintained over time. This involves several challenges such as making sure data sources are up to date and adding new relevant data sources. In DBLife, we have found techniques that have allowed us to maintain the system over several years with minimal work from the developer.