UIUC header
Mias_header2

Past Talks


2008 Speaker Series


7/11: Time TBA, 3405 Siebel

Shaul Markovitch

The Knowledgeable Computer: Using Wikipedia-based Semantics for Text Processing


When humans perform text-processing tasks, such as text categorization, information retrieval and finding related documents, they interpret the specific wording of the document in the much larger context of their background knowledge and experience. On the other hand, state-of-the-art text processing programs are quite brittle - they mostly rely on the frequency of word occurrences without using common-sense knowledge.

We propose to enrich document representation through automatic use of a vast compendium of human knowledge - an encyclopedia.  We define a new type of Wikipedia-based semantics that uses the collection of Wikipedia articles as an ontology.  Every Wikipedia article represents a concept.  Every word or text fragment is represented as a point in the multi-dimensional space of this concept space.

When performing text-processing tasks, such as text categorization, we enrich the processed documents with Wikipedia concepts, thus allowing a much more knowledgeable inference.  Empirical evaluation of our method in the context of text categorization, information retrieval and computing semantic relatedness shows that such knowledge-intensive representation can indeed enhance performance in these domains significantly.

This work is done is collaboration with Ofer Egozi  Evgeniy Gabrilovich.


7/11: Time TBA, 3405 Siebel

Raymond J. Mooney

Learning Language from its Perceptual Context


Current systems that learn to process natural language require laboriously constructed human-annotated training data. Ideally, a computer would be able to acquire language like a child by being exposed to linguistic input in the context of a relevant but ambiguous perceptual environment. As a step in this direction, we present a system that learns to sportscast simulated soccer games by example. The training data consists of textual human commentaries on Robocup simulation games. A set of possible meanings for each comment is automatically constructed from game event traces. Our previously developed systems for learning to parse and generate natural language (KRISP and WASP) were augmented to learn from this data and then commentate novel games. The system is evaluated based on its ability to parse sentences into correct meanings and generate accurate descriptions of game events. Human evaluation was also conducted on the overall quality of the generated sportscasts and compared to human-generated commentaries.


7/8: 10am, 3405 Siebel

Ed Hovy
Ontologies: An Introduction  


Research in natural language processing (NLP) over the past fifteen years has produced impressive practical results using statistical methods. But increasingly there are signs that continued quality improvement in language processing applications (including QA, summarization, information extraction, and machine translation) requires deeper and richer representations, possibly even (shallow) semantics of text meaning. Although theories of semantics (formal and informal) abound, no-one has yet built a resource of semantic symbols that effectively supports NLP, that is empirically based, and that has been validated through human agreement scores. Can this be done? This talk describes the construction of the Omega ontology to support various NLP applications, in the context of the OntoNotes project in DARPA’s GALE program. Omega contains an Upper Model of about a hundred manually constructed and organized terms and a Middle Model of several thousand ‘sense pools’, where each sense pool is a collection of word senses from English, Arabic, and Chinese nouns and verbs, and includes one or more associated atomic features to support reasoning, as well as pointers to hundreds of individual sentences containing a word with the appropriate sense. The creation of senses, their pooling, and their integration into Omega is carried out by teams of annotators, and is subjected to cross-annotator agreement tests and other semi-automated validation procedures. To our knowledge, this is by far the most extensive ontology building effort that involves such validation.

This work is a collaboration of researchers at USC/ISI and the University of Colorado at Boulder.


7/8: 1:30pm, 3405 Siebel
The Promise and Problems of Annotation  

In order to apply automated language processing technology to assist humans with analysis and other text-oriented tasks such as retrieval, summarization, question answering, and translation, the technology has to be ‘trained’ to the particulars of the domain and the analysis task(s). Different fields of study, different tasks, different text genres, and different domains of interest all present different, and sometimes unique, challenges.

The procedure of ‘training’ the technology involves preparing a selection of the representative texts to create what is called the training suite. Typically, domain experts view the texts with suitable interfaces and in various ways and formats enter information they find useful for their task(s), in a process called coding or annotation. Usually, annotation includes the steps of delimiting some fragment of text, selecting one or more interpretive labels to attach to that portion, and perhaps adding additional information. Once two or more annotators have performed coding on the same texts, and have achieved a high enough degree of agreement between them, the language processing technology can be trained on a portion of the training suite, and its performance measured on the remainder. If that is satisfactory, the technology can be applied to additional, unannotated, material of the same type, thereby assisting analysts in future tasks.

Annotation is not an exact science. To help ensure clean and trustable annotations suitable for machine learning, the language processing community is beginning to address a set of seven issues. Using examples from several of the author’s projects, this talk describes each issue, lists some relevant work for each, and points to what needs to be resolved. The seven issues are: 1. How does one obtain a balanced corpus to annotate, and when is a corpus balanced (and representative)? 2. How does one decide what specifically to annotate? How does one adequately capture the theory behind the phenomena and express it in simple annotation instructions? 3. When hiring annotators, what characteristics are important? How does one ensure that they are adequately (and not over- or under-) trained? 4. How does one establish a simple, fast, and trustworthy annotation procedure? What interfaces does one build? How does one ensure that the interfaces do not influence the annotation results? 5. How does evaluate the results? What are the appropriate agreement measures? At which cutoff points should one redesign or re-do the annotations? 6. Hoe should one formulate and store the results? How does one ensure compatibility with other existing resources? How does one make results available for best impact? 7. How does one report the annotation effort and results? How does one actually publish papers on this work? What should the papers contain?

7/1: 10am, 3405 Siebel

Bill Hsu
Constructive Induction in Link Mining with Applications to "Social" Networks  


This talk will focus on the problem of learning to predict and reason about the structure of graphs whose links represent relations of various types. I will describe some framing problems in link mining, starting with classification-based prediction of link existence in social networks and extending this towards statistical relational learning (especially using relational graphical models). We will look at two methodologies: first, computing graph features and using them in classical feature construction; second, a more general constructive induction approach that aims at synthesizing features in a pure "discovery informatics" framework. In both cases I will first discuss classification, then mostly generative and some discriminative techniques. I will present some early results from graph feature construction in the social network link mining domain and discuss new research using the more general approach. I will conclude with a brief survey of successful applications of this link mining approach, including one in bioinformatics (protein-protein interaction prediction).

William Hsu is an associate professor of computer science at Kansas State University. He received his Ph.D. in computer science in 1998, was a research scientist in the Automated Learning Group at NCSA from 1998-1999, and has been a member of K-State's Computing and Information Sciences faculty since 1999. His research and teaching interests include machine learning, probabilistic reasoning, time series analysis, and data mining using graphical models.


6/27: 10am, 3405 Siebel

James Clarke
Integer Linear Programming for NLP  


Homepage
Many natural language processing tasks, such as machine translation, parsing and generation; require a decoding algorithm to find the best solution for a given input and model. The decoding problem is also referred to as the inference or search problem. Ideally during decoding we should find the optimal solution in an efficient manner. However, many decoding algorithms find sub-optimal solutions or force us to make strong assumptions of conditional independence between variables. Formulating decoding as an integer linear program allows us to infer globally optimal solutions and enforce global constraints.

In this talk we give an introduction to the concepts, formulation and solving of integer linear programs. We demonstrate how integer linear programming can be used for decoding in two applications: sentence compression and dependency parsing. Our approach can yield state-of-the-art or better performance by introducing linguistically motivated constraints that allow us to model the global properties observed in language.


6/20: 3pm, 3405 Siebel

Hadar Shem Tov


6/13: 4:30pm, Location TBA

Hwee Tou Ng
Recent Advances in Word Sense Disambiguation: Scaling Up, Sense Prior Estimation, and Integration into Statistical Machine Translation  


In this talk, I will introduce several research issues associated with word sense disambiguation (WSD), which is the task of determining the correct meaning, or sense, of a word in context. I will present recent work completed in my research group to address these issues. The first issue concerns the scaling up of WSD. Although supervised WSD gives good accuracy, the lack of sense-tagged training data has hampered the progress of WSD. I will present our approach of using parallel texts to scale up WSD. Using this approach, our WSD system participated in SemEval-2007, where our system achieved the highest and second highest accuracy in the coarse- grained and fine-grained English all-words task, among 16 and 14 participating systems respectively. The second issue concerns the accuracy drop of a WSD system, when it is applied to texts drawn from a different domain with different sense priors. I will present results showing improved WSD accuracy after applying class prior estimation algorithms and using well calibrated probabilities. The third issue concerns the perceived lack of applications utilizing WSD. We integrated our state-of-the-art WSD system into Hiero, a state-of-the-art hierarchical phrase-based statistical machine translation system. We found that the use of WSD improves translation quality, and the improvement is statistically significant.


6/10: 10am, 3405 Siebel

Roxana Girju
Semantic Parsing: Understanding Noun-Noun pairs in Context  


The Computational Linguistics community has shown a renewed interest in deeper semantic analysis, among them the automatic recognition of semantic relations between nouns. This talk will provide a short introduction to major concepts and recent developments in the area of semantic relations. The material is relevant both to computer science and linguistics. Students will be introduced to various representation schemes required for semantic analysis of noun pairs in context and will become familiar with the basic techniques and tools needed to develop semantic parsers.


2007 Speaker Series


5/23

Chris Olston, Yahoo! Research  



6/7-6/8

Ed Hovy, ISI-USC  



Schedule
6/7, 1-4pmAutomated Text Summarization as a Variant of Info Extraction (tutorial) (Siebel 2405)
6/8, 9amThe 3 Futures of NLP (research talk) (Siebel 2405)
6/8, 10:30amNew Developments in Information Extraction (research talk) (Siebel 2405)
6/14

Josef Ruppenhoffer, Pitt  


Dr. Ruppenhofer will give a talk and a related tutorial, both of which are open to the public.

Title: Manual and Automatic Subjectivity and Sentiment Analysis

Subjectivity analysis focuses on the expression of emotions, evaluations, and sentiments in language. This tutorial will cover:


  • problem definitions (e.g., what is subjectivity?) and manual annotations;
  • methods for identifying opinion-bearing words and phrases (lexicon development)
  • methods for identifying polarity/orientation (positive, negative, or neutral) of expressions in context;
  • applications of subjectivity analysis, with an emphasis on project review mining.

The tutorial session will be a working session involving manual annotations. Note that we will only touch on subjectivity and sentiment classification at the document level, and will focus on fine-grained analysis at the sentence, phrase, word, and word-sense levels.


Schedule
6/14, 10a-12pSiebel 2405, Research Talk
6/14, 1-2pSiebel 2405, Tutorial
6/25

Deva Ramanan, TTI  


If you would like to schedule a meeting with Deva on Tuesday, 6/26, please send email to: heeren@cs.uiuc.edu.


Schedule
6/25, 4:15-5:45pTraining a Computer to See People (Siebel 2405)
7/5

Dan Roth, MIAS-UIUC  


Dr. Roth will give a talk entitled: Global Learning with Constraints

Abstract: The maturity of machine learning techniques allows us today to learn many low level predicates and generate an appropriate vocabulary over which reasoning methods can be used to make significant progress in higher level domain decisions.

I will describe research on a framework that combines learning and inference and exhibit its use in the natural language processing domain. Key in this framework is the ability to incorporate declarative and expressive global information into the learning and decision stage. I will discuss the use of this framework as (1) a way to allow the output of local classifiers for different problem components to be assembled into a whole that reflects global preferences and constraints; (2) a way to improve probabilistic models by enforcing additional expressive constraints and (3) a way to significantly improve semi-supervised training of structured models.

Examples will be drawn from 'wh' attribution in natural language processing (determining who did what to whom when and where) and from information extraction problems.

Bio: Dan Roth is a Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign and the Beckman Institute of Advanced Science and Technology (UIUC) and a Willett Faculty Scholar of the College of Engineering. He is also the Director of MIAS, a DHS Institute of Discrete Science Center for Multimodal Information Access & Synthesis.

Roth has published broadly in machine learning, natural language processing, knowledge representation and reasoning and has developed advanced machine learning based tools for natural language applications that are being used widely by the research community. Among his paper awards are the best paper award in IJCAI-99 and the 2001 AAAI Innovative Applications of AI Award. Roth was the program chair of CoNLL'02 and of ACL'03 and is an associate editor for JAIR and the Machine Learning Journal. Roth got his Ph.D. in Computer Science from Harvard University in 1995.


Schedule
7/5, 1:00-2:30pGlobal Learning with Constraints (Siebel 3405)
7/6

Tina Eliassi-Rad, LLNL 


Dr. Eliassi-Rad will give a talk entitled: Leveraging Network Structure to Infer Missing Values in Relational Data

Abstract: Inference techniques for relational data improve classification performance by exploiting dependencies between attributes of related instances. In particular, a great deal of recent attention has been paid to collective inference procedures, which make simultaneous inferences over attributes of related instances. Collective inference has been shown to be particularly effective for overcoming substantial amounts of missing attribute information. We propose a novel approach for inference in relational data, which leverages information about the relational network structure. We show that when structural characteristics are informative, our approach leads to consistent, and sometimes dramatic, improvement in classification performance regardless of the amount of attribute information available. We demonstrate the utility of our method on several real-world classification tasks. Interestingly, for many of these tasks, collective inference does not perform well, apparently due to low amounts of relational autocorrelation. Understanding data characteristics that influence collective inference is a largely unexplored area for further study. This work is joint work with Brian Gallagher (Lawrence Livermore National Laboratory) and Lise Getoor (University of Maryland).

Bio: Tina Eliassi-Rad is a computer scientist and a technical lead at the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. She earned her Ph.D. at UW-Madison in 2001, her M.S. at UIUC in 1995, and her B.S. at UW-Madison in 1993. All of her degrees are in computer science. Her research interests include artificial intelligence, machine learning, knowledge discovery and data mining. Her work has been applied to the World-Wide Web, scientific simulation data, and complex networks. For more details, visit http://www.cs.wisc.edu/~eliassi/.


Schedule
7/6, 1:00-2:30pLeveraging Network Structure to Infer Missing Values in Relational Data (Siebel 3405)

Slides

7/9

Patrick Pantel, ISI-USC  


Dr. Pantel will give a 3 hour talk and tutorial which is open to the public:

Title: Lexical semantics and large-scale similarity modeling

Abstract: In this tutorial, we will explore recent explorations in computational lexical semantics, using corpus-based and web-based techniques, and unsupervised and semi-supervised learning methods. With a focus on similarity modeling, we will learn the art of mapping problem statements to feature representations, information theoretic feature weighting, comparison measures, and clustering algorithms. We will apply this framework to automatically learn the concepts in a textual corpus, the senses of words, the topics in a collection of documents, paraphrases, and even detecting aliases and groups of related individuals in a homeland security setting. We will also explore Google's famed MapReduce infrastructure for seamless very large-scale data processing, introducing open-source efforts under way for making this technology to the public.

Bio: Patrick Pantel is currently a Research Assistant Professor and Research Scientist in the Natural Language Group at the USC Information Sciences Institute where he does research in large-scale natural language processing, ontology learning, text mining, knowledge acquisition, and predictive systems. In 2003, he received a Ph.D. in Computing Science from the University of Alberta in Edmonton, Canada.


Schedule
7/9, 1:00-4:00pLexical semantics and large-scale similarity modeling (Siebel 3405)

Slides

7/10

Anhai Doan, University of Wisconsin  


Dr. Doan's talk is open to the public.

Title: The Cimple Project on Community Information Management

Abstract: In this talk I will give an overview of Cimple, a joint project between the University of Wisconsin-Madison and Yahoo! Research. Cimple develops a generic solution that crawls, extracts, and integrates data, to build structured "portals" for online communities. I will first describe the envisioned working of Cimple and our prototype, DBlife, which is a structured portal being developed for the database research community. Next, I describe the technical challenges underlying Cimple and our solution approaches. Finally, I discuss the connections between Cimple and research in data integration, information extraction, human computation, and Web data management. More information about Cimple can be found at http://www.cs.wisc.edu/~anhai/projects/cimple

Bio: AnHai Doan is an assistant professor in Computer Science at the University of Wisconsin-Madison, since July 2006. His interests cover databases, AI, and Web. His current research focuses on data integration, Web community management, mass collaboration, text management, information extraction, and schema and ontology matching. Selected recent honors include the ACM Doctoral Dissertation Award in 2003, CAREER Award in 2004, and Alfred P. Sloan Research Fellowship in 2007. Selected recent professional activities include co-chairing WebDB at SIGMOD-05 and the AI Nectar track at AAAI-06.


Schedule
7/10, 1:00-2:30pThe Cimple Project on Community Information Management (Siebel 3405)

Slides