When humans perform text-processing tasks, such as text categorization, information retrieval and finding related documents, they interpret the specific wording of the document in the much larger context of their background knowledge and experience. On the other hand, state-of-the-art text processing programs are quite brittle - they mostly rely on the frequency of word occurrences without using common-sense knowledge.
We propose to enrich document representation through automatic use of a vast compendium of human knowledge - an encyclopedia. We define a new type of Wikipedia-based semantics that uses the collection of Wikipedia articles as an ontology. Every Wikipedia article represents a concept. Every word or text fragment is represented as a point in the multi-dimensional space of this concept space.
When performing text-processing tasks, such as text categorization, we enrich the processed documents with Wikipedia concepts, thus allowing a much more knowledgeable inference. Empirical evaluation of our method in the context of text categorization, information retrieval and computing semantic relatedness shows that such knowledge-intensive representation can indeed enhance performance in these domains significantly.
This work is done is collaboration with Ofer Egozi Evgeniy Gabrilovich.
Current systems that learn to process natural language require laboriously constructed human-annotated training data. Ideally, a computer would be able to acquire language like a child by being exposed to linguistic input in the context of a relevant but ambiguous perceptual environment. As a step in this direction, we present a system that learns to sportscast simulated soccer games by example. The training data consists of textual human commentaries on Robocup simulation games. A set of possible meanings for each comment is automatically constructed from game event traces. Our previously developed systems for learning to parse and generate natural language (KRISP and WASP) were augmented to learn from this data and then commentate novel games. The system is evaluated based on its ability to parse sentences into correct meanings and generate accurate descriptions of game events. Human evaluation was also conducted on the overall quality of the generated sportscasts and compared to human-generated commentaries.
Research in natural language processing (NLP) over the past fifteen years has produced impressive practical results using statistical methods. But increasingly there are signs that continued quality improvement in language processing applications (including QA, summarization, information extraction, and machine translation) requires deeper and richer representations, possibly even (shallow) semantics of text meaning. Although theories of semantics (formal and informal) abound, no-one has yet built a resource of semantic symbols that effectively supports NLP, that is empirically based, and that has been validated through human agreement scores. Can this be done? This talk describes the construction of the Omega ontology to support various NLP applications, in the context of the OntoNotes project in DARPA’s GALE program. Omega contains an Upper Model of about a hundred manually constructed and organized terms and a Middle Model of several thousand ‘sense pools’, where each sense pool is a collection of word senses from English, Arabic, and Chinese nouns and verbs, and includes one or more associated atomic features to support reasoning, as well as pointers to hundreds of individual sentences containing a word with the appropriate sense. The creation of senses, their pooling, and their integration into Omega is carried out by teams of annotators, and is subjected to cross-annotator agreement tests and other semi-automated validation procedures. To our knowledge, this is by far the most extensive ontology building effort that involves such validation.
This work is a collaboration of researchers at USC/ISI and the University of Colorado at Boulder.
In order to apply automated language processing technology to assist humans with analysis and other text-oriented tasks such as retrieval, summarization, question answering, and translation, the technology has to be ‘trained’ to the particulars of the domain and the analysis task(s). Different fields of study, different tasks, different text genres, and different domains of interest all present different, and sometimes unique, challenges.
The procedure of ‘training’ the technology involves preparing a selection of the representative texts to create what is called the training suite. Typically, domain experts view the texts with suitable interfaces and in various ways and formats enter information they find useful for their task(s), in a process called coding or annotation. Usually, annotation includes the steps of delimiting some fragment of text, selecting one or more interpretive labels to attach to that portion, and perhaps adding additional information. Once two or more annotators have performed coding on the same texts, and have achieved a high enough degree of agreement between them, the language processing technology can be trained on a portion of the training suite, and its performance measured on the remainder. If that is satisfactory, the technology can be applied to additional, unannotated, material of the same type, thereby assisting analysts in future tasks.
Annotation is not an exact science. To help ensure clean and trustable annotations suitable for machine learning, the language processing community is beginning to address a set of seven issues. Using examples from several of the author’s projects, this talk describes each issue, lists some relevant work for each, and points to what needs to be resolved. The seven issues are: 1. How does one obtain a balanced corpus to annotate, and when is a corpus balanced (and representative)? 2. How does one decide what specifically to annotate? How does one adequately capture the theory behind the phenomena and express it in simple annotation instructions? 3. When hiring annotators, what characteristics are important? How does one ensure that they are adequately (and not over- or under-) trained? 4. How does one establish a simple, fast, and trustworthy annotation procedure? What interfaces does one build? How does one ensure that the interfaces do not influence the annotation results? 5. How does evaluate the results? What are the appropriate agreement measures? At which cutoff points should one redesign or re-do the annotations? 6. Hoe should one formulate and store the results? How does one ensure compatibility with other existing resources? How does one make results available for best impact? 7. How does one report the annotation effort and results? How does one actually publish papers on this work? What should the papers contain?
This talk will focus on the problem of learning to predict and reason
about the structure of graphs whose links represent relations of various
types. I will describe some framing problems in link mining, starting
with classification-based prediction of link existence in social
networks and extending this towards statistical relational learning
(especially using relational graphical models). We will look at two
methodologies: first, computing graph features and using them in
classical feature construction; second, a more general constructive
induction approach that aims at synthesizing features in a pure
"discovery informatics" framework. In both cases I will first discuss
classification, then mostly generative and some discriminative
techniques. I will present some early results from graph feature
construction in the social network link mining domain and discuss new
research using the more general approach. I will conclude with a brief
survey of successful applications of this link mining approach,
including one in bioinformatics (protein-protein interaction
prediction).
William Hsu is an associate professor of computer science at Kansas
State University. He received his Ph.D. in computer science in 1998,
was a research scientist in the Automated Learning Group at NCSA from
1998-1999, and has been a member of K-State's Computing and Information
Sciences faculty since 1999. His research and teaching interests
include machine learning, probabilistic reasoning, time series analysis,
and data mining using graphical models.
Homepage
Many natural language processing tasks, such as machine translation, parsing and generation; require a decoding algorithm to find the best solution for a given input and model. The decoding problem is also referred to as the inference or search problem. Ideally during decoding we should find the optimal solution in an efficient manner. However, many decoding algorithms find sub-optimal solutions or force us to make strong assumptions of conditional independence between variables. Formulating decoding as an integer linear program allows us to infer globally optimal solutions and enforce global constraints.
In this talk we give an introduction to the concepts, formulation and solving of integer linear programs. We demonstrate how integer linear programming can be used for decoding in two applications: sentence compression and dependency parsing. Our approach can yield state-of-the-art or better performance by introducing linguistically motivated constraints that allow us to model the global properties observed in language.
In this talk, I will introduce several research issues associated with word sense disambiguation (WSD), which is the task of determining the correct meaning, or sense, of a word in context. I will present recent work completed in my research group to address these issues. The first issue concerns the scaling up of WSD. Although supervised WSD gives good accuracy, the lack of sense-tagged training data has hampered the progress of WSD. I will present our approach of using parallel texts to scale up WSD. Using this approach, our WSD system participated in SemEval-2007, where our system achieved the highest and second highest accuracy in the coarse- grained and fine-grained English all-words task, among 16 and 14 participating systems respectively. The second issue concerns the accuracy drop of a WSD system, when it is applied to texts drawn from a different domain with different sense priors. I will present results showing improved WSD accuracy after applying class prior estimation algorithms and using well calibrated probabilities. The third issue concerns the perceived lack of applications utilizing WSD. We integrated our state-of-the-art WSD system into Hiero, a state-of-the-art hierarchical phrase-based statistical machine translation system. We found that the use of WSD improves translation quality, and the improvement is statistically significant.
The Computational Linguistics community has shown a renewed interest in deeper semantic analysis, among them the automatic recognition of semantic relations between nouns. This talk will provide a short introduction to major concepts and recent developments in the area of semantic relations. The material is relevant both to computer science and linguistics. Students will be introduced to various representation schemes required for semantic analysis of noun pairs in context and will become familiar with the basic techniques and tools needed to develop semantic parsers.
| 6/7, 1-4pm | Automated Text Summarization as a Variant of Info Extraction (tutorial) (Siebel 2405) |
| 6/8, 9am | The 3 Futures of NLP (research talk) (Siebel 2405) |
| 6/8, 10:30am | New Developments in Information Extraction (research talk) (Siebel 2405) |
| 6/14, 10a-12p | Siebel 2405, Research Talk |
| 6/14, 1-2p | Siebel 2405, Tutorial |
| 6/25, 4:15-5:45p | Training a Computer to See People (Siebel 2405) |
| 7/5, 1:00-2:30p | Global Learning with Constraints (Siebel 3405) |
| 7/6, 1:00-2:30p | Leveraging Network Structure to Infer Missing Values in Relational Data (Siebel 3405) |
| 7/9, 1:00-4:00p | Lexical semantics and large-scale similarity modeling (Siebel 3405) |
| 7/10, 1:00-2:30p | The Cimple Project on Community Information Management (Siebel 3405) |