[<][>]

EuroLAN'99

Themes

The theme reflects the growing international interest on lexical semantics in correlation with the current approach of the modern information society for stimulating technological bases that would allow to minor languages to have equal access to information. It aims at approaching problems related to the deep understanding of texts, WordNet, discourse interpretation and intelligent web browsers based on language interpretation. It is also our desire to stimulate West/East and East/East collaboration in this domain.

It has become a tradition that, within the main topic of the Eurolan Schools, related directions are approached that could also house workshops aimed at stimulating discussions and further collaboration. Three thematic tracks are planned for Eurolan '99:

  1. Lexical Semantics
  2. Discourse and Summarization
  3. Web Application of WordNet and Discourse

I. Lexical Semantics

- the structure of WordNet, its building principles, how much experience from English can be borrowed in order to apply to other languages, how WordNet can be used for word sense disambiguation, anaphora resolution, textual summarization and machine translation. 1. Semantics in Lexicons and Corpora: a few European initiatives

by Nicoletta Calzolari - Institute of Computational Linguistics Pisa
 
 

Semantic encoding in large computational lexicons and semantic annotation of corpora, in relation with a few European projects/initiatives:

- EAGLES lexicon/semantics group results

- SIMPLE semantic lexicons, with semantic subcategorisation for 12 European languages

- EuroWordNet lexicons, with semantic relations of different types

- SPARKLE (semi-)automatic acquisition of semantic information from large corpora

- SENSEVAL/ROMANSEVAL initiative, for evaluation of word-sense disambiguation systems and related corpus semantic tagging.
 
 

How these initiatives start addressing the requirements of language engineering applications with respect to semantics and content. A discussion of which further steps are needed and are feasible, in lexical and corpus semantics will be attempted.
 
 

2. Motivation, History, and Design of WordNet

by Christiane Fellbaum - Princeton University

 
(1) WordNet principles

The development of WordNet from a psycholinguistic model of human semantic memory to a lexical database that has found many applications in a variety of domains. The lexicographic treatment of different parts of speech will be discussed in some detail.

(2) Applications of WordNet

This talk will examine various NLP applications of the lexical database. We will cover such topics as word sense disambiguation, lexical cohesion, and the creation of semantic concordances, based on the work of different researchers and reported in Fellbaum (1998).
 
 

3. Lexical Acquisition for a Romanian WordNet

by Sanda Harabagiu - Southern Methodist University
 

This talk will focus on acquisition techniques that can be applied to build a Romanian version of the WordNet lexical database. A presentation of lexico-semantic relations will be provided, with exemplifications in the Romanian lexicon.

Methodologies for fast construction of multi-lingual WordNets, as highlighted by the Euro-WordNet project will be summarized. Comparisons with CYK and FrameNet will be discussed. The talk will include a demo of a core Romanian WordNet.

 

4. Cross-lingual sense determination

by Nancy Ide - Vassar College
 
 

One of the most pressing problems for natural language processing is the determination of appropriate sense distinctions for use in lexicons, especially multi-lingual lexicons. Recently, it has been suggested that sense distinctions that are lexicalized across several languages could be used as a means to identify relevant word senses. This session will consider this possibility and outline some work that attempts to determine to what degree cross-lingual sense determination is possible, what languages and what types of languages could be used etc.
 
 
 
5. Probabilistic methods in the context of a taxonomy and methods for obtaining and using parallel text

by Philip Resnik - University of Maryland
 

(1) The use of probabilistic methods in the context of a taxonomy

(2) Methods for obtaining and using parallel text.
 
 
 

6. Printed dictionaries, from lexical databases to lexical ontologies.

by Dan Tufis, RACAI - Romanian Academy, Bucharest
 

      The talk will present the main issues in the development of a common multilingual (5 CEE languages) methodology for encoding mono-lingual explanatory dictionaries; there will be investigated several problems (data sampling, consistency, adequacy, degree of automation, etc.) raised by ontological information extraction and the development of WordNet like ontologies.
 

7. Developing Multi-purpose LKBs for NLP

by Evelyne Viegas - New Mexico State University
 

One prerequisite for the deep understanding of texts in NLP is to use a semantic-based lexicon. Developing large-scale computational semantic lexicons is expensive. This is why lexicon builders should aim at making multi-purpose (multilingual, maintainable, reusable) lexicons. The lectures will focus on the building and use of multi-purpose lexicons in NLP. More specifically we will:

(1) show how to build an ontology-based multilingual lexicon;

(2) describe methodologies to cost-effectively develop a core lexicon (about 7,000 entries) into a large-scale one using morpho-semantic and lexical rules;

(3) study how lexicon entries contribute to the deep understanding of texts; and

(4) invite students to build entries for a short text in their language and produce for it Text Meaning Representation using the semantic analyzer built at the Computing Research Laboratory.
 
 
 

8. Wordnets and ontologies in NLP

by Piek Vossen - University of Amsterdam
 

(1) Overview of types of ontologies and their differences
  Different paradigms and traditions of building and using ontologies. This will be done by giving a survey of different ontologies and their major properties: LDOCE, EDR, Cyc, MikroKosmos, Domain Ontologies, Thesauri, Experimental Lexicons for NLP (Acquilex, CoreLex) and wordnets (WordNet1.5 and EuroWordNet).
 
(2) The structure of WordNet1.5 and EuroWordNet  
More focus to the structure of EuroWordNet and methodologies for building the resources. Specifically, the multilingual design, the use and adaptation of an interlingual-index, the language-internal relations and the methodology of developing the wordnets: starting from a shared set of Base Concepts and a top-ontology, representing the concept in a language and extending it top-down.
 
(3) The use of wordnets in NLP
  The possible use of wordnets in NLP. A brief overview will be given of the core techniques in applications like: information retrieval, information extraction, machine translation, summarizers as well as component tasks: parsing, term recognition, word sense disambiguation, co-reference resolution.
 

II. Discourse and Summarization

- recent advances on Discourse Theory and Text Summarization, discourse structure and the relation with anaphora, how can discourse structure influence summarization, are discourse interpretation and summarization techniques dependent on language?
 

1. Introduction to Computational Semantics

by Johan Bos - University of Saarlandes, Saarbrücken

 
The course introduces a number of fundamental techniques for computational semantics. Both the underlying theory and their implementation in Prolog are discussed. More precisely, we want to tackle the following two questions:

(1) How can we automate the process of associating semantic representations with expressions of natural language?

(2) How can we use logical representations of natural language expressions to automate the process of drawing inferences?

(3) One of the lectures will focus on using WordNet in discourse analysis. Prerequisites for this course: some basic familiarity with first-order logic and/or knowledge of Prolog would be advantageous for the student, but is not a must. More information can be gained from: http://www.coli.uni-sb.de/~bos/comsem/
 
 

2. Linguistic Foundations for Summary Generation from Corpora

by Rodolfo Delmonte - University of Venice
 

Text summarization can be regarded as the most interesting and promising Natural Language Understanding task computational linguists are currently faced with. The reduced dimensions of texts to be summarized are guaranteed by sentence extraction tasks based on topic identification by means of traditional frequency ranking of nouns enhanced by morphologically derived linguistic items. Sentence extraction is usually proportional to the original text in terms of 1/5th, or 20% value in order to obtain a first text reduction onto which to apply NLP intensive techniques. A second important step in text reduction in order to produce a linguistically grounded synthesis passes through usual text analysis processes which aim at recovering Predicate-Argument Structures of each propositional nucleus contained in the sentence extracted text and at evaluating possible adjuncts of each predicate.

We shall discuss all processes intervening when trying to produce a synthesis from unrestrained text, by showing and explaining an example elaborated by GETARUNS, the system for Italian texts summarization under development at the University of Venice.

In a second lecture we shall present the contents of our computational lexicon of Italian, called LIFUV, in its fully explicit format, which is made up of 5000 verb entries, 2000 adjective entries and 6000 nouns. We shall also present MIDUV, the main lexicon which contains a compact classification of nouns, verbs and adjectives for over 120,000 entries.

We shall then concentrate on the Upper Level Module of GETA_RUN, the ancestor of GETARUNS, a complete system for text understanding which is running on the web in our website, at http://byron.cgm.unive.it/Risorse/GETARUN.html.

We shall discuss in three separate lectures the linguistic representations produced at discourse level for a number of English and Italian texts, by commenting on the following topics:

A. Anaphora Resolution and Inferential Processes to produce a Discourse Model

B. From Discourse Model to Summary Generation with Semantic Relations and Discourse Structure

C. From Discourse Model to Knowledge Database for Queries under BACK-V
 

3. Topics in Discourse Interpretation

by Jerry Hobbs - Stanford Research Institute International
 

(1) Interpretation as Abduction

Abduction is inference to the best explanation. An approach to abductive inference, called "weighted abduction", has been developed, that has resulted in a significant simplification of how the problem of interpreting texts is conceptualized. The interpretation of a text is the minimal explanation of why the text would be true. More precisely, to interpret a text, one must prove the logical form of the text from what is already mutually known, merging redundancies where possible and making assumptions where necessary. It is shown how such "local pragmatics" problems as reference resolution, the interpretation of compound nominals, the resolution of syntactic ambiguity and metonymy, and schema recognition can be solved in this manner. Moreover, this approach of "interpretation as abduction" can be combined with the older view of "parsing as deduction" to produce an elegant and thorough integration of syntax, semantics, and pragmatics, one that spans the range of linguistic phenomena from phonology to discourse structure. Finally, I will discuss means for making the abduction process efficient and the semantics of the weights and costs in the abduction scheme.

(2) Encoding Commonsense Knowledge for Discourse Interpretation

We understand discourse so well because we know so much. Therefore, a system for discourse interpretation must have a large knowledge base of commonsense knowledge. But this knowledge base must be built up in a principled fashion. Certain abstract concepts lie at the foundation of the knowledge required in discourse interpretation, in the sense that most prepositions, the most common verbs, and many other words crucially involve these concepts in their meaning. Among these concepts are granularity, systems of entities, scales, the figure-ground relation, change of state, causality, and goal-directed behavior. In this talk I will sketch some of the key features needed in a commonsense theory of each of these concepts. I will then outline a methodology for using occurrences of words in discourse to select the facts that need to be encoded in a knowledge base for discourse interpretation.

(3) The Structure of Discourse

Adjacent segments of discourse, by their very adjacency, convey the information that the situations they describe are related. In this talk, I describe the most common of these relations, including relations based on figure-ground, change of state, causality, and the affirmation or denial of similarity. For each of these relations I give a formal characterization of the inferences that must be drawn in order to recognize them. This will be done in the "Interpretation as Abduction" framework. When two segments of discourse are related by a coherence relation, they compose into a larger segment of discourse. In this way, a tree-like structure for an entire discourse can be built up. I will discuss several features of discourse structure in these terms. Then a method for analyzing discourse will be presented, and several diverse texts will be examined by this method.

(4) Information and Intention in Discourse

In discourse interpretation there are two questions that need to be answered: What situation is being described in the text? And why is the speaker describing that situation? The first of these questions leads to the informational perspective on discourse, and the second leads to the intentional perspective. I will discuss the advantages and shortcomings of each of these perspectives, and then propose an integrated view. In this framework I will discuss several phenomena involving the interrelation of the perspectives, including the case of pragmatic conditionals, a case of answering a question by responding to a higher goal, a case where the informational account is a central part of the intentional account, and a case where the intentional account overrides the informational account.
 
 

4. Automated Text Summarization

by Daniel Marcu - Information Sciences Institute and Department of Computer Science, University of Southern California
 

After lying dormant for over two decades, automated text summarization has experienced a tremendous resurgence of interest in the past few years. Research is being conducted in China, Japan, Europe, and North America, and industry has brought to market more than 30 summarization systems; most recently, two specialized workshops were devoted to the topic (the ACL/EACL-97 Workshop on Intelligent Scalable Text Summarization and the AAAI-98 Spring Symposium on Intelligent Text Summarization).

(1) Overview and Topic Identification

We will first outline the major types of summary: indicative vs. informative; abstract vs. extract; generic vs. query-oriented; background vs. just-the-news; single-document vs. multi-document; and so on. We will describe the typical decomposition of summarization into three stages, and explain in detail the major
approaches to each stage. For topic identification, we will outline techniques based on stereotypical text structure, cue words, high-frequency indicator phrases, intratext connectivity, and discourse structure centrality. For topic fusion, we will outline some ideas that have been proposed, including concept generalization and semantic association. For summary generation, we will describe the problems of sentence planning to achieve information compaction.
 

(2) Topic Interpretation, Generation, Future

We will highlight the strengths and weaknesses of statistical and symbolic/linguistic techniques in implementing efficient summarization systems. We will discuss ways in which summarization systems can interact with and/or complement natural language generation, discourse parsing, information extraction, and information retrieval systems. Finally, we will present a set of open problems that we perceive as being crucial for immediate progress in automatic summarization.
 
 

5. Discourse structure and reference

by Nancy Ide, Vassar College and Dan Cristea, University "Alexandru I. Cuza", Iasi
 

The relation between discourse structure and referring expressions is not well understood at present. Recent work explores this relation, and this session will survey part of it. Special emphasis will be given on the approaches presented at the ACL'99 Workshop on the Relationship Between Discourse/Dialogue Structure and Reference to be held in conjunction with the ACL Conference, June 21-22, 1999 at the University of Maryland.
 
 
6. Discourse: Theories, Parsing, Generation.

by Daniel Marcu - Information Sciences Institute and Department of Computer Science, University of Southern California
 
 

Researchers of natural language have repeatedly acknowledged that coherent texts are not just simple sequences of words, but rather complex artifacts whose semantic units are connected by rhetorical, logical, argumentative, and other cohesive relations. In this course, we review some of the major current theories of discourse and discuss their impact on the parsing and generation of multisentence texts.
 
 
    1. Background and theoretical issues in computational discourse

    2.  

      We illustrate two major perspectives on discourse structure -- functional and structural -- by describing how several influential theories deal with the pertinent questions. These theories include the story grammars of Van Dijk and Kintsch; the schema-based model of McKeown; the inference-based model of Hobbs; the logic-based models of Kamp, Asher, and Polanyi; the tripartite model of Grosz and Sidner; the rhetorical relation model (RST) of Mann and Thompson; and subsequent developments. We discuss some ongoing problems with all these models, focusing on the relations that ensure coherence and on the challenge of building up by hand large corpora of discourse trees.
       

    3. Discourse structure determination based on text planning and rhetorical parsing of unrestricted texts

    4.  

      The topic has been studied for about 15 years. We describe the development and operation of RST-based text structurers and subsequent intention-based text planners. In contrast with text planning research, the automated rhetorical parsing of multisentence texts is a much younger enterprise. We present a range of lexico-grammatical phenomena that can be used to identify discourse segments and discourse relations and present parsing models and mechanisms that are used to derive the discourse structure of unrestricted texts.
       
       

7. Knowledge Processing on an Extended WordNet

by Dan Moldovan - Southern Methodist University
 
 

This talk presents a way in which a large knowledge base may be implemented using WordNet and extensions. The goal is to create an environment that supports text inference, which in turn is the key to exploiting the expressive power of natural language. WordNet glosses viewed as dictionary definitions provide a rich information that makes possible many text inferences. We will show how to construct inference rules using the relations from WordNet and how to extract plausible inferences from text. This method can be applied to identify intentions, coherence and a deep understanding of text.
 
 
 

III. Applications of WordNet and Discourse

- the state of the art in information extraction and retrieval, intelligent Internet browsing and help from Natural Language Processing domains, with particular emphasis on WordNet and Discourse applications.
 

1. Concepts in Multilingual Information Retrieval

by Paul Buitelaar - DFKI Saarbrücken
 

Making information retrieval systems more intelligent requires more knowledge to be incorporated in these systems in order to make them more responsive to their environment (user, domain, application). This includes the task of handling data and document sets that are multilingual, requiring translation of queries and/or results between languages. Part of this, in particular in handling semi- or unstructured text data, involves knowledge about the meaning of individual words, including their relation to other words, their wider linguistic context (phrase, sentence, paragraph, discourse, document) and their domain specific context.
 

The course will address the interface between lexical semantics, multilinguality and information retrieval, showing that much of the knowledge needed for the tasks described above can be acquired by organizing and analyzing the underlying data and document sets. Within the framework of the MIETTA system I will discuss examples of this in classification, query expansion and query translation.
 
 
 

2. Possible Applications of a Romanian WordNet

by Sanda Harabagiu - Southern Methodist University

 

In this talk I will elaborate on several applications of a Romanian WordNet. The areas will include the fields of Machine Translation, Information Retrieval and Information Extraction. We shall also address the problem of building conceptual indexes for a Question/Answering agent operating in Romanian.
 
 

3. Word Sense Disambiguation and an Application to Information Retrieval from the Internet

by Dan Moldovan - Southern Methodist University
 
 

The problem of mapping words into word meanings, known as the word sense disambiguation (WSD), is an important problem in natural language understanding as its solution impacts other tasks such as discourse processing, reference resolution, coherence inference, and others.
 

A method is presented for the simultaneous disambiguation of multiple words. The method proposed here combines two sources of information: (1) statistics from the Internet about the occurrences of groups of words and (2) the conceptual density among a pair of words measured on a Machine Readable Dictionary such as WordNet. The Internet is used as a source of large corpora, while WordNet is used to quantify the semantic distance between words. The method provides a metric that ranks the senses of words.
 

In a second part of this talk we apply the disambiguation method to improve the information retrieval from the Internet. By using similarity lists for the content words of a query, the search is extended to a much larger number of documents, thus recall is increased. Then, with the help of some newly defined search operators, most of the irrelevant information is filtered out, which increases the retrieval precision.
 
 

4. Lexical Information and Decisions in Parsing

by Gábor Prószéky - MorphoLogic Budapest

                    Lexical items usually contain descriptions using numerous morpho-syntactic features. These features are results of abstraction.
                    That is, they have been extracted from previous occurrences of the word in various contexts. Problems might occur when, e.g.
                    - we meet words the actual roles of which seem to contradict to their lexical descriptions,
                    - we would like to understand metaphors,
                    - we would like to process new lexical entries.
                    There are two rather general ways to solve the above problems in HLT:overriding  lexical information or using
                    underspecified lexical descriptions. In order to choose a good solution, we show:
                            (1) the possible relations between lexical information and actual syntactic position,
                            (2) a method for classification of lexical entries without using traditional categories,
                            (3) a method for morpho-syntactic parsing based on open and underspecified lexicons,
                            (4) the role of corpus processing in feature selection.
                    Terms like 'finite but open lexicons', 'run-time change of dictionaries', 'on-line learning from corpora' might lead us to define a
                    rather lexicalized grammar formalism. The most interesting features of this idea called 'finite syntax' are also discussed.
 

5. Designing an editorial platform for a multilingual terminology  the experience provided by the DHYDRO project
 
by Laurent Romary - Loria/CNRS Nancy

The talk will provide an in-depth presentation of both the concepts and the implementation choices which are under development within the European MLIS/Dhydro project. The aim is to define an environment allowing the editing and viewing of multiligual terminologies, on the basis of the current available standards (SGML, TEI, Martif). The following topics will be more particularly addressed :
                * the editorial aspects with a view on the opposition between dictionary and terminological database;
                * the encoding of a terminological databasqe using XML and the TEI;
                * some words about the possible use of parallel texts;
                * methodological and technical aspects for developing such an environment.

6. Queyring through Web

by François Rousselot - LIIA-ENSAIS, Strasbourg
 

(1) Querying web documents using Description Logics

Description logics is a good tool to maintain a document base. On the one hand it provides a flexible way to represent the content of texts in a frame-like formalism which allows to mix information of different types: indices, time, cause etc., on the other hand it provides a sound language based on logic, useful to express queries and finally to help in selecting documents. If natural language queries are to be translated in the DL language, some interesting problems arise coming from the restrictions of DL. It is necessary to establish carefully the links between natural language used to express semantic relations and roles (symbol used to express relations in DL). As a result it is possible either to conceive a robust interface for unrestricted language or to give rules to constrain the language utilizable by the user.

(2) Refining answers of a search engine with linguistic engineering tools

Human end-user of commonly used search engines are often overwhelmed with unrelevant texts. It is interesting to have a way to refine the results in order to reduce the noise and the silence. The idea is first to use a linguistic tool to compile a set of terms characterizing the domain and secondly to use link typologies to eventually access other documents in order to examine them. By comparison to the set of terms new documents can be selected or not as being relevant and the process goes on. The first step need interaction with the users: it is necessary to edit the results of the program which furnishes term candidates, but afterward the program can run automatically and can watch the web.
 

7. Knowledge Based Support for Technical Translators

by Walther von Hahn - University of Hamburg
 

The course will explain the rationale and the techniques applied in the system DBR-MAT. This (Prolog-based) system supports translators by allowing for multilingual domain specific queries, which are answered from an abstract knowledge base instead of giving monolingual canned text explanations or isolated term definitions. Additionally, syntactic analyses and term checking is avialable to the translator. The interaction of the lexicon, the knowledge base, the graphical objects and the generator are explained in more detail. The Project was performed by the University of Hamburg, Academy of Science Sofia and University of Bucharest.
 

top