|
![]() |
The theme reflects the growing international interest on lexical semantics in correlation with the current approach of the modern information society for stimulating technological bases that would allow to minor languages to have equal access to information. It aims at approaching problems related to the deep understanding of texts, WordNet, discourse interpretation and intelligent web browsers based on language interpretation. It is also our desire to stimulate West/East and East/East collaboration in this domain.
It has become a tradition that, within the main topic of the Eurolan Schools, related directions are approached that could also house workshops aimed at stimulating discussions and further collaboration. Three thematic tracks are planned for Eurolan '99:
by Nicoletta Calzolari - Institute of Computational
Linguistics Pisa
- EAGLES lexicon/semantics group results
- SIMPLE semantic lexicons, with semantic subcategorisation for 12 European languages
- EuroWordNet lexicons, with semantic relations of different types
- SPARKLE (semi-)automatic acquisition of semantic information from large corpora
- SENSEVAL/ROMANSEVAL initiative, for
evaluation of word-sense disambiguation systems and related corpus semantic
tagging.
How these initiatives start addressing
the requirements of language engineering applications with respect to semantics
and content. A discussion of which further steps are needed and are feasible,
in lexical and corpus semantics will be attempted.
by Christiane Fellbaum - Princeton University
The development of WordNet from a psycholinguistic model of human semantic memory to a lexical database that has found many applications in a variety of domains. The lexicographic treatment of different parts of speech will be discussed in some detail.
(2) Applications of WordNet
This talk will examine various NLP applications
of the lexical database. We will cover such topics as word sense disambiguation,
lexical cohesion, and the creation of semantic concordances, based on the
work of different researchers and reported in Fellbaum (1998).
by Sanda Harabagiu - Southern
Methodist University
Methodologies for fast construction of multi-lingual WordNets, as highlighted by the Euro-WordNet project will be summarized. Comparisons with CYK and FrameNet will be discussed. The talk will include a demo of a core Romanian WordNet.
4. Cross-lingual sense determination
by Nancy Ide
- Vassar College
by Philip Resnik - University of
Maryland
(2) Methods for obtaining and using
parallel text.
by Dan
Tufis, RACAI - Romanian Academy, Bucharest
7. Developing Multi-purpose LKBs for NLP
by Evelyne Viegas - New Mexico State
University
(1) show how to build an ontology-based multilingual lexicon;
(2) describe methodologies to cost-effectively develop a core lexicon (about 7,000 entries) into a large-scale one using morpho-semantic and lexical rules;
(3) study how lexicon entries contribute to the deep understanding of texts; and
(4) invite students to build entries
for a short text in their language and produce for it Text Meaning Representation
using the semantic analyzer built at the Computing Research Laboratory.
by Piek Vossen - University
of Amsterdam
1. Introduction to Computational Semantics
by Johan Bos - University of Saarlandes, Saarbrücken
(1) How can we automate the process of associating semantic representations with expressions of natural language?
(2) How can we use logical representations of natural language expressions to automate the process of drawing inferences?
(3) One of the lectures will focus on using WordNet
in discourse analysis. Prerequisites for this course: some basic familiarity
with first-order logic and/or knowledge of Prolog would be advantageous
for the student, but is not a must. More information can be gained from:
http://www.coli.uni-sb.de/~bos/comsem/
by Rodolfo Delmonte - University
of Venice
Text summarization can be regarded as the most interesting and promising Natural Language Understanding task computational linguists are currently faced with. The reduced dimensions of texts to be summarized are guaranteed by sentence extraction tasks based on topic identification by means of traditional frequency ranking of nouns enhanced by morphologically derived linguistic items. Sentence extraction is usually proportional to the original text in terms of 1/5th, or 20% value in order to obtain a first text reduction onto which to apply NLP intensive techniques. A second important step in text reduction in order to produce a linguistically grounded synthesis passes through usual text analysis processes which aim at recovering Predicate-Argument Structures of each propositional nucleus contained in the sentence extracted text and at evaluating possible adjuncts of each predicate.
We shall discuss all processes intervening when trying to produce a synthesis from unrestrained text, by showing and explaining an example elaborated by GETARUNS, the system for Italian texts summarization under development at the University of Venice.
In a second lecture we shall present the contents of our computational lexicon of Italian, called LIFUV, in its fully explicit format, which is made up of 5000 verb entries, 2000 adjective entries and 6000 nouns. We shall also present MIDUV, the main lexicon which contains a compact classification of nouns, verbs and adjectives for over 120,000 entries.
We shall then concentrate on the Upper Level Module of GETA_RUN, the ancestor of GETARUNS, a complete system for text understanding which is running on the web in our website, at http://byron.cgm.unive.it/Risorse/GETARUN.html.
We shall discuss in three separate lectures the linguistic representations
produced at discourse level for a number of English and Italian texts, by
commenting on the following topics:
A. Anaphora Resolution and Inferential Processes to produce a Discourse
Model
B. From Discourse Model to Summary Generation with Semantic Relations and
Discourse Structure
C. From Discourse Model to Knowledge Database for Queries under BACK-V
by Jerry Hobbs - Stanford Research
Institute International
Abduction is inference to the best explanation. An approach to abductive inference, called "weighted abduction", has been developed, that has resulted in a significant simplification of how the problem of interpreting texts is conceptualized. The interpretation of a text is the minimal explanation of why the text would be true. More precisely, to interpret a text, one must prove the logical form of the text from what is already mutually known, merging redundancies where possible and making assumptions where necessary. It is shown how such "local pragmatics" problems as reference resolution, the interpretation of compound nominals, the resolution of syntactic ambiguity and metonymy, and schema recognition can be solved in this manner. Moreover, this approach of "interpretation as abduction" can be combined with the older view of "parsing as deduction" to produce an elegant and thorough integration of syntax, semantics, and pragmatics, one that spans the range of linguistic phenomena from phonology to discourse structure. Finally, I will discuss means for making the abduction process efficient and the semantics of the weights and costs in the abduction scheme.
(2) Encoding Commonsense Knowledge for Discourse Interpretation
We understand discourse so well because we know so much. Therefore, a system for discourse interpretation must have a large knowledge base of commonsense knowledge. But this knowledge base must be built up in a principled fashion. Certain abstract concepts lie at the foundation of the knowledge required in discourse interpretation, in the sense that most prepositions, the most common verbs, and many other words crucially involve these concepts in their meaning. Among these concepts are granularity, systems of entities, scales, the figure-ground relation, change of state, causality, and goal-directed behavior. In this talk I will sketch some of the key features needed in a commonsense theory of each of these concepts. I will then outline a methodology for using occurrences of words in discourse to select the facts that need to be encoded in a knowledge base for discourse interpretation.
(3) The Structure of Discourse
Adjacent segments of discourse, by their very adjacency, convey the information that the situations they describe are related. In this talk, I describe the most common of these relations, including relations based on figure-ground, change of state, causality, and the affirmation or denial of similarity. For each of these relations I give a formal characterization of the inferences that must be drawn in order to recognize them. This will be done in the "Interpretation as Abduction" framework. When two segments of discourse are related by a coherence relation, they compose into a larger segment of discourse. In this way, a tree-like structure for an entire discourse can be built up. I will discuss several features of discourse structure in these terms. Then a method for analyzing discourse will be presented, and several diverse texts will be examined by this method.
(4) Information and Intention in Discourse
In discourse interpretation there are two questions
that need to be answered: What situation is being described in the text?
And why is the speaker describing that situation? The first of these questions
leads to the informational perspective on discourse, and the second leads
to the intentional perspective. I will discuss the advantages and shortcomings
of each of these perspectives, and then propose an integrated view. In
this framework I will discuss several phenomena involving the interrelation
of the perspectives, including the case of pragmatic conditionals, a case
of answering a question by responding to a higher goal, a case where the
informational account is a central part of the intentional account, and
a case where the intentional account overrides the informational account.
by Daniel Marcu - Information
Sciences Institute and Department of Computer Science, University of Southern California
(1) Overview and Topic Identification
We will first outline the major types
of summary: indicative vs. informative; abstract vs. extract; generic vs.
query-oriented; background vs. just-the-news; single-document vs. multi-document;
and so on. We will describe the typical decomposition of summarization
into three stages, and explain in detail the major
approaches to each stage. For topic
identification, we will outline techniques based on stereotypical text
structure, cue words, high-frequency indicator phrases, intratext connectivity,
and discourse structure centrality. For topic fusion, we will outline some
ideas that have been proposed, including concept generalization and semantic
association. For summary generation, we will describe the problems of sentence
planning to achieve information compaction.
(2) Topic Interpretation, Generation, Future
We will highlight the strengths and
weaknesses of statistical and symbolic/linguistic techniques in implementing
efficient summarization systems. We will discuss ways in which summarization
systems can interact with and/or complement natural language generation,
discourse parsing, information extraction, and information retrieval systems.
Finally, we will present a set of open problems that we perceive as being
crucial for immediate progress in automatic summarization.
by Nancy Ide, Vassar College and
Dan Cristea, University "Alexandru I. Cuza", Iasi
by Daniel Marcu - Information
Sciences Institute and Department of Computer Science, University of Southern California
We illustrate two major perspectives
on discourse structure -- functional and structural -- by describing how
several influential theories deal with the pertinent questions. These theories
include the story grammars of Van Dijk and Kintsch; the schema-based model
of McKeown; the inference-based model of Hobbs; the logic-based models
of Kamp, Asher, and Polanyi; the tripartite model of Grosz and Sidner;
the rhetorical relation model (RST) of Mann and Thompson; and subsequent
developments. We discuss some ongoing problems with all these models, focusing
on the relations that ensure coherence and on the challenge of building
up by hand large corpora of discourse trees.
The topic has been studied for about
15 years. We describe the development and operation of RST-based text structurers
and subsequent intention-based text planners. In contrast with text planning
research, the automated rhetorical parsing of multisentence texts is a
much younger enterprise. We present a range of lexico-grammatical phenomena
that can be used to identify discourse segments and discourse relations
and present parsing models and mechanisms that are used to derive the discourse
structure of unrestricted texts.
by Dan Moldovan - Southern Methodist
University
1. Concepts in Multilingual Information Retrieval
by Paul Buitelaar - DFKI Saarbrücken
The course will address the interface
between lexical semantics, multilinguality and information retrieval, showing
that much of the knowledge needed for the tasks described above can be
acquired by organizing and analyzing the underlying data and document sets.
Within the framework of the MIETTA system I will discuss examples of this
in classification, query expansion and query translation.
by Sanda Harabagiu - Southern Methodist University
In this talk I will elaborate on several
applications of a Romanian WordNet. The areas will include the fields of
Machine Translation, Information Retrieval and Information Extraction.
We shall also address the problem of building conceptual indexes for a
Question/Answering agent operating in Romanian.
by Dan Moldovan - Southern Methodist
University
A method is presented for the simultaneous
disambiguation of multiple words. The method proposed here combines two
sources of information: (1) statistics from the Internet about the occurrences
of groups of words and (2) the conceptual density among a pair of words
measured on a Machine Readable Dictionary such as WordNet. The Internet
is used as a source of large corpora, while WordNet is used to quantify
the semantic distance between words. The method provides a metric that
ranks the senses of words.
In a second part of this talk we apply
the disambiguation method to improve the information retrieval from the
Internet. By using similarity lists for the content words of a query, the
search is extended to a much larger number of documents, thus recall is
increased. Then, with the help of some newly defined search operators,
most of the irrelevant information is filtered out, which increases the
retrieval precision.
by Gábor Prószéky - MorphoLogic Budapest
Lexical items usually contain descriptions using numerous morpho-syntactic
features. These features are results of abstraction.
That is, they have been extracted from previous occurrences of the word
in various contexts. Problems might
occur when, e.g.
- we meet words the actual roles of which seem to contradict to their lexical
descriptions,
- we would like to understand metaphors,
- we would like to process new lexical entries.
There are two rather general ways to solve the above problems in HLT:overriding
lexical information or using
underspecified lexical descriptions. In order to choose a good solution,
we show:
(1) the possible relations between lexical information and actual syntactic
position,
(2) a method for classification of lexical entries without using traditional
categories,
(3) a method for morpho-syntactic parsing based on open and underspecified
lexicons,
(4) the role of corpus processing in feature selection.
Terms like 'finite but open lexicons', 'run-time change of dictionaries',
'on-line learning from corpora' might lead us to define a
rather lexicalized grammar formalism. The most interesting features of
this idea called 'finite syntax' are also discussed.
5. Designing
an editorial platform for a multilingual terminology the experience
provided by the DHYDRO project
by Laurent Romary - Loria/CNRS
Nancy
The talk will provide an in-depth presentation
of both the concepts and the implementation choices which are under development
within the European MLIS/Dhydro project. The aim is to define an environment
allowing the editing and viewing of multiligual terminologies, on the basis
of the current available standards (SGML, TEI, Martif). The following topics
will be more particularly addressed :
* the editorial aspects with a view on the opposition between dictionary
and terminological database;
* the encoding of a terminological databasqe using XML and the TEI;
* some words about the possible use of parallel texts;
* methodological and technical aspects for developing such an environment.
6. Queyring through Web
by François Rousselot - LIIA-ENSAIS,
Strasbourg
Description logics is a good tool to maintain a document base. On the one hand it provides a flexible way to represent the content of texts in a frame-like formalism which allows to mix information of different types: indices, time, cause etc., on the other hand it provides a sound language based on logic, useful to express queries and finally to help in selecting documents. If natural language queries are to be translated in the DL language, some interesting problems arise coming from the restrictions of DL. It is necessary to establish carefully the links between natural language used to express semantic relations and roles (symbol used to express relations in DL). As a result it is possible either to conceive a robust interface for unrestricted language or to give rules to constrain the language utilizable by the user.
(2) Refining answers of a search engine with linguistic engineering tools
Human end-user of commonly used search
engines are often overwhelmed with unrelevant texts. It is interesting
to have a way to refine the results in order to reduce the noise and the
silence. The idea is first to use a linguistic tool to compile a set of
terms characterizing the domain and secondly to use link typologies to
eventually access other documents in order to examine them. By comparison
to the set of terms new documents can be selected or not as being relevant
and the process goes on. The first step need interaction with the users:
it is necessary to edit the results of the program which furnishes term
candidates, but afterward the program can run automatically and can watch
the web.
7. Knowledge Based Support for Technical Translators
by Walther von Hahn - University
of Hamburg
The course will explain the rationale and the techniques applied in the
system DBR-MAT. This (Prolog-based) system supports translators by
allowing for multilingual domain specific queries, which are answered from
an abstract knowledge base instead of giving monolingual canned text
explanations or isolated term definitions. Additionally, syntactic analyses
and term checking is avialable to the translator. The interaction of the
lexicon, the knowledge base, the graphical objects and the generator are
explained in more detail. The Project was performed by the University of
Hamburg, Academy of Science Sofia and University of Bucharest.