Annotation formalisms and standards for NLP (XML, XCES)
Nancy Ide - Vassar College, Poughkeepsie
Laurent Romary - LORIA Laboratoires, Nancy
Corpus annotation requires the choice of a format for representing the text and its annotations in electronic form. The format should enable maximum usability and reusability of the annotated corpus by software available at different research sites. The corpus and annotation documents should also be encoded to enable easy and flexible access to the data. The extended Markup Language (XML) provides a standard encoding framework for annotation that answers these needs. Using XML as a base, the XML Corpus Encoding Standard (XCES) has been developed within the EAGLES project to provide a framework for encoding and organizing corpora and their annotations in a standard, flexible, and reusable format.
This section of the EUROLAN Summer School introduces the student to XML, as well as related and supporting standards developed within the XML framework, including XSLT (XML Transformation Language) and RDF (Resource Definition Framework). These tools together provide means to encode corpora and their annotations using the XCES document
architecture, as well as easy and efficient manipulation and access of these data. In addition, we will outline the issues and concerns for representing annotated data and provide an overview of an abstract data model for corpora and their annotations. In the hands-on session,
students will instantiate documents in XML representing a short text and its syntactic annotation, and will learn the basics of developing XSLT scripts for manipulating and accessing these data. We will also demonstrate use of a tool for creating and defining annotation categories using RDF, which can be stored in a central registry for use and/or reference by other annotations.
Measuring and Comparing Corpora
Adam Kilgarriff -
University of Brighton
Anyone who has worked with corpora will be all too aware of differences between them. Depending on the differences, it may, or may not, be reasonable to expect results based on one corpus to also be valid for another. It may, or may not, be appropriate for a grammar, or parser, based on one to perform well on another. It may, or may not, be straightforward to port application from a domain of the first text type to a domain of the second. Currently, characterisations of corpora are mostly textual and informal. A corpus is described as "Wall Street Journal" or "transcripts of business meetings" or "foreign learners' essays (intermediate grade)". It would be desirable to be able to place a new corpus in relation to existing ones, and to be able to quantify similarities and differences. Allied to corpus-similarity is corpus-homogeneity. An understanding of homogeneity is a prerequisite to a measure of similarity -- it makes little sense to compare a corpus sampled across many genres, like the Brown, with a corpus of weather forecasts, without first accounting for the one being broad, the other narrow. There are of course many ways in which two corpora will differ, and different kinds of difference will be relevant for different kinds of purposes. Thus, similarity such that a part-of-speech tagger developed for one corpus works well in the other, may differ from similarity for Machine Translation.
The course will explore strategies for quantifying corpus similarity and corpus homogeneity.
Background reading:
Kilgarriff A. - "Comparing Corpora", To appear in Int Jnl Corpus Linguistics (2001). Currently available at http://www.itri.bton.ac.uk/~Adam.Kilgarriff/ijcl.ps.gz
Corpus-Based Lexical Knowledge Acquisition
Dan Tufis -
Romanian Academy, Bucharest
This talk will address the issue of automatically contructing bi- and
multilingual
translation lexicons as well as simple chunking grammars from parallel
corpora. There will be presented issues in sentence alignment, text tagging
and lemmatisation. 1-1 versus to n:m mappings models of word alignment
(translation lexicon extraction) will be discussed. Some basic statistics
for collocation analysis will be discussed (pointwise mutual information,
dice, log likelihood, chi-square). We will show that standard monolingual
"collocation" recognition techniques and a 1:1 mapping approach can be used
for the implementation of a n:m word-alignment models.
The students will be shown experiments on a multilingual corpus, with
several bilingual lexicons extracted from a multilingual parallel text.
Students will be asked to validate extracted dictionaries and compute
various information retrieval scores (precision, recall, F-measure).
Experiments with developing simple chunking grammars and chunking texts will
exemplify the ideas of grammar induction from corpora.
Bibliography (available on my web-page):
Dan Tufis : Using a large set of EAGLES-compliant morpho-syntactic
descriptors as a tegset for probabilistic tagging Intl. Conf. LREC2000,
Athens, 2000, pp. 1105-1112
Dan Tufis, Ana-Maria Barbu: Automatic Extraction of translation Equivalents
from parallel corpora.
Proc. of TELRI2000, Ljublana (to appear)
Sub-syntactic and syntactic annotation (shallow-parsing, tree banks)
Domain Specific Semantic Annotation
Paul Buitelaar -
DFKI, Saarbrücken
Lecture Component:
We discuss issues in the semantic annotation of textual documents from a
domain specific point of view. These include:
Practical Component:
Students are invited to semantically annotate sample instances of terms and
relations in (English, German) medical text. Tools are available for
semi-automatic support. Also, on-line medical resources can be consulted.
The word sense: theory, annotation, disambiguation
Adam Kilgarriff -
University of Brighton
The lecture component will give an overview of the concept of polysemy and
explore its ramifications for NLP, including the following topics:
what does it mean for a word to have more than one meaning?
ambiguity tests and their limitations
how dictionaries present polysemy
a lexicographic perspective: lumpers versus splitters
corpus lexicography
polysemy in the dictionary and polysemy in the corpus
lexical creativity (incl metaphor and metonymy) and polysemy
Word Sense Disambiguation (WSD) - brief history
WSD evaluation
sense-tagging in SENSEVAL
The practical component will be organised around an exercise in
sense-tagging. Each student will be required to sense-tag around 200
word-instances of a language for which they are a native speaker or have
near-native proficiency, according to a sense inventory from a
published dictionary. This will then provide a dataset for further
small-group discussion and analysis.
Reference:
Adam Kilgarriff: "I don't believe in word senses". Computers and the
Humanities 31 (2), 1998. Pp 91-113. available online at:
ftp://ftp.itri.bton.ac.uk/reports/ITRI-97-12.ps.gz.
Annotation of semantics, meaning relationships, linguistic chains, semantic roles of verbs
Graeme Hirst - University of Toronto
Charles Fillmore - University of California, Berkeley
My morning presentation will begin with a description of the goals and
achievements of the FrameNet research project, and will include comparisons
between the FrameNet database and WordNet, familiar dictionaries, and
familiar thesauri.
This will be followed by a survey of the kinds of relations among lexical
entities that are recognized in FrameNet but have not been incorporated
into other net-like lexical resources. These will include:
1. Word-sets that are best described in terms of the special grammatical
constructions in which they participate. The main illustration will be
words used in typical time-specifying phrases in English and the category
of titles. Comparison will be made with the syntactic behavior of
equivalent expressions in other languages.
2. Words with "anaphoric zero" - that is, words which evoke semantic
structures including particular arguments, but which can omit expressing
those arguments under certain discourse conditions. Some of these follow
semantic patterns (aspect verbs); some make use of semantic and
grammatical conditions (definite relational nouns); and others seem to be
lexically specific. The typological point by which English can be
compared with languages that allow more or less unrestricted pragmatic
zeroes (Chinese, Japanese) will be emphasized; comparisons will be made
with other European languages.
3. "Transparent" nouns - nouns which are the syntactic heads of phrases
but which are "transparent" to collocational or selectional relations
between their governors and their dependents. The labeling of
transparent structures can serve as an aid to recognizing selectional
relationships (by ignoring the intervening transparent structures). A
project for detecting other transparent structures for similar purposes
will be described. Transparent nouns include names of types, aggregates,
quantities, and units.
4. Support verbs (light verbs) and some related "Mel'cukian" lexical
functions. English makes strong use of support verbs with nouns that
designate events. Support verbs will be seen as figuring in (1) word
sense disambiguation (compare "have an argument" with "make the argument
(that)"); (2) selecting among the participants in an event (compare "give
an examination", "take an examination", or "perform an operation",
"undergo an operation"); (3) highlight particular phases of a temporally
complex event ("make a promise", "keep a promise"); (4) select the
register of a passage ("make a complaint", "register a complaint"); etc.
In some cases there will be clear generalizations in the selection of
support verbs ("make" with official monologic communications - statement,
announcement, proclamation; "have" with dialogic or reciprocal events -
argument, discussion, fight, quarrel; give with behavior-influencing acts
- advice, instructions, warning; etc.); but in many cases they will be
idiosyncratic ("say a prayer", "wage a war").
5. Homologues/Analogues. Many words belong in different frames, or in
different subtrees in a hierarchy, but have relevant structural,
functional or configurational similarity to each other. Thus "toes" and
"fingers", or "knee" and "elbow", analogous parts of the upper and lower
parts of the human body. The words "paw", "foot", "claw", and "hoof" are
analogous body parts across animal types, as are "nose", "beak", "bill",
"snout" and "trunk". Words designating recipients of a professional
service include "customer", "client", "guest", and "passenger". Money
payments across various activity types have frame-differentiating names
like "fee", "bribe", "tip", "wages", "payment", etc., differing from each
other in selection of support verbs, etc. One reason for recognizing
homologous word-groups is that languages differ in the degree to which
they separate them lexically.
6. Words that have similar meanings in different domains; a variety of
polysemy. ("Give" as expressing ordinary gift-giving as opposed to
making a contribution shows different omissibility properties. "Explain"
as a communicating verb as opposed to a verb of cognition shows different
aspectual behavior. And so on.)
Annotation of discourse (structure, co-reference)
Evaluation of Anaphora Resolution
Catalina Barbu -
Universities of Wolverhampton & Iasi,
Ruslan Mitkov - University of Wolverhampton
The tutorial will provide a theoretical background for evaluation in
anaphora resolution and will address practical issues in evaluation. The
discussion will cover the following issues:
the importance of fair and consistent evaluation
difficulties in developing annotated corpora to be used in evaluation
problems with currently reported results, arising from differences in
evaluation measures, evaluation data, working mode, pre-processing tools
as a way forward, an evaluation workbench that alleviates the previously
mentioned problems is proposed and described. A demo of the evaluation
workbench will be presented.
The practical session will consist of group activities based on the
evaluation issues raised in the theoretical part. The results produced
during the training session will provide a basis for discussion.
Information access on the Web: Retrieval, Extraction and
Organization
Atsushi Fujii -
University of Library and Information Science - Tokyo
This lecture will give an overview of various computer processing methods to
access textual information on the World Wide Web, focusing mainly on the
following topics:
cross-language information retrieval, where the user presents queries in
one language to retrieve Web pages in other languages,
corpus generation, in which textual fragments are extracted/organized
based on Web pages, for specific language applications (e.g., machine
translation and question answering).
In the practice part, each student will be asked to perform
mono-lingual/cross-language retrieval by way of a Web interface (combined
with machine translation systems). For this purpose, a English-Japanese
comparable document collection and English test queries will be used: while
one group performs E-E monolingual retrieval, the other group performs E-J
cross-language retrieval (no Japanese proficiency will be required, because
a J-E MT system will be available via the Web). Then, we will compare and
discuss retrieval results (i.e., accuracy and time efficiency) obtained with
both groups.
Some of my publications associated with those above topics are available at
http://www.ulis.ac.jp/~fujii/publication.html.
Domain Specific Semantic Annotation in Cross-Lingual Information
Retrieval
Paul Buitelaar -
DFKI, Saarbrücken
Lecture Component:
We discuss issues in exploiting semantic annotation for cross-lingual
information retrieval. These include:
- Statistical vs. Knowledge-based Approaches
- Query Expansion/Refinement
Practical Component:
Experiments with a cross-lingual information retrieval prototype.
Annotation of discourse structure
Daniel Marcu -
ISI, University of Southern California
Most linguists agree that well-written texts have internal structure and
that this structure is conveniently characterized by discourse/rhetorical
relations, i.e, relations that reflect semantic
and functional judgments about the text spans they connect. Yet, if one
attempts to uncover the internal structure of texts, one will soon run into
many difficulties. How many relations should one use? On which grounds
should one define these relations? What is the granularity of the textual
spans that one should consider in studying discourse-specific phenomena?
Where does syntactic annotation become discourse annotation and
vice-versa? In this lecture, we will discuss a few large scale efforts aimed
at building discourse-level annotations of naturally occurring texts.
During the training session, students will use a discourse annotation tool
in order to manually build the discourse structure of two small texts. We
will then discuss the difficulties and problems that are inherent to this
task.
Exploitation for summarization and discourse interpretation
Daniel Marcu -
ISI, University of Southern California
Empirical-based, statistical approaches to part-of-speech tagging, syntactic
parsing, and machine translation have yielded systems that perform at higher
accuracy levels than traditional rule-based systems. Yet, much less
empirically grounded work has been carried out in the context of
summarization and discourse interpretation. In this lecture, we will review
some new trends in the fields of empirical-based summarization and discourse
interpretation.
During the training session, students will use existing annotated data and
existing machine learning tools to build an empirical-grounded component for
a discourse interpretation system.
Exploitation for machine translation
Ulf Hermjakob - University of Southern California
Sergei Nirenburg - New Mexico State University
The lectures will give an overview of various approaches to machine
translation, with an emphasis on empirical methods and the language
resources they need.
The practical component will focus on how language resources and learning
can be used to resolve ambiguity in translation and to overcome structural
mismatches. Principal language pair for translation exercises will be
English and French.
Software Architecture for Language Engineering
Hamish Cunningham - University of Sheffield
Valentin Tablan - Universities of Sheffield & Iasi
The tutorial will cover an introduction to Software Architecture for Language Engineering (SALE) and to Information Extraction. SALE is an area formed by the intersection of human language computation and software engineering and it covers all areas of the provision of
infrastructural systems to support research an development of language processing software.
The practical session will demonstrate the GATE (General Architecture for Text Engineering) and some IE tools integrated in GATE.
|