Tutorials
(abstracts)

Summer Institute on "Creation and Exploitation of Annotated Language Resources" | 30 July - 11 August 2001 | Romania

  • Annotation formalisms and standards for NLP (XML, XCES)

      Nancy Ide - Vassar College, Poughkeepsie

      Laurent Romary - LORIA Laboratoires, Nancy

      Corpus annotation requires the choice of a format for representing the text and its annotations in electronic form. The format should enable maximum usability and reusability of the annotated corpus by software available at different research sites. The corpus and annotation documents should also be encoded to enable easy and flexible access to the data. The extended Markup Language (XML) provides a standard encoding framework for annotation that answers these needs. Using XML as a base, the XML Corpus Encoding Standard (XCES) has been developed within the EAGLES project to provide a framework for encoding and organizing corpora and their annotations in a standard, flexible, and reusable format.

      This section of the EUROLAN Summer School introduces the student to XML, as well as related and supporting standards developed within the XML framework, including XSLT (XML Transformation Language) and RDF (Resource Definition Framework). These tools together provide means to encode corpora and their annotations using the XCES document architecture, as well as easy and efficient manipulation and access of these data. In addition, we will outline the issues and concerns for representing annotated data and provide an overview of an abstract data model for corpora and their annotations. In the hands-on session, students will instantiate documents in XML representing a short text and its syntactic annotation, and will learn the basics of developing XSLT scripts for manipulating and accessing these data. We will also demonstrate use of a tool for creating and defining annotation categories using RDF, which can be stored in a central registry for use and/or reference by other annotations.


    Top
  • Measuring and Comparing Corpora

      Adam Kilgarriff - University of Brighton

      Anyone who has worked with corpora will be all too aware of differences between them. Depending on the differences, it may, or may not, be reasonable to expect results based on one corpus to also be valid for another. It may, or may not, be appropriate for a grammar, or parser, based on one to perform well on another. It may, or may not, be straightforward to port application from a domain of the first text type to a domain of the second. Currently, characterisations of corpora are mostly textual and informal. A corpus is described as "Wall Street Journal" or "transcripts of business meetings" or "foreign learners' essays (intermediate grade)". It would be desirable to be able to place a new corpus in relation to existing ones, and to be able to quantify similarities and differences. Allied to corpus-similarity is corpus-homogeneity. An understanding of homogeneity is a prerequisite to a measure of similarity -- it makes little sense to compare a corpus sampled across many genres, like the Brown, with a corpus of weather forecasts, without first accounting for the one being broad, the other narrow. There are of course many ways in which two corpora will differ, and different kinds of difference will be relevant for different kinds of purposes. Thus, similarity such that a part-of-speech tagger developed for one corpus works well in the other, may differ from similarity for Machine Translation.
      The course will explore strategies for quantifying corpus similarity and corpus homogeneity.

      Background reading: Kilgarriff A. - "Comparing Corpora", To appear in Int Jnl Corpus Linguistics (2001). Currently available at http://www.itri.bton.ac.uk/~Adam.Kilgarriff/ijcl.ps.gz


    Top
  • Corpus-Based Lexical Knowledge Acquisition

      Dan Tufis - Romanian Academy, Bucharest

      This talk will address the issue of automatically contructing bi- and multilingual translation lexicons as well as simple chunking grammars from parallel corpora. There will be presented issues in sentence alignment, text tagging and lemmatisation. 1-1 versus to n:m mappings models of word alignment (translation lexicon extraction) will be discussed. Some basic statistics for collocation analysis will be discussed (pointwise mutual information, dice, log likelihood, chi-square). We will show that standard monolingual "collocation" recognition techniques and a 1:1 mapping approach can be used for the implementation of a n:m word-alignment models.

      The students will be shown experiments on a multilingual corpus, with several bilingual lexicons extracted from a multilingual parallel text. Students will be asked to validate extracted dictionaries and compute various information retrieval scores (precision, recall, F-measure).

      Experiments with developing simple chunking grammars and chunking texts will exemplify the ideas of grammar induction from corpora. Bibliography (available on my web-page):

      • Dan Tufis : Using a large set of EAGLES-compliant morpho-syntactic descriptors as a tegset for probabilistic tagging Intl. Conf. LREC2000, Athens, 2000, pp. 1105-1112

      • Dan Tufis, Ana-Maria Barbu: Automatic Extraction of translation Equivalents from parallel corpora. Proc. of TELRI2000, Ljublana (to appear)


    Top
  • Sub-syntactic and syntactic annotation (shallow-parsing, tree banks)

      Hans Uszkoreit - Saarland University of Saarbrucken


    Top
  • Domain Specific Semantic Annotation

      Paul Buitelaar - DFKI, Saarbrücken

      Lecture Component: We discuss issues in the semantic annotation of textual documents from a domain specific point of view. These include:

      • domain specific senses

      • terms and relations

      • available resources and tools (medical domain)

      Practical Component: Students are invited to semantically annotate sample instances of terms and relations in (English, German) medical text. Tools are available for semi-automatic support. Also, on-line medical resources can be consulted.


    Top
  • The word sense: theory, annotation, disambiguation

      Adam Kilgarriff - University of Brighton

      The lecture component will give an overview of the concept of polysemy and explore its ramifications for NLP, including the following topics:

      • what does it mean for a word to have more than one meaning?

      • ambiguity tests and their limitations

      • how dictionaries present polysemy

      • a lexicographic perspective: lumpers versus splitters

      • corpus lexicography

      • polysemy in the dictionary and polysemy in the corpus

      • lexical creativity (incl metaphor and metonymy) and polysemy

      • Word Sense Disambiguation (WSD) - brief history

      • WSD evaluation

      • sense-tagging in SENSEVAL

      The practical component will be organised around an exercise in sense-tagging. Each student will be required to sense-tag around 200 word-instances of a language for which they are a native speaker or have near-native proficiency, according to a sense inventory from a published dictionary. This will then provide a dataset for further small-group discussion and analysis.

      Reference: Adam Kilgarriff: "I don't believe in word senses". Computers and the Humanities 31 (2), 1998. Pp 91-113. available online at: ftp://ftp.itri.bton.ac.uk/reports/ITRI-97-12.ps.gz.


    Top
  • Annotation of semantics, meaning relationships, linguistic chains, semantic roles of verbs

      Graeme Hirst - University of Toronto
      Charles Fillmore - University of California, Berkeley

      My morning presentation will begin with a description of the goals and achievements of the FrameNet research project, and will include comparisons between the FrameNet database and WordNet, familiar dictionaries, and familiar thesauri.

      This will be followed by a survey of the kinds of relations among lexical entities that are recognized in FrameNet but have not been incorporated into other net-like lexical resources. These will include:

      1. Word-sets that are best described in terms of the special grammatical constructions in which they participate. The main illustration will be words used in typical time-specifying phrases in English and the category of titles. Comparison will be made with the syntactic behavior of equivalent expressions in other languages.

      2. Words with "anaphoric zero" - that is, words which evoke semantic structures including particular arguments, but which can omit expressing those arguments under certain discourse conditions. Some of these follow semantic patterns (aspect verbs); some make use of semantic and grammatical conditions (definite relational nouns); and others seem to be lexically specific. The typological point by which English can be compared with languages that allow more or less unrestricted pragmatic zeroes (Chinese, Japanese) will be emphasized; comparisons will be made with other European languages.

      3. "Transparent" nouns - nouns which are the syntactic heads of phrases but which are "transparent" to collocational or selectional relations between their governors and their dependents. The labeling of transparent structures can serve as an aid to recognizing selectional relationships (by ignoring the intervening transparent structures). A project for detecting other transparent structures for similar purposes will be described. Transparent nouns include names of types, aggregates, quantities, and units.

      4. Support verbs (light verbs) and some related "Mel'cukian" lexical functions. English makes strong use of support verbs with nouns that designate events. Support verbs will be seen as figuring in (1) word sense disambiguation (compare "have an argument" with "make the argument (that)"); (2) selecting among the participants in an event (compare "give an examination", "take an examination", or "perform an operation", "undergo an operation"); (3) highlight particular phases of a temporally complex event ("make a promise", "keep a promise"); (4) select the register of a passage ("make a complaint", "register a complaint"); etc. In some cases there will be clear generalizations in the selection of support verbs ("make" with official monologic communications - statement, announcement, proclamation; "have" with dialogic or reciprocal events - argument, discussion, fight, quarrel; give with behavior-influencing acts - advice, instructions, warning; etc.); but in many cases they will be idiosyncratic ("say a prayer", "wage a war").

      5. Homologues/Analogues. Many words belong in different frames, or in different subtrees in a hierarchy, but have relevant structural, functional or configurational similarity to each other. Thus "toes" and "fingers", or "knee" and "elbow", analogous parts of the upper and lower parts of the human body. The words "paw", "foot", "claw", and "hoof" are analogous body parts across animal types, as are "nose", "beak", "bill", "snout" and "trunk". Words designating recipients of a professional service include "customer", "client", "guest", and "passenger". Money payments across various activity types have frame-differentiating names like "fee", "bribe", "tip", "wages", "payment", etc., differing from each other in selection of support verbs, etc. One reason for recognizing homologous word-groups is that languages differ in the degree to which they separate them lexically.

      6. Words that have similar meanings in different domains; a variety of polysemy. ("Give" as expressing ordinary gift-giving as opposed to making a contribution shows different omissibility properties. "Explain" as a communicating verb as opposed to a verb of cognition shows different aspectual behavior. And so on.)


    Top
  • Annotation of discourse (structure, co-reference)

      Dan Cristea - University of Iasi


    Top
  • Evaluation of Anaphora Resolution

      Catalina Barbu - Universities of Wolverhampton & Iasi,
      Ruslan Mitkov - University of Wolverhampton

      The tutorial will provide a theoretical background for evaluation in anaphora resolution and will address practical issues in evaluation. The discussion will cover the following issues:

      • the importance of fair and consistent evaluation

      • difficulties in developing annotated corpora to be used in evaluation

      • problems with currently reported results, arising from differences in evaluation measures, evaluation data, working mode, pre-processing tools

      • as a way forward, an evaluation workbench that alleviates the previously mentioned problems is proposed and described. A demo of the evaluation workbench will be presented.

      The practical session will consist of group activities based on the evaluation issues raised in the theoretical part. The results produced during the training session will provide a basis for discussion.


    Top
  • Information access on the Web: Retrieval, Extraction and Organization

      Atsushi Fujii - University of Library and Information Science - Tokyo

      This lecture will give an overview of various computer processing methods to access textual information on the World Wide Web, focusing mainly on the following topics:

      • cross-language information retrieval, where the user presents queries in one language to retrieve Web pages in other languages,

      • corpus generation, in which textual fragments are extracted/organized based on Web pages, for specific language applications (e.g., machine translation and question answering).

      In the practice part, each student will be asked to perform mono-lingual/cross-language retrieval by way of a Web interface (combined with machine translation systems). For this purpose, a English-Japanese comparable document collection and English test queries will be used: while one group performs E-E monolingual retrieval, the other group performs E-J cross-language retrieval (no Japanese proficiency will be required, because a J-E MT system will be available via the Web). Then, we will compare and discuss retrieval results (i.e., accuracy and time efficiency) obtained with both groups.

      Some of my publications associated with those above topics are available at http://www.ulis.ac.jp/~fujii/publication.html.


    Top
  • Domain Specific Semantic Annotation in Cross-Lingual Information Retrieval

      Paul Buitelaar - DFKI, Saarbrücken

      Lecture Component: We discuss issues in exploiting semantic annotation for cross-lingual information retrieval. These include:

      • Statistical vs. Knowledge-based Approaches
      • Query Expansion/Refinement

      Practical Component: Experiments with a cross-lingual information retrieval prototype.


    Top
  • Annotation of discourse structure

      Daniel Marcu - ISI, University of Southern California

      Most linguists agree that well-written texts have internal structure and that this structure is conveniently characterized by discourse/rhetorical relations, i.e, relations that reflect semantic and functional judgments about the text spans they connect. Yet, if one attempts to uncover the internal structure of texts, one will soon run into many difficulties. How many relations should one use? On which grounds should one define these relations? What is the granularity of the textual spans that one should consider in studying discourse-specific phenomena? Where does syntactic annotation become discourse annotation and vice-versa? In this lecture, we will discuss a few large scale efforts aimed at building discourse-level annotations of naturally occurring texts.

      During the training session, students will use a discourse annotation tool in order to manually build the discourse structure of two small texts. We will then discuss the difficulties and problems that are inherent to this task.


    Top
  • Exploitation for summarization and discourse interpretation

      Daniel Marcu - ISI, University of Southern California

      Empirical-based, statistical approaches to part-of-speech tagging, syntactic parsing, and machine translation have yielded systems that perform at higher accuracy levels than traditional rule-based systems. Yet, much less empirically grounded work has been carried out in the context of summarization and discourse interpretation. In this lecture, we will review some new trends in the fields of empirical-based summarization and discourse interpretation.

      During the training session, students will use existing annotated data and existing machine learning tools to build an empirical-grounded component for a discourse interpretation system.


    Top
  • Exploitation for machine translation

      Ulf Hermjakob - University of Southern California
      Sergei Nirenburg - New Mexico State University

      The lectures will give an overview of various approaches to machine translation, with an emphasis on empirical methods and the language resources they need.

      • Why is machine translation (MT) hard?

      • Brief history of MT

      • Modes of use

      • MT approaches, including

        • Example-based MT

        • Interlingua-based MT

        • Symbolic machine-learning based MT

        • Statistical MT

      • Multilingual and interlingua resources, including

        • Lexical acquisition for MT, incl. multi-word expressions

        • Parallel and comparable corpora

        • Sentence and word alignment

        • Ontologies

      • MT evaluation

      The practical component will focus on how language resources and learning can be used to resolve ambiguity in translation and to overcome structural mismatches. Principal language pair for translation exercises will be English and French.


    Top
  • Software Architecture for Language Engineering

      Hamish Cunningham - University of Sheffield
      Valentin Tablan - Universities of Sheffield & Iasi

      The tutorial will cover an introduction to Software Architecture for Language Engineering (SALE) and to Information Extraction. SALE is an area formed by the intersection of human language computation and software engineering and it covers all areas of the provision of infrastructural systems to support research an development of language processing software.

      The practical session will demonstrate the GATE (General Architecture for Text Engineering) and some IE tools integrated in GATE.


Main | Top | Last update: 30 July 2001 | Designed by Sabin-Corneliu Buraga