URML

An Underspecified Markup Syntax for Rhetorical Structure Annotations

Introduction

While quite a few linguistic corpora with syntactic annotations are available today, resources are scarce on the level of discourse annotation. A flexible, extendible annotation format speeds up development. We therefore propose an XML format for annotating rhetorical structure trees. In human and automatic analysis, rhetorical structure is often difficult and assigned incrementally. Thus, the format allows for underspecification. The paper discusses the various design decisions involved, illustrates the format with an example, and sketches some applications.

URML -- the document type definition (DTD)

We provide an XML grammar for URML for download: urml-base.dtd. (You may need to right-click or Ctrl-Click onto the link to save the file to disk.)

Other DTDs may be derived from this URML to fit your specific annotation needs. Derived DTDs should, nethertheless, implement (=allow for) the whole original URML language. We demonstrate this with another DTD, urml-pos.dtd and an example XML file, urml-pos-sample.xml.

Publication

Reitter, D. & Stede, M., Step by step: underspecified markup in incremental rhetorical analysis, LINC-03, 2003.

Tools

These tools are available for free for non-commercial research purposes. They may be freely improved -- please send patches back to us.

isi2urml.perl

This tool converts the ISI corpus of rhetorical texts (Carlson et al. 2001) from their LISP-based format to URML.

Download: isi2urml.perl

tag-urml.perl

This tool tokenizes tags a URML corpus with part-of-speech information, using a <sign> tag for each token. It interfaces the TnT Tagger (Brants 2000), which is available for free. It delivers state-of-the-art performance. We used language models acquired from the German NEGRA corpus and the English SUSANNE corpus.

Available upon request. You'll need the parser and a language model.

extract-doc.perl

Simply extracts a document with a given ID from an URML corpus.

Download: extractdoc.perl

simplifyclasses.perl

Simple search/replace script to replace rhetorical relations with their subsuming categories. Must be adapted to fit your needs.

Please ask us if you need it.

urml2latex.perl

Creates RST diagrams for LaTeX, to be used with the rst package.

Usage: urm2latex.perl [-i] urml-file.xml [[document-id] analysis-id]

If no document-id is given, the program prints a list of document-ids contained in the file. Parameter -i instructs urml2latex to include the minimal discourse units directly in the tree.

Download: urml2latex.perl

separate-urml.perl

Splits an URML corpus in a training / test set according to a given ratio. As parameters, give ratio, source file, first target, second target. Example:

./separate-urml.perl 0.8 pdm-corpus.xml pdm-training.xml pdm-test.xml

Download: separate-urml.perl

The rst visualization package for LaTeX

URML data can be visualized in LaTeX.
An example diagram and corpus excerpt, drawn with rst

Click here for more information.

The Potsdam Corpus

We collected a corpus of newspaper texts and performed manual RST annotation. Two annotators worked through 173 texts. Data was converted from the annotation application format to URML.

Status: The corpus is complete and was subject to a non-blind cross-validation. It should be considered as "beta", until a blind-cross validation could be performed and inter-annotator agreement measures are calculated. Volunteers are welcome!

Availability: Please contact Manfred Stede.

Questions? Ideas? Who did all this?

David Reitter and Manfred Stede