The main purpose of this is to allow the validation of human language technologies based on spoken Romance languages. C-ORAL-ROM has a relevant added value at levels such as corpus design, dialogue representation, prosodic annotation, PoS tagging, multimedia storage and speech analysis. It is also worth to mention its usefulness in the creation of a representative multilingual resource designed for validation of HLT (Human Language Technologies). The C-ORAL-ROM book and DVD* provide a unique set of comparable corpora of spontaneous speech for the main Romance languages, French, Italian, Portuguese and Spanish.
The corpora are accompanied by comparative linguistic studies, models and standard linguistic measures of spoken language variability. 1 Introduction: problems in tagging a spoken corpus The multi-word taggingThe Spanish C-ORAL-ROM corpus consists of 312597 tagged tokens. Every tag marks a lexical unit, regardless the number of graphical words it is made of. C-ORAL-ROM - Integrated Reference Corpora for Spoken Romance Languages (2001-2004), investigadora: Recursos Linguísticos para o Português: um corpus e ferramentas para a sua consulta e análise (2002-2004), investigadora: ENABLER: European National Activities for Basic Language Resources (2001-2003), investigadora The transcripts comprise a total of around 80,000 words.