TSD 2008

The manual segmentation of speech databases still outperforms the automatic segmentation algorithms and, at the same time, the quality of resulting synthetic voice depends on the accuracy of the phonetic segmentation. In this paper we describe a semi-automatic speech segmentation procedure, in which a human expert manually allocates the selected boundaries prior to the automatic segmentation of the rest of the corpus. Segmentation error predictor is designed, estimated and then used to generate a sequence of manual annotations done by an expert. The obtained error response curves are significantly better than random segmentation strategies. The results are presented for two different Polish corpora.

#6: Web People Search with Domain Ranking

Zornitsa Kozareva (University of Alicante, Spain), Rumen Moraliyski (University of Beira Interior, Portugal), Gael Dias (University of Beira Interior, Portugal)

The world wide web is the biggest information source which people consult daily for facts and events. Studies demonstrate that 30% of the searches relate to proper names such as organizations, actors, singers, books or movie titles. However, a serious problem is posed by the high level of ambiguity where one and the same name can be shared by different individuals or even across different proper name categories. In order to provide faster and more relevant access to the requested information, current research focuses on the clustering of web pages related to the same individual. In this paper, we focus on the resolution of the web people search problem through the integration of domain information.

#9: Normalization of Temporal Information in Estonian

Margus Treumuth (University of Tartu, Estonia)

I present a model for processing temporal information in Estonian natural language. I have built a tool to convert natural-language calendar expressions to semi-formalized semantic representation. I have further used this representation to construct formal SQL constraints, which can be enforced on any relational database. The model supports compound expressions (conjunctive and disjunctive), negation and exception expressions. The system is built to work on Oracle databases and is currently running on Oracle 10g with a PHP web interface.

#12: An Extension to the Sammon Mapping

Andreas Maier, Julian Exner, Stefan Steidl, Anton Batliner, Tino Haderlein, Elmar Nöth (University of Erlangen-Nuremberg, Germany)

We present a novel method for the visualization of speakers which is microphone independent. To solve the problem of lacking microphone independency we present two methods to reduce the influence of the recording conditions on the visualization. The first one is a registration of maps created from identical speakers recorded under different conditions, i.e., different microphones and distances in two steps: Dimension reduction followed by the linear registration of the maps. The second method is an extension of the Sammon mapping method, which performs a non-linear registration during the dimension reduction procedure. The proposed method surpasses the two step registration approach with a mapping error ranging from 17 % to 24 % and a grouping error which is close to zero.

#14: Age Determination of Children

Tobias Bocklet, Andreas Maier, Elmar Nöth (University of Erlangen-Nuremberg, Germany)

This paper focuses on the automatic determination of the age of children in preschool and primary school age. For each child a Gaussian Mixture Model (GMM) is trained. As training method the Maximum A Posteriori adaptation (MAP) is used. MAP derives the speaker models from a Universal Background Model (UBM) and does not perform an independent parameter estimation. The means of each GMM are extracted and concatenated, which results in a so-called GMM supervector. These supervectors are then used as meta features for classification with Support Vector Machines (SVM) or for Support Vector Regression (SVR). With the classification system a precision of 83 % was achieved and a recall of 66 %. When the regression system was used to determine the age in years, a mean error of 0.8 years and a maximal error of 3 years was obtained. A regression with a monthly accuracy brought similar results.

#15: Analysis of Hypernasal Speech in Children with Cleft Lip and Palate

Andreas Maier, Alexander Reuß, Christian Hacker, Maria Schuster, Elmar Nöth (University of Erlangen-Nuremberg, Germany)

In children with cleft lip and palate speech disorders appear often. One major disorder amongst them is hypernasality. This is the first study which shows that it is possible to automatically detect hypernasality in connected speech without any invasive means. Therefore, we investigated MFCCs and pronunciation features. The pronunciation features are computed from phoneme confusion probabilities. Furthermore, we examine frame level features based on the Teager Energy operator. The classification of hypernasal speech is performed with up to 66.6 % (CL) and 86.9 % (RR) on word level. On frame level rates of 62.3 % (CL) and 90.3 % (RR) are reached.

#17: Developing a Dialogue System: How to Grant a Customer's Directive?

Mare Koit, Olga Gerassimenko, Riina Kasterpalu, Andriela Rääbis, Krista Strandson (University of Tartu, Estonia)

Estonian phone calls to travel agencies are analyzed with the further aim to develop a dialogue system. The analysis is based on the Estonian Dialogue Corpus. Customers' initial requests introduce a topic or check the competencies of the official, and they have to be adjusted in a following information-sharing sub-dialogue. Information is given briefly, using short sentences or phrases. A collaborative travel agent offers substituting information or action if s/he is unable to fulfill the customer's request. A dialogue system is being developed which gives travel information in Estonian. Ready-made sentence patterns are used in the current version for granting the users' requests. Our study will help to make the grants more natural.

#18: Performance Evaluation for Voice Conversion Systems

Todor Ganchev, Alexandros Lazaridis, Iosif Mporas, Nikos Fakotakis (University of Patras, Rion-Patras, Greece)

In the present work, we introduce a new performance evaluation measure for assessing the capacity of voice conversion systems to modify the speech of one speaker (source) so that it sounds as if it was uttered by another speaker (target). This measure relies on a GMM-UBM-based likelihood estimator that estimates the degree of proximity between an utterance of the converted voice and the predefined models of the source and target voices. The proposed approach allows the formulation of an objective criterion, which is applicable for both evaluation of the virtue of a single system and for direct comparison (benchmarking) among different voice conversion systems. To illustrate the functionality and the practical usefulness of the proposed measure, we contrast it with four well-known objective evaluation criteria.

#21: Statistical Properties in Chinese Word Segmentation...

Wei Qiao, Maosong Sun (Tsinghua University, Beijing, China), Wolfgang Menzel (Hamburg University, Germany)

Overlapping ambiguity is a major ambiguity type in Chinese word segmentation. In this paper, the statistical properties of overlapping ambiguities are intensively studied based on the observations from a very large balanced general-purpose Chinese corpus. The relevant statistics are given from different perspectives. The stability of high frequent maximal overlapping ambiguities is tested based on statistical observations from both general-purpose corpus and domain-specific corpora. A disambiguation strategy for overlapping ambiguities, with a predefined solution for each of the 5,507 pseudo overlapping ambiguities, is proposed consequently, suggesting that over 42% of overlapping ambiguities in Chinese running text could be solved without making any error. Several state-of-the-art word segmenters are used to make comparisons on solving these overlapping ambiguities. Preliminary experiments show that about 2% of the 5,507 pseudo ambiguities which are mistakenly segmented by these segmenters can be properly treated by the proposed strategy.

#23: Multilingual Weighted Codebooks for Non-native Speech Recognition

Martin Raab, Rainer Gruhn (Harman Becker Automotive Systems, Ulm, Germany), Elmar Nöth (University of Erlangen, Germany)

In many embedded systems commands and other words in the user's main language must be recognized with maximum accuracy, but it should be possible to use foreign names as they frequently occur in music titles or city names. Example systems with constrained resources are navigation systems, mobile phones and MP3 players. Speech recognizers on embedded systems are typically semi-continuous speech recognizers based on vector quantization. Recently we introduced Multilingual Weighted Codebooks (MWCs) for such systems. Our previous work shows significant improvements for the recognition of multiple native languages. However, open questions remained regarding the performance on non-native speech. We evaluate on four different non-native accents of English, and our MWCs produce always significantly better results than a native English codebook. Our best result is a 4.4% absolute word accuracy improvement. Further experiments with non-native accented speech give interesting insights in the attributes of non-native speech in general.

#25: Spoken Document Retrieval...

Pere R. Comas, Jordi Turmo (Technical University of Catalonia, Barcelona, Spain)

This paper presents a new approach to spoken document information retrieval for spontaneous speech corpora. The classical approach to this problem is the use of an automatic speech recognizer (ASR) combined with standard information retrieval techniques. However, ASRs tend to produce transcripts of spontaneous speech with significant word error rate, which is a drawback for standard retrieval techniques. To overcome such a limitation, our method is based on an approximated sequence alignment algorithm to search "sounds like" sequences. Our approach does not depend on extra information from the ASR and outperforms up to 7 points the precision of state-of-the-art techniques in our experiments.

#30: Energy Normalization in Automatic Speech Recognition

Nik�a Jakovljevi�, Marko Janev, Darko Pekar, Dragi�a Mi�kovi� (University of Novi Sad, Serbia)

In this paper a novel method for energy normalization is presented. The objective of this method is to remove unwanted energy variations caused by different microphone gains, various loudness levels across speakers, as well as changes of single speaker loudness level over time. The solution presented here is based on principles used in automatic gain control. The use of this method results in relative improvement of the performances of an automatic speech recognition system by 26%.

#31: Combining Multiple Resources to Build Reliable Wordnets

Darja Fi�er (University of Ljubljana, Slovenia), Beno�t Sagot (INRIA, Paris, France)

This paper compares automatically generated sets of synonyms in French and Slovene wordnets with respect to the resources used in the construction process. Polysemous words were disambiguated via a five-language word-alignment of the SEERA.NET parallel corpus, a subcorpus of the JRC Acquis. The extracted multilingual lexicon was disambiguated with the existing wordnets for these languages. On the other hand, a bilingual approach sufficed to acquire equivalents for monosemous words. Bilingual lexicons were extracted from different resources, including Wikipedia, Wiktionary and EUROVOC thesaurus. A representative sample of the generated synsets was evaluated against the goldstandards.

#32: Thinking in Objects: Towards an Infrastructure for Semantic...

Milena Slavcheva (Bulgarian Academy of Sciences, Sofia, Bulgaria)

This paper describes a component-driven population of a language resource consisting of semantically interpreted verb-centred structures in a cross-lingual setting. The overall infrastructure is provided by the Unified Eventity Representation (UER) - a cognitive theoretical approach to verbal semantics and a graphical formalism, based on the Unified Modeling Language (UML). The verb predicates are modeled by eventity frames which contain the eventities' components: static modeling elements representing the characteristics of participants and the relations between them; dynamic modeling elements describing the behaviour of participants and their interactions.

#38: Talking Head as Life Blog

Ladislav Kunc (Czech Technical University in Prague, Czech Republic), Jan Kleindienst (IBM, Prague, Czech Republic), Pavel Slav�k (Czech Technical University in Prague, Czech Republic)

The paper describes an experimental presentation system that can automatically generate dynamic ECA-based presentations from structured data including text context, images, music and sounds, videos, etc. Thus the Embodied Conversational Agent acts as a moderator in the chosen presentation context, typically personal diaries. Since an ECA represents a rich channel for conveying both verbal and non-verbal messages, we are researching ECAs as facilitators that transpose "dry" data such as diaries and blogs into more lively and dynamic presentations based on ontologies. We constructed our framework on an existing toolkit ECAF that supports runtime generation of ECA agents. We describe the extensions of the toolkit and give an overview of the current system architecture. We describe the particular Grandma TV scenario, where a family uses the ECA automatic presentation engine to deliver weekly family news to distant grandparents. Recently conducted usability studies suggest the pros and cons of the presented approach.

#39: Variants and Homographs: Eternal Problem of Dictionary Makers

Jaroslava Hlav��ov�, Mark�ta Lopatkov� (Charles University in Prague, Czech Republic)

We discuss two types of asymmetry between wordforms and their (morphological) characteristics, namely (morphological) variants and homographs. We introduce a concept of multiple lemma that allows for unique identification of wordform variants as well as `morphologically-based' identification of homographic lexemes. The deeper insight into these concepts allows further refining of morphological dictionaries and subsequently better performance of any NLP tasks. We demonstrate our approach on the morphological dictionary of Czech.

#41: Affisix: Tool for Prefix Recognition

Jaroslava Hlav��ov�, Michal Hru�eck� (Charles University in Prague, Czech Republic)

In the paper, we present a software tool Affisix for automatic recognition of prefixes. On the basis of an extensive list of words in a language, it determines the segments - candidates for prefixes. There are two methods implemented for the recognition - the entropy method and the squares method. We briefly describe the methods, propose their improvements and present the results of experiments with Czech.

#43: UU Database: A Spoken Dialogue Corpus...

Hiroki Mori, Tomoyuki Satake, Makoto Nakamura (Utsunomiya University, Japan), Hideki Kasuya (International University of Health and Welfare, Japan)

The Utsunomiya University (UU) Spoken Dialogue Database for Paralinguistic Information Studies, now available to the public, is introduced. The UU database is intended mainly for use in understanding the usage, structure and effect of paralinguistic information in expressive Japanese conversational speech. This paper describes the outline, design, building, and key properties of the UU database, to show how the corpus meets the demands of speech scientists and developers who are interested in the nature of expressive dialogue speech.

#47: Acoustic Modeling for Speech Recognition Using Limited Resources

Rok Gaj�ek, Janez �ibert, France Miheli� (University of Ljubljana, Slovenia)

In the article we evaluate different techniques of acoustic modeling for speech recognition in the case of limited audio resources. The objective was to build different sets of acoustic models, the first was trained on a small set of telephone speech recordings and the other was trained on a bigger database with broadband speech recordings and later adapted to a different audio environment. Different adaptation methods (MLLR, MAP) were examined in combination with different parameterization features (MFCC, PLP, RPLP). We show that using adaptation methods, which are mainly used for speaker adaptation purposes, can increase the robustness of speech recognition in cases of mismatched training and working acoustic environment conditions.

#48: Evaluation of the Slovak Spoken Dialogue System Based on ITU-T

Stanislav Ond�, Jozef Juh�r, Anton �i�m�r (Technical University of Ko�ice, Slovakia)

The development of the Slovak spoken dialogue system started in year 2006. The developed system is publicly available as a trial system and provides weather forecast and timetables information services. One of the very important questions is how to evaluate quality of such a system. A new method for quality assessment of the spoken dialogue system is proposed. The designed method is based on ITU-T P.851 recommendation. Three types of questionnaires were prepared - A, B and C. The questionnaires serve for obtaining information about user's background, completed interactions with the system and about overall impression of the system and its services. Scenarios, methodology of coding, classifying and rating of collected data were also developed. There are also six classes of quality for representation of system's features. Introduced method was used for evaluation of the dialogue system and timetables information service. This paper also summarizes the results of performed experiment.

#49: Influence of Reading Errors on the Text-Based Evaluation of Voices

Tino Haderlein, Elmar Nöth, Andreas Maier, Maria Schuster, Frank Rosanowski (University of Erlangen-Nuremberg, Germany)

In speech therapy and rehabilitation, a patient's voice has to be evaluated by the therapist. Established methods for objective, automatic evaluation analyze only recordings of sustained vowels. However, an isolated vowel does not reflect a real communication situation. In this paper, a speech recognition system and a prosody module are used to analyze a text that was read out by the patients. The correlation between the perceptive evaluation of speech intelligibility by five medical experts and measures like word accuracy (WA), word recognition rate (WR), and prosodic features was examined. The focus was on the influence of reading errors on this correlation. The test speakers were 85 persons suffering from cancer in the larynx. 65 of them had undergone partial laryngectomy, i.e. partial removal of the larynx. The correlation between the human intelligibility ratings on a five-point scale and the machine was r = --0.61 for WA, r&asymp 0.55 for WR, and r&asymp 0.60 for prosodic features based on word duration and energy. The reading errors did not have a significant influence on the results. Hence, no special preprocessing of the audio files is necessary.

#52: Semantic Classes in Czech Valency Lexicon

V�clava Kettnerov�, Mark�ta Lopatkov� and Kl�ra Hrstkov� (Charles University in Prague, Czech Republic)

We introduce a project aimed at enhancing a valency lexicon of Czech verbs with coherent semantic classes. For this purpose, we make use of FrameNet, a semantically oriented lexical resource. At the present stage, semantic frames from FrameNet have been mapped to two groups of verbs with divergent semantic and morphosyntactic properties, verbs of communication and verbs of exchange. The feasibility of this task has been proven by the achieved inter-annotator agreement - 85.9% for the verbs of communication and 78.5% for the verbs of exchange. As a result of our experiment, the verbs of communication have been classified into nine semantic classes and the verbs of exchange into ten classes, based on upper level semantic frames from FrameNet.

#55: Dialogue-Based Processing of Graphics and Graphical Ontologies

Ivan Kope�ek, Radek O�lej�ek (Masaryk University, Brno, Czech Republic)

Dialogue-based processing of graphics forms a framework for enabling visually impaired people to access computer graphics. The presented approach integrates current technologies of dialogue systems, computer graphics and ontologies for the purpose of overcoming the obstacles and problems that the development of the system used for the dialogue processing of graphics entails. In this paper we focus on problems related to creating graphical ontologies and exploiting them in the annotation of graphical objects as well as in the development of dialogue strategies. We also present an illustrative example.

#57: Deep Syntactic Analysis and Rule Based Accentuation in TTS...

Antti Suni, Martti Vainio (University of Helsinki, Finland)

With the emergence of the HMM-synthesis paradigm, producing natural, expressive prosody has become viable in speech synthesis. This paper describes the development of rule-based prominence prediction model for Finnish Text-to-Speech system, based on deep syntactic analysis and discourse structure.

#59: Study on Speaker Adaptation Methods in the BNT Task

Petr �erva, Jind�ich �dánský, Jan Silovský, Jan Nouza (Technical University of Liberec, Czech Republic)

This paper deals with the use of speaker adaptation methods in the broadcast news transcription task, which is very difficult from speaker adaptation point of view. It is because in typical broadcast news, speakers change frequently and their identity is not known in the time when the given program is being transcribed. Due to this fact, it is often necessary to use some unconventional speaker adaptation methods here which can operate without the knowledge of speaker's identity and/or in an unsupervised mode too. In this paper, we compare and propose several such methods that can operate both in on-line and off-line modes in addition and we show their performance in a real broadcast news transcription system.

#61: Dealing with Small, Noisy and Imbalanced Data

Adam Przepiórkowski (Polish Academy of Sciences & Warsaw University, Warsaw, Poland), Micha� Marcinczuk (Wroc�aw University of Technology, Poland), �ukasz Degórski (Polish Academy of Sciences, Warsaw, Poland)

This paper deals with the task of definition extraction with the training corpus suffering from the problems of small size, high noise and heavy imbalance. A previous approach, based on manually constructed shallow grammars, turns out to be hard to better even by such robust classifiers as SVMs, AdaBoost and simple ensembles of classifiers. However, a linear combination of various such classifiers and manual grammars significantly improves the results of the latter.

#62: Spoken Requests for Tourist Information: Speech Acts Annotation

Laura Hasler (University of Wolverhampton, United Kingdom)

This paper presents an ongoing corpus annotation of speech acts in the domain of tourism, which falls within a wider project on multimodal question answering. An annotation scheme and set of guidelines are developed to mark information about parts of spoken utterances which require a response, distinguishing them from parts of utterances which do not. The corpus used for annotation consists of transcriptions of single-speaker utterances aimed at obtaining tourist information. Interannotator agreement is computed between two annotators to assess the reliability of the guidelines used to facilitate their task.

#64: On the Use of MLP Features for Broadcast News Transcription

Petr Fousek, Lori Lamel, Jean-Luc Gauvain (LIMSI-CNRS, France)

Multi-Layer Perceptron (MLP) features have recently been attracting growing interest for automatic speech recognition due to their complementarity with cepstral features. In this paper the use of MLP features is evaluated in a large vocabulary continuous speech recognition task, exploring different types of MLP features and their combination. Cepstral features and three types of Bottle-Neck MLP features were first evaluated without and with unsupervised model adaptation using models with the same number of parameters. When used with MLLR adaption on a broadcast news Arabic transcription task, Bottle-Neck MLP features perform as well as or even slightly better than a standard 39 PLP based front-end. This paper also explores different combination schemes (feature concatenations, cross adaptation, and hypothesis combination). Extending the feature vector by combining various feature sets led to a 9% relative word error rate reduction relative to the PLP baseline. Significant gains are also reported with both ROVER hypothesis combination and cross-model adaptation. Feature concatenation appears to be the most efficient combination method, providing the best gain with the lowest decoding cost.

#67: Two-Level Fusion for Emotion Classification

Ram�n L�pez-C�zar, Zoraida Callejas (University of Granada, Spain), Martin Kroul, Jan Nouza, Jan Silovský (Technical University of Liberec, Czech Republic)

This paper proposes a technique to enhance emotion classification in spoken dialogue systems by means of two fusion modules. The first combines emotion predictions generated by a set of classifiers that deal with different kinds of information about each sentence uttered by the user. To do this, the module employs several fusion methods that produce other predictions about the emotional state of the user. The predictions are the input to the second fusion module, where they are combined to deduce the user's emotional state. Experiments have been carried out considering two emotion categories (`Non-negative' and `Negative') and classifiers that deal with prosodic, acoustic, lexical and dialogue acts information. The results show that the first fusion module significantly increases the classification rates of a baseline and the classifiers working separately, as has been observed previously in the literature. The novelty of the technique is the inclusion of the second fusion module, which enhances classification rate by 2.25% absolute.

#70: Making Speech Technologies Available in (Serviko) Romani Language

Milan Rusko, Sakhia Darjaa, Mari�n Trnka (Institute of Informatics of the Slovak Academy of Sciences, Bratislava, Slovakia), Viliam Zeman (Office of the Plenipotentiary of the Government of the Slovak Republic for Romanies, Bratislava, Slovakia), Juraj Glov�a (University of Constantinus, the Philosopher in Nitra, Slovakia)

The language of Romanies seems not to be commercially interesting for big companies. Not only the majority of people from Roma community are very poor, but the language itself is very difficult to work on, because it is extremely rich in local dialects and the standardized form that would be accepted by majority of Romanies does not exist. The Romani language belongs to the "digitally endangered languages". The paper gives a short description of Romani language in Slovakia. An effort to design basic tools needed to start using Romani speech and language in computer technologies is presented. As the authors are familiar with speech synthesis, they have chosen building several types of speech synthesizers in Romani as a pilot project. The paper shortly summarizes some facts on Romani orthography, phonetics, and prosody. The design of text corpus, diphone set, and speech database is described. The application part of the paper presents Romani synthesizers - both diphone and unit-selection, some of which are bilingual (Romani-Slovak). The demo of the synthesis can be tried on the authors' web-page.

#73: Dialogue Based Text Editing

Jarom�r Plhák (Masaryk University, Brno, Czech Republic)

This paper presents the basic principles for text editing by means of dialogue. First, the usage of the text division algorithm is discussed as well as its enhancement. Then, the dialogue text processing interface which co-operates with the voice synthesizer is described. We propose basic functions, formulate the most notable problems and suggest and discuss their possible solutions.

#75: Architecture Model and Tools for Perceptual Dialog Systems

Jan Cu�ín, Jan Kleindienst (IBM, Prague, Czech Republic)

In this paper, we present an architecture model for context-aware dialog-based services. We consider the term "context" in a broader meaning including presence and location of humans and objects, human behavior, human-to-human or human-to-computer interaction, activities of daily living, etc. We expect that the surrounding environment from which context is gathered is a "smart environment", i.e. a space (such as office, house, or public area) equipped with different sets of sensors, including audio and visual perception. Designing the underlying perceptual systems is a non-trivial task which involves interdisciplinary effort dealing with the integration of voice and image recognition technologies, situation modeling middleware, and context-aware interfaces into a robust and self-manageable software framework. To support fast development and tuning of dialog-based services, we introduce a simulation framework compliant with the proposed architecture. The framework is capable of gathering information from a broad set of sensors, of an event-based abstraction of such information, interaction with an integrated dialog system, and of a virtual representation of a smart environment in schematic 2D or realistic 3D projections. We elaborate on use-cases of architecture model referring the two projects where our system was successfully deployed.

#78: First Steps in Building a Verb Valency Lexicon for Romanian

Ana-Maria Barbu (Romanian Academy, Bucharest, Romania)

This paper presents some steps in manually building a verb valency lexicon for Romanian. We refer to some major previous works by focusing on their information representation. We select that information for different stages of our project and we show the conceptual problems encountered during the first phase. Finally we present the gradually building procedure of the lexicon and we exemplify the manner in which the information is represented in a lexicon entry.

#79: Hyponymy Patterns: Semi-Automatic Extraction, Evaluation...

Verginica Barbu Mititelu (Romanian Academy, Bucharest, Romania)

We present below an experiment in which we identified in corpora hyponymy patterns for English and for Romanian in two different ways. Such an experiment is interesting both from a computational linguistic perspective and from a theoretical linguistic one. On the one hand, a set of hyponymy patterns is useful for the automatic creation or enrichment of an ontology, for tasks such as document indexing, information retrieval, question answering. On the other hand, these patterns can be used in papers concerned with this semantic relation (hyponymy) as they are more numerous and are evaluated as opposed to those "discovered" through observation of text or, rather, introspection. One can see how hyponymy is realized in text, according to the stylistic register to which this belongs, and also a comparison between such patterns in two different languages is made possible.

#80: Shedding Light on a Troublesome Issue in NLIDBS

Rodolfo Pazos (CENIDET & ITCM, Mexico), Ren� Santaolalaya S., Juan C. Rojas P., Joaqu�n P�rez O. (CENIDET, Mexico)

A natural language interface to databases (NLIDB) without help mechanisms that permit clarifying queries is prone to incorrect query translation. In this paper we draw attention to a problem in NLIDBs that has been overlooked and has not been dealt with systematically: word economy; i.e., the omission of words when expressing a query in natural language (NL). In order to get an idea of the magnitude of this problem, we conducted experiments on EnglishQuery when applied to a corpora of economized-wording queries. The results show that the percentage of correctly answered queries is 18%, which is substantially lower than those obtained with corpora of regular queries (53%--83%). In this paper we describe a typification of problems found in economized-wording queries, which has been used to implement domain-independent dialog processes for an NLIDB in Spanish. The incorporation of dialog processes in an NLIDB permits users to clarify queries in NL, thus improving the percentage of correctly answered queries. This paper presents the tests of a dialog manager that deals with four types of query problems, which permits to improve the percentage of correctly answered queries from 60% to 91%. Due to the generality of our approach, we claim that it can be applied to other domain-dependent or domain-independent NLIDBs, as well as other NLs such as English, French, Italian, etc.

#81: Repeats and Self-Repairs Detection in French Transcribed Speech

Rémi Bove (University of Provence, Aix-en-Provence, France)

We present in this paper the results of a tagged corpus-based study conducted on two kinds of disfluencies (repeats and self-repairs) from a corpus of spontaneous spoken French. This work first investigates the linguistic features of both phenomena, and then shows how - from a corpus output tagged with TreeTagger - to take into account repeats and self-repairs using word N-grams model and rule-based pattern matching. Some results on a test corpus are finally presented.

#82: Where Do Parsing Errors Come From: the Case of Spoken Estonian

Kaili M��risep, Helen Nigol (University of Tartu, Estonia)

This paper discusses some issues of developing a parser for spoken Estonian which is based on an already existing parser for written language, and employs the Constraint Grammar framework. When we used a corpus of face-to-face everyday conversations as the training and testing material, the parser gained the recall 97.6% and the precision 91.8%. The parsing of institutional phone calls turned out to be a more complicated task, with the recall dropping by 3%. In this paper, we will focus on parsing nonfluent speech using a rule-based parser. We will give an overview of parsing errors and ways to overcome them.

#84: Emulating TRFs of Higher Level Auditory Neurons for ASR

Garimella S.V.S. Sivaram, Hynek Hermansky (IDIAP Research Institute, Martigny, Swiss Federal Institute of Technology, Lausanne, Switzerland)

This paper proposes modifications to the Multi-resolution RASTA (MRASTA) feature extraction technique for the automatic speech recognition (ASR). By emulating asymmetries of the temporal receptive field (TRF) profiles of higher level auditory neurons, we obtain more than 11.4% relative improvement in word error rate on OGI-Digits database. Experiments on TIMIT database confirm that proposed modifications are indeed useful.

#85: Acquisition of Telephone Data from Radio Broadcasts

Old�ich Plchot, Valiantsina Hubeika, Luk� Burget, Petr Schwarz, Pavel Mat�jka (Brno University of Technology, Czech Republic)

This paper presents a procedure of acquiring linguistic data from the broadcast media and its use in language recognition. The goal of this work is to answer the question whether the automatically obtained data from broadcasts can replace or augment to the continuous telephone speech. The main challenges are channel compensation issues and great portion of unspontaneous speech in broadcasts. The experimental results are obtained on NIST LRE 2007 evaluation system, using both NIST provided training data and data, obtained from broadcasts.

#86: Perceptually Motivated Sub-Band Decomposition

Petr Motl��ek (IDIAP Research Institute, Martigny, Switzerland, Brno University of Technology, Czech Republic), Sriram Ganapathy (IDIAP Research Institute, Martigny, Switzerland, EPFL, Switzerland), Hynek Hermansky (IDIAP Research Institute, Martigny, Switzerland, Brno University of Technology, Czech Republic, EPFL, Switzerland), Harinath Garudadri (Qualcomm Inc., San Diego, United States) Marios Athineos (International Computer Science Institute, Berkeley, United States)

This paper describes employment of non-uniform QMF decomposition to increase the efficiency of a generic wide-band audio coding system based on Frequency Domain Linear Prediction (FDLP). The base line FDLP codec, operating at high bit-rates (\sim136 kbps), exploits a uniform QMF decomposition into 64 sub-bands followed by sub-band processing based on FDLP. Here, we propose a non-uniform QMF decomposition into 32 frequency sub-bands obtained by merging 64 uniform QMF bands. The merging operation is performed in such a way that bandwidths of the resulting critically sampled sub-bands emulate the characteristics of the critical band filters in the human auditory system. Such frequency decomposition, when employed in the FDLP audio codec, results in a bit-rate reduction of 40% over the base line. We also describe the complete audio codec, which provides high-fidelity audio compression at \sim66 kbps. In subjective listening tests, the FDLP codec outperforms MPEG-1 Layer 3 (MP3) and achieves similar qualities as MPEG-4 HE-AAC codec.

#87: The Generation of Emotional Expressions for a Dialogue Agent

Siska Fitrianie (Delft University of Technology, Netherlands), Leon J.M. Rothkrantz (Delft University of Technology & Netherlands Defence Academy, Netherlands)

Emotion influences the choice of facial expression. In a dialogue the emotional state is co-determined by the events that happen during a dialogue. To enable rich, human like expressivity of a dialogue agent, the facial displays should show a correct expression of the state of the agent in the dialogue. This paper reports about our study in building knowledge on how to appropriately express emotions in face to face communication. We have analyzed the appearance of facial expressions and corresponding dialogue-text (in balloons) of characters of selected cartoon illustrations. From the facial expressions and dialogue-text, we have extracted independently the emotional state and the communicative function. We also collected emotion words from the dialogue-text. The emotional states (label) and the emotion words are represented along two dimensions "arousal" and "valence". Here, the relationship between facial expressions and text were explored. The final goal of this research is to develop emotional-display rules for a text-based dialogue agent.

#88: A Framework for Analysis and Prosodic Feature Annotation

Dimitris Spiliotopoulos, Georgios Petasis, Georgios Kouroupetroglou (National and Kapodistrian University of Athens, Greece)

Concept-to-Speech systems include Natural Language Generators that produce linguistically enriched text descriptions which can lead to significantly improved quality of speech synthesis. There are cases, however, where either the generator modules produce pieces of non-analyzed, non-annotated plain text, or such modules are not available at all. Moreover, the language analysis is restricted by the usually limited domain coverage of the generator due to its embedded grammar. This work reports on a language-independent framework basis, linguistic resources and language analysis procedures (word/sentence identification, part-of-speech, prosodic feature annotation) for text annotation/processing for plain or enriched text corpora. It aims to produce an automated XML-annotated enriched prosodic markup for English and Greek texts, for improved synthetic speech. The markup includes information for both training the synthesizer and for actual input for synthesising. Depending on the domain and target, different methods may be used for automatic classification of entities (words, phrases, sentences) to one or more preset categories such as "emphatic event", "new/old information", "second argument to verb", "proper noun phrase", etc. The prosodic features are classified according to the analysis of the speech-specific characteristics for their role in prosody modelling and passed through to the synthesizer via an extended SOLE-ML description. Evaluation results show that using selectable hybrid methods for part-of-speech tagging high accuracy is achieved. Annotation of a large generated text corpus containing 50% enriched text and 50% canned plain text produces a fully annotated uniform SOLE-ML output containing all prosodic features found in the initial enriched source. Furthermore, additional automatically-derived prosodic feature annotation and speech synthesis related values are assigned, such as word-placement in sentences and phrases, previous and next word entity relations, emphatic phrases containing proper nouns, and more.

#92: Prosodic Phrases and Semantic Accents in Speech Corpus

Jan Romportl (University of West Bohemia, Czech Republic)

We describe a statistical method for assignment of prosodic phrases and semantic accents in read speech data. The method is based on statistical evaluation of listening test data by a maximum-likelihood approach with parameters estimated by an EM algorithm. We also present linguistically relevant quantitative results about the prosodic phrase and semantic accent distribution in 250 Czech sentences.

#95: Word Sense Disambiguation with Semantic Networks

George Tsatsaronis, Iraklis Varlamis, Michalis Vazirgiannis (Athens University of Economics and Business, Athens, Greece)

Word sense disambiguation (WSD) methods evolve towards exploring all of the available semantic information that word thesauri provide. In this scope, the use of semantic graphs and new measures of semantic relatedness may offer better WSD solutions. In this paper we propose a new measure of semantic relatedness between any pair of terms for the English language, using WordNet as our knowledge base. Furthermore, we introduce a new WSD method based on the proposed measure. Experimental evaluation of the proposed method in benchmark data shows that our method matches or surpasses state of the art results. Moreover, we evaluate the proposed measure of semantic relatedness in pairs of terms ranked by human subjects. Results reveal that our measure of semantic relatedness produces a ranking that is more similar to the human generated one, compared to rankings generated by other related measures of semantic relatedness proposed in the past.

#97: Information Search Modelling: Analysis of a Human Dialog Corpus

Alain Loisel, Jean-Philippe Kotowicz, Nathalie Chaignaud (INSA of Rouen, Mont-Saint-Aignan, France)

We aim at improving the health information search engine CISMeF, by including a conversational agent that interacts with the user in natural language. To study the cognitive processes involved during information search, a bottom-up methodology was adopted. An experiment has been set up to obtain human dialogs related to such searches. In this article, the emphasis lays on the analysis of these human dialogs.

#99: Accent and Channel Adaptation for Use in a Telephone-Based System

Kinfe Tadesse Mengistu, Andreas Wendemuth (Otto-von-Guericke University, Magdeburg, Germany)

An utterance conveys not only the intended message but also information about the speaker's gender, accent, age group, etc. In a spoken dialog system, these information can be used to improve speech recognition for a target group of users that share common vocal characteristics. In this paper, we describe various approaches to adapt acoustic models trained on native English data to the vocal characteristics of German-accented English speakers. We show that significant performance boost can be achieved by using speaker adaptation techniques such as Maximum Likelihood Linear Regression (MLLR), Maximum a Posteriori (MAP) adaptation, and a combination of the two for the purpose of accent adaptation. We also show that promising performance gain can be obtained through cross-language accent adaptation, where native German speech from a different application domain is used as enrollment data. Moreover, we show the use of MLLR for telephone channel adaptation.

#100: Sentiment Detection Using Lexically-Based Classifiers

Ben Allison (University of Sheffield, United Kingdom)

This paper addresses the problem of supervised sentiment detection using classifiers which are derived from word features. We argue that, while the literature has suggested the use of lexical features is inappropriate for sentiment detection, a careful and thorough evaluation reveals a less clear-cut state of affairs. We present results from five classifiers using word-based features on three tasks, and show that the variation between classifiers can often be as great as has been reported between different feature sets with a fixed classifier. We are thus led to conclude that classifier choice plays at least as important a role as feature choice, and that in many cases word-based classifiers perform well on the sentiment detection task.

#101: An Empirical Bayesian Method for Detecting Out of Context Words

Sanaz Jabbari, Ben Allison, Louise Guthrie (University of Sheffield, United Kingdom)

In this paper, we propose an empirical Bayesian method for determining whether a word is used out of context. We suggest we can treat a word's context as a multinomially distributed random variable, and this leads us to a simple and direct Bayesian hypothesis test for the problem in question. We demonstrate this method to be superior to a method based upon common practice in the literature. We also demonstrate how an empirical Bayes method, whereby we use the behaviour of other words to specify a prior distribution on model parameters, improves performance by an appreciable amount where training data is sparse.

#102: Identification of Speakers from Their Hum

Hemant A. Patil, Robin Jain, Prakhar Jain (Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, India)

Automatic Speaker Recognition (ASR)is an economic method of biometrics because of the availability of low cost and powerful processors. An ASR system will be efficient if the proper speaker-specific features are extracted. Most of the state-of-the-art ASR systems use the natural speech signal (either read speech or spontaneous speech) from the subjects. In this paper, an attempt is made to identify speakers from their hum . The experiments are shown for Linear Prediction Coefficients (LPC), Linear Prediction Cepstral Coefficients (LPCC), and Mel Frequency Cepstral Coefficients (MFCC) as input feature vectors to the polynomial classifier of 2^{nd} order approximation. Results are found to be better for MFCC than LP-based features.

#103: HMM-Based Speech Synthesis for the Greek Language

Sotiris Karabetsos, Pirros Tsiakoulis, Aimilios Chalamandaris, Spyros Raptis (Institute for Language and Speech Processing, Athens, Greece)

The success and the dominance of Hidden Markov Models (HMM) in the field of speech recognition, tends to extend also in the area of speech synthesis, since HMM provide a generalized statistical framework for efficient parametric speech modeling and generation. In this work, we describe the adaption, the implementation and the evaluation of the HMM speech synthesis framework for the case of the Greek language. Specifically, we detail on both the development of the training speech databases and the implementation issues relative to the particular characteristics of the Greek language. Experimental evaluation depicts that the developed text-to-speech system is capable of producing adequately natural speech in terms of intelligibility and intonation.

#104: Web-Based Lemmatisation of Named Entities

Rich�rd Farkas (MTA-SZTE, Szeged, Hungary), Veronika Vincze, Istv�n Nagy (University of Szeged, Hungary) R�bert Orm�ndi, Gyoergy Szarvas (MTA-SZTE, Szeged, Hungary), Attila Alm�si (University of Szeged, Hungary)

Identifying the lemma of a Named Entity is important for many Natural Language Processing applications like Information Retrieval. Here we introduce a novel approach for Named Entity lemmatisation which utilises the occurrence frequencies of each possible lemma. We constructed four corpora in English and Hungarian and trained machine learning methods using them to obtain simple decision rules based on the web frequencies of the lemmas. In experiments our web-based heuristic achieved an average accuracy of nearly 91%.

#105: Syllable Based Language Model for Continuous Speech Recognition

Piotr Majewski (University of ��dz, Poland)

Most of state-of-the-art large vocabulary continuous speech recognition systems use word-based n-gram language models. Such models are not optimal solution for inflectional or agglutinative languages. The Polish language is highly inflectional one and requires a very large corpora to create a sufficient language model with the small out-of-vocabulary ratio. We propose a syllable-based language model, which is better suited to highly inflectional language like Polish. In case of lack of resources (i.e. small corpora) syllable-based model outperforms word-based models in terms of number of out-of-vocabulary units (syllables in our model). Such model is an approximation of the morpheme-based model for Polish. In our paper, we show results of evaluation of syllable based model and its usefulness in speech recognition tasks.

#106: Exploiting Contextual Information for Speech/Non-Speech Detection

Sree Hari Krishnan Parthasarathi, Petr Motl��ek, Hynek Hermansky (IDIAP Research Institute, Martigny, Switzerland)

In this paper, we investigate the effect of temporal context for speech/non-speech detection (SND). It is shown that even a simple feature such as full-band energy, when employed with a large-enough context, shows promise for further investigation. Experimental evaluations on the test data set, with a state-of-the-art multi-layer perceptron based SND system and a simple energy threshold based SND method, using the F-measure, show an absolute performance gain of 4.4% and 5.4% respectively. The optimal contextual length was found to be 1000 ms. Further numerical optimizations yield an improvement (3.37% absolute), resulting in an absolute gain of 7.77% and 8.77% over the MLP based and energy based methods respectively. ROC based performance evaluation also reveals promising performance for the proposed method, particularly in low SNR conditions.

#108: Active Tags for Semantic Analysis

Ivan Habernal, Miloslav Konop�k (University of West Bohemia, Pilsen, Czech Republic)

We propose a new method for semantic analysis. The method is based on handwritten context-free grammars enriched with semantic tags. Associating the rules of a context-free grammar with semantic tags is beneficial; however, after parsing the tags are spread across the parse tree and it is usually hard to extract the complete semantic information from it. Thus, we developed an easy-to-use and yet very powerful mechanism for tag propagation. The mechanism allows the semantic information to be easily extracted from the parse tree. The propagation mechanism is based on an idea to add propagation instruction to the semantic tags. The tags with such instructions are called active tags in this article. Using the proposed method we developed a useful tool for semantic parsing that we offer for free on our internet pages.

#111: Advances in Acoustic Modeling for the Recognition of Czech

Ji�� Kopeck�, Ond�ej Glembek, Martin Karafi�t, (Brno University of Technology, Czech Republic)

This paper presents recent advances in Automatic Speech Recognition for the Czech Language. Improvements were achieved both in acoustic and language modeling. We mainly aim on the acoustic part of the issue. The results are presented in two contexts, the lecture recognition and SpeeCon+Temic test set. The paper shows the impact of using advanced modeling techniques such as HLDA, VTLN and CMLLR. On the lecture test set, we show that training acoustic models using word networks together with the pronunciation dictionary gives about 4--5% absolute performance improvement as opposed to using direct phonetic transcriptions. An effect of incorporating the "schwa" phoneme in the training phase shows a slight improvement.

#117: Temporal Issues and WER on the Capitalization of Spoken Texts

Fernando Batista (L²F & ISCTE, Lisboa, Portugal), Nuno Mamede (L²F & IST, Lisboa, Portugal), Isabel Trancoso (L²F & ISCTE, Lisboa, Portugal)

This paper investigates the capitalization task over Broadcast News speech transcriptions. Most of the capitalization information is provided by two large newspaper corpora, and the spoken language model is produced by retraining the newspaper language models with spoken data. Three different corpora subsets from different time periods are used for evaluation, revealing the importance of available training data in nearby time periods. Results are provided both for manual and automatic transcriptions, showing also the impact of the recognition errors in the capitalization task. Our approach is based on maximum entropy models and uses unlimited vocabulary. The language model produced with this approach can be sorted and then pruned, in order to reduce computational resources, without much impact in the final results.

#121: Statistical WSD for Russian Nouns Denoting Physical Objects

Olga Mitrofanova (St. Petersburg State University, Russia), Olga Lashevskaya (Institute of the Russian Language, Moscow, Russia), Polina Panicheva (St. Petersburg State University, Russia)

The paper presents experimental results on automatic word sense disambiguation (WSD). Contexts for polysemous and/or homonymic Russian nouns denoting physical objects serve as an empirical basis of the study. Sets of contexts were extracted from the Russian National Corpus (RNC). Machine learning software for WSD was developed within the framework of the project. WSD tool used in experiments is aimed at statistical processing and classification of noun contexts. WSD procedure was performed taking into account lexical markers of word meanings in contexts and semantic annotation of contexts. Sets of experiments allowed to define optimal conditions for WSD in Russian texts.

#122: Automatic Semantic Annotation of Polish Dialogue Corpus

Agnieszka Mykowiecka, Ma� gorzata Marciniak, Katarzyna G� owinska, (Polish Academy of Sciences, Warsaw, Poland)

In the paper we present a method of automatic annotation of transliterated spontaneous human-human dialogues on the level of domain attributes. It has been used for the preparation of an annotated corpus of dialogs within LUNA project. We describe the domain ontology, process of manual creation of rules, annotation schema and evaluation.

#123: The Verb Argument Browser

Bálint Sass (Pázmány Péter Catholic University, Budapest)

We present a special corpus query tool - the Verb Argument Browser - which is suitable for investigating argument structure of verbs. It can answer the following typical research question: What are the salient words which can appear in a free position of a given verb frame? In other words: What are the most important collocates of a given verb (or verb frame) in a particular morphosyntactic position? At present, the Hungarian National Corpus is integrated, but the methodology can be extended to other languages and corpora. The application has been of significant help in building lexical resources (e.g. the Hungarian WordNet) and it can be useful in any lexicographic work or even language teaching. The tool is available online at corpus.nytud.hu/vab (username: tsd , password: vab ).

#127: A Comparison of Language Models for Dialog Act Segmentation...

J�chym Kol�� (University of West Bohemia, Pilsen, Czech Republic)

This paper compares language modeling techniques for dialog act segmentation of multiparty meetings. The evaluation is twofold; we search for a convenient representation of textual information and an efficient modeling approach. The textual features capture word identities, parts-of-speech, and automatically induced classes. The models under examination include hidden event language models, maximum entropy, and BoosTexter. All presented methods are tested using both human-generated reference transcripts and automatic transcripts obtained from a state-of-the-art speech recognizer.

#132: Prosody Evaluation for Speech-Synthesis Systems

France Miheli�, Bo�tjan Vesnicer, Janez �ibert (University of Ljubljana, Slovenia), Elmar Nöth (University of Erlangen-Nuremberg, Germany)

We present an objective-evaluation method of the prosody modeling in an HMM-based Slovene speech-synthesis system. Method is based on the results of the automatic recognition of syntactic-prosodic boundary positions and accented words in the synthetic speech. We have shown that the recognition results represent a close match with the prosodic notations, labeled by the human expert on the natural-speech counterpart that was used to train the speech-synthesis system. The recognition rate of the prosodic events is proposed as an objective evaluation measure for the quality of the prosodic modeling in the speech-synthesis system. The results of the proposed evaluation method are also in accordance with previous subjective-listening assesment evaluations, where high scores for the naturalness for such a type of speech synthesis were observed.

#135: Effects of Segmentation and F0 Errors on Emotion Recognition

Stefan Steidl, Anton Batliner, Elmar Nöth, Joachim Hornegger (University of Erlangen-Nuremberg, Germany)

Prosodic features modelling pitch, energy, and duration play a major role in speech emotion recognition. Our word level features, especially duration and pitch features, rely on correct word segmentation and F0 extraction. For the FAU Aibo Emotion Corpus, the automatic segmentation of a forced alignment of the spoken word sequence and the automatically extracted F0 values have been manually corrected. Frequencies of different types of segmentation and F0 errors are given and their influence on emotion recognition using different groups of prosodic features is evaluated. The classification results show that the impact of these errors on emotion recognition is small.

#137: A Computational Framework to Integrate Different Semantic Resources

Qiang Zhou (Tsinghua University, Beijing, China)

In recent years, many large-scale semantic resources have been built in the NLP community, but how to apply them in real text semantic parsing is still a big problem. In this paper, we propose a new computational framework to deal with this problem. Its key parts are a lexical semantic ontology (LSO) representation to integrate abundant information contained in current semantic resources, and a LSO schema to automatically reorganize all this semantic knowledge in a hierarchical network. We introduce an algorithm to build the LSO schema by a three-step procedure: to build a knowledge base of lexical relationship, to accumulate all information in it to generate basic LSO nodes, and to build a LSO schema through hierarchical clustering based on different semantic relatedness measures among them. The preliminary experiments have shown promising results to indicate its computability and scaling-up characteristics. We hope it can play an important role in real world semantic computation applications.

#138: Language Acquisition: the Emergence of Words

Louis ten Bosch, Lou Boves (Radboud University Nijmegen, Netherlands)

Young infants learn words by detecting patterns in the speech signal and by associating these patterns to stimuli provided by non-speech modalities (such as vision). In this paper, we discuss a computational model that is able to detect and build word-like representations on the basis of multimodal input data. Learning of words (and word-like entities) takes place within a communicative loop between a `carer' and the `learner'. Experiments carried out on three different European languages (Finnish, Swedish, and Dutch) show that a robust word representation can be learned in using approximately 50 acoustic tokens (examples) of that word. The model is inspired by the memory structure that is assumed functional for human speech processing.

#139: Efficient Unit-Selection in Text-to-Speech Synthesis

Ale� Miheli�, Jerneja �ganec Gros (Alpineon d.o.o., Ljubljana, Slovenia)

This paper presents a method for selecting speech units for polyphone concatenative speech synthesis, in which the simplification of procedures for search paths in a graph accelerated the speed of the unit-selection procedure with minimum effects on the speech quality. The speech units selected are still optimal; only the costs of merging the units on which the selection is based are less accurately determined. Due to its low processing power and memory footprint requirements, the method is applicable in embedded speech synthesizers.

#140: MSD Recombination for Statistical Machine Translation

Jerneja �ganec-Gros, Stanislav Gruden (Alpineon d.o.o., Ljubljana, Slovenia)

Freely available tools and language resources were used to build the VoiceTRAN statistical machine translation (SMT) system. Various configuration variations of the system are presented and evaluated. The VoiceTRAN SMT system outperformed the baseline conventional rule-based MT system in both English-Slovenian in-domain test setups. To further increase the generalization capability of the translation model for lower-coverage out-of-domain test sentences, an "MSD-recombination" approach was proposed. This approach not only allows a better exploitation of conventional translation models, but also performs well in the more demanding translation direction; that is, into a highly inflectional language. Using this approach in the out-of-domain setup of the English-Slovenian JRC-ACQUIS task, we have achieved significant improvements in translation quality.

#141: Disyllabic Chinese Word Extraction Based on Character Thesaurus...

Sun Maosong, Xu Dongliang (Tsinghua University, Beijing, China), Benjamin K.Y. T'sou (City University of Hong Kong, China), Lu Huaming (Beijing Information Science and Technology University, Beijing, China)

This paper presents a novel approach to Chinese disyllabic word extraction based on semantic information of characters. Two thesauri of Chinese characters, manually-crafted and machine-generated, are conducted. A Chinese wordlist with 63,738 two-character words, together with the character thesauri, are explored to learn semantic constraints between characters in Chinese word-formation, resulting in two types of semantic-tag-based HMM. Experiments show that: (1) both schemes outperform their character-based counterpart; (2) the machine-generated thesaurus outperforms the hand-crafted one to some extent in word extraction, and (3) the proper combination of semantic-tag-based and character-based methods could benefit word extraction.

#144: A Semi-Automatic Wizard of Oz Technique for Let'sFly System

Alexey Karpov, Andrey Ronzhin, Anastasia Leontyeva (St. Petersburg Institute for Informatics and Automation of RAS, Russia)

The paper presents Let'sFly spoken dialogue system intended for natural human-computer interaction via telephone lines in travel planning domain. The system uses ASR, keyword spotting and TTS methods for continuous Russian speech and a dialogue manager with mixed-initiative strategy. Semi-automatic Wizard of Oz technique used for collecting speech data and real dialogues is described. Semi-automatic model is a tradeoff between a fully automatic spoken dialogue system and an interaction with a hidden human-operator working like a computer system. The experimental data obtained with this technique are discussed in the paper.

#146: Cognitive and Emotional Interaction

Amel Achour, Jeanne Villaneau, Dominique Duhaut (University of Southern Brittany, Lorient, France)

The ANR project EmotiRob aims at conceiving and realizing a companion robot which interacts emotionally with fragile children. However, the project MAPH which is an extension of EmotiRob tries to extend the cognitive abilities of the robot to implement a linguistic interaction with the child. For this, we studied a children corpus and got semantic links that could exist between each pair of words. This corpus elaborated by D. Bassano has been used to evaluate language development among children under five. Using this corpus, we tried to make a taxonomy in accordance with the conceptual world of children and tested its validity. Using the taxonomy and the semantic properties that we attributed to the corpus words, we defined rapprochement coefficients between words in order to generate new sentences, answer the child questions and play with him. As a perspective for this, we envisage to make the robot able of enriching its vocabulary, and to define new learning patterns basing on its reactions.

#148: Automatic Learning of Czech Intonation

Tom� Dub�da, Jan Raab (Charles University in Prague, Czech Republic)

The present paper examines three methods of intonational stylization in the Czech language: a sequence of pitch accents, a sequence of boundary tones, and a sequence of contours. The efficiency of these methods was compared by means of a neural network which predicted the f_0 curve from each of the three types of input, with subsequent perceptual assessment. The results show that Czech intonation can be learned with about the same success rate in all three situations. This speaks in favour of a rehabilitation of contours as a traditional means of describing Czech intonation, as well as the use of boundary tones as another possible local approach.

#149: Intelligent Voice Navigation of Spreadsheets: An Empirical Evaluation

Derek Flood, Kevin Mc Daid, Fergal Mc Caffery, Brian Bishop (Dundalk Institute of Technology, Ireland)

Interaction with software systems has become second nature to most computer users, however when voice recognition is introduced, this simple procedure becomes quite complex. To reduce this complexity for spreadsheet users, the authors have developed an intelligent voice navigation system called iVoice. This paper outlines the iVoice system and details an experiment that was conducted to determine the efficiency of iVoice when compared to a leading voice recognition technology.

#152: The Module of Morphological and Syntactic Analysis SMART

Anastasia Leontyeva, Ildar Kagirov (St. Petersburg Institute for Informatics and Automation of RAS, Russia)

The present paper presents the program of morphological and syntactic analyses of the Russian language. The principles of morphological analysis are based on some ideas by A.A. Zaliznyak. The entire glossary is to be kept as sets of lexical stems and grammatical markers, such as flexions and suffixes. A great deal of stems is represented by a few variants, owing to in-stem phonological alternations. Such approach helps to reduce the glossary and retrench the time of searching a particular word-form. Syntactic analysis is based on the rules of Russian. During the processing of some texts, base syntactic groups were singled out. The module could be used as a morphology analyzer or as a syntactic analyzer. This module provides information about the initial wordform, its paradigm, its transcription and compose a vocabulary for Russian continuous speech recognizer. The module is also used for syntactic analysis of Russian sentence.

#153: Forced-Alignment and Edit-Distance Scoring

Serguei Pakhomov (University of Minnesota, United States), Jayson Richardson (University of North Carolina, United States), Matt Finholt-Daniel, Gregory Sales (Seward Incorporated, United States)

We demonstrate an application of Automatic Speech Recognition (ASR) technology to the assessment of young children's basic English vocabulary. We use a test set of 2935 speech samples manually rated by 3 reviewers to compare several approaches to measuring and classifying the accuracy of the children's pronunciation of words, including acoustic confidence scoring obtained by forced alignment and edit distance between the expected and actual ASR output. We show that phoneme-level language modeling can be used to obtain good classification results even with a relatively small amount of acoustic training data. The area under the ROC curve of the ASR-based classifier that uses a bi-phone language model interpolated with a general English bi-phone model is 0.80 (95% CI 0.78--0.82). The point where both sensitivity and specificity are at their maximum is where sensitivity is 0.74 and the specificity is 0.80 with 0.77 harmonic mean, which is comparable to human performance (ICC=0.75; absolute agreement = 81%).

#157: Czech Pitch Contour Modeling Using Linear Prediction

Petr Horák (Academy of Sciences of the Czech Republic, Praha, Czech Republic)

Present Czech TTS systems can produce synthetic speech with high intelligibility but low naturalness. The difference between natural and synthetic speech is still too high. Naturalness of the synthetic speech is given by the signal modeling and by the prosody modeling. This paper deals with the improving of the synthetic prosody modeling especially with the improving of the intonation modeling. A mathematical model of the pitch contour modeling can significantly limit the complexity of intonational rules creation and increase the naturalness of resulting synthetic speech. The linear prediction inonational model has been implemented into TTS system Epos for practical use. This built-in inonational model uses excitation by rules and provides in conjunction with a new triphone time domain inventories more naturalness synthetic speech than previous direct intonational rules.

#160: A Pattern-Based Framework for Uncertainty Representation ...

Miroslav Vacura, Vojt�ch Sv�tek (University of Economics, Prague, Czech Republic), Pavel Smr� (Brno University of Technology, Czech Republic)

We present a novel approach to representing uncertain information in ontologies based on design patterns. We provide a brief description of our approach, present its use in connection with fuzzy information and probabilistic information, and describe the possibility to model multiple types of uncertainty in a single ontology. We also shortly present an appropriate fuzzy reasoning tool and define a complex ontology architecture for well-founded handling of uncertain information.

#161: Reverse Correlation for Analyzing MLP Posterior Features in ASR

Joel Pinto, Garimella S.V.S. Sivaram, Hynek Hermansky (IDIAP Research Institute, Martigny, Swiss Federal Institute of Technology, Lausanne, Switzerland)

In this work, we investigate the reverse correlation technique for analyzing posterior feature extraction using an multilayered perceptron trained on multi-resolution RASTA (MRASTA) features. The filter bank in MRASTA feature extraction is motivated by human auditory modeling. The MLP is trained based on an error criterion and is purely data driven. In this work, we analyze the functionality of the combined system using reverse correlation analysis.

#166: New Methods for Pruning and Ordering of Syntax Parsing Trees

Vojt�ch Ková�, Ale� Horák, Vladim�r Kadlec (Masaryk University, Brno, Czech Republic)

Most robust rule-based syntax parsing techniques face the problem of high number of possible syntax trees as the output. There are two possible solutions to this: either release the request for robustness and provide special rules for uncovered phenomena, or equip the parser with filtering and ordering techniques. We describe the implementation and evaluation of the latter approach. In this paper, we present new techniques of pruning and ordering the resulting syntax trees in the Czech parser synt. We describe the principles of the methods and present results of measurements of effectiveness of these methods both per method and in combination, as computed for 10,000 corpus sentences.

#167: SSML for Arabic Language

Noor Shaker, Mohamed Abou-Zleikha (University of Damascus, Syria), Oumayma Al Dakkak (HIAST, Damascus, Syria)

This paper introduces SSML for using with Arabic language. SSML is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms. We study SSML, the validity to extend SSML for Arabic language by building Arabic SSML project for parsing SSML document and extracting the speech output.

#168: Improving Unsupervised WSD with a Dynamic Thesaurus

Javier Tejada-C�rcamo (National Polytechnic Institute, Mexico City, Mexico & Peruvian Computer Society, Arequipa, Peru), Hiram Calvo, Alexander Gelbukh (National Polytechnic Institute, Mexico City, Mexico)

The method proposed by Diana McCarthy et al. obtains the predominant sense for an ambiguous word based on a weighted list of terms related to the ambiguous word. This list of terms is obtained using the distributional similarity method proposed by Lin to obtain a thesaurus. In that method, every occurrence of the ambiguous word uses the same thesaurus, regardless of the context where it occurs. Every different word to be disambiguated uses the same thesaurus. In this paper we explore a different method that accounts for the context of a word when determining the most frequent sense of an ambiguous word. In our method the list of distributed similar words is built based on the syntactic context of the ambiguous word. We attain a precision of 69.86%, which is 7% higher than the supervised baseline of using the MFS of 90% SemCor against the remaining 10% of SemCor.

#170: Toward the Ultimate ASR Language Model

Frederick Jelinek, Carolina Parada (John Hopkins University, Baltimore, United States)

The n-gram model is standard for large vocabulary speech recognizers. Many attempts were made to improve on it. Language models were proposed based on grammatical analysis, artificial neural networks, random forests, etc. While the latter give somewhat better recognition results than the n-gram model, they are not practical, particularly when large training data bases (e.g., from world wide web) are available. So should language model research be abandoned as a hopeless endeavor? This talk will discuss a plan to determine how large a decrease in recognition error rate is conceivable, and propose a game-based method to determine what parameters the ultimate language model should depend on.

#171: The Future of Text-Meaning in Computational Linguistics

Graeme Hirst (University of Toronto, Canada)

Writer-based and reader-based views of text-meaning are reflected by the respective questions "What is the author trying to tell me?" and "What does this text mean to me personally?" Contemporary computational linguistics, however, generally takes neither view. But this is not adequate for the development of sophisticated applications such as intelligence gathering and question answering. I discuss different views of text-meaning from the perspective of the needs of computational text analysis and the collaborative repair of misunderstanding.

#172: Deep Lexical Semantics

Jerry Hobbs (University of Southern California, United States)

The link between words and the world is made easier if we have conceptualized the world in a way that language indicates. In the effort I will describe, we have constructed a number of core formal theories, trying to capture the abstract structure that underlies language and enable literal and metaphorical readings to be seen as specializations of the abstract structures. In the core theories, we have axiomatized composite entities (or things made out of other things), the figure-ground relation, scalar notions (of which space, time and number are specializations), change of state, causality, and the structure of complex events and processes. These theories explicate the basic predicates in terms of which the most common word senses need to be defined or characterized. We are now encoding axioms that link the word senses to the core theories, focusing on 450 of words senses in Core WordNet that are primarily concerned with events and their structure. This may be thought of as a kind of "advanced lexical decomposition", where the "primitives" into which words are "decomposed" are elements in coherently worked-out theories.

#173: Practical Prosody: Modeling Language Beyond the Words

Elizabeth Shriberg (SRI International, United States)

Prosody is clearly valuable for human understanding, but can be difficult to model in spoken language technology. This talk describes a "direct modeling" approach, which does not require any hand-labeling of prosodic events. Instead, prosodic features are extracted directly from the speech signal, based on time alignments from automatic speech recognition. Machine learning techniques then determine a prosodic model, and the model is integrated with lexical and other information to predict the target classes of interest. The talk presents a general method for prosodic feature extraction and design (including a special-purpose tool developed at SRI), and illustrates how it can be successfully applied in three different types of tasks: (1) detection of sentence or dialog act boundaries; (2) classification of emotion and affect, and (3) speaker classification.

#175: ♠ Demo: An Open Source Tool for Partial Parsing and Morphosyntactic Disambiguation

Aleksander Buczynski and Adam Przepi�rkowski (Polish Academy of Sciences & Warsaw University, Warsaw, Poland)

The paper presents Spejd, an Open Source Shallow Parsing and Disambiguation Engine. Spejd (abbreviated to ♠) is based on a fully uniform formalism both for constituency partial parsing and for morphosyntactic disambiguation -- the same grammar rule may contain structure-building operations, as well as morphosyntactic correction and disambiguation operations. The formalism and the engine are more flexible than either the usual shallow parsing formalisms, which assume disambiguated input, or the usual unification-based formalisms, which couple disambiguation (via unification) with structure building. Current applications of Spejd include rule-based disambiguation, detection of multiword expressions, valence acquisition, and sentiment analysis. The functionality can be further extended by adding external lexical resources. While the examples are based on the set of rules prepared for the parsing of the IPI PAN Corpus of Polish, ♠ is fully language-independent and we hope it will also be useful in the processing of other languages.

#176: On-line and Off-line Speech Recognition Systems for Czech Language

Petr �erva, Jind�ich �dánský, Jan Nouza (Technical University of Liberec, Czech Republic)

Czech is a language with a high degree of inflection. This means that most Czech words (namely all nouns, pronouns, adjectives, numbers and verbs) appear in various morphological forms depending on grammatical and semantical context. The total number of all possible word-forms goes beyond 1 million. This fact makes speech recognition of Czech a challenging task because a designer must solve problems associated with very large vocabularies, even larger (and at the same time sparser) language models, and last but not least the complex problem of efficient decoding. During the last 3 years we have been developing speech recognition modules that can be applied in various areas where spoken Czech is to be converted into text. These areas include namely voice dictation into a PC and transcription of broadcast programs. Systems developed for both the domains will be demonstrated at TSD 2008. Our dictation system operates with a vocabulary whose current size is 320 thousand items. This size allows the system to be used for domain-independent dictation, e.g. by newspaper people. Special vocabularies and language models have been compiled also for judicature and some selected areas of medicine. The system is speaker independent and hence everybody who speaks Czech will have a chance to test it. The second application area is automatic transcription of broadcast programs. We will try to show both off-line as well as on-line processing of broadcast data. The former will be demonstrated on a large archive of already transcribed and indexed programs of Czech TV, the latter will be shown as live sub-titling of currently broadcast TV stations. (The latter demonstration can run live if fast internet connection is available on site.)

#177: Python Module and GUI for EuroWordNet

Neeme Kahusk (Institute of Computer Science, University of Tartu, Estonia)

The software demonstration includes Python module, TEKSaurus (Estonian WordNet online) and GUI written in Qt. The subject of this demo are open-source tools for editing and managing EuroWordNet database files. The Python module serves as API and Graphic User Iterface is implemented in Qt. These tools run on broad range of hardware platforms, including Windows, MacOS, Linux, Unix. The EWN module enables programmer to handle EuroWordNet synsets and semantic relations easily. Synsets are implemented as objects, operations on them as methods. Calculations on synsets can be used both in interactive Python sessions and by importing into other modules, like word sense disambiguation. TEKSaurus is a web application that provides user with an interface to browse and search EuroWordNet files Synsets can be queried by keyword, part of speech, and even by synset number. The results are displayed as tables of synsets, by literal, gloss and examples. Semantic relations are displayed as clickable links. The GUI is written in Qt, it enables to run it on several platforms. The GUI is designed for lexicographers, who can use it like Polaris tool to browse, manage and edit EuroWordNet database files. The GUI part is in early developement stage, but should have at least the same functionality as Polaris, when finished. GPL licence enables users to modify the source according to their needs. The tools are tested on Estonian WordNet, but should function for other languages as well.

#178: Unintrusive IT for Philological Editing

Joerg Ritter, Susanne Sch�tz (Martin-Luther-University Halle-Wittenberg, Germany)

Digital methods are essential for production, revision and publication of philological editions. The TEI-guidelines for XML coding are considered an approved standard for the representation of texts. We have developed an unintrusive environment (kronos.uzi.uni-halle.de) that simplifies and automates digitalization and XML encoding of literary texts. The tool offers a graphical user interface adapted to the structure of the underlying genre. Encoding is as easy as formatting text even without knowledge of XML/TEI. After configuration the environment provides the extraction of publication formats (HTML,PDF) at the push of a button.

TSD 2007 | TSD 2006 | TSD 2005