TSD 2010

Much recent research in natural language parsing takes as input carefully crafted, edited text, often from newspapers. However, many real-world applications involve processing text which is not written carefully by a native speaker, is produced for an eventual audience of only one, and is in essence ephemeral. In this talk I will present a number of research and commercial applications of this type which I and collaborators are developing, in which we process text as diverse as mobile phone text messages, non-native language learner essays, and primary care medical notes. I will discuss the problems these types of text pose, and outline how we integrate information from parsing into applications.

#101: Evolution of the ASR Decoder Design

Miroslav Novak (IBM Watson Research Center, Yorktown Heights, NY, USA)

The ASR decoder is one of the fundamental components of an ASR system and has been evolving over the years to address the increasing demands for larger domains as well as the availability of more powerful hardware. Though the basic search algorithm (i.e. Viterbi search) is relatively simple, implementing a decoder which can handle hundreds of thousands of words in the active vocabulary and hundreds of millions of n-grams in the language model in real time is no simple task. With the emergence of embedded platforms, some of the design concepts used in the past to cope with limitations of the available hardware can become relevant again, where such limitations are similar to those of workstations of early days of ASR. In this paper we will describe various basic design concepts encountered in various decoder implementations, with the focus on those which are relevant today among the fairly large spectrum of available hardware platforms.

#102: Knowledge for Everyman

Christiane Fellbaum (Princeton University, Princeton, USA)

Increasing globalization creates situations with wide-ranging effects on large communities, often requiring global responses and innovative solutions. Timely examples are climate and environmental changes related to rapid growth and economic development. Natural and man-made unforeseen catastrophes like oil spills, landslides and floods require immediate action that might crucially rely on information and expertise available only from sources far removed from the crisis site. Knowledge sharing and transfer are also essential for sustainable long-term growth and development. In both kinds of cases, it is important that information and experience be made available and widely shared, communicated and encoded for future re-use. The global scope of many problems and their solutions requires furthermore that information and communication be accessible to communities crossing languages and cultures. Finally, an appropriate system for recording, maintaining and sharing information must be accessible to both experts and laymen. The goal of the European Union-funded KYOTO project (Knowledge-Yielding Ontologies for Transition-Based Organization, www.kyoto-project.eu) is to develop an information and knowledge sharing system that relates documents in several languages to lexical resources and a common central ontology and allows for deep semantic analysis. KYOTO facilitates the crosslinguistic and crosscultural construction and maintenance of a sophisticated knowledge system among the members of domain-specific communities. Representation, storage and retrieval of a shared terminology takes place via a Wiki platform. Relevant terms are anchored in a language-independent, customizable formal ontology that connects the lexicons of seven languages (Basque, Chinese, Dutch, English, Italian, Japanese, and Spanish) and that guarantees a uniform interpretation of terms across languages. The semantic representations in the ontology are accessible to a computer and allow deep textual analysis and reasoning operations. KYOTO's target domains are the environment and biodiversity, with appropriate experts acting as "users". Once developed, the system will be available for extension to any domain.

#283: A Compositional Model for a Lexical Emotion Detection

Marc Le Tallec, Jeanne Villaneau (Université de Tours, Tours, France), Jean-Yves Antoine (Université de Bretagne-Sud, Vannes, France), Agata Savary (Université de Tours, Tours, France), Arielle Syssau (Université de Montpellier III, Montpellier, France)

The ANR EmotiRob project aims at detecting emotions in an original application context: realizing an emotional companion robot for weakened children. This paper presents a system which aims at characterizing emotions by only considering the linguistic content of utterances. It is based on the assumption of compositionality: simple lexical words have an intrinsic emotional value, while verbal and adjectival predicates act as a function on the emotional values of their arguments. The paper describes the semantic component of the system, the algorithm of compositional computation of the emotion value and the lexical emotional norm used by this algorithm. A quantitative and qualitative analysis of the differences between system outputs and expert annotations is given, which shows satisfactory results, with the right detection of emotional valency in 90% of the test utterances.

#245: A Methodology for Learning Optimal Dialog Strategies

David Griol (Carlos III University of Madrid, Madrid, Spain), Michael F. McTear (University of Ulster, Jordanstown, Northern Ireland, UK), Zoraida Callejas, Ramón López-Cózar, Nieves Ábalos, Gonzalo Espejo (University of Granada, Granada, Spain)

In this paper, we present a technique for learning new dialog strategies by using a statistical dialog manager that is trained from a dialog corpus. A dialog simulation technique has been developed to acquire data required to train the dialog model and then explore new dialog strategies. A set of measures has also been defined to evaluate the dialog strategy that is automatically learned. We have applied this technique to explore the space of possible dialog strategies for a dialog system that collects monitored data from patients suffering from diabetes.

#229: A Multimodal DS for an AmI Application in Home Environments

Nieves �balos, Gonzalo Espejo, Ram�n L�pez-C�zar, Zoraida Callejas (University of Granada, Granada, Spain), David Griol (Carlos III University of Madrid, Madrid, Spain)

This paper presents a multimodal dialogue system called Mayordomo which aims at easing the interaction with home appliances using speech and a graphical interface within an Ambient Intelligence environment. We present the methods employed for implementing the system describing the design of the user-system interactions as well as additional features such as the management of user profiles to restrict the access to domestic appliances and customize the recognition grammars and the generated responses.

#244: A Priori and A Posteriori Machine Learning...

Jan Zelinka, Jan Romportl (University of West Bohemia, Pilsen, Czech Republic), Lud�k M�ller (SpeechTech s.r.o., Pilsen, Czech Republic)

The main idea of a priori machine learning is to apply a machine learning method on a machine learning problem itself. We call it "a priori" because the processed data set does not originate from any measurement or other observation. Machine learning which deals with any observation is called "posterior". The paper describes how posterior machine learning can be modified by a priori machine learning. A priori and posterior machine learning algorithms are proposed for artificial neural network training and are tested in the task of audio-visual phoneme classification.

#232: ASR Based on Multiple Level Units in Spoken Dialogue System

Masafumi Nishida (Doshisha University, Kyoto, Japan), Yasuo Horiuchi, Shingo Kuroiwa (Chiba University, Chiba, Japan), Akira Ichikawa (Waseda University, Saitama, Japan)

The purpose of our study is to develop a spoken dialogue system for in-vehicle appliances. Such a multi-domain dialogue system should be capable of reacting to a change of the topic, recognizing fast and accurately separating words as well as whole sentences. We propose a novel recognition method by integrating a sentence, partial words, and phonemes. The degree of confidence is determined by the degree to which recognition results match on these three levels. We conducted speech recognition experiments for in-vehicle appliances. In the case of sentence units, the recognition accuracy was 96.2% by the proposed method and 92.9% by the conventional word bigram. As for word units, recognition accuracy of the proposed method was 86.2% while that of whole word recognition was 75.1%. Therefore, we concluded that our method can be effectively applied in spoken dialogue systems for in-vehicle appliances.

#212: ASR Transcription Based Unsupervised Speaker Adaptation

B�lint T�th, Tibor Fegy�, G�za N�meth (Budapest University of Technology and Economics, Budapest, Hungary)

Statistical parametric synthesis offers numerous techniques to create new voices. Speaker adaptation is one of the most exciting ones. However, it still requires high quality audio data with low signal to noise ration and precise labeling. This paper presents an automatic speech recognition based unsupervised adaptation method for Hidden Markov Model (HMM) speech synthesis and its quality evaluation. The adaptation technique automatically controls the number of phone mismatches. The evaluation involves eight different HMM voices, including supervised and unsupervised speaker adaptation. The effects of segmentation and linguistic labeling errors in adaptation data are also investigated. The results show that unsupervised adaptation can contribute to speeding up the creation of new HMM voices with comparable quality to supervised adaptation.

#216: Adaptation of a Feedforward Artificial Neural Network

Jan Trmal, Jan Zelinka, Lud�k M�ller (University of West Bohemia, Pilsen, Czech Republic)

In this paper we present a novel method for adaptation of a multi-layer perceptron neural network (MLP ANN). Nowadays, the adaptation of the ANN is usually done as an incremental retraining either of a subset or the complete set of the ANN parameters. However, since sometimes the amount of the adaptation data is quite small, there is a fundamental drawback of such approach -- during retraining, the network parameters can be easily overfitted to the new data. There certainly are techniques that can help overcome this problem (early-stopping, cross-validation), however application of such techniques leads to more complex and possibly more data hungry training procedure. The proposed method approaches the problem from a different perspective. We use the fact that in many cases we have an additional knowledge about the problem. Such additional knowledge can be used to limit the dimensionality of the adaptation problem. We applied the proposed method on speaker adaptation of a phoneme recognizer based on traps (Temporal Patterns) parameters. We exploited the fact that the employed traps parameters are constructed using log-outputs of mel-filter bank and by virtue of reformulating the first layer weight matrix adaptation problem as a mel-filter bank output adaptation problem, we were able to significantly limit the number of free variables. Adaptation using the proposed method resulted in a substantial improvement of phoneme recognizer accuracy.

#239: Adapting Lexical and Language Models for Spontaneous Czech

Jan Nouza, Jan Silovský (Technical University of Liberec, Liberec, Czech Republic)

The paper deals with the problem of automatic transcription of spontaneous conversations in Czech. That type of speech is informal with many colloquial words. It is difficult to create an appropriate lexicon and language model when linguistic resources representing colloquial Czech are limited to several small corpora collected by the Institute of Czech National Corpus. To overcome this, we introduce transformations between the most frequent colloquial words and their counterparts in formal Czech. This allows us a) to combine the small spoken corpora with much larger corpora of more formal texts, b) to optimize the recognizer's lexicon, and c) to solve the data sparsity problem when computing a probabilistic language model. We have applied this approach in the design of a system for transcription of spontaneous telephone conversations. Its recent version operates with accuracy about 48% and the proposed transformations together with corpora mixing contributed to 9% improvement compared to the baseline system.

#291: Advanced Searching in the Valency Lexicons

Eduard Bej�ek, V�clava Kettnerov�, Mark�ta Lopatkov� (Charles University, Prague, Czech Republic)

This paper presents a sophisticated way to search valency lexicons. We provide a visualization of lexicons with such built-in searching that allows users to draw sophisticated queries in a graphical mode. We exploit the PML-TQ, a query language based on the tree editor TrEd. For demonstration purposes, we focus on VALLEX and PDT-VALLEX, two Czech valency lexicons of verbs. We propose a common lexicon data format supported by PML-TQ. This format offers easy viewing both lexicons, parallel searching and interlinking them. The proposed method is universal and can be used for other hierarchically structured lexicons. %

#290: An NLP-Oriented Analysis of the Instant Messaging Discourse

Justyna Walkowska (Adam Mickiewicz University, Poznan, Poland)

This paper describes the results of the analysis of an experimentally collected small corpus of messages exchanged through an instant messaging (IM) programme. The data is analysed from the point of view of automatic parsing. Special attention is paid to two problems associated with IM discourse: the semantic multi-tasking (or the interweaving of topics) of conversation partners, and the non-standard spelling found in such dialogues. The contents of the corpus are also compared with other types of written dialogues, i.e. SMS messages and conversations between human users and chatterbots. Finally, some solutions are proposed to facilitate the process of automatic parsing of IM messages.

#316: Automatic Acquisition of Wordnet Relations...

Roman Kurc, Maciej Piasecki (Wroc�aw University of Technology, Wroc�aw, Poland), Stan Szpakowicz (University of Ottawa, Ottawa, Canada)

\esp is a pattern-based algorithm of extracting lexical-semantic relations, defined for English. We present its adaptation to Polish. We consider not only the technicalities such as the availability of language-processing tools for Polish, but also pattern structures which leverage the specificity of a strongly inflected language. We propose a new method of computing the reliability measure of extraction; this leads to a modified algorithm which we have named \est. In this paper we investigate the influence of additional lexico-semantic data and information from generic patterns.

#341: Automatic Detection and Evaluation of Edentulous Speakers...

Tobias Bocklet, Florian Hönig, Tino Haderlein, Florian Stelzle, Christian Knipfer, Elmar Nöth (Universitat Erlangen-Nürnberg, Erlangen, Germany)

Dental rehabilitation by complete dentures is a state-of-the-art approach to improve functional aspects of the oral cavity of edentulous patients. It is important to assure that these dentures have a sufficient fit. We introduce a dataset of 13 edentulous patients that have been recorded with and without complete dentures in situ. These patients have been rated an insufficient fit of their dentures, so that additional (sufficient) dentures and additional speech recordings have been prepared. In this paper we show that sufficient dentures increase the performance of an ASR system by ca. 27 %. Based on these results, we present and discuss three different systems that automatically determine whether the dentures of an edentulous person have a sufficient fit or not. The system with the best performance models the recordings by GMMs and uses the mean vectors of these GMMs as features in an SVM. With this system we were able to achieve a recognition rate of 80 %.

#213: Automatic Lip Reading Using AAM on High Speed Recordings

Alin Gavril Chitu, Karin Driel, Leon J. M. Rothkrantz (Delft University of Technology, Delft, The Netherlands)

This paper presents our work on lip reading in the Dutch language. The results are based on a new data corpus recorded at 100Hz in our group. The NDUTAVSC corpus is to date the largest corpus build for lip reading in Dutch. For parameterising the input data we use Active Appearance Models. Based on the results of AAM we define a set of high level geometric features which are used for training recognizer systems for different recognition tasks, such as fixed length digits strings, random length letters strings, random word sequences, fixed topic continuous speech and random continuous speech. We show that our approach gives great improvements compared to previous results. We also investigate the influence of the high speed recordings on the performance of the recognition. We show that in the case of high speech rate the use of higher speed recordings is compulsory.

#249: Automatic Segmentation of Parasitic Sounds in Speech Corpora for TTS Synthesis

Jind�ich Matou�ek (University of West Bohemia, Pilsen, Czech Republic)

In this paper, automatic segmentation of parasitic speech sounds in speech corpora for text-to-speech (TTS) synthesis is presented. The automatic segmentation is, beside the automatic detection of the presence of such sounds in speech corpora, an important step in the precise localisation of parasitic sounds in speech corpora. The main goal of this study is to find out whether the segmentation of these sounds is accurate enough to enable cutting the sounds out of synthetic speech or explicit modelling of these sounds during synthesis. HMM-based classifier was employed to detect the parasitic sounds and to find the boundaries between these sounds and the surrounding phones simultaneously. The results show that the automatic segmentation of parasitic sounds is comparable to the segmentation of other phones, which indicates that the cutting out or the explicit usage of parasitic sounds should be possible.

#222: Automatic Sentiment Analysis by Textual Similarity

Jan �i�ka, Franti�ek Da�ena (Mendel University in Brno, Brno, Czech Republic)

The paper investigates a problem connected with automatic analysis of sentiment (opinion) in textual natural-language documents. The initial situation works on the assumption that a user has many documents centered around a certain topic with different opinions of it. The user wants to pick out only relevant documents that represent a certain sentiment -- for example, only positive reviews of a certain subject. Having not too many typical patterns of the desired document type, the user needs a tool that can collect documents which are similar to the patterns. The suggested procedure is based on computing the similarity degree between patterns and unlabeled documents, which are then ranked according to their similarity to the patterns. The similarity is calculated as a distance between patterns and unlabeled items. The results are shown for publicly accessible downloaded real-world data in two languages, English and Czech.

#274: Borda-Based Voting Schemes for Semantic Role Labeling

Vladimir Robles, Antonio Molina, Paolo Rosso (Universidad Politécnica de Valencia, Valencia, Spain)

In this article, we have studied the possibility of applying Borda and Fuzzy Borda voting schemes to combine semantic role labeling systems. To better select the correct semantic role, among those provided by different experts, we have introduced two measures: the first one calculates the overlap between labeled sentences, whereas the second one adds different scoring levels depending on the verbs that have been parsed.

#276: CORPRES: Corpus of Russian Professionally Read Speech

Pavel Skrelin, Nina Volskaya, Daniil Kocharov, Karina Evgrafova, Olga Glotova, Vera Evdokimova (Saint-Petersburg State University, St. Petersburg, Russia)

The paper introduces CORPRES - COrpus of Russian Professionally REad Speech developed at the Department of Phonetics, Saint Petersburg State University, as a result of a three-year project. The corpus includes samples of different speaking styles produced by 4 male and 4 female speakers. Six levels of annotation cover all phonetic and prosodic information about the recorded speech data, including labels for pitch marks, phonetic events, phonetic, orthographic and prosodic transcription. Precise phonetic transcription of the data provides an especially valuable resource for both research and development purposes. Overall corpus size is 60 hours of speech. The paper contains information about CORPRES design and annotation principles, and overall data description. Also, we discuss possible use of the corpus in phonetic research and speech technology as well as some findings on the Russian sound system obtained from the corpus data.

#318: Can Corpus Pattern Analysis Be Used in NLP?

Silvie Cinková, Martin Holub (Charles University, Prague, Czech Republic), Pavel Rychlý (Masaryk University, Brno, Czech Republic), Lenka Smejkalová, Jana �indlerová (Charles University, Prague, Czech Republic)

Corpus Pattern Analysis (CPA) , coined and implemented by Hanks as the Pattern Dictionary of English Verbs (PDEV) , appears to be the only deliberate and consistent implementation of Sinclair's concept of Lexical Item . In his theoretical inquiries Hanks hypothesizes that the pattern repository produced by CPA can also support the word sense disambiguation task. Although more than 670 verb entries have already been compiled in PDEV, no systematic evaluation of this ambitious project has been reported yet. Assuming that the Sinclairian concept of the Lexical Item is correct, we started to closely examine PDEV with its possible NLP application in mind. Our experiments presented in this paper have been performed on a pilot sample of English verbs to provide a first reliable view on whether humans can agree in assigning PDEV patterns to verbs in a corpus. As a conclusion we suggest procedures for future development of PDEV.

#275: Client and Speech Detection System for Intelligent Infokiosk

Andrey Ronzhin, Alexey Karpov, Irina Kipyatkova (Russian Academy of Sciences, St. Petersburg, Russia), Milo� �elezný (University of West Bohemia, Pilsen, Czech Republic)

Timely attraction of a client and detection of his/her speech message in real noisy conditions are main difficulties at deployment of speech and multimodal interfaces in information kiosks. Combination of sound source localization, voice activity and face detection technologies allowed to determine client mouth coordinates and extract boundaries of speech signal appeared in the kiosk dialogue area. Talking head model based on audio-visual speech synthesis immediately greets the client, when her face is captured in the video-monitoring area, in order to attract him/her to the information service before leaving the interaction area. Client's face tracking is also used for turning the talking head in direction to the client that significantly improves the naturalness of interaction. The developed infokiosk set in the institute hall provides information about structure and staff of laboratories. Statistics of human-kiosk interaction is accumulated within last six months in 2009.

#256: Comparison of Different Lemmatization Approaches

Jakub Kanis, Lucie Skorkovská (University of West Bohemia, Pilsen, Czech Republic)

This paper presents a quantitative performance analysis of two different approaches to the lemmatization of the Czech text data. The first one is based on manually prepared dictionary of lemmas and set of derivation rules while the second one is based on automatic inference of the dictionary and the rules from training data. The comparison is done by evaluating the mean Generalized Average Precision (mGAP) measure of the lemmatized documents and search queries in the set of information retrieval (IR) experiments. Such method is suitable for efficient and rather reliable comparison of the lemmatization performance since a correct lemmatization has proven to be crucial for IR effectiveness in highly inflected languages. Moreover, the proposed indirect comparison of the lemmatizers circumvents the need for manually lemmatized test data which are hard to obtain and also face the problem of incompatible sets of lemmas across different systems.

#285: Comparison of Web 1T 5-gram Corpus to Czech National Corpus

Václav Procházka, Petr Pollák (Czech Technical University, Prague, Czech Republic)

In this paper, newly issued Czech Web 1T 5-grams corpus created by Google and LDC is analysed and compared with reference n-gram corpus obtained from Czech National Corpus. Original 5-grams from both corpora were post-processed and statistical trigram language models of various vocabulary sizes and parameters were created. The comparison of various corpus statistics such as unique and total word and n-gram counts before and after post-processing is presented and discussed, especially with the focus on clearing Web 1T data from invalid tokens. The tools from HTK Toolkit were used for the evaluation and accuracy, OOV rates and perplexity were measured using sentence transcriptions from Czech SPEECON database.

#210: Correlation Features and a Linear Transform Specific...

Andreas Beschorner, Dietrich Klakow (Saarland University, Saarbrücken, Germany)

In this paper we introduce three ideas for phoneme classification: First, we derive the necessary steps to integrate linear transforms into the computation of reproducing kernels. This concept is not restricted to phoneme classification and can be applied to a wider range of research subjects. Second, in the context of support vector machine (SVM) classification, correlation features based on MFCC-vectors are proposed as a substitute for the common first and second derivatives, and the theory of the first part is applied to the new features. Third, an SVM structure in the spirit of phoneme states is introduced. Relative classification improvements of 40.67% compared to stacked MFCC features of equal dimension encourage further research in this direction.

#319: Coverage-Based Methods for Distributional Stopword Selection...

Joe Vasak, Fei Song (University of Guelph, Guelph, Canada)

Unlike the common stopwords in information retrieval, distributional stopwords are document-specific and refer to the words that are more or less evenly distributed across a document. Isolating distributional stopwords has been shown to be useful for text segmentation, since it helps improve the representation of a segment by reducing the overlapped words between neighboring segments. In this paper, we propose three new measures for distributional stopword selection and expand the notion of distributional stopwords from the document level to a topic level. Two of our new measures are based on the distributional coverage of a word and the other one is extended from an existing measure called distribution difference by relying on the density of words in a way similar to another measure called distribution significance. Our experiments show that these new measures are not only efficient to compute, but also more accurate than or comparable to the existing measures for distributional stopword selection and that distributional stopword selection at a topic level is more accurate than document level selection for subtopic segmentation.

#272: Czech HMM-Based Speech Synthesis

Zden�k Hanzl��ek (University of West Bohemia, Pilsen, Czech Republic)

In this paper, first experiments on statistical parametric HMM-based speech synthesis for the Czech language are described. In this synthesis method, trajectories of speech parameters are generated from the trained hidden Markov models. A final speech waveform is synthesized from those speech parameters. In our experiments, spectral properties were represented by mel cepstrum coefficients. For the waveform synthesis, the corresponding MLSA filter excited by pulses or noise was utilized. Beside that basic setup, a high-quality analysis/synthesis system STRAIGHT was employed for more sophisticated speech representation. For a more robust model parameter estimation, HMMs are clustered by using decision tree-based context clustering algorithm. For this purpose, phonetic and prosodic contextual factors proposed for the Czech language are taken into account. The created clustering trees are also employed for synthesis of speech units unseen within the training stage. The evaluation by subjective listening tests showed that speech produced by the combination of HMM-based TTS system and STRAIGHT is of comparable quality as speech synthesised by the unit selection TTS system trained from the same speech data.

#242: Czech Spoken Dialog System with Mixed Initiative

Jan �vec, Lubo� �m�dl (University of West Bohemia, Pilsen, Czech Republic)

This paper describes a prototype of a Czech dialog system with a mixed dialog initiative and a natural language understanding module. The described dialog system is designed for providing railway information such as arrivals, departures, prices and train types. The dialog can be driven by both an user of the system and a dialog manager to accomplish the dialog goal. In addition the user can use an almost arbitrary Czech utterance consistent with the dialog domain to interact with the system. The system accesses the train database on-line via the Internet. The version described in this paper works as a desktop computer application and communicates with the user using the headset. The paper describes the modules of the dialog system including automatic speech recognition, natural language understanding, dialog manager, speech generation and speech synthesis.

#217: Data for Evaluation of Concatenation Cost Functions

Milan Leg�t, Jind�ich Matou�ek (University of West Bohemia, Pilsen, Czech Republik)

This paper describes the collection and analysis of data, which are planned to be used for the evaluation and development of concatenation cost functions for unit selection based TTS systems. Data, collected via listening tests following the recommendations given in , were analyzed in a variety of ways to identify and possibly exclude "malicious" listeners as well as to demonstrate their sufficient "richness" for the aimed utilization. This study was limited to five Czech vowels as these sounds are characterized by being highly energetic and having rich spectral content, which induces complexity and wide range of possible discontinuities at concatenation points.

#246: Design and Implementation of a Bayesian Network Speech Recognizer

Pascal Wiggers, Leon J. M. Rothkrantz, Rob van de Lisdonk (Delft University of Technology, Delft, The Netherlands)

In this paper we describe a speech recognition system implemented with generalized dynamic Bayesian networks (bns). We discuss the design of the system and the features of the underlying toolkit we constructed that makes efficient processing of speech and language data with Bayesian networks possible. Features include: sparse representations of probability tables, a fast algorithm for inference with probability tables, lazy evaluation of probability tables, algorithms for calculations with tree-shaped distributions, the ability to change distributions on the fly, and a generalization of bn model structure.

#187: Diagnostics for Debugging Speech Recognition Systems

Milo� Cer�ak (Slovak Academy of Sciences, Bratislava, Slovakia)

Modern speech recognition applications are becoming very complex program packages. To understand the error behaviour of the ASR systems, a special diagnosis--a procedure or a tool---is needed. Many ASR users and developers have developed their own expert diagnostic rules that can be successfully applied to a system. There are also several explicit approaches in the literature for determining the problems related to application errors. The approaches are based on error and ablative analyses of the ASR components, with a blame assignment to a problematic component. The disadvantage of those methods is that they are either quite time-consuming to acquire expert diagnostic knowledge, or that they offer very coarse-grained localization of a problematic ASR part. This paper proposes fine-grained diagnostics for debugging ASR by applying a program-spectra based failure localization, and it localizes directly a part of ASR implementation. We designed a toy experiment with diagnostic database OLLO to show that our method is very easy to use and that it provides a good localization accuracy. Because it is not able to localize all the errors, an issue that we discuss in the discussion, we recommend to use it with other coarse-grained localization methods for a complex ASR diagnosis.

#227: Dialogue System Based on EDECÁN Architecture

Javier Mikel Olaso, Mar�a Inés Torres (Universidad del Pa�s Vasco, Leioa, Basque Country, Spain)

Interactive and multimodal interfaces have been proved of help in human-machine interactive systems such as dialogue systems. Facial animation, specifically lips motion, helps to make speech comprehensible and dialogue turns intuitive. The dialogue system under consideration consists of a stand that allows to get current and past news published on the Internet by several newspapers and sites, and also to get information about the weather, initially of Spanish cities, although it can be easily extended to other cities around the world. The final goal is to provide with valuable information and entertainment to people queuing or just passing around. The system aims, as well, at disabled people thanks to the different multi-modal input/outputs taken into consideration. In this work are described the diferent modules that are part of the dialogue system. These modules where developed under EDECÁN architecture specifications.

#192: Embedded Speech Recognition in UPnP (DLNA) Environment

Jozef Ivanecký, Radek Hampl (European Media Laboratory, Heidelberg, Germany)

In the past decade great technological advances have been made in internet services, personal computers, telecommunications, media and entertainment. Many of these advances have benefited from sharing technologies across those industries. This influences how Digital Home Entertainment products are designed to follow the overall "Media Convergence" trend. Existing Universal Plug and Play (UPnP) or DLNA specifications are often used for these purposes. These specifications permit electronic devices to be simply plugged into home and local networks for access and exchange of shared data like music, video or photos. The number of media items in a user library can then easily exceed 10,000 elements. In addition, these specifications are used by manufacturers of consumer electronics to ensure interoperability of different consumer electronic devices. In this paper, we describe our efforts towards introducing speech recognition to control electronic devices in UPnP (DLNA) environments. We give an overview of the content structure and media information available in the UPnP (DLNA) network. We also analyze the use of available information for speech recognition. The main focus will be on the possibility of designing and implementing a voice-enabled UPnP (DLNA) Control Point, and the introduction of one particular solution.

#233: Emotion Recognition from Speech

Iulia Lefter, Leon J. M. Rothkrantz, Pascal Wiggers (Delft University of Technology, Delft, The Netherlands), David. A. van Leeuwen (TNO Human Factors, Delft, The Netherlands)

We explore possibilities for enhancing the generality, portability and robustness of emotion recognition systems by combining data-bases and by fusion of classifiers. In a first experiment, we investigate the performance of an emotion detection system tested on a certain database given that it is trained on speech from either the same database, a different database or a mix of both. We observe that generally there is a drop in performance when the test database does not match the training material, but there are a few exceptions. Furthermore, the performance drops when a mixed corpus of acted databases is used for training and testing is carried out on real-life recordings. In a second experiment we investigate the effect of training multiple emotion detectors, and fusing these into a single detection system. We observe a drop in the Equal Error Rate ({\sc eer}) from 19.0 % on average for 4 individual detectors to 4.2 % when fused using FoCal .

#320: Emotion Recognition: Decoupling Emotion and Speaker Information

Rok Gaj�ek, Vitomir �truc, France Miheli� (University of Ljubljana, Ljubljana, Slovenia)

The standard features used in emotion recognition carry, besides the emotion related information, also cues about the speaker. This is expected, since the nature of emotionally colored speech is similar to the variations in the speech signal, caused by different speakers. Therefore, we present a gradient descent derived transformation for the decoupling of emotion and speaker information contained in the acoustic features. The Interspeech '09 Emotion Challenge feature set is used as the baseline for the audio part. A similar procedure is employed on the video signal, where the nuisance attribute projection (NAP) is used to derive the transformation matrix, which contains information about the emotional state of the speaker. Ultimately, different NAP transformation matrices are compared using canonical correlations. The audio and video sub-systems are combined at the matching score level using different fusion techniques. The presented system is assessed on the publicly available eNTERFACE '05 database where significant improvements in the recognition performance are observed when compared to the stat-of-the-art baseline.

#287: Encoding Event and Argument Structures in Wordnets

Raquel Amaro, Sara Mendes, Palmira Marrafa (University of Lisbon, Lisbon, Portugal)

In this paper we propose the codification of argument and event structures in wordnets, providing information on selection properties, semantic incorporation phenomena and internal properties of events, in what we claim to be an affordable procedure. We propose an explicit expression of argument structure, including default and shadow arguments, through three new relations and a new order feature. As synsets in wordnets are associated to a given POS, information on the selection properties of lexical items is added. We show that the systematic encoding of event structure information, through five new features at synset level, besides providing the grounds for describing the order of arguments, enriches the descriptive power of these resources. In doing so, we crucially contribute to making wordnets rich and structured repositories of lexical semantic information, that allow for the extraction of argument and event structures of lexical items, thus enhancing their usability in NLP systems.

#205: Enhancing Emotion Recognition from Speech...

Theodoros Kostoulas, Todor Ganchev, Alexandros Lazaridis, Nikos Fakotakis (University of Patras, Rion-Patras, Greece)

In the present work we aim at performance optimization of a speaker-independent emotion recognition system through speech feature selection process. Specifically, relying on the speech feature set defined in the Interspeech 2009 Emotion Challenge, we studied the relative importance of the individual speech parameters, and based on their ranking, a subset of speech parameters that offered advantageous performance was selected. The affect-emotion recognizer utilized here relies on a GMM-UBM-based classifier. In all experiments, we followed the experimental setup defined by the Interspeech 2009 Emotion Challenge, utilizing the FAU Aibo Emotion Corpus of spontaneous, emotionally coloured speech. The experimental results indicate that the correct choice of the speech parameters can lead to better performance than the baseline one.

#305: Estonian: Some Findings for Modelling Speech Rhythmicity and...

Mari-Liis Kalvik, Meelis Mihkla, Indrek Kiissel, Indrek Hein (Institute of the Estonian Language, Tallin, Estonia)

This paper presents the results of two researches with a common aim: to improve the quality of synthetic speech. The study of the parameters of the three quantity degrees which are carrying the Estonian stress structure reveals that the durational ratio of the vowels of stressed and unstressed syllables is the most appropriate distinctive feature of quantity opposition. Investigation of the perception of different speech rates in blind and sighted shows that screenreader trained blinds prefer a considerably higher speech rate.

#278: Evaluation of a Sentence Ranker for Text Summarization

Alistair Kennedy, Stan Szpakowicz (University of Ottawa, Ottawa, Canada)

Evaluation is one of the hardest tasks in automatic text summarization. It is perhaps even harder to determine how much a particular component of a summarization system contributes to the success of the whole system. We examine how to evaluate the sentence ranking component using a corpus which has been partially labelled with Summary Content Units. To demonstrate this technique, we apply it to the evaluation of a new sentence-ranking system which uses t \thes. This corpus provides a quick and nearly automatic method of evaluating the quality of sentence ranking.

#336: Event-Time Relation Identification Using Machine Learning and Rules

Anup Kumar Kolya (Jadavpur University, Kolkata, India), Asif Ekbal (Heidelberg University, Heidelberg, Germany), Sivaji Bandyopadhyay (Jadavpur University, Kolkata, India)

Temporal information extraction is a popular and interesting research field in the area of Natural Language Processing (NLP). In this paper, we report our works on temporal relation identification within the TimeML framework. We worked on TempEval-2007 Task B that involves identification of relations between events and document creation time. Two different systems, one based on machine learning and the other based on handcrafted rules, are developed. The machine learning system is based on Conditional Random Field (CRF) that makes use of only some of the features available in TimeBank corpus in order to infer temporal relations. The second system is developed using a set of manually constructed handcrafted rules. Evaluation results show that the rule-based system performs better compared to the machine learning based system with the precision, recall and F-score values 75.9%, 75.9% and 75.9%, respectively under the strict evaluation scheme and 77.1%, 77.1% and 77.1%, respectively under the relaxed evaluation scheme. In contrast, CRF based system yields precision, recall and F-score values 74.1%, 73.6% and 73.8%, respectively under the strict evaluation scheme and 75.1%, 74.6% and 74.8%, respectively under the relaxed evaluation scheme.

#204: Expressive Gibberish Speech Synthesis for Affective HCI

Selma Yilmazyildiz, Lukas Latacz, Wesley Mattheyses, Werner Verhelst (Vrije Universiteit Brussel, Brussel, Belgium)

In this paper we present our study on expressive gibberish speech synthesis as a means for affective communication between computing devices, such as a robot or an avatar, and their users. Gibberish speech consists of vocalizations of meaningless strings of speech sounds and is sometimes used by performing artists to express intended (and often exaggerated) emotions and affect, such as anger and surprise, without actually pronouncing any understandable word. The advantage of gibberish in affective computing lies with the fact that no understandable text has to be pronounced and that only affect is conveyed. This can be used to test the effectiveness of affective prosodic strategies, for example, but it can also be applied in actual systems.

#307: Extracting Human Spanish Nouns

Sofia N. Galicia-Haro (Universidad Nacional Aut�noma de M�xico, Mexico City, Mexico), Alexander F. Gelbukh (Instituto Polit�cnico Nacional, Mexico City, Mexico)

In this article we present a simple method to extract Spanish nouns with the linguistic property of "human" animacy. We describe a non-supervised method based on lexical patterns and on a person name list enlarged from a collection of newspaper texts. Results were obtained from the Web filters and estimation methods are proposed to validate them.

#259: Final Experiments with Czech MALACH Project

Josef Psutka, Jan �vec, Josef V. Psutka, Jan Van�k, Ale� Pra��k, Lubo� �m�dl (University of West Bohemia, Pilsen, Czech Republic)

In this paper we describe the system for a fast phonetic/lexical searching in the large archives of the Czech holocaust testimonies. The developed system is the first step to a fulfillment of the MALACH project visions, at least as for an easier and faster access to the Czech part of the archives. More than one thousand hours of spontaneous, accented and highly emotional speech of Czech holocaust survivors stored at the USC Shoah Foundation Institute as video-interviews were automatically transcribed and phonetically/lexically indexed. Special attention was paid to processing of colloquial words that appear very frequently in the Czech spontaneous speech. The final access to the archives is very fast allowing to detect segments of interviews containing pronounced words, clusters of words presented in pre-defined time intervals, and also words that were not included in the working vocabulary (OOV words).

#203: GD AMs Fusion for Parliament Subtitling

Jan Van�k, Josef V. Psutka (University of West Bohemia, Pilsen, Czech Republic)

Gender-dependent (male/female) acoustic models are more acoustically homogeneous and therefore give better recognition performance than single gender-independent model. This paper deals with a problem how to use these gender-based acoustic models in a real-time LVCSR (Large Vocabulary Continuous Speech Recognition) system that is for more than one year used by the Czech TV for automatic subtitling of Parliament meetings that are broadcasted on the channel �T24. Frequent changes of speakers and the direct connection of the LVCSR system to the TV audio stream require switching/fusion of models automatically and as soon as possible. The paper presents various techniques based on using the output probabilities for quick selection of a better model or their combinations. The best proposed method achieved over 11% relative WER reduction in comparision with the GI model.

#197: Gradient Descent Optimization in Training from Heterogeneous Data

Martin Karafi�t, Igor Szoeke, Jan �ernock� (Brno University of Technology, Brno, Czech Republic)

In this paper, we study the use of heterogeneous data for training of acoustic models. In initial experiments, a significant drop of accuracy has been observed on in-domain test set if the data was added without any regularization. A solution is proposed by getting control over the training data by optimization of the weights of different data-sets. The final models shows good performance on all various tests linked to various speaking styles. Furthermore, we used this approach to increase the performance over just the main test set. We obtained 0.3% absolute improvement on basic system and 0.4% on HLDA system although the size of the heterogeneous data set was quite small.

#211: Hybrid HMM/BLSTM-RNN for Robust Speech Recognition

Yang Sun, Louis ten Bosch, Lou Boves (Radboud University, Nijmegen, The Netherlands)

The question how to integrate information from different sources in speech decoding is still only partially solved (layered architecture versus integrated search). We investigate the optimal integration of information from Artificial Neural Nets in a speech decoding scheme based on a Dynamic Bayesian Network for noise robust ASR. A HMM implemented by the DBN cooperates with a novel Recurrent Neural Network (BLSTM-RNN), which exploits long-range context information to predict a phoneme for each MFCC frame. When using the identity of the most likely phoneme as a direct observation, such a hybrid system has proved to improve noise robustness. In this paper, we use the complete BLSTM-RNN output which is presented to the DBN as Virtual Evidence. This allows the hybrid system to use information about all phoneme candidates, which was not possible in previous experiments. Our approach improved word accuracy on the Aurora 2 Corpus by 8%.

#237: Improving Image Captioning Using Text Summarization

Laura Plaza (Universidad Complutense de Madrid, Madrid, Spain), Elena Lloret (University of Alicante, Alicante, Spain), Ahmet Aker (University of Sheffield, Sheffield, UK)

This paper presents two different approaches to automatic captioning of geo-tagged images by summarizing multiple web-documents that contain information related to an image's location: a graph-based and a statistical-based approach. The graph-based method uses text cohesion techniques to identify information relevant to a location. The statistical-based technique relies on different word or noun phrases frequency counting for identifying pieces of information relevant to a location. Our results show that summaries generated using these two approaches lead indeed to higher ROUGE scores than n-gram language models reported in previous work.

#251: Integrating Aggregation Strategies

Pablo Gerv�s (Universidad Complutense de Madrid, Madrid, Spain), Gabriel Amores (Universidad de Sevilla, Sevilla, Spain), Raquel Herv�s (Universidad Complutense de Madrid, Madrid, Spain), Guillermo P�rez (Universidad de Sevilla, Sevilla, Spain), Susana Bautista, Virginia Francisco (Universidad Complutense de Madrid, Madrid, Spain), Pilar Manch�n (Universidad de Sevilla, Sevilla, Spain)

This paper presents the integration of a natural language generation system onto an In-Home Domain Dialogue System to achieve fluent, non-redundant verbal descriptions of the state of the environment. Three important contributions are brought together in this integration: an in-depth study of aggregation strategies preferred by users in the In-Home Domain, a fully operational dialogue system, and a natural language generation system capable of implementing the required aggregation strategies. The integration is validated by means of acceptance tests with human evaluators. In this paper we show how the aggregation strategies remove redundancies and provide a description that is assigned higher scores by human evaluators than prior descriptions.

#321: Integration of Speech and Text Processing Modules

Jan Pt��ek (Charles University, Prague, Czech Republic), Pavel Ircing (University of West Bohemia, Pilsen, Czech Republic), Miroslav Spousta (Charles University, Prague, Czech Republic), Jan Romportl, Zden�k Loose (University of West Bohemia, Pilsen, Czech Republic), Silvie Cinkov� (Charles University, Prague, Czech Republic), José Rela no Gil, Raúl Santos (Telefónica I+D, Madrid, Spain)

This paper presents a real-time implementation of an automatic dialogue system called `Senior Companion', which is not strictly task-oriented, but instead it is designed to `chat' with elderly users about their family photographs. To a large extent, this task has lost the usual restriction of dialogue systems to a particular (narrow) domain, and thus the speech and natural language processing components had to be designed to cover a broad range of possible user and system utterances.

#103: Iterative Decoding for Speech Recognition

Frederick Jelinek (Johns Hopkins University, USA)

Recently many improvements in speech recognition performance have been obtained by means of re-scoring either a lattice or a confusion network generated by the primary recognizer. In this talk we will show performance gains obtained by applying iterative decoding to the confusion network using a more sophisticated language model than the one basic to the primary recognizer. Iterative decoding results in further gains if rescoring involves also a more sophisticated acoustic model, and/or if the re-estimation is carried out under a minimum Bayes risk regime based on simulated annealing. The description of the latter algorithms is beyond the scope of this lecture.

#201: Lexical-Conceptual Relations as Qualia Role Encoders

Raquel Amaro, Sara Mendes, Palmira Marrafa (University of Lisbon, Lisbon, Portugal)

In this paper we show how wordnets can be used for building computational lexica that support generative processes accounting for phenomena such as the creation of meaning in context. We propose the integration of qualia information in wordnets through the association of lexical-conceptual relations to qualia roles, in what is a simple and low cost procedure, as it makes use of information already encoded in wordnets. This association between lexical-conceptual relations and qualia aspects allows us to describe the qualia structure of lexical items in a consistent way, without any loss of information and with the advantage of identifying the semantic predicates that can be values of qualia roles.

#292: Linguistic Adaptation in Semi-natural Dialogues: Age Comparison

Marie Nilsenová, Palesa Nolting (Tilburg University, Tilburg, The Netherlands)

Speaker adaptation in dialogues appears to support not only dialogue coordination, but also language processing, learning and in/out-group manifestation. Presumably, speakers in various stages of their language development might exploit different functions and types of adaptation, but conclusive research in this area has so far been lacking. In the present study, we compare structural, lexical and prosodic adaptation in a semi-natural dialogue across two age groups, in adult-child and adult-adult dyads. The results of our experiments indicate that children take over the structural and lexical forms used by their dialogue partner more frequently than adults. Children also adapt to the pitch of the speaker they interact with more than adult participants. Irrespective of age, we found longer onset latencies following the experimenter's question if the question had a non-canonical (declarative) form compared to a question with a canonical (interrogative) form. This can be seen as a manifestation of a processing advantage typically associated with the long-term effects of adaptation-as-learning.

#268: Listening-Test-Based Annotation of Expressive Speech Corpus

Martin Gr�ber, Jind�ich Matou�ek (University of West Bohemia, Pilsen, Czech Republic)

This paper is focused on the evaluation of listening test that was realized with a view to objectively annotate expressive speech recordings and further develop a limited domain expressive speech synthesis system. There are two main issues to face in this task. The first matter in issue to be taken into consideration is the fact that expressivity in speech has to be defined in some way. The second problem is that perception of expressive speech is a subjective question. However, for the purposes of expressive speech synthesis using unit selection algorithms, the expressive speech corpus has to be objectively and unambiguously annotated. At first, a classification of expressivity was determined making use of communicative functions. These are supposed to describe the type of expressivity and/or speaker's attitude. Further, to achieve objectivity at a significant level, a listening test with relatively high number of listeners was realized. The listeners were asked to mark sentences in the corpus using communicative functions. The aim of the test was to acquire a sufficient number of subjective annotations of the expressive recordings so that we would be able to create "objective" annotation. There are several methods to obtain objective evaluation from lots of subjective ones, two of them are presented.

#241: Online TV Captioning of Czech Parliamentary Sessions

Jan Trmal (University of West Bohemia, Pilsen, Czech Republic), Ale� Pra��k (SpeechTech, s.r.o, Pilsen, Czech Republic), Zden�k Loose, Josef Psutka (University of West Bohemia, Pilsen, Czech Republic)

In the paper we introduce the on-line captioning system developed by our teams and used by the Czech Television (CTV), the public service broadcaster in the Czech Republic. The research project is targeted at incorporation of speech technologies into the CTV environment. One of the key missions is the development of captioning system supporting captioning of a "live" acoustic track. It can be either the real audio stream or the audio stream produced by a shadow speaker. Another key mission is to develop software tools and techniques usable for training the shadow speakers. During the initial phases of the project we concluded that the broadcasting of the Parliamentary meetings of the Chamber of Deputies fulfills the necessary conditions that enable it to be captioned without the aid of the shadow speaker. We developed a fully automatic captioning pilot system making the broadcasting of Parliamentary meetings of the Chamber of Deputies accessible to the hearing impaired viewers. The pilot run enabled us and our partners in the Czech TV to develop and evaluate the complete captioning infrastructure and collect, review and possibly implement opinions and suggestions of the targeted audience. This paper presents our experience gathered during first years of the project to the public audience.

#235: Opinion Mining by Transformation-Based Domain Adaptation

Róbert Ormándi, István Hegedüs (University of Szeged, Szeged, Hungary), Richárd Farkas (Hungarian Academy of Sciences, Budapest, Hungary)

Here we propose a novel approach for the task of domain adaptation for Natural Language Processing. Our approach captures relations between the source and target domains by applying a model transformation mechanism which can be learnt by using labeled data of limited size taken from the target domain. Experimental results on several Opinion Mining datasets show that our approach significantly outperforms baselines and published systems when the amount of labeled data is extremely small.

#200: Optimal Minimization of a Pronunciation Dictionary Model

Simon Dobri�ek (University of Ljubljana, Ljubljana, Slovenia), Janez �ibert (University of Primorska, Koper, Slovenia), France Miheli� (University of Ljubljana, Ljubljana, Slovenia)

This paper presents the results of our efforts to obtain the minimum possible finite-state representation of a pronunciation dictionary. Finite-state transducers are widely used to encode word pronunciations and our experiments revealed that the conventional redundancy-reduction algorithms developed within this framework yield suboptimal solutions. We found that the incremental construction and redundancy reduction of acyclic finite-state transducers creates considerably smaller models (up to 60%) than the conventional, non-incremental (batch) algorithms implemented in the OpenFST toolkit.

#228: Parallel Training of Neural Networks for Speech Recognition

Karel Veselý, Luká� Burget, Franti�ek Grézl (Brno University of Technology, Brno, Czech Republic)

The feed-forward multi-layer neural networks have significant importance in speech recognition. A new parallel-training tool TNet was designed and optimized for multiprocessor computers. The training acceleration rates are reported on a phoneme-state classification task.

#296: Perplexity of n-gram and Dependency Language Models

Martin Popel, David Mare�ek (Charles University, Prague, Czech Republic)

Language models (LMs) are essential components of many applications such as speech recognition or machine translation. LMs factorize the probability of a string of words into a product of P(w_i|h_i), where h_i is the context (history) of word w_i. Most LMs use previous words as the context. The paper presents two alternative approaches: post-ngram LMs (which use following words as context) and dependency LMs (which exploit dependency structure of a sentence and can use e.g. the governing word as context). Dependency LMs could be useful whenever a topology of a dependency tree is available, but its lexical labels are unknown, e.g. in tree-to-tree machine translation. In comparison with baseline interpolated trigram LM both of the approaches achieve significantly lower perplexity for all seven tested languages (Arabic, Catalan, Czech, English, Hungarian, Italian, Turkish).

#243: Posterior Estimates and Transforms for Speech Recognition

Jan Zelinka, Lubo� �m�dl, Jan Trmal, Lud�k M�ller (University of West Bohemia, Pilsen, Czech Republic)

This paper describes ANN based posterior estimates and their application to speech recognition. We replaced the standard back-propagation with the L-BFGS quasi-Newton method. We have focused only on posterior based feature vector extraction. Our goal was a feature vector dimension reduction. Thus we designed three posterior transforms to space with dimensionality 1 or 2. The designed transforms were tested on the SpeechDat-East corpus. We also applied the introduced method on a Czech audio-visual corpus. In both cases the methods leads to significant word error rate decrease.

#327: Preliminary Study on HMM-based NER for Polish

Micha� Marcinczuk, Maciej Piasecki (Wroc�aw University of Technology, Wroc�aw, Poland)

Accuracy of a Named Entity Recognition algorithm based on the Hidden Markov Model is investigated. The algorithm was limited to recognition and classification of Named Entities representing persons. The algorithm was tested on two small Polish domain corpora of stock exchange and police reports. Comparison with the base lines algorithms based on the case of the first letter and a gazetteer is presented. The algorithm expressed 62% precision and 93% recall for the domain of the training data. Introduction of the simple hand-written post-processing rules increased precision up to 89%. We discuss also the problem of the method portability. A model of the combined knowledge sources is sketched also%in conclusions as a possible way to overcome the portability problem.

#289: Question Answering for Not Quite Semantic Web

Miloslav Konop�k, Ond�ej Rohl�k (University of West Bohemia, Pilsen, Czech Republic)

In this paper we present a prototype implementation of the question answering system for one of the inflectional languages -- Czech. The presented open domain system is especially effective in answering factual wh-questions about people, dates, names and locations. The answer is constructed on-the-fly from data gathered from the Internet, public ontologies, knowledge of the Czech language, and extensible template system. The system is capable of semiautomatic learning of new templates as well as both statistical and semantic processing of Internet content.

#263: Real Anaphora Resolution is Hard: The Case of German

Manfred Klenner, Angela Fahrni, Rico Sennrich (University of Zurich, Zurich, Switzerland)

We introduce a system for anaphora resolution for German that uses various resources in order to develop a real system as opposed to systems based on idealized assumptions, e.g. the use of true mentions only or perfect parse trees and perfect morphology. The components that we use to replace such idealizations comprise a full-fledged morphology, a Wikipedia-based named entity recognition, a rule-based dependency parser and a German wordnet. We show that under these conditions coreference resolution is (at least for German) still far from being perfect.

#238: Recovery of Rare Words in Lecture Speech

Stefan Kombrink, Mirko Hannemann, Luk� Burget, Hynek He�mansk� (Brno University of Technology, Brno, Czech Republic)

The vocabulary used in speech usually consists of two types of words: a limited set of common words, shared across multiple documents, and a virtually unlimited set of rare words, each of which might appear a few times only in particular documents. In most documents, however, these rare words are not seen at all. The first type of words is typically included in the language model of an automatic speech recognizer (ASR) and is thus widely referred to as in-vocabulary (IV). Words of the second type are missing in the language model and thus are called out-of-vocabulary (OOV). However, these words usually carry important information. We use a hybrid word/sub-word recognizer to detect OOV words occurring in English talks and describe them as sequences of sub-words. We detected about one third of all OOV words, and were able to recover the correct spelling for 26.2% of all detections by using a phoneme-to-grapheme (P2G) conversion trained on the recognition dictionary. By omitting detections corresponding to recovered IV words, we were able to increase the precision of the OOV detection substantially.

#196: Robust Statistic Estimates for Adaptation in Speech Recognition

Zbyn�k Zajíc, Luk� Machlica, Lud�k M�ller (University of West Bohemia, Pilsen, Czech Republic)

This paper deals with robust estimations of data statistics used for the adaptation. The statistics are accumulated before the adaptation process from available adaptation data. In general, only small amount of adaptation data is assumed. These data are often corrupted by noise, channel, they do not contain only clean speech. Also, when training Hidden Markov Models (HMM) several assumptions are made that could not have been fulfilled in the praxis, etc. Therefore, we described several techniques that aim to make the adaptation as robust as possible in order to increase the accuracy of the adapted system. One of the methods consists in initialization of the adaptation statistics in order to prevent ill-conditioned transformation matrices. Another problem arises when an acoustic feature is assigned to an improper HMM state even if the reference transcription is available. Such situations can occur because of the forced alignment process used to align frames to states. Thus, it is quite handy to accumulate data statistic utilizing only reliable frames (in the sense of data likelihood). We are focusing on Maximum Likelihood Linear Transformations and the experiments were performed utilizing the feature Maximum Likelihood Linear Regression (fMLLR). Experiments are aimed to describe the behavior of the system extended by proposed methods.

#264: Semantic Duplicate Identification with Parsing and Machine Learning

Sven Hartrumpf, Tim vor der Brück (FernUniversität in Hagen, Hagen, Germany), Christian Eichhorn (Technische Universität Dortmund, Dortmund, Germany)

Identifying duplicate texts is important in many areas like plagiarism detection, information retrieval, text summarization, and question answering. Current approaches are mostly surface-oriented (or use only shallow syntactic representations) and see each text only as a token list. In this work however, we describe a deep, semantically oriented method based on semantic networks which are derived by a syntactico-semantic parser. Semantically identical or similar semantic networks for each sentence of a given base text are efficiently retrieved by using a specialized index. In order to detect many kinds of paraphrases the semantic networks of a candidate text are varied by applying inferences: lexico-semantic relations, relation axioms, and meaning postulates. Important phenomena occurring in difficult duplicates are discussed. The deep approach profits from background knowledge, whose acquisition from corpora%like Wikipedia is explained briefly. The deep duplicate recognizer is combined with two shallow duplicate recognizers in order to guarantee a high recall for texts which are not fully parsable. The evaluation shows that the combined approach preserves recall and increases precision considerably in comparison to traditional shallow methods.

#298: Semantic Role Patterns and Verb Classes in Verb Valency Lexicon

Zuzana Nev��ilov� (Masaryk University, Brno, Czech Republic)

For Czech language there is large valency frame lexicon: VerbaLex. It contains verbs, slots related to the verbs and information about semantic roles each slot plays. This paper discusses observations made on VerbaLex frames related to verb classification. It shows that for particular classes of verbs (e.g. verbs describing weather) some semantic role patterns are typical. It also tries to reveal these patterns in not so obvious cases. Currently, verb frames in VerbaLex are not interconnected. This paper outlines the way we can do such connections. We expect that verb frames of the same class or with the same semantic role patterns are semantically close and therefore propose similar types of interconnection. We expect to create relatively small set of inference rules that influence a large number of verb frames.

#282: Special Speech Synthesis for Social Network Websites

Csaba Zaink�, Tam�s G�bor Csap�, G�za N�meth (Budapest University of Technology and Economics, Budapest, Hungary)

This paper gives an overview of the design concepts and implementation of a Hungarian microblog reading system. Speech synthesis of such special text requires some special components. First, an efficient diacritic reconstruction algorithm was applied. The accuracy of a former dictionary-based method was improved by machine learning to handle ambiguous cases properly. Second, an unlimited domain text-to-speech synthesizer was applied with extensions for emotional and spontaneous styles. Chat or blog texts often contain "emoticons" which mark the emotional state of the user. Therefore, an expressive speech synthesis method was adapted to a corpus-based synthesizer. Four emotions were generated and evaluated in a listening test: neutral, happy, angry and sad. The results of the experiments showed that happy and sad emotions can be generated with this algorithm, with best accuracy for female voice.

#215: The Structure of a Discontinuous Dialogue

Tiit Hennoste, Olga Gerassimenko, Riina Kasterpalu, Mare Koit, Kirsi Laanesoo, Anni Oja, Andriela Rääbis, Krista Strandson (University of Tartu, Tartu, Estonia)

We are studying how a dialogue structure is established by an Internet opinion article and its anonymous comments. We are using the methodology of conversation analysis with the focus on membership categorization analysis. The study shows that the core structure of the dialogue is formed by many parallel micro-dialogues. Besides the linear micro-dialogue structure there is a structure layer which is formed by the complex category sets built by participants using membership categorization of the agents of the article as well as the commentators themselves. We investigate the strategies and the linguistic means used by the participants for creating interrelations in the complex multilayered structure.

#219: These Nouns that Hide Events: an Initial Detection

Amaria Adila Bouabdallah, Tassadit Amghar, Bernard Levrat (University of Angers, Angers, France)

Many studies have been devoted to the temporal analysis of texts, and more precisely to the tagging of temporal entities and relations occurring in texts. Among these lasts, the various avatars of events in their multiples occurring forms has been tackled by numerous works. We describe here a method for the detection of noun phrases denoting events. Our approach is based on the implementation of a simple linguistic test proposed by linguists for this task. Our method is applied on two different corpuses; the first is composed of newspaper articles and the second, a much larger one, rests on an interface for automatically querying the Yahoo search engine. Primary results are encouraging and increasing the size of the learning corpus should allow for a real statistical validation of the results.

#254: Towards Disambiguation of Word Sketches

Vít Baisa (Masaryk University, Brno, Czech Republic)

A word sketch is a source of valuable information both for linguists and lexicographers but it consists of lemmas which are not disambiguated. In this paper we describe a method which can partially disambiguate these lemmas and increase a quality of information contained in word sketches. For the disambiguation we exploit intersections of English and Czech word sketches using an English-Czech dictionary.

#230: Towards a Bank of Constituent Parse Trees for Polish

Marek Swidzinski (Warsaw University, Warsaw, Poland), Marcin Wolinski (Polish Academy of Sciences, Warsaw, Poland)

We present a project aimed at construction of a bank of constituent parse trees for 20,000 Polish sentences taken from the balanced hand-annotated subcorpus of the National Corpus of Polish (NKJP). The treebank is to be obtained by automatic parsing and manual disambiguation of resulting trees. The grammar applied by the project is a new version of Swidzinski's formal definition of Polish. Each sentence is disambiguated independently by two linguists and, if needed, adjudicated by a supervisor. The feedback from this process is used to iteratively improve the grammar. In the paper, we describe linguistic but also technical decisions made in the project. We discuss the overall shape of the parse trees including the extent of encoded grammatical information. We also delve into the problem of syntactic disambiguation as a challenge for our job.

#208: Towards an N-version Dependency Parser

Miguel Ballesteros, Jes�s Herrera, Virginia Francisco, Pablo Gerv�s (Universidad Complutense de Madrid, Madrid, Spain)

Maltparser is a contemporary dependency parsing machine learning-based system that shows great accuracy. However 90% for Labelled Attachment Score (LAS) seems to be a de facto limit for such kinds of parsers. Since generally such systems can not be modified, previous works have been developed to study what can be done with the training corpora in order to improve parsing accuracy. High level techniques, such as controlling sentences' length or corpora's size, seem useless for these purposes. But low level techniques, based on an in-depth study of the errors produced by the parser at the word level, seem promising. Prospective low level studies suggested the development of n-version parsers. Each one of these n versions should be able to tackle a specific kind of dependency parsing at the word level and the combined action of all them should reach more accurate parsings. In this paper we present an extensive study on the usefulness and the expected limits for n-version parser to improve parsing accuracy. This work has been developed specifically for Spanish using Maltparser.

#207: Using Knowledge about Misunder. to Increase the Robustness of SDSs

Ram�n L�pez-C�zar, Zoraida Callejas, Nieves �balos, Gonzalo Espejo (University of Granada, Granada, Spain), David Griol (Carlos III University of Madrid, Madrid, Spain)

This paper proposes a new technique to enhance the performance of spoken dialogue systems employing a method that automatically corrects semantic frames which are incorrectly generated by the semantic analyser of these systems. Experiments have been carried out using two spoken dialogue systems previously developed in our lab: Saplen and Viajero, which employ prompt-dependent and prompt-independent language models for speech recognition. The results obtained from 10,000 simulated dialogues show that the technique improves the performance of the two systems for both kinds of language modelling, especially for the prompt-independent language model. Using this type of model the Saplen system increased sentence understanding by 19.54%, task completion by 26.25%, word accuracy by 7.53%, and implicit recovery of speech recognition errors by 20.30%, whereas for the Viajero system these figures increased by 14.93%, 18.06%, 6.98% and 15.63%, respectively.

#315: Using Syllables as Acoustic Units for Spontaneous Speech Recognition

Jan Hejtmánek (University of West Bohemia, Pilsen, Czech Republic)

In this work, we deal with advanced context-dependent automatic speech recognition (ASR) of Czech spontaneous talk using hidden Markov models (HMM). Context-dependent units (e.g. triphones, diphones) in ASR systems provide significant improvement against simple non-context-dependent units. However, for spontaneous speech recognition we had to overcome some very challenging tasks. For one, the number of syllables compared to the size of spontaneous speech corpus makes the usage of context-dependent units very difficult. The main part of this article shows problems and procedures to effectively build and use a syllable-based ASR with the LASER (ASR system developed at Department of Computer Science and Engineering, Faculty of Applied Sciences). The procedures are usable with virtual any modern ASR.

#267: Using TectoMT as a Preprocessing Tool for Phrase-Based SMT

Daniel Zeman (Charles University, Prague, Czech Republic)

We present a systematic comparison of preprocessing techniques for two language pairs: English-Czech and English-Hindi. The two target languages, although both belonging to the Indo-European language family, show significant differences in morphology, syntax and word order. We describe how TectoMT, a successful framework for analysis and generation of language, can be used as preprocessor for a phrase-based MT system. We compare the two language pairs and the optimal sets of source-language transformations applied to them. The following transformations are examples of possible preprocessing steps: lemmatization; retokenization, compound splitting; removing/adding words lacking counterparts in the other language; phrase reordering to resemble the target word order; marking syntactic functions. TectoMT, as well as all other tools and data sets we use, are freely available on the Web.

#180: ABBYY Lingvo Pro - a global network for building collaborative multilingual tools and resources

Polina Ilugdina, Eugene Pakhomov

ABBYY is a leading international developer of linguistic software. For 20 years since the foundation the company has been active in software development and linguistics, semantics, syntax and lexicography research, offering professional solutions and services in the industry. This long-term experience enabled ABBYY to create ABBYY Lingvo Pro - an efficient and unique web-solution for translators, linguists, lexicographers, and the rest of those involved in the language industry. We would like to announce this solution at the TSD 2010, demonstrate its main functionality and discuss the most important issues with the professional community which will be the main users of the portal. The main purpose of ABBYY Lingvo Pro is to become a global community-based multilingual network, which will accumulate collaborative language resources and instruments within one web-space. The product combines a global online source of dictionaries, glossaries and parallel corpora, language instruments and a web-space for professional community where its members can find each other, discuss professional issues, share useful materials and experience. ABBYY Lingvo Pro will be launched in Spring 2010 for Beta-testing. As of now ABBYY Lingvo Pro numbers about 2.5mln dictionary entries and 1,5mln parallel sentences for Russian <-> English, Russian <-> German and Russian <-> French translation directions. The list of languages will be extended in future. ABBYY Lingvo Pro also includes different professional tools, for example, ABBYY Aligner Online, a service allowing parallel texts alignment. The next step which is planned for the end of 2010 will be implementation of professional instrument for lexicographers' work (ABBYY Lingvo Content - a solution for dictionary creation), translation (CAT, quality assurance, etc.) and translation management tools.

#345: An On-Line Game Instead of the Annotation Editor

Barbora Hladka, Jiri Mirovsky, Pavel Schlesinger

We present the PlayCoref game, an on-line internet game, whose purpose is to enrich text data with coreference annotation. We provide a detailed description of the game, especially of its course and its implementation, and we mention the processing of the data and the scoring function.

#346: Visualization of VerbaLex Semantic Role Patterns

Zuzana Neverilova

VerbaLex is currently the largest verb valency lexicon in Czech. It consists of verb frames, and each frame consists of slots. Each slot represents a semantic role (such as agent, patient or location) and a link to Czech WordNet (e.g. agent has to be a person). Slots for particular frames repeat and create patterns of semantic roles. In some cases, verbs with the same pattern (such as AGent-LOCation-LOCation) are semantically related (e.g. verbs of motion). A visual representation of such patterns can help users to orient themselves in the lexicon. Visualization allows to display large amount of data. Users can pick up details and keep an overview at the same time. On one side, the visualization can be used for observing the patterns in VerbaLex. On the other, the visual representation can serve as a feedback for authors and administrators of VerbaLex, since users can more easily reveal potential errors.

#401: DEB System - Platform for Storage of Digital Knowledge

Ales Horak, Adam Rambousek

DEB platform developed by NLP Centre provides libraries and tools for easy developement of dictionary editors and browsers. The platform is a modular and extensible client-server system, with most of the functionality handled by the server. All the data are stored in native XML database. Server also provides API to allow interaction with external applications. Ten applications were already based on DEB platform, for example wordnet editor and browser DEBVisDic (used also in KYOTO project), general dictionary browser DEBDict, art glossary editor TeDi, or editor for project Family Names in UK.

#402: Advanced Features of the Sketch Engine Corpus System

Vojtech Kovar

In the demo we present the leading corpus management tool -- the Sketch Engine. The system implements a wide variety of functions including corpus building, concordance viewing, frequency computing, extracting collocations, word sketches, statistical thesaurus and many others. Firstly, we give a short overview of the functionality and the query language. Then we move to the recently added advanced features, namely new constructions of the query language, new design, new corpus-building functionality and the Tickbox lexicography -- a tool for drafting dictionary entries by few clicks.

#403: Corpus Pattern Analysis: new light on words and meanings

Patrick Hanks

This is a demonstration of a long-term research project, Corpus Pattern Analysis (CPA), whose goals include: a) shedding light on how people use words to make meanings; b) providing evidence to support a new theory of linguistic behaviour, called the Theory of Norms and Exploitations - TNE; and c) creating a resource that will be useful for tasks such as improving a learner's (or a computer's) command of idiomatic phraseology and the associated meanings.

Traditionally, the analysis of meaning was thought to proceed word by word, like a child building a toy house with Lego bricks. TNE proposes instead that words do not have meaning - they only have 'meaning potential'. In TNE, meanings are associated with patterns - prototypical phraseological patterns of word use. Words are highly ambiguous, but most patterns are unambiguous.

CPA is a technique for identifying meaningful patterns of word use, drawing on prototype theory, valencies, and collocational analysis.

#404: Internet Language Service

Karel Pala, Pavel Smerk

Internet Language Service (ILS) is a language tool based on the technologies of web interfaces, particularly on the DEB system. It allows users to look up information on various linguistic issues and provides expert linguistic information related to rules of orthography, spelling, meanings of words, usage, grammar and style. The system now works with Czech, but it could be used for other languages with rich morphology as well. The size of the lexical database is more than 60 000 items (entries). The explanatory part includes more than 150 items (sections) in which the individual orthographical topics are explained. The ILS answers questions of the users with the average number of approx. 10-20 000 accesses per day depending on the day of week. The ILS is running for more than two years for the users in the whole Czech Republic (Czech Republic has approx. 10�million inhabitants). The access to the Internet Language Service is free. The ILS can be accessed on the web address http://prirucka.ujc.cas.cz/. The ILS has been prepared in cooperation with the colleagues from Institute of the Czech Language of Czech Academy of Sciences who prepared and checked the linguistic data.

#405: WordnetLoom: a Graph-based Visual Wordnet Development Framework

Maciej Piasecki, Micha� Marcinzuk, Adam Musia�, Rados�aw Ramocki, Marek Maziarz

The paper presents WordnetLoom - a new version of an application supporting the development of the Polish wordnet called plWordNet. The primary user interface of WordnetLoom is a graph-based, graphical, active presentation of wordnet structure. Linguist can directly work on the structure of synsets linked by relation links. The new version is compared with the previous one in order to show the lines of development and to illustrate the introduced difference. A new version of WordnetWeaver - a tool supporting semi-automated expansion of wordnet is also presented. The new version is based on the same user interface as WordnetLoom, utilises all types of wordnet relations and is tightly integrated with the rest of the wordnet editor. The role of the system in the wordnet development process, as well as experience from its application, are discussed. A set of WWW-based tools supporting coordination of team work and verification is also presented.

#406: Synt - a robust web-enabled parser for Czech

Milos Jakubicek, Ales Horak

We will present several features of an agenda-based chart parser for Czech with PCFG backbone -- the Synt parser -- in a web environment which enables the user to choose advanced parsing options and display parsing results in a number of ways.

#407: Web-Based Lecture Browser

Josef Zizka

Superlectures.com is an innovative lecture video portal that enables users to search in speech. This brings a significant speed-up in accessing lecture video recordings. The aim of this portal is to make video easily searchable as any textual document. The speech processing system automatically recognizes and indexes Czech and English spoken word.

TSD 2009 | TSD 2008 | TSD 2007