TSD 2016

Natural language input in deep learning is commonly represented as embeddings. While embeddings are widely used, fundamental questions about the nature and purpose of embeddings remain. Drawing on traditional computational linguistics as well as parallels between language and vision, I will address two of these questions in this talk. (1) Which linguistic units should be represented as embeddings? (2) What are we trying to achieve using embeddings and how do we measure success?

Invited talk: Natural Language Knowledge Graphs

Ido Dagan

How can we capture the information expressed in large amounts of text? And how can we allow people, as well as computer applications, to easily explore it? When comparing textual knowledge to formal knowledge representation (KR) paradigms, two prominent differences arise. First, typical KR paradigms rely on pre-specified vocabularies, which are limited in their scope, while natural language is inherently open. Second, in a formal knowledge base each fact is encoded in a single canonical manner, while in multiple texts a fact may be repeated with some redundant, complementary or even contradictory information.

Invited talk: Remote Monitoring of Neurodegeneration through Speech

Elmar Nöth

Alzheimer’s disease (AD) is the most common neurodegenerative disorder. It generally deteriorates memory function, then language, then executive function to the point where simple activities of daily living (ADLs) become difficult (e.g. taking medicine or turning off a stove). Parkinson’s disease (PD) is the second most common neurodegenerative disease, also primarily affecting individuals of advanced age. Its cardinal symptoms include akinesia, tremor, rigidity, and postural imbalance. Together, AD and PD afflict approximately 55 million people, and there is no cure. Currently, professional or informal caregivers look after these individuals, either at home or in long-term care facilities. Caregiving is already a great, expensive burden on the system, but things will soon become far worse. Populations of many nations are aging rapidly and, with over 12% of people above the age of 65 having either AD or PD, incidence rates are set to triple over the next few decades.
Monitoring and assessment are vital, but current models are unsustainable. Patients need to be monitored regularly (e.g. to check if medication needs to be updated), which is expensive, time-consuming, and especially difficult when travelling to the closest neurologist is unrealistic. Monitoring patients using non-intrusive sensors to collect data during ADLs from speech, gait, and handwriting, can help to reduce the burden.
In this talk I will report on the results of the workshop on "Remote Monitoring of Neurodegeneration through Speech", which was part of the “Third Frederick Jelinek Memorial Summer Workshop”.

#725: Russian Deception Bank: A Corpus for Automated Deception Detection in Text

Tatiana Litvinova, Olga Litvinova

full paper

The problem of automatic lie detection in a written text is very urgent particularly with a growing number of Internet communication and thus it has been studied over the last decade but mainly using English-language materials. In order to address this, there have to be special text corpora that are challenging to design. The article presents a specially designed corpus of Russian-language texts – “Russian Deception Bank” – with a “truthful” and a “deceptive” text by the same author as well as metadata with information about the authors (gender, age, psychological testing data, etc.). This is the first Russian corpus of this type. There are also the results of the analysis conducted in order to identify the differences between “truthful” and “deceptive” texts along a range of linguistic characteristics that were extracted using the Linguistic Inquiry and Word Count software.

#750: A Modular Chain of NLP Tools for Basque

Arantxa Otegi, Nerea Ezeiza, Iakes Goenaga, Gorka Labaka

This work describes the initial stage of designing and implementing a modular chain of Natural Language Processing tools for Basque. The main characteristic of this chain is the deep morphosyntactic analysis carried out by the first tool of the chain and the use of these morphologically rich annotations by the following linguistic processing tools of the chain. It is designed following a modular approach, showing high ease of use of its processors. Two tools have been adapted and integrated to the chain so far, and are ready to use and freely available, namely the morphosyntactic analyzer and PoS tagger, and the dependency parser. We have evaluated these tools and obtained competitive results. Furthermore, we have tested the robustness of the tools on an extensive processing of Basque documents in various research projects.

#777: A unified parser for developing Indian language text to speech synthesizers

Arun Baby, Nishanthi N L, Anju Leela Thomas, Hema A Murthy

This paper describes the design of a language independent parser for text-to-speech synthesis in Indian languages. Indian languages come from 5-6 different language families of the world. Most Indian languages have their own scripts. This makes parsing for text to speech systems for Indian languages a difficult task. In spite of the number of different families which leads to divergence, there is a convergence owing to borrowings across language families. Most importantly Indian languages are more or less phonetic and can be considered to consist broadly of about 35-38 consonants and 15-18 vowels. In this paper, an attempt is made to unify the languages based on this broad list of phones. A common label set is defined to represent the various phones in Indian languages. A uniform parser is designed across all the languages capitalising on the syllable structure of Indian languages. The proposed parser converts UTF-8 text to common label set, applies letter-to-sound rules and generates the corresponding phoneme sequences. The parser is tested against the custom-built parsers for multiple Indian languages. The TTS results show that the accuracy of the phoneme sequences generated by the proposed parser is more accurate than that of language specific parsers.

#746: Vive la petite difference! Exploiting small differences for gender attribution of short texts

Filip Gralinski, Rafal Jaworski, Lukasz Borchmann, Piotr Wierzchon

This article describes a series of experiments on gender attribution of Polish texts. The research was conducted on the publicly available corpus called “He Said She Said”, consisting of a large number of short texts from the Polish version of Common Crawl. As opposed to other experiments on gender attribution, this research takes on a task of classifying relatively short texts, authored by many different people. For the sake of this work, the original “He Said She Said” corpus was filtered in order to eliminate noise and apparent errors in the training data. In the next step, various machine learning algorithms were developed in order to achieve better classification accuracy. Interestingly, the results of the experiments presented in this paper are fully reproducible, as all the source codes were deposited in the open platform Gonito.net. Gonito.net allows for defining machine learning tasks to be tackled by multiple researchers and provides the researchers with easy access to each other's results.

#829: A Dynamic Programming Approach to Improving Translation Memory Matching and Retrieval using Paraphrases

Rohit Gupta, Constantin Orasan, Qun Liu, Ruslan Mitkov

Translation memory tools lack semantic knowledge like paraphrasing when they perform matching and retrieval. As a result, paraphrased segments are often not retrieved. One of the primary reasons for this is the lack of a simple and efficient algorithm to incorporate paraphrasing in the TM matching process. Gupta and Orasan proposed an algorithm which incorporates paraphrasing based on greedy approximation and dynamic programming. However, because of greedy approximation, their approach does not make full use of the paraphrases available. In this paper we propose an efficient method for incorporating paraphrasing in matching and retrieval based on dynamic programming only. We tested our approach on English-German, English-Spanish and English-French language pairs and retrieved better results for all three language pairs compared to the earlier approach.

#740: A Sentiment-aware Topic Model for Extracting Failures from Product Reviews

Elena Tutubalina

This paper describes a probabilistic model that aims to extract different kinds of product difficulties conditioned on users' dissatisfaction through the use of sentiment information. The proposed model learns a distribution over words, associated with topics, sentiment and problem labels. The results were evaluated on reviews of products, randomly sampled from several domains (automobiles, home tools, electronics, and baby products), and user comments about mobile applications, in English and Russian. The model obtains a better performance than several state-of-the-art models in terms of the likelihood of a held-out test and outperforms these models in a classification task.

#831: AQA: Automatic Question Answering System for Czech

Marek Medveď, Aleš Horák

Question answering (QA) systems have become popular nowadays, however, a majority of them concentrates on the English language and most of them are oriented to a specific limited problem domain. In this paper, we present a new question answering system called AQA (Automatic Question Answering). AQA is an open-domain QA system which allows users to ask all common questions related to a selected text collection. The first version of the AQA system is developed and tested for the Czech language, but we also plan to include more languages in future versions. The AQA strategy consists of three main parts: question processing, answer selection and answer extraction. All modules are syntax-based with advanced scoring obtained by a combination of TF-IDF, tree distance between the question and candidate answers and other selected criteria. The answer extraction module utilizes named entity recognizer which allows the system to catch entities that are most likely to answer the question. Evaluation of the AQA system is performed on a previously published Simple Question-Answering Database, or SQAD, with more than 3,000 question-answer pairs.

#830: An Efficient Method for Vocabulary Addition to WFST Graphs

Anna Bulusheva, Alexander Zatvornitskiy, Maxim Korenevsky

A successful application of automatic speech recognition often requires the ability to recognize words that are not known during system building phase. Modern speech recognition decoders employ large WFST graphs so to add new words one needs to recompile the graph. The compilation process requires a lot of memory and consumes a lot of time. In this paper a method to add new words into a speech recognition graph is presented. The method requires significantly less memory, takes less time than a full recompilation and doesn’t affect recognition accuracy or speed.

#838: Annotation of Czech Texts with Language Mixing

Zuzana Nevěřilová

Language mixing (using chunks of foreign language in a native language utterance) occurs frequently. Foreign language chunks have to be detected because their annotation is often incorrect. In the standard pipelines of Czech texts annotation, no such detection exists. Before morphological disambiguation, unrecognized words are processed by Czech guesser which is successful on Czech words (e.g. neologisms, typos) but its usage makes no sense on foreign words. We propose a new pipeline that adds foreign language chunk and multi-word expression (MWE) detection. We experimented with a small corpus where we compared the original (semi-automatic) annotation (including foreign words and MWEs) with the results of the new pipelines. As a result, we reduced the number of incorrect annotations of interlingual homographs and foreign language chunks in the new pipeline compared to the standard one. We also reduced the number of tokens that have to be processed by the guesser. The aim was to use the guesser solely on potentially Czech words.

#765: Assessing Context for Extraction of near Synonyms from Product Reviews in Spanish

Sofía N. Galicia-Haro, Alexander F. Gelbukh

This paper reports ongoing research on near synonym extraction. The aim of our work is to identify the near synonyms of multiword terms related to an electro domestic product domain. The state of the art approaches for identification of single word synonyms are based on distributional methods. We analyzed for this method different sizes and types of contexts, from a collection of Spanish reviews and from the Web. We present some results and discuss the relations found.

#805: Automatic Question Generation Based on Analysis of Sentence Structure

Miroslav Blšták and Viera Rozinajová

This paper presents a novel approach to the area of automated factual question generation. We propose a template-based method which uses the structure of sentences to create multiple sentence patterns on various levels of abstraction. The pattern is used to classify the sentences and to generate questions. Our approach allows to create questions on different levels of difficulty and generality e.g. from general questions to specific ones. Other advantages lie in simple expansion of patterns and in increasing the text coverage. We also suggest a new way of storing patterns which significantly improves pattern matching process. Our first results indicate that the proposed method can be an interesting direction in the research of automated question generation.

#800: Automatic Restoration of Diacritics for Igbo Language

I. Ezeani et al.

Igbo is a low-resource African language with orthographic and tonal diacritics, which capture distinctions between words that are important for both meaning and pronunciation, and hence of potential value for a range of language processing tasks. Such diacritics, however, are often largely absent from the electronic texts we might want to process, or assemble into corpora, and so the need arises for effective methods for automatic diacritic restoration for Igbo. In this paper, we experiment using an Igbo bible corpus, which is extensively marked for vowel distinctions, and partially for tonal distinctions, and attempt the task of reinstating these diacritics when they have been deleted. We investigate a number of word-level diacritic restoration methods, based on n-grams, under a closed-world assumption, achieving an accuracy of 98.83% with our most effective method.

#739: Automatic scoring of a Sentence Repetition Task from Voice Recordings

Meysam Asgari, Allison Sliter, Jan Van Santen

In this paper, we propose an automatic scoring approach for assessing the language deficit in a sentence repetition task used to evaluate children with language disorders. From ASR-transcribed sentences, we extract sentence similarity measures, including WER and Levenshtein distance, and use them as the input features in a regression model to predict the reference scores manually rated by experts. Our experimental analysis on subject-level scores of 46 children, 33 diagnosed with autism spectrum disorders (ASD), and 13 with specific language impairment (SLI) show that proposed approach is successful in prediction of scores with averaged product-moment correlations of 0.84 between observed and predicted ratings across test folds.

#799: Automatic syllabification and syllable timing of automatically recognized speech - for Czech

Marek Boháč and Lukáš Matějů and Michal Rott and Radek Šafařík

Our recent work was focused on automatic speech recognition (ASR) of spoken word archive documents. One of the important tasks was to structuralize the recognized document (to segment the document and to detect sentence boundaries). Prosodic features play significant role in the spoken document structuralization. In our previous work we bound the prosodic information on the ASR events - words and noises. Many prosodic features (e.g. speech rate, vowel prominence or prolongation of last syllables) require higher time resolution than word-level . For that reason we propose a scheme that is able to automatically syllabify the recognized words and by forced-alignment of its phonetic content provide the syllables (and its phonemes) with time-stamps. We presume that words, non-speech events, syllables and phonemes represent an appropriate hierarchical set of structuralization units for processing various prosodic features.

#731: Building Corpora for Stylometric Research

Jan Svec, Jan Rygl

Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information. The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.

#824: Classification of Speaker Intoxication Using a Bidirectional Recurrent Neural Network

Kim Berninger, Jannis Hoppe, Benjamin Milde

With the increasing popularity of deep learning approaches in the field of speech recognition and classification many of such problems are encountering a paradigm shift from classic approaches, such as hidden Markov models, to recurrent neural networks (RNN). In this paper we are going to examine that transition for the ALC corpus which had been used in the Interspeech 2011 Speaker State Challenge. Filter bank (FBANK) features are used alongside two types of bidirectional RNNs, each using gated recurrent units (GRU). Those models are used to classify the intoxication state of people just by recordings of their voices and outperform humans with state-of-the-art results.

#827: Collecting Facebook Posts and WhatsApp Chats: Corpus Compilation of Private Social Media Messages

Lieke Verheijen, Wessel Stoop

This paper describes the compilation of a social media corpus with Facebook posts and WhatsApp chats. Authentic messages were voluntarily donated by Dutch youths between 12 and 23 years old. Social media nowadays constitute a fundamental part of youths' private lives, constantly connecting them to friends and family via computer-mediated communication (CMC). The social networking site Facebook and mobile phone chat application WhatsApp are currently quite popular in the Netherlands. Several relevant issues concerning corpus compilation are discussed, including website creation, promotion, metadata collection, and intellectual property rights / ethical approval. The application that was created for scraping Facebook posts from users' timelines, of course with their consent, can serve as an example for future data collection. The Facebook and WhatsApp messages are collected for a sociolinguistic study into Dutch youths' written CMC, of which a preliminary analysis is presented, but also present a valuable data source for further research.

#749: Combining dependency parsers using error rates

Tomáš Jelínek

In this paper, we present a method of improving dependency parsing accuracy by combining parsers using error rates. We use four parsers: MSTParser, MaltParser, TurboParser and MateParser, and the data of the analytical layer of the Prague Dependency Treebank. We parse data with each of the parsers and calculate error rates for several parameters such as POS of dependent tokens. These error rates are then used to determine weights of edges in an oriented graph created by merging all the parses of a sentence provided by the parsers. We find the maximum spanning tree in this graph (a dependency tree without cycles), and achieve a 1.3% UAS / 1.1% LAS improvement compared to the best parser in our experiment.

#735: Constraint-Based Open-domain Question Answering Using Knowledge Graph Search

Ahmad Aghaebrahimian, Filip Jurčíček

We introduce a highly scalable approach for open-domain question answering with no dependence on any logical form to surface form mapping data set or any linguistic analytic tool such as POS tagger or named entity recognizer. We define our approach under the Constrained Conditional Models framework which lets us scale to a full knowledge graph with no limitation on the size. On a standard benchmark, we obtained competitive results to state-of-the-art in open-domain question answering task.

#796: Correction of prosodic phrases in large speech corpora

Zdeněk Hanzlíček

Nowadays, in many speech processing tasks, such as speech recognition and synthesis, really large speech corpora are utilized. These speech corpora usually contain several hours of speech or even more. To achieve possibly best results, an appropriate annotation of the recorded utterances is often necessary. This paper is focused on problems related to the prosodic annotation of the Czech speech corpora. In the Czech language, the utterances are supposed to be split by pauses into so-called prosodic clauses containing one or more prosodic phrases. The types of particular phrases are linked to their last prosodic words corresponding to various functionally involved prosodemes. The clause/phrase structure is substantially determined by the sentence composition. However, in real speech data, different prosodeme type or even phrase/clause borders can be present. This paper deals with 2 basic problems: the correction of the improper prosodeme/phrase type and the detection of new phrase borders. For both tasks, we proposed new procedures utilizing hidden Markov models. Experiments were performed on 4 large speech corpora recorded by professional speakers for the purpose of speech synthesis. These experiments were limited to the declarative sentences. The results were successfully verified by listening tests.

#753: Cross-language dependency parsing using Part-Of-Speech patterns

Peter Bednár

The presented paper describes a simple instance-based learning method for dependency parsing, which is based solely on the part-of-speech n-grams extracted from training data. The presented method is not dependent on any lexical features (i.e. words or lemmas) or other morphological categories so model trained on one language can be directly applied to another similar language with harmonized tagset of coarse-grained part-of-speech categories. Using the instance-based learning allows us to directly evaluate predictive power of part-of-speech patterns on evaluation data from Czech and Slovak treebanks.

#813: CzEng 1.6: Czech-English Parallel Corpus with Tools Dockered

O. Bojar et al.

We present a new release of the Czech-English parallel corpus CzEng. CzEng 1.6 consists of about 0.5 billion words (gigaword) in each language. The corpus is equipped with automatic annotation at a deep syntactic level of representation and alternatively in Universal Dependencies. Additionally, we release the complete annotation pipeline as a virtual machine in the Docker virtualization toolkit.

#745: Difficulties with wh-questions in Czech TTS System

Markéta Juzová, Daniel Tihelka

The sentence intonation is very important for differentiation of sentence types (declarative sentences, questions, etc.), especially in languages without fixed word order. Thus, it is very important to deal with that also in text-to-speech systems. This paper concerns the problem of wh-question, where its intonation differs from the intonation of another basic question type -- yes/no question. We discuss the possibility to use wh-questions (recorded during the speech corpus preparation) in speech synthesis. The inclusion and appropriate usage of these recordings is tested in a real text-to-speech system and evaluated by listening tests. Furthermore, we focus on the problem of the perception of wh-question by listeners, with the aim to reveal whether listeners really prefer phonologically correct (falling) intonation in this type of questions.

#744: Digging Language model -– Maximum Entropy Phrase Extraction

Jakub Kanis

This work introduces our maximum entropy phrase extraction method for the Czech -- English translation task. Two different corpora and language models of the different sizes were used to explore a potential of the maximum entropy phrase extraction method and phrase table content optimization. Additionally, two different maximum entropy estimation criteria were compared with the state of the art phrase extraction method too. In the case of a domain oriented translation, maximum entropy phrase extraction significantly improves translation precision.

#772: Embedded learning segmentation approach for Arabic speech recognition

Hamza Frihia, Halima Bahi

Building an Automatic Speech Recognition (ASR) system requires a well segmented and labeled speech corpus (often transcription is made by an expert). These resources are not always available for languages such as Arabic. This paper presents a system for automatic Arabic speech segmentation for speech recognition purpose. State-of-the-art models in ASR systems are the Hidden Markov Models (HMM), so that for the segmentation, we expect the use of embedded learning approach where an alignment between speech segments and HMMs is done iteratively to refine the segmentation. This approach needs the use of transcribed and labelled data, for this purpose, we built a dedicated corpus. Finally, the obtained results are close to those described in the literature and could be improved by handling more Arabic speech specificities.

#840: Evaluation and Improvements in Punctuation Detection for Czech

Vojtěch Kovář, Jakub Machura, Kristýna Zemková and Michal Rott

Punctuation detection and correction belongs to the hardest automatic grammar checking tasks for the Czech language. The paper compares available grammar and punctuation correction programs on several data sets. It also describes a set of improvements of one of the available tools, leading to significantly better recall, as well as precision.

#719: Evaluation of TTS Personification by GMM-based Speaker Gender and Age Classifier

Jiří Přibil et al.

This paper describes an experiment using the Gaussian mixture models (GMM)-based speaker gender and age classification for automatic evaluation of the achieved success in text-to-speech (TTS) system personification. The proposed two-level GMM classifier detects four age categories (child, young, adult, senior) as well as it discriminates gender for adult voices. This classifier is applied for gender/age estimation of the synthetic speech in Czech and Slovak languages produced by different TTS systems with several voices, using different speech inventories and speech modelling methods. The obtained results confirm the hypothesis that this type of classifier can be utilized as an alternative approach instead of the conventional listening test in the area of speech evaluation.

#748: FAQIR -- A Frequently Asked Questions Retrieval Test Collection

Mladen Karan, Jan Šnajder

Frequently asked question (FAQ) collections are commonly used across the web to provide information about a specific domain (e.g., services of a company). With respect to traditional information retrieval, FAQ retrieval introduces additional challenges, the main ones being (1) the brevity of FAQ texts and (2) the need for topic-specific knowledge. The primary contribution of our work is a new domain-specific FAQ collection, providing a large number of queries with manually annotated relevance judgments. On this collection, we test several unsupervised baseline models, including both count based and semantic embedding based models, as well as a combined model. We evaluate the performance across different setups and identify potential venues for improvement. The collection constitutes a solid basis for research in supervised machine-learning-based FAQ retrieval.

#769: Finite-State Super Transducers for Grapheme-to-Phoneme Conversion

Žiga Golob, Jerneja Žganec Gros, Vitomir Štruc, France Mihelič, Simon Dobrišek

Minimal deterministic finite-state transducers (MDFSTs) are powerful models that can be used to represent pronunciation dictionaries in a compact form. Intuitively, we would assume that by increasing the size of the dictionary, the size of the MDFSTs would increase as well. However, as we show in the paper, this intuition does not hold for highly inflected languages. With such languages the size of the MDFSTs begins to decrease once the number of words in the represented dictionary reaches a certain threshold. Motivated by this observation, we have developed a new type of FST, called a finite-state super transducer (FSST), and show experimentally that the FSST is capable of representing pronunciation dictionaries with fewer states and transitions than MDFSTs. Furthermore, we show that (unlike MDFSTs) our FSSTs can also accept words that are not part of the represented dictionary. The phonetic transcriptions of these out-of-dictionary words may not always be correct, but the observed error rates are comparable to the error rates of the traditional methods for grapheme-to-phoneme conversion.

#795: From dialogue corpora to dialogue systems: Generating a chatbot with teenager personality for preventing cyber-pedophilia

A. Callejas-Rodríguez, E. Villatoro-Tello, I. Meza et al.

A conversational agent, also known as chatbot, is a machine conversational system which interacts with human users via natural language. Traditionally, chatbot technology is built under certain set of "manually" elaborated conversational rules. However, given the availability of large and real examples of humans' interactions in the web, automatically generating these rules is becoming a more feasible option. In this paper we describe an approach for building and training a conversational agent, which holds a teenager personality and it is able to dialogue in Mexican Spanish. By means of this chatter bot we aim at assisting law enforcement officers in the prevention of cyber-pedophilia. Our performed experiments demonstrate that our developed chatbot is able to elaborate comparable lexical and syntactical constructions to those a teenager would produce. As an additional contribution, we compile and release a large dialogue corpus containing real examples of conversations among teenagers.

#770: Gathering Information about Word Similarity from Neighbor Sentences

Natalia Loukachevitch, Aleksei Alekseev

In this paper we present the first results of detecting word semantic similarity on the Russian translations of Miller-Charles and Rubenstein-Goodenough sets prepared for the first Russian word semantic evaluation Russe-2015. The experiments were carried out on three text collections: Russian Wikipedia, a news collection, and their united collection. We found that the best results in detection of lexical paradigmatic relations are achieved using the combination of word2vec with the new type of features based on word co-occurrences in neighbor sentences.

#727: Generating of Events Dictionaries from Polish WordNet for the Recognition of Events in Polish Documents

Jan Kocoń, Michał Marcińczuk

In this article we present the result of the recent research in the recognition of events in Polish. Event recognition plays a major role in many natural language processing applications such as question answering or automatic summarization. We adapted TimeML specification (the well known guideline for English) to Polish language. We annotated 540 documents in Polish Corpus of Wrocław University of Technology (KPWr) using our specification. Here we describe the results achieved by Liner2 (a machine learning toolkit) adapted to the recognition of events in Polish texts.

#790: Glottal Flow Patterns Analyses for Parkinson's Disease Detection: Acoustic and Nonlinear Approaches

Belalcázar-Bolaños et al.

In this paper we propose a methodology for the automatic detection of Parkinson's Disease (PD) by using several glottal flow measures including different time-frequency (TF) parameters and nonlinear behavior of the vocal folds. Additionally, the nonlinear behavior of the vocal tract is characterized using the residual wave. The proposed approach allows modeling phonation (glottal flow) and articulation (residual wave) properties of speech separately, which opens the possibility to address symptoms related to dysphonia and dysarthria in PD, independently. Speech recordings of the five Spanish vowels uttered by a total of 100 speakers (50 with PD and 50 Healthy Controls) are considered. The results indicate that the proposed approach allows the automatic discrimination of PD patients and healthy controls with accuracies of up to 78% when using the TF-based measures.

#721: Grammatical Annotation of Historical Portuguese: Generating a Corpus-based Diachronic Dictionary

Bick and Zampieri

In this paper, we present an automatic system for the morphosyntactic annotation and lexicographical evaluation of historical Portuguese corpora. Using rule-based orthographical normalization, we were able to apply a standard parser (PALAVRAS) to historical data (Colonia corpus) and to achieve accurate annotation for both POS and syntax. By aligning original and standardized word forms, our method allows to create tailor-made standardization dictionaries for historical Portuguese with optional period or author frequencies.

#837: Grapheme to Phoneme Translation using Conditional Random Fields with Re-ranking

Stephen Ash, David Lin

Grapheme to phoneme (G2P) translation is an important part of many applications including text to speech, automatic speech recognition, and phonetic similarity matching. Although G2P models have been studied thoroughly in the literature, we propose a G2P system which is optimized for producing a high-quality top-k list of candidate pronunciations for an input grapheme string. Our pipeline approach uses Conditional Random Fields (CRF) to predict phonemes from graphemes and a discriminative re-ranker, which incorporates information from previous stages in the pipeline with a graphone language model to construct a high-quality ranked list of results. We evaluate our findings against the widely used CMUDict dataset and demonstrate competitive performance with state-of-the-art G2P methods. Additionally, using entries with multiple valid pronunciations, we show that our re-ranking approach out-performs ranking using only a smoothed graphone language model, a technique employed by many recent publications. Lastly, we released our system as an open-source G2P toolkit.

#752: Homonymy and Polysemy in the Czech Morphological Dictionary

Homonymy and Polysemy in the Czech Morphological Dictionary

We focus on a problem of homonymy and polysemy in morphological dictionaries on the example of the Czech morphological dictionary MorfFlex CZ. It is not necessary to distinguish meanings in morphological dictionaries unless the distinction has consequencies in word formation or syntax. The contribution proposes several important rules and principles for achieving consistency.

#756: How to Add Word Classes to the Kaldi Speech Recognition Toolkit

Axel Horndasch, Caroline Kaufhold, Elmar Nöth

The paper explains and illustrates how the concept of word classes can be added to the widely used open-source speech recognition toolkit Kaldi. The suggested extensions to existing Kaldi recipes are limited to the word-level grammar (G) and the pronunciation lexicon (L) models. The implementation to modify the weighted finite state transducers employed in Kaldi makes use of the OpenFST library. In experiments on small and mid-sized corpora with vocabulary sizes of 1.5K and 5.5K respectively a slight improvement of the word error rate is observed when the approach is tested with (hand-crafted) word classes. Furthermore it is shown that the introduction of sub-word unit models for open word classes can help to robustly detect and classify out-of-vocabulary words without impairing word recognition accuracy.

#720: Influence of Reverberation on Automatic Evaluation of Intelligibility with Prosodic Features

Tino Haderlein, Michael Doellinger, Anne Schützenberger, Elmar Noeth

Objective analysis of intelligibility by a speech recognizer and prosodic features was performed for close-talking recordings before. This study examined whether this is also possible for reverberated speech. In order to ensure that only the room acoustics are different, artificial reverberation was used. 82 patients after partial laryngectomy read a standardized text, 5 experienced raters assessed intelligibility perceptually on a 5-point scale. The best feature subset, determined by Support Vector Regression, consists of the word correctness of a speech recognizer, the average duration of silent pauses, the standard deviation of the F_0 on the entire sample, the standard deviation of jitter, and the ratio of the durations of the voiced sections and the entire recording. A human-machine correlation of r = 0.80 was achieved for the close-talking recordings and r = 0.72 for the worst case of the examined signal qualities. By adding three more features, also r = 0.80 was reached for the reverberated scenario.

#783: Influence of expressive speech on ASR performances: application to elderly assistance in smart home

Frédéric Aman, Véronique Aubergé, Michel Vacher

Smart homes are discussed as a win-win solution for maintaining the Elderly at home as a better alternative to care homes for dependent elderly people. Such Smart homes are characterized by rich domestic commands devoted to elderly safety and comfort. The vocal command has been identified as an efficient, well accepted, interaction way, it can be directly addressed to the "habitat", or through a robotic interface. In daily use, the challenges of vocal commands recognition are the noisy environment but moreover the reformulation and the expressive change of the strictly authorized commands. This paper focuses (1) to show, on the base of elicited corpus, that expressive speech, in particular distress speech, strongly affects generic state of the art ASR systems (20 to 30%) (2) how interesting improvement thanks to ASR adaptation can regulate (15%) this degradation. We conclude on the necessary adaptation of ASR system to expressive speech when they are designed for person’s assistance.

#823: Investigation of Bottle-Neck Features for Emotion Recognition

Anna Popková, Filip Povolný, Pavel Matějka, Ondřej Glembek, František Grézl, Jan "Honza" Černocký

This paper describes several systems for emotion recognition developed for the AV+EC 2015 Emotion Recognition Challenge. A complete system, making use of all three modalities (audio, video, and physiological data), was submitted to the evaluation. The focus of our work was, however, on the so called Bottle-Neck features used to complement the audio features. For the recognition of arousal, we improved the results of the delivered audio features and combined them favorably with the Bottle-Neck features. For valence, the best results were obtained with video, but a two-output Bottle-Neck structure is not far behind, which is especially appealing for applications where only audio is available.

#789: KALDI Recipes for the Czech Speech Recognition Under Various Conditions

Petr Mizera, Jiri Fiala, Ales Brich, Petr Pollak

The paper presents the implementation of Czech ASR system under various conditions using KALDI speech recognition toolkit in two standard state-of-the-art architectures (GMM-HMM and DNN-HMM). We present the recipes for the building of LVCSR using SpeechDat, SPEECON, CZKCC, and NCCCz corpora with the new update of feature extraction tool CtuCopy which supports currently KALDI format. All presented recipes same as CtuCopy tool are publicly available under the Apache license v2.0. Finally, an extension of KALDI toolkit which supports the running of described LVCSR recipes on MetaCentrum computing facilities (Czech National Grid Infrastructure operated by CESNET) is described. In the experimental part the baseline performance of both GMM-HMM and DNN-HMM LVCSR systems applied on given Czech corpora is presented. These results also demonstrate the behaviour of designed LVCSR under various acoustic conditions same as various speaking styles.

#802: Morphosyntactic analyzer for the Tibetan language

Dobrov A., Dobrova A., Grokhovskiy P., Soms N., Zakharov V.

The paper deals with the development of a morphosyntactic analyzer for the Tibetan language. It aims to create a consistent formal grammatical description (formal grammar) of the Tibetan language, including all grammar levels of the language system from morphosyntax (syntactics of morphemes) to the syntax of composite sentences and supra-phrasal entities. Syntactic annotation was created on the basis of morphologically tagged corpora of Tibetan texts. The peculiarity of the annotation consists in combining both the immediate constituents structure and the dependency one. An individual (basic) grammar module of Tibetan grammatical categories, its possible values, and restrictions on their combination are created. Types of tokens and their grammatical features form the basis of the formal grammar being produced, allowing linguistic processor to build syntactic trees of various kinds. Methods of avoiding redundant structural ambiguity are proposed.

#793: Neural Networks for Featureless NER in Czech

Straková et al.

We present a completely featureless, language agnostic named entity recognition system. Following recent advances in artificial neural network research, the recognizer employs parametric rectified linear units (PReLU), word embeddings and character-level embeddings based on gated linear units (GRU). Without any feature engineering, only with surface forms, lemmas and tags as input, the network achieves excellent results in Czech NER and surpasses the current state of the art of previously published Czech NER systems, which use manually designed rule-based orthographic classification features. Furthermore, the neural network achieves robust results even when only surface forms are available as input. In addition, the proposed%artificial neural network can use the manually designed rule-based orthographic classification features and in such combination, it exceeds the current state of the art by a wide margin.

#726: On the Influence of the Number of Anomalous and Normal Examples in Anomaly-Based Annotation Errors Detection

Jindřich Matoušek, Daniel Tihelka

Anomaly detection techniques were shown to help in detecting word-level annotation errors in read-speech corpora for text-to-speech synthesis. In this framework, correctly annotated words are considered as normal examples on which the detection methods are trained. Misannotated words are then taken as anomalous examples which do not conform to normal patterns of the trained detection models. As it could be hard to collect a sufficient number of examples to train and optimize an anomaly detector, in this paper we investigate the influence of the number of anomalous and normal examples on the detection accuracy of several anomaly detection models: Gaussian distribution based models, one-class support vector machines, and Grubbs' test based model. Our experiments show that the number of examples can be significantly reduced without a large drop in detection accuracy.

#741: Platon: Dialog Management and Rapid Prototyping for Multilingual Multi-User Dialog Systems

Martin Gropp, Anna Schmidt, Thomas Kleinbauer, Dietrich Klakow

We introduce Platon, a domain-specific language for authoring dialog systems based on Groovy, a dynamic programming language for the Java Virtual Machine (JVM). It is a fully-featured tool for dialog management that is also particularly suitable for, but not limited to, rapid prototyping making it possible to create a basic multilingual dialog system with minimal overhead and then gradually extend it to a complete system. It supports multilinguality, multiple users in a single session, and has built-in support for interacting with objects in the dialog environment. It is possible to integrate external components for natural language understanding and generation, while Platon can itself be integrated even in non-JVM projects or run in a stand-alone debugging tool for testing. In this paper we describe important elements of the language and present two scenarios Platon has been used in.

#801: Predicting Morphologically-Complex Unknown Words in Igbo

Ikechukwu E. Onyenwe, Mark Hepple

The effective handling of previously unseen words is an important factor in the performance of part-of-speech taggers. Some trainable POS taggers use suffix (sometimes prefix) strings as cues in handling unknown words (in effect serving as a proxy for actual linguistic affixes). In the context of creating a tagger for the African language Igbo, we compare the performance of some existing taggers, implementing such an approach, to a novel method for handling morphologically complex unknown words, based on morphological reconstruction (i.e. a linguistically-informed segmentation into root and affixes). The novel method outperforms these other systems by several percentage points, achieving accuracies of around 92% on morphologically-complex unknown words.

#776: Preliminary study on automatic recognition of spatial expressions in Polish texts

Preliminary study on automatic recognition of spatial expressions in Polish texts

In the paper we cover the problem of spatial expression recognition in text for Polish language. A spatial expression is a text fragment which describes a relative location of two or more physical objects to each other. The first part of the paper treats about a Polish corpus annotated with spatial expressions and annotators agreement. In the second part we analyse the feasibility of spatial expression recognition by overviewing relevant tools and resources for text processing for Polish. Then we present a knowledge-based approach which utilizes the existing tools and resources for Polish, including: a morpho-syntactic tagger, shallow parsers, a dependency parser, a named entity recognizer, a general ontology, a wordnet and a wordnet to ontology mapping. We also present a dedicated set of manually created syntactic and semantic patterns for generating and filtering candidates of spatial expressions. In the last part we discuss the results obtained on the reference corpus with the proposed method and present detailed error analysis.

#816: Relevant Documents Selection for Blind Relevance Feedback in Speech Information Retrieval

Lucie Skorkovská

The experiments presented in this paper were aimed at the selection of documents to be used in the blind or pseudo relevance feedback in spoken document retrieval. The previous experiments with the automatic selection of the relevant documents for the blind relevance feedback method have shown the possibilities of the dynamical selection of the relevant documents for each query depending on the content of the retrieved documents instead of just blindly defining the number of the relevant documents to be used in advance. The score normalization techniques commonly used in the speaker identification task are used for the dynamical selection of the relevant documents. In the previous experiments, the language modeling information retrieval method was used. In the experiments presented in this paper, we have derived the score normalization technique also for the vector space information retrieval method. The results of our experiments show, that these normalization techniques are not method-dependent and can be successfully used in several information retrieval system settings.

#771: Short Messages Spam Filtering Using Sentiment Analysis

Enaitz Ezpeleta, Urko Zurutuza, José María Gómez Hidalgo

In the same way that short instant messages are more and more used, spam and non-legitimate campaigns through this type of communication systems are growing up. Those campaigns, besides being an illegal online activity, are a direct threat to the privacy of the users. Previous short messages spam filtering techniques focus on automatic text classification and do not take message polarity into account. Focusing on phone SMS messages, this work demonstrates that it is possible to improve spam filtering in short message services using sentiment analysis techniques. Using a publicly available labelled (spam/legitimate) SMS dataset, we calculate the polarity of each message and aggregate the polarity score to the original dataset, creating new datasets. We compare the results of the best classifiers and filters over the different datasets (with and without polarity) in order to demonstrate the influence of the polarity. Experiments show that polarity score improves the SMS spam classification, on the one hand, reaching to a 98.91% of accuracy. And on the other hand, obtaining a result of 0 false positives with 98.67% of accuracy.

#751: Speech-to-Text Summarization Using Automatic Phrase Extraction from Recognized Text

Michal Rott, Petr Červa

This paper describes a summarization system that was developed in order to summarize news delivered orally. The system generates text summaries from input audio using three independent components: an automatic speech recognizer, a syntactic analyzer, and a summarizer. The absence of sentence boundaries in the recognized text complicates the summarization process. Therefore, we use a syntactic analyzer to identify continuous segments in the recognized text. We used 50 reference articles to perform our evaluation. The data are publicly available at http://nlp.ite.tul.cz/sumarizace. The results of the proposed system were compared with the results of sentence summarization in the reference articles. The evaluation was performed using co-occurrence of n-grams in the reference and generated summaries, and by readers mark-ups. The readers marked two aspects of the summaries: readability and information relevance. Experiments confirm that the generated summaries have the same information value as the reference summaries. However, readers state that phrase summaries are hard to read without the whole sentence context.

#760: Starting a Conversation: Indexical Rhythmical Features across Age and Gender (a corpus study)

Tatiana Sokoreva, Tatiana Shevchenko

The study investigates indexical rhythmical features in the first dozen words in 102 American English adult speakers' telephone talks. The main goal is to explore age- and gender-related changes in prosodic characteristics of accented syllables (AS) and non-accented syllables (NAS) which affect speech rhythm in dialogue. The rhythm measures include duration, fundamental frequency and intensity, both mean values and PVI scores in adjacent syllables. The results suggest increasing accentual prominence achieved through growing values of foot and AS mean duration, increasing F0 range values, as well as higher PVI scores for F0 maxima values across three age groups. The accent-based (prototypical "stress-timed") pattern of English proves to be developing with age and varying with gender in AmE spontaneous speech.

#797: SubGram: Extending Skip-gram Word Representation with Substrings

Tom Kocmi, Ondřej Bojar

Skip-gram (word2vec) is a recent method for creating vector representations of words (distributed word representations) using a neural network. The representation gained popularity in various areas of natural language processing, because it seems to capture syntactic and semantic information about words without any explicit supervision in this respect. We propose SubGram, a refinement of the Skip-gram model to consider also the word structure during the training process, achieving large gains on the Skip-gram original test set.

#733: The custom decay language model for long range dependencies

Mittul Singh, Clayton Greenberg, Dietrich Klakow

Significant correlations between words can be observed over long distances, but contemporary language models like N-grams, Skip grams, and recurrent neural network language models (RNNLMs) require a large number of parameters to capture these dependencies, if the models can do so at all. In this paper, we propose the Custom Decay Language Model (CDLM), which captures long range correlations while maintaining sub-linear increase in parameters with vocabulary size. This model has a robust and stable training procedure (unlike RNNLMs), a more powerful modeling scheme than the Skip models, and a customizable representation. In perplexity experiments, CDLMs outperform the Skip models using fewer number of parameters. A CDLM also nominally outperformed a similar-sized RNNLM, meaning that it learned as much as the RNNLM but without recurrence.

#759: Tools rPraat and mPraat: Interfacing phonetic analyses with signal processing

Tomáš Bořil and Radek Skarnitzl

The paper presents the rPraat package for R / mPraat toolbox for Matlab which constitutes an interface between the most popular software for phonetic analyses, Praat, and the two more general programmes. The package adds on to the functionality of Praat, it is shown to be superior in terms of processing speed to other tools, while maintaining the interconnection with the data structure of R and Matlab, which provides a wide range of subsequent processing possibilities. The use of the proposed tool is demonstrated on a comparison of real speech data with synthetic speech generated by means of dynamic unit selection.

#782: Topic Modeling over Text Streams from Social Media

M. Smatana, J. Paralič, P. Butka

Topic modeling becomes a popular research area which shows us new way to search, browse and summarize large amount of texts. Methods of topic modeling try to uncover the hidden thematic structure in document collections. Topic modeling in connection with social networks, which are one of the strongest communication tool and produces large amount of opinions and attitudes on world events, can be useful for analysis in case of crisis situations, elections, launching a new product on the market etc. For that reason we pro-pose a tool for topic modeling over text streams from social networks in this paper. Description of proposed tool is extended with practical experiments. Realized experiments shown promising results when using our tool on real data in comparison to state-of-the-art methods.

#747: Towards It-CMC: a fine-grained POS tagset for Italian linguistic analysis

Claudio Russo

The present work introduces It-CMC, a fine-grained POS tagset that aims at combining linguistic accuracy and computational sustainability. It-CMC is tailored on Italian data from Computer-Mediated Communication (CMC) and, across the sections of the paper, a sistematically comparison with the current tagset of the La Repubblica corpus is provided. After an early stage of performance monitoring carried out with Schmid's TreeTagger, the tagset is currently involved in a workflow that aims at creating an Italian parameter file for RFTagger.

#828: Training maxout neural networks for speech recognition tasks

Aleksey Prudnikov, Maxim Korenevsky

The topic of the paper is the training of deep neural networks which use tunable piecewise-linear activation functions called "maxout" for speech recognition tasks. Maxout networks are compared to the conventional fully-connected DNNs in case of training with both cross-entropy and sequence discriminative (sMBR) criteria. Experiments are carried out on the CHiME Challenge 2015 corpus of multi-microphone noisy dictation speech and the Switchboard corpus of conversational telephone speech. The clear advantage of maxout networks over DNNs is demonstrated when using the cross-entropy criterion on both corpora. It is also argued that maxout networks are prone to overfitting during sequence training but in some cases it can be successfully overcome with the use of the KL-divergence based regularization.

#729: Unit-selection speech synthesis adjustments for audiobook-based voices

Jakub Vít, Jindřich Matoušek

This paper presents easy-to-use modifications to unit-selection speech-synthesis algorithm with voices built from audiobooks. Audiobooks are a very good source of large and high quality audio data for speech synthesis; however, they usually do not meet basic requirements for standard unit-selection synthesis: "neutral" speech properties with no expressive or spontaneous expressions, stable prosodic patterns, careful pronunciation, and consistent voice style during recording. However, if these conditions are taken into consideration, few modifications can be made to adjust the general unit-selection algorithm to make it more robust for synthesis from such audiobook data. Listening test shows that these adjustments increased perceived speech quality and acceptability against a baseline TTS system. Modifications presented here can also allow to exploit audio data variability to control pitch and tempo of synthesized speech.

#819: Using Alliteration in Authorship Attribution of Historical Texts

Lubomir Ivanov

The paper describes the use of alliteration, by itself or in combination with other features, in training machine learning algorithms to perform attribution of texts of unknown/disputed authorship. The methodology is applied to a corpus of 18th century political writings, and used to improve the attribution accuracy.

#761: Utterance Classification

Reiko Kuwa

We propose a novel classification method of recognized second language learners utterances into three classes of acceptability for dialogue-based computer assisted language learning (CALL) systems. Our method uses a linear classifier trained with three types of bilingual evaluation understudy (BLEU) scores. The three BLEU scores are calculated respectively, referring to three subsets of a learner corpus divided according to the quality of sentences. Our method classifies learner utterances into three classes ( correct, acceptable with some modifications and out-of-the-scope of assumed erroneous sentences ), since it is suitable for providing effective feedback. Experimental results showed that our proposed classification method could distinguish utterance acceptability with 75.8% accuracy.

#742: Voice Activity Detector (VAD) based on long-term Mel frequency band features

Sergey Salishev, Andrey Barabanov, Daniil Kocharov, Pavel Skrelin, Mikhail Moiseev

We propose a VAD using long-term 200 ms Mel frequency band statistics, auditory masking, and a pre-trained two level decision tree ensemble based classifier, which allows capturing syllable level structure of speech and discriminating it from common noises. Proposed algorithm demonstrates on the test dataset almost 100% acceptance of clear voice for English, Chinese, Russian, and Polish speech and 100% rejection of stationary noises independently of loudness. The algorithm is aimed to be used as a trigger for ASR. It reuses short-term FFT analysis (STFFT) from ASR frontend with additional 2KB memory and 15% complexity overhead.

#798: WordSim353 for Czech

Silvie Cinková

Human judgments of lexical similarity/relatedness are used as evaluation data for Vector Space Models, helping to judge how the distributional similarity captured by a given Vector Space Model correlates with human intuitions. A well established data set for the evaluation of lexical similarity/relatedness is WordSim353, along with its translations into several other languages. This paper presents its Czech translation and annotation, which is publicly available via the LINDAT-CLARIN repository at hdl.handle.net/11234/1-1713.

#849: Persian Linguistic Database (PLDB) and its historical corpus

Seyed Mostafa Assi, Saeedeh Ghandi

Relying on real and objective data is a common prerequisite to almost all aspects of theoretical and applied linguistics research. This can be achieved by developing a large corpus: "a body of written texts or transcribed speech which can serve as a basis for linguistic analysis and description". Depending on the type of research, different specialized corpora have been compiled. Historical corpora consist of texts from one or more periods in the past. The corpus presented in this paper is the first attempt to build a corpus of prose texts from the 5th to 7th centuries AH (10th to 12th centuries AD) which consists of 50 authentic full texts with about 4 million words. The number of types in the corpus is 883216 and the number of tokens is 3766927. For doing so, a selection of important texts of this era was collected based on certain criteria and was typed in a predefined format. In the second step, all texts were integrated and indexed in the Persian Linguistic database (PLDB) system which facilitates text processing, faster information retrieval, and compiling frequency wordlists, statistics and concordances. The corpus is not just a raw corpus, as the texts are annotated with bibliographic headers and part of speech and phonetic tags. Keywords: Corpus linguistics, Linguistic corpora, part of speech) POS) tagging, Historical corpus of the Persian language Topics

#843: Annotated Amharic Corpora

Pavel Rychlý and Vít Suchomel

Amharic is one of under-resourced languages. The paper presents two text corpora. The first one is a substantially cleaned version of existing morphologically annotated WIC Corpus (210,000 words). The second one is the largest Amharic text corpus (17 million words). It was created from Web pages automatically crawled in 2013, 2015 and 2016. It is part-of-speech annotated by a tagger trained and evaluated on the WIC Corpus.

#848: Multi-label Topic Classification of Turkish Sentences Using Cascaded Approach for Dialog Management System

Gizem Soğancıoğlu, Bilge Köroğlu and Onur Ağın

full paper

In this paper, we propose a two-stage system which aims to classify utterances of customers into 10 categories including daily language and specific banking problems. Detecting the topic of customer question would enable the dialogue system to find the corresponding answer more effectively. In order to identify the topic of customer questions, we built a machine learning based model consisting of two stages which uses feature sets extracted from text and dialogue attributes. In the first stage, where utterances are categorized into two classes, namely daily language and banking domain, the binary classification problem was studied and different learning algorithms such as Naive Bayes, C4.5 have been evaluated with different feature sets. Utterances classified as banking domain by the first classifier, are classified by second stage classifier. In this stage, automatic detection of specialty of banking-related sentences is aimed. We approached this as a multi-label classification problem. Our proposed cascaded approach has shown a quite good performance with the score of 0.558 in terms of Micro-averaged F-score measure.

#850: Resources for Indian languages

Arun Baby, Anju Leela Thomas, Nishanthi N L and TTS Consortium

full paper

This paper discusses a consortium effort with the design of database for a high-quality corpus, primarily for building text to speech(TTS) synthesis systems for 13 major Indian languages. Importance of language corpora is recognized since long before in many countries. The amount of work in speech domain for Indian languages is comparatively lower than that of other languages. This demands the speech corpus for Indian languages. The corpus presented here is a database of speech audio files and corresponding text transcriptions. Various criteria are addressed while building the database for these languages namely, optimal text selection, speaker selection, pronunciation variation, recording specification, text correction for handling out-of-the-vocabulary words and so on. Furthermore, various characteristics that affect speech synthesis quality like encoding, sampling rate, channel, etc is considered so that the collected data will be of high quality with defined standards. Database and text to speech synthesizers are built for all the 13 languages, namely, Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Malayalam, Manipuri, Marathi, Odiya, Rajasthani, Tamil and Telugu.

#852: Complete Analysis from Unstructured Text to Dependency Tree with UDPipe and NameTag

Milan Straka and Jana Straková and Jan Hajič

We present our in-house software tools for NLP: UDPipe, an open-source tool for processing CoNLL-U files which performs tokenization, morphological analysis, POS tagging and dependency parsing for 32 languages; and NameTag, a named entity recognizer for Czech and English. Both tools achieve excellent performance and are distributed with pretrained models, while running with minimal time and memory requirements. UDPipe and NameTag are also trainable with your own data. Both tools are open-source and distributed under Mozilla Public License 2.0 (software) and CC BY-NC-SA (data). The binary, C++ library with Python, Perl, Java and C# bindings along with online web service and demo are available at http://ufal.mff.cuni.cz/udpipe and http://ufal.mff.cuni.cz/nametag.

#853: Text Corpus for Gender Attribution Studies

Tatiana Litvinova, Olga Litvinova, Olga Zagorovskaya, Ekaterina Ryzhkova

The problem of gender identification using the analysis of language parameters of a written text by means of machine learning and NLP is currently gaining momentum. However, there are still a lot of issues to be addressed. Therefore special text corpora are being designed. We are presenting a text corpus aimed for studying gender identification of authors of written texts and differences in males and females. What is unique about this particular corpus is the following: a) it consists of texts of different genres written for the experiment and Internet texts; b) there is a lot of metadata about the authors – age, psychological characteristics, lateral organization profile, etc.; c) there is a subcorpus of intentionally deceptive speech (gender imitation); d) it contains linguistic labeling; e) it mainly relies on Russian texts but tweets by Russian and English bilinguals are added; f) it is a database that all allows a search along an author’s personality traits and linguistic characteristics of texts.

#854: Lexicographic Software with the DEB platform

Adam Rambousek

We are presenting wide range of lexicographic software based on the DEB platform, developed at Natural Language Processing Centre, Masaryk University. Notable examples of the tools build using the DEB platform are DEBVisDic (wordnet and ontology browser and editor), Dictionary of Czech Sign Language, and DEBWrite (free tool to build a dictionary in user-friendly way).

#855: Lexonomy: a cloud-based dictionary writing system

Michal Měchura

Lexonomy, http://www.lexonomy.eu/, is a web-based system for writing and publishing dictionaries and other collections of XML-encoded entries. This demo will be a guided tour of Lexonomy's user interface and will cover: (1) creating a new dictionary in Lexonomy, (2) editing entries in Lexonomy's built-in XML editor, (3) configuring the structure of entries in Lexonomy's schema editor and (4) publishing your dictionary on the Lexonomy website. Lexonomy is free to use and participants in the demo will be able to get a user account on the spot.

#856: Multiword thesaurus

Miloš Jakubíček

We will present a recent development in the Sketch Engine corpus management system focusing on thesaurus of multi-word units. The distributional thesaurus (of words) based on the word sketch relations has been part of Sketch Engine for a very long time. Thanks to the extension of word sketches to multi-word sketches we were able to carry out experiments on using the multi-word sketches to compute multi-word thesaurus in a similar fashion like the single-word one.

TSD 2015 | TSD 2014 | TSD 2013