TSD 2018

When Minsky and Chomsky were at Harvard in the 1950s, they started out their careers questioning a number of machine learning methods that have since regained popularity. Minsky's Perceptrons was a reaction to neural nets and Chomsky's Syntactic Structures was a reaction to ngram language models. Many of their objections are being ignored and forgotten (perhaps for good reasons, and perhaps not). While their arguments may sound negative, I believe there is a more constructive way to think about their efforts; they were both attempting to organize computational tasks into larger frameworks such as what is now known as the Chomsky Hierarchy and algorithmic complexity. Section 5 will propose an organizing framework for deep nets. Deep nets are probably not the solution to all the world's problems. They don't do the impossible (solve the halting problem), and they probably aren't great at many tasks such as sorting large vectors and multiplying large matrices. In practice, deep nets have produced extremely exciting results in vision and speech, though other tasks may be more challenging for deep nets.

#102: Leolani: a Reference Machine with a Theory of Mind for Social Communication

Piek Vossen, Selene Baez, Lenka Bajcetić, Bram Kraaijeveld

Our state of mind is based on experiences and what other people tell us. This may result in conflicting information, uncertainty, and alternative facts. We present a robot that models relativity of knowledge and perception within social interaction following principles of the theory of mind. We utilized vision and speech capabilities on a Pepper robot to build an interaction model that stores the interpretations of perceptions and conversations in combination with provenance on its sources. The robot learns directly from what people tell it, possibly in relation to its perception. We demonstrate how the robot's communication is driven by hunger to acquire more knowledge from and on people and objects, to resolve uncertainties and conflicts, and to share awareness of the perceived environment. Likewise, the robot can make reference to the world and its knowledge about the world and the encounters with people that yielded this knowledge.

#103: Speech Analytics for Medical Applications

Isabel Trancoso, Joana Correia, Francisco Teixeira, Bhiksha Raj, and Alberto Abad

Speech has the potential to provide a rich bio-marker for health, allowing a non-invasive route to early diagnosis and monitoring of a range of conditions related to human physiology and cognition. With the rise of speech related machine learning applications over the last decade, there has been a growing interest in developing speech based tools that perform non-invasive diagnosis. This talk covers two aspects related to this growing trend. One is the collection of large in-the-wild multimodal datasets in which the speech of the subject is affected by certain medical conditions. Our mining effort has been focused on video blogs (vlogs), and explores audio, video, text and metadata cues, in order to retrieve vlogs that include a single speaker which, at some point, admits that he/she is currently affected by a given disease. The second aspect is patient privacy. In this context, we explore recent developments in cryptography and, in particular in Fully Homomorphic Encryption, to develop an encrypted version of a neural network trained with unencrypted data, in order to produce encrypted predictions of health-related labels. As a proof-of-concept, we have selected two target diseases: Cold and Depression, to show our results and discuss these two aspects.

#104: Speech and Language Based Human Decisions - What is next?

Elmar Noeth

In this talk we will present methods to automatically analyze speaker traits and speaker states based on speech and language and discuss potential applications of the technology. We will show that these analyses are in the process of hitting the market and discuss ethical questions.

#890: Manócska: A Unified Verb Frame Database for Hungarian

Ágnes Kalivoda, Noémi Vadász, Balázs Indig

This paper presents {Manócska}, a verb frame database for Hungarian. It is called unified as it was built by merging all available verb frame resources. To be able to merge these, we had to cope with their structural and conceptual differences. After that, we transformed them into two easy to use formats: a TSV and an XML file. {Manócska} is open-access, the whole resource and the scripts which were used to create it are available in a github repository. This makes {Manócska} reproducible and easy to access, version, fix and develop in the future. During the merging process, several errors came into sight. These were corrected as systematically as possible. Thus, by integrating and harmonizing the resources, we produced a Hungarian verb frame database of a higher quality.

#948: A Cross-Lingual Approach for Building Multilingual Sentiment Lexicons

Behzad Naderalvojou, Behrang Qasemizadeh, Laura Kallmeyer, Ebru Akcapinar Sezer

We propose a cross-lingual distributional model to build sentiment lexicons in many languages from resources available in English. We evaluate this method for two languages, German and Turkish, and on several datasets. We show that the sentiment lexicons built using our method remarkably improve the performance of a state-of-the-art lexicon-based BiLSTM sentiment classifier.

#943: A Dataset and a Novel Neural Approach for Optical Gregg Shorthand Recognition

Fangzhou Zhai, Yue Fan, Tejaswani Verma, Rupali Sinha, Dietrich Klakow

Gregg shorthand is the most popular form of pen stenography in the United States. It has been adapted for many other languages. In order to substantially explore the potentialities of performing optical recognition of Gregg shorthand, we develop and present Gregg-1916, a dataset that comprises Gregg shorthand scripts of about 16 thousand common English words. In addition, we present a novel architecture for shorthand recognition which exhibits promising performance and opens up the path for various further directions.

#944: A Lattice Based Algebraic Model for Verb Centered Constructions

Bálint Sass

In this paper we present a new, abstract, mathematical model for verb centered constructions (VCCs). After defining the concept of VCC we introduce proper VCCs which are roughly the ones to be included in dictionaries. First, we build a simple model for one VCC utilizing lattice theory, and then a more complex model for all the VCCs of a whole corpus combining representations of single VCCs in a certain way. We hope that this model will stimulate a new way of thinking about VCCs and will also be a solid foundation for developing new algorithms handling them.

#924: A Survey of Recent DNN Architectures on the TIMIT Phone Recognition Task

J. Michálek, J. Vaněk

In this survey paper, we have evaluated several recent deep neural network (DNN) architectures on a TIMIT phone recognition task. We chose the TIMIT corpus due to its popularity and broad availability in the community. It also simulates a low-resource scenario that is helpful in minor languages. Also, we prefer the phone recognition task because it is much more sensitive to an acoustic model quality than a large vocabulary continuous speech recognition (LVCSR) task. In recent years, many DNN published papers reported results on TIMIT. However, the reported phone error rates (PERs) were often much higher than a PER of a simple feed-forward (FF) DNN. That was the main motivation of this paper: To provide a baseline DNNs with open-source scripts to easily replicate the baseline results for future papers with lowest possible PERs. According to our knowledge, the best-achieved PER of this survey is better than the best-published PER to date.

#887: Adaptation of Algorithms for Medical Information Retrieval for Working on Russian-Language Text Content

Aleksandra Vatian, Natalia Dobrenko, Anastasia Makarenko, Niyaz Nigmatullin, Nikolay Vedernikov, Artem Vasilev, Andrey Stankevich, Natalia Gusarova, Anatoly Shalyto

The paper investigates the possibilities of adapting various ADR algorithms to the Russian language environment. In general, the ADR detection process consists of 4 steps: (1) data collection from social media; (2) classification / filtering of ADR assertive text segments; (3) extraction of ADR mentions from text segments; (4) analysis of extracted ADR mentions for signal generation. The implementation of each step in the Russian-language environment is associated with a number of difficulties in comparison with the traditional English-speaking environment. First of all, they are connected with the lack of necessary databases and specialized language resources. In addition, an important negative role is played by the complex grammatical structure of the Russian language. The authors present various methods of machine learning algorithms adaptation in order to overcome these difficulties. For step 3 on the material of Russian-language text forums using the ensemble classifier, the Accuracy = 0.805 was obtained. For step 4 on the material of Russian-language EHR, by adapting pyConTextNLP, the F-measure = 0.935 was obtained, and by adapting ConText algorithm, the F-measure = 0.92–0.95 was obtained. A method for full-scale performing of step 4 was developed using cue-based and rule-based approaches, and the F-measure = 67.5% was obtained that is quite comparable to baseline.

#882: Adjusting Machine Translation Datasets for Document-Level Cross-Language Information Retrieval: Methodology

Gennady Shtekh, Polina Kazakova, Nikita Nikitinsky

Evaluating the performance of Cross-Language Information Retrieval models is a rather difficult task since collecting and assessing substantial amount of data for CLIR systems evaluation could be a non-trivial and expensive process. At the same time, substential number of machine translation datasets are available now. In the present paper we attempt to solve the problem stated above by suggesting a strict workflow for transforming machine translation datasets to a CLIR evaluation dataset (with automatically obtained relevance assessments), as well as a workflow for extracting a representative subsample from the initial large corpus of documents so that it is appropriate for further manual assessment. We also hypothesize and then prove by the number of experiments on the United Nations Parallel Corpus data that the quality of an information retrieval algorithm on the automatically assessed sample could be in fact treated as a reasonable metric.

#911: Annotated Clause Boundaries' Influence on Parsing Results

Dage Särg, Kadri Muischnek, Kaili Müürisep

The aim of the paper is to study the effect of pre-annotated clause boundaries on dependency parsing of Estonian new media texts. Our hypothesis is that correct identification of clause boundaries helps to improve parsing because as the text is split into smaller syntactically meaningful units, it should be easier for the parser to determine the syntactic structure of a given unit. To test the hypothesis, we performed two experiments on a 14,000-word corpus of Estonian web texts whose morphological analysis had been manually validated. In the first experiment, the corpus with gold standard morphological tags was parsed with MaltParser both with and without the manually annotated clause boundaries. In the second experiment, only the segmentation of the text was preserved and the morphological analysis was done automatically before parsing. The experiments confirmed our hypothesis about the influence of correct clause boundaries by a small margin: in both experiments, the improvement of LAS was 0.6%.

#945: Annotated Corpus of Czech Case Law for Reference Recognition Tasks

Jakub Harašta, Jaromír Šavelka, František Kasl, Adéla Kotková, Pavel Loutocký, Jakub Míšek, Daniela Procházková, Helena Pullmannová, Petr Semenišín, Tamara Šejnová, Nikola Šimková, Michal Vosinek, Lucie Zavadilová, Jan Zibner

We describe an annotated corpus of 350 decisions of Czech top-tier courts which was gathered for a project assessing the relevance of court decisions in Czech law. We describe two layers of processing of the corpus; every decision was annotated by two trained annotators and then manually adjudicated by one trained curator to solve possible disagreements between annotators. This corpus was developed as training and testing material for reference recognition tasks which will be further used for research on assessment of legal importance. However, the overall shortage of available research corpora of annotated legal texts, particularly in Czech language, leads us to believe that other research teams may find it useful.

#862: Automatic Evaluation of Synthetic Speech Quality by a System Based on Statistical Analysis

Jiří Přibil, Anna Přibilová, and Jindřich Matoušek

The paper describes a system for automatic evaluation of speech quality based on statistical analysis of differences in spectral properties, prosodic parameters, and time structuring within the speech signal. The proposed system was successfully tested in evaluation of sentences originating from male and female voices and produced by a speech synthesizer using the unit selection method with two different approaches to prosody manipulation. The experiments show necessity of all three types of speech features for obtaining correct, sharp, and stable results. A detailed analysis shows great influence of the number of statistical parameters on correctness and precision of the evaluated results. Larger size of the processed speech material has a positive impact on stability of the evaluation process. Final comparison documents basic correlation with the results obtained by the standard listening test.

#905: Building the Tatar-Russian NMT System

Aidar Khusainov, Dzhavdet Suleymanov, Rinat Gilmullin, Ajrat Gatiatullin

This paper assesses the possibility of combining the rule-based and the neural network approaches to the construction of the machine translation system for the Tatar-Russian language pair. We propose a rule-based system that allows using parallel data of a group of 6 Turkic languages (Tatar, Kazakh, Kyrgyz, Crimean-Tatar, Uzbek, Turkish) and the Russian language to overcome the problem of limited Tatar-Russian data. We incorporated modern approaches for data augmentation, neural networks training and linguistically motivated rule-based methods. The main results of the work are the creation of the first neural Tatar-Russian translation system and the improvement of the translation quality in this language pair in terms of BLEU scores from 12 to 39 and from 17 to 45 for both translation directions (comparing to the existing translation system). Also the translation between any of the Tatar, Kazakh, Kyrgyz, Crimean Tatar, Uzbek, Turkish languages becomes possible, which allows to translate from all of these Turkic languages into Russian using Tatar as an intermediate language.

#942: Classification of Formal and Informal Dialogues Based on Emotion Recognition Features

Gyoergy Kovács

Social context is an important part of human communication, hence it is also important for improved human computer interaction. One aspect of social context is the level of formality. Here, motivated by the difference observed between the emotional annotation of formal and informal dialogues in the HuComTech corpus, we introduce a content-free classification scheme based on feature sets designed for emotion recognition. With this method we attain an error rate of 8.8% in the classification of formal and informal dialogues, which means a relative error rate reduction of more than 40% compared to earlier results. By combining our proposed method with earlier models, we were able to further reduce the error rate to below 7%.

#954: CoRTE: A Corpus of Recognizing Textual Entailment Data Annotated for Coreference and Bridging Relations

Afifah Waseem

This paper presents CoRTE, an English corpus annotated with coreference and bridging relations, where the dataset is taken from the main task of recognizing textual entailment (RTE). Our annotation scheme elaborates existing schemes by introducing subcategories. Each coreference and bridging relation has been assigned a category. CoRTE is a useful resource for researchers working on coreference and bridging resolution, as well as recognizing textual entailment (RTE) task. RTE has its applications in many NLP domains. CoRTE would thus provide contextual information readily available to the NLP systems being developed for domains requiring textual inference and discourse understanding. The paper describes the annotation scheme with examples. We have annotated 340 text-hypothesis pairs, consisting of 24,742 tokens and 8,072 markables.

#957: Corpus Annotation Pipeline for Non-standard Texts

Zuzana Pelikánová, Zuzana Nevěřilová

According to some estimations (e.g. ), web corpora contain over 6% of foreign material (borrowings, language mixing, named entities). Since annotation pipelines are usually built upon standard and correct data, the resulting annotation of web corpora often contains serious errors. We studied in depth annotation errors of the web corpus czTenTen 12 and proposed an extension to the tagger desamb that had been used for czTenTen annotation. First, the subcorpus was made using the most problematic documents from czTenTen. Second, measures were established for the most frequent annotation errors. Third, we established several experiments in which we extended the annotation pipeline so it could annotate foreign material and multi-word expressions. Finally, we compared the new annotations of the subcorpus with the original ones.

#901: Current State of Text-to-Speech System ARTIC: A Decade of Research on the Field of Speech Technologies

Daniel Tihelka, Zdeněk Hanzlíček, Markéta Jůzová, Jakub Vít, Jindřich Matoušek, Martin Grůber

This paper provides a survey of the current state of ARTIC -- the modern Czech concatenative corpus-based text-to-speech system. Through more than a decade of research & development in the field of speech technologies and applications, the system was enriched with new languages (and, as a consequence, language-dependent NLP methods), and its speech generation capabilities were significantly improved when new progressive speech generation modules (SPS, DNN, HSS) were (and are still being to) designed and incorporated into it. Also, ARTIC has to deal with various requirements on data used to generate speech from, ranging in size, quality and domain of the output speech, while there always was the requirement to achieve the highest quality in terms of both naturalness and intelligibility. Thus, the paper summarizes some of the most significant achievements and demanding tasks which had to be tackled by the system, illustrating the universality and flexibility of this Czech TTS system.

#941: Czech Dataset for Semantic Textual Similarity

Lukás Svoboda and Tomás Brychcín

Semantic textual similarity is the core shared task at the International Workshop on Semantic Evaluation (SemEval). It focuses on sentence meaning comparison. So far, most of the research has been devoted to English. In this paper we present first Czech dataset for semantic textual similarity. The dataset contains 1425 manually annotated pairs. Czech is highly inflected language and is considered challenging for many natural language processing tasks. The dataset is publicly available for the research community. In 2016 we participated at SemEval competition and our UWB system were ranked as second among 113 submitted systems in monolingual subtask and first among 26 systems in cross-lingual subtask. We adapt the UWB system for Czech (originally for English) and experiment with new Czech dataset. Our system achieves very promising results and can serve as a strong baseline for future research. %We explore one of the most recent state-of-the-art word embedding algorithms and evaluate its potential to carry the meaning of sentence as word vector composition.

#908: Data Augmentation and Teacher-Student Training for LF-MMI Based Robust Speech Recognition

Asadullah and Tanel Alumae

Deep neural networks (DNN) have played a key role in the development of state-of-the-art speech recognition systems. In recent years, lattice-free MMI objective (LF-MMI) has become a popular method for training DNN acoustic models. However, domain adaptation of DNNs from clean to noisy data still remains a challenging problem. In this paper, we compare and combine two methods for adapting LF-MMI-based models to a noisy domain that do not require transcribed noisy data: multi-condition training and teacher-student style domain adaptation. For teacher-student training, we use lattices obtained via decoding untranscribed clean speech as supervision for adapting the model to noisy domain. We use in-domain noise extracted from a large untranscribed speech corpus using voice activity detection for noise-augmentation in multi-condition training and teacher-student training. We show that combining multi-condition training and lattice-based teacher-student training gives better results than either of the methods alone. Furthermore, we show the benefits of using in-domain noise instead of general noise profiles for noise augmentation. Overall, we obtain 7.4% relative improvement in word error rate over a standard multi-condition baseline.

#921: Deep Learning and Online Speech Activity Detection for Czech Radio Broadcasting

Jan Zelinka

In this paper, enhancements of online speech activity detection (SAD) is presented. Our proposed approach combines standard signal processing methods and modern deep-learning methods which allows simultaneous training of the detector's parts that are usually trained or designed separately. In our SAD, an NN-based early score computation system, an NN-based score smoothing system and proposed online decoding system were incorporated in a training process. Besides the CNN and DNN, spectral flux and spectral variance features are also investigated. The proposed approach was tested on a Czech Radio broadcasting corpus. The corpus was used for investigation supervised and also semi-supervised machine learning.

#885: Deriving Enhanced Universal Dependencies from a Hybrid Dependency-Constituency Treebank

Lauma Pretkalnina, Laura Rituma, Baiba Saulite

The treebanks provided by the Universal Dependencies (UD) initiative are a state-of-the-art resource for cross-lingual and monolingual syntax-based linguistic studies, as well as for multilingual dependency parsing. Creating a UD treebank for a language helps further the UD initiative by providing an important dataset for research and natural language processing in that language. In this paper, we describe how we created a UD treebank for Latvian, and how we obtained both the basic and enhanced UD representations from the data in Latvian Treebank which is annotated according to a hybrid dependency-constituency grammar model. The hybrid model was inspired by Lucien Tesniere's dependency grammar theory and its notion of a syntactic nucleus. While the basic UD representation is already a de facto standard in NLP, the enhanced UD representation is just emerging, and the treebank described here is among the first to provide both representations.

#930: Do We Need Word Sense Disambiguation for LCM Tagging?

Aleksander Wawer, Justyna Sarzynska

Observing the current state of natural language processing, especially in the Polish language, one notices that sense-level dictionaries are becoming increasingly popular. For instance, the largest manually annotated sentiment dictionary for Polish is now based on plWordNet (the Polish WordNet) , also the Polish Linguistic Category Model (LCM-PL) dictionary has its significant part annotated on sense level. Our paper addresses the important question: what is the influence of word sense disambiguation in real-world scenarios and how it compares to the simpler baseline of labeling using just the tag of the most frequent sense. We evaluate both approaches on data sets compiled for studies on fake opinion detection and predicting levels of self-esteem in the area of social psychology. Our conclusion is that the baseline method vastly outperforms its competitor.

#889: Evaluating Distributional Features for Multiword Expression Recognition

Natalia Loukachevitch, Ekaterina Parkhomenko

In this paper we consider the task of extracting multiword expression for Russian thesaurus RuThes, which contains various types of phrases, including non-compositional phrases, multiword terms and their variants, light verb constructions, and others. We study several embedding-based features for phrases and their components and estimate their contribution to finding multiword expressions of different types comparing them with traditional association and context measures. We found that one of the distributional features has relatively high results of MWE extraction even when used alone. Different forms of its combination with other features (phrase frequency, association measures) improve both initial orderings.

#900: F_0 Post-Stress Rise Trends Consideration in Unit Selection TTS

Markéta Jůzová, Jan Volín

In spoken Czech language, the stress and post-stress syllables in human speech are usually characterized by an increase in fundamental frequency F0 (except for phrase-final stress groups). In unit selection text-to-speech systems, where no contour of F0 is generated to be followed, however, the F0 behaviour is usually tended very vaguely. The paper presents an experiment of making the unit selection TTS to follow the trends of fundamental frequency rise in synthesized speech to achieve higher naturalness and overall quality of speech synthesis itself.

#933: Generation of Arabic Broken Plural within LKB

Samia Ben Ismail, Sirine Boukedi, Kais Haddar

The treatment of Broken Plural (BP) for Arabic noun using a unification grammar is an important task in Natural Language Processing (NLP). This treatment contributes to construct extensional lexicons with a large coverage. In this context, the main objective of this work is to develop a morphological analyzer for Arabic treating BP with Head-driven Phrase Structure Grammar (HPSG). Therefore, after a linguistic study, we start by identifying different patterns of BP and representing them with HPSG. The designed grammar was specified in Type Description Language (TDL) and then was experimented with LKB system. The obtained results were encouraged and satisfactory because our system can generates all BP forms that can have an Arabic singular noun.

#902: Identifying and Linking Participant Mentions in Legal Court Judgments

Ajay Gupta, Devendra Verma, Sachin Pawar, Sangameshwar Patil, Swapnil Hingmire, Girish K. Palshikar, Pushpak Bhattacharya

Legal court judgements have multiple participants (e.g. judge, complainant, petitioner, lawyer, etc.). They may be referred to in multiple ways, e.g., the same person may be referred as lawyer, counsel, learned counsel, advocate, as well as his/her proper name. For any analysis of legal texts, it is important to resolve such multiple mentions which are coreferences of the same participant. In this paper, we propose a supervised approach to this challenging task. To avoid human annotation efforts for Legal domain data, we exploit ACE 2005 dataset by mapping its entities to participants in Legal domain.% the % similarities between participants in % Legal domain and entities in ACE 2005 % dataset We use basic Transfer Learning paradigm by training classification models on general purpose text (news in ACE 2005 data) and applying them to Legal domain text. We evaluate our approach on a sample annotated test dataset in Legal domain and demonstrate that it outperforms state-of-the-art baselines.

#879: Idioms Modeling in a Computer Ontology as a Morphosyntactic Disambiguation Strategy (the Case of Tibetan Corpus of Grammar Treatises)

Alexei Dobrov, Anastasia Dobrova, Pavel Grokhovskiy, Maria Smirnova, Nikolay Soms

The article presents the experience of developing computer ontology as one of the tools for Tibetan idioms processing. A computer ontology that contains a consistent specification of meanings of lexical units with different relations between them represents a model of lexical semantics and both syntactic and semantic valencies, reflecting the Tibetan linguistic picture of the world. The article presents an attempt to classify Tibetan idioms, including compounds, which are idiomatized clips of syntactic groups that have frozen inner syntactic relations and are often characterized by omission of grammatical morphemes; and the application of this classification for idioms processing in computer ontology. The article also proposes methods of using computer ontology for avoiding idioms processing ambiguity.

#897: Improving Part-of-Speech Tagging by Meta-Learning

Łukasz Kobylinski, Michał Wasiluk, Grzegorz Wojdyga

Recently, we have observed a rapid progress in the state of Part of Speech tagging for Polish. Thanks to PolEval --- a shared task organized in late 2017 --- many new approaches to this problem have been proposed. New deep learning paradigms have helped to narrow the gap between the accuracy of POS tagging methods for Polish and for English. Still, the number of errors made by the taggers on large corpora is very high, as even the currently best performing tagger reaches an accuracy of ca. 94.5%, which translates to millions of errors in a billion-word corpus. To further improve the accuracy of Polish POS tagging we propose to employ a meta-learning approach on top of several existing taggers. This meta-learning approach is inspired by the fact that the taggers, while often similar in terms of accuracy, make different errors, which leads to a conclusion that some of the methods are better in specific contexts than the others. We thus train a machine learning method that captures the relationship between a particular tagger accuracy and language context and in this way create a model, which makes a selection between several taggers in each context to maximize the expected tagging accuracy.

#876: LDA in Character-LSTM-CRF Named Entity Recognition

M. Konopík, O. Pražák

In this paper, we present a NER system based upon deep learning models with character sequence encoding and word sequence encoding in LSTM layers. The results are boosted with LDA topic models and linear-chain CRF sequence tagging. We reach the new state-of-the-art performance in NER of 81.77 F-measure for Czech and 85.91 F-measure Spanish.

#903: Learning to Interrupt the User at the Right Time in Incremental Dialogue Systems

Adam Chýlek, Jan Švec, Luboš Šmídl

Continuous processing of input in incremental dialogue systems might result in the need of interrupting a user's utterance when clarification or rapport is needed. Being able to predict the right time when to interrupt the utterance can be another step to a more human-like dialogue. On the other hand, annotation of corpora with different types of possible interruptions requires additional human resources. In this paper, we discuss how to process a corpus that does not have interruptions specifically annotated. We also present initial experiments on two corpora and show that it is possible to model the desired behaviour from these corpora.

#878: Lexical Stress-Based Authorship Attribution with Accurate Pronunciation Patterns Selection

Lubomir Ivanov, Amanda Aebig, Stephen Meerman

This paper presents a feature selection methodology for authorship attribution based on lexical stress patterns of words in text. The methodology uses part-of-speech information to make the proper selection of a lexical stress pattern when multiple possible pronunciations of the word exist. The selected lexical stress patterns are used to train machine learning classifiers to perform author attribution. The methodology is applied to a corpus of 18th century political texts, achieving a significant improvement in performance compared to previous work.

#915: Morphological Analyzer for the Tunisian Dialect

Roua Torjmen, Kais Haddar

The morphological analysis is an important task for the Tunisian dialect processing because the dialect does not respect any standard and it is different for modern standard Arabic. In order to propose a method allowing the morphological analysis, we study many Tunisian dialect texts to identify different forms of written words. The proposed method is based on a self-constructed dictionary extracted from a corpus and a set of morphological local grammars implemented in the NooJ linguistic platform. Indeed, the morphological grammars are transformed into finite transducers while using NooJ's new technologies. To test and evaluate the designed analyzer, we applied it on a Tunisian test corpus containing over 18,000 words. The obtained results are ambitious.

#952: Morphological and Language-Agnostic Word Segmentation for NMT

Dominik Macháček, Jonáš Vidra, Ondřej Bojar

The state of the art of handling rich morphology in neural machine translation (NMT) is to break word forms into subword units, so that the overall vocabulary size of these units fits the practical limits given by the NMT model and GPU memory capacity. In this paper, we compare two common but linguistically uninformed methods of subword construction (BPE and STE, the method implemented in Tensor2Tensor toolkit) and two linguistically-motivated methods: Morfessor and one novel method, based on a derivational dictionary. Our experiments with German-to-Czech translation, both morphologically rich, document that so far, the non-motivated methods perform better. Furthermore, we identify a critical difference between BPE and STE and show a simple pre-processing step for BPE that considerably increases translation quality as evaluated by automatic measures.

#920: Morphosyntactic Disambiguation and Segmentation for Historical Polish with Graph-Based Conditional Random Fields

Jakub Waszczuk, Witold Kieras, Marcin Wolinski

The paper presents a system for joint morphosyntactic disambiguation and segmentation of Polish based on conditional random fields (CRFs). The system is coupled with Morfeusz, a morphosyntactic analyzer for Polish, which represents both morphosyntactic and segmentation ambiguities in the form of a directed acyclic graph (DAG). We rely on constrained linear-chain CRFs generalized to work directly on DAGs, which allows us to perform segmentation as a by-product of morphosyntactic disambiguation. This is in contrast with other existing taggers for Polish, which either neglect the problem of segmentation or rely on heuristics to perform it in a pre-processing stage. We evaluate our system on historical corpora of Polish, where segmentation ambiguities are more prominent than in contemporary Polish, and show that our system significantly outperforms several baseline segmentation methods.

#955: Multi-task Projected Embedding for Igbo

Ignatius Ezeani, Mark Hepple, Ikechukwu Onyenwe, Chioma Enemuo

NLP research on low resource African languages is often impeded by the unavailability of basic resources: tools, techniques, annotated corpora, and datasets. Besides the lack of funding for the manual development of these resources, building from scratch will amount to the reinvention of the wheel. Therefore, adapting existing techniques and models from well-resourced languages is often an attractive option. One of the most generally applied NLP models is word embeddings. Embedding models often require large amounts of data to train which are not available for most African languages. In this work, we adopt an alignment based projection method to transfer trained English embeddings to the Igbo language. Various English embedding models were projected and evaluated on the odd-word, analogy and word-similarity tasks intrinsically, and also on the diacritic restoration task. Our results show that the projected embeddings performed very well across these tasks.

#899: On the Extension of the Formal Prosody Model for TTS

Markéta Jůzová, Daniel Tihelka, Jan Volín

The formal prosody grammar used for TTS focuses mainly on the description of final prosodic words in phrases/sentences which characterize a special prosodic phenomenon representing a certain communication function within the language system. This paper introduces an extension of the prosody model which also takes into account the importance and distinction of the first prosodic words in the prosodic phrases. This phenomenon can not change the semantic interpretation of the phrase, but for higher naturalness, the beginnings of the prosodic phrases differ from subsequent words and should be, based on the phonetic background, dealt with separately.

#866: Online LDA-Based Language Model Adaptation

Jan Lehečka, Aleš Pražák

In this paper, we present our improvements in online topic-based language model adaptation. Our aim is to enhance the automatic speech recognition of a multi-topic speech which is to be recognized in the real-time (online). Latent Dirichlet Allocation (LDA) is an unsupervised topic model designed to uncover hidden semantic relationships between words and documents in a text corpus and thus reveal latent topics automatically. We use LDA to cluster the text corpus and to predict topics online from partial hypotheses during the real-time speech recognition. Based on detected topic changes in the speech, we adapt the language model on-the-fly. We are demonstrating the improvement of our system on the task of online subtitling of TV news, where we achieved 18% relative reduction of perplexity and 3.52% relative reduction of WER over non-adapted system.

#934: Phonological Posteriors and GRU Recurrent Units to Assess Speech Impairments of Patients with Parkinson's Disease

Juan Camillo Vásquez-Correa, Nicanor Garcia-Ospina, Juan Rafael Orozco-Arroyave, Milos Cernak, Elmar Nöth

Parkinson's disease is a neurodegenerative disorder characterized by a variety of motor symptoms, including several impairments in the speech production process. Recent studies show that deep learning models are highly accurate to assess the speech deficits of the patients; however most of the architectures consider static features computed from a complete utterance. Such an approach is not suitable to model the dynamics of the speech signal when the patients pronounce different sounds. Phonological features can be used to characterize the voice quality of the speech, which is highly impaired in patients suffering from Parkinson's disease. This study proposes a deep architecture based on recurrent neural networks with gated recurrent units combined with phonological posteriors to assess the speech deficits of Parkinson's patients. The aim is to model the time-dependence of consecutive phonological posteriors, which follow the sound patterns of English phonological model. The results show that the proposed approach is more accurate than a baseline based on standard acoustic features to assess the speech deficits of the patients.

#936: Phonological i-Vectors to Detect Parkinson's Disease

N. Garcia-Ospina, T. Arias-Vergara, J. C. Vásquez-Correa, J. R. Orozco-Arroyave, M. Cernak, and E. Nöth

Speech disorders are common symptoms among Parkinson's disease patients and affect the speech of patients in different aspects. Currently, there are few studies that consider the phonological dimension of Parkinson's speech. In this work, we use a recently developed method to extract phonological features from speech signals. These features are based on the Sound Patterns of English phonological model. The extraction is performed using pre-trained Deep Neural Networks to infer the probabilities of phonological features from short-time acoustic features. An i-vector extractor is trained with the phonological features. The extracted i-vectors are used to classify patients and healthy speakers and assess their neurological state and dysarthria level. This approach could be helpful to assess new specific speech aspects such as the movement of different articulators involved in the speech production process.

#875: Prefixal Morphemes of Czech Verbs

Jaroslava Hlaváčová

The paper presents the analysis of Czech verbal prefixes, which is the first step of a project that has the ultimate goal an automatic morphemic analysis of Czech. We studied prefixes that may occur in Czech verbs, especially their possible and impossible combinations. We describe a procedure of prefix recognition and derive several general rules for selection of a correct result. The analysis of "double" prefixes enables to make conclusions about universality of the first prefix. We also added linguistic comments to several types of prefixes.

#892: Prosodic Features' Criterion For Hebrew

Ben Fishman, Itshak Lapidot, Irit Opher

Prosody provides important information about intention and meaning, and carries clues regarding dialogue turns, phrase emphasis and even the physiological or emotional condition of the speaker. Prosody has been researched extensively by linguists and speech scientists; However, little attention has been given to formulating and ranking the acoustic features that represent prosodic information. This paper aims at defining a simple methodology that allows us to test whether a feature conveys prosodic information. This way, we can compare different features and rate them as prosodic or content (In this paper the word "content" refers to the verbal information of the utterance) related. We explore many features using a Hebrew dataset especially designed for validating prosodic features, and as the first step of our research we chose two prosody classes: neutral and question. We apply our methodology successfully and find that prosodic features indeed are invariant to the content of the utterance, while correlating with prosodic manifestations. We validate our methodology by showing that our ranking of prosodic features yields similar results to classification based feature selection.

#961: Recognition of OCR Invoice Metadata Block Types

Hien T. Ha, Marek Medveď, Zuzana Nevěřilová, Aleš Horák

Automatically cataloging of thousands of paper-based structured documents is a crucial fund-saving task for future document management systems. Current optical character recognition (OCR) systems process the tabular data with a sufficient level of character-level accuracy; however, the overall structure of the document metadata is still an open practical task. In this paper, we introduce the OCRMiner system designed to extract the indexing metadata of structured documents obtained from an image scanning process and OCR. We present the details of the system modular architecture and evaluate the detection of text block types that appear within invoice documents. The system is based on text analysis in combination of layout features, and is developed and tested in cooperation with a renowned copy machine producer. The system uses an open source OCR and reaches the overall accuracy of 80.1%.

#947: Recognition of the Logical Structure of Arabic Newspaper Pages

Hassina Bouressace, Janos Csirik

In document analysis and recognition, we seek to apply methods of automatic document identification. The main goal is to go from a simple image to a structured set of information exploitable by machine. Here, we present a system for recognizing the logical structure (hierarchical organization) of Arabic newspapers pages. These are characterized by a rich and variable structure. They may contain several articles composed of titles, figures, author’s names and figure captions. However, the logical structure recognition of a newspaper page is preceded by the extraction of its physical structure. This extraction is performed in our system using a combined method which is essentially based on the RLSA (Run Length Smearing / Smoothing Algorithm), projections profile analysis, and connected components labeling. Logical structure extraction is then performed based on certain rules of sizes and positions of the physical elements extracted earlier, and also on an a priori knowledge of certain properties of logical entities (titles, figures, authors, captions, etc.). Lastly, the hierarchical organization of the document is represented as an XML file generated automatically. To evaluate the performance of our system, we tested it on a set of images and the results are encouraging.

#870: Recurrent Neural Network Based Speaker Change Detection from Text Transcription Applied in Telephone Speaker Diarization System

Zbyněk Zajíc, Daniel Soutner, Marek Hrúz, Luděk Müller, Vlasta Radová

In this paper, we propose a speaker change detection system based on lexical information from the transcribed speech. For this purpose, we applied a recurrent neural network to decide if there is an end of an utterance at the end of a spoken word. Our motivation is to use the transcription of the conversation as an additional feature for a speaker diarization system to refine the segmentation step to achieve better accuracy of the whole diarization system. We compare the proposed speaker change detection system based on transcription (text) with our previous system based on information from spectrogram (audio) and combine these two modalities to improve the results of diarization. We cut the conversation into segments according to the detected changes and represent them by an i-vector. We conducted experiments on the English part of the CallHome corpus. The results indicate improvement in speaker change detection (by 0.5 % relatively) and also in speaker diarization (by 1 % relatively) when both modalities are used.

#863: Robust Recognition of Conversational Telephone Speech via Multi-Condition Training and Data Augmentation

Jiří Málek, Jindřich Ždánský, Petr Červa

In this paper, we focus on automatic recognition of telephone conversational speech in scenario, when no amount of genuine telephone recordings is available for training. The training set contains only data from a significantly different domain, such as recording of broadcast news. Significant mismatch arises between training and test conditions, which leads to deteriorated performance of the resulting recognition system. We aim to diminish this mismatch using the data augmentation. Speech compression and narrow-band spectrum are significant features of the telephone speech. We apply these effects to the training dataset artificially, in order to make it more similar to the desired test conditions. Using such augmented dataset, we subsequently train an acoustic model. Our experiments show that the augmented models achieve accuracy close to the results of a model trained on genuine telephone data. Moreover, when the augmentation is applied to the real-world telephone data, further accuracy gains are achieved.

#949: Semantic Question Matching in Data Constrained Environment

Anutosh Maitra, Shubhashis Sengupta, Abhisek Mukhopadhyay, Deepak Gupta, Rajkumar Pujari, Pushpak Bhattacharya, Asif Ekbal, Tom Geo Jain

Machine comprehension of various forms of semantically similar questions with same or similar answers has been an ongoing challenge. Especially in many industrial domains with limited set of questions, it is hard to identify proper semantic match for a newly asked question having the same answer but presented in different lexical form. This paper proposes a linguistically motivated taxonomy for English questions and an effective approach for question matching by combining deep learning models for question representations with general taxonomy based features. Experiments performed on short datasets demonstrate the effectiveness of the proposed approach as better matching classification was observed by coupling the standard distributional features with knowledge-based methods.

#904: Semantic Role Labeling of Speech Transcripts without Sentence Boundaries

Niraj Shrestha, Marie-Francine Moens

Speech data is an extremely rich and important source of information. However, we lack suitable methods for the semantic annotation of speech data. For instance, semantic role labeling (SRL) of speech that has been transcribed by an automated speech recognition (ASR) system is still an unsolved problem. SRL of ASR data is difficult and complex due to the absence of sentence boundaries, punctuation, grammar errors, words that are wrongly transcribed, and word deletions and insertions. In this paper we propose a novel approach to SRL of ASR data based on the following idea: (1) train the SRL system on data segmented into frames, where each frame consists of a predicate and its semantic roles without considering sentence boundaries; (2) label it with the semantics of PropBank roles; and to assist the above (3) train a part-of-speech (POS) tagger to work on noisy and error prone ASR data. Experiments with the OntoNotes corpus show improvements compared to the state-of-the-art SRL applied on ASR data.

#874: Sentiment Attitudes and Their Extraction from Analytical Texts

N.L. Rusnachenko, and Natalia Loukachevitch

In this paper we study the task of extracting sentiment attitudes from analytical texts. We experiment with the RuSentRel corpus containing annotated Russian analytical texts in the sphere of international relations. Each document in the corpus is annotated with sentiments from the author to mentioned named entities, and attitudes between mentioned entities. We consider the problem of extracting sentiment relations between entities for the whole documents as a three-class machine learning task.

#877: Subtext Word Accuracy and Prosodic Features for Automatic Intelligibility Assessment

Tino Haderlein, Anne Schützenberger, Michael Doellinger, Elmar Nöth

Speech intelligibility for voice rehabilitation can successfully be evaluated by automatic prosodic analysis. In this paper, the influence of reading errors and the selection of certain words (nouns only, nouns and verbs, beginning of each sentence, beginnings of sentences and subclauses) for the computation of the word accuracy (WA) and prosodic features are examined. 73 hoarse patients read the German version of the text "The North Wind and the Sun". Their intelligibility was evaluated perceptually by 5 trained experts according to a 5-point scale. Combining prosodic features and WA by Support Vector Regression showed human-machine correlations of up to r = 0.86. They drop for files with few reading errors, however, but this can largely be evened out by feature set adjustment. WA should be computed on the whole text, but for some prosodic features, a subset of words may be sufficient.

#917: The Influence of Errors in Phonetic Annotations on Performance of Speech Recognition System

R. Šafařík, L. Matějů, L. Weingartova

This paper deals with errors in acoustic training data and the influence on speech recognition performance. The training data can be prepared manually, automatically or by combination of these two. In all cases, some mislabeled phonemes can appear in phonetic annotations. We conducted series of experiments which simulate some common errors. The experiments deal with various amount of changes in phonetic annotations such as different types of changes in voicing of obstruents, random substitution of consonants or vowels and random deleting of phonemes. All experiments were done for Czech language using GlobalPhone speech data set and both Gaussian mixture models and deep neural networks were used for acoustic modeling. The results show that some amount of such errors in training data does not influence speech recognition accuracy. The accuracy is significantly influenced only by large amount of errors (more than 50 %).

#894: The Retention Effect of Learning Grammatical Patterns Implicitly Using Joining-in-Type Robot-Assisted Language-Learning System

AlBara Khalifa, Tsuneo Kato, Seiichi Yamamoto

Conducting a multiparty conversation among two robots and a human learner for the purpose of language learning is a novel idea. It can help in conveying grammatical information to the human learner in an implicit manner. The main focus in this paper is the quantification of the level of retention of what was learned implicitly over a period of four weeks. We had evaluated the utterances of the human learners on the level of similarity of n-grams with a reference answer, and on the basis of grammatical correctness of use. The experiments revealed effect of repletion of implicit learning for learning corrective use of grammatical patterns.

#912: Towards a French Smart-Home Voice Command Corpus: Design and NLU Experiments

Thierry Desot, Stefania Raimondo, Anastasia Mishakova, François Portet and Michel Vacher

Despite growing interest in smart-homes, semantically annotated large voice command corpora for Natural Language development (NLU) are scarce, especially for languages other than English. In this paper, we present an approach to generate customizable synthetic corpora of semantically-annotated French commands for a smart-home. This corpus was used to train three NLU models -- a triangular CRF, an attention-based RNN and the Rasa framework -- evaluated using a small corpus of real users interacting with a smart home. While the attention model performs best on another large French dataset, on the small smart home corpus the models vary performance across to intent, slot and slot value classification. To the best of our knowledge, no other French corpus of semantically annotated voice commands is currently publicly available.

#913: Using Anomaly Detection for Fine Tuning of Formal Prosodic Structures in Speech Synthesis

Martin Matura, Markéta Jůzová

Consistent prosody description of speech corpora is a fundamental requirement for a high quality speech synthesis generated by current TTS systems. In this preliminary study, we are using One-class SVM anomaly detection approach to predict formal prosodic structure outliers (a prosodic mismatch) in recorded utterances, that can negatively influence the overall quality of synthesized speech, especially in unit selection. To evaluate the outcome of our detection system, we performed a listening test with encouraging results.

#907: Voice Control in a Real Flight Deck Environment

M. Trzos, M. Dostál, P. Macháčková, and J. Eitlerová

In this paper, we present a methodology on how to implement multimodal voice controlled systems by means of automatic speech recognition. The real flight deck environment brings many challenges such as high accuracy requirements, high noise conditions, non-native English-speaking users or limited hardware and software resources. We present the design of an automatic speech recognition system based on a freely available AMI Meeting Corpus and a proprietary corpus provided by Airbus. Then we describe how we trained and evaluated the speech recognition models in a simulated environment using the anechoic chamber laboratory. The tuned speech recognition models were tested in real flight environment on two Honeywell experimental airplanes: Dassault Falcon 900 and Boeing 757.

#932: WaveNet-Based Speech Synthesis Applied to Czech: A Comparison with the Traditional Synthesis Methods

Zdeněk Hanzlíček, Jakub Vít, Daniel Tihelka

WaveNet is a recently-developed deep neural network for generating high-quality synthetic speech. It produces directly raw audio samples. This paper describes the first application of WaveNet-based speech synthesis for the Czech language. We used the basic WaveNet architecture. The duration of particular phones and the required fundamental frequency used for local conditioning were estimated by additional LSTM networks. We conducted a MUSHRA listening test to compare WaveNet with 2 traditional synthesis methods: unit selection and HMM-based synthesis. Experiments were performed on 4 large speech corpora. Though our implementation of WaveNet did not outperform the unit selection method as reported in other studies, there is still a lot of scope for improvement, while the unit selection TTS have probably reached its quality limit.

#968: Automated Legal Research for German Law

Thejeswi Nagendra Kamatchi, Jelena Mitrović, Siegfried Handschuh

Full text PDF

This demonstration will be based on the system for performing automated legal research of Civil Law. The dataset has the legal text organized according to legal code, sections, paragraphs and sentence numbers. Relevant links connecting related laws are present. Supporting information such as POS tags, parse-trees, synonyms and similar words (found using Wikipedia word embeddings) are used to enrich the dataset with features. The user can input a simple sentence(s) describing the case, according to which the legal case is classified to a specific part of the law. Interactive fact collection is then performed. Once enough facts are collected and particular legal texts can be matched with sufficient confidence, judgment prediction is performed. All the collected facts, matching legal text with justification and predictions are compiled into a report for the user. Future work on this system will include an argument mining system based on rhetorical relations and figures in law text.

#969: Computer model of the Tibetan language morphology

Aleksei Dobrov, Anastasia Dobrova, Pavel Grokhovskiy, Maria Smirnova, Nikolay Soms

Full text PDF

The research describes the developing of a computer model of the Tibetan morphology which can be used to explain the phenomena of positional morpheme interchange in the Tibetan language. The work included such main stages as development of the faceted classification of observed interchanges according to the type of variation, types of initials and finals of morphemes and other possible reasons; development of an object-oriented model reflecting the created classification, and allowing to automate the work of observed rules of gradation; development of the system of automatic regression testing of the model, which makes it possible to guarantee its compliance with linguistic material. The created computer model of the Tibetan morphology was evaluated using a regression testing system, which ensures that the model conforms to the observed morphological phenomena.

#970: SMACC - text analyzer for legal assistance

Basile Audard, Elena Manishina, Joao Pedro Campello

Full text PDF

In our daily life we regularly come across various contracts and agreements: real estate lease, electricity or mobile phone plan, insurance contract, general conditions of sale, etc. These contracts, in paper or online versions, may contain thousands of lines; reading and understanding these lines is a real challenge for many people, the language in the legal domain texts being notoriously hard to digest for non-professionals. SMACC (Smart Contract Checker) texual analyzer was developed to address those issues. SMACC is a tool that offers legal assistance to users faced with a biding contract and consequent obligations and who would like to get a better overview of the document at hand as well as the idea of its legality vis-à-vis the existing legislation in the specific legal domain.

#971: Text Embeddings Based on Synonyms

Magdalena Wiercioch

Full text PDF

Searching for text representation is one of the main tasks in information retrieval domain. The appropriate model has an impact on sentiment analysis also known as opinion mining. Take for instance books review sentiment studies. The goal is to assess people’s opinions or emotions towards the book. Obviously, it may be applied in various fields such as recommendation systems. However, the quality of text representation affects the performance of this type of tasks.

#972: An interface between the Czech valency lexicon PDT-Vallex and corpus manager KonText

Kira Droganova, Eva Fučíková, Anša Vernerová

Full text PDF

We present a user interface between the Czech valency lexicon, PDT-Vallex, and KonText -- a web application for querying corpora available within the LINDAT/CLARIN project.

#973: Korpusomat: a quick way to create searchable, annotated corpora

Witold Kieraś, Łukasz Kobyliński, Maciej Ogrodniczuk, Zbigniew Gawłowicz, Michał Wasiluk

Full text PDF

Korpusomat is a web application aimed at building annotated corpora in Polish for the purpose of corpus linguistic studies. Korpusomat combines existing tools, such as morphological analyser, tagger and corpus search engine, and provides an easy-to-use environment for building corpora from almost any text, including texts in binary formats.

#974: Sketch Engine: all-new user experience and richer API

Vít Baisa, Tomáš Svoboda, Jan Michelfeit, Miloš Jakubíček, Vojtěch Kovář, Ondřej Matuška

Recently we introduced substantial changes to the corpus manager Sketch Engine. The goal was to make the language technology disappear and to provide smooth user experience for all: from teachers, translators, terminologists to linguists and programmers. During the demo you will have the opportunity to try the completely new user interface with all previous features preserved. The frontend is now more separated from the backend and it means that the backend API is richer too (creating corpora, uploading files, authentication, terminology extraction, ...).

TSD 2017 | TSD 2016 | TSD 2015