TSD 2022

Spectrograms provide a visual representation of the time-frequency variations of a speech signal. Furthermore, the color scales can be used as a pre-processing normalization step. In this study, we investigated the suitability of using different color scales for the reconstruction of spectrograms together with bottleneck features extracted from Convolutional AutoEncoders (CAEs). We trained several CAEs considering different parameters such as the number of channels, wideband/narrowband spectrograms, and different color scales. Additionally, we tested the suitability of the proposed CAE architecture for the prediction of the severity of Parkinson’s Disease (PD) and for the nasality level in children with Cleft Lip and Palate (CLP). The results showed that it is possible to estimate the neurological state for PD with Spearman's correlations of up to 0.71 using the Grayscale, and the nasality level in CLP with F-scores of up to 0.58 using the raw spectrogram. Although the color scales improved performance in some cases, it is not clear which color scale is the most suitable for the selected application, as we did not find significant differences in the results for each color scale.

#1116: A Novel Hybrid Framework to Enhance Zero-shot Classification

Yanan Chen and Yang Liu

As manually labelling data can be error-prone and labour-intensive, some recent studies automatically classify documents without any training on labelled data and directly exploit pre-trained language models (PLMs) for many downstream tasks, also known as zero-shot text classification. In the same vein, we propose a novel framework aims at improving zero-short learning and enriching domain specific information required by PLMs with transformer models. To unleash the power of PLMs pre-trained on massive cross-section corpus, the framework unifies two transformers for different purposes: 1) expanding categorical labels required by PLMs by creating coherent representative samples with GPT2, which is a language model acclaimed for generating sensical text outputs, and 2) augmenting documents with T5, which has the virtue of synthesizing high quality new samples similar to the original text. The proposed framework can be easily integrated into different general testbeds. Extensive experiments on two popular topic classification datasets have proved its effectiveness.

#1109: A Self-Evaluating Architecture for Describing Data

George A. Wright and Matthew Purver

This paper introduces Linguoplotter, a workspace-based architecture for generating short natural language descriptions. All processes within Linguoplotter are carried out by codelets, small pieces of code each responsible for making incremental changes to the program's state, the idea of which is borrowed from Hofstadter et al . Codelets in Linguoplotter gradually transform a representation of temperatures on a map into a description which can be output. Many processes emerge in the program out of the actions of many codelets, including language generation, self-evaluation, and higher-level decisions such as when to stop a given process, and when to end all processing and publish a final text. The program outputs a piece of text along with a satisfaction score indicating how good the program judges the text to be. The iteration of the program described in this paper is capable of linguistically more diverse outputs than a previous version; human judges rate the outputs of this version more highly than those of the last; and there is some correlation between rankings by human judges and the program's own satisfaction score. But, the program still publishes disappointingly short and simple texts (despite being capable of longer, more complete descriptions). This paper describes: the workings of the program; a recent evaluation of its performance; and possible improvements for a future iteration.

#1112: ANTILLES: An Open French Linguistically Enriched Part-of-Speech Corpus

Yanis Labrak and Richard Dufour

Part-of-speech (POS) tagging is a classical natural language processing (NLP) task. Although many tools and corpora have been proposed, especially for the most widely spoken languages, these suffer from limitations concerning their user license, the size of their tagset, or even approaches no longer in the state-of-the-art. In this article, we propose ANTILLES, an extended version of an existing French corpus (UD French-GSD) comprising an original set of labels obtained with the aid of morphological characteristics (gender, number, tense, etc.). This extended version includes a set of 65 labels, against 16 in the initial version. We also implemented several POS tools for French from this corpus, incorporating the latest advances in the state-of-the-art in this area. The corpus as well as the POS labeling tools are fully open and freely available.

#1177: Attention-based Model for Accurate Stance Detection

Omama Hamad, Ali Hamdi and Khaled Shaban

Effective representation learning is an essential building block for achieving many natural language processing tasks such as stance detection as performed implicitly by humans. Stance detection can assist in understanding how individuals react to certain information by revealing the user’s stance on a particular topic. In this work, we propose a new attention-based model for learning feature representations and show its effectiveness in the task of stance detection. The proposed model is based on transfer learning and multi-head attention mechanisms. Specifically, we use BERT and word2vec models to learn text representation vectors from the data and pass both of them simultaneously to the multi-head attention layer to help focus on the best learning features. We present five variations of the model, each with a different combination of BERT and word2vec embeddings for the query and value parameters of the attention layer. The performance of the proposed model is evaluated against multiple baseline and state-of-the-art models. The best of the five proposed variations of the model improved the accuracy on average by 0.4% and achieved 68.4% accuracy for multi-classification, while the best accuracy for binary classification is 86.1% with a 1.3% improvement.

#1113: Autoblog 2021: The Importance of Language Models for Spontaneous Lecture Speech

Abner Hernandez, Philipp Klumpp, Badhan Das, Andreas Maier, and Seung Hee Yang

The demand for both quantity and quality of online educational resources has skyrocketed during the last two years' pandemic. Entire course series had since been recorded and distributed online. To reach a broader audience, videos could be transcribed, combined with supplementary material (e. g. lecture slides) and published in the style of blog posts. This had been done previously for Autoblog 2020, a corpus of lecture recordings that had been converted to blog posts, using automated speech recognition (ASR) for subtitle creation. This work aims to introduce a second series of recorded and manually transcribed lecture videos. The corresponding data includes lecture videos, slides, and blog posts / transcripts with aligned slide images and is published under creative commons license. A state-of-the-art Wav2Vec ASR model was used for automatic transcription of the content, using different n-gram language models (LM). The results were compared to the human ground truth annotation. Findings indicated that the ASR performed well on spontaneous lecture speech. Furthermore, LMs trained on large amounts of data with fewer out-of-vocabulary words were outperformed by much smaller LMs estimated over in-domain language. Annotated lecture recordings were deemed helpful for the creation of task-specific ASR solutions as well as their validation against a human ground truth.

#1136: Automatic Grammar Correction of Commas in Czech Written Texts: Comparative Study

Jakub Machura, Adam Frémund and Jan Švec

The task of grammatical error correction is a widely studied field of natural language processing where the traditional rule-based approaches compete with the machine learning methods. The rule-based approach benefits mainly from a wide knowledge base available for a given language. On the contrary, the transfer learning methods and especially the use of pre-trained Transformers have the ability to be trained from a huge number of texts in a given language. In this paper, we focus on the task of automatic correction of missing commas in Czech written texts and we compare the rule-based approach with the Transformer-based model trained for this task.

#1174: BERT-based Classifiers for Fake News Detection on Short and Long Texts with Noisy Data: a Comparative Analysis

Elena Shushkevich, Mikhail Alexandrov and John Cardiff

Free uncontrolled access to the Internet is the main reason for fake news propagation on the Internet both in social media and in regular Internet publications. In this paper we study the potential of several BERT-based models to detect fake news related to politics. Our contribution to the area consists of testing BERT, RoBERTa and MNLI RoBERTa models with (a) short and long texts; (b) ensembling with the best models; (c) noisy texts. To improve ensembling, we introduce an additional class ‘Doubtful news’. To create noisy data we use cross-translation. For the experiments we consider the well-known FRN (Fake vs. Real News, long texts) and LIAR (short texts) datasets. The results we obtained on the long texts dataset are higher than the results we obtained on the short texts dataset. The proposed approach to ensembling provided significant improvement of the results. The experiments with noisy data demonstrated high noise immunity of the BERT model with long news and the RoBERTa model with short news.

#1131: Balancing the Style-Content Trade-Off in Sentiment Transfer Using Polarity-Aware Denoising

Sourabrata Mukherjee, Zdeněk Kasner and Ondřej Dušek

Text sentiment transfer aims to flip the sentiment polarity of a sentence (positive to negative or vice versa) while preserving its sentiment-independent content. Although current models show good results at changing the sentiment, content preservation in transferred sentences is insufficient. In this paper, we present a sentiment transfer model based on polarity-aware denoising, which accurately controls the sentiment attributes in generated text, preserving the content to a great extent and helping to balance the style-content trade-off. Our proposed model is structured around two key stages in the sentiment transfer process: better representation learning using a shared encoder and sentiment-controlled generation using separate sentiment-specific decoders. Empirical results show that our methods outperforms state-of-the-art baselines in terms of content preservation while staying competitive in terms of style transfer accuracy and fluency. Source code, data, and all other related details are available on Github. (https://github.com/SOURO/polarity-denoising-sentiment-transfer)

#1183: Can a Machine Generate a Meta-Review? How far are we?

Prabhat Kumar Bharti, Asheesh Kumar, Tirthankar Ghosal, Mayank Agrawal, and Asif Ekbal

A meta-review usually written by the editor of a journal or the area/program chair in a conference is a summary of the peer-reviews and a concise interpretation of the editors/chairs decision. Although the task closely simulates a multi-document summarization problem, automatically writing reviews on top of human-generated reviews is something very less explored. In this paper, we investigate how current state-of-the-art summarization techniques fare on this problem. We come up with qualitative and quantitative evaluation of four radically different summarization approaches on the current problem. We explore how the summarization models perform on preserving aspects and sentiments in original peer reviews and meta-reviews. Finally, we conclude with our observations on why the task is challenging, different from simple summarization, and how one should approach to design a meta-review generation model. We have provided link for our git repository https://github.com/PrabhatkrBharti/MetaGen.git so as to enable readers to replicate the findings.

#1155: Computational Approaches for Understanding Semantic Constraints on Two-termed Coordination Structures

Julie Kallini and Christiane Fellbaum

Coordination is a linguistic phenomenon where two or more terms or phrases, called conjuncts, are conjoined by a coordinating conjunction, such as and, or, or but. Well-formed coordination structures seem to require that the conjuncts are semantically similar or related. In this paper, we utilize English corpus data to examine the semantic constraints on syntactically like coordinations, which link constituents with the same lexical or syntactic categories. We examine the extent to which these semantic constraints depend on the type of conjunction or on the lexical or syntactic category of the conjuncts. We employ two distinct, independent metrics to measure the semantic similarity of conjuncts: WordNet relations and semantic word embeddings. Our results indicate that both measures of similarity have varying distributions depending on the particular conjunction and the conjuncts' lexical or syntactic categories.

#1111: DaFNeGE: Dataset of French Newsletters with Graph Representation and Embedding

Alexis Blandin, Farida Saïd, Jeanne Villaneau and Pierre-François Marteau

Natural language resources are essential for integrating linguistic engineering components into information processing suites. However, the resources available in French are scarce and do not cover all possible tasks, especially for specific business applications. In this context, we present a dataset of French newsletters and their use to predict their impact, good or bad, on readers. We propose an original representation of newsletters in the form of graphs that take into account the layout of the newsletters. We then evaluate the interest of such a representation in predicting a newsletter's performance in terms of open and click rates using graph convolution network models.

#1104: Detection of Prosodic Boundaries in Speech Using Wav2Vec 2.0

Marie Kunešová and Markéta Řezáčková

Prosodic boundaries in speech are of great relevance to both speech synthesis and audio annotation. In this paper, we apply the wav2vec 2.0 framework to the task of detecting these boundaries in speech signal, using only acoustic information. We test the approach on a set of recordings of Czech broadcast news, labeled by phonetic experts, and compare it to an existing text-based predictor, which uses the transcripts of the same data. Despite using a relatively small amount of labeled data, the wav2vec2 model achieves an accuracy of 94% and F1 measure of 83% on within-sentence prosodic boundaries (or 95% and 89% on all prosodic boundaries), outperforming the text-based approach. However, by combining the outputs of the two different models we can improve the results even further.

#1181: Empathy and Persona of English vs. Arabic Chatbots: A Survey and Future Directions

Omama Hamad, Ali Hamdi and Khaled Shaban

There is a high demand for chatbots across a wide range of sectors. Human-like chatbots engage meaningfully in dialogues while interpreting and expressing emotions and being consistent through understanding the user's personality. Though substantial progress has been achieved in developing empathetic chatbots for English, work on Arabic chatbots is still in its early stages due to various challenges associated with the language constructs and dialects. This survey reviews recent literature on approaches to empathetic response generation, persona modelling and datasets for developing chatbots in the English language. In addition, it presents the challenges of applying these approaches to Arabic and outlines some solutions. We focus on open-domain chatbots developed as end-to-end generative systems due to their capabilities to learn and infer language and emotions. Accordingly, we create four open problems pertaining to gaps in Arabic and English work; namely, (1) feature representation learning based on multiple dialects; (2) modelling the various facets of a persona and emotions; (3) datasets; and (4) evaluation metrics.

#1117: End-to-End Parkinson's Disease Detection Using a Deep Convolutional Recurrent Network

Cristian David Rios-Urrego, Santiago Andres Moreno-Acevedo, Elmar Nöth and Juan Rafael Orozco-Arroyave

Deep Learning (DL) has enabled the development of accurate computational models to evaluate and monitor the neurological state of different disorders including Parkinson’s Disease (PD). Although researchers have used different DL architectures including Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) with Long Short-Term Memory (LSTM) units, fully connected networks, combinations of them, and others, but few works have correctly analyzed and optimized the input size of the network and how the network processes the information. This study proposes the classification of patients suffering from PD vs. healthy subjects using a 1D CNN followed by an LSTM. We show how the network behaves when its input and the kernel size in different layers are modified. In addition, we evaluate how the network discriminates between PD patients and healthy controls based on several speech tasks. The fusion of tasks yielded the best results in the classification experiments and showed promising results when classifying patients in different stages of the disease, which suggests the introduced approach is suitable to monitor the disease progression.

#1097: Evaluating Attribution Methods for Explainable NLP with Transformers

Vojtěch Bartička, Ondřej Pražák, Miloslav Konopík and Jakub Sido

This paper describes the experimental evaluation of several attribution methods on two NLP tasks: Sentiment analysis and multi-label document classification. Our motivation is to find the best method to use with Transformers to interpret model decisions. For this purpose, we introduce two new evaluation datasets. The first one is derived from Stanford Sentiment Treebank, where the sentiment of individual words is annotated along with the sentiment of the whole sentence. The second dataset comes from Czech Text Document Corpus, where we added keyword information assigned to each category. The keywords were manually assigned to each document and automatically propagated to categories via PMI. We evaluate each attribution method on several models of different sizes. The evaluation results are reasonably consistent across all models and both datasets. It indicates that both datasets with proposed evaluation metrics are suitable for interpretability evaluation. We show how the attribution methods behave concerning model size and task. We also consider practical applications -- we show that while some methods perform well, they can be replaced with slightly worse-performing methods requiring significantly less time to compute.

#1139: Evaluation of Wav2Vec Speech Recognition for Speakers with Cognitive Disorders

Jan Švec, Filip Polák, Aleš Bartoš, Michaela Zapletalová and Martin Víta

In this paper, we present a spoken dialog system used for collecting data for future research in the field of dementia prediction from speech. The dialog system was used to collect the speech data of patients with mild cognitive deficits. The core task solved by the dialog system was the spoken description of the vivid shore picture for one minute. The patients also performed other simple speech-based tasks. All utterances were recorded and manually transcribed to obtain a ground-truth reference. We describe the architecture of the dialog system as well as the results of the first speech recognition experiments. The zero-shot Wav2Vec 2.0 speech recognizer was used and the recognition accuracy on word- and character-level was evaluated.

#1140: Exploration of Multi-Corpus Learning for Hate Speech Classification in Low Resource Scenarios

Ashwin Geet D'Sa, Irina Illina, Dominique Fohr and Awais Akbar

The dramatic increase in social media has given rise to the problem of online hate speech. Deep neural network-based classifiers have become the state-of-the-art for automatic hate speech classification. The performance of these classifiers depends on the amount of available labelled training data. However, most hate speech corpora have a small number of hate speech samples. In this article, we aim to jointly use multiple hate speech corpora to improve hate speech classification performance in low-resource scenarios. We harness different hate speech corpora in a multi-task learning setup by associating one task to one corpus. This multi-corpus learning scheme is expected to improve the generalization, the latent representations, and domain adaptation of the model. Our work evaluates multi-corpus learning for hate speech classification and domain adaptation. We show significant improvements in classification and domain adaptation in low-resource scenarios.

#1098: Federated Learning in Heterogeneous Data Settings for Virtual Assistants -- A Case Study

Paweł Pardela, Anna Fajfer, Mateusz Góra and Artur Janicki

Due to recent increased interest in data privacy, it is important to consider how personal virtual assistants (VA) handle data. The established design of VAs makes data sharing mandatory. Federated learning (FL) appears to be the most optimal way of increasing data privacy of data processed by VAs, as in FL, models are trained directly on users' devices, without sending them to a centralized server. However, VAs operate in a heterogeneous environment -- they are installed on various devices and acquire various quantities of data. In our work, we check how FL performs in such heterogeneous settings. We compare the performance of several optimizers for data of various levels of heterogeneity and various percentages of stragglers. As a FL algorithm, we used FedProx, proposed by Sahu et al. in 2018. For a test database, we use a publicly available Leyzer corpus, dedicated to VA-related experiments. We show that skewed quantity and label distributions affect the quality of VA models trained to solve intent classification problems. We conclude by showing that a carefully selected local optimizer can successfully mitigate this effect, yielding 99% accuracy for the ADAM and RMSProp optimizers even for highly skewed distributions and a high share of stragglers.

#1179: Fine-Tuning BERT for Generative Dialogue Domain Adaptation

Tiziano Labruna and Bernardo Magnini

Current data-driven Dialogue State Tracking (DST) models exhibit a poor capacity to adapt themselves to domain changes, resulting in a significant degradation in performance. We propose a methodology, called Generative Dialogue Domain Adaptation, which significantly simplifies the creation of training data when a number of changes (e.g., new slot-values or new instances) occur in a domain Knowledge Base. We start from dialogues for a source domain and apply generative methods based on language models such as BERT, fine-tuned on task-related data and generate slot-values substitutions for a target domain. We have experimented dialogue domain adaptation in a few-shot setting showing promising results, although the task is still very challenging. We provide a deep analysis of the quality of the generated data and of the features that affect this task, and we emphasise that DST models are very sensitive to the distribution of slot-values in the corpus.

#1158: Going Beyond the Cookie Theft Picture Test: Detecting Cognitive Impairments using Acoustic Features

Franziska Braun, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Korbinian Riedhammer and Sebastian P. Bayerl

Standardized tests play a crucial role in the detection of cognitive impairment. Previous work demonstrated that automatic detection of cognitive impairment is possible using audio data from a standardized picture description task. The presented study goes beyond that, evaluating our methods on data taken from two standardized neuropsychological tests, namely the German SKT and a German version of the CERAD-NB, and a semi-structured clinical interview between a patient and a psychologist. For the tests, we focus on speech recordings of three sub-tests: reading numbers (SKT 3), interference (SKT 7), and verbal fluency (CERAD-NB 1).%, to classify whether a patient is cognitively impaired or not. We show that acoustic features from standardized tests can be used to reliably discriminate cognitively impaired individuals from non-impaired ones. Furthermore, we provide evidence that even features extracted from random speech samples of the interview can be a discriminator of cognitive impairment. In our baseline experiments, we use OpenSMILE features and Support Vector Machine classifiers. In an improved setup, we show that using wav2vec 2.0 features instead, we can achieve an accuracy of up to 85%.

#1182: Identification of Metaphorical Collocations in Different Languages -- Similarities and Ddifferences

Lucia Nacinovic Prskalo and Marija Brkic Bakaric

Metaphorical collocations are a subset of collocations in which a semantic shift has occurred in one of the components. The main goal of this paper is to describe the process of identifying metaphorical collocations in different languages -- English, German and Croatian. Approaches to annotating metaphorical collocations from a list of word sketches for the three languages are presented using one of the most common nouns for all three languages -- "year" for English, "Jahr" (Engl. year) for German, and "godina" (Engl. year) for Croatian. The compilation of a list of relevant grammatical relations in the identification of metaphorical collocations for each language is also described. Finally, the procedures for automatic classification of metaphorical collocations for Croatian, German and English are performed and compared.

#1119: Investigating Paraphrasing-based Data Augmentation for Task-Oriented Dialogue Systems

Liane Vogel and Lucie Flek

With synthetic data generation, the required amount of human-generated training data can be reduced significantly. In this work, we explore the usage of automatic paraphrasing models such as GPT-2 and CVAE to augment template phrases for task-oriented dialogue systems while preserving the slots. Additionally, we systematically analyze how far manually annotated training data can be reduced. We extrinsically evaluate the performance of a natural language understanding system on augmented data on various levels of data availability, reducing manually written templates by up to 75 percent while preserving the same level of accuracy. We further point out that the typical NLG quality metrics such as BLEU or utterance similarity are not suitable to assess the intrinsic quality of NLU paraphrases, and that public task-oriented NLU datasets such as ATIS and SNIPS have severe limitations.

#1118: Lexical Bundle Variation in Business Actors' Public Communications

Brett Drury and Samuel Morais Drury

Business Actors communicate to audiences via the mass media through public statements or informal interviews with journalists. This information is directly quoted in news stories about financially significant events. The motivation for speaking to the mass media varies from job role to job role, and therefore the vocabulary of a job role and the delivery of the information to the press varies also. This paper provides a comprehensive analysis using lexical bundles and sentimental lexical bundles to discover the common vocabulary of four selected job roles: Analyst, CEO, CFO and Economist, and their similarity with other job roles. This work demonstrates that the CEO job role makes ample use of highly positive repetitive lexical bundles, whereas the Economist holds a unique role where it has a vocabulary with less of a positive skew and few shared lexical bundles with other job roles.

#1096: Lexicon-based vs. Lexicon-free ASR for Norwegian Parliament Speech Transcription

Jan Nouza, Petr Červa and Jindřich Žďánský

Norwegian is a challenging language for automatic speech recognition research because it has two written standards (Bokmål and Nynorsk) and a large number of distinct dialects, from which none has status of an official spoken norm. A traditional lexicon-based approach to ASR leads to a huge lexicon (because of the two standards and also due to compound words) with many spelling and pronunciation variants, and consequently to a large (and sparse) language model (LM). We have built a system with 601k-word lexicon and an acoustic model (AM) based on several types of neural networks and compare its performance with a lexicon-free end-to-end system developed in the ESPnet framework. For evaluation we use a publically available dataset of Norwegian parliament speeches that offers 100 hours for training and 12 hours for testing. In spite of this rather limited training resource, the lexicon-free approach yields significantly better results (13.0% word-error rate) compared to the best system with the lexicon, LM and neural network AM (that achieved 22.5% WER).

#1108: Linear Transformations for Cross-lingual Sentiment Analysis

Pavel Přibáň, Jakub Šmíd, Adam Mištera and Pavel Král

This paper deals with cross-lingual sentiment analysis in Czech, English and French languages. We perform zero-shot cross-lingual classification using five linear transformations combined with LSTM and CNN based classifiers. We compare the performance of the individual transformations, and in addition, we confront the transformation-based approach with existing state-of-the-art BERT-like models. We show that the pre-trained embeddings from the target domain are crucial to improving the cross-lingual classification results, unlike in the monolingual classification, where the effect is not so distinctive.

#1164: New Language Identification and Sentiment Analysis Modules for Social Media Communication

Radoslav Sabol and Aleš Horák

The style and vocabulary of social media communication, such as chats, discussions or comments, differ vastly from standard languages. Specifically in internal business communication, the texts contain large amounts of language mixins, professional jargon and occupational slang, or colloquial expressions. Standard natural language processing tools thus mostly fail to detect basic text processing attributes such as the prevalent language of a message or communication or their sentiment. In the presented paper, we describe the development and evaluation of new modules specifically designed for language identification and sentiment analysis of informal business communication inside a large international company. Besides the details of the module architectures, we offer a detailed comparison with other state-of-the-art tools for the same purpose and achieve an improvement of 10--13% in accuracy with selected problematic datasets.

#1133: OPTICS: Automatic MT Evaluation based on Optimal Transport by Integration of Contextual Representations and Static Word Embeddings

Hiroshi Echizen'ya, Kenji Araki and Eduard Hovy

Automatic MT metrics using word embeddings are extremely effective. Semantic word similarities are obtained using word embeddings. However, similarities using only static word embeddings are insufficient for lack of contextual information. Automatic metrics using fine-tuned models can adapt to a specific domain using contextual representations obtained by learning, but that adaptation requires large amounts of data to learn suitable models. As described herein, we propose an automatic MT metric based on optimal transport using both contextual representations and static word embeddings. The contextual representations are obtained by learning the neural models. In that case, our proposed metric requires no other data except source sentences and references, which correspond to the evaluation target hypotheses, to learn the models that are used to extract the contextual representations. Therefore, our proposed metric can adapt to the domain appropriately without requiring large amounts of learning data. Experiment results obtained using the WMT 20 metric shared task data indicated that correlations with human judgment using our proposed metric are higher than those using a metric based only on static word embeddings. Moreover, our proposed metric achieved state-of-the-art performance with system-level correlation and to-English segment-level correlation.

#1102: On Comparison of Phonetic Representations for Czech Neural Speech Synthesis

Jindřich Matoušek and Daniel Tihelka

In this paper, we investigate two research questions related to the phonetic representation of input text in Czech neural speech synthesis: 1) whether we can afford to reduce the phonetic alphabet, and 2) whether we can remove pauses from phonetic transcription and let the speech synthesis model predict the pause positions itself. In our experiments, three different modern speech synthesis models (FastSpeech 2 + Multi-band MelGAN, Glow-TTS + UnivNet, and VITS) were employed. We have found that the reduced phonetic alphabet outperforms the traditionally used full phonetic alphabet. On the other hand, removing pauses does not help. The presence of pauses (predicted by an external pause prediction tool) in phonetic transcription leads to a slightly better quality of synthetic speech.

#1170: On the Importance of Word Embedding in Automated Harmful Information Detection

Salar Mohtaj and Sebastian Möller

Social media have been growing rapidly during past years. They changed different aspects of human life, especially how people communicate and also how people access information. However, along with the important benefits, social media causes a number of significant challenges since they were introduced. Spreading of fake news and hate speech are among the most challenging issues which have attracted a lot of attention by researchers in past years. Different models based on natural language processing are developed to combat these phenomena and stop them in the early stages before mass spreading. Considering the difficulty of the task of automated harmful information detection (i.e., fake news and hate speech detection), every single step of the detection process could have a sensible impact on the performance of models. In this paper, we study the importance of word embedding on the overall performance of deep neural network architecture on the detection of fake news and hate speech on social media. We test various approaches for converting raw input text into vectors, from random weighting to state-of-the-art contextual word embedding models. In addition, to compare different word embedding approaches, we also analyze different strategies to get the vectors from contextual word embedding models (i.e., get the weights from the last layer, against averaging weights of the last layers). Our results show that XLNet embedding outperforms the other embedding approaches on both tasks related to harmful information identification.

#1172: Ontology-Aware Biomedical Relation Extraction

Ahmad Aghaebrahimian, Maria Anisimova and Manuel Gil

Automatically extracting relationships from biomedical texts among multiple sorts of entities is an essential task in biomedical natural language processing with numerous applications, such as drug development or repurposing, precision medicine, and other biomedical tasks requiring knowledge discovery. Current Relation Extraction systems mostly use one set of features, either as text, or more recently, as graph structures. The state-of-the-art systems often use resource-intensive hence slow algorithms and largely work for a particular type of relationship. However, a simple yet agile system that learns from different sets of features has the advantage of adaptability over different relationship types without an extra burden required for system re-design. We model RE as a classification task and propose a new multi-channel deep neural network designed to process textual and graph structures in separate input channels. We extend a Recurrent Neural Network with a Convolutional Neural Network to process three sets of features, namely, tokens, types, and graphs. We demonstrate that entity type and ontology graph structure provide better representations than simple token-based representations for Relation Extraction. We also experiment with various sources of knowledge, including data resources in the Unified Medical Language System to test our hypothesis. Extensive experiments on four well-studied biomedical benchmarks with different relationship types show that our system outperforms earlier ones. Thus, our system has state-of-the-art performance and allows processing millions of full-text scientific articles in a few days on one typical machine.

#1105: PoCaP Corpus: A Multimodal Dataset for Smart Operating Room Speech Assistant using Interventional Radiology Workflow Analysis

Kubilay Can Demir, Matthias May, Axel Schmid, Michael Uder, Katharina Breininger, Tobias Weise, Andreas Maier and Seung Hee Yang

This paper presents a new multimodal interventional radiology dataset, called PoCaP (Port Catheter Placement) Corpus. This corpus consists of speech and audio signals in German, X-ray images, and system commands collected from 31 PoCaP interventions by six surgeons with average duration of 81.4 ± 41.0 minutes. The corpus aims to provide a resource for developing a smart speech assistant in operating rooms. In particular, it may be used to develop a speech-controlled system that enables surgeons to control the operation parameters such as C-arm movements and table positions. In order to record the dataset, we acquired consent by the institutional review board and workers' council in the University Hospital Erlangen and by the patients for data privacy. We describe the recording set-up, data structure, workflow and preprocessing steps, and report the first PoCaP Corpus speech recognition analysis results with 11.52% word error rate using pretrained models. The findings suggest that the data has the potential to build a robust command recognition system and will allow the development of a novel intervention support systems using speech and image processing in the medical domain.

#1123: Quality Assessment of Subtitles -- Challenges and Strategies

Julia Brendel and Mihaela Vela

This paper describes a novel approach for assessing the quality of machine-translated subtitles. Although machine translation (MT) is widely used for subtitling, in comparison to text translation, there is little research in this area. For our investigation, we are using the English to German machine translated subtitles from the SubCo corpus , a corpus consisting of human and machine-translated subtitles from English. In order to provide information about the quality of the machine-produced subtitles error annotation and evaluation is performed manually. Both the applied error annotation and evaluation schemes are covering the four dimensions content, language, format and semiotics allowing for a fine-grained detection of errors and weaknesses of the MT engine. Besides the human assessment of the subtitles, our approach comprises also the measurement of the inter-annotator agreement (IAA) of the human error annotation and evaluation, as well as the estimation of post-editing effort. The combination of these three steps represents a novel evaluation method that finds its use in both improving the subtitling quality assessment process and the machine translation systems.

#1178: Review of Practices of Collecting and Annotating Texts in the Learner Corpus REALEC

Olga Vinogradova and Olga Lyashevskaya

REALEC, learner corpus released in the open access, had received 6,054 essays written in English by HSE undergraduate students in their English university-level examination by the year 2020. This paper reports on the data collection and manual annotation approaches for the texts of 2014-2019 and discusses the computer tools available for working with the corpus. This provides the basis for the ongoing development of automated annotation for the new portions of learner texts in the corpus. The observations in the first part were made on the reliability of the total of 134,608 error tags manually annotated across the texts in the corpus. Some examples are given in the paper to emphasize the role of the interference with learners’ L1 (Russian), one more direction of the future corpus research. A number of studies carried out by the research team working on the basis of the REALEC data are listed as examples of the research potential that the corpus has been providing.

#1115: Statistical and Neural Methods for Cross-lingual Entity Label Mapping in Knowledge Graphs

Gabriel Amaral, Mārcis Pinnis, Inguna Skadiņa, Odinaldo Rodrigues and Elena Simperl

Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for fields such as machine translation. In this work, we investigate the application of word and sentence alignment techniques coupled with a matching algorithm to align cross-lingual entity labels extracted from Wikidata in 10 languages. Our results indicate that mapping between Wikidata's main labels stands to be considerably improved (up to 20 points in F1-score) by any of the employed methods. We show how methods relying on sentence embeddings outperform all others, even across different scripts. We believe the application of such techniques to measure the similarity of label pairs, coupled with a knowledge base rich in high-quality entity labels, to be an excellent asset to machine translation.

#1127: Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets

Lu Zeng, Sree Hari Krishnan Parthasarathi, Yuzong Liu, Alex Escott, Santosh Cheekatmalla, Nikko Strom, and Shiv Vitaladevuni

We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the first stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh(.) on dense layer weights. In the second stage, we use linear quantization methods on the rest of the network, including other parameters (bias, gain, batchnorm), inputs, and activations. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data (evaluating on 4,000 hours of data). We organize our results in two embedded chipset settings: a) with commodity ARM NEON instruction set and 8-bit containers, we present accuracy, CPU, and memory results using sub 8-bit weights (4, 5, 8-bit) and 8-bit quantization of rest of the network; b) with off-the-shelf neural network accelerators, for a range of weight bit widths (1 and 5-bit), while presenting accuracy results, we project reduction in memory utilization. In both configurations, our results show that the proposed algorithm can achieve: a) parity with a full floating point model's operating point on a detection error tradeoff (DET) curve in terms of false detection rate (FDR) at false rejection rate (FRR); b) significant reduction in compute and memory, yielding up to 3 times improvement in CPU consumption and more than 4 times improvement in memory consumption.

#1107: TOKEN is a MASK: Few-shot Named Entity Recognition with Pre-trained Language Models

Ali Davody, David Ifeoluwa Adelani, Thomas Kleinbauer and Dietrich Klakow

Transferring knowledge from one domain to another is of practical importance for many tasks in natural language processing, especially when the amount of available data in the target domain is limited. In this work, we propose a novel few-shot approach to domain adaptation in the context of Named Entity Recognition (NER). We propose a two-step approach consisting of a variable base module and a template module that leverages the knowledge captured in pre-trained language models with the help of simple descriptive patterns. Our approach is simple yet versatile, and can be applied in few-shot and zero-shot settings. Evaluating our lightweight approach across a number of different datasets shows that it can boost the performance of state-of-the-art baselines by 2-5% F1-score.

#1095: Text-to-Text Transfer Transformer Phrasing Model Using Enriched Text Input

Markéta Řezáčková and Jindřich Matoušek

Appropriate prosodic phrasing of the input text is crucial for natural speech synthesis outputs. The presented paper focuses on using a Text-to-Text Transfer Transformer for predicting phrase boundaries in text and inspects the possibility of enriching the input text with more detailed information to improve the success rate of the phrasing model trained on plain text. This idea came from our previous research on phrasing that showed that more detailed syntactic/semantic information might lead to more accurate predicting of phrase boundaries.

#1130: The Influence of Dataset Partitioning on Dysfluency Detection Systems

Sebastian P. Bayerl, Dominik Wagner, Elmar Nöth, Tobias Bocklet and Korbinian Riedhammer

This paper empirically investigates the influence of different data splits and splitting strategies on the performance of dysfluency detection systems. For this, we perform experiments using wav2vec 2.0 models with a classification head as well as support vector machines (SVM) in conjunction with the features extracted from the wav2vec 2.0 model to detect dysfluencies. We train and evaluate the systems with different non-speaker-exclusive and speaker-exclusive splits of the Stuttering Events in Podcasts (SEP-28k) dataset to shed some light on the variability of results w.r.t. to the partition method used. Furthermore, we show that the SEP-28k dataset is dominated by only a few speakers, making it difficult to evaluate. To remedy this problem, we created SEP-28k-Extended (SEP-28k-E), containing semi-automatically generated speaker and gender information for the SEP-28k corpus, and suggest different data splits, each useful for evaluating other aspects of methods for dysfluency detection.

#1135: Transfer Learning of Transformers for Spoken Language Understanding

Jan Švec, Adam Frémund, Martin Bulín and Jan Lehečka

Pre-trained models used in the transfer-learning scenario are recently becoming very popular. Such models benefit from the availability of large sets of unlabeled data. Two kinds of such models include the Wav2Vec 2.0 speech recognizer and T5 text-to-text transformer. In this paper, we describe a novel application of such models for dialog systems, where both the speech recognizer and the spoken language understanding modules are represented as Transformer models. Such composition outperforms the baseline based on the DNN-HMM speech recognizer and CNN understanding.

#1163: Transformer-based Automatic Speech Recognition of Formal and Colloquial Czech in MALACH Project

Jan Lehečka, Josef V. Psutka and Josef Psutka

Czech is a very specific language due to its large differences between the formal and the colloquial form of speech. While the formal (written) form is used mainly in official documents, literature, and public speeches, the colloquial (spoken) form is used widely among people in casual speeches. This gap introduces serious problems for ASR systems, especially when training or evaluating ASR models on datasets containing a lot of colloquial speech, such as the MALACH project. In this paper, we are addressing this problem in the light of a new paradigm in end-to-end ASR systems -- recently introduced self-supervised audio Transformers. Specifically, we are investigating the influence of colloquial speech on the performance of Wav2Vec 2.0 models and their ability to transcribe colloquial speech directly into formal transcripts. We are presenting results with both formal and colloquial forms in the training transcripts, language models, and evaluation transcripts.

#1147: Use of Machine Learning Methods in the Assessment of Programming Assignments

Botond Tarcsay, Jelena Vasić and Fernando Perez-Tellez

Programming has become an important skill in today’s world and is taught widely both in traditional and online settings. Educators need to grade increasing numbers of student submissions. Unit testing can contribute to the automation of the grading process; however, it cannot assess the structure, or partial correctness, which are needed for finely differentiated grading. This paper builds on previous research that investigated several machine learning models for determining the correctness of source code. It was found that some such models can be successful, provided that the code samples used for fitting and prediction fulfil the same sets of requirements (corresponding to coding assignments). The hypothesis investigated in this paper is that code samples can be grouped by similarity of the requirements that they fulfil and that models built with samples of code from such a group can be used for determining the quality of new samples that belong to the same group, even if they do not correspond to the same coding assignment, which would make for a much more useful predictive model in practice. The investigation involved ten different machine learning algorithms used on over four hundred thousand student code submissions and it confirmed the hypothesis.

#1093: Wakeword Detection under Distribution Shifts

Sree Hari Krishnan Parthasarathi, Lu Zeng, Christin Jose and Joseph Wang

We propose a novel approach for semi-supervised learning (SSL) designed to overcome distribution shifts between training and real-world data arising in the keyword spotting (KWS) task. Shifts from training data distribution are a key challenge for real-world KWS tasks: when a new model is deployed on device, the gating of the accepted data undergoes a shift in distribution, making the problem of timely updates via subsequent deployments hard. Despite the shift, we assume that the marginal distributions on labels do not change. We utilize a modified teacher/student training framework, where labeled training data is augmented with unlabeled data. Note that the teacher does not have access to the new distribution as well. To train effectively with a mix of human and teacher labeled data, we develop a teacher labeling strategy based on confidence heuristics to reduce entropy on the label distribution from the teacher model; the data is then sampled to match the marginal distribution on the labels. Large scale experimental results show that a convolutional neural network (CNN) trained on far-field audio, and evaluated on far-field audio drawn from a different distribution, obtains a 14.3% relative improvement in false discovery rate (FDR) at equal false reject rate (FRR), while yielding a 5% improvement in FDR under no distribution shift. Under a more severe distribution shift from far-field to near-field audio with a smaller fully connected network (FCN) our approach achieves a 52% relative improvement in FDR at equal FRR, while yielding a 20% relative improvement in FDR on the original distribution.

#1185: Fairslator

Michal Měchura

This demo will introduce Fairslator, an experimental application for removing bias from machine translation. Translations produced by machines – especially when the source language is English – are often biased because of ambiguities in gender, number and forms of address. Fairslator resolves these by examining the output of machine translation, detecting the presence of any bias-triggering ambiguities, and asking the human user how they wish to resolve them: for example, whether gender-ambiguous English words such as ‘student’ and ‘doctor’ should be translated as male or female, or whether the English pronoun ‘you’ should be translated as singular or plural, as formal or informal.

Extended abstract

#1186: Digital Primer v1 :: One Goal, Two Prototypes

Daniel D. Hromada, Hyungjoong Kim

The ultimate aim of the “Digital Primer” (DP) project is development, optimization and deployment of digital education instrument (Bildunginstrument) for fostering of acquisition of basic literacy in primary school pupils. DP has two sub-projects: a “physical” Personal Primer (π2) branch focuses on design of a post-smartphone open hardware artefact based on “Raspberry Pi Zero” technology. The “Web Primer” sub-project provides extended functionality in browser. Both sub-projects provide audiotext support, implement human-machine peer learning curricula and use Mozilla’s DeepSpeech acoustic models embelished with our own exercise-specific language models.

Extended abstract

#1187: A tool for terminology extraction - OneClick Terms

Ondřej Matuška

OneClick Terms is a powerful online term extractor with monolingual and bilingual term extraction capabilities. It is powered by the unique term extraction technology in the Sketch Engine corpus query system.

#1188: Corpora for language learning - SkELL

Ondřej Matuška

SkELL (abbreviation of Sketch Engine for Language Learning) is a free corpus-based web tool that allows language learners and teachers find authentic sentences for specific target word(s). For any word or a phrase, SkELL displays a concordance that lists example sentences drawn from a special text corpus crawled from the internet, which was cleaned of spam and only includes authentic texts suitable for language learning. There are versions of SkELL for English, Russian, German, Italian, Czech and Estonian.

#1189: Opravidlo.cz – New Online Proofreader of Czech Language

Jakub Machura

To write a text without any grammatical, spelling, or typographical mistakes is one of the main features of high-standard typed text. Nowadays, users of language more often create demand for software which would reliably detect and correct various kinds of mistakes in texts. Since June this year, a beta version of the new web proofreader for Czech has been released at Opravidlo.cz. The individual suggestions for correction are based on formal rules that identify mistakes in spelling (punctuation, capitalization, common spelling mistakes), grammar (sentence commas, grammatical consistency, ungrammatical sentence structures) and typography (according to standard CSN 01 6910). The tool is available as a freely accessible web interface, where texts can be inserted or written directly and then be checked. Currently, the beta version is available for the public to test. Based on feedback, Opravidlo will be fixed and modified if necessary.

TSD 2021 | TSD 2020 | TSD 2019