ENOM: Embedded, non-obstructive monitoring of voice and speech impediments

Project goals

Stuttering is a commonly known speech disorder with a prevalence of about 5% in children and 1% in adults. Males are far more likely to be affected than females. The cause of this speech disorder is still not fully understood.
It is considered incurable, but it is treatable. Treatment approaches have therefore been established that aim to treat the symptoms, such as technological speech aids and rhythm exercises. Although this often leads to rapid improvements, lasting therapeutic success cannot be proven. The behavioural therapy method used by our project partner, Kasseler Stottertherapie, specifically aims to achieve more fluent speech. This is achieved by completely modifying the client’s speech using a technique called “fluency shaping”.

This speech technique is typically learned during an intensive two-week course.
To start with, visual biofeedback is used. Pronunciation is analyzed in real time and graphic feedback is displayed to the speaker. The technique is first learnt on individual sounds and words before being applied to coherent sentences in spontaneous everyday situations. After the initial in-person phase, clients continue to receive telemedical support.

Leaving the therapy environment is a critical time, as this understandably causes stress for clients. It would be desirable to have a continuous therapy monitoring system at this point that can be used in clients’ daily lives without causing them any disturbance. The aim of the project is therefore the non-obstructive analysis of clients’ speech, i.e. in the background, to record any abnormalities and to provide therapists and clients with feedback on speech behaviour in everyday situations that is as objective as possible. In order to meet the requirements for processing medical data under the General Data Protection Regulation (GDPR), care must be taken to ensure that data is only processed locally and not forwarded on to third parties for processing. For such a system to be provided, it must be possible to process and classify speech on the device. A system capable of performing such classification must be trained using previously recorded and tagged data.
This places specific demands on the dataset. It must contain modified speech, abnormal speech typically seen with stuttering, and fluent spontaneous speech, under realistic conditions. An intermediate goal of the project is therefore to create an extensively tagged dataset that allows a classifier to be trained. The classifier must be able to distinguish between fluent speech, modified speech, and symptoms typically seen with stuttering. This must then be adapted accordingly to work on mobile devices. Finally, the findings obtained in this way will be combined to create an overall system.

Research work carried out so far and results

The basis used for automated speech analysis is often a speech recognition system. Therefore, a first step was to select a dataset containing German spontaneous speech. A modern end-to-end speech recognizer was also trained with it and compared to proven methods for speech recognition. This comparison was specifically aimed at assessing the usability of small to medium-sized datasets for syllable recognition (Bayerl and Riedhammer, 2019). The use of word subunits is especially important in paralinguistic analyses. The usability of different speech and syllable recognizers for the purpose of paralinguistic analysis was also compared and evaluated.

One goal of the research project is to create an extensively tagged dataset, which should be of use in training machine learning models. Thanks to the excellent collaboration with our project partner, Kasseler Stottertherapie, we succeeded in completing an initial, elaborately tagged version of this dataset in February 2019. This contains data from 37 stutterers (9 female, 28 male). The recordings of the clients were made at different points in the therapy plan, namely before learning the speech technique in the intensive course, after the intensive course, and following completion of the therapy. General incoherence, interjections, silent blockages, repetition of syllables and words, and broken words and sentences were explicitly tagged. To our knowledge, this is the most comprehensive dataset of its kind and it makes possible a variety of experiments and paralinguistic analyses.

Working together with Kasseler Stottertherapie, various metrics linked to stuttering were evaluated and a link to prosodic features, in particular average sound duration, was established. Statistical analyses were carried out for this purpose. It was possible to demonstrate that a modified speech recognizer can in principle be used to identify non-fluent parts of speech. In this context, the Speech Control Index (SCI) was introduced and compared with the Speech Efficiency Score (SES, Amir et al., 2018) to assess their suitability for use in stuttering therapy (Bayerl et al., 2020).

Time series are values that are presented in accordance with their temporal sequence. Classifying and distinguishing them poses a particular challenge. Since audio data can also be classified by means of time series, before the tagged dataset was made available, there was a digression into the general analysis of time series. This resulted in the development of an innovative system that can handle a variety of different time series and that is highly robust. Deep convolutional neural networks (CNN) were used in this regard. This method uses recurrence plots to visualize and then classify time series. When applied to a large, widely used benchmarking dataset, excellent results were obtained. The classification accuracy was largely in the same range as the best methods used to date, even bettering the best system at times. In principle, it is possible to use this method in a slightly modified form to distinguish between stuttered and fluent speech.

In order to investigate the area of embedded and on-device recognition, we worked together with security researchers at Darmstadt Technical University. This involved adapting a speech recognizer to run on an embedded system in an encapsulated, secure environment. The resulting system solves the conflict of interest that arises between a service provider and the user of a service. On the one hand, the user has a legitimate interest in keeping his or her data protected from access by the service provider; on the other hand, the service provider does not want to reveal its intellectual property. In cases where machine learning models are delivered to users rather than hidden behind an application programming interface (API) in particular, it is possible to steal the intellectual property behind such a model. If data is sent to the service provider, there is a possibility it will be misused. Running these models in a secure environment solves this dilemma.
On the one hand, the data contained within such an environment are protected from access by the service provider, and on the other hand, the model is protected from theft and manipulation by the user. This allows the interests of both the service provider and the user to be secured (Bayerl et al., 2019). This point is of interest in particular when it comes to the commercial use of the findings of this research project.

For an Android-based prototype of the system described in the project proposal, a kind of “proof of concept” was created in the course of two workshops. The findings of the first workshop were published as part of “Show and Tell” at the 2019 Interspeech conference in Graz (Vasquez et al., 2019). While the focus of the application is the recognition of Parkinson’s disease, the findings are transferable to stuttering by replacing the recognition models with models that recognize stuttering and by incorporating stuttering-specific speech exercises. The findings of a second workshop are still being processed and will be published soon.

Details of the findings can be found in the attached scientific papers.

Specific challenges

Changes in the Android operating system in particular mean that it is no longer possible to permanently record speech in everyday situations or during phone calls. An embedded smartphone analysis has therefore been ruled out as an option for now, although its feasibility in principle could be demonstrated (Vasquez et al., 2019). A possible alternative to the smartphone would be versions comparable to modern voice assistants such as Alexa or Google Home, which are always listening. However, unlike these devices, the voice data would not be processed in the cloud, but purely locally. These devices would then have to meet the same requirements as the app described in the research proposal. This means speaker identification, speech recognition, and categorization of speech into classes: fluent, modified, and non-fluent.

Creating and tagging the data in particular proved to be difficult and made for slow progress to begin with. The accuracy and variety of tags we were seeking, which were necessary to make classification possible outside of laboratory conditions, delayed the dataset creation process. The pool of individuals who were able to consistently tag stutterer-specific language and modified language is very small. The time required makes the process of tagging expensive and slow.


The findings obtained in (Wenninger et al., 2019) for general classification of time series can most likely be applied to stutterer-specific voice patterns. We are currently working on a system of this type for the classification of stuttering. As soon as a satisfactory classification rate is achieved, the results will be published in a scientific journal and development work will begin on a prototype device to support therapy monitoring.

A series of experiments based on the system described in (Wenninger et al. 2019) brought some interesting findings. The system was adapted to use spectrograms for preprocessing and classification instead of so-called recurrence plots. These are closer to language than recurrence plots, although these could still conceivably be used in a modified form. Initial experiments on the classification of stuttering can be seen in the table below.

The experiments described briefly below were evaluated using a subset of the data created. Experiments one and two use Mel spectrograms, which are spectrograms brought to a perceptual scale more closely matched to the human perception of frequencies. The aim is to present frequencies necessary for the identification of speech with higher precision than those not required for the understanding of speech. The findings of these experiments can be seen in the table below. Experiments three and four show the findings of experiments performed with normal spectrograms. Meanwhile, the accuracies should be taken as an average across five trained models.

Preliminary classification results
ExperimentNumber of classesClassesAvg. Accuracy|
Mel spectrogram exp 14uf, f, m, P50.27%
Mel spectrogram exp 28uf, wm, ws, uW, uS, I, sb, m66.59%
Spectrogram exp 34uf, f, m, P58.10%
Spectrogram exp 48uf, wm, ws, uW, uS, I, sb, m69.20%

The findings are promising and further experiments can now be developed on the basis of these initial experiments. Spectrograms in particular appear to work well. Further preprocessing steps that enhance features required in order to classify stuttering can be applied to improve the findings on a step-by-step basis.

Findings, particularly those obtained during the development of the monitoring device, can also be applied to other conditions that can be monitored by means of continuous voice monitoring, such as Parkinson’s disease, Alzheimer’s disease, or depression.


  • S. P. Bayerl and K. Riedhammer, 2019. A comparison of hybrid and end-to-end models for syllable recognition, in: Proc. Int'l Conference on Text, Speech, and Dialogue (TSD)
  • S.P. Bayerl et al., 2020. Towards automated assessment of stuttering and stuttering therapy. In: Proc. Int'l Conference on Text, Speech, and Dialogue (TSD)
  • J. C. Vasquez-Correa, T. Arias-Vergara, P. Klumpp, M. Strauss, A. Küderle, N. Roth, S. Bayerl, N. Garcia-Ospina, P. A. Perez-Toro, L. F. Parra-Gallego, C. D. Rios-Urrego, D. Escobar-Grisales, J. R. Orozco-Arroyave, B. Eskofier, E. Nöth, 2019. Apkinson: a mobile solution for multimodal assessment of patients with Parkinson’s disease. In: Proc. Interspeech 2019. 
  • Bayerl, S.P., Frasetto, T., Jauernig, P., Riedhammer, K., Sadeghi, A.-R., Schneider, T., Stapf, E., and Weinert, C., 2020. Offline model guard: secure and private ML on mobile devices, in: Proc. ACM SIGDA Conference on Design, Automation, and Test in Europe (DATE).
  • Wenninger, M., Bayerl, S.P., Schmidt, J., and Riedhammer, K., 2019. Timage – a robust time series classification pipeline, in: Proc. Int'l Conference on Artificial Neural Networks (ICANN)

Project start/end

May 2018 - April 2021.



Prof. Korbinian Riedhammer
(Machine learning, voice recognition, and understanding)

Sebastian P. Bayerl
(Research associate)

Prof. Elmar Nöth
(Chair for Pattern Recognition, Friedrich-Alexander-University of Erlangen-Nuremberg)

Dr Florian Hönig
(Kasseler Stottertherapie)


  • Bavarian State Ministry for Education and Culture, Science, and the Arts
  • Bavarian Scientific Forum (BayWISS)