Slovenian automatic speech recognition for broadcast news

Authors

  • Lucija Gril University of Maribor, Faculty of Electrical Engineering and Computer Science, Slovenia
  • Mirjam Sepesy Maučec University of Maribor, Faculty of Electrical Engineering and Computer Science, Slovenia
  • Gregor Donaj University of Maribor, Faculty of Electrical Engineering and Computer Science, Slovenia
  • Andrej Žgank University of Maribor, Faculty of Electrical Engineering and Computer Science, Slovenia

DOI:

https://doi.org/10.4312/slo2.0.2021.1.60-89

Keywords:

automatic speech recognition, characteristics of Slovenian language, broadcast news, deep neural networks, lossy speech codecs

Abstract

In speech and language technologies, automatic speech recognition is one of the key building blocks. In this article, we will explain the development of an automatic recognizer of Slovenian speech for the domain of daily news broadcasts. The architecture of the system is based on a deep neural net. Considering the available speech sources, we performed modeling with various activation functions. In the development of speech recognition, we also checked the impact of lossy speech codecs on speech recognition results. We used the UBM BNSI Broadcast News and IETK-TV databases to train the speech recognizer. The total amount of voice recordings was 66 hours. In parallel with the deep neural networks, we increased the speech recognition dictionary, which amounted to 250,000 words. In this way, we reduced the out-of-vocabulary rate to 1.33%. Speech recognition on the test set achieved the best WER of 15.17%. While evaluating the results, we also performed a more detailed analysis of speech recognition errors based on lemmas and F-conditions, which to some extent show the complexity of the Slovenian language for such scenarios of technology use.

Downloads

Download data is not yet available.

References

Arhar, Š., & Gorjanc, V. (2007). Korpus FidaPLUS: nova generacija slovenskega referenčnega korpusa. Jezik in slovstvo, (52)2, 95–110.

Dobrišek, S., Gros, J., Mihelič, F., & Pavešić, N. (1998). Recording and labelling of the GOPOLIS Slovenian speech database. V First International Conference on language resources & evaluation: Granada, Spain, 28–30 May 1998 (str. 1089–1096). European Language Resources Association.

Dobrišek, S., & Mihelič, F. (2010). Zmanjševanje odvečnosti končnih pretvornikov za učinkovito gradnjo razpoznavalnikov slovenskega govora z velikim besednjakom. V Jezikovne tehnologije: zbornik 13. mednarodne multikonference, Informacijska družba IS (str. 24–27).

Dobrišek, S., Žganec Gros, J., Žibert, J., Mihelič, F., & Pavešić, N. (2017). Speech Database of Spoken Flight Information Enquiries SOFES 1.0, Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1125

ELRA. (2015). Pridobljeno s http://www.elra.info

Gales, M. J. (1999). Semi-tied covariance matrices for hidden Markov models. IEEE transactions on speech and audio processing, 7(3), 272–281.

Grčar, M., Krek, S., & Dobrovoljc, K. (2012). Obeliks: statistični oblikoskla­denjski označevalnik in lematizator za slovenski jezik. V T. Erjavec in J. Žganec Gros (ur.), Zbornik Osme konference Jezikovne tehnologije, Ljub­ljana, Slovenija (str. 89–94). Ljubljana: Institut Jožef Stefan. Pridobljeno s http://nl.ijs.si/isjt12/JezikovneTehnologije2012.pdf

Imperl, B., Kačič, Z., & Horvat, B. (1996). Razpoznavanje osamljenih besed s polzveznimi Prikritimi modeli Markova. V Zbornik pete Elektrotehniške in računalniške konference ERK (str. B/231–234).

Imperl, B., & Kačič, Z (1999). Connected digits and natural numbers recognition for the telephone multilingual speech dialog systems. V Proceedings of the 4th international workshop on Electronics, control, measurement and signals ECMS (str. 164–167).

Ipšić, I., Mihelič, F., Dobrišek, S., Žganec Gros, J., & Pavešić, N. (1999). A Slovenian spoken dialog system for air flight inquiries. V Eurospeech ‘99: proceedings, 6th European Conference on Speech Communication and Technology (str. 2659–2662).

Kačič, Z., Horvat, B., & Greif, Š. (1988). Man-machine communication: speaker-independent speech recognition. Informatica: an international journal of computing and informatics, (12)1, 6–12.

Kaiser, J., & Kačič, Z. (1997). SpeechDat (II) Slovenian Database for the Fixed Telephone Network. Maribor, Slovenia: University of Maribor.

Kaiser, J., Sepesy Maučec, M., Kačič, Z., & Horvat, B. (2000). Razpoznavanje tekočega slovenskega govora z velikim slovarjem. V T. Erjavec in J. Gros (ur.), Jezikovne tehnologije (str. 39–44). Ljubljana: Institut Jožef Stefan. Pridobljeno s http://nl.ijs.si/isjt00/zbornik/sdjt00-Kaiser06.pdf

Lleida, E., Ortega, A., Miguel, A., Bazán-Gil, V., Pérez, C., Gómez, M., & De Prada, A. (2019). Albayzin 2018 evaluation: the iberspeech-RTVE challenge on speech technologies for spanish broadcast media. Applied Sciences, 9(24), 5412.

Mihelič, F., Ipšić, I., Dobrišek, S., & Pavešić, N. (1992). Feature representations and classification procedures for Slovene phoneme recognition. Pattern recognition letters, 13(12), 879–891.

Nassif, A. B., Shahin, I., Attili, I., Azzeh, M., & Shaalan, K. (2019). Speech recognition using deep neural networks: A 463 systematic review. IEEE Access 2019, 7, 19143–19165.

Nouza, J., Safarik, R., & Cerva, P. (2016). ASR for South Slavic Languages Developed in Almost Automated Way. V Interspeech (str. 3868–3872).

Pollak, P., & Behunek, M. (2011). Accuracy of MP3 speech recognition under real-word conditions: Experimental study. V Proceedings of the International Conference on Signal Processing and Multimedia Applications (str. 1–6). IEEE.

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N.,…, Silovsky, J. (2011). The Kaldi speech recognition toolkit. V IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.

RSDO. (b. d.). Pridobljeno s https://www.cjvt.si/rsdo/

Schwartz, R., Jin, H., Kubala, F., & Matsoukas, S. (1997). Modeling Those F-Conditions – or not. V Proc. DARPA Speech Recognition Workshop, Chantilly, ZDA.

Stolcke, A. (2002). SRILM – an extensible language modeling toolkit. SRILM – an extensible language modeling toolkit. V International Conference on Speech and Language Processing (str. 901–904).

Ulčar, M., Dobrišek, S., & Robnik-Šikonja, M. (2019). Razpoznavanje slovenskega govora z metodami globokih nevronskih mrež. Uporabna informatika. 27, 3.

Verdonik, D., Kosem, I., Vitez, A., Krek, S., & Stabej, M. (2013). Compilation, transcription and usage of a reference speech corpus: The case of the Slovene corpus GOS. Language resources and evaluation, 47(4), 1031–1048.

Verdonik, D., Potočnik, T., Sepesy Maučec, M., & Erjavec T. (2017). Spoken corpus Gos VideoLectures 2.0 (transcription). Maribor: Fakulteta za elektrotehniko, računalništvo in informatiko Univerze v Mariboru. Pridobljeno s http://hdl.handle.net/11356/1222

Verdonik, D. (2018). Korpus in baza Gos Videolectures. V D. Fišer in A. Pančur (ur.), Zbornik 11. konference Jezikovne tehnologije in digitalna humanistika (str. 265–268). Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani. Pridobljeno s http://nl.ijs.si/jtdh18/JTDH-2018-Proceedings.pdf

Zhang X., Trmal, J., Povey, D., & Khudanpur, S. (2014). Improving deep neural network acoustic models using generalized maxout networks. V 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (str. 215–219). IEEE.

Zorrilla, A. L., Dugan, N., Torres, M. I., Glackin, C., Chollet, G., & Cannings, N. (2016). Some asr experiments using deep neural networks on spanish databases. Advances in Speech and Language Technologies for Iberian Languages. IberSPEECH.

Zwitter Vitez, A., Zemljarič Miklavčič, J., Krek, S., Stabej, M., & Erjavec, T. (2013). Spoken corpus Gos 1.0, Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1040

Žgank, A., Kačič, Z., & Horvat, B. (2002). Preliminary evaluation of Slovenian mobile database PoliDat. V Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02).

Žgank, A., Rotovnik, T., Sepesy Maučec, M., Verdonik, D., Kitak, J., Vlaj, D., Hozjan, V., …, Horvat, B. (2004). Acquisition and annotation of Slovenian broadcast news database. V Fourth international conference on language resources and evaluation, LREC 2004 (str. 2103–2106). Lizbona, Portugalska. Pridobljeno s http://www.lrec-conf.org/proceedings/lrec2004/pdf/123.pdf

Žgank, A., Rotovnik, T., Grašič, M., Kos, M., Vlaj, D., & Kačič, Z. (2006). Sloparl-Slovenian parliamentary speech and text corpus for large vocabulary continuous speech recognition. V Ninth International Conference on Spoken Language Processing. Pridobljeno s http://dblp.uni-trier.de/db/conf/interspeech/interspeech2006.html#ZgankRGKVK06

Žgank, A., Rotovnik, T., Sepesy Maučec, M., & Kačič, Z. (2006). Osnovna zgradba razpoznavalnika slovenskega tekočega govora UMB Broadcast News. V T. Erjavec in J. Žganec Gros (ur.), Jezikovne tehnologije: zbornik 9. mednarodne multikonference Informacijska družba IS (str. 99–118). Ljubljana: Institut Jožef Stefan. Pridobljeno s http://nl.ijs.si/is-ltc06/proc/

Žgank, A., & Sepesy Maučec, M. (2010). Razpoznavalnik tekočega govora UMB Broadcast News 2010: nadgradnja akustičnih in jezikovnih modelov. V T. Erjavec in J. Žganec Gros (ur.), Jezikovne tehnologije 2010 (28–31). Ljubljana: Institut Jožef Stefan. Pridobljeno s http://nl.ijs.si/isjt10/JezikovneTehnologije2010.pdf

Žgank, A., Donaj, G., & Sepesy Maučec, M. (2014). Razpoznavalnik tekočega govora UMB Broadcast News 2014: kakšno vlogo igra velikost učnih virov. V V T. Erjavec in J. Žganec Gros (ur.) Zbornik 9. konference Jezikovne tehnologije, Informacijska družba IS (str. 147–150). Ljubljana: Institut Jožef Stefan. Pridobljeno s http://library.ijs.si/Stacks/Proceedings/InformationSociety/2014/2014_IS_CP_Volume-G_(LT).pdf

Žgank, A., Sepesy Maučec, M., & Verdonik, D. (2016). The SI TEDx-UM speech database: A new Slovenian spoken language resource. V Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (str. 4670–4673).

Žibert, J., Mihelič, F., & Dobrišek, S. (2000). Avtomatično podnaslavljanje vremenskih napovedi. V B. Zajc (ur.), Zbornik devete Elektrotehniške in računalniške konference, Portorož, Slovenija, 21.–23. september 2000 (str. 165–168).

Žibert, J., Martinčić-Ipšić, S., Ipšić, I., & Mihelič, F. (2003). Bilingual speech recognition of Slovenian and Croatian weather forecasts. V Proceedings EC-VIP-MC 2003. 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications (IEEE Cat. No. 03EX667) (Vol. 2, str. 637–642). IEEE.

Žibert, J., & Mihelič, F. (2004). Development of Slovenian broadcast news speech database. V Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04) (str. 2095–2098). Pridobljeno s http://www.lrec-conf.org/proceedings/lrec2004/pdf/98.pdf

Published

01.07.2021 — Updated on 06.07.2021

Versions

How to Cite

Gril, L., Sepesy Maučec, M., Donaj, G., & Žgank, A. (2021). Slovenian automatic speech recognition for broadcast news. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 9(1), 60–89. https://doi.org/10.4312/slo2.0.2021.1.60-89 (Original work published July 1, 2021)