Development online models for automatic speech recognition systems with a low data level

Mamyrbayev ОZh; Oralbekova DO; Alimhan K; Othman M; Zhumazhanov B

doi:10.17352/amp.000049

PDF HTML

Submitted: August 1, 2022

Published: Aug 23, 2022

DOI: 10.17352/amp.000049

Keywords:

Automatic speech recognition, End-to-end, RNN-T, Neural transducer, Monotonic chunkwise attention

Mamyrbayev ОZh*

Institute of Information and Computational Technologies, Almaty, Kazakhstan

Oralbekova DO*

Satbayev University, Almaty, Kazakhstan

Alimhan K

LN Gumilyov Eurasian National University, Nur-Sultan, Kazakhstan

Othman M

Universiti Putra Malaysia, Kuala Lumpur, Malaysia

Zhumazhanov B

Institute of Information and Computational Technologies, Almaty, Kazakhstan

Abstract

Speech recognition is a rapidly growing field in machine learning. Conventional automatic speech recognition systems were built based on independent components, that is an acoustic model, a language model and a vocabulary, which were tuned and trained separately. The acoustic model is used to predict the context-dependent states of phonemes, and the language model and lexicon determine the most possible sequences of spoken phrases. The development of deep learning technologies has contributed to the improvement of other scientific areas, which includes speech recognition. Today, the most popular speech recognition systems are systems based on an end-to-end (E2E) structure, which trains the components of a traditional model simultaneously without isolating individual elements, representing the system as a single neural network. The E2E structure represents the system as one whole element, in contrast to the traditional one, which has several independent elements. The E2E system provides a direct mapping of acoustic signals in a sequence of labels without intermediate states, without the need for post-processing at the output, which makes it easy to implement. Today, the popular models are those that directly output the sequence of words based on the input sound in real-time, which are online end-to-end models. This article provides a detailed overview of popular online-based models for E2E systems such as RNN-T, Neural Transducer (NT) and Monotonic Chunkwise Attention (MoChA). It should be emphasized that online models for Kazakh speech recognition have not been developed at the moment. For low-resource languages, like the Kazakh language, the above models have not been studied. Thus, systems based on these models have been trained to recognize Kazakh speech. The results obtained showed that all three models work well for recognizing Kazakh speech without the use of external additions.

Downloads

Download data is not yet available.

How to Cite

ОZh M., DO, O., K, A., M, O., & B, Z. (2022). Development online models for automatic speech recognition systems with a low data level. Annals of Mathematics and Physics, 5(2), 107–111. https://doi.org/10.17352/amp.000049

Issue

Vol. 5 No. 2 (2022)

Section

Research Articles

Copyright & License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Licensing and protecting the author rights is the central aim and core of the publishing business. Peertechz dedicates itself in making it easier for people to share and build upon the work of others while maintaining consistency with the rules of copyright. Peertechz licensing terms are formulated to facilitate reuse of the manuscripts published in journals to take maximum advantage of Open Access publication and for the purpose of disseminating knowledge.

We support 'libre' open access, which defines Open Access in true terms as free of charge online access along with usage rights. The usage rights are granted through the use of specific Creative Commons license.

Peertechz accomplice with- [CC BY 4.0]

Explanation

'CC' stands for Creative Commons license. 'BY' symbolizes that users have provided attribution to the creator that the published manuscripts can be used or shared. This license allows for redistribution, commercial and non-commercial, as long as it is passed along unchanged and in whole, with credit to the author.

Please take in notification that Creative Commons user licenses are non-revocable. We recommend authors to check if their funding body requires a specific license.

With this license, the authors are allowed that after publishing with Peertechz, they can share their research by posting a free draft copy of their article to any repository or website.
'CC BY' license observance:

License Name	Permission to read and download	Permission to display in a repository	Permission to translate	Commercial uses of manuscript
CC BY 4.0	Yes	Yes	Yes	Yes

The authors please note that Creative Commons license is focused on making creative works available for discovery and reuse. Creative Commons licenses provide an alternative to standard copyrights, allowing authors to specify ways that their works can be used without having to grant permission for each individual request. Others who want to reserve all of their rights under copyright law should not use CC licenses.

References

Mamyrbayev O, Oralbekova D. Modern trends in the development of speech recognition systems // News of the National academy of sciences of the republic of Kazakhstan.4:332; 2020; 42 – 51 // doi.org/10.32014/2020.2518-1726.64

Matthew Baas, Herman Kamper. Voice Conversion Can Improve ASR in Very Low-Resource Settings. arXiv:2111.02674. 2021. [eess.AS]. (data of request: 11.11.2021).

Chang S, Deng Y, Zhang Y, Wang R, Qiu J, Wang W, Zhao Q, Liu D. An Advanced Echo Separation Scheme for Space-Time Waveform-Encoding SAR Based on Digital Beamforming and Blind Source Separation. Remote Sensing. 2022; 14(15):3585. https://doi.org/10.3390/rs14153585

Chang S, Deng Y, Zhang Y, Zhao Q, Wang R, Zhang K. An Advanced Scheme for Range Ambiguity Suppression of Spaceborne SAR Based on Blind Source Separation. IEEE Transactions on Geoscience and Remote Sensing. 2022.

Singh TP, Gupta S, Garg M, Gupta D, Alharbi A, Alyami H, Anand D,Ortega-Mansilla A, Goyal N. Visualization of Customized Convolutional Neural Network for Natural Language Recognition. Sensors 2022; 22:2881. https://doi.org/10.3390/s22082881

Popli R, Kansal I, Garg A, Goyal N, Garg K. Classification and recognition of online hand-written alphabets using Machine Learning Methods. IOP Conference Series: Materials Science and Engineering. 2021; 1757-8981. http://dx.doi.org/10.1088/1757-899X/1022/1/012111

Graves A. Sequence transduction with recurrent neural networks. arXiv: 2012; 1211.3711. (data of request: 02.09.2021).

Li J, Zhao R, Hu H, Gong Y, "Improving RNN Transducer Modeling for End-to-End Speech Recognition." 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), SG, Singapore. 2019; 114-121.

Graves A, Fernandez S, Gomez F, Schmidhuber J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML, Pittsburgh, USA. 2006.

Prabhavalkar, Rohit, Rao, Kanishka, Sainath, Tara, Li, Bo, Johnson, Leif, Jaitly, Navdeep. A Comparison of Sequence-to-Sequence Models for Speech Recognition. 2017; 939-943. 10.21437/Interspeech. 2017-233.

Jaitly N, Le QV, Vinyals O, Sutskever I, Sussillo D, Bengio S, “An online sequence-to-sequence model using partial conditioning,” in NIPS, 2016.

Chan W, Jaitly N, Le QV, Vinyals O, “Listen, attend and spell,” CoRR, 2015; 1508.01211.

Sainath, Tara, et al. Improving the Performance of Online Neural Transducer Models. 2018; 5864-5868.

Jaitly, Navdeep, Le, Quoc, Vinyals, Oriol, Sutskeyver, Ilya, Bengio, Samy. An Online Sequence-to-Sequence Model Using Partial Conditioning, 2015.

Battenberg E, et al. "Exploring neural transducers for end-to-end speech recognition," 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, 2017; 206-213.

Prabhavalkar, Rohit. et. al. A Comparison of Sequence-to-Sequence Models for Speech Recognition. 2017; 939-943. 10.21437/Interspeech. 2017-233

Chung-Cheng Chiu, Colin Raffel. “Monotonic chunkwise attention.” in Proceedings of ICLR. 2018.

Chiu, Chung-Cheng, et.al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models. 2018; 4774-4778. 10.1109/ICASSP.2018.8462105.

Kim, Chanwoo, et.al. End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System. 10.1109/ASRU46091. 2020; 2019:9003976; 2019.

Hou J, Guo W, Song Y, et al. Segment boundary detection directed attention for online end-to-end speech recognition. J AUDIO SPEECH MUSIC PROC. 2020.

Orken M, Dina O, Keylan A, et al. A study of transformer-based end-to-end speech recognition system for Kazakh language. Sci Rep 12, 8337. 2022. https://doi.org/10.1038/s41598-022-12260-y.

Mamyrbayev O, Alimhan K, Oralbekova D, Bekarystankyzy A, Zhumazhanov B. Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-Eur. J. Enterpris. Technol. 2022; 19(115), 84–92.

Mamyrbayev O, Oralbekova D, Kydyrbekova A, Turdalykyzy T, Bekarystankyzy A, "End-to-End Model Based on RNN-T for Kazakh Speech Recognition," 2021 3rd International Conference on Computer Communication and the Internet (ICCCI), 2021; 163-167. doi: 10.1109/ICCCI51764.2021.9486811.

Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv. 2014. http://arxiv.org/abs/1412. 6980 (data of request: 01.11.2021).

Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals // Soviet physics. Doklady. 1996; 10:707–710.

Article Sidebar

Main Article Content