Enhancing recurrent neural network-based language models by word tokenization
- Hatem M. Noaman†1, 2Email authorView ORCID ID profile,
- Shahenda S. Sarhan†1 and
- Mohsen. A. A. Rashwan†3
© The Author(s) 2018
Received: 26 January 2018
Accepted: 9 April 2018
Published: 27 April 2018
Different approaches have been used to estimate language models from a given corpus. Recently, researchers have used different neural network architectures to estimate the language models from a given corpus using unsupervised learning neural networks capabilities. Generally, neural networks have demonstrated success compared to conventional n-gram language models. With languages that have a rich morphological system and a huge number of vocabulary words, the major trade-off with neural network language models is the size of the network. This paper presents a recurrent neural network language model based on the tokenization of words into three parts: the prefix, the stem, and the suffix. The proposed model is tested with the English AMI speech recognition dataset and outperforms the baseline n-gram model, the basic recurrent neural network language models (RNNLM) and the GPU-based recurrent neural network language models (CUED-RNNLM) in perplexity and word error rate. The automatic spelling correction accuracy was enhanced by approximately 3.5% for Arabic language misspelling mistakes dataset.
The second problem is the memory size. Recurrent neural networks-based language models with large numbers of neurons are expected to need larger memory sizes than that of other traditional language models. Researchers tried to find solutions to these problems through merging all words that occur less than a given threshold into a special rare token or by adding classes of neurons in the output layer and factorizing the output layer into classes . The main contribution of this work is building a recurrent neural network language modeling model that outperforms the basic RNNLM . It is faster and consumes less memory than the RNNLM and its enhanced versions (the factored recurrent neural network language model (fRNNLM)  and the CUED-RNNLM ). It also adds word features implicitly with no need to add a different vector for each word, as proposed by the related work in the fRNNLM  or the FNLM . These features make the proposed model suitable for highly inflected languages and in building models with dynamic vocabulary expansion. It also decreases the number of vocabulary words since unseen words can be inferred from other seen words if the words have the same stem. This paper is organized as follows. “Related work” section includes an overview of the related works. In “Proposed model” section , the researchers discuss the word tokenization process for English and Arabic then the proposed model is presented. The experimental results are discussed in “Experiments and results” section . Finally, the conclusion is presented in “Conclusion” section .
Neural networks different architectures have been investigated and applied to language model estimations by many researchers. Feed forward neural networks  have been adapted in language modeling estimation ; feed forward neural network language models simultaneously learn the probability functions for word sequences and build the distributed representation for individual words, but this model has a drawback in that a fixed number of words can be considered as a context window for the current or target word. To enhance the conventional feed forward neural network language models training time, researchers proposed continuous space language modeling (CSLM), which is a modular open-source toolkit of feed forward neural network language models ; this model introduces support for GPU cards that enable neural networks to build models with corpora that contain more than five billion words in less than 24 hours with about a 20% perplexity reduction . Recurrent neural networks have been applied to estimate language models. However, with this model, there is no need to specify the context window size by using feedback from the hidden to the input layer as a kind of network memory for the word context. Experiment results have proved that recurrent neural networks in language models outperform n-gram language models [5, 6, 9, 14, 15]. An RNNLM toolkit was designed to estimate the class-based language model using recurrent neural networks [5, 6]. It can also provide functions such as an internist model evaluation using perplexity, N-best rescoring and model-based text generation. The training speed is the main RNNLM drawback, especially with large vocabulary sizes and large hidden layers. The RWTHLM  is another recurrent neural network-based toolkit with long short-term memory (LSTM) implementation, and the RWTHLM toolkits BLAS library was used to support reduced training time and efficient network training. The CUED-RNNLM  provides an implementation for the recurrent neural network-based model, and it has GPU support to achieve a more efficient training speed. Both the basic feed forward network and the recurrent neural network-based language models do not include any type of word level morphological features, but some researchers tried to add this type of word feature explicitly by input layer factorization. Factored neural language models (FNLM)  add word features explicitly in the neural network input layer in the feed-forward based neural network language model and the factored recurrent neural network language model (fRNNLM) . They also add word features to the recurrent neural network input layer to model the results better than the basic model. Their complexity is higher than that of the original models since they add word features explicitly to the input layer. While adding these features improves network performance, it adds more complexity to the models estimation and the application performance, especially when applying it to large size vocabulary applications or language with rich morphological features. Researches tries to build RNNLM personalization models  using dataset collected from social media networks, model-based RNNLM personalization aims to captures patterns posted by used and his/her related friends while another approach is feature-based where RNNLM parameters are static throw users. Recently neural-based language modeling models added as an extension to Kaldi automatic speech recognition (Kaldi-RNNLM)  software, this architecture combines the use of subword features and one-hot encoding of words with high frequency to handle large vocabularies containing infrequent words. Also Kaldi-RNNLM architecture improves cross-entropy objective function to train unnormalized probabilities. In addition to feed forward network and the recurrent neural network-based language models architectures convolution neural network (CNN)  was applied to estimate language models with inputs to the network in the form of character and output predictions is at the word-level.
The proposed model is a modified version of the basic recurrent neural network language model . Instead of presenting the full word to the network input layer, we split the word into three parts: the prefix, the stem and the suffix. Both the prefix and the suffix may or may not exist. “Word tokenization” section presents a full description about word tokenization and how we implement it with English and Arabic text using modified versions of two free open source stemmers in “English word tokenization” and “Arabic word tokenization” sections. Next, the proposed models architecture is discussed in “The proposed model” section with full details about the model’s components, inputs and outputs.
English word tokenization
English stemmer original input and output after converting it into prefix, stem and suffix form
Arabic word tokenization
Arabic stemmer original input and output after scanning words and convert it into prefix + stem + suffix form
The proposed model
Experiments and results
The proposed model is compared with other language modeling approaches based on three different evaluation approaches. The first one is the model computation complexity presented in “Model complexity” section. The second one in “Models perplexity results” section is the proposed model perplexity enhancement and word error rate. The results are shown for English automatic speech recognition using the AMI meeting corpus and the model perplexity enhancement plus the entropy reduction result for the Online Open Source Arabic language corpuss experimental results. Finally, in “Arabic automatic spelling correction application” section, the Arabic automatic spelling error correction results are presented using the Qatar Arabic Language Bank (QALP) corpus.
Complexity of the RNNLM and factored RNNLM models are \((|V| + H) * H + H * (C + |V|)\) and \((|f1| + ... + |fK| +H) * H + H * (C + |V|)\), respectively. While proposed model complexity is \((|pr|+|stem|+|suff|+ H) * H + H * (C + |V|)\), where V, H, C fi are the vocabulary count, the hidden layer size, the classes count and the ith feature vector respectively. The researchers observe that the sum of the prefix number and the stems count and the suffixes count \((|pr|+|stem|+|suff|)\) will be much less than vocabulary words count (|V|) especially for highly inflected language and language with rich morphological system. Also proposed model does not need extra GPU processing capabilities as needed with CUED-RNNLM system.
Models perplexity results
English AMI meeting corpus experiments
English AMI meeting corpus perplexity and WER results using RNNLM and CUED-RNNLM against our proposed model
The results show that the proposed token-based recurrent neural network language model has outperformed the n-gram LM by approximately 3% and enhances the basic RNNLM and its GPU version CUED-RNNLM by approximately 1.5% when using the English AMI meeting corpus dataset. While the proposed approach is relatively close to the CUED-RNNLMs reported results with the same dataset, the proposed systems training and decoding times are much improved compared to those of the RNNLM and CUED-RNNLM. Moreover, memory consumption is much lower with our proposed model, which make it much more applicable to be used for rescoring tasks rather than for the models generated by the RNNLM (that have no GPU support and have high memory needs) and the CUED-RNNLM (that relies on the GPU architecture and needs more computational resources).
Online Open Source Arabic language corpus experiments
Perplexity on 70K word as test from Arabic Open Corpus using different smoothing techniques against proposed algorithm
Entropy reduction (%)
Arabic automatic spelling correction application
Automatic Arabic spelling correction application
Test set 1 (%)
Test set 2 (%)
Test set 3 (%)
In this paper, we have introduced a modified recurrent neural network-based language model for language modeling. The modification was to segment the network input into three parts. It is observed that the computational complexity is much lower than that in the basic recurrent neural network model. This outcome makes it possible to build language models for highly inflective languages (such as Arabic) from large corpora with smaller training times and memory costs. Using the intrinsic evaluation (perplexity), it is observed that our proposed model outperforms the baseline n-gram model by up to 30% based on the Arabic Open Corpus experimental results shown in Table 4. The results obtained from applying the proposed model to the Arabic automatic spelling correction problem show about a 3.5% total accuracy enhancement. This finding indicates that more complex and advanced Arabic language applications (such as speech recognition and automatic machine translation) can make use of the model described in this paper.
HN is the corresponding author, and SS and MR are the co-authors. HN has made substantial contributions in the design and implementation of proposed algorithm. SH was involved in drafting the manuscript or critically revising it. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Ethics approval and consent to participate
We confirm that this manuscript has not been published elsewhere and is not under consideration by another journal. All authors have approved the manuscript and agree with its submission.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155MATHGoogle Scholar
- Abramowitz M, Stegun IA (1964) Handbook of mathematical functions: with formulas. Graphs Math Tables 55:83MATHGoogle Scholar
- Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press Cambridge, Cambridge, p 30Google Scholar
- Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533View ArticleMATHGoogle Scholar
- Kombrink S, Mikolov T, Karafiát M, Burget L (2011) Recurrent neural network based language modeling in meeting recognition. In: Twelfth annual conference of the international speech communication associationGoogle Scholar
- Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Eleventh annual conference of the international speech communication associationGoogle Scholar
- Bousmaha KZ, Rahmouni MK, Kouninef B, Hadrich LB (2016) A hybrid approach for the morpho-lexical disambiguation of arabic. J Inf Proces Syst 12(3):358–380Google Scholar
- Saad MK, Ashour W (2010) Osac: open source arabic corpora. In: 6th ArchEng international symposiums, EEECS, vol. 10Google Scholar
- Mikolov T, Kopecky J, Burget L, Glembek O et al (2009) Neural network based language models for highly inflective languages. In: IEEE international conference on acoustics, speech and signal processing. ICASSP 2009, pp 4725–4728Google Scholar
- Wu Y, Yamamoto H, Lu X, Matsuda S, Hori, C, Kashioka H (2012) Factored recurrent neural network language model in ted lecture transcription. In: International workshop on spoken language translation (IWSLT)Google Scholar
- Chen X, Liu X, Qian Y, Gales M, Woodland PC (2016) Cued-rnnlman open-source toolkit for efficient training and evaluation of recurrent neural network language models. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6000–6004Google Scholar
- Alexandrescu A, Kirchhoff K (2006) Factored neural language models. In: Proceedings of the human language technology conference of the NAACL, companion volume: short papers. Association for computational linguistics, pp 1–4Google Scholar
- Schwenk H (2013) Cslm-a modular open-source continuous space language modeling toolkit. In: INTERSPEECH, pp 1198–1202Google Scholar
- Devlin J, Zbib R, Huang Z, Lamar T, Schwartz R, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Vol. 1: long papers), vol. 1, pp 1370–1380Google Scholar
- De Mulder W, Bethard S, Moens M-F (2015) A survey on the application of recurrent neural networks to statistical language modeling. Comput Speech Lang 30(1):61–98View ArticleGoogle Scholar
- Sundermeyer M, Schlüter R, Ney H (2014) rwthlmthe RWTH Aachen University neural network language modeling toolkit. In: Fifteenth annual conference of the international speech communication associationGoogle Scholar
- Tseng B-H, Wen T-H (2017) Personalizing recurrent-neural-network-based language model by social network. IEEE/ACM Trans Audio Speech Lang Proces (TASLP) 25(3):519–530View ArticleGoogle Scholar
- Xu H, Li K, Wang Y, Wang J, Kang S, Chen X, Povey D, Khudanpur S (2018) Neural network language modeling with letter-based features and importance sampling. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP)Google Scholar
- Kim Y, Jernite Y, Sontag D, Rush AM (2016) Character-aware neural language models. In: AAAI, pp 2741–2749Google Scholar
- Kernighan MD, Church KW, Gale WA (1990) A spelling correction program based on a noisy channel model. In: Proceedings of the 13th conference on computational linguistics. Association for computational linguistics, vol. 2, pp 205–210Google Scholar
- Buckwalter T (2004) Buckwalter arabic morphological analyzer version 2.0. linguistic data consortium, University of Pennsylvania, 2002. ldc cat alog no.: Ldc2004l02. Technical report, ISBN 1-58563-324-0Google Scholar
- Han J, Moraga C (1995) The influence of the sigmoid function parameters on the speed of backpropagation learning. In: International workshop on artificial neural networks. Springer, Berlin, pp 195–201Google Scholar
- Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T, Kadlec J, Karaiskos V, Kraaij W, Kronenthal M et al (2005) The ami meeting corpus: a pre-announcement. In: International workshop on machine learning for multimodal interaction. Springer, Berlin, pp 28–39Google Scholar
- Alumäe T, Kurimo M (2010) Efficient estimation of maximum entropy language models with n-gram features: An srilm extension. In: Eleventh annual conference of the international speech communication associationGoogle Scholar
- Chen X (2015) Cued rnnlm toolkitGoogle Scholar
- Noaman HM, Sarhan SS, Rashwan M (2016) Automatic arabic spelling errors detection and correction based on confusion matrix-noisy channel hybrid system. Egypt Comput Sci J 40(2):2016Google Scholar
- Attia M, Al-Badrashiny M, Diab M (2014) Gwu-hasp: hybrid arabic spelling and punctuation corrector. In: Proceedings of the EMNLP 2014 workshop on Arabic natural language processing (ANLP), pp 148–154Google Scholar
- Zaghouani W, Mohit B, Habash N, Obeid O, Tomeh N, Rozovskaya A, Farra N, Alkuhlani S, Oflazer K (2014) Large scale Arabic error annotation: guidelines and framework. In: LREC, pp 2362–2369Google Scholar