TY - JOUR AU - Noaman, Hatem M. AU - Sarhan, Shahenda S. AU - Rashwan, Mohsen. A. A. PY - 2018 DA - 2018/04/27 TI - Enhancing recurrent neural network-based language models by word tokenization JO - Human-centric Computing and Information Sciences SP - 12 VL - 8 IS - 1 AB - Different approaches have been used to estimate language models from a given corpus. Recently, researchers have used different neural network architectures to estimate the language models from a given corpus using unsupervised learning neural networks capabilities. Generally, neural networks have demonstrated success compared to conventional n-gram language models. With languages that have a rich morphological system and a huge number of vocabulary words, the major trade-off with neural network language models is the size of the network. This paper presents a recurrent neural network language model based on the tokenization of words into three parts: the prefix, the stem, and the suffix. The proposed model is tested with the English AMI speech recognition dataset and outperforms the baseline n-gram model, the basic recurrent neural network language models (RNNLM) and the GPU-based recurrent neural network language models (CUED-RNNLM) in perplexity and word error rate. The automatic spelling correction accuracy was enhanced by approximately 3.5% for Arabic language misspelling mistakes dataset. SN - 2192-1962 UR - https://doi.org/10.1186/s13673-018-0133-x DO - 10.1186/s13673-018-0133-x ID - Noaman2018 ER -