Fig. 3From: Enhancing recurrent neural network-based language models by word tokenizationWord tokenization process flowchart. The proposed approach uses stemmer to split the word into 3 parts word prefix, word stem and word suffix, the input to the stemmer is a complete surface word, and the output is the stemmed word vector consisting of a prefix ID, a stem ID and a suffix ID. After splitting the word into its composing parts, the stemmer assigns each part a unique ID. If a word does not have a prefix or a suffix, it has the value − 1 to indicate that this part is not present for the given word.Back to article page