Skip to main content
Fig. 3 | Human-centric Computing and Information Sciences

Fig. 3

From: Enhancing recurrent neural network-based language models by word tokenization

Fig. 3

Word tokenization process flowchart. The proposed approach uses stemmer to split the word into 3 parts word prefix, word stem and word suffix, the input to the stemmer is a complete surface word, and the output is the stemmed word vector consisting of a prefix ID, a stem ID and a suffix ID. After splitting the word into its composing parts, the stemmer assigns each part a unique ID. If a word does not have a prefix or a suffix, it has the value − 1 to indicate that this part is not present for the given word.

Back to article page