Word clustering based on POS feature for efficient twitter sentiment analysis

With rapid growth of social networking service on Internet, huge amount of information are continuously generated in real time. As a result, sentiment analysis of online reviews and messages has become a popular research issue [1]. In this paper a novel modified Chi Square-based feature clustering and weighting scheme is proposed for the sentiment analysis of twitter message. Along with the part of speech tagging, the discriminability and dependency of the words in the tagged training dataset are taken into account in the clustering and weighting process. The multinomial Naïve Bayes model is also employed to handle redundant features, and the influence of emotional words is raised for maximizing the accuracy. Computer simulation with Sentiment 140 workload shows that the proposed scheme significantly outperforms four existing representative sentiment analysis schemes in terms of the accuracy regardless of the size of training and test data.

tree of sentence, which is constructed to indicate the relationship between the words by parsing the sentences [12][13][14][15]. Then the sentiment classifier is built based on the syntax relations, polarity, and features of the words [11]. There exist various challenges in sentiment analysis. The primary issue is the extraction of effective model. Typically, a machine learning algorithm is applied to the classification model extracted from the training dataset having manually tagged class labels [3]. Therefore, proper implementation of the classification model plays a crucial role in deciding the performance of sentiment analysis. Another issue is feature weighting. Assigning appropriate weight to relevant and discriminative features (attribute) is crucial to achieve the sentiment analysis of high accuracy using the classifier.
Among the feature weighting schemes proposed for sentiment analysis, the most widely used one is based on feature frequency (FF) due to the simplicity and effectiveness [10]. Here, the frequency of a word appearing in a document is utilized as the value of the feature of the document, and the highest value of them in the total documents is regarded as the feature value of the whole training data set. FF shows reasonable performance in many cases. However, if the feature values are uniformly distributed, it is difficult to properly analyze the feature information. The scheme based on document frequency (DF) effectively handles the issue of uniform distribution of the features. Here, the number of documents containing the target word is counted from the training dataset, which effectively represents the statistical information of the feature even the case of uniform distribution. The DF scheme has the advantage of simplicity and applicability to the training data of a huge volume at reasonable computational complexity [16]. However, rare words are treated as useless data, which degrades the performance of sentiment analysis [17]. Part of speech-based weighting (PSW) [18] is a recently proposed feature weighting scheme for twitter sentiment analysis, which is a kind of word frequency (WF)-based approach considering the frequency of unique word in each category. The relevance of the word among the training dataset is also considered. As the weights for the words are set empirically, however, its performance may not be robust. The term frequency and inverse document frequency (TF-IDF) [19,20] is a commonly adopted feature weighting scheme owing to its efficiency and robustness. It assumes that importance of a word is highly dependent on its frequency of occurrences in the document and the ratio of the total number of documents to the number of documents containing the word. It is effective in measuring the importance of the words among the documents of training dataset, which greatly increases the accuracy of sentiment analysis. However, exaggeration of the dimensionality still exists, which treats the size of features as the volume of the words of the entire training dataset. This causes big computation overhead of weighting all the words [21].
Although a variety of feature weighting schemes for sentiment analysis have been developed, few of them investigate the relevancy between the clustered features and the class in assigning the weights. In this paper a novel feature weighting approach is proposed, which is inspired by the expectation that enhancing the strength of the words of strong discriminability may allow higher accuracy of sentiment analysis [22,23]. In the proposed scheme the words of same type of POS feature of the classes are clustered into predefined sets. The dependency between the clustered set and the corresponding class is measured by the modified Chi Square technique [24]. It serves as a criterion for weighting the emotional words along with the discriminability of the words. The proposed scheme is extensively evaluated by computer simulation and compared with other schemes proposed for twitter sentiment analysis using the workloads of Sentiment 140 [25]. The simulation results reveal that the proposed scheme greatly improves the accuracy of the existing schemes. The main contributions of the paper are summarized below, • A novel feature reduction method is proposed to reduce the dimensionality (size of features) [26], which omits irrelevant data in classifying the training dataset into a small number of features and achieves a reasonable computational complexity when weighting the words [27,28]. • A modified Chi Square method is employed since the conventional Chi Square method suffers from the shortcoming of overemphasizing the role of the words of low frequency and measuring the class of a word based on DF. Therefore, WF is proposed to serve as the input to the Chi Square method to avoid such weakness. In addition, the traditional Chi Square method investigates the independency between a single feature and the class in the text classification. In the proposed scheme the dependency of the clustered feature set on the class is explored. The importance of the words is also characterized by the dependency derived from the modified Chi Square method. • A novel composite feature weighting technique is proposed, which considers the dependency derived using the modified Chi Square technique and discriminability of the clustered feature set. In addition, the influence of the dependency to the weighting is also taken into account. Meanwhile, the importance of the words of strong discriminability is emphasized in the weighting process so that they can take more significant role in the sentiment analysis.
The rest of the paper is organized as follows: "Related work" section discusses the background of sentiment analysis. In "The proposed scheme" section the proposed scheme is presented, and its performance is evaluated in "Performance evaluation" section. Finally, the paper is concluded in "Conclusion" section.

Naive Bayes classifier (NBC)
NBC is commonly employed for text classification due to its robust performance for various data, especially for high dimensional text data. It is a probability-based classifier employing the Baye's theorem with the assumption of naïve independency between the predictors [29]. Here the properties of each predictor are analyzed to contribute the probability of the category of each predictor to its class. A classifier is constructed based on Bayes theorem of Eq. (1) [30]. With NBC the influence of predictor_x on given class_c is estimated, assuming that the predictors are independent with each other.
(1) P(c|x) = P(x|c)P(c) P(x) Here P(c|x) is the probability of class_c given predictor_x, which is called the posterior probability. P(x|c) is the probability of predictor_x given class_c. P(c) is the probability of class_c to be true, which is called the prior probability of class_c. P(x) is the prior probability of predictor_x. With n predictors, P(x|c) is defined as, The selective Bayes classifier (SBC) is an enhancement of NBC, which displays good performance when redundant attributes exist. With SBC highly correlated redundant attributes are excluded if the assumption of attribute independency is taken. In [31] greedy search is performed to select all the subsets of the attributes using the forward selection technique, which raises the accuracy of the classifier obtained from the training set.

Twitter sentiment analysis
Sentiment analysis involves language processing, text classification, and computational linguistics to extract emotional information from the source data. It is broadly employed to review the social media used in various fields such as marketing and customer service [32]. Typically, the intention of sentiment analysis is to estimate the mood of the user concerning the target object, and the basic task is to determine the polarity of the given text [32]. The approaches employed for sentiment analysis is roughly categorized into two types; machine learning-based and lexicon-based. With the machine learning-based approach the sentiment classifier is trained using a machine learning algorithm [1]. The lexicon-based approach focuses on the evaluation of the polarity of the text using the lexicons collected from various sources such as MPQA lexicon [33], WordNet [34] and SentiWordNet [35]. The machine learning-based approach is commonly adopted for twitter sentiment analysis, which is a representative binary classifier categorizing the target text into positive or negative. The basic structure of twitter sentiment analysis is shown in Fig. 1.
Recent studies of twitter sentiment analysis focus on usage of various feature sets and methods [36][37][38][39]. In [40], the emotional state of tweets is visualized into specific feelings such as sadness, joy, and anger by employing the theory of Naïve Bayesian. SVM and MaxEntropy classifiers are used as competitors to compare the performance. In [41], the authors analyze the emoticons of sport fans using a lexicon-based approach. In [42], the prediction of stock market was analyzed by SVM approach. Tweets derived from the University and financial companies are utilized as source dataset in the performance evaluations. The results proved that SVM achieved best performance compared to KNN and Naïve Bayes classifiers. In [43], a hybrid approach combining several classifiers is investigated. Various cross-validated experiments were conducted, and the results reveal that the hybrid approach greatly improves the accuracy of classification.
Feature selection is one of the key steps in data pre-processing employed to maximize the performance of text classification, and it utilizes a machine learning technique [44]. It eliminates irrelevant or redundant attributes from the original feature space, and (2) P(x|c) = n k=1 P(x k |c) selects a relevant subset based on the target evaluation criterion to reduce the complexity of the analysis [2,45]. Twitter is a popular online social networking service (SNS) platform that enables the users to show their thought or opinion in a 140-character message. Sentiment 140 lets the users discover the sentiment on people, product, etc. on Twitter. It also provides the APIs for analyzing the tweets, and supports the integration of sentiment analysis classifier with other personal site or platform [25].

Part of speech
The part of speech (POS) tagging is a method of splitting the sentences into words and attaching a proper tag such as noun, verb, adjective and adverb to each word based on the POS tagging rules [46]. Figure 2 lists the POS tag, and Fig. 3 shows three examples of tagging [47]. POS tagging has been widely used in various tasks including text classification, speech recognition, automatic machine translation, and so on. A variety of POS taggers are available for English such as Brill tagger, Tree tagger, and CLAWS tagger. The POS tagging operation consists of two stages, training stage and tagging stage, which are shown in Figs. 4 and 5, respectively.
In the training stage, the corpus is employed to supply words in different context environments, and the contextual information is used as a clue to construct the rules required to decide the lexical classes of the words. Then the most likely tag for a word is selected by calculating the probability of the appearance of the context of the word and its immediate neighbors in the tagging stage [48].

Feature weighting
In sentiment analysis the training data are classified into features (attributes) based on the content, and then weights are assigned to the features to distinguish their importances. Various feature weighting schemes have been proposed, and the commonly used one is term frequency and inversed document frequency (TF-IDF). TF-IDF consists of two parts, term frequency and inverse document frequency. Term frequency, tf(w,d), represents the frequency of word_w appearing in document_d. The inverse document frequency, idf(w,D), is a measurement showing how much information word_w offer for document_d. It is achieved by dividing the total number of documents by the number of documents word_w appears, and then taking the logarithm as, The TF-IDF value is then obtained by TF-IDF(w,d,D) = tf(w,d)·idf(w,D). A high TF-IDF value of a word denotes a large frequency in few documents, and a small frequency of the documents containing the word in the entire set of the documents. On the contrary, a low value indicates that the word appears evenly in every document. The TF-IDF is useful for selecting the words important for a document and evicting common words [49]. FF is a popular feature weighting scheme because of its simplicity and efficiency, which expresses a document as a vector of features. The method utilizes the frequency of a word appearing in a certain document as the value of the feature of the document [49]. DF is another important feature weighting method used in a variety of applications of text classification and other related tasks, which counts the number of documents that the target word_w appears within the entire documents. Only the words of a high DF value are kept which is represented as,  Part of speech-based weighting (PSW) is a recently proposed feature weighting scheme for improving the accuracy of twitter sentiment analysis. The method utilizes POS tagger, and the words are classified into three predefined subclasses as shown in Table 1.
The importance of a word is measured based on its POS tag. Refer to Table 1. The word of Adverb, Adjective and Verb with the corresponding POS tags are regarded as  related to emotion, and thus retained in the Emotion subclass. In addition, a weight value, wt i,j (j = 1, 2, 3), is assigned to reflect the importance of the words.
Here f i is the frequency of word_i appearing in the training dataset, and x is a constant factor used to adjust the degree of influence of the words of different property in deciding the sentiment. It is 2, 1.5, and 1 for the emotion, normal and remain subclass, respectively. E[F j ] is obtained as follows, where F j (j = 1, 2, 3) represents the subclass of Emotion, Normal, and Remain, respectively.

Basic operation
The overall operation flow of the proposed scheme is as follows. For the sentences of twitter, POS tagging is firstly performed. Some sentences are selected as training data set based on the criteria, and then categorized into two classes, positive and negative according to their polarity. The words in the classes are clustered using their POS tags. A weight value is then assigned to every word based on the dependency and word discriminability of the clustered feature set to which the word belongs. When the training stage is over, a table of statistical data is obtained. The sentiment of the sentences of twitter in the test document is judged based on the statistics table. The overall operation flow of the proposed scheme is depicted in Fig. 5. Generally, the objective of the proposed scheme is to reinforce the strength of emotional words through the weighting and make them more influential in sentiment analysis, where the dependency among the cluster feature sets and classes serves as a criterion for the weighting. The detail implementation is presented in the next subsection (Fig. 6).

Preprocessing
Sentiment analysis mainly depends on the availability of initial corpus, Φ = {d 1 ,…, d |Φ| }, d i = {s 1 ,…,s |S| } and predefined class, C = {c 1 ,…,c |C| }. Here d i represents a document consisting of |S| sentences out of |Φ| documents of original corpus and |C| classes. Firstly, the components of Φ are classified into the predefined set of categories, C. The task can be formalized as a function Ψ:  [50]. Then the initial corpus Φ is classified into two class sets, D neg and D pos ( where sn and sp are the size of documents consisting of negative and positive sentences, respectively. Then every sentence in D neg and D pos is parsed through POS tagger, and every word of the sentence is assigned a corresponding POS tag serving as its feature. Refer to the example of Fig. 3.

Dimensionality reduction by feature selection
The sentences of twitter in the training data are classified into positive and negative sentences, while they can also be classified into subjective or objective. Note that emotional words in a sentence is important in judging the sentiment of the sentence. Therefore, removing the sentences having few emotional words can improve the accuracy of sentiment analysis. In this paper, Adverb, Adjective and Verb are regarded as emotional features important in deciding the sentiment. The sentence containing less than two types of emotional feature is regarded as unrelevant to sentiment analysis, and thus removed from the training data set [51]. In the example of Fig. 3, even though the meaning of the third sentence seems negative, it is not used for training because only one type of emotional feature of Verb appears in the sentence.

Feature clustering
The unigram feature extractor is utilized to retrieve the features from the tweets due to its simplicity and efficiency, which treats each unique word in the training dataset as a unit representing separate features [52]. Therefore, the set of document-built classes, D α , is expressed as feature-based array consisting of unique words excluding the stop words. It is represented as [53], D i,α = {w 1 , w 2 ,…, w |w| | α = neg˅pos}, where w i is a unique word occurring in the class set, D i,α . It is expressed as, Here f i is the number of occurrences of unique word, w i , in the entire documents of D i,α with its POS feature tag, t i . NW i is the weighted frequency of w i reflecting the importance of the word as discussed in the following subsection. Different from the existing schemes counting the term frequency in every document and choosing the largest value to represent the feature of the document, the frequency of the words of the documents is computed with the class dataset to avoid the problem of exaggerating the role of low-frequency terms [22,49]. A novel feature clustering method is proposed to aggregate the words of same POS features of D i,neg and D i,pos into the clustered feature set, C E and C N , as follows [54][55][56]. Here C E is the clustered set of emotional feature maintaining the words of the POS tag of Adverb (JJ, JJR, JJS in Fig. 1), Adjective and Verb in D i,α . C N serves as normal feature set keeping the words of remaining tags [57]. The detail classification is shown in Table 2. This process is formulated as,

Measuring importance
The emotional words are classified into the clustered feature set, C E . Here it is crucial to reflect the importance of the words to decide whether the tagged emotional words are actually important to the class or not [58]. Typically, word discrimination (WD) is applied to measure how much discriminative information a word owns with respect to the class [59]. The importance of a word for the class is quantified as, Where WD i,μ represents the WD of word_i of class_μ against class_ν, which is measured by the difference in the frequency of word_i appearing in class_μ, f i,μ , and that in class_ν, f i,ν . Intuitively, the word of high WD is regarded as important to the class as it contains a strong flavor on the class differentiating from other classes. This in turn greatly facilitates the judgement of the sentiment of the sentences. If f i,μ > 0, word_i has positive correlation with class_μ. Otherwise, it is deemed unrelated to class_μ. For instance, assume that two words, w 1 and w 2 coexist in D i,neg and D i,pos with the frequency listed in the Table 3.
Here f 1,neg is the frequency of w 1 appearing in D i,neg and so on. WD of w 1 and w 2 of D i,neg is calculated as, WD 1,neg = f 1,neg − f 1,pos , WD 2,neg = f 2,neg − f 2,pos . If (f 1,neg − f 1,pos ) > 0, w 1 is regarded as an essential word of D i,neg instead of D i,pos . Meanwhile, if WD 1,neg > WD 2,neg , w 1 is regarded as the word containing more information on D i,neg than w 2 for the analysis of the sentiment. Recall that two feature sets, C E and C N , were built considering the POS feature of every word in D i,α , where C E retains the emotional words and C N the others.  Intuitively, the emotional words are expected to carry more information regarding the polarity of the sentiment compared to other words [10]. In addition, emotional words are not likely to occur in more than one class owing to its strong discriminability. For example, "Like" is an emotional word representing positive sentiment, and thus WD 'Like' ,pos will be much larger than WD 'Like' ,neg . Since the emotional words are classified into C E based on the POS tag, the mean value of WD of the words in C E will be greater than that of C N . If p and q are the number of unique words of C E and C N in D i,α , respectively,

Measuring dependency
The conventional Chi Square method is commonly used for feature selection, which ranks the words and selects the word of the highest x 2 value. Meanwhile, it suffers from the weakness of overemphasizing low frequency words as only DF is considered without WF. Different from DF, WF is the frequency a word appears in the entire dataset, and it is expressed as, WF measures the importance of a word in the whole data space. With the traditional Chi Square method the uniformly distributed words are deemed to best represent the class. Specifically, the method is based on the intuition that the optimal words for a specific class are the ones distributed most evenly among the documents of the training dataset [50]. Therefore, only the evenly distributed words of high DF and low WF value are selected from the dataset as the features representing the class, while the words of low DF and high WF value are discarded as they may cause inefficiency. However, some words of low DF and high WF might also be important and thus need to be reserved. Especially, the emotional words are regarded as rare with the conventional Chi Square method because they usually have uneven distribution. They also have high WF and low DF value [22]. By incorporating WF as an input to the Chi Square method, the dependency of C E on D i,α (α = neg⋁pos) is measured as follows. Firstly, the hypothesis statements are set up as follows.
Null Hypothesis_1: C E and D i,α are independent.
Alternative Hypothesis_1: C E and C i,α are dependent. Then contingency table for r·c is constructed as in Table 4, where r and c are the number of rows and columns, respectively. The Chi Square value, χ 2 , is calculated using

Feature selection ∈D i,α ∉ D i,α Sum
Containing Here O is the observed frequency and E is the expected frequency under the hypothesis. Table 4 lists different cases of feature selections with respect to C E . In Table 4, A is the sum of the WF of the words in C E which is expressed as A = p i=1 WF i , where p is the number of words in C E . Similarly, the value of C is calculated as C = q i=1 WF i with q as the number of words in C N . B is the case for the opposite class which is represented as, where z is the number of words of C E in class D i,α appearing in D i,ᾱ . As the number of words in D i,α is q, z ≤ q. D is calculated as, where p′ and q′ are the number of words in C E and C N in D i,ᾱ , respectively. For example, if the dependency of D i,neg with its corresponding clustered feature set, C E , is measured, A is the total WF of the words in C E of D i,neg and B is the sum of WF that the words of C E appear in D i,pos . C and D are obtained by subtracting the total WF of D i,neg and D i,pos from A and B, respectively. The expected frequency of the words of C E belonging to D i,j is obtained as, E 11 = (A + C)·(A + B)/N. Here E 11 is the expectation of A in the first row and first column of Table 4, and the deviation between the expectation is calculated as, D 11 = (A − E) 2 /E 11 . Similarly, the other values are obtained as follows.
And then, C E and D i,α are derived as In addition, the value of χ 2 (C N ,D i,α ) can also be computed by constructing the table  similar to Table 4, and the hypotheses is shown below.
Null Hypothesis_2: C N and D i,α are independent. Alternative Hypothesis_2: C N and D i,α are dependent.

Weighting of the word
Word weighting is performed based on the importance of the word in the training dataset. The proposed weighting scheme considers the dependency of the clustered feature set with the class, which is measured by the value of χ 2 . The greater the value, the stronger the dependency. For measuring the dependency, the critical value (CV) of Chi Square is given. In this paper 95% is taken as a metric for the measurement which indicates the null hypothesis is wrong with the probability of 0.95 or more. CV is computed as 3.84 with one degree of freedom [DF = (r − 1)·(c − 1)] based on the cumulative distribution function of Chi Square expressed as [60], Here k is the degree, γ is incomplete gamma function, and Γ is gamma function represented as, The probability density function (PDF) of χ 2 distribution, f(χ 2 ), is drawn in Fig. 7, which shows a two-sided test of χ 2 distribution with the CV of 3.84. If the given χ 2 value is greater than the CV, the null hypothesis would be rejected. The tested clustered feature set is regarded as dependent on the class. Otherwise, they are treated as independent with each other. The CV of χ 2 is set high to ensure the reliability of the dependency. The region for the probability of one degree of freedom with the CV is marked with slashed lines.
Then χ 2 (C E ,D i,α ) is compared with the CV. The Null Hypothesis_1 is rejected if χ 2 (C E ,D i,α ) > CV, and Alternative Hypothesis_1 is chosen which indicates that C E is highly dependent on D i,α with the probability of 95%. Otherwise, they are regarded as independent from each other. Similarly, χ 2 (C N ,D i,α ) is compared with the CV, and the Null Hypothesis_2 or Alternative Hypothesis_2 is chosen based on the result of the Fig. 7 The PDF of χ 2 distribution comparison. As C E holds emotional words of relative strong discriminability which are more likely to represent the class, the proposed weighting scheme strengthens the discriminability to highlight the role of emotional words in category prediction. Recall that the larger the χ 2 value is, the more discriminative information of the class the feature holds [22]. If χ 2 (C E ,D i,α ) < CV, C E and D i,α are regarded as independent from each other, which indicates that C E does not contain enough information on class D i,α , and thus no weighting is applied. Only when χ 2 (C E ,D i,α ) > CV, the words in C E are supposed to be highly dependent on D i,α , and the proposed weighting scheme is performed. Observe from Table 4 that χ 2 (C E ,D i,α ) increases with the growth of A because B, C and D are constant values. Therefore, the value A is increased to make the words in C E more discriminative, and firstly the distortion of the importance of a word is defined as, Here Θ(w i ,D i,α ) measures the importance of a word between the observation and prediction. α(w i ,D i,α ) is the observed importance of the word_w i for class_D i,α measured by the relative frequency, WF wi , of word_w i . β(w i ,D i,α ) represents the predicted importance, and it is computed by the deviation between the expectation of Chi Square method. The distortion of the importance for the clustered set, C α (α = E⋁N), is obtained as, Θ(C α ,c k ) measures the distortion of the importance between cluster_C α and class_D i,α . The increment rate of A, r A , is computed as, Since Θ(C α ,c k ) is equal to |A − E 11 |, Eq. (17) can be rewritten as, Here D 11 and E 11 are the deviation and expected frequency, and √ D 11 * E 11 measures the difference between the observed frequency and expected frequency in D i , α (α = neg⋁pos). By dividing it by the value of A, the increment rate of r A is calculated. Meanwhile, χ 2 (C N ,D i,α ) > χ α 2 might be possible since some non-emotional words also have small discriminability. Moreover, as the volume of data of C N is much larger than C E , the value of χ 2 (C N ,D i,α ) might be greater than the CV of χ 2 . Therefore, r N is computed based on Table 5 and Eq. (18). The increment rate, r D , is then, Here r D increases the frequency of A when C E holds enough words of strong discriminability and contains more class information than C N . Otherwise, it is set to be zero. In addition, as the value of A is the sum of the WF of the words of all the documents in C E

Testing
Bayes theorem is widely used in supervised learning for text classification. In this paper Multinomial Naïve Bayes (MNB) model is employed as the classifier for the given text, which is based on naïve assumption of conditional independence for the features [61]. Specifically, in the text classification of sentiment analysis, the goal is to find the best matching class for the tested sentences. It is the most likely or maximum a posteriori (MAP) class, c map , which is calculated as, where c is a class in the total classes in training dataset, C, and P(c|S) is posterior probability of class_c measuring the probability of sentence_S being in class_c as computed by Eq. (21). In the proposed scheme two classes are defined, D neg and D pos . Therefore, C = {D neg , D pos }, and the objective is to find the best matching class among C for every tested sentence.
Since the probability of the sentence, P(S), is a constant, it can be discarded. Equation (21) can then be expressed as, S i = {w 1 ,…,w |k| } represents one sentence composed of |k| words, and w j (j = 1,…,n) is a word in the sentence. For example, for the sentence of "peace is important", S = {peace, is, important}, with |k| = 3. P(w i |c) is the conditional probability of w j occurring in a sentence of class_c, which measures how much w j contributes for class_c to be the matching class. P(c) is the prior probability of a sentence in class_c. In Eq. (22), multiplying many conditional probabilities may lead to the problem of floating point underflow. Therefore, adding the logarithms of the probabilities instead of the multiplication is carried out. As the logarithmic function is monotonic, the class of the highest probability can still be selected as the target class. Equation (22) is thus converted to, (19) Here count(w i ,c) is the number of appearances of w i in class_c of the training dataset and count(c) is the total number of words in class_c. Meanwhile, the problem of zero probability can occur if a word in a sentence does not appear in the training dataset. Then, no matter how strong evidence could be gained from other words for the class, the estimation becomes zero. Laplace smoothing is employed to avoid this issue as [62], P(w i |c) = (count(w i ,c) + 1)/(count(c) + |V|). Where |V| is the number of distinct words in the training dataset. Recall that, in the proposed scheme, the weighted frequency (NW) of every word in the training dataset has been adjusted considering its relative popularity. Using it, P(w i |c) is obtained as, P(w i |c) = (NW i + 1)/ (count(c) + |V|).
Recall that NW i was obtained based on the dependency and discriminability of the word in the target class, where the influence of emotional words was strengthened by assigning more weight than the others for more accurate prediction. Note that redundant feature words are considered with the MNB model. For instance, assume that sentence_S i is composed of four words as S i = {w 1 , w 2 , w 3 , w 1 }. Then the numerator of Eq. (22) becomes P(C i )· P(w 1 |C i ) 2· P(w 2 |C i )·P(w 3 |C i ). Where w 1 has twice as much influence as the other words. The redundant word is therefore given more weight, which leads to biased and low accuracy prediction. In the proposed scheme, thus, only distinct words in the tested sentence are counted.

Performance evaluation
In this section the proposed scheme is evaluated by computer simulation using Matlab. For this, the workload obtained from Sentiment 140 [25] is used to analyze the accuracy of the proposed scheme for twitter sentiment analysis. It is also compared against the previously existing FF, PSW, DF, and TF-IDF scheme.
The simulator consists of three parts; preprocessor, POS tagging API, and Bayes-based classifier. The preprocessor classifies the data of the training data set, and converts them to the customized format accessible by the API of POS tagging [46]. A Matlab function is implemented for accessing the Stanford POS tagger [63], which provides the API for the data in the workload. The Bayesian classifier is used to classify the tested document and predict the sentiment of the target sentences. A Multinomial Naïve Bayes (MNB) model is employed as the classifier in the simulation. The workload used in the simulation is extracted from Sentiment 140, which contains 1,600,000 lines of tweet data. 800,000 of them are negative class and the others are positive class. In the simulation the documents (23) c map = arg max c∈C   log P(c) + � 1≤i≤n log P(w i |c)   Fig. 8 Two examples of tweets in the data set used for the training process are also classified into two categories; positive and negative. Figure 8 shows two examples of data in the data set. There are six components in one tweet; polarity (0 = negative, 4 = positive), id, data, query condition, user name, and text of the tweet. The text is used in the training stage of the simulation.
Extensive simulations are run to obtain reliable performance data. Six training data sets of (5000, 10,000, 15,000, 20,000, 25,000, 30,000) randomly selected tweets data are built, and each of which consists of an equal number of positive and negative randomly selected tweets extracted from the Sentiment 140. In addition, a series of tested data sets are formed to verify the performance. Firstly, three tested data sets consisting of 2000, 4000 and 6000 data of equal size of negative and positive document are built to compare the accuracy of the schemes with six different sizes of training data ranging from 5000 to 30,000. The results are shown in Fig. 9. Observe from the figure that the proposed scheme consistently outperforms the other schemes regardless of the size of training and test data set. Intuitively, the accuracy of sentiment analysis increases as the volume of training data grows. It is because the larger the training data, the more evidences could be provided for sentiment judgement. Also notice that the accuracy of the proposed scheme gradually increases with the growth of the size of training data set. However, there is no such improvement with the other schemes excluding the PSW scheme. This is because the parameters of the feature classification model were decided empirically. Moreover, the accuracy generally drops as the test data increases because a limited size training data cannot consistently provide robust evidence for sentiment analysis. Observe from Fig. 9c that the proposed scheme is substantially more accurate than the others even in the worst condition of 'minimum training data (5000) and maximum test data (6000)' . This is because WF is utilized as a parameter in the Chi Square method of the proposed scheme, which overcomes the drawback of the traditional Chi Square method in analyzing low frequency terms. Moreover, as a large value of Chi Square implies more class information of the feature (attribute), the weight applied to the words properly takes the interclass dependency into consideration. This enhances the feature of important words of high discriminability, which in turn produces higher accuracy than other schemes.
In order to check the robustness of the proposed scheme, the accuracy is also measured with three randomly selected test data sets containing 3000 positive and 3000 negative data, respectively. The outcomes are shown in Fig. 10. Observe from the figure that the proposed scheme substantially displays higher accuracy regardless of the volume of training and test data. It is also worth to note that the accuracy decreases as the number of test data increases from 1000 to 6000. This is because the accuracy of sentiment analysis is significantly affected by the pattern of the test sentences. TF-IDF also shows reasonable performance since it employs feature weighting. The PSW scheme offers good accuracy when the size of training data is large. The DF scheme is consistently superior than the FF scheme. Pan et al. [10] identified that considering the presence or absence of features can allow higher accuracy than considering only the feature frequency. This is the reason why the DF scheme outperforms the FF scheme.
In the previous simulations the test data consists of equal number of positive and negative tweets. In order to evaluate the sensitivity of the proposed scheme with Fig. 9 The comparison of accuracies with different size test data Fig. 10 The comparison of accuracies with the test data of balanced polarity Fig. 11 The comparison of accuracies with the test data of unbalanced polarity Fig. 12 The comparison of three benchmarks with the randomly selected test data respect to the polarity, simulations are made with the test document randomly selected from the test dataset without considering the polarity. Three different sizes of training data set of 5000, 20,000 and 30,000 are taken to handle the test data. The average accuracy is shown in Fig. 10 revealing that the proposed scheme consistently outperforms the others (Fig. 11). Figure 12 shows the performance of the five schemes when the test dataset is randomly selected from Sentiment 140. Note that the proposed method significantly outperforms other schemes for the three well-known benchmarks. The proposed scheme produces best performance in terms of precision, recall, and F1-measures when the size of test dataset varies from 500 to 6000. It reveals that the proposed scheme is very sensitive to the sentiment of test documents and is capable of classifying test data into correct category.

Conclusion
Twitter sentiment analysis has become a promising technique for industry and academia. In this paper a novel feature weighting approach for sentiment analysis of twitter data has been proposed using a Bayes-based text classifier. An effective feature selection strategy recognizing sentiment sentence is presented to select informative data for classification. Moreover, each term is grouped into target cluster considering the POS property of the term. A novel feature weighting scheme considering discriminability and dependency derived from modified Chi Square statistics is introduced, which computes a proper weight value for each term reflecting the importance degree of the term. Extensive experiments were conducted on Sentiment 140, and four representative feature weighting schemes were also tested to demonstrate the performance. The experimental results show that the proposed scheme consistently outperforms others in terms of accuracy, precision, recall, and F1-measure. In the future a fine-grained clustering strategy is planned to be developed to accurately define the margin of the clusters. Moreover, unsupervised learning techniques will be incorporated into the proposed scheme to further improve the performance of sentiment analysis. In addition, the proposed scheme will be tested using various classifiers such as SVM, decision tree, and neural network.