Sentiment Analysis for Twitter using WEKA

16 minute read

Published:

Introduction

One of the most important things that can be a signal of a successful product is the users want to use it since it fulfills their needs. To achieve that point, the executive people from companies need to evaluate their products performance when officially released to public. One way to do that is by knowing the users reaction towards the product’s quality. By knowing the users reaction, they can improve the quality of production for they can learn about users expectation and sure it may help them to pay more attention to user-driven development (UDD), a processes in which the needs of users of a product become the top priority in the product development processes.

Talking about the way to evaluate a product’s quality, the executive people can find the public opinion in social media, such as Twitter, Facebook, Instagram, etc. They can find their product and analyze the users opinion towards it. They can do it manually, yet off course it will take a long time to get the evaluation process done. So, they need something that can do this job automatically so that they can be more productive rather than only just determining the positive or negative opinion from every tweet.

One of the common application used to do this job is Sentiment Analyzer. This app can determine how positive or negative a statement text is. It implements Natural Language Processing as the core technology to analyze and understand the content of a sentence text. In this brief article, I will explain about this application used for classifying tweet’s sentiment extracted from Twitter. So, let’s start!


Programming Languages

Programming Languages:

  • Java (JSP, Servlet)

Frameworks and Libraries:

  • WEKA
  • AngularJS
  • Bootstrap
  • jQuery
  • twitter4j (Java library for the Twitter API)

Supporting Model

This application uses a classification method called SlidingWindow which determines the model to be used when the “disagreement” condition is met. The “disagreement” is a condition where the core classifier can not classify the tweet into the class of positive or negative.

If we choose to use SlidingWindow, there are two possible conditions when we receive the result of classification, namely:

  • If the core classifier returns the positive or negative as the predicted class,
    • put the corresponding instance in the data train. We should also check whether the total number of instance is more than 1000. If so, remove the first instance
  • If the core classifier returns nan as the predicted class,
    • put the corresponding instance in the data test and it will be classified in the end of the process
    • For this condition, we should consider these possibilities too:
      • If the number of instances in the data train is more than 0,
        • decides upon a "disagreed" document by applying the learned model based on the last 1000 "agreed" documents (stored in "train" instances)
      • Else,
        • unknown class for the train data is empty and can not do the training process. In this condition, we can just return random class as the final result.

If we choose not to use SlidingWindow, then there are also two possible conditions, namely:

  • If the core classifier returns the positive or negative as the predicted class,
    • set the list of predicted class come from three possibilities
  • If the core classifier returns nan as the predicted class,
    • decides upon a "disagreed" (out = nan) document by applying the learned model based on the previously built model

General Concept

In this part, I will explain the core principal of this Sentiment Analysis and some code snippets so that you can understand the backend-logic better.

Extracts Twitter Data

  • Sets the query
    • We begin by fetching user's query and assign the value to a global variable.
  • Extracts tweet
    • Afterwards, we fetch the corresponding tweets which contain the keyword inputted by the user. This step uses Twitter API so it does need some authentication tokens (you can get these tokens from your Twitter's developer account).
  • Sets the tweet text
    • To achieve the optimum result of classification, we only need the text feature from the original tweet. It means that we do not need another supporting features, such as published time, username and name of person posting the tweet, location when he/she was posting the tweet, etc. So, we just store the tweet text into a list. Simple!
  • Sets the amount of tweet
    • The last step for this stage is we just need to store the number of extracted tweets.

Sentiment Analysis

  • Initializes Sentiment Processor
    • Sentiment Analyser
      • Trainer
        • Initializes BidiMap objects for text, feature, and complex representation
          • We use DualHashBidiMap that stores the pair of String and Integer. The content of this BidiMap is all of the attributes from the dataset of every tweet's representation, namely text, feature, and complex. I will explain these three representations in the next sub-point.
        • Trains lexicon model
          • It is a supervised learning using LibSVM classifier
          • It trains the model for the lexicon-based representation
          • It saves the model in order to use it on the provided test sets
          • The rest of the model representation forms will be created on-the-fly because of the minimum term frequency threshold that takes both train and test sets into consideration
        • Trains text model
          • The concept is this model will determine the sentiment value based on the whole opinion or division of opinions
          • It builds and saves the text-based model built on the training set
          • The training set has already been preprocessed, where every special symbol is converted into a string representing the symbol (example: converts @user into 'usermentionsymbol', www.example.com into 'urllinksymbol', etc)
          • The general steps are:
            • Retrieves a new instances with a filter inside
            • Saves the instances to a file named according to the type of representation. In this case, we use text representation, so the name will be 'T.arff'
            • Writes the attributes from the filtered instances (can be retreived from 'T.arff') to a file in 'attributes' folder (text.tsv). The attributes are the tokenized words and the representation is based on the used tokenizer
            • Creates classifier (NaiveBayesMultinomial), build model, and save model
        • Trains feature model
          • The concept is this model will determine the sentiment value based on the Twitter features, such as links, mentions, and hash-tags
          • It builds and saves the feature-based model built on the training set
          • The training set has already been preprocessed, where every special symbol is converted into a string representing the symbol (example: converts @user into 'usermentionsymbol', www.example.com into 'urllinksymbol', etc)
          • The general steps are:
            • Retrieves a new instances with a filter inside
            • Saves the instances to a file named according to the type of representation. In this case, we use feature representation, so the name will be 'F.arff'
            • Writes the attributes from the filtered instances (can be retreived from 'F.arff') to a file in 'attributes' folder (feature.tsv). The attributes are the tokenized words and the representation is based on the used tokenizer
            • Creates classifier (NaiveBayesMultinomial), build model, and save model
        • Trains complex model
          • The concept is this model will determine the sentiment value based on the combination of text and POS (Part of Speech). Each word is assigned to the corresponding POS
          • It builds and saves the complex-based model built on the training set
          • The training set has already been preprocessed, where every special symbol is converted into a string representing the symbol (example: converts @user into 'usermentionsymbol', www.example.com into 'urllinksymbol', etc)
          • The general steps are:
            • Retrieves a new instances with a filter inside
            • Saves the instances to a file named according to the type of representation. In this case, we use complex representation, so the name will be 'C.arff'
            • Writes the attributes from the filtered instances (can be retreived from 'C.arff') to a file in 'attributes' folder (complex.tsv). The attributes are the tokenized words and the representation is based on the used tokenizer
            • Creates classifier (NaiveBayesMultinomial), build model, and save model
        This is an example of code for training the text-based representation:
        text-based representation training
      • Polarity Classifier
        • Initializes BidiMap objects for text, feature, and complex representation
          • We create objects of DualHashBidiMap and initialize the value with the previous DualHashBidiMap explained in the Trainer sub-point
        • Initializes filter (StringToWordVector) and tokenizer (NGramTokenizer)
          • We choose StringToWordVector as the filterer and NGramTokenizer as the tokenizer. The minimum and maximum size for the tokenizer is 2
        • Initializes classifiers (MNB and LibSVM)
          • Read the model which is built previously
          • We use NaiveBayesMultinomial for the text, feature, and complex representation
          • We use LibSVM for the lexicon representation
          • We also build an instance for every representation. It is built from the data train (T.arff, F.arff, and C.arff)
        This is an example of code for initializing filterer and tokenizer:
        initializes filterer and tokenizer
        This is an example of code for initializing classifiers:
        initializes classifiers
      • Tweet Preprocessor
        • Initializes preprocessor for lexicon, text, feature, and complex representation
          • We create the object of attributes that will be used to preprocess all the representations before doing the classification. By preprocessing the words, we normalize them to a new sentence without any ambiguities. Here are the preprocessor for every representation:
            • Lexicon Preprocessor
              • Sets the general score for every lexicon (use SentiWordNet as the resource)
              • Gets the abbreviations and store them in a hash-table
                • Fetches the list of abbreviations and return its contents in a hash-table
                • Here are the examples: afk=away from keyboard, aka=also known as, asap=as soon as possible
                • From the examples, the fetched results will be stored in a hash-table: <afk, away from keyboard>, <aka, also known as>, <asap, as soon as possible>
              • Gets the happy and sad emotions and store them in a linked list
                • Fetches the list of the happy emoticons from a dictionary and store them in a linked list
                • Here are the examples: :-), :), ;)
                • From the examples, the fetched results will be stored in a linked list: [:-), :), ;)]
              • Gets the list of positive and negative words
            • Text Preprocessor
              • Gets the abbreviations and store them in a hash-table
                • Fetches the list of abbreviations and return its contents in a hash-table
                • Here are the examples: afk=away from keyboard, aka=also known as, asap=as soon as possible
                • From the examples, the fetched results will be stored in a hash-table: <afk, away from keyboard>, <aka, also known as>, <asap, as soon as possible>
              • Gets the happy and sad emotions and store them in a linked list
                • Fetches the list of the happy emoticons from a dictionary and store them in a linked list
                • Here are the examples: :-), :), ;)
                • From the examples, the fetched results will be stored in a linked list: [:-), :), ;)]
            • Feature Preprocessor
              • Gets the abbreviations and store them in a hash-table
              • Gets the happy and sad emoticons then store them in a linked list
              • Creates the combination of dot symbol and store them in a linked list
              • Creates the combination of exclamation symbol and store them in a linked list
            • Complex Preprocessor
              • This preprocessor only returns the Part of Speech (POS) of tweets
          This is an example of code for retrieving features attributes, such as abbreviations, happy and sad emoticons, etc:
          retrieves feature attributes
        • Initializes Part of Speech (POS) tagger
          • We create an object that will be used to analyze the POS using this tagger: wsj-0-18-left3words-distsim.tagger
      • Initializes filterer (StringToWordVector) and tokenizer (NGramTokenizer)
        • We use StringToWordVector as the filterer and NGramTokenizer as the tokenizer
      • Initializes data train and data test in case of we need to use sliding window
        • In the Supporting Model section, I've explained briefly about the classification method that will be used for this application
        • When we choose to use this method, we need to provide an alternative way in case we receive the nan class as the predicted class. To do that, we have to create a new data train which will be a place for storing the current instance and training the new model based on the "agreed" instances in that data train
        This is an example of code for creating a new data train and data test for the case of using sliding window:
        creates new data train and data test for sliding window
    • Tweet initialization
      • Initializes the list of tweet sentiment
        • Empty the list which will store the final predicted classes
      • Initializes the list of preprocessed tweet
        • Empty the list which will store the preprocessed tweets
      • Initializes the list of class distribution
        • Empty the list which will store the class probability
    • Get Polarity
      • Starts the whole process including preprocesses the given tweet, creates different representations of it (stored in "all[]" Instances) and tests it in the PolarityClassifier class
        • Preprocesses the lexicon, text, feature, and complex
          • Generally, the steps for preprocessing are:
            • Replaces emoticons - current tweet is altered to "happy"/"sad"
            • Replaces Twitter features, such as links, mentions, hash-tags
            • Replaces Consecutive Letters - replaces more than 2 repetitive letters with 2
            • Replaces Negation - if current is a negation word, then current = "not"
            • Replaces Abbreviations - if current is an abbreviation, then replace it
          This is an example of code for preprocessing tweets:
          preprocesses tweets
        • Initializes the instances (data test) of lexicon, text, feature, and complex
          • Fetches all instances created before (instances that contains data test)
    • Execution of Algorithm
      • Gets the instances of every data test representation and applies a filter to it
        • Returns the instances of text-based representations and apply filter to it in order the format of data test is same with the format of data train
      • Reformates the instances of lexicon, text, feature, and complex
        • Removes the attributes from the data test that are not used in the data train
        • Moreover, it also alters the order of every representation's attributes according to the train files
      • Apply the classifier
        • This is the main method that sets up all the processes of the Ensemble classifier. It returns the decision made by the two classifiers, namely:
          • Classifier for text, feature, and complex representation (HC)
            • Gets the probability for each class (positive/negative)
            • Uses double[] preds = mnb_classifiers[i].distributionForInstance(test.get(0))
            • Text-based representation
              • Distribution for positive (hc[0]): preds[0]*31.07
              • Distribution for negative (hc[1]): preds[1]*31.07
            • Feature-based representation
              • Distribution for positive (hc[2]): preds[0]*11.95
              • Distribution for negative (hc[3]): preds[1]*11.95
            • Complex-based representation
              • Distribution for positive (hc[4]): preds[0]*30.95
              • Distribution for negative (hc[5]): preds[1]*30.95
        • Classifier for lexicon representation (LC)
          • Gets the probability for each class (positive/negative)
          • Presumes that the value is stored in a variable called lc_value
        • Counts the probabilities
          • HC classifier
            • Positive score (ps): (hc[0] + hc[2] + hc[4]) / 73.97
            • Negative score (ns): (hc[1] + hc[3] + hc[5]) / 73.97
            • Final score (hc_value): (1 + ps - ns) / 2
          • Comparison between HC and LC
            • If hc_value < 0.5 AND lc_value > 0.5: output is negative
            • Else if hc_value > 0.5 AND lc_value < 0.5: output is positive
            • Else: output is nan
        This is an example of code for applying the core classifier:
        applies core classifier
      • Check the Sliding Window
        • Case 0: uses sliding window
          • If HC and LC agree (positive/negative), then put this document in the data train
          • Else, add the document into the data test. It will be classified in the end of the process
            • If the number of data train's instances is more than 0
              • Calls clarifyOnSlidingWindow
                • Adds an instance into the data train at the end of file
                • Sets a filter for the data train and store the filtered data train in a new instances
                • Prepares the data train and the data test
                • Builds classifier from the data train (1000 training data)
                • Predicts the class based on the model created before
            • Else, the final class is unknown for the data train is empty and can not do the training process
          This is an example of code when we implement the sliding window:
          implements sliding window
        • Case 1: does not use sliding window
          • If HC and LC agree (positive/negative), then sets the list of predicted class come from three possibilities
          • Else,
            • Calls clarifyOnModel which decides upon a "disagreed" (output class = nan) document by applying the learned model based on the previously built model
              • Gets the text-based representation of the document
              • Re-orders attributes so that they are compatible with the data train
              • Finds the polarity of the document based on the previously built model, namely Liebherr, goethe, or Cisco
          • This is an example of code when we do not implement the sliding window:
            does not implement sliding window
      • Done

References

Thanks to Petter Törnberg (Copyright 2013) for the demo code used to analyze the sentiment value of a word. I implemented the code on SWN3.java with several modifications.

You can read the theories behind the production of this application from these resources:


Project Display and Code

You can find the entire source code on my Github