Sentiment Analysis for Twitter using WEKA

16 minute read

Published: December 26, 2016

Introduction

One of the most important things that can be a signal of a successful product is the users want to use it since it fulfills their needs. To achieve that point, the executive people from companies need to evaluate their products performance when officially released to public. One way to do that is by knowing the users reaction towards the product’s quality. By knowing the users reaction, they can improve the quality of production for they can learn about users expectation and sure it may help them to pay more attention to user-driven development (UDD), a processes in which the needs of users of a product become the top priority in the product development processes.

Talking about the way to evaluate a product’s quality, the executive people can find the public opinion in social media, such as Twitter, Facebook, Instagram, etc. They can find their product and analyze the users opinion towards it. They can do it manually, yet off course it will take a long time to get the evaluation process done. So, they need something that can do this job automatically so that they can be more productive rather than only just determining the positive or negative opinion from every tweet.

One of the common application used to do this job is Sentiment Analyzer. This app can determine how positive or negative a statement text is. It implements Natural Language Processing as the core technology to analyze and understand the content of a sentence text. In this brief article, I will explain about this application used for classifying tweet’s sentiment extracted from Twitter. So, let’s start!

Programming Languages

Programming Languages:

Java (JSP, Servlet)

Frameworks and Libraries:

WEKA
AngularJS
Bootstrap
jQuery
twitter4j (Java library for the Twitter API)

Supporting Model

This application uses a classification method called SlidingWindow which determines the model to be used when the “disagreement” condition is met. The “disagreement” is a condition where the core classifier can not classify the tweet into the class of positive or negative.

If we choose to use SlidingWindow, there are two possible conditions when we receive the result of classification, namely:

If the core classifier returns the positive or negative as the predicted class,
- put the corresponding instance in the data train. We should also check whether the total number of instance is more than 1000. If so, remove the first instance
If the core classifier returns nan as the predicted class,
- put the corresponding instance in the data test and it will be classified in the end of the process
- For this condition, we should consider these possibilities too:
  - If the number of instances in the data train is more than 0,
    - decides upon a "disagreed" document by applying the learned model based on the last 1000 "agreed" documents (stored in "train" instances)
  - Else,
    - unknown class for the train data is empty and can not do the training process. In this condition, we can just return random class as the final result.

If we choose not to use SlidingWindow, then there are also two possible conditions, namely:

If the core classifier returns the positive or negative as the predicted class,
- set the list of predicted class come from three possibilities
If the core classifier returns nan as the predicted class,
- decides upon a "disagreed" (out = nan) document by applying the learned model based on the previously built model

General Concept

In this part, I will explain the core principal of this Sentiment Analysis and some code snippets so that you can understand the backend-logic better.

Extracts Twitter Data

Sets the query

We begin by fetching user's query and assign the value to a global variable.

Extracts tweet

Afterwards, we fetch the corresponding tweets which contain the keyword inputted by the user. This step uses Twitter API so it does need some authentication tokens (you can get these tokens from your Twitter's developer account).

Sets the tweet text

To achieve the optimum result of classification, we only need the text feature from the original tweet. It means that we do not need another supporting features, such as published time, username and name of person posting the tweet, location when he/she was posting the tweet, etc. So, we just store the tweet text into a list. Simple!

Sets the amount of tweet

The last step for this stage is we just need to store the number of extracted tweets.

Sentiment Analysis

Initializes Sentiment Processor
- Sentiment Analyser
  - Trainer
    - Initializes BidiMap objects for text, feature, and complex representation
      - We use DualHashBidiMap that stores the pair of String and Integer. The content of this BidiMap is all of the attributes from the dataset of every tweet's representation, namely text, feature, and complex. I will explain these three representations in the next sub-point.
    - Trains lexicon model
      - It is a supervised learning using LibSVM classifier
      - It trains the model for the lexicon-based representation
      - It saves the model in order to use it on the provided test sets
      - The rest of the model representation forms will be created on-the-fly because of the minimum term frequency threshold that takes both train and test sets into consideration
    - Trains text model
      - The concept is this model will determine the sentiment value based on the whole opinion or division of opinions
      - It builds and saves the text-based model built on the training set
      - The training set has already been preprocessed, where every special symbol is converted into a string representing the symbol (example: converts @user into 'usermentionsymbol', www.example.com into 'urllinksymbol', etc)
      - The general steps are:
        Retrieves a new instances with a filter inside
        Saves the instances to a file named according to the type of representation. In this case, we use text representation, so the name will be 'T.arff'
        Writes the attributes from the filtered instances (can be retreived from 'T.arff') to a file in 'attributes' folder (text.tsv). The attributes are the tokenized words and the representation is based on the used tokenizer
        Creates classifier (NaiveBayesMultinomial), build model, and save model
    - Trains feature model
      - The concept is this model will determine the sentiment value based on the Twitter features, such as links, mentions, and hash-tags
      - It builds and saves the feature-based model built on the training set
      - The training set has already been preprocessed, where every special symbol is converted into a string representing the symbol (example: converts @user into 'usermentionsymbol', www.example.com into 'urllinksymbol', etc)
      - The general steps are:
        Retrieves a new instances with a filter inside
        Saves the instances to a file named according to the type of representation. In this case, we use feature representation, so the name will be 'F.arff'
        Writes the attributes from the filtered instances (can be retreived from 'F.arff') to a file in 'attributes' folder (feature.tsv). The attributes are the tokenized words and the representation is based on the used tokenizer
        Creates classifier (NaiveBayesMultinomial), build model, and save model
    - Trains complex model
      - The concept is this model will determine the sentiment value based on the combination of text and POS (Part of Speech). Each word is assigned to the corresponding POS
      - It builds and saves the complex-based model built on the training set
      - The training set has already been preprocessed, where every special symbol is converted into a string representing the symbol (example: converts @user into 'usermentionsymbol', www.example.com into 'urllinksymbol', etc)
      - The general steps are:
        Retrieves a new instances with a filter inside
        Saves the instances to a file named according to the type of representation. In this case, we use complex representation, so the name will be 'C.arff'
        Writes the attributes from the filtered instances (can be retreived from 'C.arff') to a file in 'attributes' folder (complex.tsv). The attributes are the tokenized words and the representation is based on the used tokenizer
        Creates classifier (NaiveBayesMultinomial), build model, and save model
    This is an example of code for training the text-based representation:
  - Polarity Classifier
    - Initializes BidiMap objects for text, feature, and complex representation
      - We create objects of DualHashBidiMap and initialize the value with the previous DualHashBidiMap explained in the Trainer sub-point
    - Initializes filter (StringToWordVector) and tokenizer (NGramTokenizer)
      - We choose StringToWordVector as the filterer and NGramTokenizer as the tokenizer. The minimum and maximum size for the tokenizer is 2
    - Initializes classifiers (MNB and LibSVM)
      - Read the model which is built previously
      - We use NaiveBayesMultinomial for the text, feature, and complex representation
      - We use LibSVM for the lexicon representation
      - We also build an instance for every representation. It is built from the data train (T.arff, F.arff, and C.arff)
    This is an example of code for initializing filterer and tokenizer:
    
    This is an example of code for initializing classifiers:
  - Tweet Preprocessor
    - Initializes preprocessor for lexicon, text, feature, and complex representation
      - We create the object of attributes that will be used to preprocess all the representations before doing the classification. By preprocessing the words, we normalize them to a new sentence without any ambiguities. Here are the preprocessor for every representation:
        Lexicon Preprocessor
        Sets the general score for every lexicon (use SentiWordNet as the resource)
        Gets the abbreviations and store them in a hash-table
        Fetches the list of abbreviations and return its contents in a hash-table
        Here are the examples: afk=away from keyboard, aka=also known as, asap=as soon as possible
        From the examples, the fetched results will be stored in a hash-table: <afk, away from keyboard>, <aka, also known as>, <asap, as soon as possible>
        Gets the happy and sad emotions and store them in a linked list
        Fetches the list of the happy emoticons from a dictionary and store them in a linked list
        Here are the examples: :-), :), ;)
        From the examples, the fetched results will be stored in a linked list: [:-), :), ;)]
        Gets the list of positive and negative words
        Text Preprocessor
        Gets the abbreviations and store them in a hash-table
        Fetches the list of abbreviations and return its contents in a hash-table
        Here are the examples: afk=away from keyboard, aka=also known as, asap=as soon as possible
        From the examples, the fetched results will be stored in a hash-table: <afk, away from keyboard>, <aka, also known as>, <asap, as soon as possible>
        Gets the happy and sad emotions and store them in a linked list
        Fetches the list of the happy emoticons from a dictionary and store them in a linked list
        Here are the examples: :-), :), ;)
        From the examples, the fetched results will be stored in a linked list: [:-), :), ;)]
        Feature Preprocessor
        Gets the abbreviations and store them in a hash-table
        Gets the happy and sad emoticons then store them in a linked list
        Creates the combination of dot symbol and store them in a linked list
        Creates the combination of exclamation symbol and store them in a linked list
        Complex Preprocessor
        This preprocessor only returns the Part of Speech (POS) of tweets
      This is an example of code for retrieving features attributes, such as abbreviations, happy and sad emoticons, etc:
    - Initializes Part of Speech (POS) tagger
      - We create an object that will be used to analyze the POS using this tagger: wsj-0-18-left3words-distsim.tagger
  - Initializes filterer (StringToWordVector) and tokenizer (NGramTokenizer)
    - We use StringToWordVector as the filterer and NGramTokenizer as the tokenizer
  - Initializes data train and data test in case of we need to use sliding window
    - In the Supporting Model section, I've explained briefly about the classification method that will be used for this application
    - When we choose to use this method, we need to provide an alternative way in case we receive the nan class as the predicted class. To do that, we have to create a new data train which will be a place for storing the current instance and training the new model based on the "agreed" instances in that data train
    This is an example of code for creating a new data train and data test for the case of using sliding window:
- Tweet initialization
  - Initializes the list of tweet sentiment
    - Empty the list which will store the final predicted classes
  - Initializes the list of preprocessed tweet
    - Empty the list which will store the preprocessed tweets
  - Initializes the list of class distribution
    - Empty the list which will store the class probability
- Get Polarity
  - Starts the whole process including preprocesses the given tweet, creates different representations of it (stored in "all[]" Instances) and tests it in the PolarityClassifier class
    - Preprocesses the lexicon, text, feature, and complex
      - Generally, the steps for preprocessing are:
        Replaces emoticons - current tweet is altered to "happy"/"sad"
        Replaces Twitter features, such as links, mentions, hash-tags
        Replaces Consecutive Letters - replaces more than 2 repetitive letters with 2
        Replaces Negation - if current is a negation word, then current = "not"
        Replaces Abbreviations - if current is an abbreviation, then replace it
      This is an example of code for preprocessing tweets:
    - Initializes the instances (data test) of lexicon, text, feature, and complex
      - Fetches all instances created before (instances that contains data test)
- Execution of Algorithm
  - Gets the instances of every data test representation and applies a filter to it
    - Returns the instances of text-based representations and apply filter to it in order the format of data test is same with the format of data train
  - Reformates the instances of lexicon, text, feature, and complex
    - Removes the attributes from the data test that are not used in the data train
    - Moreover, it also alters the order of every representation's attributes according to the train files
  - Apply the classifier
    - This is the main method that sets up all the processes of the Ensemble classifier. It returns the decision made by the two classifiers, namely:
      - Classifier for text, feature, and complex representation (HC)
        Gets the probability for each class (positive/negative)
        Uses double[] preds = mnb_classifiers[i].distributionForInstance(test.get(0))
        Text-based representation
        Distribution for positive (hc[0]): preds[0]*31.07
        Distribution for negative (hc[1]): preds[1]*31.07
        Feature-based representation
        Distribution for positive (hc[2]): preds[0]*11.95
        Distribution for negative (hc[3]): preds[1]*11.95
        Complex-based representation
        Distribution for positive (hc[4]): preds[0]*30.95
        Distribution for negative (hc[5]): preds[1]*30.95
    - Classifier for lexicon representation (LC)
      - Gets the probability for each class (positive/negative)
      - Presumes that the value is stored in a variable called lc_value
    - Counts the probabilities
      - HC classifier
        Positive score (ps): (hc[0] + hc[2] + hc[4]) / 73.97
        Negative score (ns): (hc[1] + hc[3] + hc[5]) / 73.97
        Final score (hc_value): (1 + ps - ns) / 2
      - Comparison between HC and LC
        If hc_value < 0.5 AND lc_value > 0.5: output is negative
        Else if hc_value > 0.5 AND lc_value < 0.5: output is positive
        Else: output is nan
    This is an example of code for applying the core classifier:
  - Check the Sliding Window
    - Case 0: uses sliding window
      - If HC and LC agree (positive/negative), then put this document in the data train
      - Else, add the document into the data test. It will be classified in the end of the process
        If the number of data train's instances is more than 0
        Calls clarifyOnSlidingWindow
        Adds an instance into the data train at the end of file
        Sets a filter for the data train and store the filtered data train in a new instances
        Prepares the data train and the data test
        Builds classifier from the data train (1000 training data)
        Predicts the class based on the model created before
        Else, the final class is unknown for the data train is empty and can not do the training process
      This is an example of code when we implement the sliding window:
    - Case 1: does not use sliding window
      - If HC and LC agree (positive/negative), then sets the list of predicted class come from three possibilities
      - Else,
        Calls clarifyOnModel which decides upon a "disagreed" (output class = nan) document by applying the learned model based on the previously built model
        Gets the text-based representation of the document
        Re-orders attributes so that they are compatible with the data train
        Finds the polarity of the document based on the previously built model, namely Liebherr, goethe, or Cisco
  - Done