Sentiment Analysis Twitter Data

Sentiment Analysis Twitter Data – Twitter is a microblogging service. It is founded in 2006. Twitter users post messages or tweets.

It was originally developed to use as an SMS-based service in which characters in a message are restricted to 160 that include 140 characters for the message or the tweet and the remaining 20 characters remain reserved for the username.

Twitter users may subscribe to the tweets posted by other users, referred to as “following”.

The service can be accessed through the Twitter website or through the applications for smartphones and tablets.

Twitter users have adopted different conventions such as replies, retweets, and hashtags in their tweets.

Twitter replies, denoted as @username. It indicates that the tweet is a response to a tweet posted by another user.

Retweets on Twitter

Retweets are used to republish the content of another tweet using the format RT @username.

Hash symbol prefix with a word represents the context of the message, e.g., #demonetization, #PMModi, etc.

The size restriction and content sharing mechanisms of Twitter have created a unique dialect that includes many abbreviations, acronyms, misspelled words, and emoticons. All these were not used in traditional media, e.g., hbd, gr8, lv.

Twitter Popularity

As of July 2014, Twitter has about 270 million active users generating 500 million tweets per day.

Sentiment Analysis

Tweet messages usually convey sentiment and emotions.

However, unlike conventional data, tweets have some specific characteristics, that make the problem of analyzing sentiment over Twitter data a more difficult problem than the one of analyzing conventional data.

Tweet messages are too short encouraging the use of abbreviations, irregular expressions, poor grammar, and misspellings.

Such characteristics make it hard to identify the sentiments content in tweets using sentiment analysis approaches designed for conventional text.

Similar to conventional sentiment analysis, existing approaches to Twitter sentiment analysis can be divided into two groups-

Machine learning and
Lexicon-based approaches

The machine learning approach on Twitter data for Sentiment Analysis

The machine learning approach to sentiment analysis is characterized by the use of a machine learning algorithm and a training corpus.

The approach usually functions in two phases, a training phase, and an inference phase.

In the training phase, a machine learning algorithm is used to train the classification model using a set of features extracted from the tweets of the training corpus.

In the inference phase, the trained model is used to infer the sentiment label of unseen tweets for example tweets of unknown sentiment classes.

Authentication handler

Twitter provides an authentication mechanism to allow controlled and limited access to user data.

Twitter currently allows the OAuth authorization mechanism which allows users to grant third parties access to their data without using their username and password.

By registering an application twitter provides a consumer key and consumer secret which will uniquely identify the twitter application and can retrieve an access token and secret key from twitter that can be later used to get the authorization to access the user’s data.

Twitter streaming API

Once the authorization has been done, the streaming API instance filter real-time public tweets containing keywords of interest-based on the searched keyword or topics.

Twitter usually provides tweets on the basis of keyword and a particular period of time, trending hashtags related to trending and current issues.

Using streaming API tweets can be extracted according to requirements like on the basis of keyword, a particular time, location, the particular user.

Time in which the tweet created and the language in which the tweet is written and the location of the user based query can be executed to extract tweets.

Streaming data is stored in a database in CSV shown to further use and processing for sentiment analysis.

Filtering Tweets

The extracted streaming tweets may be from multiple languages but in this work only English language selected for extract tweets.

Tweets are extracted on the basis of requested language only execute by twitter itself. In the query for fetching tweets the value ‘en’ corresponding to the language, attribute represents the identified language by twitter.

The query executes to request tweets for demonetization in the English language in a particular period of time.

The location of tweets by default consider according to the current location.

Data Transformation (Cleaning and Parsing)

The information extracted from the tweets of the selected dataset was in raw form.

In order to use them, raw tweets had to clean and transform it into a more usable structured dataset to ensure that the next phases of the process are smooth, easy efficient, and effective.

In this step by applying a set of rules to remove re-tweets, similar tweets, short length, and symbol-based tweets.

Feature Extraction

The following pre-processing steps have been used to facilitate and simplify automated feature extraction:

Tweet cleaning

Tweet cleaning is the first step towards data transformation. This task consists of three subtasks to accomplish this process.

The overview of the subtasks is given following.

Remove retweets: All tweets which contain the commonly used string “RT”(denoting retweet or repost of tweet) in the body of the tweet are removed.
Remove uninformative tweets: Short length tweets that will not contain informative data have been removed. A minimum length of 20 characters is used as a tweet-length.
Remove non-English tweets: Each word used in tweet is compared with a common English word list and remove with less than 15% of content matching.
Remove duplicate tweets: After comparing tweets with one another, those tweets whose 90% content has been matched are discarded.
Converting all Tweets into the Lower case: next step had to convert all tweets to lower case in order to bring the tweets in a consistent form.
By doing this, data has been used to perform further transformation and classification without having to worry about the non-consistency of data. This task is done using the lower () function in Rstudio.
This function converted all alphabets in lower cases and solved the case-sensitivity problem by making all data consistent in lower cases.
Removing emoticons and punctuations: next step is to remove emoticons and punctuations because they were not needed in the analysis. One might ask why emoticons have been removed because when they were being extracted they appeared in the form of square boxes instead of proper emoticons.
These garbage values of emoticons were removed by using the “replace” keyword. Replace keyword was used in a way that replaced all emoticon symbols and values with an empty space making data clear from useless emoticon mess.
Removing URL’S: The very next step is to remove the URL, as they provide no information during analysis. URL’s show links to other webpages and websites.
These were of no use so from all tweets these were removed by using re.sub () in Rstudio, which replaced all sentences and sub-parts of sentences started from Http with blank spaces.

Tweet parsing

After cleaning the data, the selected tweets are then parsed. This is the second and final step of data transformation and to get this achieved below explained steps were taken-

URL Extraction: All URLs should be removed from the tweet corpus.
Punctuation removal: All punctuation should be removed before feature selection.
Lowercase conversion: To make an easy feature selection process all the data should be converted into lower case.
Non-ASCII removal: All non-ASCII characters are removed from the tweets.
Stemming: some of the words in the dataset have similar roots but they differ only in affixes e.g. accounting, accounts, accountable, accountability has the same root account.
So stemming the words and keep root word to reduce the feature set and improves the classification performance.
Stop words removal: Stop words in English like “a”, “is”, “the”, “an” have been removed by using function provided by Rstudio.
Extracting Hashtags: As hashtags are the newest trends in voicing opinions and gaining public popularity, captured the essence of hashtags to help us determine the value of the tweet at hand.
Everyone knows that hashtags are those words that start from number symbol (#) so extracted them from all tweets.
This command extracted all words that started from the “#” symbol.

Extracting change of direction indicators: further, work needed to save the words which can change meaning and context of the whole sentence like “and” “or” “but” etc. So by searching all these words and saved them in a separate column that will be used for better and accurate analysis.

TWEET TOKENOZATION

After cleaning the text now tweets have been tokenized the text by converting a tweet in form of token means each word in a tweet is representing by a token for calculating the frequency of words present in tweets by assigning a numerical value to each token.

The text data needs to be converted to a numerical vector, this can be done use Count Vectorizer which transforms the tweet into a token count matrix by tokenizes the tweet and creates a matrix called sparse matrix on the basis of occurrence of each token in the tweet.

Each tweet is representing a document and each word in the tweet is representing a feature so if the feature is representing in a tweet then in the matrix it will be represented by 1 otherwise 0.

This table helps us to calculate the frequency of words that are present maximum in tweets.

After creating this matrix analyze the overall frequency of features present in tweets and remove low-frequency features as they will not play a major role in sentiment analysis.

Low-frequency features have been removed whose occurrence is below 0.995 in the whole corpus.

SCORE TWEETS: Emotion Lexicon(NRC)

The score of each tweet has been calculated by the get_nrc_sentiment method which has been implemented by the NRC Emotion lexicon defined by Saif Mohammad(Saif Mohammad 2010) under syuzhet package in r.

According to Mohammad, This NRC emotion lexicon contains a list of eight emotions(disgust, anticipation, anger, fear, surprise, trust, sadness, and joy) and two sentiments ( positive and negative).

The score varies from -5 to +5 to each tweet.

The get_nrc_sentiment function uses the NRC sentiment dictionary to evaluate the existence of eight different emotions and their respective calculate the presence of eight different emotions and their strength in their corresponding corpus of text.

Perform sentiment analysis Twitter Data

Sentiment graph shows with eight emotions and two polarity dimensions of users on tweets about demonetization.

The result shows that people trust the decision taken by the government. The public has taken Demonetization as a positive step by the Indian Government and they support the step.

The polarity of tweets can be positive, negative, or neutral.

What is sentiment analysis? Read…

Sentiment Analysis Twitter Data