Introduction

Recently the intuitive Keras API to TensorFlow was expanded beyond Python and released as the keras package in R. This implementation in R is directly supported by Francois Chollet, creator of Keras for Python. As an R user this is exciting because Keras lends interpretable functions to access TensorFlow, the powerful open-source machine learning library developed by Google. TensorFlow is a backend engine which provides efficient methods of numerical computation required by deep learning models.

With these new tools available, I decided to build a neural network that classifies the binary sentiment of tweets. I would like to state that there are more efficient models for natural language processing of short-sequence text. The time cost of training a deep neural network (especially on a CPU…) is avoidable via other algorithms such as GloVe and weighted bag-of-words models. I chose to apply a deep network architecture to this data for my own experimentation and learning.


Data

The foundation of any supervised model is the quantity and quality of available data. The data set of tweets I used to build my model is publicly available via an open-source project put together by Stanford graduate students called Sentiment140. The training/testing file I used contains 800k tweets labeled as negative and 800k tweets labeled as positive, totaling 1.6 million tweets. The data set was not ranked by hand, but rather machine-classified based on emoticons within each tweet (see site for more details). Even with these restrictions, the relatively large dataset provides a great opportunity for training a deep neural network.


Preprocessing

Twitter data is interesting because it captures the thoughts, ideas, and opinions of so many people. This fact also makes it quite difficult to efficiently clean and process. Most tweets contain misspellings, errors, and non-standard grammatical structure. Many tweets contain colloquial words and phrases only in use in specific demographics. Theoretically, a neural network should be able to learn these patterns, although specific patterns must appear at a sufficiently high rate for these network connections to contribute to model learning. When a misspelling such as “thaaaaat” appears once, and another variation of “thaat” appears once, the model will not be able to generalize understanding of meaning as a human can due to deficiency of examples. I performed rudimentary data cleaning by eliminating textual features that would not contribute to model learning (e.g., links to outside domains, text conversion errors).

I used the tidytext package to process the data, and the same tidy methodology of data exploration and analysis that I used in my exploratory analysis of Kanye’s lyrics. I decided against eliminating stop words (i.e., the most commonly represented English words) within each tweet due to the type of neural network I chose to employ. The challenge here was to create a numeric lexicon that represents each textual feature as a unique integer while preserving individual tweet sequence. So if I have tweets that read, “The Warriors are the greatest basketball team” and, “The Clippers are a poor basketball franchise” a unique integer maps to each textual feature. Below is an example of this representation, prior to log transformation.

Example A Value A Example B Value B
The 123 The 123
Warriors 42 Clippers 2342
are 22 are 22
the 245 a 7878
best 6303 poor 900
basketball 31203 basketball 31203
team 5950 franchise 10430

Model

Due to the sequential nature of text data, the logical type of model to train the data on is a recurrent neural network (RNN). There is a specific RNN architecture that more accurately synthesizes the context of sequential data by learning long term dependencies, called a Long Short Term Memory network (LSTM). You can check out this blog post to read more about how LSTMs work and why they are used extensively in making predictions about sequentially dependent data. I built a simple 3-layer neural network with a single bidirectional LSTM layer, which I compiled and trained on a subset of the classified tweets. The bidirectional feature of the LSTM allows it to process text in both forward and reverse direction, which has been shown to produce better prediction results for contextually dependent data.

To train my final model, I randomly sampled 85% of the 1.6 million tweets (~1.36 million) and fed the processed text data through the network. The remaining 15% of the tweets were used as a validation dataset to assess the accuracy of model predictions on unlabeled data. Optimal prediction accuracy prior to induction of overfitting occurred after about 3 epochs, and when tested resulted in a prediction accuracy of 82.53%.

## Example model structure
summary(model_struct)
## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## embedding_6 (Embedding)          (None, None, 128)             4480        
## ___________________________________________________________________________
## bidirectional_6 (Bidirectional)  (None, 256)                   263168      
## ___________________________________________________________________________
## dense_6 (Dense)                  (None, 1)                     257         
## ===========================================================================
## Total params: 267,905
## Trainable params: 267,905
## Non-trainable params: 0
## ___________________________________________________________________________
## 
## 


Tuning

Model tuning is where I reached a bottleneck of predictive capacity due to my available computational power. In order to rapidly deploy model iterations and assess for accuracy improvement, I down-sampled the dataset to a random 10% of observations. I tested differing batch sizes, number of epochs, and layers. This was a great exercise in practice for me, but ultimately it was difficult to assess tractable differences in model accuracy. The power of deep neural networks is a result of data volume, so it was difficult to extrapolate how model variations would perform when trained on the full dataset.


Results

On August 5, 2017 around 1:30 PM, I tested my model on realtime twitter data. The percent positive sentiment for each query is plotted below (n = 500 tweets/query).


Considerations

The largest consideration to be had here is the use of machine-labeled data to train the model. There is some inherent bias in this methodology since all of the training tweets originally contained emoticons (although not when used for model training). Human-labeled data would give a more accurate depiction of true tweet sentiment, and I will continue to optimize the model via my own classifications. In the future I would like to include a neutral sentiment category, and eventually attempt to model correlations between predicted topic sentiments and real-life trends. Please contact me for business inquiries.


Resources