Following up on my earlier post, as thefrequency-based modelswere not very accurate and a good rule-based model was very hard to elaborate, we implemented what we known to be state-of-the-art methods for sentiment analysis on short sentences and make a list of the pros and cons of these methods. We train all of them on a 10.000 sentences dataset. These sentences are classified as positive, neutral, and negative by human experts. We benchmark the models on a hold out sample of 500 sentences.

To build a deep-learning model for sentiment analysis, we first have to represent our sentences in a vector space. We studiedfrequency-based methodsin a previous post. They represent a sentence either by abag-of-words, which is a list of the words that appear in the sentence with their frequencies, or by aterm frequency inverse document frequency(tf-idf) vector where the word frequencies in our sentences are weighted with their frequencies in the entire corpus.

These methods are very useful for long texts. For example, we can describe very precisely a newspaper article or a book by its most frequent words. However, for very short sentences, its not accurate at all. First, because 10 words are not enough to aggregate. But also because the structure of the sentence is very important to analyze sentiment andtf-idfmodels hardly capture negations, amplifications, and concessions. For instance, Very good food, but bad for service would have the same representation as Bad for food, but very good service!.

We represent our sentences with vectors that take into account both the words that appear and the semantic structure. A first way to do this is to represent every word with ann-feature vector, and to represent our sentence with an*lengthmatrix. We can for instance build a vector of the same size as the vocabulary (10.000 for instance), and to represent thei-th word with a 1 in thei-th position and 0 elsewhere.

Tomas Mikolov developed anotherway to represent words in a vector space, with features that capture the semantic compositionality. He trains the following neural network on a very large corpus:

He trains this model and represents the word ants by the output vector of the hidden layer. The features of theseword vectorswe obtain capture most of the semantic information, because it captures enough information to evaluate the statistical repartition of the word that follows ants in a sentence.

What we do is similar. We represent every word by an index vector. And we integrate in our deep learning model a hidden layer of linear neurons that transforms these big vectors into much smaller ones. We take these smaller vectors as an input of a convolutional neural network. We train the model as a whole, so that theword vectorswe use are trained to fit thesentiment informationof the words, i.e. so that the features we get capture enough information on the words to predict the sentiment of the sentence.

We want to build a representation of a sentence that takes into account not only the words that appear, but also the sentences semantic structure. The easiest way to do this is to superpose theseword vectorsand build a matrix that represents the sentence. There is another way to do it, that was also developed byTomas Mikolovand is usually called Doc2Vec.

He modifies the neural network we used for Word2Vec, and takes as an input both theword vectorsthat come before, and a vector that depends on the sentence they are in. We will take the features of this word vector as parameters of our model and optimize them using a gradient descent. Doing that, we will have for every sentence a set of features that represent the structure of the sentence. These features capture most of the useful information on how the words follow each other.

Thesedocument vectorsare very useful for us, because the sentiment of a sentence can be deduced very precisely from thesesemantic features. As a matter of fact, users writing reviews with positive or negative sentiments will have completely different ways of composing the words. Feeding a logistic regression with these vectors and training the regression to predict sentiment is known to be one of the best methods for sentiment analysis, both for fine-grained (Very negative / Negative / Neutral / Positive / Very positive) and for more general Negative / Positive classification.

We implemented and benchmarked such a method but we chose not to productionalize it. As a matter of fact, building thedocument vectorof a sentence is not an easy operation. For every sentence, we have to run a gradient descent in order to find the right coefficients for this vector. Compared to our other methods for sentiment analysis, where the preprocessing is a very short algorithm (a matter of milliseconds) and the evaluation is almost instantaneous, Doc2Vec classification requires a significant hardware investment and/or takes much longer to process. Before taking that leap, we decided to explore representing our sentences by a matrix ofword vectorsand to classify sentiments using a deep learning model.

The next method we explored for sentiment classification uses a multi-layer neural network with a convolutional layer, multiple dense layers of neurons with a sigmoid activation function, and additional layers designed to prevent overfitting. We explained how convolutional layers work in a previous article. It is a technique that was designed forcomputer vision, and that improves the accuracy of most image classification andobject detection models.

The idea is to apply convolutions to the image with a set of filters, and to take the new images it produces as inputs of the next layer. Depending on the filter we apply, the output image will either capture the edges, or smooth it, or sharpen the key patterns. Training the filters coefficients will help our model build extremely relevant features to feed the next layers. These features work like local patches that learn compositionality. During the training, it will automatically learn the best patches depending on the classification problem we want to solve. The features it learns will be location-invariant. It will convolve exactly the same way an object that is at the bottom of the frame and an object that is at the top of the frame. This is key not only for object detection, but for sentiment analysis as well.

As these models became more and more popular incomputer vision, a lot of people tried to apply them in other fields. They had significantly good results inspeech recognitionand innatural language processing. In speech recognition, the trick is to build the frequency intensity distribution of the signal for every timestamp and to convolve these images.

For NLP taskslike sentiment analysis, we do something very similar. We build word vectors and convolve the image built by juxtaposing these vectors in order to build relevant features.

Intuitively, the filters will enable us to highlight the intensely positive or intensely negative words. They will enable us to understand the relation between negations and what follows, and things like that. It will capture relevant information about how the words follow each other. It will also learn particular words or n-grams that bear sentiment information. We then feed a fully connected deep neural network with the outputs of these convolutions. It selects the best of these features in order to classify the sentiment of the sentence. The results on our datasets are pretty good.

We also studied, implemented and benchmarked the Long Short-Term Memory Recurrent Neural Network model. It has a very interesting architecture to process natural language. It works exactly as we do. It reads the sentence from the first word to the last one. And it tries to figure out the sentiment after each step. For example, for the sentence The food sucks, the wine was worse.. It will read The, then food, then sucks, the and wine. It will keep in mind both a vector that represents what came before (memory) and a partial output. For instance, it will already think that the sentence is negative halfway through. Then it will continue to update as it processes more data.

This is the general idea, but the implementation of these networks is much more complex because it is easy to keep recent information in mind, but very difficult to have a model thatcaptures most of the useful long-term dependencieswhile avoiding the problems linked tovanishing gradient.

This RNN structure looks very accurate for sentiment analysis tasks. It performs well for speech recognition and for translation. However, it slows down the evaluation process considerably and doesnt improve accuracy that much in our application so should be implemented with care.

Richard Socher et al. describe in the paperRecursive Deep Models for Semantic Compositionality Over a Sentiment Treebankanother cool method for sentiment analysis. He says that every word has a sentiment meaning. The structure of the sentence should enable us to compose these sentiments in order to get the overall sentiment of the sentence.

They implement a model called the RNTN. It represents the words by vectors and takes a class of tensor-multiplication-based mathematical functions to describe compositionality. Stanford has avery large corpus of movie reviewsturned into trees by their NLP libraries. Every node is classified from very negative to very positive by a human annotator. They trained the RNTN model on this corpus, and got very good results. Unfortunately, they train it on IMDB movie reviews data. But it doesnt perform quite as well on our reviews.

The big advantage of this model is that it is very interpretable. We can understand very precisely how it works. We can visualize which words it detects to be positive or negative, and how it understands the compositions. However, we need to build an extremely large training set (around 10.000 sentences with fine-grain annotations on every node) for every specific application. As we continue to gather more and more detailed training data, this is just one of the types of models we are exploring to continue improving the sentiment models we have in production!

Deep Learning Explained in 7 Steps – Data Driven Investor

Self-driving cars, Alexa, medical imaging – gadgets are getting super smart around us with the help of deep learning

Which is More Promising: Data Science or Software Engineering? – Data Driven Investor

About a month back, while I was sitting at a caf and working on developing a website for a client, I found this woman

Welcome to a place where words matter. OnWatch

Follow all the topics you care about, and well deliver the best stories for you to your homepage and inbox.Explore

Get unlimited access to the best stories onUpgrade