Il Blog di Lorenzo Vainigli

Classification of Reviews with Yelp Open Dataset

6 October 2020
Programming Projects
4 minutes    20

Questo articolo è disponibile in italiano

Yelp Open Dataset is a collection of data concerning users who write reviews on businesses belonging to different commercial categories (e.g. restaurants, car rental, etc. …).

Thanks to this database I was able to realize a model based on a neural network for the automatic classification of reviews using natural language processing and machine learning techniques.

The model already trained can be used on Google Colab by clicking on the button below.

Open in Colab

In the next paragraphs I describe how the project was carried out.

Building the model

The input values were calculated by applying word embedding techniques to the text of the reviews. The output values for each input have been calculated based on the number of stars associated with the review and transformed into a binary vector to allow the model to make a classification based on categories (e.g. “4 stars” has been coded with the vector [0, 0, 0, 1, 0]).

The steps described in this course have been followed.

After numerous attempts, trying different configurations of different layers and parameters, the following configuration was chosen:

# Parameters for word embedding
vocab_size = 10000
embedding_dim = 128
max_length = 256

# Model configuration
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(5,activation=’softmax’)
])

Training and validation

The first 10,000 reviews have been selected, without any filtering criteria, making a division 80%-20% to build, respectively, training sets and test sets. An early stopping was added during training to prevent overfitting.
The resulting accuracy values are 79.8% for the training set and 63.7% for the test set (Figure 1).

Figure 1: Accuracy and loss values for the first run, i.e. before filtering the data.

It would have been very interesting to select 100,000 reviews instead of only a tenth of them, but it was not done for computational capacity limits.

Improving performances

Considering that changing the parameters or model composition did not significantly improve the results, it was decided to filter the data by eliminating those that produced too high loss values. The assumption behind this choice lies in the fact that these values of input-outputs may not be very truthful.
For each instance of the 10,000 selected records, the loss value was calculated and it was decided to discard those with a value higher than 2.
The resulting accuracy values after this operation are 88.9% for the training set and 70.3% for the test set (Figure 2).

Figure 2: Accuracy and loss values for the second run, i.e. after filtering the data where the examples with high loss values have been deleted.

Final results

Below are shown some graphs showing the data on which you can make a final evaluation on the model for the automatic recognition of a review.
In order to classify between positive and negative we can evaluate as
positive reviews to which 4 or 5 stars are attributed and as negative those to which 1 or 2 stars are attributed. The case of 3 stars is arbitrary.
Looking at the graphs shown in Figures 3, 4, 5 and 6 we can see how the accuracy of the model has improved.

Figure 3: This sentence is definitely one of the worst you can write about a business. At the first run the model erroneously assigned 5 stars but with a considerable uncertainty, while in the second run the best choice (1 star) is clearly superior to the others.
Figure 4: This is undoubtedly a positive review and even at the first run you have a clear choice for the 5 star option, which is however further strengthened in the second run. It would also be fair to expect 4 stars.
Figure 5: Absolutely positive review. Already with the first run you have a clear choice, but with the second 82% probability is an extremely correct result.
Figure 6: This phrase is one of the most interesting because it represents a balance for which 3 stars would be preferable (since it is the average choice). Unfortunately in this case the model is not very accurate even after the second run, in fact the choice to attribute a star is definitely wrong. However it can be observed that, once again, the second run looks more truthful than the first.

Further analysis

Other notebooks are available in my repository that contain additional analysis of Yelp Open Dataset data.

Lorenzo Vainigli

Passionate about computers, technology and information technology since I was a child. I have a bachelor's degree in Computer Science from the University of Bologna and I am now studying for a master's degree. Often I like to put my knowledge into practice by creating something new. Here I tell a bit about me.