Classification of Reviews with Yelp Open Dataset
Questo articolo è disponibile in italiano
Yelp Open Dataset is a collection of data concerning users who write reviews on businesses belonging to different commercial categories (e.g. restaurants, car rental, etc. …).
Thanks to this database I was able to realize a model based on a neural network for the automatic classification of reviews using natural language processing and machine learning techniques.
The model already trained can be used on Google Colab by clicking on the button below.
In the next paragraphs I describe how the project was carried out.
Building the model
The input values were calculated by applying word embedding techniques to the text of the reviews. The output values for each input have been calculated based on the number of stars associated with the review and transformed into a binary vector to allow the model to make a classification based on categories (e.g. “4 stars” has been coded with the vector [0, 0, 0, 1, 0]).
The steps described in this course have been followed.
After numerous attempts, trying different configurations of different layers and parameters, the following configuration was chosen:
# Parameters for word embedding
vocab_size = 10000
embedding_dim = 128
max_length = 256
# Model configuration
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim,
input_length=max_length),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dropout(0.4),
tf.keras.layers.Dense(5,activation=’softmax’)
])
Training and validation
The first 10,000 reviews have been selected, without any filtering criteria, making a division 80%-20% to build, respectively, training sets and test sets. An early stopping was added during training to prevent overfitting.
The resulting accuracy values are 79.8% for the training set and 63.7% for the test set (Figure 1).
It would have been very interesting to select 100,000 reviews instead of only a tenth of them, but it was not done for computational capacity limits.
Improving performances
Considering that changing the parameters or model composition did not significantly improve the results, it was decided to filter the data by eliminating those that produced too high loss values. The assumption behind this choice lies in the fact that these values of input-outputs may not be very truthful.
For each instance of the 10,000 selected records, the loss value was calculated and it was decided to discard those with a value higher than 2.
The resulting accuracy values after this operation are 88.9% for the training set and 70.3% for the test set (Figure 2).
Final results
Below are shown some graphs showing the data on which you can make a final evaluation on the model for the automatic recognition of a review.
In order to classify between positive and negative we can evaluate as
positive reviews to which 4 or 5 stars are attributed and as negative those to which 1 or 2 stars are attributed. The case of 3 stars is arbitrary.
Looking at the graphs shown in Figures 3, 4, 5 and 6 we can see how the accuracy of the model has improved.
Further analysis
Other notebooks are available in my repository that contain additional analysis of Yelp Open Dataset data.
Lorenzo Vainigli
Android developer and passionate about computers and technology since I was a child.
I have a master's degree in Computer Science obtained at the University of Bologna.
I like astronomy 🌌 and Japanese culture ⛩️.
Here I tell more about myself.