Categories
Programming Projects

Classification of Reviews with Yelp Open Dataset

Yelp Open Dataset is a collection of data concerning users who write reviews on businesses belonging to different commercial categories (e.g. restaurants, car rental, etc. …).

Thanks to this database I was able to realize a model based on a neural network for the automatic classification of reviews using natural language processing and machine learning techniques.

The model already trained can be used on Google Colab by clicking on the button below.

Open in Colab

In the next paragraphs I describe how the project was carried out.

Building the model

The input values were calculated by applying word embedding techniques to the text of the reviews. The output values for each input have been calculated based on the number of stars associated with the review and transformed into a binary vector to allow the model to make a classification based on categories (e.g. “4 stars” has been coded with the vector [0, 0, 0, 1, 0]).

The steps described in this course have been followed.

After numerous attempts, trying different configurations of different layers and parameters, the following configuration was chosen:

# Parameters for word embedding
vocab_size = 10000
embedding_dim = 128
max_length = 256

# Model configuration
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(5,activation=’softmax’)
])

Training and validation

The first 10,000 reviews have been selected, without any filtering criteria, making a division 80%-20% to build, respectively, training sets and test sets. An early stopping was added during training to prevent overfitting.
The resulting accuracy values are 79.8% for the training set and 63.7% for the test set (Figure 1).

Figure 1: Accuracy and loss values for the first run, i.e. before filtering the data.

It would have been very interesting to select 100,000 reviews instead of only a tenth of them, but it was not done for computational capacity limits.

Improving performances

Considering that changing the parameters or model composition did not significantly improve the results, it was decided to filter the data by eliminating those that produced too high loss values. The assumption behind this choice lies in the fact that these values of input-outputs may not be very truthful.
For each instance of the 10,000 selected records, the loss value was calculated and it was decided to discard those with a value higher than 2.
The resulting accuracy values after this operation are 88.9% for the training set and 70.3% for the test set (Figure 2).

Figure 2: Accuracy and loss values for the second run, i.e. after filtering the data where the examples with high loss values have been deleted.

Final results

Below are shown some graphs showing the data on which you can make a final evaluation on the model for the automatic recognition of a review.
In order to classify between positive and negative we can evaluate as
positive reviews to which 4 or 5 stars are attributed and as negative those to which 1 or 2 stars are attributed. The case of 3 stars is arbitrary.
Looking at the graphs shown in Figures 3, 4, 5 and 6 we can see how the accuracy of the model has improved.

Figure 3: This sentence is definitely one of the worst you can write about a business. At the first run the model erroneously assigned 5 stars but with a considerable uncertainty, while in the second run the best choice (1 star) is clearly superior to the others.
Figure 4: This is undoubtedly a positive review and even at the first run you have a clear choice for the 5 star option, which is however further strengthened in the second run. It would also be fair to expect 4 stars.
Figure 5: Absolutely positive review. Already with the first run you have a clear choice, but with the second 82% probability is an extremely correct result.
Figure 6: This phrase is one of the most interesting because it represents a balance for which 3 stars would be preferable (since it is the average choice). Unfortunately in this case the model is not very accurate even after the second run, in fact the choice to attribute a star is definitely wrong. However it can be observed that, once again, the second run looks more truthful than the first.

Further analysis

Other notebooks are available in my repository that contain additional analysis of Yelp Open Dataset data.

Categories
Programming

Facebook Hacker Cup 2020 Qualification Round Solutions

What is the Facebook Hacker Cup?

The Facebook Hacker Cup is a competition for programmers promoted by Facebook worldwide. The aim of the competition is to find the best programmers in the world to whom to pay cash prizes. The final winner will also win the cup.

The problems proposed during the Facebook Hacker Cup rounds are typical programming problems that a programmer may have to solve in real life: many times it is not difficult to write an algorithm, test it on the sample input and verify that the output produced is correct. Each problem in fact provides two text files as example input/output with which to test the program before submitting it officially. At the moment of submission, instead, a longer input is downloaded that is needed to produce the final output, but no more than six minutes must elapse between downloading the input and sending the output.
Here comes the best thing about the Hacker Cup: a correct algorithm may not be enough to produce the full output required by the problem if it has not optimal complexity in time! An algorithm with exponential complexity is good for small inputs but not for large inputs (and in reality it is the second case to be predominant).

The algorithms must not only be correct, but also efficient.

As regards the programming language to use, there is absolute freedom as long as a compiler or interpreter for that language is publicly available.

It starts with a 72-hour qualification round where 3, 4 or 5 problems are proposed.
Participants who have solved at least one problem correctly pass to the first round and to enter the second round a minimum score must be obtained (each problem is associated with a score for a total of 100 points for each round).
From the second round the first N participants pass the round and here comes into the game the penalty, i.e. the sum of the time taken to send the various answers.

For all other details please refer to the official page.

Qualification round 2020

Travel Restrctions (10 points)

You have N cities, each of which can allow or block incoming or outgoing flights. Determine whether you can travel between cities for each pair of cities.

Problem statement – My solution (Python)

Alchemy (15 points)

N stones that can be of type A or B must be fused together to produce a definitive stone. Fusions can be made if you have a triple where not all the stones are the same. Determine if given a set of stones you can fuse them all together.

Problem statement – My solution (Python)

Timber (21 points)

In a forest there are trees that can be cut down and form a “timber interval”. These trees have a specific position and height and can be dropped to the right or to the left. Calculate the longest interval that can be achieved.

Problem statementMy solution (Python)

Running on Fumes Chapter 1 (16 points)

A delivery man has to go from one city to another and on the way there are other cities he will necessarily have to cross. The vehicle on which he travels has a maximum fuel capacity and the man has to decide in which cities to refuel in order to complete the journey with the lowest possible expense. Calculate the minimum cost to go from the first to the last city.

Problem statementMy solution (Python)

Other solutions

All the previous solutions, along with some of the past years, are available in my GitHub repository.

Categories
Programming Projects

A Little Blockchain in Erlang

Erlang is one of the functional programming languages that allows you to build concurrent programs in a relatively simple way; its actor-based paradigm offers primitives to create processes and make them communicate with each other in a way that other languages do not.

Another peculiarity of Erlang, probably the most important one, is the possibility to update the code of a program without interrupting its execution. Here is an explanation of this functionality.

Among the softwares written in Erlang there is WhatsApp.

A blockchain, a word in vogue in recent years, is a data structure that provides guarantees on the authenticity of the data it contains. In fact, a blockchain is a chain of blocks of variable length: you can add blocks only at the top (or only at the bottom) of the chain but once added they can no longer be modified, but only read.

During my university career I had to implement a blockchain in Erlang. In this case it is a blockchain distributed on several nodes in a network, so algorithms have been implemented for the exchange of information between nodes (gossiping) and for the resolution of chain forks.
The entire project code, together with documentation on its operation, is available in this repository on GitHub.

Thanks to my colleague Lorenzo for the collaboration in the realization of this project.

Categories
Programming

Three Word-Sense Disambiguation Algorithms

Word sense disambiguation (w.s.d.) is that part of natural language processing that deals with attributing to a word (token) its meaning in the context in which it appears.

In this article there are presented three probabilistic algorithms to solve this problem, together with their implementation in Python.

The algorithms are:

  • Naive Bayes;
  • Lesk algorithm;
  • EM (Expectation-Maximization) algorithm;

and they are tested on this corpus appropriately pre-processed the code available in corpus_preprocessing.py.

Naive Bayes Algorithm

This algorithm belongs to the supervised learning algorithm class and it’s based on three factors:

w_i: \text{word} \\ s_i: \text{sense} \\ f_i: \text{feature}

To simplify, we define a feature fi of a word wi all the other words wj that appear in the same sentence; with i ≠ j.

count(x): \text{number of occurrences of } x \\ count(s_i, w_j): \text{number of times in which } s_i \text{ is attributed to } w_j \\ count(f_j, s_i): \text{number of times in which } f_j \text{ appears in a } \\ \text{sentence in which there’s a word that } s_i \text{ is attributed to}

So we have:

P(s_i)=\frac{count(s_i,w)}{count(w)} \\ P(f_j|s_i)=\frac{count(f_j,s_i)}{count(s_i)} \\ \hat{s} = \argmax_{i \in S} P(s_i) \prod_j P(f_j,s_i)

The result of the last expression represents the most suitable meaning for the word w.

Python implementation: naive_bayes.py

Lesk Algorithm

The Lesk algorithm uses a dictionary to disambiguate words: WordNET is the database we need.

The concept behind Lesk’s algorithm is the overlapping between the dictionary definition and the context in which the word to be disambigured appears. More precisely, the meaning given by the algorithm is the one in whose definition (which can be extended with some example sentences) appears the largest number of words (features) of the context.

\overrightarrow{d_s}: \text{words that make up the definition of } s \\ \overrightarrow{e_s}: \text{words that make up the examples for } s \\ \overrightarrow{f}: \text{words that appear in the context} \\ \hat{s} = \argmax_{s \in S} |(\overrightarrow{d_s} \cup \overrightarrow{e_s}) \cap \overrightarrow{f}|

Python implementation: lesk.py

EM Algorithm

The Expectation-Maximization algorithm is part of the unsupervised learning algorithm class.

The goal is to maximize the likelihood of the corpus:

l(C) = log(\prod_i P(c_i))

where C is the corpus and ci are the individual contexts that make up the corpus, i.e. the sentences.

P(c_i) = \sum_k P(c_i|s_k)P(s_k) \\ P(c_i|s_k) = \prod_{f_j \in c_i} P(f_j|s_k)

At the beginning the parameters P(sk) and P(fj|sk) are initialized randomly. As the name suggests, the iterative algorithm is composed of the phases:

  • E-step: esitimate the posterior probability that the sense sk generates the context ci.
h_{ik} = \frac{P(c_i|s_k)}{\sum_{k=1}^{K}(P(c_i|s_k)}
  • M-step: recalculation of the parameters P(sk) and P(fj|sk).
P(s_k) = \frac{\sum_{i=1}^{I}h_{ik}}{\sum_{k=1}^{K}\sum_{i=1}^{I}h_{ik}} \\ P(w_j|s_k) = \frac{\sum_{\{c_i|w_j \in c_i\}}h_{ik}}{Z_k} \\ Z_k = \sum_{k=1}^{K} \sum_{\{c_i|w_j \in c_i\}}h_{ik}

The iterative cycle continues until l(C) is improved.

Python implementation: em.py

Notes

Although I have not found any errors in the source code except those reported with the comments in the files themselves, I do not guarantee their complete correctness.

References

Categories
Bootstrap Programming Web

Advanced Pagination with Bootstrap

Among the many components of Bootstrap we can find one to build a block of page indexes to implement a pagination. Typically this component is used when you have a list of elements that, for one reason or another, can not be shown all in one page. In this case, as with search engines, several pages are created with the same template but each with different elements, sorted by some criteria.

The Pagination Component Documentation Page presents the basic version of this tool and some variants, without Javascript code to allow the buttons to work.

In case we have to use pagination for a few pages, that’s enough.
In many other cases, however, it is necessary to achieve the problem that the number of pages is too large to show all the numbers in the user interface.

Having found this problem I implemented a simple algorithm that allows to “collapse” part of the unnecessary page indexes. The result is shown immediately below.

See the Pen Advanced Pagination in Bootstrap by Lorenzo Vainigli (@lorenzovngl) on CodePen.

This algorithm uses three variables:

  • Count: total number of pages to be listed.
  • Range: number of indices to be shown on the left and on the right of the current page.
  • Page: current page, the one that the user is viewing.

Javascript code

let count = 50;
let range = 3;
let page = 15;

if (page > 1){
  $('.pagination').append('<li class="page-item"><a class="page-link" href="./' + (page-1) + '">Previous</a></li>');
}
for (let i = 1; i <= count; i++){
  if (i == 1 && page-range > 1){
    $('.pagination').append('<li class="page-item"><a class="page-link" href="./1">1</a></li>');
      if (page-range > 2){
        $('.pagination').append('<li class="page-item disabled"><a class="page-link" href="#">...</a></li>');
    }
  }
  if (i >= page-range && i <= page+range){
    if (i != page){
      $('.pagination').append('<li class="page-item"><a class="page-link" href="./' + i + '">' + i + '</a></li>');
    } else {
      $('.pagination').append('<li class="page-item active"><a class="page-link" href=""./' + i + '"">' + i + '</a></li>');
    }
  }
  if (i == count && page+range < count){
    if (page+range < count-1){
      $('.pagination').append('<li class="page-item disabled"><a class="page-link" href="#">...</a></li>');
    }
    $('.pagination').append('<li class="page-item"><a class="page-link" href="./' + count + '">' + count + '</a></li>'); 
  }
}
if (page < count){
  $('.pagination').append('<li class="page-item"><a class="page-link" href="./' + (page+1) + '">Next</a></li>');
}

PHP code

$count = 50;
$range = 3;
$page = 15;

if ($page > 1){
  echo '<li class="page-item"><a class="page-link" href="./'.($page-1).'">Previous</a></li>';
}
for ($i=1; $i<=$count; $i++){
  if ($i == 1 && $page-$range > 1){
    echo '<li class="page-item"><a class="page-link" href="./1">1</a></li>';
    if ($page-$range > 2){
      echo '<li class="page-item disabled"><a class="page-link" href="#">...</a></li>';
    }
  }
  if ($i >= $page-$range && $i <= $page+$range){
    if ($i != $page){
      echo '<li class="page-item"><a class="page-link" href="./'.$i.'">'.$i.'</a></li>';
    } else {
      echo '<li class="page-item active"><a class="page-link" href=""./'.$i.'"">'.$i.'</a></li>';
    }
  }
  if ($i == $count && $page+$range < $count){
    if ($page+$range < $count-1){
      echo '<li class="page-item disabled"><a class="page-link" href="#">...</a></li>';
    }
    echo '<li class="page-item"><a class="page-link" href="./'.$count.'">'.$count.'</a></li>';
  }
}
if ($page < $count){
  echo '<li class="page-item"><a class="page-link" href="./'.($page+1).'">Successiva</a></li>';
}

Dependencies (Bootstrap 4.0.0)

CSS

https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css

Javascript

https://code.jquery.com/jquery-3.2.1.slim.min.js
https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.12.9/umd/popper.min.js
https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js

Categories
Programming Web

Disable Google Analytics only for site’s administrators

Services used to track visitors to your site, such as Google Analytics, are essential tools to get an idea of the performance of your web pages.

The problem that arises is that administrators, content creators or any other profile that manages the content of a site are among the most active users. In the long run, this activity, which Analytics catalogues as “visitor”, without distinguishing between users and administrators, falsifies the numbers of real users, i.e. those who have no interest in the growth or decrease in numbers of a site.

This is very important, especially if the numbers of visitors will be the basis for the choice of marketing strategies.

How to do that?

Using cookies

To track visitors Analytics uses a short Javascript code like this one, included on every single page of the site that each visitor’s browser runs:

window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-XXXXXXXX-X');

Quite simply, we can define a cookie to be set in the administrator’s browser (for example no-tracking) and, with a conditional command, tell the code to execute the previous instructions only if the cookie is not present. The value of the cookie is therefore not relevant.

// indexOf(str) returns -1 if there is no occurrence of string str,
// the index of the first occurrence of str otherwise
if (document.cookie.indexOf("no-tracking") === -1){
    window.dataLayer = window.dataLayer || [];
    function gtag(){dataLayer.push(arguments);}
    gtag('js', new Date());
    gtag('config', 'UA-XXXXXXXX-X');
}

The cookie can also be set manually in the administrator’s browser.

Once this operation has been carried out, Google Analytics will only track the real users who visit the site, making the collected data much more reliable.