Voice recordings for the diagnosis of COVID-19 with Deep Convolutional Neural Networks

In this article I describe what was my project for the master’s thesis in Computer Science at the University of Bologna, in which I was involved in using the technologies of Deep Convolutional Neural Networks to create models of machine learning that were able to detect cases of COVID-19 by analyzing voice recordings.

In the medical field, artificial intelligence is now widely used, in particular as an aid to the detection of diseases, such as Parkinson’s disease and post-traumatic stress disorder. Diseases such as those just mentioned have, among their consequences, the alteration of the voice of people who are affected. In this way, the voice becomes a so-called biomarker of the disease and therefore, what can be done is to record the voice of the people we want to make the diagnosis to obtain audio sources that contain useful information. In the case of new diseases we know little about, such as COVID-19, this information is difficult to find precisely because we do not know what they are. The appropriate tool for solving this problem, for which we have the data (input-output combinations), but not the “rules”, is machine learning.

The study I performed started with the analysis of the article: “Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data”,published by a team of researchers at the University of Cambridge. The goal of their research is to build a dataset containing recordings of coughs and breaths that can be used to train machine learning models to discriminate cases of COVID-19. To collect the data, a crowdsourcing campaign has been launched with voluntary participation open to everyone via mobile apps and a web form ( to facilitate its scalability. In addition to recording their voice, users enter some data about their clinical status and, in particular, whether they have tested positive for COVID-19.
The collected audio samples were used to train models such as Logistic Regression, Gradient Boosting Trees, and Support Vector Machines.

The researchers divided the work into three different tasks, and a different combination of inputs was used for each.

  • In the first task the model must distinguish patients with COVID-19 from healthy patients. Both cough and breath recordings were used for this task.
  • In the second task the model needs to distinguish patients with COVID-19 who reported coughing as one of their symptoms from patients who do not have COVID-19 but indicated that they had cough. For this task, only audios with coughing fits were used.
  • In the third task the distinction to be made is the same as in the second task, with the difference that COVID-19 negative patients do not simply have cough but asthmatic cough. For this task, only the breath sources were used.

The dataset contains sensitive data and is not published, but thanks to a sharing agreement between the University of Cambridge and the University of Bologna I was able to access this data.

Replicating their experiment I noticed how the feature extraction part was very complex: the models used (such as LR or SVM) do not support the processing of raw files and it is therefore necessary to extract features manually. The approach used by the Cambridge researchers is the following.

  • Handcrafted features: 477 features were generated by extracting from each source the following properties: duration, onsets, time, period, RMS energy, spectral centroid, roll-off frequency, zero crossing rate, MFCC, δ-MFCC, δ²-MFCC. This method requires a very specific knowledge of audio processing techniques and, therefore, would require the presence of an expert in this field.
  • VGGish features: 256 features were extracted automatically using VGGish, a neural network specialized in audio processing and trained on a dataset sampled from YouTube. VGGish divides each audio file into sub-samples of 0.96 seconds and extracts 128 features for each of them. The final vector of 256 values was obtained by calculating the vector of averages (128 values) and the one of standard deviations (another 128 values) obtained from the vectors generated by VGGish for each individual file (Figure 1).
Figure 1: feature extraction scheme with VGGish for a single file. M is the vector of averages and S is the vector of standard deviations.

The method just described allows to obtain a final vector of 733 features to be used as input for machine learning models, however the great complexity of this operation led me to look for an alternative.

Among the various solutions available there are the Deep Convolutional Neural Networks, neural networks equipped with “convolutions”, i.e. filters able to transform the pixels of an image through matrix multiplications. The image is decomposed into smaller and smaller sub-images from which features are extrapolated and then sent as input to a series of dense layers responsible for the actual classification (Figure 2).

Figure 2: architecture of a Deep Convolutional Neural Network. Image from

This solution would allow us to automate the feature extraction part, but how to combine audio recordings with models that process images? We can rely to the concept of spectrogram.
The spectrogram can be considered “the visual fingerprint of a sound”: it is a graph that on the y-axis has the frequencies (often expressed in a logarithmic scale) and on the x-axis shows the time. Each point on the graph is colored according to the intensity that the sound has at that frequency in that given moment of time (Figure 3).

Figure 3: example of a spectrogram.

With this method, the feature extraction part is reduced only to the generation of spectrograms.

Unfortunately, there is a trade-off inherent in neural networks: the countless possible choices regarding hyperparameters and network configurations. Below are just a few variables to set up.

  • Number of epochs
  • Batch size
  • Learning rate
  • Input shape
  • Number of nodes in the hidden layers
  • Number of hidden layers
  • Dropout rate
  • Topology
  • Online data augmentation
  • Offline data augmentation
  • Class weights
  • Fine tuning

The countless possible combinations, combined with high training times (which in this case can take minutes or even hours), made it impossible to perform an exhaustive search. I have opted therefore for an approach of local search: starting from the values of hyperparameters and configurations commonly used in literature I have tried to vary these values, observing how the behavior of the model changed.

In the end I used a model like the one shown in Figure 4. The “core” of the architecture encapsulates one of the four networks used for the experiments (MobileNetV2, ResNet50, VGG16 or Xception) with the weights already initialized and to which two dense layers were added: one of 256 neurons and another of 128 neurons, both with a 20% dropout for regularization purposes.

Figure 4: architecture of the model user for the experiments.


Figure 5: ROC curve obtained for task 1.
Figure 6: comparison of results for task 1.

Regarding the first task, as shown in Figure 5, the four different networks score about the same in ROC-AUC, i.e., about 80%. ResNet50 and Xception do slightly better, with the former managing to obtain a standard deviation just lower. In Figure 6, the result of ResNet50 is compared with the results published by the Cambridge researchers, who in this case used Logistic Regression. From the comparison we can see that, although there is a small improvement in ROC-AUC, the method based on CNN loses a lot in precision and recall.

Figure 7: comparison of results for task 2.
Figure 8: comparison of results for task 3.

In the article from the English university, the results obtained with Support Vector Machine were published for the second and third tasks. A comparison between SVM and MobileNetV2 is shown in Figure 7 and 8. In these two tasks, MobileNetV2 was the best performing CNN. Taking into account that for the purpose of the experiment the ROC-AUC is the most important metric, it can be seen that SVM obtains better results than CNNs, even if CNNs manage to outperform SVM by a lot in precision and recall in task 3.


After seeing a comparison between the shallow learning approach with Logistic Regression and SVM and the deep learning approach with Deep Convolutional Neural Networks we can conclude that, after all, the two solutions are equivalent, even if the first approach seems to get slightly better results.

However, everyone in the literature agrees that deep learning models run into trouble when trained on limited datasets, where shallow learning models perform better. The dataset release (dated May 2020) provided by the University of Cambridge is indeed a first version with few data points. However, it should be considered, as mentioned at the beginning, that the goal of the data collection campaign is to create a scalable dataset and it is therefore plausible to expect that CNNs will perform progressively better than LR and SVM as more samples become available.

A new paper was recently published by the same team of researchers: Exploring Automatic COVID-19 Diagnosis via Voice and Symptoms from Crowdsourced Data, in which they repeat the experiment with an updated version of the dataset. It would be interesting to replicate the approach with Deep Convolutional Neural Network on this new data release.

The complete thesis, in italian, is available at this link.

Cover image: Bagad, Piyush & Dalmia, Aman & Doshi, Jigar & Nagrani, Arsha & Bhamare, Parag & Mahale, Amrita & Rane, Saurabh & Agarwal, Neeraj & Panicker, Rahul. (2020). Cough Against COVID: Evidence of COVID-19 Signature in Cough Sounds.

Programming Projects

Classification of Reviews with Yelp Open Dataset

Yelp Open Dataset is a collection of data concerning users who write reviews on businesses belonging to different commercial categories (e.g. restaurants, car rental, etc. …).

Thanks to this database I was able to realize a model based on a neural network for the automatic classification of reviews using natural language processing and machine learning techniques.

The model already trained can be used on Google Colab by clicking on the button below.

Open in Colab

In the next paragraphs I describe how the project was carried out.

Building the model

The input values were calculated by applying word embedding techniques to the text of the reviews. The output values for each input have been calculated based on the number of stars associated with the review and transformed into a binary vector to allow the model to make a classification based on categories (e.g. “4 stars” has been coded with the vector [0, 0, 0, 1, 0]).

The steps described in this course have been followed.

After numerous attempts, trying different configurations of different layers and parameters, the following configuration was chosen:

# Parameters for word embedding
vocab_size = 10000
embedding_dim = 128
max_length = 256

# Model configuration
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,

Training and validation

The first 10,000 reviews have been selected, without any filtering criteria, making a division 80%-20% to build, respectively, training sets and test sets. An early stopping was added during training to prevent overfitting.
The resulting accuracy values are 79.8% for the training set and 63.7% for the test set (Figure 1).

Figure 1: Accuracy and loss values for the first run, i.e. before filtering the data.

It would have been very interesting to select 100,000 reviews instead of only a tenth of them, but it was not done for computational capacity limits.

Improving performances

Considering that changing the parameters or model composition did not significantly improve the results, it was decided to filter the data by eliminating those that produced too high loss values. The assumption behind this choice lies in the fact that these values of input-outputs may not be very truthful.
For each instance of the 10,000 selected records, the loss value was calculated and it was decided to discard those with a value higher than 2.
The resulting accuracy values after this operation are 88.9% for the training set and 70.3% for the test set (Figure 2).

Figure 2: Accuracy and loss values for the second run, i.e. after filtering the data where the examples with high loss values have been deleted.

Final results

Below are shown some graphs showing the data on which you can make a final evaluation on the model for the automatic recognition of a review.
In order to classify between positive and negative we can evaluate as
positive reviews to which 4 or 5 stars are attributed and as negative those to which 1 or 2 stars are attributed. The case of 3 stars is arbitrary.
Looking at the graphs shown in Figures 3, 4, 5 and 6 we can see how the accuracy of the model has improved.

Figure 3: This sentence is definitely one of the worst you can write about a business. At the first run the model erroneously assigned 5 stars but with a considerable uncertainty, while in the second run the best choice (1 star) is clearly superior to the others.
Figure 4: This is undoubtedly a positive review and even at the first run you have a clear choice for the 5 star option, which is however further strengthened in the second run. It would also be fair to expect 4 stars.
Figure 5: Absolutely positive review. Already with the first run you have a clear choice, but with the second 82% probability is an extremely correct result.
Figure 6: This phrase is one of the most interesting because it represents a balance for which 3 stars would be preferable (since it is the average choice). Unfortunately in this case the model is not very accurate even after the second run, in fact the choice to attribute a star is definitely wrong. However it can be observed that, once again, the second run looks more truthful than the first.

Further analysis

Other notebooks are available in my repository that contain additional analysis of Yelp Open Dataset data.

Programming Projects

A Little Blockchain in Erlang

Erlang is one of the functional programming languages that allows you to build concurrent programs in a relatively simple way; its actor-based paradigm offers primitives to create processes and make them communicate with each other in a way that other languages do not.

Another peculiarity of Erlang, probably the most important one, is the possibility to update the code of a program without interrupting its execution. Here is an explanation of this functionality.

Among the softwares written in Erlang there is WhatsApp.

A blockchain, a word in vogue in recent years, is a data structure that provides guarantees on the authenticity of the data it contains. In fact, a blockchain is a chain of blocks of variable length: you can add blocks only at the top (or only at the bottom) of the chain but once added they can no longer be modified, but only read.

During my university career I had to implement a blockchain in Erlang. In this case it is a blockchain distributed on several nodes in a network, so algorithms have been implemented for the exchange of information between nodes (gossiping) and for the resolution of chain forks.
The entire project code, together with documentation on its operation, is available in this repository on GitHub.

Thanks to my colleague Lorenzo for the collaboration in the realization of this project.


The new website of Pro Loco Sovicille

In recent months I have been working on the design and implementation of a new site for the Pro Loco of Sovicille, a non-profit association that aims to enhance the places, traditions and resources of the territory of the municipality of Sovicille, Siena (Italy), promoting initiatives among the locals and welcoming tourists.

In a first phase of study of the old site, which I had already have the opportunity to take a look at for Nice Places, I immediately noticed a large amount of content: in-depth descriptions of the places in the area, products and a database with many accommodation facilities.

Content organization

Being by definition a static site that does not provide for interaction with the user, I thought that the best way to optimize this information was to study the composition of URLs to uniquely identify each page. In addition, this would lead to better indexing on search engines and easier sharing on social media.

All URLs of the site have the form:[subsection/]resource

without any numerical indexes or parameters, so as to be as human-readable as possible.

Informations and translations have been organized in a MySQL database.

Articles publication

A CMS (Content Management System) was needed to enable the Pro Loco to publish news and communications about new events. Without any doubt I chose WordPress for various reasons:

  • I know it pretty well because I use it for both this blog and the Nice Places blog;
  • is written in PHP, my favorite server-side language that I always use;
  • is a tool supported by thousands of people, from my point of view (and not only) the best existing CMS.

The only thing I was worried about was writing scripts that would allow me to integrate the articles in the layout of the site, as a sort of ad-hoc theme.


Internet traffic from mobile has exceeded computer traffic a few years ago, making it mandatory for any site to be mobile-ready.
Using Bootstrap is all easier and only a few tweaks were necessary, but the framework gives me the certainty of having a site that fits every type of screen.

Social sharing

In my opinion, a site that aims both to disseminate historical information and to promote a territory goes very well with the possibility of creating pages optimized for sharing on social networks.

In the meta tags the information on the page has been inserted according to the guidelines of Open Graph, while for the sharing buttons I have relied on AddThis.

Translations management

The site has been translated into four languages: Italian, English, French and German.
Obviously it was unthinkable to create four files for a single page, so I thought of using a simple function

function t(slug){...}

that returned the right translation for the input slug identifier, retrieved from a dictionary. In this way in the source code I only had to write

<?php echo t("slug-parola") ?>

so I only have to make one file for all four versions.


I like the parallax effect very much because a three-dimensional aspect flows into a web page, so I decided to implement it in all those pages that have photos accompanied by a description, as is the case for places or accommodation facilities.

Mappa stilizzata

On the homepage there is a map of the places of interest around Sovicille that I made with the HTML5 drawing tools.
It was very challenging to make that map because the only reference I could use was the geographical point, expressed by latitude and longitude, for each single place. I couldn’t use these numbers without any adaptation, since we go from terrestrial coordinates expressed in degrees to terrestrial coordinates expressed in pixels.

The conversion that came out of it was very interesting: I had never faced a similar problem before.


The creation of the new site of the Pro Loco of Sovicille was a long and not easy job, but it gave me the opportunity to better express my vision for this type of websites, where I could make use of all my skills.

Hours of work: about 150.

Here is an introduction article for the new site (in Italian).