Lorenzo Vainigli

Software developer, passionate about programming

Voice recordings for the diagnosis of COVID-19 with Deep Convolutional Neural Networks

11 April 2021
8 min.

Questo articolo è disponibile in italiano

In this article I describe what was my project for the master’s thesis in Computer Science at the University of Bologna, in which I was involved in using the technologies of Deep Convolutional Neural Networks to create models of machine learning that were able to detect cases of COVID-19 by analyzing voice recordings.

In the medical field, artificial intelligence is now widely used, in particular as an aid to the detection of diseases, such as Parkinson’s disease and post-traumatic stress disorder. Diseases such as those just mentioned have, among their consequences, the alteration of the voice of people who are affected. In this way, the voice becomes a so-called biomarker of the disease and therefore, what can be done is to record the voice of the people we want to make the diagnosis to obtain audio sources that contain useful information. In the case of new diseases we know little about, such as COVID-19, this information is difficult to find precisely because we do not know what they are. The appropriate tool for solving this problem, for which we have the data (input-output combinations), but not the “rules”, is machine learning.

The study I performed started with the analysis of the article: “Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data”,published by a team of researchers at the University of Cambridge. The goal of their research is to build a dataset containing recordings of coughs and breaths that can be used to train machine learning models to discriminate cases of COVID-19. To collect the data, a crowdsourcing campaign has been launched with voluntary participation open to everyone via mobile apps and a web form (www.covid-19-sounds.org) to facilitate its scalability. In addition to recording their voice, users enter some data about their clinical status and, in particular, whether they have tested positive for COVID-19.
The collected audio samples were used to train models such as Logistic Regression, Gradient Boosting Trees, and Support Vector Machines.

The researchers divided the work into three different tasks, and a different combination of inputs was used for each.

  • In the first task the model must distinguish patients with COVID-19 from healthy patients. Both cough and breath recordings were used for this task.
  • In the second task the model needs to distinguish patients with COVID-19 who reported coughing as one of their symptoms from patients who do not have COVID-19 but indicated that they had cough. For this task, only audios with coughing fits were used.
  • In the third task the distinction to be made is the same as in the second task, with the difference that COVID-19 negative patients do not simply have cough but asthmatic cough. For this task, only the breath sources were used.

The dataset contains sensitive data and is not published, but thanks to a sharing agreement between the University of Cambridge and the University of Bologna I was able to access this data.

Replicating their experiment I noticed how the feature extraction part was very complex: the models used (such as LR or SVM) do not support the processing of raw files and it is therefore necessary to extract features manually. The approach used by the Cambridge researchers is the following.

  • Handcrafted features: 477 features were generated by extracting from each source the following properties: duration, onsets, time, period, RMS energy, spectral centroid, roll-off frequency, zero crossing rate, MFCC, δ-MFCC, δ²-MFCC. This method requires a very specific knowledge of audio processing techniques and, therefore, would require the presence of an expert in this field.
  • VGGish features: 256 features were extracted automatically using VGGish, a neural network specialized in audio processing and trained on a dataset sampled from YouTube. VGGish divides each audio file into sub-samples of 0.96 seconds and extracts 128 features for each of them. The final vector of 256 values was obtained by calculating the vector of averages (128 values) and the one of standard deviations (another 128 values) obtained from the vectors generated by VGGish for each individual file (Figure 1).
Figure 1: feature extraction scheme with VGGish for a single file. M is the vector of averages and S is the vector of standard deviations.

The method just described allows to obtain a final vector of 733 features to be used as input for machine learning models, however the great complexity of this operation led me to look for an alternative.

Among the various solutions available there are the Deep Convolutional Neural Networks, neural networks equipped with “convolutions”, i.e. filters able to transform the pixels of an image through matrix multiplications. The image is decomposed into smaller and smaller sub-images from which features are extrapolated and then sent as input to a series of dense layers responsible for the actual classification (Figure 2).

Figure 2: architecture of a Deep Convolutional Neural Network. Image from becominghuman.ai.

This solution would allow us to automate the feature extraction part, but how to combine audio recordings with models that process images? We can rely to the concept of spectrogram.
The spectrogram can be considered “the visual fingerprint of a sound”: it is a graph that on the y-axis has the frequencies (often expressed in a logarithmic scale) and on the x-axis shows the time. Each point on the graph is colored according to the intensity that the sound has at that frequency in that given moment of time (Figure 3).

Figure 3: example of a spectrogram.

With this method, the feature extraction part is reduced only to the generation of spectrograms.

Unfortunately, there is a trade-off inherent in neural networks: the countless possible choices regarding hyperparameters and network configurations. Below are just a few variables to set up.

  • Number of epochs
  • Batch size
  • Learning rate
  • Input shape
  • Number of nodes in the hidden layers
  • Number of hidden layers
  • Dropout rate
  • Topology
  • Online data augmentation
  • Offline data augmentation
  • Class weights
  • Fine tuning

The countless possible combinations, combined with high training times (which in this case can take minutes or even hours), made it impossible to perform an exhaustive search. I have opted therefore for an approach of local search: starting from the values of hyperparameters and configurations commonly used in literature I have tried to vary these values, observing how the behavior of the model changed.

In the end I used a model like the one shown in Figure 4. The “core” of the architecture encapsulates one of the four networks used for the experiments (MobileNetV2, ResNet50, VGG16 or Xception) with the weights already initialized and to which two dense layers were added: one of 256 neurons and another of 128 neurons, both with a 20% dropout for regularization purposes.

Figure 4: architecture of the model user for the experiments.


Figure 5: ROC curve obtained for task 1.
Figure 6: comparison of results for task 1.

Regarding the first task, as shown in Figure 5, the four different networks score about the same in ROC-AUC, i.e., about 80%. ResNet50 and Xception do slightly better, with the former managing to obtain a standard deviation just lower. In Figure 6, the result of ResNet50 is compared with the results published by the Cambridge researchers, who in this case used Logistic Regression. From the comparison we can see that, although there is a small improvement in ROC-AUC, the method based on CNN loses a lot in precision and recall.

Figure 7: comparison of results for task 2.
Figure 8: comparison of results for task 3.

In the article from the English university, the results obtained with Support Vector Machine were published for the second and third tasks. A comparison between SVM and MobileNetV2 is shown in Figure 7 and 8. In these two tasks, MobileNetV2 was the best performing CNN. Taking into account that for the purpose of the experiment the ROC-AUC is the most important metric, it can be seen that SVM obtains better results than CNNs, even if CNNs manage to outperform SVM by a lot in precision and recall in task 3.


After seeing a comparison between the shallow learning approach with Logistic Regression and SVM and the deep learning approach with Deep Convolutional Neural Networks we can conclude that, after all, the two solutions are equivalent, even if the first approach seems to get slightly better results.

However, everyone in the literature agrees that deep learning models run into trouble when trained on limited datasets, where shallow learning models perform better. The dataset release (dated May 2020) provided by the University of Cambridge is indeed a first version with few data points. However, it should be considered, as mentioned at the beginning, that the goal of the data collection campaign is to create a scalable dataset and it is therefore plausible to expect that CNNs will perform progressively better than LR and SVM as more samples become available.

A new paper was recently published by the same team of researchers: Exploring Automatic COVID-19 Diagnosis via Voice and Symptoms from Crowdsourced Data, in which they repeat the experiment with an updated version of the dataset. It would be interesting to replicate the approach with Deep Convolutional Neural Network on this new data release.

The complete thesis, in italian, is available at this link.

Cover image: Bagad, Piyush & Dalmia, Aman & Doshi, Jigar & Nagrani, Arsha & Bhamare, Parag & Mahale, Amrita & Rane, Saurabh & Agarwal, Neeraj & Panicker, Rahul. (2020). Cough Against COVID: Evidence of COVID-19 Signature in Cough Sounds. https://arxiv.org/abs/2009.08790

Lorenzo Vainigli

Android developer and passionate about computers and technology since I was a child.
I have a master's degree in Computer Science obtained at the University of Bologna.
I like astronomy 🌌 and Japanese culture ⛩️.
Here I tell more about myself.