Interpreting Emotions from Raw Speech: A Deep Learning Perspective
By Narendra N, Anubhav Anand, Shubham Negi
What do we do when we face a technical glitch in a laptop or when we have a problem with an order we placed on an e-commerce website? The answer is simple. We end up talking to a customer care representative. But do we know that all these calls are recorded for analysis (disclaimer)? It is important for these companies to assess the quality of the call whether it is in terms of feedback on the product, issues raised by the customer, behavior of the representative and so on. We are often asked to provide a rating after the call has ended, which people rarely do. What if we can automate this process? Is it possible to assess the emotion of the speaker and provide a suitable rating for the call? This brings us to the problem of developing a Speech Emotion Recognition (SER) system. Speech Emotion Recognition systems also have applicability in a variety of application scenarios like health-care systems, monitoring systems and automatic driving systems to name a few.
The Problem
Most of the existing emotion recognition systems rely on the transcripts of speech to assess the emotion. However, we observed that the performance of Automatic Speech Recognition (ASR) systems is very poor in noisy speech environments. How do we resolve this? Is it possible to identify emotions directly from speech signals without using an ASR system? In this article, we describe our effort toward solving the problem of identifying emotions from raw speech signals. We will also look at explaining the reasoning behind the classification.
The Approach
We experimented with the use of a deep neural network, SincNet, to solve this problem. It is a Convolutional Neural Network (CNN) based approach which was originally proposed for speaker recognition. As the filters learning in the first layer of this convolutional network are interpretable, we have used them to provide an explanation for the classified emotions.
About SincNet
Convolutional Neural Network (CNN) based filtering is becoming popular in tasks involving speech recognition. CNNs learn low-level speech representations from waveforms, potentially capturing important characteristics such as pitch and formants. However, there are two problems with this. First, the design of neural networks becomes crucial for getting a good representation. And second, the dimensionality of input waveform is very high. Since the first layer in any speech related task is crucial for extracting representations, SincNet allows us to reduce the dimensionality in a very deterministic sense, replacing the first convolution layer with meaningful filters. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application. Fig 1. captures the SincNet architecture proposed in [12]. The first layer of a standard CNN performs a set of time-domain convolutions between the input waveform and some Finite Impulse Response (FIR) filters
where 𝐿 is the length of the filter. SincNet, however, performs convolution with a pre-defined parametric function, say, 𝑔
A reasonable choice of would be a filter bank of rectangular bandpass filters. A band pass filter is designed as a difference of two low pass filters with different cut-off frequencies, say, f1 and f2. The time-domain representation of such a filter will yield the difference of two sinc functions
The function g is fully differentiable, and therefore the cut-off frequencies can be jointly optimized through gradient descent methods. These result in providing an interpretable filter response to the input speech.
Advantages of SincNet:
1. Fast Convergence: Since we only learn the cutoff frequencies in the first layer of the network the training is faster. The architecture also forces the network to learn filters having high impact on performance
2. Few Paramerters: A standard CNN of F filters, each of length L, would require to learn F.L parameters, but SincNet only learns 2F parameters and is also independent of the filter length.
3. Interpretability: The convolutional filters learned in traditional approaches provide overlapping filters which might not help in understanding the features learned by the system. The filters learned in SincNet provide non-overlapping band-pass filter cut-offs which can be understood easily.
Our Technique
A. Dataset
We have used the IEMOCAP dataset provided by the University of Southern California for our experiments.
The IEMOCAP dataset has 12 hours of audio-visual data. It includes video, speech, text transcriptions, and motion capture of the face. It consists of sessions where hired actors perform improvisations or read scripted dialogues specifically selected to suggest emotional expressions of the speaker. The dataset is annotated by multiple annotators. There are 10 classes in the database such as anger, happiness, neutral, etc.
B. Implementation
We have restricted our analysis to four emotion classes viz. angry, happy, sad, and neutral, to compare our results with state-of-the-art algorithms.
The IEMOCAP official release provides us with dialogue level classification. There are approximately 10,000 dialogues in the whole dataset with respective classes. Due to the imbalance in the data points of these four classes, we combine signals with happy and excited into one class to make it balanced. The modified four classes contain approximately 7,000 utterances. We consider only the improvised speech conversations to be consistent with the state of the art. This provides us with a dataset of 2943 utterances. The duration of the utterances ranges from as small as 0.5 seconds to as large as 37 seconds. Each utterance is given one of the four labels.
We sampled a sliding window length of 200ms chunks with 50% overlap over the input speech. We have considered each chunk to have the same label as the overall utterance. The first layer performs convolutions based on the sinc filters learned during training. We use 80 such filters of length 251 samples followed by two layers of CNN with 60 filters each with a length of 5 samples. This is followed by three fully connected layers of 2048 samples. All layers use leaky-ReLU as the activation function. The model is trained using the RMSprop optimizer with a learning rate of 0.001.
We have divided the dataset into a ratio of 80:20 where 80 percent was used for training and 20 for validation. A sentence-level classification was computed by averaging the predicted probabilities over all the chunks of the input speech and choosing the one with the maximum posterior after averaging. We also performed a 3-fold validation for the dataset. We achieved an overall accuracy of 77.19%. Table I summarizes the comparison of our implementation with the state of the art.
Takeaways
- The results are encouraging as we see that the accuracy of the model betters the state of the art using only speech samples. The other proposed methods use a combination of speech and phonemes along with the transcripts to achieve higher levels of accuracy.
- We have observed that the learning model is susceptible to the accent of speech. The model has been trained on the IEMOCAP dataset which has speakers of US accent. We tested it on conversations of Indian accent to limited success.
- We have observed through random sampling some inaccuracies in the labeling of the dataset. For example, the sad and neutral emotions overlap within a person’s speech. I.e., the labels of the speech vary from sentence to sentence within a single continuous conversation.
By Narendra N, Anubhav Anand, Shubham Negi
For Further Reading
[1] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with SincNet,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 1021–1028.
[2] P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, and J. Vepa, “Speech emotion recognition using spectrogram & phoneme embedding.” in Interspeech, 2018, pp. 3688–3692.
[3] A. Nediyanchath, P. Paramasivam, and P. Yenigalla, “Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition,” in ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7179–7183.
[4] J. Lee and I. Tashev, “High-level feature representation using recurrent neural network for speech emotion recognition,” in Sixteenth annual conference of the international speech communication association, 2015.
[5] A. Satt, S. Rozenberg, and R. Hoory, “Efficient emotion recognition from speech using deep learning on spectrograms.” in Interspeech, 2017, pp. 1089–1093.