Speech Emotion Recognition Using 2D CNN-LSTM

Abhishek B Udnur; Prathmesh M Babel; Akash P Gaikwad; Tushar S Satpute

Abhishek B Udnur Research Scholar, Department of Computer Science Engineering, Sharad Institute of Technology College of Engineering.
Prathmesh M Babel Department of Artificial Intelligence and Data Science Sharad Institute of Technology College of Engineering
Akash P Gaikwad Research Scholar Department of Artificial Intelligence and Data Science Sharad Institute of Technology College of Engineering
Tushar S Satpute Research Scholar Department of Artificial Intelligence and Data Science Sharad Institute of Technology College of Engineering

Abstract

This study focuses on Speech Emotion Recognition (SER) in the digital era, addressing mental well-being amidst extensive technology usage. SER is a critical tool impacting healthcare, entertainment, education, and more. The research explores diverse deep learning-based techniques for emotion detection in speech. However, the challenge of understanding abstract features in deep neural networks, a “black box” issue, persists. The study underscores SER’s importance in comprehending digital human behavior and its potential in designing supportive media architectures. The dataset used, Ravdess Dataset, is described in detail. The implementation covers essential preprocessing steps using the librosa library, including data exploration and feature extraction like Mel spectrogram calculations, Fast Fourier Transform (FFT), Hamming window application, and Mel Frequency Cepstral Coefficients (MFCCs). It explains data augmentation, model building using a CNN-LSTM architecture, and model evaluation, achieving a high accuracy of 94.02% in emotion recognition. Deployment aspects discuss utilizing the trained model for emotion detection, emphasizing practical application through a Flask web framework. The discussion highlights the success of CNN-LSTM networks in extracting emotional information from speech signals. Techniques to combat overfitting and enhance model generalization are explored. The conclusion stresses the ongoing pursuit of higher accuracy in SER, suggesting avenues for future research, including novel network architectures and feature merging methods. The study provides a comprehensive insight into SER techniques and their potential in addressing mental well-being in the digital age.

References

1. Z.C. Lipton, The Mythos of Model Interpretability, arXivpreprint arXiv: 1606.03490, 2016.
2. A. Neumaier, Solving ill-conditioned and singular linearsystems: a tutorial on regularization, Siam Rev. 40 (3)
(1998) 636–666.
3. L. He, M. Lech, N.C. Maddage, N.B. Allen, Study ofempirical mode decomposition and spectral analysis
for stress and emotion classification in natural speech,Biomed. Signal Process. Control 6 (2) (2011) 139–14
4. Zhao, J., Mao, X., & Chen, L. (2019). Speech emotionrecognition using deep 1D & 2D CNN LSTM networks.Biomedical signal processing and control, 47, 312-323.
5. Mel-frequency cepstrum. (2023, September 4). InWikipedia. https://en.wikipedia.org/wiki/Melfrequency_
cepstrum
6. Emotion recognition. (2023, October 4). In Wikipedia.https://en.wikipedia.org/wiki/Emotion_recognition
7. Z. Huang, M. Dong, Q. Mao, Y. Zhan, Speech EmotionRecognition Using CNN, ACM Multimedia, 2014, pp.
801–804