Skip to main content

Heart rate monitoring using human speech spectral features


This paper attempts to establish a correlation between the human speech, emotions and human heart rate. The study highlights a possible contactless human heart rate measurement technique useful for monitoring of patient condition from real-time speech recordings. The distance between the average peak-to-peak distances in speech Mel-frequency cepstral coefficients are used as the speech features. The features when tested on 20 classifiers from the data collected from 30 subjects indicate a non-separable classification problem, however, the classification accuracies indicate the existence of strong correlation between the human speech, emotion and heart-rates.


Heart rate indicates the total number of times our heart contracts and relaxes per minute; and is expressed as the number of beats per minute (bpm) [1]. Many factors contribute to variation in heart rate such as level of physical activity, fitness, temperature, body position, emotions, body size and medication. Heart Rate also depends on the body’s need to absorb oxygen and excrete carbondioxide. Traditionally, heart rate is measured by detecting arterial pulsation. The electrical activity of the heart is measured by a non-invasive technique called electrocardiogram and is used for assessing the condition of a human heart. The use of technology is gaining significance especially in teleheatlh applications for the continuous monitoring for cardiac patients and for special situations like treating a burn victim or monitoring infants having the risk of sudden infant syndrome.

A device used to monitor and record the heart rate in real-time is generally referred to as heart rate monitor. The early models of heart rate monitors uses electrode leads that are attached to the chest. In contrast, the modern heart rate monitors usechest strap transmitter and a wrist receiver or a mobile phone. Strapless heart rate monitors allows the user to simply touch two sensors on a wristwatch display for a few seconds to display the heart rate. The detection of heart rate from human emotions were also done based on the modelling of vowel speech signals.

In many of the biotelemetry applications, the heart rate monitoring of the ambulatory patients become extremely difficult if the mobility of the patient is restricted due to paralysis or and/or where continuous paramedic attention and monitoring with sensors become unaffordable. Through this research, we propose to overcome such restrictions by implementing a contact less heart rate monitoring method using the real-time speech recordings based on different emotions of the patient. The system developed is based on a database of emotions and heart rate obtained from multiple subjects.


Most of the past research in heart rate detection are based on the modelling of vowel speech signals and by processing different physiological parameters. The short-time Fourier Transform (STFT) [2, 3] has been used to detect the maximum peaks of the formants in the process of detecting the heart rate.

McCraty et al. [4] related the mathematical transformation of heart rate variability with the power spectral density of speech. He founded that the positive emotions results in the alteration of heart rate variability. Kim et al. [5] developed a new emotion recognition system by processing different physiological parameters. The electrocardiogram, skin temperature and electrodermal activity were used as input signals. This particular system consisted of preprocessing, feature extraction and pattern classification stages. The characteristics of the emotion is identified from short-segment signals.

Anttonen et al. [6] developed an EMFi chair that measures heart rate. The chair is embedded with electromechanical film and traditional earlobe photoplethysmography (PPG) for measuring the heart rate. He used this setup to study the impact of emotional changes on the heart rate. Mesleh et al. [7] developed a method for the detection of heart rate from human emotions based on the modelling of vowel speech signals. The non-contact method for the detection of heart rate from human emotions is based on modelling the relationship between speech production of vowel speech signals and heart activities for humans. It uses STFT to estimate heart rates from vowel speech signals. This method is not extensively applicable in the medical field since the representation of human emotions through vowel speech cant be clearly illustrated. This method does not work for persons with artificial or transplanted hearts.


Data collection

30 subjects aged between 20 and 45 had participated for the data collection. The speech and the corresponding heart rate were recorded in a sound proof room using a high quality microphone to obtain the speech signals with minimum noise. Three different emotions viz. anger, neutral and joy were selected for the study. During data collection, the subjects were asked to express specific emotions through words. The emotions were self-induced by instructing subjects to recollect the past situations. The recorded utterance was ‘It Begins In Seven Hours’.

Before the start of the recording, a brief introduction was given to the subjects regarding the purpose and procedure of experiment and demonstrated how to express the emotions. The recording was conducted for 60 days and the subjects were selected randomly for the study. For one subject, three different sessions were conducted in two different days for three different emotions. The duration of a particular recording instance is 30 seconds and 30 such instances were taken for each emotion. From a recorded instance, only ten seconds were used for feature extraction process.

  1. 1.

    Hand held ECG monitor The device used for measuring heart rate was a Handheld ECG monitor MD100 A1. This device is able to detect and display the ECG waveform non-invasively and for routine monitoring of heart rates. Strict instructions were given to the subjects about handling the device. It has two modes of display, easy mode and continuous mode. Easy mode has chosen for uncomplicated and suitable measurement. Subjects were seated in straight and stable chairs to reduce the motion of the subject. The subjects were instructed to hold the 8 metal electrodes of the device with right index finger firmly, and place the 3 electrodes against the centre of the left palm. The recorded ECGs were transferred to the computer using a software named ‘Keep-It-Easy’.

  2. 2.

    Recording room Recording rooms are usually multi-room facilities that require high sound quality and less noise compared to other rooms. The more concentrated factors during the construction of an audio recording room are ceiling and walls. Porous absorbers like melamine sponges and wood are used to build the walls of the voice recording room. Concrete walls and floors are usually avoided to make the audio recording room noise free.

The speech recording room used was acoustically quiet as possible. Only the subject and the person who assist the recording will be present inside the recording room. The voice was recorded in a mobile phone (Sony Ericson Xperia Mini Pro) with hi-fi stereo headset MH 710 and a noise level of \({-}\)89.7 dB. The microphone was kept closer to the subjects mouth to reduce the ambient noise and to increase the sound level of the speech signals. The microphone was always being kept directly at the subjects mouth as the level of sound becomes less when the microphone moves away. Computer in the recording room is very often a disturbing factor in recording the speech signals. Hence silent processor fan and power supplies are used to reduce the sounds from the computer. A carpet helps to reduce the reflection of the sound signals.

Feature extraction

The extraction of features from the emotional speech signal is one of the important tasks to understand the human speech behaviour. Mel frequency cepstrum is considered to be the robust method and widely adopted method for speech processing.

  1. 1.

    Recording room Mel-frequency cepstral coefficient analysis Mel-frequency cepstrum [8] is a mathematical representation of short term power spectrum of the speech. Mel frequency cepstral coefficients [8] are based on a standard power spectrum estimate which is first subjected to a log based transform of the Mel-frequency scale and then decorrelated by using a modified discrete cosine transform.

The process of Mel-frequency cepstral coefficient identification process includes six important steps: (1) pre-emphasis, (2) framing, (3) hamming windowing, (4) Fast Fourier transform, (5) Mel-filter bank processing, and (6) discrete cosine transforms.

Preemphasis is a process to boost the magnitude of higher frequencies in the signal for improving the Signal to Noise ratio. It is the initial step of noise reduction technique [9]. The recorded emotional speech is emphasized by passing through a filter to increase the magnitude of the higher frequencies of the speech signal.

$$\begin{aligned} Y[n]=X[n]-0.95X[n-1] \end{aligned}$$

Let a = 0.95, which makes 95 % of any one sample is presumed to originate from previous sample [9].

Framing is the process of segmentation of speech samples originated from analog to digital converter into small frames. The speech signal is divided into frames of N samples. Neighboring frames are separated based on the value M and \(M < N\) where M = 100 and N = 256 [9].

Hamming windowing is one of the most simple window functions. It reduces the effect of leakage for the better representation of the frequency spectrum of the speech signals. The frames obtained are multiplied with the window function W(n) to reduce the discontinuities of the speech signals in the time domain. It helps in reducing the spectral artifacts of the speech signals. The window function can be denoted as W(n) [9].

$$\begin{aligned} W[n]=0.56-0.46~cos \left(\frac{2\pi n}{N-1}\right), \quad 0\le n\le N-1 \end{aligned}$$

where, N is the number of samples per frame, Y[n] is the output signal, X(n) is the input signal, and W(n) is the Hamming window. The result of windowing signal can be represented as,

$$\begin{aligned} Y[n]=X[n]\times W[n] \end{aligned}$$

The Fast Fourier Transform is the fastest algorithm to compute Discrete Fourier transform (DFT). The FFT algorithm is very efficient to compute the DFT calculations in less time. The DFT can be computed efficiently in matlab using the function FFT [9].

The Discrete Fourier transform of a discrete signal X(n) can be defined as,

$$\begin{aligned} X[k]=\sum \limits _{n=0}^{N-1} X[n]~e^{-j[\frac{2\pi }{N}]nk}, \quad k=\{0,1,\ldots ,N-1\} \end{aligned}$$

where, X[k] = Fourier transform of input signal X[n], f = frequency in Hz.

The DFT is mostly used in the area of frequency spectrum analysis since it transforms the discrete signal in the time domain into its discrete frequency domain components. Without transforming the discrete time domain signal into the discrete frequency domain signal, it is not able to compute the Fourier transform in DSP based systems [9].

Mel filter bank processing The range of frequencies in the FFT spectrum obtained is very broad. Since it is difficult to follow a linear scale in case of speech signals, Mel-scale is used for the filtering process. The Mel filter bank consists of a set of overlapping triangular filters applied to compute the weighted sum of filter components approximating the output to a Mel scale. The center frequencies of the Mel-filters are linearly spaced and the band width is fixed on the Mel scale [9].

The most popular equation to compute the Mel-frequency is given using

$$\begin{aligned} F(mel)=2595.log~10(1+f/700) \end{aligned}$$

Discrete cosine transform helps in converting the log Mel spectrum into time domain. The result of the conversion is called Mel Frequency Cepstrum Coefficient. It is a real transform and has better computational efficiency. The set of coefficient is called acoustic vectors. Therefore, each input utterance is transformed into a sequence of acoustic vector [9].

The discrete cosine transform of a signal can be defined as,

$$\begin{aligned} C_x[k]&=\Biggl \lbrace { \begin{array}{ll} \sum \limits _{n=0}^{N-1} 2X[n]~cos(\frac{\pi }{2N}k(2n+1))\ & 0 \le n \le N \\ 0\ \quad &~\text {otherwise} \end{array} } \end{aligned}$$

where, \(C_x[k]\) = output signal X[n] = input signal.

The 12 coefficients obtained after those processes corresponds Mel frequency cepstral coefficients. The sum of the Mel frequency cepstral coefficients is taken and the peaks are calculated. Each 10 s audio have minimum of three repeated utterances and for each plot of the utterance, the average of two peak to peak distance are taken and the corresponding heart rate is measured from the ECG of the corresponding 10 s audio. The peak to peak distance between the MFCC is taken as the features that correlate to the variability in the heart rate.

  1. 2.

    Heart rate detection A typical human heart rate is in the range of 60–100 bpm, but this may change according to age, sex and size of the person. The QRS complex represents the depolarization of ventricles. Ventricular rate or the heart rate is calculated as the time interval between the two QRS complexes per unit time. The duration of QRS complex is generally 0.06–0.1 s.

To determine the ECG, during the recording of emotional speech, MD 100 A1 handheld ECG monitor have been used. The device has 30 s recording capacity in its easy-mode of recording. The heart rate is obtained from recorded ECGs using 1500 rule. According to basic dysrhythmias interpretation and management each horizontal box equals 0.04 s. Hence, count the number of small squares between two neighbouring R-waves and divide that number with 1500. The obtained value will be the heart rate, H of the respective emotional speech.

$$\begin{aligned} H=\frac{1500}{\text {No. of small boxes between two R-waves}} \end{aligned}$$
Fig. 1
figure 1


From Fig. 1

$$\begin{aligned} H=\frac{1500}{17}=88.23 \end{aligned}$$
  1. 3.

    Classification techniques Naive Bayes Classifier: Naive Bayes classification is a supervised method of classification. The Naive Bayes classifier works on a simple concept. It makes use of the variables contained in the data sample, by observing them individually, independent of each other. It gives an accuracy of 40.09 and 42 % precision for 5 % train and 95 % test dataset. The Nave Bayes classifier is based on the Bayes rule of conditional probability. It makes use of all the attributes contained in the data, and analyses them individually as though they are equally important and independent of each other [10].

CVParameterSelection is a meta-classifier which can optimize over an arbitrary number of parameters. It cannot optimize on nested options. An only direct option of the base classifier is possible. This classifier sets the scheme parameters which are to be set by cross validation. The CVParameterSelection Classifier gives an accuracy of 33.33 % and precision 33 for 5 % training and 95 % test dataset [11].

Filtered classifier is a class for running an arbitrary classifier on data that has been passed through an arbitrary filter. The metaFiltered Classifier gives an accuracy of 33.33 % and precision 33 for 5 % training and 95 % test dataset [12].

J48: J48 is a classifier which is implemented by C4.5 algorithm. C4.5 builds decision trees from a set of training data. J48 is an open source Java implementation of the C4.5 algorithm in the Weka data mining tool. C4.5 is a program that creates a decision tree based on a set of labeled input data. This algorithm was developed by Ross Quinlan. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. The J48 Classifier gives an accuracy of 39.24 % and a precision of 42 % for 5 % training and 95 % test dataset [13].

Results and discussion

The average feature difference of the Mel frequency cepstral coefficient and the heart rate measured from the recorded ECG are classified on 20 different classifiers. Figures 2, 3 and 4 shows the ECG and  speech MFCC features under the common set of basic emotions recorded simultaneously. The distance features drawn from the MFCC and heart rate extracted from the ECG is correlated by the classifiers to predict the emotions. The output obtained from the classifiers for 5 % trained and 95 % tested dataset is shown in a Table 1. The results shows that there exist some correlation between the heart rate, emotion and the human speech, that can be relevant to build a contactless speech sensor based heart-rate measurement. It can be noted that in a classification perspective, the this is a small-sample classification problem, that can prove to be a extremely challenging predication problem in real-time recording situations. The scatter diagram in Fig. 5 shows inseparable nature of the problem, and indicates that MFCC features are not robust under the variations of inter-class emotions.

Fig. 2
figure 2

a MFCC of emotion ’neutral’ and b corresponding ECG obtained from the handheld ECG monitor

Fig. 3
figure 3

a MFCC of emotion ’joy’ and b corresponding ECG obtained from the handheld ECG monitor

Fig. 4
figure 4

a MFCC of emotion ’anger’ and b corresponding ECG obtained from the handheld ECG monitor

Fig. 5
figure 5

Scatter plot of the dataset

Table 1 Comparison of different classifiers when 5 % of the class samples are used as training data and remaining 95 % of sample are used as test on emotion database

Since, we have done the recordings using actors, the induced emotions may be subjective to the actors ability to mimic a emotion. In addition, the scatter plots also indicate that the emotions in its pure form are rare and often can be combination of subtle emotions. There are other emotions like boredom, disgust, fear, sad, the basic emotions to be considered. These emotions should also be taken for the classification, which would help in better analysis of ECG measurement from corresponding emotions. Table 2 shows the heart rate detection accuracy of 30 individuals who have been subjected in this study. As can be clearly seen, there is a large variation between the results on the detection accuracy, that indicates a subtle mixing of emotions leading to overlap in classes. Nonetheless it can be seen from the classification accuracies that there exist strong correlations between the emotions in the speech and heart rates.

Table 2 Individual classification accuracy (%) of 30 subjects using 66 % data points for training and remaining for test


We have demonstrated the possibility for a contactless human heart rate monitoring system based on the variation in human emotions. The idea is tested using a range of well known classification techniques. The precision, accuracy, F-measure and percentage recall were also analysed. The classification is done for three different emotions and the emotional level of the subject could be identified from the corresponding heart rates. The database was created from thirty different subjects for three different emotions. The classification analysis indicated strong correlation between the heart-rate, emotion and human speech, which can be further explored to create contact-less real-time heart-rate detection devices.


  1. Sundnes J, Lines GT, Cai X, Nielsen BF, Mardal K-A, Tveito A (2007) Computing the electrical activity in the heart, vol 1. Springer, Netherlands

    Google Scholar 

  2. Milacic M, James AP, Dimitrijev S (2013) Biologically inspired features used for robust phoneme recognition. Int J Mach Intell Sens Signal Process 1(1):46–54

    Google Scholar 

  3. Davletcharova A, Sugathan S, Abraham B, James AP (2015) Detection and analysis of emotion from speech signals. Procedia Comput Sci 58:91–96

    Article  Google Scholar 

  4. McCraty R, Atkinson M, Tiller WA, Rein G, Watkins AD (1995) The effects of emotions on short-term power spectrum analysis of heart rate variability. Am J Cardiol 76(14):1089–1093

    Article  Google Scholar 

  5. Kim KH, Bang S, Kim S (2004) Emotion recognition system using short-term monitoring of physiological signals. Med Biol Eng Comput 42(3):419–427

    Article  Google Scholar 

  6. Anttonen J, Surakka, V (2005) Emotions and heart rate while sitting on a chair. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, pp 491–499

  7. Mesleh A, Skopin D, Baglikov S, Quteishat A (2012) Heart rate extraction from vowel speech signals. J Comput Sci Technol 27(6):1243–1251

    Article  Google Scholar 

  8. Sigurdsson S, Petersen K, Lehn-Schiøler T (2006) Mel frequency cepstral coefficients: An evaluation of robustness of mp3 encoded music. In: Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR). Victoria

  9. Muda L, Begam M, Elamvazuthi I (2010) Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques. arXiv preprint arXiv:1003.4083

  10. Kim S-B, Han K-S, Rim H-C, Myaeng SH (2006) Some effective techniques for naive bayes text classification. Knowl Data Eng IEEE Trans 18(11):1457–1466

    Article  Google Scholar 

  11. Staelin C (2003) Parameter selection for support vector machines. Hewlett-Packard Company, Tech. Rep. HPL-2002-354R1

  12. Meraoumia A, Chitroub S, Bouridane A (2012) Multimodal biometric person recognition system based on fingerprint & finger-knuckle-print using correlation filter classifier. In: IEEE International Conference on Communications (ICC), 2012, pp 820–824

  13. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  MathSciNet  Google Scholar 

  14. Singhal A, Brown CR (1997) Dynamic bayes net approach to multimodal sensor fusion. In: Proceedings of SPIE 3209, Sensor Fusion and Decentralized Control in Autonomous Robotic Systems, vol. 3209. pp 2–10. doi:10.1117/12.287628

  15. Zhang H (2004) The optimality of naive bayes. AA 1(2):3

    Google Scholar 

  16. Kibriya AM, Frank E, Pfahringer B, Holmes G (2005) Multinomial naive bayes for text categorization revisited. In: AI 2004: Advances in Artificial Intelligence. Springer, Germany, pp 488–499

  17. Mitchell T (2005) Generative and discriminative classifiers: naive bayes and logistic regression. Manuscript available at

  18. Longstaff ID, Cross JF (1987) A pattern recognition approach to understanding the multi-layer perception. Pattern Recognit Lett 5(5):315–319

    Article  Google Scholar 

  19. Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66

    Google Scholar 

  20. Buscema M, Tastle WJ, Terzi S (2013) Meta net: A new meta-classifier family. In: Data Mining Applications Using Artificial Adaptive Systems. Springer, New York, pp 141–182

  21. Atkeson CG, Moore AW, Schaal S (1997) Locally weighted learning for control. In: Lazy Learning. Springer, Netherlands, pp 75–113

  22. Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8):578–588

    Article  MATH  Google Scholar 

  23. Hou C, Nie F, Yi D, Wu Y (2013) Efficient image classification via multiple rank regression. Image Process IEEE Trans 22(1):340–352

    Article  MathSciNet  Google Scholar 

  24. Ayu MA, Ismail SA, Matin AFA, Mantoro T (2012) A comparison study of classifier algorithms for mobile-phone’s accelerometer based activity recognition. Procedia Eng 41:224–229

    Article  Google Scholar 

  25. Shahzad W, Asad S, Khan MA (2013) Feature subset selection using association rule mining and jrip classifier. Int J 8(18):885–896

    Google Scholar 

  26. Kotsiantis SB, Zaharakis I, Pintelas P (2007) Supervised machine learning: a review of classification techniques. in: Proceedings of the 2007 Conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies. IOS press, Amsterdam, pp 3–44

Download references


The author will like to acknowledge Bibia Abraham from Kannur Medical College for helping with the preparation of the dataset required for this study.

Competing interests

The authors declare that they have no competing interests.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Alex Pappachen James.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

James, A.P. Heart rate monitoring using human speech spectral features. Hum. Cent. Comput. Inf. Sci. 5, 33 (2015).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: