### The framework of calculation method for attention

The proposed computational method based on information entropy of two channels and exponential moving average (EMA) correctly simulate the trigger mechanism of the human auditory system and the persistence of the attention process. Information entropy is an important concept used to represent the average rate at which information are produced in information theory. It has a large value of audio information entropy that an audio signal has the prominent feature: including a high short-term average energy, high average zero-crossing rate, high binaural intensity difference, or short binaural time difference. Compared with other models use many features of audio signal, this attention model based information entropy can avoid the complexity of multi-feature calculation, and has a good detection accuracy through relevant experimental verification in this paper. The block diagram of the calculation model in this paper is shown in Fig. 2. We first use a set of Gammatone filters and Meddis inner hair cell model to perform auditory peripheral process on audio signals. And then we set two channels to get the local entropy, the image channel based image saliency method firstly gets spectrum map which contains the frequency, time and loudness information, owing to some information of the initial spectrum map isn’t clear, the image should be converted to grayscale image and enhanced to get a clear greyscale image which can be processed by correlation calculation to obtain local entropy of image channel. In order to get an effective algorithm about audio attention, the other calculation channel is the audio calculation channel, we firstly frame the speech signal and then calculate the local information entropy value of each frame. The local entropy obtained through the audio channel and the image channel should be linearly merged, and the auditory attention can be calculated by an EMA model, which can make full use of the characteristics of the audio signal to reflect the sustainable and attenuation features of the human attention mechanism in time dimension. Finally, we can determine the area-of-interest of audio signal by the degree of attention.

### The computational model of auditory periphery

In order to simulate the process of sound signals from the basilar membrane to the cochlear nucleus, the frequency selectivity of the basilar membrane is simulated by a set of bandpass filter banks to extract some sound parameters. At the same time, we simulate the generation and transportation of neurotransmitters in the hair cell-auditory nerve fiber fissures through the inner hair cell and auditory nerve protrusion model. The computational model of auditory periphery realizes the splitting of the audio signal according to different center frequencies, half-wave rectification and non-linear compression. At present, the bandpass filter group for auditory peripheral processing mainly include Mel filter group and Gammatone (GT) filter bank , and the Mel Frequency Cepstral Coefficients (MFCCs) is widely used but MFCC is sensitive to noise and has poor noise resistance. Some researchers have proved the superior noise immunity of the GT filter banks using the GT filter banks instead of the Mel filter banks, which has a certain inhibitory effect on Gaussian white noise and additive background noise [20, 21]. Finally, the next model of auditory periphery used the Meddis model [22, 23] to describe our inner hair cells and auditory nerve projection.

#### The Gammatone filter bank to simulate basilar membrane

The cochlear-like map obtained by the GT filter banks is compared with the ordinary spectral map, the low-frequency resolution of cochlear-like map is better than the high-frequency resolution of it [24]. The impulse response of the GT filter can be considered as a Gamma function multiplied by a cosine signal, and the time impulse response formula is shown in formula (1). The value of *N* is the number of the filter, the parameter *t* is the time of audio, *n* denotes the center frequency of the filter, \(\phi \) is the starting phase, \(\alpha \) is the order of the filters, *A* is a constant. The equivalent rectangular bandwidth (ERB) is a psychoacoustic measure of the bandwidth of the auditory filter at each point along the cochlea, in the case of \(n=4\) and \(b=1.1019\) times, the *ERB* can represent the human auditory filter [25]. For convenience, we set \(A=1\) and \(\phi =0\) [26].

$$\begin{aligned}g(n,t)&=A{{t}^{\alpha -1}}{{e}^{-2\pi Bt}}\cos (2\pi {{f}_{n}}+\phi )u(t), \quad t\ge 0,1\le n\le N \nonumber \\ B&={{b}}\times ERB({{f}}_{{n}}) \nonumber \\ ERB({{f}_{n}})&=24.7\times \left( 4.37\times \frac{{{f}_{n}}}{1000}+1\right) \end{aligned}$$

(1)

in this paper, we set \(N=25\), and the final result of the GT filter banks can be formulated as

$$\begin{aligned} y(n,t)=s({{t}})*q({{n}},{{t}}) \end{aligned}$$

(2)

where * is the convolution, *y*(*n*, *t*) represents the filtered signal, and *s*(*t*) is the input audio signal. The frequency responses of the Gammatone filters are represent by *g*(*n*, *t*).

#### The Meddis inner hair cell model

The inner hair cell is a transducer element of the cochlea which function is responsible for converting the mechanical vibration of the basilar membrane into a potential within the cell membrane. The audio signal filtered by the GT filter banks is processed by the inner hair cell mathematical model, and The functions of this model are half-wave rectification, non-linear compression and adaptive adjustment. The Meddis model is a commonly used in many auditory peripherals composite model. The permeability of membrane changes with the instantaneous sound intensity of sound waves, the Neurotransrmtter penetrates from Pool to Cleft through the cell membrane, the part of Neurotransrmtter in Cleft was collected into the Pool through the reprocessing store, and the other part Neurotransrmtter can be freely lost, and the Factory in the inner hair cells was also constantly making Neurotransrmtter to supplement the depletion.

The Meddis model describes the generating, transmitting and diffusing processing of converting acoustic signals into potential signals, and this mathematical model is simple and easy to implement on a computer [27]. The differential equations describing the model are shown below:

$$\begin{aligned}k(t)&={\left\{ \begin{array}{ll} \frac{A+s(t)}{A+B+s(t)}g,&\quad A+s(t)\ge 0 \\ 0, &\quad A+s(t)<0 \end{array}\right. } \nonumber \\\frac{{{d}_{q(t)}}}{{{d}_{t}}}&=y(1-q(t))-k(t)q(t)+xw(t) \nonumber \\ \frac{{{d}_{c(t)}}}{{{d}_{t}}} &=k(t)q(t)-lc(t-rc(t) \nonumber \\\frac{{{d}_{w(t)}}}{{{d}_{t}}} &=rc(t)-xw(t) \end{aligned}$$

(3)

where the osmotic pressure of the cell membrane is *k*(*t*). The *s*(*t*) is the output of the basement membrane, and *A*, *B*, *g*, *y*, *x*, *l*, *r* are constants. The *c*(*t*) determines the probability of nerve fiber activity, which is represented by *h*. The result of Meddis model is computed as Eq. (4).

$$\begin{aligned}V(t)=h\cdot c(t) \end{aligned}$$

(4)

### The two channels of getting local information entropy

The audio attention model generally calculates a significant attention area by linearly combining multi-dimensional features of the signal. In this paper, the signal is processed by the image channel and the audio channel to get the local information entropy, which can make up for the incomplete defects of the single channel calculation model, and the related experiments we did also prove that the model of combining two channels can improve the accuracy of our attention extraction.

#### The audio channel processing

The audio processing channel is mainly to frame the audio signal, normalize the amplitude and finally calculates the local information entropy. the information entropy is the average rate at which information are produced by a stochastic source of data. If a random signal has a high short-term frequency and uneven energy, the value of information entropy will be larger. On the contrary, the information entropy value of a uniform signal is lower. The signal perceived by the human ear are rarely stable in our actual environment where the characteristics of frequency or loudness have a certain range of changes, and have a higher value of the information entropy. Due to the value of the audio signal amplitude is normalized within − 1 to 1, we divide the interval from − 1 to 1 into *n* consecutive cells, such that a vector \(Y=\{y_1,y_2,\ldots ,y_n\}\) of this interval. The information entropy of each frame can be given by

$$\begin{aligned}H(k)&=-\sum \limits _{i=1}^{n}{{{p}_{i}}\log {{p}_{i}}} \nonumber \\{{p}_{i}}&=\frac{count({{y}_{n}})}{\sum \nolimits _{i=0}^{n}{count({{y}_{n}})}},\quad {{y}_{n}}=[-1,1] \nonumber \\{{H}_{aud}}(t)&=\sum \limits _{k=0}^{t}{H(k)} \end{aligned}$$

(5)

where the *H*(*k*) is the value of this information entropy. The \(p_i\) is the probability of the amplitude in the \(y_n\) interval. The \(count(y_n)\) is the number of discrete points in the \(y_n\) interval and we set \(n=20\). Finally, \(H_{aud}(t)\) is the result of the audio channel processing about getting local information entropy and *t* is time.

#### The image channel processing

The main purpose of image channel processing is to obtain a high attention area on the spectrogram where we can get some information different from the audio channel. This channel uses the image saliency correlation algorithm to calculate the local information entropy of the spectrogram, and the first thing we need to do is to get the spectrum of the one-dimensional audio signal with a bank of band-pass filters. A spectrogram is a visual representation of the sound signal as signal vary with time, and a common format of audio signal is a graph with two geometric dimensions: one axis represents time, the other axis is frequency, a third dimension indicating the amplitude of a particular frequency at a particular time is represented by the intensity or color of each point in the image. Here the spectrogram is defined as *Spec*(*t*) where *t* is the time of the audio signal and *Spec*(*t*) is in RGB domain, the next calculation of the image channel is described by:

$$\begin{aligned}&Spec{{(t)}_{G\text {ray}}}=\frac{(299\times Spec{{(t)}_{R}}+587\times Spec{{(t)}_{G}}+114\times Spec{{(t)}_{B}})}{1000} \end{aligned}$$

(6)

where \(Spec(t)_{Gray}\) is the grayscale image associated with the spectrogram in RGB domain (\(Spec(t)_R\) stands for red, \(Spec(t)_G\) for green and \(Spec(t)_B\) for blue). In order to deal with the grayscale image and get more useful and clear information from this image, we enhance the visual representation of grayscale image, we formulate the process as the following equation:

$$\begin{aligned}&Eh(t)=mean(Spec{{(t)}_{G\text {ray}}})+\frac{(Spec{{(t)}_{Gray}}- mean(Spec{{(t)}_{Gray}}))}{\frac{contrast}{100}} \end{aligned}$$

(7)

where *Eh*(*t*) is the enhanced image, and contrast is the contrast degree of improving the image quality, \(mean(Spec(t)_{Gray})\) denotes the average value of \(Spect(t)_{Gray}\) which calculated by Eq. (6). The internal computing process of this channel is formulated by:

$$\begin{aligned}{{P}_{i}}&=\frac{f(i)}{{{N}^{2}}} \nonumber \\H({{t}})&=-\sum \limits _{i=0}^{255}{{{p}_{i}}\log {{p}_{i}}} \end{aligned}$$

(8)

where *H*(*t*) is array, where each output pixel contains the entropy value of the 9-by-9 neighborhood around the corresponding pixel in the input image. *f*(*i*) is the number of times that a pixel with a gray value of *i*(\(i\in [0,255]\)) appears in a 9-by-9 neighborhood. \(P_i\) is the probability value, and we set \(N=9\). Finally, the local entropy value of the image is calculated to one dimensional entropy, and the process can be defined as:

$$\begin{aligned}{{H}_{img}}(t)=mean(H(t)) \end{aligned}$$

(9)

where \({{H}}_{img}(t)\) is the value of the local information entropy that is one-dimensional vector calculated by averaging the columns.

### The exponential moving average (EMA) correlation process

In statistics, a moving average(MA) is a calculation to analyze data points by creating series of averages of different subsets of the full data set, it commonly used with time series data to smooth out short-term fluctuations [28]. The EMA is a type of MA that places a greater weight and significance on the most recent data points, and its principles are the same as the human auditory system which give more attention to what happened in the near future.

The EMA for a series audio signal can be formulated as

$$\begin{aligned}EMA(k,n)&=h(k)\cdot a+EMA(k-1,n)\cdot (1-a) \nonumber \\a&=\frac{2}{n+1} \end{aligned}$$

(10)

where the *h*(*k*) is the value of the information entropy. *EMA*(*k*, *n*) is the value of the EMA at the *n*th frame, and the different values of *n* will represent the different scales of calculation, which smaller *n*-value reflects short-term trends in information entropy. The coefficient *a* is related to *n*.

In this paper, the short-term exponential moving average is expressed by \(EMA(k,s_n)\), the long-term exponential moving average is expressed by \(EMA(k,l_n)\), the value of the coefficient \(s_n\) and \(l_n\) is rated with the duration of the audio. In the same coordinate system, when the short-term exponential moving average line upward through the long-term exponential moving average line, it represents the information entropy value increased in the short-term audio signal, which can be considered as the beginning of an event with high degree of attention. And when the short-term exponential moving average line down through the long-term exponential moving average line, it represents the information entropy value reduced in the short-term audio signal, which is the sign of the human auditory system lose attention to this event. The calculation about differential EMA can be defined as

$$\begin{aligned}&dif(k)=EMA(k,{{s}_{n}})-EMA(k,{{l}_{n}}),\quad {{l}_{n}}>{{s}_{n}} \end{aligned}$$

(11)

where *dif* is the difference between short-term EMA and long-term EMA and the higher value of *dif* indicates a higher attention to this time. The values of *dif* greater than 0 mean that we can confirm this frame with high attention, on the other hand the values of *dif* less than 0 mean that this frame can’t attract our attention. We also can illustrate *dif* with the column chart that displays data as vertical bars, and there are some segments on the column chart which attention is different from around (the sign of the *dif* value is different from the front and back segments) and relative length of time is very short, so we must deal with these segments to keep the sign of the *dif* same with the surrounding segments. Namely, we have the following fact

$$\begin{aligned}&dif{{(k)}_{\text {deal}}}=\text {seg}\_ful(dif(k)) \end{aligned}$$

(12)

where \(dif(k)_{deal}\) is the result of process that some segments have fused. The calculation about the degree of attention is shown in formula (13) where *attention*(*k*) represents the degree of attention of the \(k_{th}\) frame.

$$\begin{aligned}&attention(k)=EMA(dif{{(k)}_{deal}},0.1({{l}_{n}}-{{s}_{n}})) \end{aligned}$$

(13)