- Open Access
Comparative study of singing voice detection based on deep neural networks and ensemble learning
© The Author(s) 2018
- Received: 5 September 2018
- Accepted: 8 November 2018
- Published: 26 November 2018
This paper investigates various structures of neural network models and various types of stacked ensembles for singing voice detection. The studied models include convolutional neural networks (CNN), long short term memory (LSTM) model, convolutional LSTM model, and capsule net. The input features to the network models are MFCC (mel-frequency cepstrum coefficients), spectrogram from short-time Fourier transformation, or raw PCM samples. The simulation results show that CNN model with spectrogram inputs yields higher detection accuracy, up to 91.8% for Jamendo dataset. Among the studied stacked ensemble methods, performing voting strategy yields comparable performance as the other methods, but with much lower computational cost. By voting with five models, the accuracy reaches 94.2% for Jamendo dataset.
- Sining voice detection
- Convolutional neural networks
Detecting signing voice in a piece of music work (soundtrack) has been studied for many years because this technique is the foundation for many advanced applications . In the following, we briefly describe some of the applications. Firstly, if we intend to remove the vocal sound from a singing soundtrack for karaoke singers, the pre-processing step certainly needs to pin point the audio segments with singing voice . Second, we know that the most well-known portion of a western popular song is usually on the verse part, which almost always contains singing performance. Therefore, the work of music summarization  as well as melody extraction  can also benefit from knowing the segments with signing voice. Next, if we want to identify the singer in a music work, we need to have the singing segments before conducting recognition . In addition to the above applications, if we intend to perform a lyrics-to-melody conversion , we also need to know the signing segments. From the above examples, we know that singing voice detection is a fundamental pre-processing step for many applications.
There are two types of problems in detecting singing voices in a piece of audio work. The first type is to mark the starting and ending points of all vocal segments on the soundtrack, referred to as the singing voice segmentation problem. The second type is to determine whether a short audio clip (e.g., 2 s) contains any human-perceivable vocal sound, including the vocal sound of the background vocalists. This type of problem is referred to as the singing voice detection problem. At a first thought, it seems that the singing voice segmentation problem is more difficult than the detection problem. However, the segmentation problem has the entire soundtrack available; therefore, some post-processing steps, such as smoothing , could be conducted to correct some error labeling by using the information of the preceding and succeeding pieces of music. Such a technique, however, could not be applied to the singing voice detection problem. In this paper, we focus on the second type of problem. As these two types of problems are somewhat different, the accuracy measure reported for one type of problem may not be compared with that of the other.
Recently, new types of neural network structures have been widely applied to many difficult problems [8, 9]. For example, the CNN structure nowadays is widely used in the image recognition problems [10, 11]. Therefore, it is a natural extension to apply the CNN structure to the singing voice detection problem. Conceptually, if we construct the spectral-temporal features of an audio clip as a two-dimensional feature plane, we should be able to use CNN structures originally proposed for images in this problem. One well-known spectral-temporal feature is the MFCC (mel-frequency cepstrum coefficients) . However, there are also other types of spectral-temporal features available, such as the spectrogram based on STFT (short-time Fourier transformation). In addition, directly applying raw PCM (pulse code modulation) samples to a CNN-based structure  is also a possible alternative. With all these possibilities, it is useful to know which type of features yields better accuracy. To have a fair and meaningful comparison, preferably the classifiers are all based on CNN and have similar structures. This issue is the first subject we would like to investigate in this paper.
As there are many different types of neural networks available in the literature, it is also important to investigate whether one particular type of neural networks has higher accuracy than others. For example, it is known that the LSTM-based neural network is also effective for audio genre classification problems . Thus, LSTM may also perform well in singing voice detection. In addition to the conventional LSTM, we also investigate one of LSTM variations, called convolutional LSTM , which combines the convolution layer into the LSTM structure to adapt to both spacial and temporal features. This network structure may also be a good candidate for singing voice detection. In addition, we also attempt to use the capsule network (capsulenet)  to this problem. Overall, we will report the results of using CNN, LSTM, convolutional LSTM, and capsulenet in this paper.
Finally, it is well-known that the ensemble learning can improve the detection accuracy in many instances. Thus, we would like to study if this approach can still work for singing voice detection problem. In this paper, we apply the ensemble learning with three different approaches, namely, voting, post-classifier, and fusion for singing voice detection problem and report the respective accuracy. Overall, the goal of this paper is to present a comprehensive comparison of relative accuracy performance among various types of features, network models, and ensemble learning techniques for the singing voice detection problem, so that researchers and practicing engineers facing this problem could follow our findings without repeating all sorts of experiments again.
This paper is organized as follows. “Related work” section is the literature survey, covering related papers with the reported accuracy. “Neural network structures with various types of features” section describes all of the network structures and ensemble learning approaches to be used in the experiments with various spectral-temporal features. “Experiments and results” section covers experimental setting, the used datasets, and the experimental results. Finally, “Conclusion” section is the conclusion.
To locate singing voice segments, researchers usually extract one or more types of features from the audio signal and then use a classifier for detection. One widely used feature for audio applications is the MFCC (Mel Frequency Cepstral Coefficient). To investigate whether this type of feature is better than others, Kim et al.  compared MFCC with MPEG-7 ASP (Audio Spectrum Projection) features, and found that MFCC was better. Similarly, Rocamora and Herrera  had the same finding, but their accuracy was only around 78%. To further increase the accuracy, Dittmar et al.  proposed to combine MFCC features with vocal variation and Flutogram variation. When using random forest as the classifier, the F-measure could reach 87%.
Other than the MFCC features, Berenzweig and Ellis  used the statistical features and the HMM (Hidden Markov Model) as the classifier. They reported an accuracy of around 80%, indicating that the statistical features may not be much better than the MFCC feature.
Another attempt to improve the accuracy for singing voice segmentation problem is through the post-processing step. To this end, Lukashevich et al.  proposed to use the ARMA (autoregressive moving average) smoothing model as the post-processor. With this technique, they had an average accuracy of 82.5%. Vembu and Baumann  also combined several features with a smoothing technique. They yielded an accuracy of 84% for singing voice segmentation.
Nwe et al.  proposed the bootstrapping technique to further improve the accuracy of singing voice segmentation. They incorporated the musical features and musical structure as features and used a Multi-Model HMM as the classifier. With the bootstrapping technique, they had an accuracy of 86.7%.
In terms of classifiers, Leglaive et al.  compared a neural network model called BLSTM (Bidirectional Long Short-term Memory) with traditional classifiers, such as SVM (support vector machine) for singing voice segmentation. In their setting, the features are MFCC-like features derived from two HPSS (Harmonic/Percussive Source Separation) layers. According to their simulations, BLSTM was better, with the F-measure reaching 91%.
On the singing voice detection problem, Schluter and Grill  proposed a model using three-layer convolutional neural networks (CNN) for signing voice detection. The features to the CNN are 2-D spectral-temporal feature plane, obtained with a procedure similar to that of the MFCC. When applying data augmentation, they reached an accuracy of 91%.
To remove the need of feature extraction for singing voice detection, Dieleman and Schrauwen  used a unified network for both feature extraction and classification. Hypothetically, using a learnable network for feature extraction should be able to extract better features than existing ones, as suggested by Humphrey et al. . However, simulation results in Dieleman and Schrauwen’s report showed that this type of unified networks did not provide higher accuracy when compared with networks using traditional features, such as MFCC. Recently, Lee et al. proposed a new end-to-end network model  with many layers of small filters. The authors claimed that their model was better than the conventional end-to-end model given in . As their experiments did not cover singing voice detection, whether this structure is better in this particular problem is not concluded.
This section describes all network structures studied in this paper. To have a fair comparison, we try our best either to use the same type of features for some network structures, or similar network structures for different types of features.
CNN with MFCC features
CNN with FFT features
In Fig. 2, we use the square layer to take squares of the outputs from sin MYP1D and cos MYP1D, and then take the squared roots of the added values. The reason of taking square is to avoid producing negative values. Basically, point wise convolving the input signal with a sine or cosine function may produce either a positive or a negative number, depending on the phase of the input signal. In our case, we are not concerned with the phase of the signal, but only the relative “strength” (energy) of the signal. Therefore, we use the square function. Actually, we have tried removing the square and the square-root functions in the experiments, but the accuracy with such an arrangement was much lower.
CNN with raw PCM
LSTM for singing voice detection
Ensemble learning by stacking
It is known that ensemble learning generally can improve the accuracy for many prediction (classification) problems. There are many different types of ensembles for machine learning. In this paper, we consider only the stacking type of ensemble learning. In this subsection, we briefly describe the used methods in the experiments.
This section covers the experimental procedures and results. To ease reading, we divide the experimental results into two subsections, one for accuracy for each type of network models, and the other one for results using stacked ensembles.
As mentioned previously, each sample used in the experiments is an audio clip with duration of 2 s. The network models are trained to detect whether any vocal sound (singing voice) is present in the clip or not. Note that we consider that as long as the presence of vocal signal is distinguishable, it could only be in a portion of the audio clip. Furthermore, audio clips containing vocal sounds from backing vocalists are also classified as vocal clips. In the experiments, the ground truth (vocal or non-vocal status) of the samples is annotated by human listeners.
In the experiments, we use two datasets to assess the accuracy. The first dataset is the Jamendo dataset , containing 93 soundtracks, and equivalent to 6 h of playing time. The soundtracks are divided into training, validation, and testing sets, each with 61, 16, and 16 soundtracks. One advantage of the Jamendo dataset is that the singing segments in each soundtrack have been manually annotated. Therefore, all we need to do is to partition each soundtrack into many 2 s audio clips. In the experiments, the training audio clips are from both the training and validation sets, leaving only the testing set for testing audio clips. Overall, we have about 13,000 training samples and 3000 testing samples. The training set has 8480 vocal segments and 7868 non-vocal segments, and the test set has 1487 vocal segments and 1499 non-vocal segments.
The second dataset is derived from the FMA (free music archive) website . The site contains more than 100,000 contributed soundtracks. In the experiments, we randomly pick about 18,000 soundtracks covering all types of music genres. Since the FMA dataset is not balanced in terms of music genre, the genre types of Rock, Electronic, and Experimental (out of 20 + genres specified in FMA website) cover about 60% of the chosen soundtracks.
For each soundtrack, we randomly take a 2-s excerpt as one sample for experiments. In this arrangement, no two audio clips in this dataset are from the same soundtrack. As the FMA dataset does not contain any annotation about the vocal/non-vocal information, we obtain the ground truth by human listeners. Specifically, we have more than 10 graduate students involved in the listening work. Each student is given around 1500 excerpted (2 s) segments. After listening to a segment, he/she is asked to identify whether the segment contains vocal or not. If in doubt, he/she is allowed to listen to the segment repeatedly. If the listener still cannot determine whether the segment is vocal or not, this segment is labeled as “undetermined,” and it is not used in the experiments. With such a procedure, we finally have 4783 vocal segments and 7451 non-vocal segments in the training set, and 1660 vocal segments and 2485 non-vocal segments in the testing set. The annotated dataset is available in .
Experimental setting and environment
1080ti × 3
In terms of training, we use the back propagation algorithm with ADADELTA  to adapt the learning rate. In addition, to avoid overfitting, we also use dropout regularization  for all layers, except the output layer, in the CNN structures. The probability for dropout is set to 0.5.
The format of the source audio clips is stereo audio with a sample rate of 44.1 ks/s. The pre-processing steps for the clips include down-mixing to one channel (mono) and downsampling to 16 ks/s. After the pre-processing, different types of features are extracted according to the used network models given previously.
Experimental results for various network models
This section contains experiments intend to (a) compare the detection accuracy for various types of models given in the previous section, (b) observe the accuracy influenced by the unequal number of segments in vocal and non-vocal classes, and (c) evaluate the generalization capability of the studied networks when trained in one dataset and tested in another dataset. In this experiment, the training and testing samples are either from the Jamendo dataset or from the FMA dataset.
Accuracy for various network models trained and tested using the Jamendo dataset
Average accuracy (10 trials) (%)
As mentioned in “Related work” section, a new deep network structure for end-to-end audio recognition with many small filters was proposed . The authors showed that their model was better than the conventional end-to-end model given in Fig. 5. When using this deep structure in the experiments, the average accuracy did increase from 77% to around 82%. But, it is still significantly lower than that of the SCNN structure. Due to the low accuracy and prolonged training time, we decide not to further investigate the end-to-end approach in the following experiments.
Other than the ECNN, we also notice that the capsule net has relatively lower accuracy. This phenomenon could be partially due to sub-optimal hyper-parameters of the network model. We have also tried several different sets of hyper-parameters; unfortunately, we are still unable to find a good set of hyper-parameters for this particular type of model. We still report our results, hoping that other researchers may benefit from our findings.
When comparing the SLSTM and CLSTM models, we notice that CLSTM is slightly better. This result seems to indicate that using convolutional layers is beneficial for 2-D feature planes, such as spectrograms.
Since the MCCN and SCNN have similar network structures, the performance difference is mainly due to the type of input features. Generally speaking, the spectrogram is a lower level of spectral-temporal features than that of the MFCC. Therefore, the spectrogram possesses more information for the CNN to explore, if training is successful. This result confirms the common conjecture that lower level features usually are preferable as the CNN inputs.
Finally, when comparing the accuracy between SCNN and CLSTM, the SCNN is better although both use convolutional layers. Because the SCNN structure has more layers, the simulation results, in a sense, indicate that the network architecture is more important than the processing power of each node.
Accuracy for various network models trained and tested using FMA dataset
Average accuracy (10 trials) (%)
Results of misclassification for various network models using FMA dataset
Misclassify vocal as non-vocal
Misclassify non-vocal as vocal
Accuracy for interchanging training and testing dataset
Testing: FMA (%)
Testing: Jamendo (%)
Simulation results for ensemble learning
Experimental results for voting
To test the detection accuracy using voting, we perform the following experiment with two settings. One with all five models involved in voting, whereas the other one with only three models with higher accuracy (namely, SCNN, MCNN, and CLSTM). The experimental results are shown in Table 4. During this experiment, the weights of the models are obtained directly from the previous experiment. As the weights initialization introduces accuracy fluctuations, a fair comparison would be to reuse previously trained weights. When comparing the accuracy of individual model with voting accuracy, we notice that voting can effectively improve the accuracy by 2.4% for Jamendo dataset and 0.7% for FMA dataset, respectively. It is also interesting to know that the accuracy improvement with voting is related to the dataset. The accuracy of FMA dataset appears to be more difficult to improve even with the voting type of stacked ensembles.
Experimental results for post classifier
We also follow similar experimental steps to conduct experiments for the post classifier method. To train the post classifier, we tried two approaches. Approach one is to reuse the training dataset to train the post classifier. For this approach, the advantage is that no additional new training samples are needed to train the post classifier. However, the downside is that the trained classifiers are extremely accurate when classifying the previously seen training samples, typically around 99%. Thus, rendering the post classifier little room to judge which classifier is more reliable in a certain type of data when making the final decision.
Accuracy for voting and post-classifier types of stacked ensembles
FMA dataset (%)
Voting, 5 models
Voting, 3 models
Post classifier, 5 model, approach 1
Post classifier, 3 model, approach 1
Post classifier, 5 model, approach 2
Post classifier, 3 model, approach 2
Experimental results for fusion
In this paper, we study six different models, namely MCNN, SCNN, ECNN, SLSTM, CLSTM, and capsule net, for singing voice detection. In addition, we also study voting, post classifier, and fusion types of stacked ensembles. The simulation results show that the end-to-end approach (ECNN) has lower accuracy than other models presented in this paper, possibly due to insufficient training samples. Among the network models, the SCNN yields the best accuracy. In addition, a simple voting based on the decisions of five models can increase the accuracy up to 2.4% for the Jamendo dataset. In conclusion, if the accuracy is the top concern, then using multiple network structures with voting is a promising method. In the present study, networks involved in voting are heterogeneous. In the future, we plan to study whether higher accuracy could be achieved by voting with multiple identical network structures, where each structure is trained with either different frequency resolution or a subset of the entire training dataset.
All of the authors supervised the experiments and analyzed the results. SDY initialized the study and write the draft version of the manuscript, and the rest two revised the manuscript. All authors read and approved the final manuscript.
The authors would like to thank Mr. Chih-Chun Liu and Mr. Ren-Jie Liu for conducting the experiments.
A portion of the materials in this paper was presented in the 7th IEEE International Symposium on Next-Generation Electronics (ISNE 2018), Taipei, Taiwan, May 7–9, 2018 .
The authors declare that they have no competing interests.
Availability of data and materials
The research was supported in part by the Ministry of Science and Technology (MOST) of Taiwan through Grants MOST 106-2221-E-027-127 and MOST 107-2221-E-027-100.
- You SD, Wu Y-C, Peng S-H (2016) Comparative study of singing voice detection methods. Multimedia Tools Appl 75(23):15509–15524View ArticleGoogle Scholar
- Hsu C-L, Wang D, Jang JSR, Hu K (2012) A tandem algorithm for singing pitch extraction and voice separation from music accompaniment. IEEE Trans Audio Speech Lang Process 20(5):1482–1491View ArticleGoogle Scholar
- Logan B, Chu S (2000) Music summarization using key phrases. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, 2000Google Scholar
- Salamon J, Gómez E, Ellis DP, Richard G (2014) Melody extraction from polyphonic music signals: approaches, applications, and challenges. IEEE Signal Process Mag 31(2):118–134View ArticleGoogle Scholar
- Kim Y E, Whitman B (2002) Singer identification in popular music recordings using voice coding features. In: Proceedings of the 3rd international conference on music information retrieval, 2002Google Scholar
- Berenzweig AL, Ellis DP (2001) Locating singing voice segments within music signals. In: IEEE workshop on the applications of signal processing to audio and acoustics, 2001Google Scholar
- Lukashevich H, Gruhne M, Dittmar C (2007) Effective singing voice detection in popular music using arma filtering. In Workshop on Digital Audio Effects (DAFx’07), 2007Google Scholar
- Song Y, Kim I (2018) DeepAct: a deep neural network model for activity detection in untrimmed videos. J Inform Process Syst 14(1):150–161. https://doi.org/10.3745/JIPS.04.0059 View ArticleGoogle Scholar
- Yu N, Yu Z, Gu F, Li T, Tian X, Pan Y (2017) Deep learning in genomic and medical image data analysis: challenges and approaches. J Inform Process Syst 13(2):204–214. https://doi.org/10.3745/JIPS.04.0029 View ArticleGoogle Scholar
- Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012Google Scholar
- Koo KM, Cha EY (2017) Image recognition performance enhancements using image normalization. Human-centric Comput Inform Sci 7(1):33View ArticleGoogle Scholar
- Davis SB, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366View ArticleGoogle Scholar
- Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: IEEE international conference on acoustics, speech and signal processing, 2014Google Scholar
- Dai J, Liang S, Xue W, Ni C, Liu W (2016) Long short-term memory recurrent neural network based segment features for music genre classification. In: 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2016Google Scholar
- Xingjian S H I, Chen Z, Wang H, Yeung D Y, Wong W K, Woo W C (2015) Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, 2015Google Scholar
- Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. In: Advances in neural information processing systems, 2017Google Scholar
- Kim H G, Sikora T (2004) Comparison of MPEG-7 audio spectrum projection features and MFCC applied to speaker recognition, sound classification and audio segmentation. In: IEEE international conference on acoustics, speech, and signal processing, 2004Google Scholar
- Rocamora M, Herrera P (2007) Comparing audio descriptors for singing voice detection in music audio files. In: 11th Brazilian symposium on computer music, San Pablo, Brazil, 2007Google Scholar
- Dittmar C, Lehner B, Prätzlich T, Müller M, Widmer G (2015) Cross-version singing voice detection in classical opera recordings. In: International society for music information retrieval conference (ISMIR), Malaga, Spain, 2015Google Scholar
- Vembu S, Baumann S (2005) Separation of vocals from polyphonic audio recordings. In: 6th international conference on music information retrieval (ISMIR 2005), London, 2005Google Scholar
- Nwe T L, Shenoy A, Wang Y (2004) Singing voice detection in popular music. In: Proceedings of the 12th annual ACM international conference on Multimedia, 2004Google Scholar
- Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015Google Scholar
- Schlüter J, Grill T (2015) Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks. In: Proc. of the 16th International Society for Music Information Retrieval Conference, 2015Google Scholar
- Humphrey E J, Bello J P, LeCun Y (2012) Moving Beyond Feature Design: Deep Architectures and Automatic Feature Learning in Music Informatics. In: Proceedings of the 13th International Society for Music Information Retrieval Conference, Porto, Portugal, 2012Google Scholar
- Lee J, Park J, Kim KL, Nam J (2018) Samplecnn: end-to-end deep convolutional neural networks using very small filters for music classification. Appl Sci 8(1):150View ArticleGoogle Scholar
- Lim M, Lee D, Park H, Kang Y, Oh J, Park JS, Kim JH (2018) Convolutional neural network based audio event classification. KSII Trans Internet Inform Syst 12(6):2748–2760Google Scholar
- Huang HM, Chen WK, Liu CH, You SD (2018) Singing voice detection based on convolutional neural networks. In: 2018 7th international symposium on next generation electronics, Taipei, 2018Google Scholar
- Wu Y C, Chang P C, Wang C Y, Wang J C (2017) A symmetrie kernel convolutional neural network for acoustic scenes classification. In: IEEE international symposium on consumer electronics, Kuala Lumpur, Malaysia, 2017Google Scholar
- Available https://en.wikipedia.org/wiki/Spectrogram. Accessed 6 Sep 2018
- Ramona M, Richard G, David B (2008) Vocal detection in music with support vector machines. In: IEEE international conference on acoustics, speech and signal processing, 2008Google Scholar
- Defferrard M, Benzi K, Vandergheynst P, Bresson X (2017) FMA: a dataset for music analysis. In: 18th international society for music information retrieval conference, 2017Google Scholar
- Available: https://github.com/NTUT-LabASPL/FMA-C-DataSet-for-Vocal-Detection. Accessed 18 Oct 2018
- Available: https://www.tensorflow.org/. Accessed 6 Sept 2018
- Available: https://keras.io/. Accessed 6 Sept 2018
- Zeiler MD (2012) ADADELTA: an adaptive learning rate method. In; arXiv preprint arXiv:1212.5701
- Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetMATHGoogle Scholar