Image recognition performance enhancements using image normalization
© The Author(s) 2017
Received: 25 May 2017
Accepted: 29 August 2017
Published: 19 November 2017
When recognizing a specific object in an image captured by a camera, we extract local descriptors to compare it with or try direct comparison of images through learning methods using convolutional neural networks. The more the number of objects with many features, the greater the number of images used in learning, the easier it is to compare features. It also makes it easier to detect if the image contains the feature, thus helping generate accurate recognition results. However, there are limitations in improving the recognition performance when the feature of the object to be recognized in the image is significantly smaller than the background area or when the area of the image to be learned is insufficient. In this paper, we propose a method to enhance the image recognition performance through feature extraction and image normalization called the preprocessing process, especially useful for electronic objects with few distinct recognition characteristics due to functional/material specificity.
As the performance of mobile devices improves and the number of functions included in the devices increases, technologies that can be implemented using these devices are also diversifying. Especially, the technology utilizing cameras has been expanding its application field, ranging from augmented reality to recognizing wine labels, book covers, packaged goods, and similar clothes employing a form of style search.
When recognizing a specific object in an image captured by a camera, it is possible to compare the existing indexed content with a local descriptor that can extract the same feature repeatedly without being affected by the size change and the shooting angle. For instance, features such as SIFT [1, 2], SURF , BRIEF , ORB , MSER [6, 7] or the image of the region (or object) estimated by a saliency map  or selective search  are learned and recognized using the convolutional neural networks (CNN). The more the number of extractable features in the region (or object), the easier it is to compare the presence or absence of each feature and to deliver the accurate recognition result.
In the case of printed photographs, printed book covers, and industrial packaging materials, there are many features that are easy to extract from the local descriptors such as the image itself, the logo using various colors and patterns, and the packaging design, so that a relatively accurate recognition result can be obtained. However, for consumer goods such as TV, refrigerator, washing machine, air conditioner, etc., it is difficult to extract the local descriptors that can be used to easily compare the characteristics because of the functional (TVs, monitors, etc., whose main purpose is screen output, do not have any features on the screen. Once they are turned on, other features not related to the object to be recognized may interfere with the recognition) and material (the surfaces of refrigerators and air conditioners may be coated with light decoration) specificity.
Recently, image recognition by deep running has been getting popular. While it is true that are producing effective for some images, but they depend on the settings of a number of learning parameters in complex, nonlinear ways. Selecting good parameters is critical to the performance of the learning algorithm, but it is largely a black art [10–13].
In this paper, we propose a technique to improve the recognition performance using the preprocessing process that detects the distinguishable features of each product and normalizes them, with the aim of recognizing the manufacturer and the product name of the electronic product. In “Extraction of features”, we describe the feature extraction method and normalization method for each product. In “Convolutional neural networks”, we describe the method of neural network construction for normalized image recognition. The experimental results are described in “Results” and conclusions and future research plans are described in “Conclusions”.
Extraction of features
As mentioned earlier, electronics are limited in their possibility to extract comparable features because of their functional/material specificity. Another limitation is that the filtering/pooling layer results for the entire image are obtained only in the non-electronic components during the CNN learning process. Thus, the process of normalization of images for recognition is different for each different type of the electronic product. This process includes a preprocessing process for extracting features, a process for extracting additional individual features that facilitate recognition of the electronic product, and a process for normalizing the extracted image features.
In the preprocessing step, preparations are made to extract features that are easy to recognize in each image.
We can also check if the refrigerator has a built-in dispenser using an edge distribution of the top-left portion. This process is covered in more detail in the next chapter.
For the two aforementioned categories of electronics, the preprocessed image is used for learning, and the size is normalized only after extracting the image outlines, a process used for learning to recognize the remaining categories of objects.
Extraction of additional features
Some electronic objects look exactly the same with only one or two options being different. Sometimes, even different manufacturers need to introduce different features to distinguish their products from the competition. In this paper, we discuss how to extract features that can provide additional information on recognition results in further detail.
In case of refrigerators, the recognition of the model should not be affected by the presence of a water purifier (or ice dispenser). Hence, we attach a different label to the top-left edge enhanced image when we perform the learning.
Convolutional neural networks
In this paper, we use a fine-tuned CaffeNet  model for recognition of normalized images. The CaffeNet model is based on 1.2 million high-quality images classified into 1000 categories during the Large Scale Visual Recognition Challenge (LSVRC)-2010 competition. It has updated its existing record with 37.5 and 17.0% error rates in the top-1 and top-5 categories respectively.
The model structure has the following characteristics: first, the gradient vanishing problem is solved by using a rectified linear unit (ReLU) nonlinearity activation function with non-saturating characteristics and a learning speed that is faster than the activation function of the existing saturating nonlinearity characteristic.
Local reaction normalization is performed after ReLU nonlinearity. This is the brightness normalization process that is affected by the actual neurons, thereby reducing the error rates of 1.4 and 1.2% in the top-1 and top-5 categories respectively.
When the pooling size is z and the interval between the pooling units is s pixels, with 0.4 and 0.3% error rates in top-1 and top-5 categories respectively overlapping pooling is performed with the knowledge that s < z.
Using two GPUs reduces the error rates by 1.7 and 1.2% in top-1 and top-5 categories, respectively, when compared to usage of one GPU.
To solve the over fitting problem, set the result value of any hidden neuron to 0 so that it does not affect the learning. We use the dropout method in which all structures share the weights while learning the models of different structures every time. Combine the arbitrary partial neurons to learn more robust and useful features.
The fine tuning process transforms the architecture for a new purpose based on the previously learned model, and updates the weights of the learning based on the previously learned model weights. We tuned into more than 5800 pre-processed electronic object image-sets to recognize 55 home appliances instead of object category recognition through the BVLC CaffeNet model. The CaffeNet model works well for object classification and we want to use it to recognize our electronic objects in detail.
We have more than 5800 pre-processed images to learn and have begun fine-tuning with the parameters learnt from 1,000,000 image-net images. If we provide the weights argument to the Caffe train command, the previously learned weights melt into our model, and the layers will match by name. In other words, a new data classifier will be created based on previously learned models. We changed the last layer’s name of the existing CaffeNet model from fc8 to fc8_television, fc8_ refrigerator, and so on. Since there is no layer name in the existing bvlc_reference_ caffenet layer, this layer starts learning with random weights. We have created new models for all the eight categories of home appliances using fine-tuning. The results are discussed in the following section.
In this paper, 55 kinds of home appliances preprocessed by the proposed method were recognized. In this section, we describe the learning set used in the test, and evaluate the proposed algorithm by comparing the recognition results of the original image, the cropped image of the recognition target part, and the preprocessed recognition target image.
Full datasets—8 categories, 55 kinds of home appliances
We use preprocessing especially for television and refrigerator as we described. It related to recognition accuracy, of course the better normalization will make the better the recognition rate.
The number of images used for learning and recognition performance are described for each category
# of images
Stand-type air conditioner
Wall-mounted air conditioner
Robotic vacuum cleaner
In this paper, we have discussed the importance of preprocessing and evaluated the improvement in recognition performance when applying deep learning to the recognition of home appliances. Convolutional neural networks is a model that is optimized for vision while minimizing the complexity of the model based on three ideas: sparse weight, tied weight, and equivariant representation. The process can recognize many objects with its complex capabilities. Many types of improved techniques are being introduced routinely and will continue to be introduced.
However, most techniques do not take rotation invariance into account, for which a large amount of well-formed datasets are required, or unnecessary information has to be manually excluded from the learning data, which is not an ideal algorithm that can easily be applied to all areas. It is more desirable to specify the problem using human intelligence and the computer is supposed to do the work to help it. Therefore, it is necessary to continue the process of extracting and recognizing various features that cannot be extracted by the convolutional neural networks.
Future work on this topic could include the exploration of extracting meaningful features not only from visual images but also based on material and atmosphere, especially in the field of fashion, to generate a model with enhanced performance.
KMK and EYC designed the study, developed the study and the methodology, collected the data, performed the analysis, and wrote the manuscript together. Both authors read and approved the final manuscript.
This work was supported by a 2-Year Research Grant of Pusan National University.
The authors declare that they have no competing interests.
Availability of data and materials
Consent for publication
Ethics approval and consent to participate
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Lowe DG (1990) Object recognition from local scale-invariant features. In: The proceedings of the seventh IEEE international conference on, vol 2. IEEE, New York, pp 1150–1157Google Scholar
- Lindeberg Ty (1994) Scale-space theory: a basic tool for analysing structures at different scales. J Appl Stat 21(2):224–270Google Scholar
- Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up robust features. In: Computer vision–ECCV 2006. pp 404–417Google Scholar
- Calonder M, Lepetit V, Strecha C, Fua P (2010) BRIEF—binary robust independent elementary features. In: Computer vision–ECCV 2010. pp 778–792Google Scholar
- Rublee E, Rabaud V, Konolige K, Bradski G (2011) ORB—an efficient alternative to SIFT or SURF. In: 2011 IEEE international conference on computer vision (ICCV). IEEE, New York, pp 2564–2571Google Scholar
- Donoser M, Horst B (2006) Efficient maximally stable extremal region (MSER) tracking. In: Proc. of IEEE computer society conference on computer vision and pattern recognition (CVPR). pp 553–560Google Scholar
- Forssen P-E, Lowe DG (2007) Shape descriptors for maximally stable extremal regions. IEEE 11th international conference on computer vision, 2007. ICCV 2007. pp 1–8Google Scholar
- Courty N, Marchand E (2003) Visual perception based on salient features. In: Proceedings. International conference on 2003 IEEE/RSJ, vol 1. IEEE, New York, pp 1024–1029Google Scholar
- Uijlings JRR, van de Sande KE, Gevers T, Smeulders AWM (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171View ArticleGoogle Scholar
- Maninis KK, Pont-Tuset J, Arbeláez P, Van Gool L (2016) Deep retinal image understanding. In: International conference on medical image computing and computer assisted intervention. Springer International Publishing, Berlin, pp 140–148Google Scholar
- Cichy RM, Khosla A, Pantazis D, Torralba A, Oliva A (2016) Deep neural networks predict hierarchical spatio-temporal cortical dynamics of human visual object recognition. arXiv preprint arXiv:1601.02970
- Zufferey D, Gisler C, Khaled OA, Hennebert J (2012) Machine learning approaches for electric appliance classification. In: 2012 11th international conference on information science, signal processing and their applications (ISSPA). IEEE, New York, pp 740–745Google Scholar
- Gajowniczek K, Ząbkowski T (2015) Data mining techniques for detecting household characteristics based on smart meter data. Energies 8:7407–7427View ArticleGoogle Scholar
- Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 8(6):679–698View ArticleGoogle Scholar
- Fitzgibbon AW, Fisher RB (1995) A buyer’s guide to conic fitting. In: Proc. 5th British machine vision conference, Birmingham, pp 513–522Google Scholar
- Thomas W, Klein JC (2001) Segmentation of color fundus images of the human retina—detection of the optic disc and the vascular tree using morphological techniques. In: International symposium on medical data analysis. Springer, Berlin, pp 282–287Google Scholar
- Sari S, Asahrori SE, Roslan H, Ibrahim N (2015) Gabor edge detection method based on bilateral filter and otsu threshold for noisy ultrasound image. In: Proceedings of recent advances in mathematical and computational methods. pp 88–95Google Scholar
- Koo KM, Cha EY (2013) A novel container ISO-code recognition method using texture clustering with a spatial structure window. Int J Softw Eng Appl 7(3):51–61Google Scholar
- Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105Google Scholar