China Wheel (Taishan) Co.,LTD , https://www.tcwwheels.com
What is different about deep learning?
What distinguishes deep learning in image recognition?
Deep learning is an important breakthrough in the field of artificial intelligence in the past decade. It has achieved great success in applications such as speech recognition, natural language processing, computer vision, image and video analysis, and multimedia. Existing deep learning models belong to neural networks. The origin of neural networks dates back to the 1940s and was once popular in the 1980s and 1990s. Neural networks attempt to solve various machine learning problems by simulating the mechanism of brain cognition. In 1986, Rumelhart, Hinton, and Williams published the famous back-propagation algorithm in Nature to train neural networks. The algorithm is still widely used today. application.
What are the key differences between deep learning and other machine learning methods and why can it be successful in many areas?
feature
The biggest difference between deep learning and traditional pattern recognition methods is that the features it uses are automatically learned from big data, rather than using manual design. Good features can improve the performance of the pattern recognition system. Over the past few decades, the features of manual design have been dominating in various applications of pattern recognition. Manual design mainly depends on the prior knowledge of the designer and it is difficult to take advantage of big data. Due to the manual adjustment of parameters, the number of parameters allowed in the design of features is very limited. Deep learning can automatically learn the representation of features from big data and can contain thousands of parameters.
It takes five to ten years to design effective features manually, and deep learning can quickly learn new valid feature representations from training data for new applications.
A pattern recognition system includes features and classifiers. In traditional methods, the optimization of features and classifiers is separate. In the framework of neural networks, feature representation and classifiers are jointly optimized to maximize the performance of their joint collaboration.
The feature representation of the convolutional network model used by Hinton in the ImageNet competition in 2012 included 60 million parameters learned from millions of samples. The feature representation learned from ImageNet has a very strong generalization capability and can be successfully applied to other data sets and tasks, such as object detection, tracking, and retrieval. Another famous competition in the field of computer vision is PSACAL VOC. However, its training set is small and it is not suitable for training deep learning models. Some scholars have used the features learned from ImageNet for object detection on PSACAL VOCs. The detection rate has increased by 20%.
Since feature learning is so important, what are good features? In an image, various complex factors are often combined in a non-linear way. For example, the face image contains various information such as identity, posture, age, expression, and light. The key to deep learning is to successfully separate these factors through multi-layer nonlinear mapping. For example, in the last hidden layer of the depth model, different neurons represent different factors. If this hidden layer is represented as a feature, face recognition, pose estimation, face recognition, and age estimation become very simple, because each factor becomes a simple linear relationship and no longer interferes with each other.
The advantages of deep structure
The "deep" word of the deep learning model means that the structure of the neural network is deep and consists of many layers. Other common machine learning models such as support vector machine and Boosting are shallow structures. The three-layer neural network model (including the input layer, the output layer, and a hidden layer) can approximate any classification function. If so, why do you need a deep model?
Research shows that for a given task, if the depth of the model is not enough, the unit of calculation it needs will increase exponentially. This means that while shallow models can express the same classification function, they require much more parameters and training samples. Shallow models provide local expression. It divides the high-dimensional image space into several local areas, and each local area stores at least one template obtained from the training data. The shallow model matches a test sample and these templates one by one, and predicts its category based on the matching results. For example, in the support vector machine model, the template is a support vector; in the nearest neighbor classifier, the template is all training samples. As the complexity of the classification problem increases, the image space needs to be divided into more and more local regions, and more and more parameters and training samples are needed. Although the parameters of many depth models are already quite large at present, if they are replaced with shallow neural networks, the parameters needed by them must be several orders of magnitude larger to achieve the same data fitting effect, so that it is difficult to achieve.
The key to the ability of the depth model to reduce parameters is to reuse the calculation units in the middle tier. Taking face recognition as an example, deep learning can be performed on the hierarchical feature representation of the face image: the bottom layer starts from the original pixel to learn the filter, characterization of the local edge and texture features, and the middle filter performs various edge filters. The combination describes different types of face organs; the top layer describes the global features of the entire face.
Deep learning provides a distributed feature representation. At the highest hidden layer, each neuron represents an attribute classifier, such as gender, ethnicity, and hair color. Each neuron divides the image space into two. The combination of N neurons can express 2N local regions, and the expression of these regions using shallow models requires at least 2N templates. It can be seen that the depth model has stronger expression and higher efficiency.
Ability to extract global features and contextual information
The depth model has powerful learning ability and efficient feature expression capability. The more important advantage is that the information is extracted layer by layer from the pixel-level raw data to the abstract semantic concept, which makes it prominent in extracting the global features and context information of the image. The advantages bring new ideas for solving traditional computer vision problems such as image segmentation and key point detection.
Taking the image segmentation of a human face as an example, in order to predict which face part (eye, nose, mouth) each pixel belongs to, it is common practice to take a small area around the pixel to extract texture features (eg, a local binary pattern). Based on this feature, we use shallow model classification such as support vector machine. Because the amount of information contained in a local area is limited, classification errors often occur. Therefore, smoothing and shape priori constraints must be added to the segmented image.
The human eye can estimate the label of the occluded portion based on the information of other regions of the face even in the presence of partial occlusion. From this, it can be seen that global and contextual information is very important for local judgments, and these informations are lost at the very beginning in the methods based on local features. Ideally, the model should use the entire image as input to directly predict the entire segmentation map. Image segmentation can be considered as a problem of high-dimensional data conversion. This not only uses the context information, the model also implicitly adds shape priors in the high-dimensional data conversion process. However, because the entire image content is too complex, it is difficult for shallow models to effectively capture global features. The emergence of deep learning has made this possible, and has been successful in various aspects such as face segmentation, human segmentation, face image registration, and human pose estimation.
Joint deep learning
Some scholars who study computer vision see the deep learning model as a black box. This view is not comprehensive. There is a close connection between traditional computer vision systems and deep learning models. Using this connection, new depth models and training methods can be proposed. Joint deep learning for pedestrian detection is a successful example. A computer vision system contains several key components. For example, a pedestrian detector includes features extraction, component detectors, component geometry deformation modeling, component occlusion inference, classifiers and other modules. In joint deep learning, each layer of the depth model and each module of the visual system can establish a correspondence. If the key modules in the visual system do not have corresponding layers in the existing deep learning model, they can inspire us to propose new depth models. For example, a large number of object detection studies have shown that modeling the geometric deformation of an object component can effectively improve the detection rate, but there is no corresponding layer in the commonly used depth model, so joint deep learning and subsequent work are proposed. New deformation layer and deformation pooling layer to achieve this function.
From a training perspective, the various modules of the computer vision system are trained or manually designed. In the pre-training stage of the depth model, each layer is also trained one by one. If we can establish the correspondence between the computer vision system and the depth model, then the experience accumulated in visual research can provide guidance for the pre-training of the depth model. The model obtained after such pre-training can achieve comparable results with the traditional computer vision system. On this basis, deep learning will also use back propagation to jointly optimize all layers, so that the mutual cooperation between them can be optimized, and the performance of the entire network will be greatly improved.