Remote sensing is a technical means using sensors on satellite, aircraft or other platforms to collect targets’ radiation information, with which specific information can be obtained. In recent years, with the rapid development of remote sensing technology, the capacity of acquiring remote sensing data has been enhancing. Meantime, the spectral, spatial and temporal resolution of remote sensing imagery have been improving  , providing solid data bases for the remote sensing application. Although better and better imagery can be acquired through remote sensing, in practice the application of remote sensing imagery relies heavily on manual processing, while machine interpretation is only an aid to manual work. Traditionally, machine interpretation of remote sensing imagery is achieved through statistical methods such as maximum likelihood and K-means clustering, which are based on remote sensing features like spectrum and textures. In the past few years, methods including artificial neural network, support vector machine, genetic algorithm and object oriented method are developing rapidly with certain fruits achieved  . However, generally speaking, all these methods require manually extraction of image features or design of interpretation rules, thus lead to long design cycles and limited the potential of algorithm improvement. Besides, the accuracy and efficiency of automatic interpretation of remote sensing imagery cannot meet the needs of most applications. Since the remote sensing application is heavily dependent on manual work, the effectiveness of remote sensing is severely restricted by the experience and expertise of the operator  .
Deep learning is an important domain of machine learning research. Compared with traditional machine learning, deep learning is a representation- learning method with multiple layers. Data abstraction and extraction from the lower layers to higher layers are accomplished through simple nonlinear modules. Current deep learning often use deep neural network (DNN) to construct the layers, which are the stacks of simple nonlinear modules. Input data is passed between the layers, whose mapping relationship reduces the dimension and extract the key characteristics of data  . Relying on the deep convolution neural network (DCNN), deep learning provides an end-to-end machine learning model that can automatically extract image features without extraction algorithms designed by human. Compared with traditional methods, deep learning is completely data-driven, which can automatically find the best ways to extract image features through learning   .
This paper briefly introduces the development of deep learning, and makes a detailed analysis for the current application fields of remote sensing land cover classification, target detection and change detection, expounds the main deep learning methods and research progress in these three fields, introduces the current application situation of deep learning in remote sensing field, and summarizes the current research work and main models. Finally, the application of deep learning is, summarized the existing problems are pointed out, and the future development direction of deep learning for remote sensing is prospected.
2. Common Deep Learning Methods in Remote Sensing Application
The deep learning method in remote sensing application is mainly used in three aspects, namely surface classification, object detection and change detection. A review of the current research results indicates that the major technical approach is to translate specific problems into classification or object detection tasks, which are processed with the computer vision deep learning model that is redesigned and adjusted for the targets of the remote sensing application, thus the specific problems are solved. The main structure is shown in Figure 1.
2.1. Land Cover Classification Methods of Remote Sensing Image
Land cover classification is a major field of remote sensing application. The main task of surface classification is to divide the pixels or regions in remote sensing imagery into several categories according to application requirements  . The deep learning model of land cover classification is generally based on deep belief network (DBN), convolution neural network (CNN) and spare auto encoder (SAE), among which the deep convolution neural network is the most popular approach at present.
Many early studies used deep CNN as Alexnet and VGG Net and achieved certain results. However, the nature of Alexnet and VGG Net classification method is to transform an image into a corresponding eigenvector through convolution, pooling and fully connected layer. Based on the eigenvector, a value representing the image classification is output. Therefore, the major issue addressed with such approach is the classification of integrated imagery on the image level. However, land cover classification is a problem of image segmentation, what to be addressed is the multi-classification after semantic segmentation of a single image.
To solve the problem of semantic segmentation and multi-classification, Long, et al. proposed FCN  , the full convolution neural network based on semantic segmentation. Based on CNN, FCN substitutes all the pooling layers and fully connected layers with convolution layers. At the end of the network, FCN introduces the transposed convolution layer, which upsamples the image features and predicts the output image size according to the input image size, thus every input pixel is predicted and the image is classified. FCN realizes end-to-end semantic segmentation, but it performs not that well in edge processing and classification accuracy.
Based on the further optimized network, Badrinarayanan, et al. proposed SegNet.  SegNet’s encoder is based on the first 13 layers of VGG-16, with improvements in the decoding stage of upsampling, besides, each decoder has a
Figure 1. Structural diagram of deep learning model.
corresponding encoder, and thus, with the same segmentation accuracy can be achieved with less training parameters and low memory overhead. To address the reduced resolution brought by subsampling or polling, based on the advantages of the above networks, DeepLab  , adopts Atrous convolution to expand the receptive field to acquire more contextual information. The latest DeepLab V3+   comes with improved Atrous convolution algorithm. ResNet, achieved with the pre-training on Imagnet, is used as the major network for feature extraction. In the ResNet residue block, Atrous convolution and different expansion rates are used to capture multi-scale contextual information in each convolution. To integrate multi-scale information, DeepLab v3+ introduces the encoder-decoder architecture and adopts the Xception model. With these improvements, the segmentation accuracy is maintained while the back end dense CRF is discarded.
At present, although there are a variety of deep learning models for surface classification, the main body are all of encoder-decoder structure (Figure 2). In the encoding stage, convolution, pooling and subsampling are adopted to acquire segmentation features. In the decoding stage, transposed convolution, pooling and upsampling is adopted to label image regions with same features, thus surface classification is achieved through semantic segmentation. At the same time, to improve the accuracy of classification, some deep learning models introduce post-processing stage to remove noise and optimize the edges. The comparison of representative image classification method is shown in Table 1.
Figure 2. Remote sensing image semantic segmentation flow chart.
Table 1. Comparison of representative image classification method.
2.2. Object Detection
Object detection is another common application of remote sensing. The deep learning model of object detection is mainly based on region-based convolution neural networks (R-CNN), which is the earliest proposed method of deep learning object detection. The main idea is to transform the object detection problem into the classification problem. The image is divided into a large number of candidate regions by selective search algorithm, CNN is then applied to obtain the eigenvectors of candidate regions, and finally object detection is completed by the classifier, which determines the type of the candidate area  . The proposal of R-CNN has greatly improved the success rate of image object detection, but R-CNN will generate partially overlapping candidate areas from each detection target. Such areas are repeatedly fed into CNN for feature calculation, thus reducing the efficiency of detection. To reduce overlapping candidate areas, He Kaiming proposed Spatial Pyramid Pooling Networks (SPP-Net)  , which introduces the spatial pyramid pooling layer after the last convolution layer, thus repetitive processing is eliminated, allowing image of any sizes to be processed with CNN. With these improvement, SPP-Net has greatly increased the speed of object detection. Based on SPP-Net, Girshick proposed Fast R-CNN  , which simplifies the spatial pyramid pooling layer of SPP-Net, thus, the RoI pooling layer is formed to extract features. The substitution of SVM by Softmax greatly improves the speed of training and detection. It is more accurate and 213 times faster than R-CNN. To further improve the efficiency of Fast R-CNN in generating candidate area, Ren et al. proposed Faster R-CNN  , which introduces Region Proposal Network (RPN), meantime, RPN and Fast R-CNN are combined as an integrated network to generate candidate regions. With further improved network structure, YOLO  and Single Shot Multibox Detector (SSD)  maintain almost the same detection accuracy with significantly improved detection speed. The comparison of representative image object detection method is shown in Table 2.
Table 2. Comparison of representative image object detection method.
2.3. Change Detection
Change detection is the process of detecting changes using remote sensing imagery obtained at different times. These changes are due in part to natural phenomena, such as droughts, floods, and landslides, the other part is due in human activities as new roads, excavation of the surface or construction of new houses. Compared to models for surface classification and object detection, there are less deep learning models for image change detection  . The current change detection based on deep learning mainly adopts two technic approaches. One is to detect the correspondent points of two imagery through deep learning and determine whether there are changes to the correspondent points. The other approach is to translate the change detection problem into the surface classification problem, and acquire the changed region through semantic segmentation, comparing and classification of map spots. From the experimental results, the semantic segmentation approach is easier to achieve, faster in speed and better in detection accuracy.
3. Progress in Researches on Deep Learning in Remote Sensing Application
With constant optimization of the deep learning model for remote sensing, deep learning is gradually applied in the surface classification, object detection and change detection of remote sensing imagery. The results of various applications show that compared with the traditional methods, new breakthroughs has been made in the accuracy and efficiency.
3.1. Imagery Based Land Cover Classification
Fu et al.  expanded the network for remote sensing image surface classification, a skip-layer structure is added to enable the FCN for multi-resolution image classification. Atrous convolution is introduced to improve the density of output features. CRF is applied in detection to refine the output class, thus improves the accuracy of high-resolution image classification. To address the problems in vegetation classification, namely, small difference of object feature and loss of features in encoding stage of FCN, Zhang et al.  Added a feature extraction layer with convolution kernel containing the features of vegetation to be extracted and an encoding layer adopting non-linear activation function, as a result, the accuracy of vegetation classification is improved. Sharma et al.  proposed a deep learning land cover classification method for middle-resolution imagery. This method takes Landsat 8 image as the research object, changes the CNN input from single pixel to 5 × 5 pixel image block. The image block input contains not only the image band information, but also the spatial relation of adjacent pixels. The experimental data shows that compared with the pixel-based CNN, the deep learning method based on block increased the overall classification accuracy of farmland, wetland, forest, water body and other features by 24.23%. Zhang et al.  proposed a high resolution imagery deep learning surface classification method that integrates CNN and Multi-Layer Perceptron (MLP). With integrated rules, by combining image features extracted by CNN and MLP, the overall classification accuracy is improved and reaches 90.56%, higher than CNN or MLP used alone. Zhao et al.  proposed a deep learning network suitable for multi-scale imagery classification, multi-scale surface classification is realized with sound accuracy by combining spectral and spatial features and improved classifiers.
In agricultural application, Cai et al.  proposed a high performance crop classification method that takes into account time and space. Based on the Common Land Units (CLU) data, long time-series multiple imagery spectral information of field blocks are combined. Spectral image stack and deep learning algorithm are applied to eliminate the interference of cloud, fog and shadow in local image. Compared with USDA crop data, the overall accuracy of this method for the classification of soybeans and corn reached 96%. Wei et al.  proposed a cube-pair-based deep convolution neural networks architecture for hyperspectral crop image classification. By using cube-pair, it exploits the data of different bands of hyperspectral imagery, and greatly reduces the training samples. Experiment shows that compared with the ordinary deep convolution neural networks, the cube-pair network architecture networks effectively improves the classification accuracy.
3.2. Object Extraction
Chen et al.  proposed an urban water body detection method based on deep learning. In this approach, A-SLIC is applied first to segment remote sensing imagery into superpixels, then well designed deep convolution neural network is used to extract the high-level features of water bodies. Experiment of several types of bodies in three cities gave an overall detection accuracy between 98.31% and 99.81%, which is a great progress.
Zhong et al.  proposed a position-sensitive balancing (PSB) object detection method and designed the detection framework for HSR remote sensing imagery. This framework combines Region Proposal Networks (RPN) with RESNET. The position-sensitive pooling layer is added to enhance the translation-invariance, improving the performance of object detection. Experiments show that the accuracy and speed of detecting aircraft, vehicles, bridges, ships, sports ground and other objects in high resolution remote sensing imagery have been significantly improved.
Tian et al.  proposed an urban area detection method based on deep learning. It involves the construction of Visual Dictionary on the basis of pre-trained deep neural network, followed by the training with labeled urban area imagery. The key of this method is how to construct the Visual Dictionary and perform the detection with deep neural networks. Experiments show that with small sample training, this scheme can accurately distinguish urban and non-urban areas.
3.3. Change Detection
To obtain the spectral and texture changes of the correspondent points between images, Zhang Xinlong et al. applied modified change vector analysis algorithm and grey level co-occurrence matrix that both concerning spatial-contextual information. By setting adaptive sampling intervals, samples of the most likely changed and unchanged areas are extracted. A Gaussian-Bernoulli Deep Boltzmann Machine model containing the label layer is constructed and trained to extract the deep features of changed and unchanged areas, thus effectively identify changed areas  .
Khan et al. proposed a forest change detection method. It transforms the change detection task into a region classification problem. Features of change are extracted through deep neural network. Based on these features, a multiresolution profile (MRP) of the target area is built and a candidate set of bounding-box is generated to detect potential changed areas. The detection accuracy of improved model reached 91.6%, which is 16% higher than traditional methods. The model can be well generalized, and can be widely used in the change detection tasks of various regions  .
Although great progress has been made in the application deep learning methods in remote sensing, there are still the following shortcomings:
1) Lack of strict mathematical interpretation. Deep learning is merely a process fitting of the input data and the output result, there is a lack of strict mathematical basis for the design and improvement of the networks.
2) The requirements for training samples are high. To achieve better results in application, the requirements for quantity and quality of training samples are very high. Although some scholars have made certain progress in small sample training, for practical application in specific areas, a large number of training samples are required for higher accuracy.
3) Comprehensibility of network features is poor. Features extracted by the network lacks practical significance after being passed to the deep level. Though there are available visual development tools, the specific meaning of automatic network extraction cannot be designed. The construction, adjustment and improvement of deep network still rely on the experience of developers.
4) Few engineering application. Most research focus on network architecture and the verification algorithm, there are few researches on cloud computing architecture, data storage and retrieval mechanism for engineering applications. Few engineering project are completed and put into practical application.
5) Image recognition based on deep learning only relies on sample training, and image is mapped to specific results through complex computations. However, in this process, deep learning does not really understand the specific meaning of mapping, so it is impossible to use prior knowledge for image recognition and judgment.
Although the application of deep learning in the remote sensing is still in its infancy, a large number of studies have proved that deep learning methods can be widely combined with remote sensing application and achieve higher accuracy and efficiency than traditional methods in surface classification, object and change detection. With the continuous improvement and perfection of the remote sensing deep learning models, the end-to-end application framework free of feature design will become an important direction for the development of smart remote sensing application.
This study is supported by High Resolution Earth Observation System Regional Industrial Project “Guangxi Beibu Golf Economic Region Remote Sensing Integrated Service Platform Construction and Application” (84-Y40G07-9010-15/18), and Guangxi National Geo-survey project of Guangxi Bureau of Surveying, Mapping and Geo-information Agency, and “Guangxi Sugar Industry Development Big-data Platform” of Guangxi innovation driven development project (Major science and technology special project).