Land use change is a complex process. The use of land use change models to analyze the causes and consequences of land use dynamics has been a popular topic in recent years   . However, quantifying the relationship between land use pattern and driving forces (RLPDF) is difficult when land use change models are applied   . RLPDF in some land use change models, e.g., Dyna_CLUE, is used to calculate the probability of land use suitability between 0 and 1. The calculation accuracy of the probability of land use suitability directly affects the predicting accuracy of land use change models  . Therefore, the robust and accurate quantification of RLPDF has been a hot subject for numerous investigations of land use change modeling. Spatial modeling techniques are increasingly used in land use change modeling. However, the implemented techniques differ in their modeling preference, and some consensus methods are needed to reduce the uncertainty of predictions. In this study, apart from accessing the performances of eight state-of-the-art single computing algorithms, we tested the predictive accuracies of four consensus methods.
One of the common approaches to analyzing RLPDF is the statistical fit of correlative computing algorithms, where land use types are fitted by regression to land use driving forces  . Because land use types are categorical variables, to quantify the distribution of land use, we usually use a dummy variable (a binary variable) that takes a value of 0 or 1 to indicate the absence or presence of a particular land use type  . Dummy variables are widely used in statistics, econometrics and species distribution analysis. Computing algorithms that use dummy variables and dependent variables include generalized linear models (GLM), generalized additive models (GAM), classification tree analysis (CTA), random forest (RF) and others, such as parametric, semi-parametric and nonparametric methods, data mining, and intelligent algorithms   . Several investigators have applied some of the above methods to studies of RLPDF, such as nested neighborhood spaces and distance decay functions  , predefined parameter matrices  , linear equations of multi-criteria evaluation (MCE)  , grey-cell or fuzzy states  , logistic models  , neural networks  , GLM  , and machine learning  . Inevitably, each computing algorithm has its advantages and disadvantages. For instance, it seems to be easy to understand the meanings of the coefficients in a logistic regression equation, but some indicate that the models are limited due to an assumption of a linear response to environmental predictors. Artificial neural networks (ANN) provide an increasingly advocated alternative because they accommodate consideration of nonlinear influences on land use distribution. However, it is difficult to comprehend the meanings of the parameters used in ANN because of the black-box nature of neural networks  . These results provide an interesting but still limited exploration of which method performs best for each of several different goals or study areas. One of the problems with these analyses is that the results are dependent on the multivariate analysis computing algorithms used. One difficulty with the use of the computing algorithms is that the number of techniques available is large and is increasing steadily, making it difficult for novices to select appropriate computing algorithms for their needs   . Recent analyses have also demonstrated that the discrepancies that arise from using different techniques can be huge, which makes the choice of an appropriate computing algorithm more difficult  .
The various modeling techniques available utilize a variety of algorithms to calculate the probability that a land use type can occur in a given area. The vast and growing literature on distribution modeling suggests that some techniques are typically more effective than others, but there is no one superior algorithm that performs best for all land use types, all data sets, or all research objectives  . A number studies have addressed the errors and uncertainties embedded in the above models  . The sources of uncertainties are diverse from small sample size, missing determinates, the nonlinear relationships to uncertainties in model building procedures. There are two main approaches to reduce the model-based uncertainty in land use pattern distribution simulations: 1) comparing extensive models, and concerning which of the models will generally provide the best predictive performance; 2) using consensus methods, which based on combinative algorithms of the predictions provided by different single computing algorithms  . The consensus approach is based on the idea that different predictions are copies of possible states of the real distributions, and they form an ensemble, which combines several unbiased model results (probabilities) will result in a more accurate prediction  . There are different ways to build consensus prediction, and it has rarely if ever been tested which of the consensus methods are able to consistently generate more accurate land use pattern distributions than some novel single-model methods available for land use modeling.
In addition to depending on the selection of an appropriate algorithm, model results also depend on the complexity of the research subject and the quality of the environmental data used  . A catchment is a geophysical functional region widely used in hydrology and ecology. Catchments have also been advocated as appropriate units for ecological planning  . Moreover, quantifying landscape pattern changes by catchment extent is more ecologically meaningful than the extent being delimited by rectangular boundaries or administrative unites  . Generally, a catchment is considered as a relatively closed ecosystem. Studying RLPDF at the catchment scale can better demonstrate the importance of selecting a model that performs well. The relative importance of driving forces in determining land use type distributions also varies with the spatial scale  .
For the above reasons, we chose a small hilly catchment (the Jinjing river catchment) and applied eight computing algorithms [i.e., GLM, GAM, CTA, ANN, flexible discriminant analysis (FDA), multivariate adaptive regression splines (MARS), generalized boosting models (GBM), and RF] and four consensus methods [two methods (Median and Mean) are based on output of all eight single-models, whereas Weighted Average (WA) chose four single-models with higher AUC values and PCA (median) methods is based on the median of half of the single-model chosen by a principle component analysis] to investigate the relationship between land use pattern and driving forces (e.g., elevation, slope, aspect, distance to residential areas, distance to roads, distance to rivers, and distance to lakes or ponds) at two spatial scales: the entire catchment and subcatchment. The main objectives of this study were: (i) to analyze the statistical differences in predictive ability of the eight single computing algorithms and identify the best-performed computing algorithms for the RLPDF study, and (ii) investigate which of the consensus methods could improve the accuracy of predictions from single computing algorithms.
2. Materials and Methods
2.1. Study Area
The Jinjing river catchment, located in the town of Jinjing, near Changsha in Hunan Province, China (Figure 1), has a population of 41,618 people and an area of 135 km2. It is one of the headwater catchments of the Jinjing river catchment system, which is one of the major tributaries of the Xiangjiang river watershed system.
The region has a subtropical monsoon climate with a mean annual air temperature of 17.5˚C and a mean annual precipitation of 1330 mm (1968-2015). On average, 70% of the annual precipitation falls during the warm season in April, May, and June. The elevation is between 56 and
2.2. Data Preparation
In this study, we used the land use data of 1990, 2005 and 2012 to analyze RLPDF. The historical cadastral maps (including land use types and digital elevation model, or DEM, data) were obtained from the Hunan Provincial Geomatics Information Center (http://www.hnpgc.com). The land use types in the area are woodland, paddy fields, tea fields, roads, residential areas and water bodies (e.g., drainage and irrigation channels, rivers and reservoirs) (Figure 2).
Figure 1. Geographical location of the Jinjing river catchment, 50 km north of Changsha (the capital city of Hunan province), China. Subcatchments 1 and 39 are highlighted.
Figure 2. Land use data for 1990, 2005 and 2012.
From 1990 to 2012, the trend of paddy field shrinking and woodland expanding featured as the major landscape change happened in the Jinjing catchment (Table 1). Because paddy field and woodland are the two main land use types, accounting for 26.65% - 28.72% and 65.45% - 71.28% of the total area of the Jinjing catchment, we chose these two land use types to analyze RLPDF. The land use type map was converted to a grid format from the available vector map at a spatial resolution of
To analyze the RLPDF at the subcatchment scale, two subcatchments were chosen from among 40 subcatchments for their specific distributions of woodland and paddy fields. Subcatchment 1 is dominated by woodland (87.41%) and thus has an undulating terrain. Subcatchment 39 has a gentle undulating terrain and is dominated by paddy field (43.23%).
Generally, land use changes driving forces can be grouped into two categories  : biophysical factors, socio-economic drivers. Although biophysical factors, such as elevation and slope, mostly do not directly drive land use change, they can influence land use allocation decisions to lead to the land use changes  . Some socio-economic drivers, e.g., GDP and population, are hard to present spatial variability in a catchment, such as the Jinjing river catchment in which the lowest level of governmental administrations (or township) for national statistical purposes is located. In view of above reasons, seven representative driving forces (elevation, slope, aspect, distance to residential areas, distance to lakes or ponds, distance to rivers, and distance to roads) were chosen in this study to explore the relationship between land use pattern and driving forces. Table 2 summarizes the seven driving forces. We calculated the importance of the driving forces using the best-performing algorithms selected from above mentioned eight computing algorithms for woodland and paddy field for the two spatial scales.
Table 1. Land use types area temporal change (km2).
Table 2. Descriptive statistics of driving forces in the RLPDF analysis.
2.3. Single Computing Algorithms and Consensus Methods for the RLPDF Analysis
In this study, to take the variation in algorithm performance into account, a multi-model approach was taken, using the BIOMOD package implemented in R software  . The eight computing algorithms considered: GLM, GAM, CTA, ANN, FDA, MARS, GBM and RF are listed in Table 3. Each of the eight computing algorithms was run independently. We used a dummy variable (a binary variable) that takes a value of 0 or 1 to indicate the absence or presence of a land use type as the dependent variable, and took seven driving forces as the independent variables for each computing algorithms.
Eight single computing algorithms were first built separately for each of land use type. The combing of the outputs of the single computing algorithms then provided the ensemble of predictions, which contains eight forecasted probability values distributions for land use pattern. Median consensus method is the median value of the outputs of all the eight single-models. The WA consensus method ranks the single computing algorithms according to their predictive performance, and assign a weighted value (0 ~ 1) to the probability values. The PCA (median) method calculates the median value of part single computing algorithms selected by a PCA from all models for each land use type. The PCA is run with projected probabilities of all single computing algorithms and provides a rate for each single computing algorithm to reflect its ability to explain the variance of the general trend of the eight single computing algorithms  .
We implemented a cross-validation procedure to evaluate the computing
Table 3. Description of eight algorithms for researching RLPDF.
algorithms. Because there was no independent data set containing the same type of data that could be used for evaluation purposes, the computing algorithms were calibrated using a random subset of 80% of the available data and evaluated using the remaining 20%. The area under the curve (AUC) of the receiver operating characteristic (ROC) has been used to assess the predictive performance of the distribution models  . An evaluation system based on the calculated AUC values was developed: 0.5 - 0.7 = low accuracy, 0.7 - 0.9 = potentially useful, and >0.9 = high accuracy  .
2.4. Predicting Results Analysis of Computing Algorithms for the RLPDF Analysis
Probability thresholds for transforming the continuous computing algorithms results into binary values were set for each computing algorithms. We overlapped the probability thresholds with the histogram of predicted probability values (HPPV) of land use pattern to illustrate the significant differences among the outcomes of the eight computing algorithms at the two different spatial scales considered.
To show the spatial contrast of the predicted probability (0 - 1) of the best- and the worst-performing algorithms, for the entire catchment, for example, we used the best spatial simulation value (0 - 1) estimated by the best-performing algorithm minus the worst spatial simulation value (0 - 1) estimated by the worst-performing algorithm to compute the spatial prediction differences. Considering the repetition of such calculations for the three discrete years, we only chose the 2012 data to investigate the impact of different computing algorithms for the RLPDF analysis.
3.1. Performance of Computing Algorithms at Different Spatial Scales
The prediction performance of the single computing algorithms and consensus methods varied for the different land use types at the two different spatial scales (Figure 3). At the entire catchment scale, the mean AUC values were between 0.715 (ANN) and 0.948 (RF) for the single-algorithms, and from 0.764 to 0.962 for the consensus methods. At the subcatchment scale, the mean AUC values were between 0.624 (CTA) and 0.972 (RF) for the single-algorithms, and from 0.758 to 0.979 for the consensus methods, suggested that, (i) among the eight single computing algorithms, RF performed the best overall for woodland and paddy field; (ii) WA showed higher predictive performance for woodland and paddy field models than did the single computing algorithms. The eight single computing algorithms and four consensus methods performed differently at the two spatial scales for the RLPDF analysis for 2012 (Figure 4). The significant differences among the HPPVs and threshold values reflect the predictive probability values statistic distribution and the importance of choosing appropriate
Figure 3. ROC index-based evaluation of simulation results by using eight algorithms and four consensus methods.
computing algorithms to analyze RLPDF.
3.2. Spatial Predicted Error Analysis of Selecting Computing Algorithms
In comparing the AUC values of the eight computing algorithms for RLPDF in 2012 (Figure 3), we found that for woodland, WA and CTA were the best- and worst-performing algorithms, respectively, and that for paddy fields, WA and ANN were the best- and worst-performing algorithms, respectively. Figure 5 shows the predicted probability (0 - 1) of the best- and worst-performing algorithms for woodland and paddy field at the entire catchment scale for 2012. Approximately 72.5% of woodland and 72.4% of paddy field had probabilities of occurrence of less than 0.1, and 3.6% of woodland and 14.5% of paddy field had probabilities of occurrence of more than 0.5. In other words, the simulation errors associated with the selection of the computing algorithm can be up to 14.5% if 0.5 is chosen as the threshold value. In 1990 and 2005, the differences of
Figure 5. Comparison of the predicted probability (0 - 1) of the best- and worst-performing algorithms for woodland and paddy field at the entire catchment scale for 2012. Woodland (WA): woodland simulation by Weighted average; Woodland (CTA): woodland simulation by classification tree analysis; Woodland (WA-CTA): Woodland (WA) minus Woodland (CTA); Paddy field (WA): Paddy field simulation by Weighted average; Paddy field (ANN): paddy field simulation by artificial neural networks; and Paddy field (WA-ANN): Paddy field (WA) minus Paddy field (ANN).
predicted probabilities (0-1) produced by the best- and worst-performing algorithms were similar to that in 2012.
3.3. Importance of Driving Forces at Different Spatial Scales
The importance of the driving forces calculated by stable performing computing algorithms (i.e., RF) was different for the different land use types at the two spatial scales (Figure 6). Elevation showed high importance values at both spatial scales in 1990, 2005 and 2012, especially for Subcatchment 39, indicating that elevation was the most important driving force for the RLPDF analysis in this study. Other driving forces, e.g., distance to residential areas, were found to be the second most important driving force. This was mainly due to the Jinjing
Figure 6. Importance values for seven driving forces (e.g., elevation, aspect, slope, distance to lakes, distance to residential areas, distance to rivers and distance to roads) in the RLPDF analysis. (a) woodland for the entire catchment; (b) paddy field for the entire catchment; (c) woodland for Subcatchment 1; (d) paddy field for Subcatchment 1; (e) woodland for Subcatchment 39; and (f) paddy field for Subcatchment 39.
river catchment possessing small plains mixed with hills.
4.1. Effects of Computing Algorithm Selection for RLPDF
The processes of land use changes are complex. RLPDF research has gained momentum with recent developments in multivariate analysis methods applied to ecological analysis   . Statistical models for RLPDF are increasingly being used, but systematic comparisons of alternative methods are still limited. In particular, only a few studies have explored the effect of the spatial scale on the model outputs  . In this study, we investigated the predictive ability of eight computing algorithms using data on land use distribution and driving forces at two scales: an entire catchment and subcatchments. The results obtained provide useful information for other RLPDF researchers.
The ROC curve is a graphical method for representing the relationship between the false positive fraction and the sensitivity for a range of thresholds  . Our results indicate that all of the eight models considered performed well at predicting land use distributions, with AUC values ranging from 0.654 to 0.963 at the two different spatial scales considered. Of the eight computing models considered, the non-parametric approaches (i.e., RF, GBM, MARS, CART, and ANN), and particularly RF and GBM, produced better results for very complex systems than parametric algorithms such as GLM  . Based on our observations in this study, the performance of ANN was judged to be unreliable for full-scale data sets. One possible reason for the poor performance of ANN is that the spatial correlation between land use pattern and driving forces may be over-fitted  . Many researchers have proven that ANN is incapable of analyzing RLPDF  . In this study, we did not doubt the predicting ability of ANN for the RLPDF analysis. However, comparing with RF and GBM and considering the complexity of parameter setting, ANN may be not suitable for the RLPDF analysis, especially for some places with high spatial heterogeneity. In our study, the most efficient consensus method was the WA consensus method, which significantly improved the predictive accuracy of these eight single computing algorithms. The good performance of WA consensus method was primarily due to the low-pass filtering ability of the average function. This result was similar to other researches referring to predictive species distribution modeling  .
Although the ROC curve is not dependent on the probability threshold, the selection of thresholds for land use pattern prediction was important because the determination of the presence or absence of a given land use type is largely dependent on the threshold value selected  . When a model yields good performance, the predicted probability varies randomly between 0 (true negative) and 1 (true positive)  . Predicting the continuous probability between 0 and 1 is one of the important purposes of using various algorithms to analyze RLPDF. In general, the continuous 0-1 probability value is considered an index of suitability for land use types and a core component of some land use models  . In this study, the characteristics of HPPVs were significantly different for the two different spatial scales considered (Figure 4). This phenomenon could be attributed to the quality of the environmental data used and the complexity of the studied objects. The structural complexity, spatial heterogeneity, and factorial co-linearity of spatial catchment data can make the RLPDF analysis and land use distribution simulation more difficult. This was also the main reason that some algorithms (such as CART and ANN) performed poorly in this study. We compared the simulation results of the best- and worst-performing algorithms. The simulation error caused by the algorithm selection can be up to 14.5% if a threshold value of 0.5 is used. In general, probability values are used to quantify the stability of land use, which is considered a very important part of a land use model  . When we use land use models (e.g., Dyna_CLUE and GeoSOS) to simulate land use and cover change, we may confuse the simulation errors resulting from the choice of the computing algorithms embedded in these land use models and the uncertainty of the internal parameters of land use models. Simulation errors can be completely avoided by repeatedly and carefully choosing and optimizing algorithms before we use them to quantify the stability of land use spatial distributions. One may argue that the characteristics and limitations of algorithms may be the key reason for their poor or good performance. Most of those algorithms were not developed for land use assessment purposes, and they have rarely been used in the RLPDF analysis  . When they are used in the RLPDF analysis, some improvements need to be made to them  .
4.2. Selection of Driving Forces with Spatial Scales
The driving forces’ importance analysis are important parts in studying RLPDF and using land use models, especially in some places with characteristics including intensity spatial heterogeneity, great spatial scale change sensitivity and complicated land use change process. The importance of driving forces varied with different land use types at multiple spatial scales in the Jinjing river catchment. However, the most important factor was DEM in the entire catchment and Subcatchments 1 & 39. This was mainly due to the characteristics of little plains interbedded with hilly in the Jinjing river catchment  . The less important factors were different for woodland and paddy field in Subcatchment 1. One possible reason for this difference concerned the natural environment: Subcatchment 1 contained more than 70% of the total area as woodland and also had paddy field embedded in hills (Figure 1). This kind of distribution of embedded paddy field increased its resource dependence for water, and thus, Distance to lake was found the most important factor for paddy field in this subcatchment. In this study, we used the best performance algorithm to quantify importance of driving forces for land use change at the entire catchment and subcatchment. If we use other poor performance algorithms, the importance values of driving forces were obviously different in comparison with that quantified by the best performance algorithm (results not shown). This phenomenon also demonstrate the importance of selecting computing algorithms for analyzing RLPDF.
4.3. Other Algorithms for Future RLPDF Analysis
Our results indicate that different land use types require different computing algorithms, depending on the spatial scale. Such differences may be obstacles to the development of a single all-purpose land use model for land use planning. Thus, it is desirable that land use decision-making be based on a set of alternative source maps and a set of predictions obtained using multiple models.
There are also other algorithms available for use in the RLPDF analysis, e.g., geographically weighted regression (GWR)  , a generalized linear mixed model, and a nonlinear mixed model  . For finite data, using multiple computing algorithms will yield accurate and clarity insights in the RLPDF analysis. The future challenges facing us may include combining multiple computing algorithms with the theories of cellular automata (CA) and multi-agents to build a land use structure optimization model at the catchment scale.
Eight computing algorithms and four consensus methods were used to investigate RLPDF at two spatial scales (e.g., an entire catchment and a subcatchment) in the town of Jinjing, northeast of Changsha in Hunan Province in China. WA consensus method performed the best overall for woodland and paddy field in the catchment. However, ANN performed inconsistent, especially for subcatchment 1 with high spatial heterogeneity. Taking 2012 data as example and comparing with the predicted probability between best and worst performed computing algorithms, approximately 72.5% of woodland and 72.4% of paddy field had probabilities of occurrence of less than 0.1, and 3.6% of woodland and 14.5% of paddy field had probabilities of occurrence of more than 0.5. The consensus methods based on average function algorithms may increase significantly the accuracy of land use distribution predictions, and thus they show considerable promise for different land use change modeling and planning.
This research was financially supported by the National Natural Science Foundation of China (41301202).