The Normalized Difference Vegetation Index (NDVI) is a widely used vegetation index (VI) and provides a way of evaluating the biophysical or biochemical information related to vegetation growth . Long term NDVI time-series datasets have been widely used for monitoring ecosystem dynamics to understand the responses of climate change  . However, due to financial and technical constraints, it is difficult to obtain NDVI data with both high spatial and high temporal resolution on the same remote sensing instrument . In addition, long periods of cloud cover problems in some regions have aggravated this matter . Thus, spatiotemporal fusion techniques which combine NDVI date from multi-sensors with high spatial and temporal resolution is feasible solution to acquire remote sensing time series for monitoring surface vegetations dynamics  .
Up to now, several spatiotemporal fusion models have been proposed. Gao et al.  proposed a spatial and temporal adaptive reflectance fusion model (STARFM) to blend MODIS and Landsat image to produce a synthetic surface reflectance product at 30 m spatial resolution. Based the STARFM, Zhu et al.  developed an enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM), introducing conversion coefficient between pixels and improving the prediction accuracy. Zhu et al.  proposed the flexible spatiotemporal data fusion model (FSDAF) which performs better in predicting abrupt land cover changes. Liao et al.  developed a spatiotemporal vegetation index image fusion model (STVIFM) to generate NDVI time series images with high spatial and temporal resolution in heterogeneous regions. In this study, we made a comparation between STARFM, ESTARFM, FSDAF, and STVIFM methods, tested by Landsat and MODIS data acquired in same site and quantitatively assess the accuracy of predicted image generated from each fusion model.
2. Materials and Methods
2.1. Study site and Data Preparation
In this study, a selected study area is shown in Figure 1, which located in Banan District (29˚34'10''N, 106˚57'35''E) in Chongqing Province to perform the comparison between the spatiotemporal fusion models. We select MODIS daily surface reflectance image and Landsat-8 image acquired for these dates during this period: April 28, 2015, August 02, 2015, and October 21, 2015. All images are pre-processed and calculated as NDVI data. Scene subset is shown in Figure 2.
Figure 1. Location of the study area.
Figure 2. Landsat NDVI (upper row) and MODIS NDVI (lower row) images. From left to right, they were acquired from April 28, 2015, August 02, 2015, and October 21, 2015, respectively.
2.2. Selected Spatiotemporal Fusion Models
The STARFM is based on the moving window technology, which requires at least a pair of high-resolution image and coarse-resolution image on the base time and one coarse-resolution image on the predicted time. By introducing a weigh function using spectral difference, temporal difference and spatial difference to determining the contribution of other pixels in the window to the central pixel. And then a synthetic high Spatiotemporal image (F(t2)) is predicted with the high- and coarse-resolution data through the proposed weight function. This model can be written as in Equation (1).
where, F(t1) and M(t1) denote the high-and coarse resolution date on the base date, M(t2) is the coarse resolution date at the predicted date, and Wi is the weight function.
The ESTARFM needs at least two pairs of high-resolution image and coarse resolution image on the base time and one coarse-resolution image on the predicted time. Compared with STARFM, this method not only considers the spatial and spectral similarity between pixels, but also introduces a conversion coefficient, which is derived from the high-and coarse-resolution data during the observation period using a linear regression. The final high-resolution prediction is computed as in Equation (2).
where, F(t1) and M(t1) denote the high-and coarse resolution data on the base date, M(t2) is the coarse resolution data at the predicted date, and Wi, Vi denote the weight function and conversion coefficient respectively.
The FSDAF using one pair of high-resolution image and coarse-resolution image on the base time and one coarse-resolution image on the predicted time, and it also need to use land cover map. This model integrates STARFM, the linear unmixing method  and the thin plate spline (TPS) interpolator that maintains the land cover change signals and local variability, which combined the temporal prediction from the linear unmixing method with the spatial prediction obtained by the TPS and distribute the residual to fine pixel to get the final prediction. It can be written as Equation (3).
where, F(t1), F(t2) denote the high-resolution image on the base time and predicted time respectively. is referred to the change between t1 and t2, which computed by the linear unmixing method and TPS. And Wi is the weight function.
The STVIFM requires two pairs of high- and coarse-resolution images acquired on the base time and one coarse-resolution on the predicted date. On the one hand, this model links the mean NDVI change of high-resolution pixels to mean NDVI change of coarse resolution pixels within a moving window. On the other hand, it also considers the difference in NDVI change rates at different growing stages. And the final prediction can be written as Equation (4).
where, NDVI(t2), NDVI(t1) are the high-resolution date on the prediction time and base time respectively. ΔNDVI denote the change between t1 and t2, which calculated by this model. And the Wi is the weight function.
2.3. Assessing Prediction Accuracy
The model’s prediction performance is quantitatively evaluated by representative metrics. And the r and RMSE (root mean squared errors) are used to measure the difference between the predicted image and actual image. The formulations of these metrics are as follows:
where N is the total number of pixels in the predicted image, xj and yj are the values of the jth pixel in the predicted image and the actual image respectively. And , represent the mean gray values of the predicted image and the actual image respectively.
3. Result and Discussion
3.1. Prediction Performance
We use the August 02 Landsat NDVI image as validation source and use April 28 and October 21 to predict the August 02 image. Figure 3 shows the actual NDVI image and predicted NDVI image by four spatiotemporal fusion models on August 02, 2015. All the predicted NDVI images are consistent with the actual image from visual comparison, and water boundaries and clear land can be predicted obviously, which demonstrate the practicality of these spatiotemporal models.
3.2. Quantitative Assessment
Scatter plots in Figure 4 indicate the difference between the actual NDVI values and the predicted NDVI values on August 02 2015. We can see that the predicted NDVI values by four spatiotemporal fusion models are all fall close to the 1:1 line, which show all four spatiotemporal fusion models can capture changes in phenology. And the prediction of ESTARFM and STVIFM using one input pair is relatively accurate than that of STARFM and FSDAF using two input pairs, which because two input pairs can provide more spatial details.
To better assess the accuracy of predictions, the metrics r and RMSE were calculated in Table 1. All four methods can get the change details to the base date image to get the prediction. The accuracy of the predicted NDVI image using the STVIFM is the best (r = 0.864, RMSE = 0.1191) and a little better than the accuracy of the predicted NDVI image using ESTARFM (r = 0.867, RMSE = 0.1247). The image predicted by STARFM (r = 0.804, RMSE = 0.1626) and FSDAF (r = 0.810, RMSE = 0.1446) can also produce an accurate result, but these two models got inaccurate predictions on some pixels (Figure 3(b), Figure 3(d)), which demonstrate the predictions using two input pairs is relatively more accurate.
Table 1. Comparison of rand RMSE betweeen actual NDVI and predicted NDVI by using STARFM, ESTARFM, FSDAF, and STVIFMmodelsin the study area on August 02 2015.
Figure 3. (a) Actual Landsat-8 NDVI image; (b)-(d) are the predicted NDVI images of STARFM, ESTARFM, FSDAF, and STVIFM respectively.
Figure 4. Scatter plots of the actual and predicted values for NDVI (darker areas indicate high density, and the line is 1:1 line).
This study made a comparison between four spatiotemporal fusion models, STARFM, ESTARFM, FSDAF, and STVIFM using high-and coarse-resolution NDVI data, and quantitatively analyzed the performance of these models using r and RMSE. For the results predicted by all four models, the r varied between 0.804 and 0.867 and the RMSE varied between 0.1191 and 0.1626, which showed that all the selected models can produce reasonable predictions. And we found that STVIFM can capture vegetation change and get the predicted results closed to actual NDVI image than other three methods. In conclusion, the STVIFM is more suitable for producing high spatiotemporal resolution NDVI time series, especially for some vegetation with different growing period.
 Busetto, L., Meroni, M. and Colombo, R. (2008) Combining Medium and Coarse Spatial Resolution Satellite Data to Improve the Estimation of Sub-Pixel NDVI Time Series. Remote Sens. Environ., 112, 118-131. https://doi.org/10.1016/j.rse.2007.04.004
 Tewes, A., Thonfeld, F., Schmidt, M., Oomen, R., Zhu, X., Dubovyk, O., Menz, G. and Schellberg, J. (2015) Using RapidEye and MODIS Data Fusion to Monitor Vegetation Dynamics in Semi-Arid Rangelands in South Africa. Remote Sens., 7, 6510-6534. https://doi.org/10.3390/rs70606510
 Bhandari, S., Phinn, S. and Gill, T. (2012) Preparing Landsat Image Time Series (LITS) for Monitoring Changes in Vegetation Phenology in Queensland, Australia. Remote Sens., 4, 1856-1886. https://doi.org/10.3390/rs4061856
 Gevaert, C.M. and García-Haro, F.J. (2015) A Comparison of STARFM and an Unmixing Based Algorithm for Landsat and MODIS Data Fusion. Remote Sensing of Environment, 156, 34-44. https://doi.org/10.1016/j.rse.2014.09.012
 Schmidt, M., Udelhoven, T., Gill, T. and Röder, A. (2012) Long Term Data Fusion for a Dense Time Series Analysis with MODIS and Landsat Imagery in an Australian Savanna. J. Appl. Remote Sens., 6, 63512. https://doi.org/10.1117/1.JRS.6.063512
 Fensholt, R. (2004) Earth Observation of Vegetation Status in the Sahelian and Sudanian West Africa: Comparison of Terra MODIS and NOAA AVHRR Satellite Data. Int. J. Remote Sens., 25, 1641-1659. https://doi.org/10.1080/01431160310001598999
 Zurita-Milla, R., Clevers, J., van Gijsel, J. and Schaepman, M. (2011) Using MERIS Fused Images for Classification Mapping and Vegetation Status Assessment in Heterogeneous Land-scapes. Int. J. Remote Sens., 32, 973-991. https://doi.org/10.1080/01431160903505286
 Gao, F., Masek, J., Schwaller, M. and Hall, F. (2006) On the Blending of the MODIS and Landsat ETM+ Surface Reflectance. IEEE Trans. Geosci. Remote Sens., 44, 2207-2218. https://doi.org/10.1109/TGRS.2006.872081
 Zhu, X., Chen, J., Gao, F., Chen, X. and Masek, J.G. (2010) An enhanced Spatial and Temporal Adaptive Reflectance Fusion Model for Complex Heterogeneous Regions. Remote Sens. Environ., 114, 2610-2623. https://doi.org/10.1016/j.rse.2010.05.032
 Zhu, X., Helmer, E.H., Gao, F., Liu, D., Chen, J. and Lefsky, M.A. (2016) A Flexible Spatiotemporal Method for Fusing Satellite Images with Different Resolutions. Remote Sens. Environ., 172, 165-177. https://doi.org/10.1016/j.rse.2015.11.016
 Liao, C., Wang, J., Pritchard, I., et al. (2017) A Spatio-Temporal Data Fusion Model for Generating NDVI Time Series in Heterogeneous Regions. Remote Sensing, 9, 1125. https://doi.org/10.3390/rs9111125