Missing data is a big problem encountered at a number of times during environmental research . A lot of causes such as routine maintenances, sampling errors in satellite sensor, failures of satellite sensor during observations, meteorological abnormalities and human errors are responsible for the discontinuity of data set . Geopotential height is the height of a pressure surface in the atmosphere above mean sea level [MSL]. The geopotential height data gathered from AQUA satellite contains incomplete data matrices in 24 standard pressures levels . A research can become inaccurate if missing data sets are used . Geopotential height was the function of air temperature, pressure, winds, and topography of the area, which required a careful method for its imputations. One of the oldest and most suggested methods to fill this missing information was replacing mean values of neighbor samples .
Many different interpolation techniques have been developed . The best method depends upon the spatial and temporal variations of geopotential height in the atmosphere. Shen, Reiter , applied different interpolations on geopotential height keeping in view its variations in the atmosphere. Knox, Higuchi investigate secular variations, Shabbar, Higuchi did regional analysis, Griesser, Brönnimann reconstructed geopotential height for 850, 700, 500, 300, 200 and 100 hPa. White calculated statistics and climatology for the Northern Hemisphere’s geopotential height over 1000 and 500 hPa. Wallace, Zhang investigated intera-decadal variability and teleconnections in the Northern hemisphere’s geopotential height over 500 and 700 hPa respectively.
Pakistan is the central country of South Asia bordered with India to East, China in North, South to Arabian Sea and Afghanistan to West (Figure 1 and Figure 2). It is arid to semi-arid country except in the north areas which received annual rainfall of 760 mm to 2000 mm annually. Pakistan has four provinces, of which Baluchistan is the driest and desert area facing 210 mm of rain averagely . 3/4th area of the country is getting no more than 250 mm of rain annually. In summer season relative humidity remains between 20% and 50%. In winter average temperature varies from 4˚C to 20˚C in most areas, while an increasing temperature of 0.6˚C to 1.0˚C is found along the coastal areas .
The actual thrust of this research work is to devise a workable methodology for carrying out scientific observations of upper atmosphere meteorology in Pakistan in spite of lacking modern equipment and technological resources in relevant departments. The published literature is not available in Pakistan, however, Saleem and Ahmed ; Saleem ; Saleem are few initiatives on upper-level atmospheric observations.
2. Material and Methods
2.1. Data Used
In this research, the monthly mean of geopotential height [in meters]for the past 13 years, obtained from Atmospheric Infrared Sounder [AIRS]level 3, was used. AIRS was the instrument on AQUA satellite, which launched in May 2002.
Figure 1. Location map of the Pakistan with it host regions (20).
Figure 2. Altitude map of Pakistan showing elevation [in meters]depicted in different color scales (20).
This satellite has very high spectral resolutions: e.g., it captures climate data through nearly 2382 bands in the electromagnetic spectrum and its geopotential height product is very high resolution 0.5˚ × 0.5˚ grid cell. Version 6 of its product contains fewer biases in geopotential height . Besides good quality of climate data, GESDISC1 provides geopotential height data for the whole global.
2.2. Spatial Interpolations of Missing Geopotential Height
Randomly 30% of the 324 samples were missing data which were then estimated from the 70% known data using different interpolation techniques like IDW, NN, BI and NI . Robeson ; Price, McKenney ; Perry and Hollis ; Yozgatligil, Aslan considered these performance parameters like, Mean Absolute Error [MAE], Root Mean Square Error [RMSE], Coefficient of Determinations [R2]and Correlation Coefficient [Corr], to find out the best interpolation technique for missing climatic data set.
1) INVERSE DISTANCE WEIGHTING
This imputation resembles to Tobler’s first law of geography in which the weight of the known samples will be determined based on the distances from the imputed sample [Robeson, 1994]. More will be the distance of neighbors from a predicted sample less will be their weight in interpolation. Ferrari and Ozaki used Equation (1) which is given below:
where is the weighting factor of distance between the ath original neighbor sample , is jth the point to be estimated, n is the total number of the sample used, and r weighting factor. Langella , formula for IDW was used in the missing data imputations.
2) NEAREST NEIGHBORS INTERPOLATION [NN]
Missing values were directly imputed with a most suitable neighbor around the missing sample in this interpolation technique.
3) BILINEAR INTERPOLATION [BI]
Junninen et al. used Equations (2) and (3) for Bilinear Interpolations
It was a linear equation with and sample values, m being a gradient of this line.
4) NATURAL NEIGHBORS INTERPOLATION [NI]
This spatial interpolation gives the nearest neighbor value of the sample to the missing geopotential height. D. and Boissonnat and Cazals explain the selection of such natural neighbors for randomly missing data being on Delaunay triangulation.
2.3. Performance Indicators for Interpolations
These following performance parameters have been frequently used by Robeson ; Price, McKenney ; Junninen, Niska ; Perry and Hollis ; Stahl, Moore ; Norazian ; Ferrari and Ozaki ; Saleem and Ahmed for imputation of missing climate data set.
1) ROOT MEAN SQUARE ERROR [RMSE]
Root Mean Square was calculated by dividing the sum of the square of the difference between imputed geopotential heights and actual value with the total number of samples, and then finally taking the square root of this term . Smaller values indicate a perfect estimation of missing data set. Equation (4) was its mathematical formula used in this research.
This parameter calculates the total difference [±]between original and interpolated geopotential height.
2) MEAN ABSOLUTE ERROR [MAE]
This provides more information about the residual error as compared with RMSE. Junninen, Niska and Norazian provided Equation (5) for MAE.
MAE value range from 0 to . Its value close to 1 indicates more accurate and perfect imputation of missing data set.
3) CORRELATION COEFFICIENT [Corr]
Its value of +1 indicates very strong correlation and near to 0 signifies a bad correlation between actual and predicted geopotential height. Equation (6) was used for the correlation coefficient in this research.
In Equation (6) nominator represents covariance while denominator represents the product of their standard deviations in the data set.
4) COEFFICIENT OF DETERMINATION [R2]
This parameter provides a degree of correlation between the actual and predicted sample geopotential height which varies between 0 and 1. Noor, Abdullah , suggested values closer to 1 indicate a perfect fit for the data set. Rahman and Islam, used the following formula for R2.
In Equation (7), Ai was the average of predicted samples and Ao is the average of sample values before prediction.
These were the results of the performance parameter for each interpolation technique.
3.1. Performance Parameters from IDW
On all pressure level IDW showed very biased results. IDW produced highest RMSE ± 14.45 m over 1 hPa while lowest value of this error was ±3.66 m at 925 hPa. Actual and predicted values indicating low quality of interpolation for missing values of geopotential height with IDW as correlation coefficient was very low (Table 1).
3.2. Performance Parameters from Nearest Neighbor Interpolation
RMSE value remains between ±4.925 and ±11.369 m with Nearest Neighbor Interpolations. Such a large RMSE, poor correlation, and poor fit to the surface indicated bad refilling of data with this interpolation technique (Table 2).
3.3. Performance Parameters from Bilinear Interpolation
Bilinear Interpolation appeared to be relatively better as compared to the above mentioned two interpolations. RMSE was ±2.461 to ±5.241 m in refilling of gaps in data up to 1000 hPa. MAE remains less than 1 and strong correlation (0.98) was found in the imputation of geopotential height. Coefficient of Determination was close to 0.98 for imputation over 1, 1.5, 2, 3, 5, 7, 10, 15, 70, 100, 150, 200, 250, 300 hPa (Table 3).
3.4. Performance Parameters from Natural Neighbor Interpolation
Reasonable low RMSE come in refilling of geopotential height over 2, 3, 5, 7, 30, 50, 70, 200, 250, 400, 500, 600 hPa. Largest RMSE was ±5.10 m at 10 hPa and lowest RMSE ±2.2 m for refilling of gaps in data at 850, 925, 1000 hPa. A good correlation coefficient [near to 0.99]was come in the refilling of geopotential height. R2 was near to 1 concluding a good line of fit between actual and predicted data set (Table 4).
Refilling of geopotential height over 24 pressure levels was good with Bilinear and Natural Neighbor Imputations (Tables 1-4). In order to nominate optimum interpolation from both of them, scatter plots of original and estimated geopotential heights were investigated. Poor data refilling was come in February and March (Figure 3(a)).
Table 1. Results indicating poor performance parameters with Inverse Distance Weighting Interpolation.
Table 2. Results indicating poor performance indicators from Nearest Neighbor Interpolation.
Table 3. Results indicating good performance parameters for refilling of gaps in data with Bilinear Interpolation.
Table 4. Good results of performance indicators with Natural Neighbor Interpolation.
Figure 3. (a) Results of interpolation of relative humidity [1 hPa to 15 hPa]with Bilinear Interpolation; (b) Bilinear Interpolation for relative humidity imputation from 20 hPa to 250 hPa; (c) Imputation of relative humidity from 300 hPa to 1000 hPa with Bilinear Interpolation.
The Imputations for months of January, February and March were not precise (over 20, 30, 50, 70, 100, 150, 200, 250 hPa) with Bilinear Interpolation. Bilinear Interpolation for remaining pressure levels accurately filled the gaps in the Geopotential height (Figure 3(b)).
The original sample and imputed sample for each months were plotted together to create theses scatter plots. However (over 500, 600, 850, 1000 hPa) Bilinear Interpolation poorly filled months of February, March and April (Figure 3(c)).
The similar technique of plotting original samples with imputed samples was used to create scatter plot of each month. Natural Neighbor Interpolation Imputations were more precise than Bilinear Interpolation. Only month of February was not good by Natural Neighbor Interpolation. Natural Neighbor Interpolation precisely filled the Geopotential height (over 1, 1.5, 2, 3, 5, 7, 10 and 15 hPa) (Figure 4(a)). The imputation with NNI for 20 hPa to 250 hPa and 300 hPa to 1000 hPa are illustrated in Figure 4(b) and Figure 4(c) respectively.
AQUA Satellite data was interpolated for Missing Data of Geopotential height. Based on critical checks and evaluation of interpolations regarding their product, it concluded that the NN and IDW interpolations for filling of missing
Figure 4. (a) Imputation of relative humidity (1 hPa to 15 hPa) with Natural Neighbor Interpolation; (b) Imputation of relative humidity (20 hPa to 250 hPa) with Natural Neighbor Interpolation; (c) Imputation of relative humidity (300 hPa to 1000 hPa) with Natural Neighbors Interpolation.
geopotential height data were proved not to be best and perfect (Table 1 and Table 2). Good results were found between BI and NI. However, after examining scatter plots of each month, it was found that NI was more accurate and reliable for missing data of geopotential height over 24 hPa levels.
The authors wish to acknowledge valuable guidance provided by Mr. Thomas Hearty and Mr. Edward T Olsen to refill gaps in AIRS relative humidity data set. The valuable suggestions are appreciated by Mr. Alessio Martion, University of the Rome, La Sapienza Italy which helped to improve this research.
1GESDISC stands for Goddard Earth Sciences Data Information Services Center.