The Water Quality Evaluation in Balihe Lake Based on Principal Component Analysis

Show more

1. Introduction

With the increasingly prominent problem of water environmental pollution, the research on the water quality comprehensive evaluation method becomes particularly important (Sina, 2017; Noori, et al., 2019) . In the process of water quality comprehensive evaluation, complex and numerous environmental factors can make the research work heavy and the relevant data analysis difficult, moreover, and even may not find the root cause of water quality deterioration (Mena-Rivera, et al., 2017; Yang, 2010) .

At present, there are many commonly used water quality comprehensive evaluation methods, such as comprehensive index method, fuzzy evaluation method, neural network method, etc. (Deng & Li, 2010; Wong & Hu, 2013). Although these methods may also make a good evaluation of water quality status, it is impossible to determine the main factors affecting water quality (Chang et al., 2011) . Principal Component Analysis (PCA) can put forward relevant factors from many variables, determine the main factors affecting water quality, and then get a reasonable explanation (Friedman, Hastie, & Tibshirani, 2010; Olsen, et al., 2012; Zhong, et al., 2018; Sun, et al., 2019) . E.g., in Sun et al.’s research, the temporal and spatial patterns of river water quality were analyzed to evaluate the pollution status in a natural river based on PCA method (Sun et al., 2019) . Similar studies can also be found for lake ecosystems (Zhong et al., 2018) .

In this study, a set of actual sampling data was observed in a freshwater lake and used to evaluate the water quality in the sampling area based on PCA method, in order to get the main factors affecting the water quality in the area, and provide guidance for water environmental governance and improvement.

2. Materials and Method

2.1. Study Area

Balihe Lake is located at the intersection are of Huaihe River and Shaying River in Fuyang City, Anhui Province, P. R. China. As an artificially excavated lake, Balihe lake is originally the largest tributary of Shaying River Basin in Anhui Province. With the geographical coordinates of E116˚14'-116˚19' and N32˚33'-32˚36', it belongs to the semi-humid monsoon climate zone in subtropical and warm temperate zones. The total drainage area of Balihe Lake is about 500 km^{2}, accounting for about one-eighth of the total area of Anhui Section of Shaying River Basin. Besides, as can be seen in Figure 1, three rivers including Disanhugou River, Liugou River and Wulihugou River flows into the lake.

Water pollution in the Balihe Lake Basin not only seriously affected the economic development of the basin and the stability of the ecosystem, but also affected the ecological environment and water quality of the Shaying River, bringing tremendous pressure to the improvement of water quality in the Huaihe River Basin. Generally speaking, there were two main sources of water pollution in the lake drainage area: first, non-point source pollution along the lake coastline along with rainfall runoff; second, pollutants from these rivers flowing into the lake. Therefore, the analysis of water quality at different locations in Balihe

Figure 1. Balihe Lake and the distribution of sampling sites.

Lake is of great significance to the water pollution control work in Huaihe River Basin. In this study, PCA method was applied to the water quality comprehensive evaluation of Balihe Lake. The water pollution status of Balihe Lake was then analyzed comprehensively, and the main pollution factors were identified, which may provide some guidance to the water pollution control of Balihe Lake and Huaihe River Basin.

2.2. Sample Collection and Analysis

In October 2017, a field sampling survey was carried out at 15 sampling sites in Balihe Lake (see Figure 1). Surface water samples were collected because the water depth of the survey was within 10 m. According to the most concerned indicators of water environment monitoring in China, the water quality indicators including dissolved oxygen (DO), total nitrogen (TN), total dissolved nitrogen (TDN), ammonia nitrogen (NH_{3}-N), nitrate nitrogen (
${\text{NO}}_{3}^{-}$ -N), nitrous nitrogen (
${\text{NO}}_{2}^{-}$ -N), total phosphorus (TP), total dissolved phosphorus (TDP), phosphate (
${\text{PO}}_{4}^{3-}$ ), chemical oxygen demand (COD) and chlorophyll a (Chl-a) were measured according to the national standard method (Environmental Protection Administration of Peoples Republic of China, 2002, 2009) .

2.3. PCA Method

Principal component analysis (PCA), also known as principal variable analysis, uses the idea of dimensionality reduction to transform multiple indicators into a few comprehensive indicators under the principle of minimizing the loss of data information (Debels, et al., 2005; Ouyang, 2005) . In PCA, the comprehensive index of transformation analysis is usually called principal component. The principal component is a linear combination of the original variables and is not correlated with each other. Therefore, only a few principal components need to be considered to grasp the main contradictions and avoid the problem of collinearity between variables in complex problems, while the main information of the original data is not lost. And as such, the analysis efficiency could be improved significantly. Based on IBM SPSS Statistic 25.0 software, PCA was carried out on 11 water quality indicators of the 15 sampling sites mentioned above.

3. Results and Discussion

3.1. Standardization of the Experimental Data

The original data of these 11 indexes was standardized to eliminate the influence of magnitude and dimension among different data. The standardized data obtained obey the normal distribution with 0 as mean and 1 as standard deviation. Equation (1) is the calculation formula and the results were shown in Table 1.

${Z}_{i}=\frac{{x}_{i}-\frac{1}{m}{\displaystyle {\sum}_{i=1}^{m}{x}_{i}}}{\sqrt{\frac{1}{m}{\displaystyle {\sum}_{i=1}^{m}{\left({x}_{i}-\frac{1}{m}{\displaystyle {\sum}_{i=1}^{m}{x}_{i}}\right)}^{2}}}}$ , (1)

Table 1. Standardized data.

where m is the number of sampling sites, x_{i} is the original index value, and Zi is the standardized value (Yang, 2010; Wu, 2019) .

3.2. Maintaining the Integrity of the Specifications

The standardized data are analyzed by PCA method. Table 2 shows that KMO statistic is 0.624 (>0.500), and the significance level of Bartlett’s test of sphericity is less than 0.001. It shows that independent variables are interrelated, and the data meet the basic requirements of PCA.

Spearman correlation analysis was used to analyze the correlations between these 11 indicators. And the results of the correlation coefficients were shown in Table 3. The greater the absolute value of the correlation coefficient between two indicators, the stronger the correlation between these two indicators. There is a positive correlation between two different indicators if the correlation coefficient is positive and vice versa. As can be seen in Table 3, there are some strong correlations between some indicators. E.g., 7 indicators have negative correlations with DO, which indicates that these indicators may be oxygen-consuming ones. Chl-a has a positive correlation with DO, which is consistent with the understanding that Chl-a is the main pigment for photosynthesis. Besides, there are strong positive correlations between TP, TDP and ${\text{PO}}_{4}^{3-}$ , which indicates that the water quality information reflected by these indicators does overlap and is suitable for principal component analysis (Singh et al., 2004) .

Table 2. KMO and Bartlett’s test results.

Table 3. Correlation matrix.

Generally speaking, there are three principles to determine the number of principal components, they are: 1) the eigenvalue of the principal component λ should be larger than 1; 2) the cumulative variance percentage of the principal components larger than 80% - 85%; 3) the number of principal components should be determined by the mutation of the eigenvalue. Among these principles, the eigenvalue represents the affecting degree of the principal component on the selected indicators, e.g. the explanation of the principal component is not enough if the eigenvalue is less than 1.

According to the explanatory table of total variance (Table 4), λ_{1}, λ_{2} and λ_{3} corresponding to the 1^{st}, 2^{nd} and 3^{rd} principal components are 5.4466 2.999 and 1.657, respectively. The corresponding percentages of variance are 49.689%, 27.265% and 15.062%. And the cumulative contribution of these three components is up to 92.016%, which matches the first two extraction principles mentioned above. It can be considered that these three principal components included all the information of these 11 environmental indicators. According to the scree plot of PCA eigenvalue curve (Figure 2), λ_{4} is less than 1 and the curve after this eigenvalue becomes gentler, which means that the explanatory power becomes smaller and the mutation occurs at the eigenvalue λ_{4}. Above all, the number of principal components is determined at 3.

Figure 2. Scree plot of component variance eigenvalues.

Table 4. Total variance explained.

The factorial load matrix (Table 5) is directly calculated by SPSS, in which, the values are the correlation coefficients between the principal components and the original variables. The absolute value of these coefficients represents the degree of correlation for the relevant relationships, e.g., the greater the absolute value, the stronger the correlation and the closer the relationship. DO, TDN,
${\text{NO}}_{3}^{-}$ -N,
${\text{NO}}_{2}^{-}$ -N, TP, TDP,
${\text{PO}}_{4}^{3-}$ and COD have high loads on the 1^{st} principal component, which indicates that this principal component reflects the information of these eight indicators comprehensively and can be interpreted as the level of oxygen-consuming pollutants in Balihe Lake. Similarly, the 2^{nd} principal component reflects the information of TN and Chl-a comprehensively and can explain the level of water eutrophication. The 3^{rd} principal component mainly reflects the information of NH_{3}-N.

The factorial load matrix is not the principal component coefficient matrix. By dividing the factor load matrix by the square root of the corresponding principal component eigenvalue, the principal component coefficient matrix (Table 6) can be calculated. By multiplying the obtained component coefficient matrix with the normalized data, the evaluation functions F1, F2, F3 corresponding to each principal component and the comprehensive evaluation function F can be obtained. Based on these evaluation functions, the water quality pollution score of each sampling site can be quantitatively described. The higher the score, the more serious the pollution is. The expressions of each function are as follows:

Table 5. The factorial load matrix.

Table 6. The principal component coefficient matrix.

$\begin{array}{l}F1=-0.181ZX1+0.032ZX2+0.036ZX3+0.172ZX4-0.051ZX5-0.168ZX6\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+0.227ZX7+0.275ZX8+0.261ZX9-0.002ZX10-0.085ZX11\end{array}$ (2)

$\begin{array}{l}F2=0.171ZX1+0.302ZX2+0.249ZX3+0.013ZX4+0.215ZX5+0.121ZX6\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}-0.024ZX7+0.041ZX8+0.041ZX9-0.038ZX10+0.316ZX11\end{array}$ (3)

$\begin{array}{l}F3=-0.013ZX1-0.015ZX2+0.144ZX3-0.454ZX4+0.300ZX5-0.027ZX6\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}-0.042ZX7-0.175ZX8-0.131ZX9+0.307ZX11+0.039ZX11\end{array}$ (4)

$\begin{array}{l}F=\left[\lambda 1/\left(\lambda 1+\lambda 2+\lambda 3\right)\right]F1+\left[\lambda 2/\left(\lambda 1+\lambda 2+\lambda 3\right)\right]F2\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\left[\lambda 3/\left(\lambda 1+\lambda 2+\lambda 3\right)\right]F3\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}=0.540F1+0.296F2+0.164F3\end{array}$ (5)

3.3. Comprehensive Water Quality Evaluation

The scores and ranks of the principal components 1, 2, 3 and the comprehensive principal component were calculated and shown in Table 7. As can be found in the table, 1) according to the ranks for the 1^{st} principal component, the top five sampling sites are 4, 5, 6, 3 and 2 indicating that the oxygen-consuming pollution is relatively serious at three places, and the site 4 is highest one; 2) for the principal component 2, the top five sites are 8, 7, 6, 5 and 10, which means that the eutrophication pollution may be more serious in comparison to the other places; 3) the top five sites for principal component 3 are 1, 2, 3, 8 and 4 where ammonia nitrogen pollution may be serious; 4) the top five places 4, 5, 6, 7, 3 for the comprehensive principal component are similar to that of the 1^{st} principal component. Considering the items 1) and 4) together, it can be concluded that the upper part of the Balihe Lake should be polluted seriously by the oxygen-consuming pollution and should be treated adequately. In addition, the main treatment measures should be oriented to the oxygen-consuming pollutions like living sources, non-point sources, etc. The lower ranks of sites 11, 12, 13, 14 and 15 indicate that the water quality of the lower part of the lake is better and conservative measures may be taken to this area.

4. Conclusion

Based on the field measurements of 11 environmental factors at 15 sampling sites, the water quality in Balihe Lake was evaluated utilizing PCA method. The following conclusions can be drawn.

1) There are obvious correlations between some of these 11 environmental factors. The 3 extracted principal components accounting for 92.016% of the total variance can well explain the water quality status in Balihe Lake. The 1^{st}, 2^{nd} and 3^{rd} principal component represent the pollution of oxygen consuming pollutants, eutrophication and ammonia nitrogen, correspondingly.

2) The sampling sites 4, 5, 6, 7 and 3, which have relatively higher PCA scores, are all located in the upper part of the lake. The water quality in these places should be more serious and the main pollutants are oxygen-consuming. Therefore, more attention should be paid to such areas in the future water quality prevention and treatment.

Table 7. Score and ranks of the principal components.

3) The water quality at sites 11, 12, 13, 14 and 15 concentrated in the lower part of the lake as the PCA scores are lower. According to the better water quality in this area, conservative measures may be taken to this area.

Although the results of this study may provide some guidance or inspiration to the water pollution prevention and treatment of Balihe Lake, more research focusing on this topic based on some other methods are necessary in the future.

Acknowledgements

The authors would like to acknowledge with great appreciation for the financial support provided by the Chinese National Major Science and Technology Program for Water Pollution Control and Treatment (No. 2015ZX07204-007) and the Chinese Fundamental Research Funds for the Central Universities (No. 2017MS055).

References

[1] Chang, K., Gao, J. L., Wu, W. Y., & Yuan, Y. X. (2011). Water Quality Comprehensive Evaluation Method for Large Water Distribution Network Based on Clustering Analysis. Journal of Hydroinformatics, 13, 390.

https://doi.org/10.2166/hydro.2011.021

[2] Debels, P., Figueroa, R., Urrutia, R., Barra, R., & Niell, X. (2005). Evaluation of Water Quality in the Chillán River (Central Chile) Using Physicochemical Parameters and a Modified Water Quality Index. Environmental Monitoring and Assessment, 110, 301-322.

https://doi.org/10.1007/s10661-005-8064-1

[3] Deng, D., & Li, Y. J. (2010). Application of Rough Set and Fuzzy Comprehensive Evaluation Method in Water Quality Assessment. In International Conference on Computing, Control and Industrial Engineering (pp. 126-128). Wuhan: IEEE.

https://doi.org/10.1109/CCIE.2010.150

[4] Environmental Protection Administration of Peoples Republic of China (2002). Surface Water Environmental Quality Standards (GB 3838-2002). (In Chinese)

[5] Environmental Protection Administration of Peoples Republic of China (2009). Monitoring and Analysis Methods of Water and Wastewater (4th ed.). Beijing: China Environmental Science Press. (In Chinese)

[6] Friedman, J., Hastie, T., & Tibshirani, R. (2010). The Elements of Statistical Learning (Vol. 1, No. 10). New York: Springer.

[7] Mena-Rivera, L., Salgado-Silva, V., Benavides-Benavides, C., Coto-Campos, J. M., & Swinscoe, T. H. A. (2017). Spatial and Seasonal Surface Water Quality Assessment in a Tropical Urban Catchment: Burío River, Costa Rica. Water, 9, 558.

https://doi.org/10.3390/w9080558

[8] Noori, R., Berndtsson, R., Hosseinzadeh, M., Adamowski, J. F., & Abyaneh, M. R. (2019). A Critical Review on the Application of the National Sanitation Foundation Water Quality Index. Environmental Pollution, 244, 575-587.

https://doi.org/10.1016/j.envpol.2018.10.076

[9] Olsen, R. L., Chappell, R. W., & Loftis, J. C. (2012). Water Quality Sample Collection, Data Treatment and Results Presentation for Principal Components Analysis—Literature Review and Illinois River Watershed Case Study. Water Research, 46, 3110-3122.

https://doi.org/10.1016/j.watres.2012.03.028

[10] Ouyang, Y. (2005). Evaluation of River Water Quality Monitoring Stations by Principal Component Analysis. Water Research, 39, 2621-2635.

https://doi.org/10.1016/j.watres.2005.04.024

[11] Sina, Z. (2017). Modification of Expected Conflicts between Drinking Water Quality Index and Irrigation Water Quality Index in Water Quality Ranking of Shared Extraction Wells Using Multi Criteria Decision Making Techniques. Ecological Indicators, 83, 368-379.

https://doi.org/10.1016/j.ecolind.2017.08.017

[12] Singh, K. P., Malik, A., Mohan, D. et al. (2004). Multivariate Statistical Techniques for the Evaluation of Spatial and Temporal Variations in Water Quality of Gomti River (India)—A Case Study. Water Research, 38, 3980-3992.

https://doi.org/10.1016/j.watres.2004.06.011

[13] Sun, X. W., Zhang, H. Y., Zhong, M. F., Wang, Z. Y., Liang, X. Q., Huang, T. S., & Huang, H. (2019). Analyses on the Temporal and Spatial Characteristics of Water Quality in a Seagoing River Using Multivariate Statistical Techniques: A Case Study in the Duliujian River, China. International Journal of Environmental Research and Public Health, 16, 1020.

https://doi.org/10.3390/ijerph16061020

[14] Wong, H., & Hu, B. Q. (2013). Application of Interval Clustering Approach to Water Quality Evaluation. Journal of Hydrology, 491, 1-12.

https://doi.org/10.1016/j.jhydrol.2013.03.009

[15] Wu, D. F. (2019). Application of Set Pair Model Based on Principal Component Analysis in Water Quality Evaluation. Water Conservancy Science and Technology and Economy, 25, 1-7. (In Chinese)

[16] Yang, Y. (2010). Management of Agricultural Pollution in China: Current Status and International Experience. In International Conference on Management and Service Science (pp. 939-946). Wuhan: IEEE.

https://doi.org/10.1109/ICMSS.2010.5576720

[17] Zhong, M. F., Zhang, H. Y., Sun, X. W., Wang, Z. Y., Tian, W., & Huang, H. (2018). Analyzing the Significant Environmental Factors on the Spatial and Temporal Distribution of Water Quality Utilizing Multivariate Statistical Techniques: A Case Study in the Balihe Lake, China. Environmental Science and Pollution Research, 25, 29418-29432.

https://doi.org/10.1007/s11356-018-2943-9