Received 1 December 2015; accepted 26 April 2016; published 29 April 2016
Recently, there has been important progress in techniques for observing the environment, and as a result, the volume of remote sensing data such as satellite data and point data acquired from various kinds of land observation instruments, has increased. Lots of useful and detailed data, which were very difficult to get before, are now available, and this contributes to the progress of various research fields. However, because of the sheer volume of new data, researchers have not been able to process it using traditional analytical techniques. Much of the information collected in the past few years is simply being stored  .
“If you torture the data long enough, nature will confess,” said the economist, 1991 Nobel laureate Ronald Coase. Although this statement is still true, this objective is not easy. First, “enough time” can be in practice “too long” in many applications and therefore unacceptable. Second, to obtain “confessions” from large data sets we must use the state of the art tools in “torture”. Third, as nature has confessed some of her most hidden secrets, it seems that she has become more stubborn and unwilling to share any more  .
In our view, much of the best techniques of “torture” can be found in what is now known as data mining. Data mining is the essential ingredient in the more general process of knowledge discovery in databases (KDD). The idea is that by automatically sifting large amounts of data should be possible to extract nontrivial knowledge inherent in them. Data mining has become fashionable, not only in computer science, but especially in business, but why is it different from statistics? Certainly data mining uses statistics and is even based on it, but it also incorporates database management techniques and modern artificial intelligence algorithms. It includes all this, but it is different at the same time. More than a collection of many types of analysis, data mining is distinguished by a distinctive approach, a new attitude toward data analysis. The emphasis is not so much in the extraction of facts, but in the generation of hypotheses. Its aim has more to do with generating new and better questions than to refine answers to traditional ones. To achieve this, data mining uses a vast collection of statistical techniques and artificial intelligence methods such as: neural networks, factor analysis, time series analysis, Bayesian networks, decision trees, statistical models, multivariate statistical analysis and clustering analysis   .
Data mining has been applied with success in many areas of science such as Biology  -  , Astronomy  -  and Medicine   just to mention some. In particular, due to the multisystem nature of Earth sciences, we consider that the incorporation of data mining to this discipline would provide a new perspective to analyze old problems and would likely suggest new ones  -  .
2. Site of Study
The Toluca Valley is located in the State of Mexico (Figure 1), within the watershed of Lerma in the South of
Figure 1. Localization map of the Toluca Valley in Mexico. The Toluca Valley is a valley in central Mexico, just west of Mexico City. Since the 1940s, there has been significant environmental degradation in the valley, with the loss of forests, soil erosion, falling water tables and water pollution due to growth in industry and population.
the Mexican Plateau. It is bounded from the North by the aquifer Atlacomulco Ixtlahuaca, from the South by the Tenango hill, from the South-West by the Nevado de Toluca volcano and from the East by the Sierra de las Cruces mountain chain; covering approximately 2738 km2.
The Toluca Valley is part of the Rio Lerma basin which has a good potential for groundwater exploitation which in fact not only is used by local farmers from Toluca and other small cities, but it also exports large volumes through the Lerma’s well battery system for Mexico City water supply, becoming a strategic source of water.
According to INEGI (National Institute of Statistics and Geography) census data, the population of the State of Mexico, is of 1,107,964 inhabitants, accounting for 13% of Mexico’s population, being the entity with the highest population density. The state’s population growth has been uneven, but has been localized on narrowly defined areas including the municipalities of Toluca, Metepec, Lerma, and Zinacantepec. From 1990 to 1995 the population grew nearly 17% due to rapid industrial growth and residential development, trend that continues to our days. It should be noted that between 1950 and the eighties, the State of Mexico moved from the seventh to the first place among the 32 states in terms of total population. Much of this increase, both state and regional, occurred during the decades of the 60’s and 70’s when the average annual growth rates were of 7% and 4% respectively   .
Outside the metropolitan area, the economy is still based on agriculture and livestock, with some income from tourism. Only a little over 4% of the total municipal population engages in agriculture raising corn, wheat, beans, potatoes, peas, fava beans and oats on a little over half of the municipality’s territory. Livestock raising is a greater source of income with 10,286 sites producing cattle, porks, sheep and domestic fowl  .
In this work we used data-mining techniques to analyze a 40-year piezometric level data set from the Toluca Valley in Central Mexico. The monitoring network was built in the late 60’s to register hydraulic head in the aquifer. Each monitoring location in the network consists of a nest of piezometers (bores) with up to eight piezometers installed at different depths ranging from 10 to 200 m (Figure 1). Hydraulic head has been measured in the network in a monthly basis since 1969 (Figure 2), it is currently operated by the National Water Comission (CONAGUA) and it provides information to analyze the space-time response of the hydrogeologic system to external forcing, among which is pumping  .
To explore the relations between the evolution of the groundwater system and socioeconomic factors we selected seven socioeconomic variables from INEGI, Mexico’s National Institute of Statistics and Geography. The
Figure 2. Boxplot graphs for the peizometric lectures [m] of the eight bores over all well and years. Bores are organized from the deepest to shallowest.
variables selected are: 1) gross national domestic product (GNP), 2) total country population, 3) urban country population, and 4) rural population in the country, 5) total estate population, 6) urban state population, and 7) rural state population.
Merging large data bases acquired from different sources, purposes and having different data representation has become such an important problem in data mining that data cleaning has been considered a crucial first step in knowledge discovery   . The current work started with extensive data cleaning (original data set can be found in here: http://www.geologia-feflow.unam.mx/documentos/base_toluca_original.ods) that included removal of text and undesired headers, annotation standardization and search for missing and misleading data and data standardization. In general missing and misleading data was ignored. For this purpose we used awk scripts, SQL management and manual supervision. Then we realized exploratory statistical, clustering and variance analysis (ANOVA) to understand data inner dependence and structure. We merge both data bases in one, that we analyzed from a multivariate statistical point of view. Principal component and canonical correspondence analysis were performed for identify suitable variables for a classification and clustering analysis.
We consider relevant to point out that only free software was used for this work, from a Debian Squeeze Gnu-Linux operating system, awk for prepossessing; MySQL, OpenOffice and Gnumeric for data management; R for exploratory and multivariate analysis; Weka for classification (J48 algorithm) and clustering analysis (k-means) and LyX for paper text processing. This note is important since software licenses prices may in fact impose severe limitations to science in developing countries such as Mexico and more important is the part of the struggle to ensure free access of scientific products financed with government money.
Exploratory statistical analysis showed in Table 1 and Figure 2, Figure 3 and Figure 4 reveal that only years and piezometers variables have inner statistical independence and that socioeconomic variables are very high correlated (Table 2). Piezometers were grouped according to depth as: deep (bore 1), medium (bores 2 - 5) and shallow (bores 6 - 8). Similarly, years were grouped according to time periods: from 1969 to 1977; from 1978 to 1989; from 1990 to 1996 and from 1997 to 2002.
Table 1. Anova analysis results.
Table 2. Correlation matrix.
aAbout the abbreviations: MTP is the Mexico’s Total Population, MRP is the rural population and MUP the urban. For its part, STP is the State’s Total Population, SRP the rural population and SUP the urban.
Figure 3. Years dendogram.
Figure 4. Time evolution of the different socioeconomic indicators considered in this work.
From multivariate analysis (Figure 5, Figure 6) we can observe that data variance is mostly explained by state population structure and GNP. In Figure 6, years are highlight in different color corresponding to decades and space was divided in four quadrant that correspond to different combinations of population structure and GNP increases. For clarity the labels were omitted in Figure 5 in which time begins at the bottom-right quadrant IV, then follow a crescent behavior to quadrant I, decreases to IV crossing to III, increases again to II and finally decreases to III. The final state is characterized by a high urban population and decreasing GNP tendency (Fig- ures 7-9). In general, data arrange themselves so that as time grows, they move from a population structure predominantly rural to a basically urban, experiencing GNP ups and downs that correspond to different periods of the country economy. It is even possible to identify the 1987 and 1994 crisis years.
Figure 5. Canonical correspondence analysis for years with socioeconomics vectors.
Figure 6. Canonical correspondence analysis for years without socio-economic vectors.
Figure 7. State rural population percentage vs time. Class colour correspond to CCA quadrant.
Figure 8. State urban population percentage Vs time. Class colour correspond to CCA quadrant.
Figure 9. GNP vs time. Class colour correspond to CCA quadrant.
For example, the blue group has a break that positively correlated with GNP (possible correlated to oil boom), then the 80 has another break now correlates negatively with GNP (The 1982 crisis). Change from 80 to 90 now the correlation with GNP is positive (the creation of the National Water Comission―CONAGUA) and mid-90s switch to a negative correlation (the 1994 crisis).
Weka’s clustering analysis (Figures 10-15) shows that all socioeconomic factors are arranged in two temporal groups: before and after 1989. As for the cluster analysis, due to the high correlation between socioeconomic
Figure 10. Cluster analysis for years instance.
Figure 11. Cluster analysis for GNP.
Figure 12. Cluster analysis for state urban population percentage.
Figure 13. Cluster analysis for state rural population percentage.
Figure 14. Weka statistical results for clustering analysis using K-means.
factors of the population and GNP, only this indicator and time were necessary for constructing a decision tree with the algorithm J48 implemented in Weka software by   .
5. Discussion and Conclusions
Data mining of historic hydrogeological data for the Toluca Valley has proved to be able to generate new knowledge, making clear that groundwater management has been influenced by socioeconomic factors such as GNP and population structure.
Figure 15. Decision tree for socioeconomic and hydrogoelogical data, using J48 algorithm in Weka.
Interestingly the years used for the algorithm as decision nodes are, with the exception of 2003, years of economic crisis and presidential transitions.
・ In 1976, Mexico suffered the consequences of the oil embargo imposed by the Organization of Petroleum Exporting Countries (OPEC) against all the countries that supported Israel in the Yom Kippur War against Syria and Egypt.
・ Monday, October 19, 1987, better known as Black Monday, was the day when stock markets around the world crashed, collapsing in a very short time. The crash began in Hong Kong, spread west through international time zones to Europe, hitting the United States after other markets had already declined by a significant margin. Mexico was no exception and this crash devalued the Mexican currency 400% against the US dollar. One year later, in this highly complex economic scenario and with an electoral process that involved two suspicious shutdowns of the computer system used to keep track of the number of votes, makes his entrance Carlos Salinas as President of Mexico.
・ The economic crisis that began in Mexico in 1994 had a global impact and was called the “Tequila Effect”. It was caused by the lack of international reserves, causing the devaluation of the peso during the first days of the presidency of Ernesto Zedillo, one of the legacies of the Salinas administration. A few weeks before the beginning of the Mexican currency devaluation process, then President of the United States, Bill Clinton, asked the U.S. Congress authorize a credit line for 20 billion US dollars to the Mexican Government.
Even more, Goicochea  shows that 1987 and 1989, years that data mining analysis uses for clustering and as one of the year nodes in the decision tree, marks in fact the beginning of the collapse of agropecuarian GNP. These results make it clear that groundwater and economy may be much more linked than thought previously.
We found that hydrogoelogical data evolve in the direction of population transformation from rural to urban. This finding that at first sight may appear as a triviality, may pose a fundamental question about groundwater management. In hydrogeology there is a widely spread thought that groundwater management problems have to do mainly with agricultural issues, such as crop election and irrigation methods. This kind of thinking could mislead us because most of those problems are solvable by agriculture technification. But what happens if, as data mining suggests, groundwater today has much more to do with urban environment? Problems related to cities, such as drinkable water demands, sanitation, distribution, are more complex and possible solutions are more expensive than those required in agriculture. This could represent a whole paradigm shift in groundwater management with profound repercussion in policy making.
Finally it is in general difficult to provide, given a certain regional economic environment, clear and simple criteria that are useful in the design and implementation of suitable private and public policies. Most of the times decisions are taken based on macroeconomic variables and regional and local trends are ignored. Moreover, even when the appropriate relevant variables, have been identified, it is not clear in many occasions how to interpret or establish the causal and interdependence structure among them. Despite the fact that many methodologies are available, it is still true that frequently the results provided by such methods remain difficult to interpret. Data mining as presented in this paper is a useful alternative that gives insights into the dynamics of the system being studied and helps validating hypothesis that can be helpful in the actual process of decision taking. In this paper, for instance, it is clearly shown that the relationship between rural and urban economic environments is changing in terms of water needs and use and that an appropriate allocation of this resource has to take this change into account. Moreover the connection with the global macroeconomic variables is made apparent in such a way that seasonal changes (in these cases, related to the political situation, namely presidential elections) can be taken into account.
In summary, data mining techniques can provide in relevant economic context a useful methodological alternative by giving simple criteria in decision and policy making.
OLC thanks Fondo Capital Semilla at Universidad Iberoamericana and to SNI program with number 62929.