It is imperative to note as reported in Chibanga et al.  that the ability to predict events with reasonable accuracy enables one to plan in advance what course of action to take, get the best out of situation and avert probable deleterious ones. But to achieve this, the system analyst might prefer not to waste time and effort required to develop and implement a conceptual model but rather opt for a simpler system-theoretic model (Chibanga et al.,  ). The system-theoretic models embrace artificial neural networks (ANNs); ANNs entail the application of difference or differential equation to identify a direct mapping between inputs and outputs without detailed consideration of the internal structure of the physical processes (Abrahart and See,  ). Thus, in the views of Chibanga et al. , ANN model structure, being a flexible mathematical structure capable of identifying complex nonlinear relationships between input and output sets, stands out as one of its principal characteristics of vital importance.
According to Abrahart , system-theoretic/black box or neural network forecasting and prediction offer various benefits ahead of the traditional conceptual modelling; some of the benefits in line with the guiding principles are parsimony, modesty, and testability (Hillel,  ). However, though an artificial neural network (ANN) is a flexible mathematical structure that is capable of identifying complex nonlinear relationships between input and output data sets, its overall performance is dependent on a whole lot of associated variables. But since an ANN does not depend solely upon the physical parameters used in the analytical approach, it could be designed with much different architecture to achieve optimal performance. As reported in Balkhair , the performance of ANN is sensitive to its physical architecture, such as the number of input nodes, hidden layer nodes, and output nodes; i.e. the appropriate architecture of ANN is highly problem dependent. This implies that as in the identification of other nonlinear types of models (system-theoretic or conceptual), 1) a model structure must be identified, and 2) the model parameters must be calibrated through an iterative procedure that employs an objective function surface in search of an optimum (Hsu et al.,  ).
The selection of an appropriate architecture is usually problematic. In the views of Abrahart and See , there is no single correct procedure to determine the optimum of units or layers, although one or two “rules of thumb” have been put forward (e.g. Sarle,  ) and various automated growing, pruning and network breeding algorithms exist (e.g. Fahlman and Lebiere, ; SNN Group,  ). In this regard, selection of the optimal number of hidden units is worth mentioning; often, it is considered to be problem dependent. Though the number of hidden units and layers will control the power of the model to perform more complex modelling but as noted by Abrahart and See , this is not without its associated trade-off. It has been noted (Abrahart and See,  ) that the use of large hidden layers also could be counter-productive because an excessive number of free parameters will encourage over-fitting of the network solution to the training data and thus reduce the generalisation capabilities of the final product. Thus, the general concern is how to find a balance between the hidden units between the layers as well as the number of hidden layers.
In light of the preceding sections, considering that the ANN model structure is ideally suited for modelling highly nonlinear input-output relationships, the central thrust of this study, therefore, is to assess the implication(s) of some of the latent issues in the application or adoption of neural network modelling paradigm or approach; specifically, in this regard, the emphasis is on model structural complexity and implicitly bring to the fore the correlation between optimisation algorithm as well as data pre-processing regime.
2.1. Data Collection and Management
For this study, daily streamflow sequence of the River Benue at the Makurdi hydrometric station was obtained from the Benue State Water Works and National Inland Waterways Authority (Makurdi Office); the data sequence spanned through an entire period of thirty (30) years. Consistency test and continuity tests were done; based on these tests, non-continuous data years were removed thus reducing the length to 26 years (i.e., 9490 data elements). The entire time series of length of 9490 daily values was thus partitioned into two-set constituents of 8670 and 730 data points corresponding to training and validation phases, respectively; i.e., split sampling approach. Figure 1 shows the traverse of River Benue and the location of the Makurdi hydrometric station. The River is perennial and exhibits high seasonality with peak flow regimes usually in the months of September and October. It takes its source from the Cameroon high lands; in conjunction with River Niger, they drain the larger part of the country and vastly impart the hydrological evolutions of their respective flood plains, albeit the entire country to a great extent.
2.2. Development of ANN Model
2.2.1. Preliminary Analysis of the Daily Streamflow/Discharge Dynamics
The mean daily discharges are as shown in Figure 2; the discernible features are the flood peaks and the seasonal periodicity, with both large and small discharges during the year. Besides these general characteristics, the study of the autocorrelation and the spectral analysis provide the first important indications on the aperiodicity of the signal under examination. Figure 3 shows the autocorrelation for values of the delay time (lag) between 1 day and about 5 years; after a rapid decrease, the autocorrelation function displays a regular behaviour, which represents the effect of the seasonal characteristic of the discharges, due, besides the rainfall regime, to other hydrologic forcing which breathes with the season. Contrary to expectation, there is no brisk fall of the autocorrelation as shown in Figure 4, which might indicate a complex behaviour, having characteristic timescales of a few days.
Figure 1. Basic details of River Benue: (a) Map of Nigeria showing River Benue and its traverse; (b) Flow regime of River Benue at Makurdi hydrometric section.
Figure 2. Time series plot of the daily discharge.
Figure 3. Autocorrelation of the daily discharge series showing sinusoidal pattern.
Figure 4. Blow out of the autocorrelation of the discharge series and the first differenced series.
Similarly, the autocorrelation of the first difference signal (Figure 4) does not show much of a rapid monotonic decrease in autocorrelation. Despite this though, apart from the periodicity, the behaviour of the series appears essentially aperiodic and erratic; for example, once the annual periodic component has been filtered out from the series through least squares methods, the spectral density (Figure 5) exhibits a typical broadband behaviour. The spectrum does not show any privileged frequencies, but rather a linear decay that links the
Figure 5. Spectral density of the filtered discharge series (i.e., after the removal of annual periodicity).
whole range of frequency components. The fact that the spectrum is continuous with a pronounced and wide base underscores the aperiodicity of the series; but the problem however is, how much does the character of this complex aperiodicity and irregularity may translate to complex nonlinearity. This therefore portends the need to investigate if the dynamics of the discharges of a river could have a dominant chaotic signature on which high-dimension linear and nonlinear dynamics may be grafted.
2.2.2. Reconstruction of Phase-Space (Attractor) by Time Delay Embedding
The first step in the search for a deterministic behaviour is that of attempting to reconstruct the dynamics in phase space. Having available the time series of only one of the variables present in the phenomenon, that is, the discharge , the delay time method proposed by Takens  and Packard et al.  can be used to reconstruct the attractor; this is based on the fact that the interaction between the variables is such that every component contains information on the complex dynamics of the system. Choosing a delay time τ, usually a multiple of sampling period , the method entails the construction of a series of vectors, of the dimension m, of the form
where, is called the embedding dimension.
To construct a well-behaved phase-space by delay time, a careful choice of τ is critical. The delay time τ is commonly selected by using the autocorrelation function (ACF) method where ACF first attains zeros or drops below a small value, say 1/e4, or the mutual information (MI) method according to Fraser and Swinney  where the MI attains a minimum. Here, the delay time τ is taken as the lag that first generates a zero autocorrelation, which is when the autocorrelation function crosses the zero line (Mpitsos et al.,  ). In practice, the estimate of τ is usually application and author dependent; for instance, some authors take the delay time as 1 day (Porporato and Ridolfi,  ), 2 days (Jayawardena and Lai, , 7 days (Islam and Sivakumar,  ), 10 days (Elshorbagy et al.,  ), 20 days (Wilcox et al.,  ), 91 days (Wang,  ), and 146 days (Pasternack,  ). These differences may arise from the nature of the autocorrelation function; to compare the influence of the delay time τ on the construction of state-phase, the state-phase maps can be plotted for differing τ values. The best τ value should make the state-phase plot best unfolded. Towards this end, the state-phase map is constructed for different τ values (τ = 1, 7, 10, 30, and 78), these are as displayed in Figure 6 along with the 3-dimensional state-phase map based on τ equals to 78, i.e., when the autocorrelation function first crossed the zero line (Figure 6). Figure 6 shows that the best unfolding can be obtained when τ = 78; therefore, τ = 78 is adopted for estimating the correlation dimension of the daily streamflow process in this study.
2.2.3. Network Topology
The time delay coordinate method (Packard, et al. ; Takens,  ) was used to reconstruct the phase-space from the scalar time series; in this case, because of the nature of the data; i.e., a univariate series. This is informed by the fact that to describe the temporal evolution of a dynamical system in a multi-dimensional phase-space with a scalar time series, there is need to employ some techniques to unfold the multi-dimensional structure using the available data. Thus, the approach in the proceeding section is further complimented by applying the method for the determination of minimal sufficient dimension (m) as proposed by Kennel et al. , called the “False Nearest Neighbour(FNN)” method. That is, supposing the point has a neighbour in a p-dimensional space then the distance is calculated in order to compute:
If exceeds a given threshold (a suitable value is ), the point is marked as having a False Nearest Neighbour. As a consequence, the embedding dimension p is high enough if the fraction of points that have False Nearest Neighbours is actually zero, or sufficiently small, say, smaller than a criterion . In this case, the False Nearest Neighbour threshold was set to 10 (as reported in Wang . Based on this, the fraction of False Nearest Neighbours as a function of the embedding dimension was calculated based on phase-space reconstruction using embedding dimension. Here, the minimal embedding dimension was taken as 8; this implies that the state of the streamflow process can be determined by eight lagged observed values as shown in Figure 7.
Following from the analysis, eight lagged values of input variables were used when fitting the ANN model to the series; specifically, this implied that based on the phase-space reconstruction, the discharges of day t-7 to day t. The eight lagged input values were used to forecast the discharge from time t + 1, i.e., the next day, to t + 5; i.e., 5-ahead values, using a multiple-output approach rather than a single-output. The idea here is just to explore the ANN model forecast behaviour over a high lead time (Figure 8).
Figure 6. Phase – space schematic of raw average daily discharge series: (a) Delay time = 1 day; (b) Delay time = 7 days; (c) Delay time = 10 days; (d) Delay time = 30 days; (e) Delay time = 78 days; (f) 3-D phase space map using delay time = 78 days.
Figure 7. Fraction of false nearest neighbours as a function of embedding dimension.
Figure 8. Schematic of three-layer feedforward artificial neural network architecture.
To address the thrust of the study, two model configurations were considered corresponding to two model architectural variants with different nodal configurations. Precisely, single and double hidden layers were thus considered to examine the implications of model structural complexity. The ANN models adopted were 1) 8 7 5 single-hidden layer with 7 nodes; i.e., 8 input nodes in the input layer, 7 nodes in the hidden layer, and 5 output nodes in the output layer 2) 8 5 2 5 double-hidden layers with 5 and 2 nodes, respectively and 3) 8 4 3 5 double-hidden layers with 4 and 3 nodes, respectively; though after several trials in an attempt to choose comparable network structures.
2.2.4. Network Training
For the purposes of the stated aim of the study, the multi-layer feedforward back propagation network was used. Specifically, network training was implemented using the trainbr (Bayesian regularisation: Br) function, traingdm (Gradient descent with momentum: gdm) function, trainlm (Levenberg-Marquardt: lm function in MATLAB Neural Network Toolbox. Since in neural network training, the transfer function is of critical relevance and predictability of future behaviour is a direct consequence of the correct identification of it, for the identified network structure, the tansigmoid and purelin transfer functions were used in the hidden and output layers, respectively. The purelin transfer function was considered for the output layer because it allows the network outputs to take on any value, whereas the last layer of a multi-layer network with sigmoid neurons constrains the network outputs to a small range.
Before applying the ANN, both input and output data were pre-processed and normalised in the range [−1 1]. The scaling strategy was adopted based on the findings of Wang  and Otache ; rescaling was done to scale the data series to fall within this bound. Scaling of the original data, say to the network range was done by
where, = the original input data, = the input data scaled to the network range, and are respectively the maximum and the minimum of the original input data, while and are the upper and the lower network ranges for the network input, respectively. Similarly, the original output, say is scaled to the network range by
where, the systems’ output is scaled to the network range, and are respectively the maximum and minimum values of the original output data , whereas and are respectively the upper and the lower network ranges for the network output. After scaling the inputs and outputs, the resulting output, say is in the scaled domain. Hence, there is need to rescale the output back to its original domain; this is by inverting Equation (3) and using as
2.3. Forecast Analysis
In order to draw conclusions on the ANN model performance, attention is on the ANN model performance in terms of extreme events, that is, maximum and minimum flows. In this regard, the coefficient of correlation R as in Equation (6) was employed.
where, v = the number of output data points, = the observed flow, = predicted flow, = mean of observed flow, and = mean of predicted flows. In terms of the measures of forecast accuracy with respect to extreme values, the ratio of the forecasted maximum to the observed maximum (peak) was determined as
where, and is the forecast corresponding to such maximum; and , means that the observed peak is perfectly reproduced by the model. Forecasts with values of about 100% are considered to be very accurate, while indicates that the model underestimates the peak value; and indicates overestimation. Similarly, the ratio of the forecasted to the observed minimum
where, represents the forecast corresponding to the minimum observed value was used to judge the forecasting capability of the model. In addition, specific-event prediction was also considered by looking at low and high flows.
3.1. Effects of Model Structural Complexity
Figures 9-12 and Table 1, Table 2 clearly show the performance of the network configurations in terms of correlation, extreme flow and event-specific evaluation for lead times of one and five-day ahead predictions, respectively. It is glaring that the complexity of the network structure may impair the integrity of the network performance as depicted by the contrasting results both in the training and validation phases. The overall performance, looking at Figures 9-12 shows that while the network demonstrated the capacity to predict low flows fairly well as well as elements of high flows, the converse is the case for a range of medium flows. In view of the statistics as in Table 1 and Table 2 as well as Figures 9-12, the aggregate performance of ANN Model: 8 4 3 5 is relatively better followed by ANN Model: 8 7 5. The ANN Model: 8 5 2 5 performed abysmally, especially with staggering values of Rmax % (489.5 and 423) and Rmin % (0.0 and 0.0) in the validation phase (See Table 1); Figure 10 clearly shows this behaviour, especially Figure 10 where there is strong evidence of zero flow prediction as well as high spikes probably due to poor data quality leading to outliers; though in all instances, as noted in Figure 9 and Figure 11, the models could explain between 60% and 97% (exemplified by R2 values) of the variability in the streamflow dynamics.
Based on the results obtained, it suffices to note that the structural complexity is defined here to connote the size of the hidden layers against the traditional one-hidden layer commonly employed. It is thus imperative to state that the selection of the optimal number of hidden units (nodes) for the hidden layer is often considered to be problem dependent. However, intuition suggests that more is better but as reported by Abrahart and See , it is not always the case; i.e., the number of hidden units and layers control the power of the network to perform complex modelling but with associated trade-off between training time and network performance. The findings here accord with the submissions of Abrahart and See  that the use of large hidden layers could be counter-productive because an excessive number of free parameters will encourage over fitting of the network solution to the training data and invariably reduces the generalisation capabilities of the final product. Hence, worthy of relevance is the effective number of hidden layers and how to establish a balance between the hidden units of these layers as shown by the results in Table 1 and Table 2. Though as pointed by Hornik et al. , the performance of a one-hidden layer could be compromised due to data quality; i.e., if the data contain an insufficient deterministic relationship. This really places a searchlight on the viability of a univariate series as employed here; the possibility of compromising any meaningful dependence in terms of casual correlation between input variables is high. However, in line with the findings of Openshaw and Openshaw as reported in Abrahart and See , there could be some advantages in the adoption of more than one-hidden layer; as noted here (See Table 1, Table 2, and Figure 9 and Figure 12), the use of two-hidden layers could provide an additional degree of representational power. It is obvious that the selection of too many hidden units or neurons may increase the training time without significant improvement on training results. Generally, in agreement with the findings of Ranjithan and Eheart , since too many hidden neurons probably may encourage each hidden neuron to memorise one of the input patterns, the error of the training set decreases gradually with an increasing number of intermediate units but the generalisation of the network may reach an optimum and does not necessarily improve indefinitely with an increasing number of intermediate units or nodes.
Figure 9. ANN 8 5 2 5 Model prediction correlation plot for (a) Br, (b) LM, and (c) GDM in the validation.
Figure 10. Characteristic ANN 8525 Model flow simulation pattern for 1-day ahead lead time for (a) Br, (b) LM, and (c) GDM optimisation algorithms.
Figure 11. ANN 8 4 3 5 Model prediction correlation plot for (a) Br, (b) LM, and (c) GDM in the validation.
Figure 12. Characteristic ANN 8 4 3 5 Model flow simulation pattern for 1-day ahead lead time for (a) Br, (b) LM, and (c) GDM optimisation algorithms.
Table 1. Effective correlation (R) statistics of extreme flow predictions.
Table 2. Event-specific evaluation.
3.2. Implications of Data Pre-Processing Strategy
It is paramount not to only evaluate model forecast performance on the basis of statistical parameters, but to also consider the impact data pre-processing may have on ANN model forecasts. It is recognised that data pre-processing can have a significant effect on model performance (e.g. Maier and Dandy,  ). It is commonly considered that, because the outputs of some transfer functions are bounded, the outputs of a Multi-Layer Perceptron (MLP) ANN will be in the interval [0, 1] or [−1, 1] depending on the transfer function used in the neurons. Reports in literature (e.g. Otache, ; Maier and Dandy, ; Wang,  ) suggest that using smaller intervals for streamflow modelling, as [0.1, 0.85], and [0.1, 0.9] could allow extreme (low and high) flow events occurring outside the range of the calibration data may be accommodated. However, the advantage of rescaling the data into a small interval is supported to a varying degree; fairly good in some instances and abysmally poor in the others. The prediction capability of extreme flow regimes by the different models is illustrated clearly in both Table 1 and Table 2.
The behaviour as depicted by Table 1 and Table 2 could be explained against the backdrop of the behaviour of transfer functions. For instance, to rescale the input data to [−1, 1] would limit the output range of the function approximately to [−0.7616, 0.7616]. Similarly, to rescale the input range to [−0.9, 0.9] would further shrink the output range approximately to [−0.7163, 0.7163]. Both 0.7616 and 0.7163 are still far away from the extreme limits of the function; such a small output data range will make the output less sensitive to the change of the weights between the hidden layer and output layer, and will therefore possibly make the training process more difficult. As reported by Wang  and affirmed in Otache , since the neurons in an ANN structure are combined linearly with a lot of weights, any rescaling of the input vector can be offset the more, as corresponding weights and biases are changed.
Based on the results obtained in all instances, it could be inferred that adoption of large hidden layers could be counter-productive; this is because an excessive number of free parameters will encourage over fitting of the network though it may provide an additional representational power. In the same context, rescaling of the ANN input regime adversely limits the output of the transfer function. Thus resulting from the conclusions drawn, it suffices to note that ANN model is by no means a substitute for conceptual watershed modelling, therefore, exogenous variables should be incorporated in streamflow modelling and forecasting exercise because of their hydrologic evolutions and too, effort should be geared towards using hybrid models like Fuzzy-Neural Network and Wavelet models in a coupling strategy with ANN in the modelling of streamflow; similarly, because of volatility and nonlinear deterministic problems, ARMA-GARCH models should be considered as viable complement in this regard too.
 Chibanga, R., Berlamont, J. and Vandewalle, J. (2003) Modelling and Forecasting of Hydrological Variables Using Artificial Neural Networks: The Kafue River Sub-Basin. Hydrological Sciences, 48, 363-379.
 Abrahart, J.R. and See, L. (2000) Comparing Neural Network and Autoregressive Moving Average Techniques for Provision of Continuous River Flow Forecasts in Two Contrasting Catchments. Hydrological Processes, 14, 2157-2172.
 Hillel, D. (1986) Modelling in Soil Physics: A critical Review. In: Boersma, L.L., Ed., Future Developments in Soil Science Research, Soil Science Society of America, Madison, WI, 35-42.
 Takens, F. (1981) Detecting Strange Attractors in Turbulence. In: Rand, D.A. and Young, L.S., Eds., Lecture Notes in Mathematics, Springer-Verlag, New York, 366-381.
 Mpitsos, G.J., Creech, H.C., Cohan, C.S. and Mendelson, M. (1987) Variability and Chaos: Neuron-Integrative Principles in Self-Organization of Motor Patterns. In: Hao, B.L., Ed., Directions in Chaos, World Scientific, Singapore, 162-190.
 Islam, M.N. and Sivakumar, B. (2002) Characterisation and Prediction of Runoff Dynamics: A Nonlinear Dynamical View. Advances in Water Resources, 25, 179-190.
 Wilcox, B.P., Rawls, W.J., Brakensiek, D.L. and Wight, J.R. (1990) Predicting Runoff from Rangeland Catchments: A Comparison of Two Models. Water Resources Research, 26, 2401-2410.
 Kennel, M.B., Brown, R. and Abarbanel, H.D. (1992) Determining Embedding Dimension for Phase-Space Reconstruction using Geometrical Construction. Physical Review A, 45, 3403-3411.
 Maier, H.R. and Dandy, G.C. (2000) Neural Networks for the Prediction and Forecasting of Water Resources Variables: A Review of Modelling Issues and Applications. Environmental Modelling & Software, 15, 101-124.