As software systems play an increasingly important role in our lives, their complexity and size continue growing. The increased complexity of software systems makes the assurance of software quality much difficult. In fact many of software applications require critical functionality because of their increasing size and complexity. Software reliability is an important facet of software quality, and is defined as the probability of failure-free software operation for a specified period of time in a specified environment. For the purpose of quantitative assessment, software reliability growth models (SRGMs) have been widely used during the last four decades      . It is worth mentioning that SRGMs specify any parametric form of random processes that describe the behavior of software failure occurrence with respect to time. Since SRGMs are essentially stochastic models with abstractions, they must be built under several simplified mathematical assumptions, and, at the same time, their parameter estimation with the fault count data observed in software testing is not a trivial task. Because the maximum likelihood estimation, which is commonly used, is reduced to a multi- modal nonlinear optimization problem with constraints, and requires more computation efforts. Another problem on SRGMs is the model selection from a great number of SRGM candidates. During the last four decades, over three hundred SRGMs have been proposed in the literature. The conclusion from the empirical research suggests that the best SRGM in terms of the goodness-of-fit performance depends on the kind of software fault count data. In other words, there does not exist the best SRGM which can fit every software fault count data. Sharma et al.  proposed a selection method of the best SRGM by using the concept of distance. Unfortunately, it is noted that the best SRGM which can fit the past observation experienced before does not always provide the best prediction model for the future (or remaining) testing period.
Apart from SRGMs based on stochastic modeling, artificial neural network (ANN) has gained much popularity to deal with non-linear phenomena arising in applications to time series forecasting, pattern recognition, function approximation, etc. Comparing with non-trivial stochastic models, it is easy to implement the ANN for the software fault prediction, since the feed-forward back- propagation (BP) type of learning algorithm can be widely used to estimate the internal parameters, such as connection weights. Software fault prediction using the ANN was proposed first by Karunanithi et al.  , Karunanithi and Malaiya   . They applied simple multi-layered perceptron (MLP) feed forward neural networks, which enjoy a universal approximation ability  to represent an arbitrary nonlinear mapping with any degree of accuracy, in the prediction of software fault-detection time in software testing. Since their seminal contributions, the ANN approach has been frequently applied to different estimation/ prediction problems in software engineering. Khoshgoftaa et al.   , Khoshgoftaar and Szabo  considered different problems to identify fault- prone modules in software quality assessment. For more recent survey, see Vashisht et al.  .
While, it should be pointed out that the ANN approach has some drawbacks in application to software fault prediction. First, there is no efficient way to determine the best neural network architecture in each application domain. Even though the number of input neurons and output neurons may be determined from physical requirements, the number of hidden layers and hidden neurons significantly influences the prediction performance in MLP feed forward neural networks. In many applications, these sizes must be determined through trial- and-error heuristics in pre-experiments. In other words, the predictive performance of software fault count strongly depends on the ANN architecture assumed for the analysis. Second, the ANN is a kind of nonlinear regression model, but can be regarded as a deterministic model to output the deterministic values as estimates or predictions. Dissimilar to the familiar SRGMs, it is impossible to quantify the software reliability as a probability by applying the common ANN approach. This feature penalizes us to use the ANN when quantifying the software reliability measures such as the software reliability, mean time to software failure, etc. On the other hand, though the ANN is a simple connectionist model depending on the architecture, it can be considered as a statistically nonparametric model without specific model assumptions. As mentioned above, a huge number of SRGMs have been developed in the literature  -  , but almost all of them are based on some parametric assumptions which cannot be validated for every fault count data. In that sense, the ANN approach can be viewed as one of nonparametric models with no specific model assumptions. In fact, the ANN approach provides a data-driven modeling framework and can bridge several kinds of machine learning techniques. Yang et al.  applied a model mining technique to provide a generic data-driven SRGM. Xiao and Dohi  proposed a nonparametric wavelet method to estimate nonhomogeneous Poisson process (NHPP)-based SRGM. Cheng et al.  considered a multistep- ahead a time series prediction with the MLP approach. Park et al.  also compared several data-driven approaches with the existing SRGMs. Recently, Begum and Dohi   applied an idea on multiple-input multiple-output (MIMO) neural computation by Park et al.  , and proposed a refined neural network approach to predict the long-term behavior of software fault count with the grouped data. They impose an assumption that the underlying fault count process obeys the Poisson law with an unknown mean value function, and propose to utilize several data transform methods from the Poisson count data to the Gaussian data. However, it is worth noting that even the Poisson assumption can be regarded as a simplified assumption in SRGM modeling and has not been validated empirically.
In this paper we refine the existing MLP approach as a data-driven model from the view point of software fault prediction. In particular, we deal with the grouped data which consists of the number of software fault counts detected at each testing date, although the existing MLP approaches focus on only the software fault-detection time data which are not easily available in actual software testing. We apply the well-known data transformation technique called the Box-Cox power transformation  , from an arbitrary probability law to the Gaussian law, and transform the underlying software fault prediction problem to a nonlinear Gaussian regression problem with the MLP. First, Tukey  introduced a family of power transformations such that the transformed values obey a monotonic function of the observations over some admissible range, where an arbitrary transformation parameter is involved in the power transformations. Box and Cox  proposed the maximum likelihood as well as the Bayesian methods for estimation of the parameter, and derived the asymptotic distribution of the likelihood ratio to test some hypotheses about the parameter. The main contributions by Box and Cox  are two-folds: The first one is to compute the profile likelihood function and to obtain an approximate confidence interval from the asymptotic property of the maximum likelihood estimator. The second one is to ensure that the probability model is fully identifiable in the Bayesian approach. In neural computation contexts, Dashora et al.  used the Box-Cox power transformation for data driven prediction models with single output neuron in MLP and analyzed the intermittent stream flow for Narmada river basin. They compared the MLP with seasonal autoregressive integrated moving average and Thomas-Fiering models. Sennaroglu and Senvar  evaluated the process capability indices in industry and compared the Box and Cox power transformation with weighted variance methods in the Weibull distribution model in terms of the process variation with the product specifications. In this way, an applicability of the Box and Cox power transform can be recognized in many research fields.
In this paper we transform the discrete integer-valued data which denote the cumulative number of software faults detected in software system testing to the Gaussian data by the Box-Cox transform. Next we input the transformed data into the MIMO MLP to make the long-term prediction of the software fault count. Note that almost all papers in past,            just considered the one-stage look ahead prediction. Hence, our challenge with MIMO and the Box-Cox power transformation overcomes the limitation for the existing neuro-based approaches. Recently, the same authors  considered a different transformation technique for the software fault prediction, where only the one-stage look ahead prediction is made in terms of prediction interval. Also, though they implicitly assume the Poisson law for the underlying fault count data, we do not restrict the Poisson law for the software fault count data, because the Box-Cox transform is a general data transformation scheme. The present paper is organized as follows. In Section 2, we overview the SRGMs based on nonhomogeneous Poisson process (NHPP). Section 3 introduces a refined ANN for the purpose of prediction, where the Box-Cox power transformation is applied in the pre-processing of input data. In Section 4, we analyze two actual software fault count data sets and compare our refined neural network approach with eleven NHPP-based SRGMs  from the view point of predictive performance with relative average error. We show here that our ANN approach affords a more appropriate prediction device and tends to have an enhanced performance from the standpoint of predictability. Finally the paper is concluded with remarks in Section 5.
2. NHPP-Based Software Reliability Modeling
We summarize the software reliability growth modeling with the nonhomogeneous Poisson process (NHPP). Suppose that software system test starts at time . Let denote the cumulative number of software faults detected by time t, where means a stochastic (non-decreasing) counting process in continuous time. In particular, it is said that is an NHPP if the following conditions hold:
・ has independent increments,
where is the higher term of infinitesimal time , and is the intensity function of an NHPP which denotes the instantaneous fault detection rate per each fault. In the above definition, θ is the model parameter (vector) included in the intensity function. Then, the probability that the cumulative number of software faults detected by time t equals is given by
is called the mean value function and indicates the expected cumulative number of software faults up to time t, say, .
If the mean value function or the intensity function is specified, then the identification problem of the NHPP is reduced to a statistical estimation problem of the unknown model parameter θ. In this way, when the parametric form of the mean value function or the intensity function is given, the resulting NHPP-based SRGMs are called parametric NHPP-based SRGMs. Table 1 contains the representative NHPP-based SRGMs and their mean value functions. Okamura and Dohi  summarized these eleven parametric NHPP- based SRGMs and developed a parameter estimation tool, SRATS, based on the maximum likelihood method and the EM (Expectation-Maximization) algorithm. In SRATS, the best SRGM with the smallest AIC (Akaike Information Criterion) is automatically selected, so the resulting best SRGM can fit best the past observation data on software fault counts among the eleven models.
Suppose that realizations of , are observed up to the observation point t. We estimate the model parameter θ by means of the maximum likelihood method. Then, the log likelihood function for the grouped data is given by
where , and for simplification. The maximum likelihood estimate of model parameter , can be obtained by maximizing Equation (3) with respect to the model parameter θ. Once the model parameter is estimated, our next concern is to predict the future value of the intensity function or the mean value function at an arbitrary time , where denotes the prediction length  . In parametric modeling, the prediction at time is easily done by substituting estimated model parameter into the time evolution , where the unconditional and conditional mean value functions at an arbitrary future time are given by
Table 1. NHPP-based SRGMs.
When the mean value function is unknown, a few nonparametric approaches have been developed   . However, it should be noted that those approaches can deal with the fault-detection time data, but do not work for prediction in future. The wavelet-based method in  can treat the grouped data, but fails to make the long-term prediction in nature. In the following section, we use an elementary MLP for the purpose of the long-term software fault prediction.
3. A Refined MLP Architecture
Artificial neural network is a computational metaphor inspired by the brain and nervous system study, and consists of an input layer with some inputs, multiple hidden layers with hidden neurons and one output layer. The input layer of neurons can be used to capture the inputs from the outside world. Since the hidden layer of neurons has no communication with the external world, the output layer of neurons sends the final output to the external world. Hence, determining an appropriate number of hidden neurons is an important design issue in neural computation. In this paper we consider an MIMO type of MLP with only one hidden layer. Similar to Section 2, suppose that n software fault count data are observed at the observation point . Our concern is about the future prediction of the cumulative number of software faults at time .
3.1. Preliminary Set-Up
In the common neural computation, it is noted that the neural network including the simplest MLP with only one output neuron is regarded as a nonlinear regression model, where the explanatory variables are randomized by the Gaussian white nose. In other words, the output data in the MLP is implicitly assumed to be a realization of a nonlinear Gaussian model. On the other hand, since one handles the fault count data as integer values in the software fault prediction, the underlying data shall be transformed to the Gaussian data in advance. Such a pre-data processing is common in the wavelet shrinkage estimation  although it specifies the underlying data as the Poisson data. According to the idea in the literature  , we apply the Box-Cox power transformation technique  from an arbitrary random data to the Gaussian data. As mentioned in Section 1, Box and Cox  developed a procedure to identify an appropriate exponent λ to transform data into a “normal shape”. Table 2 presents the Box-Cox power transformation and its inverse transform formula. In this table, denotes the cumulative number of software faults detected at -th testing day. Then, we have the transformed data by means of the Box-Cox power transformation. The transformation parameter λ indicates
Table 2. Box-Cox power transformation formulae.
the power to which all data should be raised, where the parameter λ has to be adjusted in the Box-Cox power transformation. It is common to determine the optimal λ in the pre-experiments before time series prediction.
Let and be the input and output for the MIMO type of MLP, respectively. Then, the prediction of the cumulative number of software faults is given by the inversion of the data transform. Figure 1 depicts the architecture of back propagation type MIMO, where is the number of software fault count data experienced before the observation point and is the prediction length. We suppose that there is only one hidden layer with hidden neurons in our MIMO type of MLP.
3.2. Training Phase
Suppose that all the connection weights ( weights from input to hidden layer, weights from hidden to output layer in Figure 1) are first given by the uniformly distributed pseudo random varieties. In the MIMO, if these weights are completely known, then it is possible to calculate from the input directly. However, since it is impossible to train all the weights including unknown patterns in principle via the common BP algorithm, it is needed to develop a new long-term prediction scheme for the MIMO. In short, we briefly introduce the long-term prediction scheme developed by Begum and Dohi  . Suppose that without any loss of generality. In Figure 2, we illustrate the configuration of our prediction scheme. In order to predict the cumulative number of software faults for testing days from the observation point , the prediction has to be made at the point . This implies that only weights can be estimated with the training data experienced for the period and that the remaining weights are not trained at time . We call these weights the non-estimable weights in this paper. As the prediction length is longer, the number of non-estimable weights becomes greater and the prediction uncertainty also increases more. In this scheme, the Box-Cox transformed data with given λ are used for the input in the MIMO, and the remaining data are used for the teaching signals in the training phase.
The BP algorithm is the well-known gradient descent method to update the connection weights, so as to minimize the squared error between the network output values and the teaching signals. For the value coming out an input neuron, , it is common to add two special inputs; bias units
Figure 1. Architecture of back propagation type MIMO.
Figure 2. Configuration of prediction scheme via MIMO.
which always have the unit values. These inputs are used to evaluate the bias to the hidden neurons and output neurons, respectively. Let be the connection weights from -th input neuron to j-th hidden neuron, where and denote the bias weights for j-th hidden neuron and s-th output neuron, respectively, for the training phase with , and Each hidden neuron calculates the weighted sum of the input neuron, , in the following equation:
Since there is no universal method to determine the number of hidden neurons, we change in the pre-experiments and choose an appropriate value. After calculating for each hidden neuron, we apply a sigmoid function as a threshold function in the MIMO. Since are summative and weighted inputs from respective hidden neurons, the s-th output in the output layer is given by
Because are also summative and weighted inputs from respective hidden neurons in the output layer, the weight is connected from j-th hidden neuron to s-th output neuron. The output value of the network in the training phase, is calculated by . In the BP algorithm, the error is propagated from an output layer to a successive hidden layer by updating the weights, where the error function is defined by
with the prediction value and the teaching signal observed for the period .
Next we overview the BP algorithm. It updates the weight parameters so as to minimize SSE between the network output values and the teaching signals where each connection weight is adjusted using the gradient descent algorithm according to the contribution to SSE in Equation (8). The momentum, α, and the learning rate, η, are controlled to adjust the weights and the convergence speed in the BP algorithm, respectively. Since these are the most important tuning parameters in the BP algorithm, we carefully examine these parameters in pre-experiments. In this paper we set α = 0.25 ~ 0.90 and η = 0.001 ~ 0.500. Then, the connection weights are updated in the following:
where and are the output gradient of j-th hidden neuron and the output gradient in the output layer, and are defined by
respectively. Also, the updated bias weights for hidden and output neurons are respectively given by
The above procedure is repeated until the desired output is achieved.
3.3. Prediction Phase
Once the weights are estimated with the training data experienced for the period through the BP algorithm, we need to obtain the remaining non-estimable weights for prediction. Unfortunately, since these cannot be trained with the information at time , we need to give these values by the uniform pseudo random variates in the range . By giving the random connection weights, the output as the prediction of the cumulative number of software faults, , are calculated by replacing Equations (6) and (7) by
respectively, for , and . Note that the resulting output is based on one sample by generating a set of uniform pseudo random variates. In order to obtain the prediction of the expected cumulative number of software faults, we generate m sets of random varieties and take the arithmetic mean of the m predictions of , where m = 1000 is confirmed to be enough in our preliminary experiments. In other words, the prediction in the MIMO type of MLP is reduced to a combination of the BP learning and a Monte Carlo simulation on the connection weights.
4. Numerical Experiments
4.1. Data Sets
We use two real project data sets cited in the reference  ; DS1 and DS2, which consist of the software fault count (grouped) data. In these data sets, the length of software testing and the total number of detected software faults are given by (62, 133) and (41, 351), respectively. To find out the desired output via the BP algorithm, we need much computation cost to calculate the gradient descent, where the initial guess of weights, , , and , are given by the uniform random variates ranged in , the number of total iterations in the BP algorithm run is 1000 and the convergence criteria on the minimum error is 0.001 which is same as our previous paper  . In Figure 3 and Figure 4, we give two examples on how to determine the optimal transformation parameter . In our experiments, it is shown that the search range of should be .
4.2. Predictive Performance Criterion
Suppose that the observation point is given by the -th testing day, . In this case, software fault counts data are used for training the MIMO type of MLP. The capability of the prediction model is measured by the average error (AE);
where REs is called the relative error for the future time and is given by
So we regard the prediction model with smaller AE as a better prediction model.
Figure 3. Determination of the transformation parameter λ (DS1 with 50% observation point).
Figure 4. Determination of the transformation parameter λ (DS2 with 50% observation point).
Tables 3-6 summarize the results on AE for the underlying data set DS1 at 50% ~ 90% observation points of the whole data for the prediction length l = 5, 10, 15 and 20 days, where denotes the optimal transformation parameter in the sense of minimum AE, and the bold number implies the best prediction model in the same category. For instance, Table 3 gives the prediction results on the cumulative number of software faults for 5 days prediction at respective observation points, when the number of hidden neurons changes from k = 10 to 50. In the MIMO type of MLP neural network, we compare the Box-Cox power transformation with the non-transformed case (Normal) and the best SRGM in Table 1. In the column of SRGM, we denote the best SRGMs in terms of predictive performance (in the sense of minimum AE) and estimation performance (in the sense of minimum AIC) by P and E, respectively.
It is seen that our MIMO-based approaches provide smaller AEs than the common SRGMs in almost all cases when the observation points are 50% and 70%. In the 60% observation point, the best prediction model is the non-trans- formed MIMO (Normal) with k = 50. On the other hand, in the latter phase of software testing, i.e., 80% ~ 90% observation points, SRGMs, such as txvmax and txvmin, offer less AEs than the MIMO type of MLPs. Even in these cases, it should be noted the best SRGM with the minimum AIC is not always equivalent to the best SRGM with the minimum AE. This fact tells us that one cannot know exactly the best SRGM in advance in terms of predictive performance. Comparing the data transform methods with the non-transformed one, we can find only one case where Normal provides the best prediction result in Table 3 for DS1 and Table 8 for DS2. However, in the other early prediction phases, it is seen that the data transform can work well to give more accurate prediction results in the MIMO type of MLPs.
Tables 7-10 present the prediction results on AE for DS2. Similar to DS1, the MIMO type of MLPs can predict the cumulative number of software faults more accurately in the early testing phase, say, 50% ~ 60% observation points, than SRGMs. Focusing on the number of hidden neurons in the MIMO type of MLPs, we expected first that the larger k may lead to the better predictive performance. However, it is not true from the results in Tables 4-6. In the MIMO-based approach, it is essential to determine feasible k and λ values, because the number of hidden neurons results the expensive computation cost with different prediction length l. In the original paper by Box and Cox  they suggest that “fix one, or possibly a small number, of λ’s and go ahead with the detailed estimation”. In their examples, they use what is usually called “snap to the grid” method to choose the transformation parameter. Unfortunately, no universal method to determine the optimal λ has not been reported yet in the literature. Hence, it is needed to give an appropriate λ even though it is not optimal. From Figure 3 and Figure 4, it can be recognized that the adjustment of λ is quite sensitive to the predictive performance and has to be done through the try-and-error heuristics. However, for an arbitrary λ, we can know that the multi-stages look ahead
Table 3. Comparison of average relative errors for five days prediction with DS1 (l = 5).
Table 4. Comparison of average relative errors for ten days prediction with DS1 (l = 10).
Table 5. Comparison of average relative errors for fifteen days prediction with DS1 (l = 15).
Table 6. Comparison of average relative errors for twenty days prediction with DS1 (l = 20).
Table 7. Comparison of average relative errors for five days prediction with DS2 (l = 5).
Table 8. Comparison of average relative errors for ten days prediction with DS2 (l = 10).
Table 9. Comparison of average relative errors for fifteen days prediction with DS2 (l = 15).
Table 10. Comparison of average relative errors for twenty days prediction with DS2 (l = 20).
prediction of software fault count is possible with the MIMO type of MLP and that the Box-Cox power transformation improves the predictive accuracy, especially, in the early prediction phase.
5. Concluding Remarks
In this paper we have investigated an applicability of the Box-Cox power transformation to the neuro-based software fault prediction. The ANN employed in this paper is an MIMO type of MLP, and can handle the grouped data on software fault counts as well as make the long-term prediction. To our best knowledge, this paper is the primary challenge to treat the long-term prediction of software faults with the grouped data in the ANN approach. Throughout a comprehensive comparison with the existing SRGMs, it has been shown that our MIMO type of MLP could work well to predict the cumulative number of software faults in the early testing phase. In the future, we will apply the proposed neural network approach to the other software fault count data and conduct more comprehensive data analysis to validate our method with data transformation. Especially, a challenging issue is to develop the prediction interval of the cumulative number of software faults. Even if SRGMs are assumed, it is almost impossible to get the exact predictive intervals of the cumulative number of software faults without applying any approximation method. We will extend our prediction scheme based on the MIMO type of MLP to the interval prediction problem. We will also consider how to select the optimal transformation parameter in the Box-Cox transformation.
The first author (M.B.) was partially supported by the MEXT (Ministry of Education, Culture, Sports, Science, and Technology) Japan Government Scholarship.