In the pursuit of system design and system engineering, it is important to construct a reliable system. One general way to do this is to include redundancies for each component. This is done to overcome failures of individual components. Each redundancy can be thought of as a backup unit. When one unit fails, a backup unit is activated, so that the component does not fail. The more redundancies, or backups a component has, the less likely the component is to fail. At the same time, however, the cost increases for each redundancy, or backup unit that is added to the component. As such, it is desired to construct a reliability system that maximizes system reliability while simultaneously adheres to a limited budget.
Unfortunately, problems of this nature are difficult to solve for a couple of reasons: they are combinatorial in nature, making it difficult to enumerate all possible solutions. Additionally, these problems are highly nonlinear, with a polynomial order equal to the number of decision variables, thereby preventing the use of traditional mathematical programming approaches.
Because of the challenging mathematics associated with these types of problems, heuristic approaches can be employed to find reasonable, if not optimal solutions. Additionally, if the heuristic approach is well thought out, the approach can find near-optimal solutions with a reasonable computational effort. For this research effort, an ant-colony optimization approach is employed as a sort of traversal through each system component with the intent of finding a “traversal” that provides, for each component, a high overall-system reliability that meets a specified budget.
The subsequent sections of the paper present the details of the problem at hand, describe the heuristic methodology used, present a simple example problem, discuss applying the heuristic to some test problems, and compare these results to the optimal solution. General observations, and opportunities for subsequent research are then offered.
The reliability of a system is the mathematical product of the effective reliability for each individual component. The effective reliability of an individual component is a function of the number of backup or parallel units. With more parallel units, the component’s effective reliability increases. This principle is well-do- cumented    . The effective cost also rises when more parallel units are added. Figure 1 details how the effective component reliability and effective cost works across three components―for each backup units/component combination, the first line shows reliability, while the second lines shows cost.
For this example, we assume the reliability for a single parallel unit for component j is Rj, where . We also assume the cost for a single parallel unit for component j is Cj. The effective reliability for component j is , where i is the number of units in parallel. The effective cost for component j is , where D is the quantity discount factor ( ), with i units in parallel. The boxes in orange show the number of parallel units used for each component: three units for component (1), two units for component (2), and four units for component (3). The system reliability and system cost
Figure 1. Components in series and parallel.
are shown as follows:
Selection of the number of units to employ for each component is the difficult part of this problem. Of course, we’d like to have as many parallel units as possible, as that increases the system reliability. Unfortunately, it also increases the system cost, and in reality, we are likely to have a limited budget. As such, we need to maximize reliability while staying within some specified budget.
Optimization via mainstream mathematical programming approaches is not possible here, due to the highly nonlinear property of the problem. Additionally, the problem has an enormous search space, rendering enumeration impractical. In order to address this problem, we need to select a search heuristic to find a desirable solution requiring a reasonable computational effort. While there are many heuristic search approaches possible, ant-colony optimization is the search approach for this particular research effort.
Ant-colony optimization is based on the concept of ants foraging for food. Ants essentially communicate with each other via secreting a substance called “pheromone.” Pheromone attracts other ants. A trail rife with pheromone will likely induce other ants to follow its path, while at the same time, there is a relatively small probability for subsequent ants to deviate from this established path. This concept is illustrated by Figure 2. Initially, ants are confined to their nest (t = 0). At t = 1, the ants pursue the food in what is seen as random fashion. At t = 2, there emerges a pattern where the ants start to enforce a direct route to the food. At steady state (t = ∞), the ants have seemingly found the shortest (best) route to the food, but the possibility of finding alternate routes still exists   .
Figure 2. Traversal of ants  .
For this research effort, we will simulate the behavior of the ants so that they traverse the individual components of a system with the intent of maximizing system reliability within budget-that is, having ants finding the “best” path during their traversals. Dealing with a limited budget and quantity discounts is a unique feature of this effort  . This is analogous to ants finding the most efficient route to a food source. The next section formalizes the search process.
Here, we first formulate the general problem, detail the search heuristic, and offer a simple example to illustrate the methodology.
Before presenting the formulation, the following terms in Table 1 are introduced.
For this problem, our objective is to maximize the system reliability-the pro- duct of all component reliabilities. Our individual component reliabilities are dependent on the number of parallel units used for each component, which is a decision variable (xj). The number of parallel units used for each component determines the cost of each individual component. More units used in parallel for each component increases the cost for the component, subsequently increasing the total cost. The total cost is limited by the budget, B. From a mathematical standpoint, we can represent our model in mathematical programming form via the following objective function and constraint, where decision variables are specified as integer, and in the range [1, n]:
Table 1. Data for example problem.
From inspection of the above model it is clear that our objective function is highly nonlinear, with a polynomial order of m. Additionally, our budget constraint is not continuous. Because of this, traditional mathematical programming approaches do not guarantee optimality. As such, we need to pursue a heuristic to provide a reasonable optimal solution via a reasonable computational effort.
Several heuristic approaches are available to address this type of problem. A genetic algorithm can be used to address this problem. Genetic Algorithms have proven effective in solving difficult optimization problems. The down side of Genetic Algorithms is computational inefficiency. A Genetic Algorithm generates binary solutions, performs analyses, manipulates the binary characteristics of the problem, converts binary solutions back into decimal solutions, and repeats this process until some pre-specified stopping criterion is met. This puts a lot of strain on the processor, resulting in a protracted search.
For this research effort, an artificial agent approach is used-something similar to ant colony optimization. An “ant” will traverse each component, starting with component 1 and concluding with component m. The choice of how many parallel units to select for each component during the traversal is probabilistic in nature, and proportional to the relative desirability of component i for each component j. For each unique traversal, the system reliability is computed, and if feasible, the system reliability is compared to the best system reliability found thus far. If the new solution is found to be the best so far, it becomes the new best solution, and the attractiveness of the solution components are enhanced accordingly. If the new solution is not the best found thus far, the attractiveness of the solution is diluted accordingly.
3.2. Ant-Colony Heuristic
Prior to detailing the optimization algorithm, the following terms in Table 2
Table 2. Data for example problem.
3.2.1. Step 1: Initialization
First of all, the reliability and cost matrices are determined. The reliability of associated with employing i parallel units for component j is determined as follows:
The cost associated with employing i parallel units for component j is determined as follows:
Because we desire high values of reliability and low values of cost when choosing the number of parallel units for each component, pheromone is a function of the ratio of reliability to cost for each parallel unit/component combination. As such, pheromone is as follows:
Our level of pheromone, along with the initialized improvement matrix, is then used to determine the probability of selecting i parallel units for component j. This is determined as follows:
The probability calculation above combines both pheromone and the number of overall best-solution improvements associated with selecting i parallel units for component j. The values of α and β are amplifiers, or weights for pheromone and improvements, respectively. MaxR is initialized to zero. Thus concludes the initialization process. The traversal process is now described.
3.2.2. Step 2: Traversal
An artificial agent, or “ant” traverses all m components, and at each component, a number of backup units are selected. We use the decision variable xj to represent the number of backup units selected for component j. This selection is done via Monte-Carlo simulation using the probabilities shown in Equation (10). A uniformly-distributed random number on the [0, 1] interval is generated and the appropriate value of xj is determined by where this random number lies in accordance with Equation (10). At the completion of the traversal, we have a solution.
3.2.3. Step 3: Assessment
Our traversal of the m components provides us with a solution, where xj represents the number of parallel units employed for component j. We need to determine the system reliability and system cost of our recently-obtained solution. System reliability is obtained as follows:
System cost is defined as follows:
If the system reliability obtained via Equation (11) exceeds MaxR, and the system cost obtained via Equation (12) is less than the budgeted amount (B), the solution is the best one found thus far, and this solution becomes the best solution thus far:
3.2.4. Step 4: Pheromone Update
If the recently found solution is the best thus far as defined above, the pheromone is updated accordingly:
Additionally, the number of improvements associated with this new best solution is incremented as follows:
Otherwise-that is, if the new solution is not an improvement over the best- found solution thus far, the pheromone is updated accordingly.
Equations (14) and (15) enhance a solution that was found to be desirable-the characteristics of this solution are made more likely to occur in future traversals. Equation (16) dilutes a solution that was not found to be desirable―the characteristics of this solution are made less likely to occur in future traversals. Regardless of the type of pheromone adjustment made above, probabilities must be re-adjusted for the next traversal. This is done via Equation (10).
3.2.5. Step 5: Repeat
Steps 2, 3 and 4 are repeated until some stopping condition is reached, which is typically a user-defined condition. The last reported “best solution” reported is considered the best solution.
To illustrate the presented methodology, we present a small example. Consider a problem with eight components (m = 8) and up to six units in parallel (n = 6) are permitted for each component. Table 3 below shows the given Rj and Cj values.
Table 4 shows the reliability matrix (rij values) for up to six units in parallel for each of the eight components. Equation (6) was used to determine these values. Please note that for each individual component, the reliability increases with
Table 3. Data for example problem.
Table 4. Reliability matrix.
each additional unit in parallel. That is, the more units used in parallel, the higher the component’s reliability. As an example, if three parallel units are used for component six, the component reliability is 0.999578 (r36 = 0.999578).
Table 5 shows the cost matrix for i units in parallel for each component (cij values). These values are determined by Equation (7), along with a discount factor (D) of 0.97. Note that for each component, using more units in parallel increases the component cost by a factor of D. As such, there is a tradeoff with using more units in parallel―the more units used in parallel, the higher the cost. Because of this, one must carefully choose the number of parallel units to optimize the system reliability with a limited budget, B. As an example, if we use three parallel units for component two, our cost is $10.19 (r32 = 10.19).
As an agent or “ant” traverses the m components of the system, they must make a decision on how many parallel components to employ for the next component. This is a stochastic process, and the level of pheromone is influential in the selection process. For this research effort, the level of pheromone is determined via Equation (8), which is the ratio of rij to cij for all i,j combinations. It should be noted that since the impij values are initialized to one, they do not affect the probabilities at this point in the search.
For this example, for three parallel units being used for component one is . Additionally, the values of and α and β and both assumed to equal one. These values are shown in Table 6. For each component, the total amount of (pherij)(impij) is summed on the bottom line.
These values are converted to probabilities via Equation (10), and are shown in Table 7. For each individual component, one will note that the sum of probabilities is unity. Monte Carlo simulation is used to select the number of parallel units employed for each component. For example, there is a 0.1057 probability that there will be four parallel units employed for component eight (prob48 = 0.1057).
For the purpose of our example, let us assume that we have a discount factor
Table 5. Cost matrix.
Table 6. (pherij)(impij) matrix.
Table 7. Probability matrix.
of D = 0.97, with a budget of B = $200. We also assume Monte Carlo simulation resulted in a solution of x = [3, 4, 3, 3, 2, 3, 2, 2]. That is three parallel units for component one, four parallel units for component two, etc. Tables 4-6 are enhanced to show the associated component reliabilities and costs associated with this solution-these values are shaded. With this solution, we compute the system reliability as follows:
The system cost is determined as follows:
Given that this is our first solution, it provides us with the maximum system reliability (maxR) thus far. Our system cost is also within our budget of $200 (cs < B). As such, this is our best solution thus far. The value of MaxR is replaced with rs. Because this solution is an improvement over the previous best solution, the pheromone must be updated accordingly. Table 8 and Table 9 show how the pheromone matrix is updated by the pre-selected amplifier value equal to 0.01 (A = 0.01) for the appropriate solution values-affected values are shaded. Non- affected values are the same as before.
Because this solution is the best found thus far, our improvement matrix is also updated in accordance with Equation (15).
Because the pheromone and improvement matrices have been adjusted, the corresponding probability matrix must also be adjusted. All probability values will change. Probabilities associated with the recently-obtained best solution will increase, while the other probabilities will decrease―this is the result of enhancing the pheromone and improvement values for the recently found best solution. Table 10 shows the updated probability matrix.
Table 8. Updated pheromone matrix.
Table 9. Updated improvement matrix.
Table 10. Updated probability matrix.
The iterative process: traversal, assessment, and updating of pheromone, improvements and probabilities continues until some stopping criterion is met. For this simple example, 1000 iterations are used, resulting in the solution x = [5, 5, 4, 6, 4, 4, 4, 3], with a system reliability of rs = 0.999804, and a system cost of cs = $198.68. This solution was also determined to be optimal via enumeration of all solutions.
The presented methodology is used on four test problems. Details of these test problems are shown in Table 11. The number of possible solutions is equivalent to the following:
It should be noted that using more than eight parallel units (backups) for each component results in diminishing utility-system reliability gains become negligible, and only the cost increases, leading to suboptimal solutions.
For each of the test problems, the Rj and Cj values are provided in Tables 12-15. This information enables us to pursue the presented heuristic, in our effort to find good solutions for these test problems.
Table 11. Test problems.
Table 12. Test problem ACO-1.
Table 13. Test problem ACO-2.
Table 14. Test problem ACO-3.
Table 15. (a) Test Problem ACO-4. (b) Test Problem ACO-4.
5. Experimental Results
Table 16 shows the results of the heuristic approach as compared to the optimal solution as obtained via complete enumeration. For all problems, the search parameters for α and β were set to 1 and 1.5 respectively. For each problem, the ant-colony optimization heuristic was performed using the number of iterations specified in Table 16. For each problem, the heuristic was repeated ten times, and the mean and standard deviation of the objective function value were calculated. From inspection of Table 16, it should be noted that the heuristic performed, on average, no worse than 0.03% inferior to optimal.
Figure 3 shows a “snapshot” of the improvement matrix for ACO-4. Each column represents one of the (14) components, while each row represents the number of units used in parallel for the component of interest. The intensity of each circle represents the frequency that i parallel units were used for component j when a new “best solution” was found. The darker the circle, the more often that particular number of parallel units was found to be beneficial for the best solution. For example, (7) parallel units is a desirable number value for component (1), while (1) parallel unit was not desirable for component (1). This improvement matrix is used in the probability calculations to “reinforce” desirable solutions for subsequent ant traversals.
6. Concluding Comments
We have presented an ant-colony optimization approach to address a system reliability problem with quantity discounts and a constrained budget. Because conventional optimization approaches are not feasible for such large, nonlinear problems, we must pursue search heuristics. Of course, ant-colony optimization is one of many options. Tabu Search, Simulated Annealing and Genetic Algorithms are possible search candidates. For this type of problem, however, ant- colony optimization serves us well due to the fact that we are able to model our desire for a solution by simulating an ant traversing the search space in a logical fashion. This would be more difficult with the aforementioned approaches. Furthermore, these other search approaches can be computationally expensive due to the management details associated with these approaches.
Of course, there are always opportunities for additional research, and this effort
Table 16. Heuristic solutions compared to optimal.
Figure 3. Snapshot of improvement matrix for ACO-4.
is no exception. The largest problem addressed has 4.4 trillion possible solutions. Comparing our best solution found via the heuristic to the optimal solution with 4.4 trillion possibilities required some work, but with today’s computational resources, it was surprisingly possible. In fact, the size of the experimental problems is considered one of the unique features associated with this effort―other efforts work with are unable to provide solutions to problems of this practical size   . To this end, larger problems can be pursued in the future. Additionally, other heuristic approaches can be tried, such as those aforementioned.