Received 29 March 2016; accepted 20 May 2016; published 23 May 2016
It is widely agreed that the problem of determining upper bounds on execution times for sequential programs has been more or less solved  . With the advent of multi-core processors scientific and industrial interest focuses on analysis and verification of multi-threaded applications. The scientific challenge comes from the fact that synchronization has to be taken into account. In this paper, we focus on how to incorporate stalling times in a WCET analysis of shared memory concurrent programs running on a multi-core architecture. The stress is on a formal definition and description of both, our graph model and the dataflow equations for timing analysis.
Previous work done in the field of timing analysis for multi-core (e.g.  ) assumes that the threads are more or less executed in parallel and the threads do not heavily synchronize with each other, except when forking and joining. Our approach supports critical sections and the corresponding stalling times (e.g. caused by lock contention) in the heart of its matrix operations. Forking and joining of threads can also easily be modeled. Thus, our model is suitable for systems from a concurrent to a (fork and join) parallel execution model. Anyway, the focus in this paper is on a concurrent execution model.
The idea that thread interleavings and synchronization between threads can be studied with a matrix calculus is novel in this research area. Our sparse matrix representations of the program are manipulated using a lazy implementation of our extended Kronecker algebra. In  the Kronecker product is used in order to model synchronization. Similar to   , we describe synchronization by our selective Kronecker products and thread interleavings by Kronecker sums. The first goal is the generation of a data structure called concurrent program graph (CPG) which describes all possible interleavings and incorporates synchronization while preserving completeness. In general, our model can be represented by sparse adjacency matrices. The number of entries in the matrices is linear in their number of lines. In the worst case, the number of lines increases exponentially in the number of threads. The CPG, however, contains many nodes and edges unreachable from the entry node. If the program contains a lot of synchronization, only a very small part of the CPG is reachable. Our lazy implementation computes only this part which we call reachable CPG (RCPG). The implementation is very memory-efficient and has been parallelized to exploit modern many-core hardware architectures. These optimizations speed up processing significantly.
RCPGs represent concurrent and parallel programs similar as control flow graphs (CFGs) do for sequential programs. In this paper, we use RCPGs to calculate the WCET of the underlying concurrent system. In contrast to  , we (1) adopt the generating functions based approach of  for timing analysis and (2) are able to handle loops. For timing analysis, we set up a data flow equation for each RCPG node. It turns out that at certain synchronizing nodes, stalling times (e.g. caused by lock contention) can be formulated within dataflow equations as simple maximum operations. Choosing this approach, the calculated WCET includes stalling time. This is in contrast to most of the work done in this field (e.g.  ), which usually adopts a partial approach, where stalling times are calculated in a second step. We successively apply the following steps:
1. Generate CFGs out of binary or program code (cf. Subsection 2.1).
2. Generate RCPG out of the CFGs (cf. Section 3).
3. Apply hardware-level analysis based on the RCPG. Such an analysis may take into account e.g. shared resources like memory, data caches, and buses, and other hardware components like instruction caches and pipelining. Annotate this information at the corresponding RCPG edges. As mentioned above, this step is out of scope of this paper. Anyway, in order to get tight bounds this step is necessary (cf.  ). Some of these analyses (e.g. cache analysis) may be performed together with the next step.
4. Establish and solve dataflow equations based on the RCPG (cf. Section 4). Stalling times are incorporated via the equations.
Similar to  and  , which provide exact WCET for sequential programs, our approach calculates an exact worst-case execution time for concurrent programs running on a multi-core CPU (not only an upper bound) provided that the number of how often each loop is executed, the execution frequencies and execution times of the basic blocks (also of the semaphore operations p and v)1 on RCFG level are known, and hardware impact is given. We assume timing predictability on the hardware level as discussed e.g. in  .
The outline of our paper is as follows. In Section 2, refined CFGs and Kronecker algebra are introduced. Our model of concurrency, some properties, and our lazy approach are presented in Section 3. Section 4 is devoted to WCET analysis of multi-threaded programs. An example is presented in Section 5. In Section 6, we survey related work. Finally, we draw our conclusion in Section 7.
In this paper, we refer to both, a processor and a core, as a processor. Our computational model can be described as follows. We model concurrent programs by threads which use semaphores for synchronization. We assume that on each processor exactly one thread is running and each thread immediately executes its next statement, if the thread is not stalled. Stalling may occur only in succession of semaphore calls.
Threads and semaphores are represented by slightly adapted CFGs. Each CFG is represented by an adjacency matrix. We assume that the edges of CFGs are labeled by elements of a semiring. A prominent example for such semirings are regular expressions  describing the behavior of finite state automata.
The set of labels is defined by, where is the set of ordinary (non-synchronization) labels and is the set of labels representing semaphore calls (and are disjoint). In order to model
e.g. a critical section usually two or more distinct thread CFGs refer to the same semaphore  . The operations on the basic blocks are, and * from a semiring  . Intuitively, these operations model consecutive program parts, conditionals, and loops, respectively.
2.1. Refined Control Flow Graphs
CFG nodes usually represent basic blocks  . Because our matrix calculus manipulates the edges, we need to have basic blocks on the (incoming) edges. A basic block consists of multiple consecutive statements without jumps. For our purpose, we need a finer granularity which we achieve by splitting edges. We apply it to basic blocks containing semaphore calls (e.g. and) and require that a semaphore call has to be the only statement on the corresponding edge. Roughly speaking, edge splitting maps a CFG edge e whose corresponding basic block contains k semaphore calls to a subgraph , such that each represents a single semaphore call, and and represent the consecutive parts before and after, respectively (). Applying edge splitting to a CFG results in a refined control flow graph (RCFG). Note that a shared memory access aware analysis requires additional edge splitting for e.g. shared variables as done in  .
In the following, we use the labels as defined above as representatives for the basic blocks of RCFGs. To keep things simple, we refer to edges, their labels, the corresponding basic blocks and the corresponding entries of the adjacency matrices synonymously. In a similar fashion, we refer to nodes, their row and column numbers in the corresponding adjacency matrix synonymously. A matrix entry a in row i and column j is interpreted as a directed edge from node i to node j labeled by a.
In Figure 1(a) a binary semaphore is depicted. In a similar way it is possible to model counting semaphores allowing n non-blocking p-calls. Entry nodes have an incoming edge with no source node. A double circled node indicates that it is a final node. In the remainder, we use the RCFGs of the threads A and B presented in the Figure 1(b) and Figure 1(c), respectively, as a running example.
2.2. Modeling Synchronization and Interleavings
Kronecker product and Kronecker sum form Kronecker algebra. In the following, we define both operations. Proofs, additional properties, and examples can be found in   . From now on, we use matrices out of only.
(a) (b) (c)
Figure 1. RCFGs of a binary semaphore and the threads A and B. (a) Binary semaphore; (b) RCFG of thread A; (c) RCFG of thread B.
Definition 1 (Kronecker Product) Given a m-by-n matrix A and a p-by-q matrix B, their Kronecker product is a mp-by-nq block matrix defined by
Kronecker product allows to model synchronization  . Properties concerning connectedness when applied to directed graphs can be found in  .
Definition 2 (Kronecker Sum) Given a matrix A of order2 m and a matrix B of order n, their Kronecker sum is a matrix of order mn defined by, where and denote identity matrices of order m and n, respectively.
The Kronecker sum calculates all possible interleavings of two concurrently executing automata (see  for a proof) even for general CFGs including conditionals and loops. In Figure 2 the Kronecker sum of the threads A and B depicted in Figure 1 is shown. It can be seen that the Kronecker sum calculates all possible interleavings of the two threads. In particular, note that thread A’s loop is copied five times (B’s number of nodes). We write to refer to the i-th copy of label l. If it is not clear in the context to which thread a label l belongs, we write to denote that l belongs to thread X. In particular, this is necessary for semaphore operations which are usually called by at least two threads. Otherwise the executing thread would be unknown.
3. Concurrent Program Graphs
Our system model consists of a finite number of threads and semaphores which are both represented by RCFGs. Threads call semaphores in order to implement synchronization or locks i.e. mutual exclusion for access to shared resources like shared variables or shared buses. The RCFGs are stored in form of adjacency matrices. The matrices have entries which are referred to as labels as defined in Section 2.
Formally, the system model consists of the tuple, where is the set of RCFG adjacency matrices describing threads, refers to the set of RCFG adjacency matrices describing semaphores, and
denotes the set of labels out of the semiring defined in the previous section. The labels (or matrix entries) of the
i-th thread's adjacency matrix are elements of, whereas the labels (or matrix entries) of the j-th
Figure 2. Kronecker sum of threads A and B.
synchronization primitive's adjacency matrix are elements of. The matrices are manipulated by
using conventional Kronecker algebra operations together with extensions which we define in the course of this section.
A concurrent program graph (CPG) is a graph with a set of nodes V, a set of directed edges, a so-called entry node and a set of final nodes. The sets V and E are constructed out of the elements of. Details on how we generate the sets V and E follow below. Similar to RCFGs the edges of CPGs are labeled by.
In general, a thread’s CPG may have several final nodes. We refer to a node without outgoing edges as a sink node. A sink node appears as zero line in the corresponding adjacency matrix. A CPG’s final node may also be a sink node (if the program terminates). However, sink nodes and final nodes can be distinguished as follows. We
use a vector determining the final nodes of thread i, namely. In addition, vector determines the final
node of synchronization primitive j. Both have ones at places q, when node q is a final node, zeros elsewhere.
Then the vector determines the final nodes of the CPG.
In the remainder of this paper, we assume that all threads do have only one single final node. Our results, however, can be generalized easily to an arbitrary number of final nodes.
3.1. Generating a Concurrent Program’s Matrix
Let and refer to the matrices representing thread i and synchronization primitive (e.g. semaphore) i, respectively. According to Figure 1(a) we have for binary semaphore i the adjacency matrix of order two. We obtain the matrix T representing k interleaved threads and the matrix S repre- senting r interleaved synchronization primitives by
Because the operations and are associative  , the corresponding n-fold versions are well defined. Hence, we can apply the operations on multiple matrices (representing threads and synchronization primitives).
In the following, we define the selective Kronecker product which we denote by. This operator synchronizes only labels identical in the two input matrices.
Definition 3 (Selective Kronecker Product) Given an m-by-n matrix A and a p-by-q matrix B, we call their selective Kronecker product. For all let, where
The ordinary Kronecker product on the automata-level calculates the product automaton. In contrast, the selective Kronecker product is defined for only one active component. One thread executes the synchronization primitives’ operations. The synchronization primitive itself is a passive component. In contrast to the ordinary Kronecker product, the selective Kronecker product is defined such that a label l in the left operand is paired
with the same label in the right operand and not with any other label in the right operand and for
the resulting entry is l and not. Definition 3 is defined for a set of labels L. In the following, we use it exclusively for. Thus, we use this operation only for labels referring to synchronization primitive calls. Used that way, the selective Kronecker product ensures that, e.g., a p-call to semaphore i, i.e. a -call, in the left operand is paired with the corresponding -operation in the right operand and not with any other label (e.g. of a semaphore) in the right operand.
Definition 4 (Filtered Matrix) We call a filtered matrix and define it as a matrix of order containing entries of of and zeros elsewhere:
The adjacency matrix representing a program is referred to as P. As stated in   , P can be computed efficiently by
Intuitively, the selective Kronecker product term on the left allows for synchronization between the threads represented by T and the synchronization primitives S. Both T and S are Kronecker sums of the involved threads and synchronization primitives, respectively, in order to represent all possible interleavings of the concurrently executing threads. The right term allows the threads to perform steps that are not involved in synchronization. Summarizing, the threads (represented by T) may perform their steps concurrently, where all interleavings are allowed, except when they call synchronization primitives. In the latter case the synchronization primitives (represented by S) together with Kronecker product ensure that these calls are executed in the order prescribed by the finite automata (FA) of the synchronization primitives. So, for example, a thread cannot do semaphore calls in the order v followed by p when the semaphore FA only allows a p-call before a v-call. The CPG of such an erroneous program will contain a node from which the final node of the CPG cannot be reached. This node is the one preceding the v-call. Such nodes can easily be found by traversing CPGs. Thus, deadlocks of concurrent systems can be detected with little effort   .
Until now the following synchronization primitives have been successfully applied. In   semaphores are the only synchronization primitives. In  the approach is extended in order to model Ada’s protected objects, too. Finally, in  it is shown that barriers can be used as synchronization primitives. In the latter paper it is also presented that initially locked and unlocked semaphores can be incorporated to our Kronecker algebra-based approach.
It can easily be shown that CPGs have at most nodes and at most edges, if k is the number of threads and each thread has n nodes in its RCFG. Hence, each CPG has a sparse adjacency matrix.
Thus, memory saving data structures and efficient algorithms suggest themselves. In the worst-case, however, the number of CPG nodes increases exponentially in k.
3.2. Lazy Implementation of Kronecker Algebra
In general, a CPG contains unreachable parts if a concurrent program contains synchronization. This can be summarized as follows: The way we adopt the Kronecker product limits the number of possible paths such that the p- and v-operations are present in correct p-v-pairs in the RCPG. In contrast contains all
possible paths even those containing semantically wrong uses of the synchronization primitive (e.g. semaphore) operations. This contrast can be seen in our running example in Figure 2 and Figure 3. The Kronecker sum of thread A and B in Figure 2 contains five copies of thread A’s loop, whereas the RCPG in Figure 3 contains this loop only three times. It can be easily seen that the latter reflects the correct use of the semaphore operations.
Choosing a lazy implementation for the matrix operations ensures that, when extracting the reachable parts of the underlying graph, the overall effort is reduced to exactly these parts. By starting from the RCPG’s entry node and calculating all reachable successor nodes, our lazy implementation exactly does this  . Thus, for example, if the resulting RCPG’s size is linear in terms of the involved threads, only linear effort will be necessary to generate the RCPG.
4. WCET Analysis on RCPGs
In order to calculate the WCET of a concurrent program, we adopt the generating functions based approach introduced in  . We generalize this approach such that we are able to analyze multi-threaded programs. Each node of the RCPG is assigned a dataflow variable and a dataflow equation is set up based on the predecessors of the RCPG node. A dataflow variable is represented by a vector. Each component of the vector reflects a processor and is used to calculate the WCET of the corresponding thread. Recall that only one single thread is allocated to a processor. Even though RCPGs support multiple concurrent threads on one CPU also, we restrict
Figure 3. RCPG.
the WCET analysis to one thread per processor. This assumption eases the definition of the dataflow equations and it is not a restriction from our approach itself.
4.1. Execution Frequencies
In the remaining part of the paper, we will use execution frequencies  . The execution frequency is a measure of how often the edge is taken compared to the other outgoing edges of node k. Thus, each execution frequency is a rational number. Its values range from 0 (which models a dead path) to 1, i.e.,. For each node k, it is required that the execution frequencies of all outgoing edges sum to 1. If node k has at least two outgoing edges, then we have a so-called node constraint. We assign a variable to each. A concrete value is assigned to each of these variables during the maximization process which is described below in Section 4.6. If a node m has only one outgoing edge to node n, then the execution frequency is statically known and neither a node constraint nor an additional variable for the execution frequency is needed.
Let refer to the set of natural numbers including zero, i.e.,. From now on, we use the variable to refer to the number of loop iterations of loop i at CFG level. For each loop, we require that
this number is constant3 and statically known. As we have seen in Figure 2, RCPGs contain several copies of basic blocks (in our case edges) and loops in different places.
Since RCPGs model all interleavings of the involved threads, a certain execution of the underlying concurrent program (a certain path in the RCPG) may divide the code of a loop in the CFG among all its copies in the RCPG. In particular, we do not know a priori how a loop will be split among its copies in the RCPG for the path producing the WCET. For this reason, we assign variables (with unknown values) to the number of loop iterations of the loop copies in the RCPG. Later on (during the maximization process), concrete values for this loop iteration variables are chosen such that the execution time is maximized. Note that assigning variables to loop iteration numbers implies that some execution frequencies have also to be considered variable. These execution frequencies also get concrete values during the maximization process.
We refer to the number of loop iterations of the j-th copy of loop i as. This variable denotes the number of how often the loop entry edge of the j-th copy of loop i is executed. The loop entry of the j-th copy (out of n) of loop i gets assigned the execution frequency variable, where. Note that the variables get numerical values during the maximization process. Thus, the execution frequency of each loop entry edge is calculated automatically.
If node m has multiple outgoing loop entry edges for the loops and there exists exactly one outgoing non loop entry edge, then the execution frequency for the loop entry edge of loop i is.
Loop Iteration Constraints. Assume CFG loop i is executed times and n copies (as mentioned above due to the Kronecker sum) of that loop are in the RCPG, then we have the constraint. We assume that
the value of variable, i.e., the number of loop iterations on thread (CFG) level is known a priori. The variables are used as variables during the maximization process. During the generation of the RCPG it is possible to remember each copy of a CFG loop entry edge. In order to establish the loop iteration constraints, we go through this information.
Loop Exit Constraints. For loop i’s j-th copy we have iterations. Then we have the loop exit constraint, where is the sum of execution frequencies of all loop exiting edges of the j-th copy of loop i.
In general, such loop exiting edges do also include edges from other threads which do not execute any part of loop i. Note that we can calculate the loop exit constraints automatically. Our approach does support nested loops  which result in non-linear constraints. This is one reason which prohibits applying an ILP-based approach like  for solving the concurrent WCET problem.
4.3. Synchronizing Nodes
A thread calling a semaphore’s p-operation potentially blocks  . On the other hand, a thread calling a semaphore’s v-operation may unblock a waiting thread  . In RCPGs, blocking occurs at what we call synchronizing nodes. We distinguish two types of synchronizing nodes, namely vp- and pp-synchronizing nodes.
Each vp-synchronizing node has an incoming edge labeled by a semaphore v-operation, an outgoing edge labeled by a p-operation of the same semaphore, and these two edges are part of different threads. In this case, the thread calling the p-operation (potentially) has to wait until the other thread’s v-operation is finished.
Definition 5. A vp-synchronizing node is a RCPG node s such that
• there exists an edge with label and
• there exists an edge with label,
where k denotes the same semaphore and the edges and are mapped to different processors, i.e.,.
For vp-synchronizing nodes, we establish specific data flow equations as described in the following subsection.
Definition 6. A pp-synchronizing node is a RCPG node s such that
• there exists an edge with label and
• there exists an edge with label,
where k denotes the same semaphore and the edges and are mapped to different processors, i.e.,.
For pp-synchronizing nodes, we establish fairness constraints ensuring a deterministic choice when e.g. the time of both involved CPUs at node s is exactly the same.
4.4. Setting Up and Solving Dataflow Equations
In this section, we extend the generating function based approach of Section 4 of  such that we are able to calculate the WCET of concurrent programs modeled by RCPGs. Each RCPG node’s dataflow equation is set up according to its predecessors and the incoming edges (including execution frequency, execution time and in case of vp-synchronizing nodes stalling time).
Let the vector. We write to denote the i-th component of vector. In addition, refers to the vector of node x.
Let a be a scalar and let and be two n-dimensional vectors. The addition and multiplication of vectors and the multiplication of a scalar with a vector are defined as follows:
Definition 7 (Setting up Dataflow Equations) Let refer to the time assigned to edge. In addition, the set of predecessor nodes of node n is referred to as.
If n is a non-vp-synchronizing node and edge is mapped to processor k, then
where the kth component of vector is and the other components are equal to 1.
Let s be a vp-synchronizing node. In addition, let and be the processors which the edges and are mapped to, i.e., and.4 Then for
where the first term considers the incoming p-edge and the second term takes into account all other incoming edges of the blocking thread running on processor.
The max-operator in Def. 7 is not an ordinary maximum operation for numbers. During the maximization process, we actually do the whole calculation twice. One time, we replace by and then, we do this calculation using the second solution. In the end, the solution with the highest WCET value will be taken.
The entry node’s equation follows the rules above and, in addition, for n threads adds an
The dataflow equations can intuitively be explained as follows. We cumulate the execution times in an interleavings semantics fashion. One can think of taking one edge after the other. Nevertheless, edges may be executed in parallel and the execution and stalling times are added to the corresponding vector components. In the overall process, we get the WCET of the concurrent program.
The system of dataflow equations can be solved efficiently by applying an algorithm presented in  . As a result, we get explicit formulas for the final node. In order to double-check that we calculate a correct solution, we used MathematicaÓ to solve the node equations, too. Both of the two approaches for solving the node equations calculate the same and correct results.
4.5. Partial Loop Unrolling
For vp-synchronizing nodes having at least one outgoing loop entry edge5, we have to partly unroll the corresponding loop such that one iteration is statically present in the RCPG’s equations. Partial loop unrolling ensures that synchronization is modelled correctly. Only the unrolled part contains a synchronizing node. Some execution frequencies and equations have to be added or adapted. Edges have to be added to ensure that the original and the unrolled loop behave semantically equivalent. For example, if the original loop was able to iterate times, then the new construct must also allow the same number of iterations. In order to define some execution frequencies correctly, we are using the Kronecker delta function. We do such partial loop unrolling for our example in the appendix in Subsection 5.2. In our example, we e.g. have to add edge to allow a zero number of iterations (compare Figure 3 and Figure 4). Note that partial loop unrolling can be fully automated.
4.6. Maximization Process
In order to determine the WCET, we have to differentiate the solution for the final node with respect to z and after that set.
Let refer to the function representing the solution of the kth component of the final node. According to well-known facts of generating functions  it is defined as
Figure 4. Adapted RCPG.
In order to calculate the loop iteration count for all loop copies and to calculate the undefined execution frequencies within the given constraints, we maximize this function. This goes beyond the approaches given in   . During this maximization step, for which we used NMaximize of MathematicaÓ, e.g. all are treated as variables. For each of these variables, Mathematica finds values within the given constraints. Thus, Mathematica assigns valid values for all and all the unknown execution frequencies. Of course, instead of Mathematica, any non-linear function solver capable of handling constraints can be used. The WCET of the kth CPU core is given by
In the following, the variable configuration found during this maximization is used. The WCET of a concurrent program consisting of n threads is defined as
where the max is the ordinary maximum operator for numbers.
If the RCPG contains s vp-synchronizing nodes, then the maximization process has to be done times. One time for each possible value of originating from the vp-synchronizing nodes. At last the largest value of those results represents the WCET of the concurrent program. Hence, the computational complexity may increase exponentially in s. Anyhow, s is usually small. For n threads and r semaphores, the
number of vp-synchronizing nodes in the CPG is bounded above by, where is the
number of v-operations of semaphore j in thread i and is the number of p-operations of semaphore j in thread k. Depending on how the semaphores are used not all vp-synchronizing nodes may be part of the RCPG. In addition, information may be available which allows to conclude that even some of the present cases cannot result in the final WCET value. Then, these cases need not be considered in the maximization process. In  an example with CPG matrix size of 298721280 has been analyzed within 400 ms. It contained 13 semaphores and only 15 synchronization nodes. Even though the CPGs for travel time analysis do not contain loops, the number of synchronizing nodes is comparable.
This small example includes synchronization and one single loop. We use two threads, namely A and B, sharing one single semaphore with the operations p and v. The CFGs of the two threads are depicted in Figure 1(b) and Figure 1(c). Each edge is labeled by a basic block l. Together with a RCFG of a binary semaphore, we calculate the adjacency matrix P of the corresponding RCPG in the following steps: The interleaved threads are given by
Because we have only one semaphore, the interleaved semaphores are trivially defined as. The program’s matrix P is given by where and. The
RCPG of the A-B-system is depicted in Figure 3. The edges for the RCPG are labeled by their execution frequencies on RCPG level. Anyway, we indicate for each execution frequency that it is the execution frequency for the x-th copy of basic block l.6
We assume that both threads access shared variables in the basic blocks a and d. Thus, the basic blocks a and d are only allowed to be executed in a mutually exclusive fashion. This is ensured by using a semaphore. The basic blocks a and d are protected by p-calls. After the corresponding thread finishes the execution of a or d, the semaphore is released by a v-call. We assume that all the other basic blocks do not access shared variables. Note that the threads are mapped to distinct processors and that these mappings are immutable.
Each variable x in this example (except, , and) is a rational number such that. We assume that thread A's loop is executed times and the three copies of the loop are executed, and times, respectively. Hence, we have the loop iteration constraint, where.
5.1. Equations Not Affected by Partial Loop Unrolling
Following the rules of Section 3, we obtain the following equations.
Note again that the max-operators in and (stated in Subsection 5.2) originate because the
original nodes 9 and 14, respectively, are vp-synchronizing nodes and that max is not the ordinary maximum operation using numbers as input. During the maximization process, for each max -operator we do the whole calculation twice, once for each possible solution.
5.2. Partial Loop Unrolling
Node 13 is a vp-synchronizing node and edge constitutes a loop entry edge. Thus, we have to apply partial loop unrolling.
The changes in the equations can be interpreted on RCPG-level as depicted in Figure 4 (compare to Figure 3). For edges whose execution frequency is 1 we write 1(a) in order to state that the edge refers to the basic block a. For these edges, the execution time would otherwise be unclear. In the following, we use the Kronecker delta function. Kronecker delta is defined as
By partially unrolling the loop, we get the execution frequencies:
Note that the non-linear function solver employed for the maximization process must be able to handle and case functions (like that used in the right hand side of) correctly.
5.3. Execution Frequencies and Constraints
The following execution frequencies and constraints are extracted out of the RCPG. The execution frequencies of the loop entry edges and are established as follows:
From the node constraints, we get, , , , , , and statically set to 1. The remaining node constraints contribute execution frequency variables and the corresponding constraints for the final maximization process.
, , , , , ,.
The loop exit constraints are as follows:
The time needed for executing basic block b is referred to as. We assume that all copies of a certain basic block lead to the same execution time. Thus, e.g., each one out of, and has an execution time of. Finally, for node 5, which is a pp-synchronizing node, we have the following constraints. These conditions follow from our computational model described in Section 2 and the fairness constraints from Subsection 4.3:
5.4. Solving the Equations
For a concise presentation, we use the notation. We used two approaches to solve the equations. At first, we applied  . To double-check the solution, we used MathematicaÓ, too.
The resulting equations for the final node 16 are:
5.5. Maximization Process
Finally, we have to differentiate with respect to z and then set.
where the set constraints consists of the constraints set up in Section 5.3. The WCET of the concurrent program consisting of two threads is defined by
In Table 1 some WCET values of the program and its components, namely the threads A and B, are depicted. The time needed for executing basic block b is referred to as. We assume that all copies of a certain basic block lead to the same execution time. Further we set, , and let range from 1 to 10. As described above, during the maximization process, we let MathematicaÓ choose the values of the variables and all the unknown execution frequencies. We used the execution time as an input parameter to see how it affects the WCET of the program. Note that the calculated values are exact WCET values. In the rightmost column of Table 1, we present the average time needed by Mathematica to calculate the time of the component leading to the WCET. Note that the maximization dominates the overall CPU time. Generating the RCPG and solving the data flow equations takes only a few milli seconds. Mathematica 10 was executed on a CentOS 6.0, Intel Core i7 870 CPU, 2.96 GHz, 8 MB cache and 4GB RAM. Until now, our focus was not on using specialized non-linear solvers which would probably lead to much better maximization times. Finding the best non-linear function solver is ongoing research.
6. Related Work
Our approach is the first one capable of handling parallel and concurrent software. There exist several approaches for parallel systems which we will discuss in the following (see e.g.  for an overview).
In  an IPET based approach is presented. Communication between code regions in form of message passing is detected via source code annotations specifying the recipient and the latency of the communication. For each communication between code regions, the corresponding CFGs are connected via an additional edge. Hence, the data structure are CFGs connected via communication edges. This is not enough for programs containing recurring communication between threads. In contrast to that, our approach generates a new data structure (RCPG) out of the input CFGs in a fully automated way. The RCPG incorporates thread synchronization of
Table 1. WCET for and multiple values of.
the multi-threaded program and thus contains only the reachable interleavings. Our approach is not limited to one single synchronization mechanism, it can be used to model e.g. semaphores or locks. In addition, RCPGs play a similar role for multi-threaded programs as CFGs do for sequential programs and can be used for further analysis purposes. The hardware analysis on basic block level of  can be applied to our approach too.
Current steps towards multi-core analysis including hardware modelling try to restrict interleavings and use a rigorous bus protocol (e.g. TDMA) that increases the predictability  . A worst-case resource usage bound to compute the WCET overlap is used. Hence, it finds a WCET upper bound only, while our approach determines the exact WCET that includes stalling times.
7. Conclusion and Future Work
In this paper, we focused on calculating stalling times automatically in an exact WCET analysis of shared memory concurrent programs running on a multi-core architecture. The stress was on a formal definition of both, our graph model and the dataflow equations for timing analysis. This is the first approach suited for parallel and concurrent systems.
We established a generic graph model for multi-threaded programs. Thread synchronization is modeled by semaphores. Our graph representation of multi-threaded programs plays a similar role for concurrent programs as control flow graphs do for sequential programs. Thus, a suitable graph model for timing analysis of multi- threaded software has been set up. The graph model serves as a framework for WCET analysis of multi-threaded concurrent programs. The usefulness of our approach has been proved by a lazy implementation of our extended Kronecker algebra. The implementation is very memory-efficient and has been parallelized to exploit modern many-core hardware architectures. Currently there is work in progress for a GPGPU implementation generating RCPGs. The first results are very promising.
We applied a generating functions approach. Dataflow equations are set up. The WCET is calculated by a non-linear function solver. Non-linearity is inherent to the multi-threaded WCET problem. The reasons are that (1) several copies of loops show up in the RCPG and (2) partial loop unrolling has to be done in certain cases. (1) implies that loop iteration numbers for loop copies have to be considered variable until the maximization process takes place. Thus, nested loops cause non-linear constraints to be handed to the function solver. (2) generates additional non-linear constraints.
In terms of WCET analysis a lot of work remains to be done. The focus of this paper is on how to model concurrent programs. One future work may be modelling hardware features. In general, without taking into account e.g. pipelining, shared cache, shared bus, branch prediction and prefetching, we might overestimate the WCET. Our approach could benefit from e.g.  -  which support shared L2 instruction caches.
Finding the best non-linear function solver is ongoing research. MathematicaÓ was just the first attempt. This will probably lead to better maximization times. A direction of future work is to generalize for multiple threads running on one CPU core. We will investigate how an implicit path enumeration technique (IPET) approach  together with non-linear solvers can produce similar results to our approach. Finally, a possible direction for future work could be a WCET analysis of semaphore-based barrier implementations  .
1These execution times do not include stalling time which we calculate automatically.
2A k-by-k matrix is known as square matrix of order k.
3For simplicity, we chose constant; our approach supports upper bounds too.
4According to the definition of vp-synchronizing nodes.
5Note that we can detect these nodes when generating the RCPG.
6A hardware analysis may detect that the copies of a basic block have different execution times due to e.g. shared data caches or instruction pipelining.