Received 13 February 2016; accepted 26 March 2016; published 31 March 2016
NoC is a booming area for designing various applications like multimedia, telecommunication, and real time task  . Previous researches mainly focus on low power, high speed and scalability in NoC  . Algorithmic  and architectural models  are made and implemented into the NoC to achieve further performance improvement than existing NoC design. Current NoC designers show much progress on this architectural level model by introducing external or internal sense amplifier (SA) in on-chip communication  . In addition to the transmitter section (TXS), the pre emphasis capacitance (PEC) is added for high speed and energy reduction in on-chip communication, it requires DC bias circuits at the receiver section (RXS). To overcome this issue, voltage sense amplifier is introduced and tested in 90 nm CMOS cross coupled circuit  . In small circuit application, user can’t identify the worth of voltage SA, so it is refined into Double Tail Sense Amplifiers (DTSA). This DTSA with transceiver consists of PEC at the transmitter and DTSA at RXS  . In a recent paper  , we presented transceiver with Reconfigurable DTSA (R-DTSA) to achieve the performance improvement. Both  and  has achieved a reduction in data rate and link power. In this paper, we have concentrated on improving the latency parameter by adapting the bootable concept in DTSA. Bootable concept is the combinations of clock enable  and clock gating  . A low power consumption model is developed and implemented in many real time applications. CG low power design approach at RTL TSMC 45 nm CMOS application is tested in  . CMOS VLSI design has taken us to real working chips that rely on controlled charge recovery to operate at significantly lower power dissipation levels than their existing counterparts. The energy recovering circuits  are applied in microcontrollers, memory devices, display drivers, grouped clock networks and other real time applications. CG-SAFF (sense-amplifier flip-flop)  circuit exhibits high speed and low energy. The switching activity and delay of various flip-flops are compared with CG-SAFF. In  , the performance improvement achieved in networks with respect to network traffic modeling based on synthetic traffic. The traffic estimator and generator are introduced for QoS in  , to generate and estimate the real time traffic data in on-chip communication. In this proposed VELAN design, we followed the Traffic generator  and Graph Theory based Traffic Estimator (GTE)  . To achieve higher performance under various traffic conditions in on-chip networks, energy recovery clocking  concept is introduced in DTSA called M-DTSA  and clock gating  concept in M-DTSA  called MCG-DTSA  . Similar to Schinkel et al. and Sakthivel et al., this VELAN design is validated. In the proposed analysis the Individual and proposed sensor amplifier functionalities are validated. The Proposed V-DTSA module is encapsulated into the receiver section of router construction in NoC. The Proposed Performance metrics like delay, datarate, energy and static power consumption are observed and compared with conventional works.
The key contributions of this work are as follows.
・ Schinkel et al. has proposed the DTSA with low power and high speed architecture.
・ The conventional DTSA is not evaluated under high traffic scenario and here we examined same DTSA under various traffic scenarios. We have found that the traffic affects the performance of conventional DTSA. To solve this issue R-DTSA is proposed.
・ This R-DTSA is a combination of four DTSA, which is validated with various traffic scenarios and the results are reported in Sakthivel et al., which is better than Schinkel et al.
・ Even though R-DTSA is more advantageous than DTSA, but some issues are remaining such as the more latency, high area and high cost. To overcome these issues and reduce the complexity of R-DTSA, we focused and developed a new DTSA which was called V-DTSA.
・ This V-DTSA provides a better performance than DTSA and R-DTSA in terms of all the parameters mentioned in the conventional work.
・ We have introduced the V-DTSA based NoC which was used to provide a better performance than conventional DTSA and R-DTSA under various Traffic scenarios.
・ To construct the V-DTSA, we used Bootable concept into both DTSA and CG-DTSA circuits.
・ Similar to the previous work R-DTSA, we introduced the same Graph theory based Traffic Estimator (GTE).
・ The top modules of V-DTSA are namely B-DTSA, BCG-DTSA, GTE and controller.
・ The primary part of this proposed work is an Analysis of Various Sense amplifiers and selection of suitable DTSA for performance comparison.
・ This work is aimed to produce new latency aware NoC design based on V-DTSA under various traffic scenarios.
・ The performance parameters such as Delay, Data rate, Power, Energy, area and more parameters are observed and reported in Section 4.
The rest of this paper is organized as follows. Section 2 addresses the system model. Proposed work and its module details are discussed in Section 3. The proposed results of various architectures are presented in Section 4. Finally, the conclusion is presented in Section 5.
2. System Model
For better data communication in NoC architecture, conventional transceiver has PEC in TXS and DTSA circuit in RXS. Schinkel et al. and Sakthivel et al. transceiver for NoC’s and proposed VELAN design is shown in Figure 1. The use of capacitance in TXS is to reduce power dissipation. In NoC circuitry communication disturbance occurs because of noise and crosstalk  . The transceiver with a Differential Interconnect Twist (DIT) provides a high performance improvement. On Early stage, bidirectional interconnects are used. The EM field solver is used to analyze interconnects. The CMOS with 1.2 V, 6 M technology is used for interconnects as in   . Table 1 shows the concept involved in VELAN and which is compared with existing design.
3. Proposed System
The conventional transceiver configuration is compared with the proposed Transceiver configuration which is shown in Figure 1. The proposed VELAN design consists of V-DTSA circuitry for reducing the power consumption of data transmission and latency. The proposed work consists of four stages, namely selection, analysis, and design and performance comparison.
In the first stage of the work, suitable SA is selected with respect to the power comparison in both sleep and active mode  (observed M-DTSA, MCG-DTSA for further process). After clock enable  , both DTSA’s refined into B-DTSA and BCG-DTSA. In the second stage, selected SA’s (DTSA, B-DTSA and BCG-DTSA) are applied with high traffic (HT) and low traffic (LT) and then the energy comparison is analyzed. In the third stage, we designed the V-DTSA circuitry for complete transceiver. Finally, we compared our results with   . The block diagram of VELAN design is shown in Figure 2. This proposed system consists of PEC with TXS, GTE  , V-DTSA circuitry and RXS. The Graph theory Traffic Estimator (GTE)  is used to estimate the traffic rate of transmitting data. Based on the data, traffic controller is used to select the corresponding DTSA available in V-DTSA circuitry.
Clock gating has been proved best, when there are more number of flip-flops (coarse grained) in the circuit,
Figure 1. Conventional and proposed transceiver configuration.
Table 1. The concept involved in VELAN.
Figure 2. Proposed Transceiver configuration.
since it is independent of the circuit size. In a fine grained system (fewer lip-flops) clock enable achieves better energy conservation, since, the power consumption of this option is very linear with the number if flip-flops. As clock enables activates only a part of the circuit this works better on a partially active task. As clock gating activates the complete circuit, works well with the task needing the whole circuit. And it’s proved and experimentally validated in FPGA platform by Oliver et al. Based on these modules, we have constructed our proposed circuit
4. Proposed Work and Its Module Details
The section discusses about energy recovery clocking Circuit, Clock Gating circuit, bootable concept, low swing transmitter, optimal swing receiver, a V-DTSA components, graph theory based traffic estimator and controller and complete transceiver for proposed DTSA.
4.1. Energy Recovery Clocking (ERC) Circuit
Mahmoodi et al. have introduced an energy recovery clock technique in flip-flops that operates with single- phase sinusoidal clocks. In ERC circuit, AC supply voltage is used to recycle the stored energies on their capacitance while standard supply voltage is used for the rest of the circuits. The schematic representation of the ERC is observed from  for energy recovery clock generation. The energy recovery technique is implemented in DTSA circuit to accomplish the power reduction in NoC architecture.
4.2. Clock Gating (CG) Circuit
Tirumalashetty et al. have introduced clock gating technique in sequential circuits for low power design. In CG circuit, universal logic gate is used for masking the local clock signal to eliminate an energy recovery scheme from the remaining capacitances in fan-out circuit. The schematic representation of CG is observed from  for clock gating generations. An energy loss occurs due to non-adiabatic switching between the device oscillators and the resistance of the clock circuit and it can be eliminated by applying clock gating technique in DTSA circuit
4.3. Bootable Concept
In general, DTSA has precharge and evaluation phases of operation. The slow rising and falling transitions of the resonant clock will cause overlap between these two phases, which results in short-circuit current. The main purpose of the bootable clocking scheme is to reduce short-circuit power by switching the precharging transistors for a portion of clock period.
4.4. Low Swing Transmitter
In a low swing transmitter, large transmitters are required to drive the bus with adequate speed which results in reduction of transmitter efficiency. To overcome this issue and achieve high data rate, Schinkel et al. and Sakthivel et al. are used a capacitive pre-emphasis transmitter that uses a series capacitance to drive the bus with low swing. The series capacitance in transmitter is used to drive the bus and reduces the swing factor. The technical concepts of proposed low swing transmitter with PEC are similar to that of Schinkel et al. and Sakthivel et al. The technical parameters of the Full Swing (FS), Multi VDD Mode (MVM) Capacitive Low Swing transmitter (CLS) are tabulated in the Table 2.
4.5. Optimal Swing Receiver
The most commonly used data receivers in a low swing transceiver are clocked comparators and sense amplifier. The comparators are used to regenerate the voltage to full swing. But the sense amplifier is a very fast circuit that regenerates the voltage, samples the incoming data and realign at the reception's end with respect to the clock signal. The sense amplifier circuit is split into two tails to avoid transistor stack, which is called DTSA is used in the receiver section of proposed transceiver.
The latency and the power dissipation of the DTSA is the basic building block of the clock distribution network that plays a vital role in NoC, an ensured design is needed to achieve low power and small latency  .
To gain maximum power reduction of data transmission in NoC architecture, the proposed work presented a variable energy aware sense amplifier design with V-DTSA circuit. The purpose of the V-DTSA circuit is to vary the DTSA module according to the traffic rate of the data. It consists of Graph theory based Traffic Estimator  , controller and DTSA modules, namely B-DTSA and BCG-DTSA. The ERC concept is implemented in DTSA circuit is called Modified-DTSA (M-DTSA). The clock gating technique is implemented in the M- DTSA  module by adding logical NOR which is a gate to the circuit called Clock Gating Modified-DTSA (MCG-DTSA)  . After applying clock enable  both circuits are called B-DTSA and BCG-DTSA. The functional diagram of the V-DTSA module and its simulation result is shown in Figure 3, which consists of transistors S1-S12 with S-pulse signal, logical NOR gating and controller. The GTE estimates the traffic rate of the data with respect to the specification which is given in Table 3. After estimating the traffic rate of the data, the control signal is sent to the controller to activate the DTSA module according to the traffic rate (LT/HT). If the input data is estimated as low traffic (LT), then the controller activated the S-pulse (ERC output) as input to the DTSA circuit. The controller enables the output of the logical NOR gate to the DTSA circuit for HT. Therefore,
Table 2. The technical concept involved in Velan.
Table 3. The overall transceiver performance comparison.
Figure 3. V-DTSA Design with simulation results.
the transistor dimensions of the proposed double-tail sense amplifier are optimized comparative to each other to get the lowest offset standard deviation per unit of power cost. Width scaling (or impedance or area scaling) can consequently be useful to all the transistors composed to match the offset standard deviation to the preferred requirement  while preserving the original speed characteristics
4.7. Graph Theory Based Traffic Estimator and Controller
The optimal weight equation is used for the TE design follows from  . The GTE  estimates the traffic rate and compares with the given threshold value and then it selects the corresponding DTSA module in V-DTSA circuitry via the controller. In order to reduce complexity in  , two DTSA modules eliminated and traffic modes are merged into four states to two states, namely HIGH (HT) and LOW (LT).
4.8. Complete Transceiver
The complete transceiver circuit is made of transmitter with pre-emphasis capacitance connected to the receiver with the V-DTSA module via DIT   . The V-DTSA circuitry consists of B-DTSA and BCG-DTSA that gets the input data through the bus. The traffic estimator estimates the traffic under low or high condition using graph theory method and enables suitable DTSA by sending selected signal to the MUX. All other techniques are adapted same from  and the complete transceiver is shown in Figure 4 and simulation results are shown in Figure 5.
5. Results and Discussion
5.1. Performance Measure-Analytical Model
To measure the performance of proposed work we have taken following metrics that are widely used for perfor-
Figure 4. Complete transceiver.
Figure 5. Complete transceiver experimental result.
mance measurement in NoC. The performance measures of delay, data rate, energy, static power consumption, average latency, throughput, energy per useful flits switching factor and analysis in a network-on-chip. The definitions of these metrics are summarized here.
1) To measure the latency of flows, it is essential to evaluate the packet waiting periods for routers.
2) The power consumption and Link power are considered recursively for every communication path starting from the terminus section.
3) The Energy consumption for each core in router is determined.
4) Data rate is measured, based on all communication paths beginning of the terminus section.
5) Average latency is a time interval between the stimulation and response.
6) Throughput is the rate of production or the rate at which something can be processed.
7) Energy per useful flits is obtained with respect to the number of flits.
8) Switching factor is the probability of output switching.
In each node Ni, the latency LnocNi is defined using network calculus from Bhat et al.  and Sakthivel et al.  as follows.
Swit is the service bandwidth and is the latency.
1) Link Power
In Bhat et al.  and Sakthivel et al.  , the power models practical is used for a NoC Router. The power model is considering the cross-coupling effect for N-wire interconnects. The total power is calculated for an N-wire link per unit length as follows:
Nwire is the total number of wires in the link
Cself and Ccoupl are the self and coupling capacitance of a wire and neighboring nodes respectively,
αsaw is the switching activity on a wire,
αCou is the switching activity with respect to the adjacent wires,
τ is the short circuit period,
Vsv is the supply voltage,
Ishort, bias, wire and Ileak, gate are currents.
2) Static Power consumption
The static power dissipation can be defined from Bhat et al.  and Sakthivel et al.  as follows in Equation (3).
An energy spent at upper levels owing to that one bit of data is sprinkled from one router (R1) to another router (R2) via the links is a efficacy of the number of routers and the number of links. The total energy can be intended as follows (Bhat et al.  ; Sakthivel et al.  )
where is the energy spent, at time t, on the link , is the energy consumed inside the switch and are the number of links and switches respectively involved in transporting the application ﬂows. The total energy consumption can be calculated using Network Calculus arrival curves as follows Bhat et al.  and Sakthivel et al.  .
5.1.4. Data Rate
A FIFO buffer with an identified capacity from Sakthivel et al.  , substitutes a data burst with presumed size, and the arrival data rate is distinct as follows.
where, Psize is the packet size and Pinterval_time and N total input flits. The smallest data unit is a bit in the analytical model and it is a frame with bounded size in the simulation model.
5.1.5. Average Flit Latency
Average Flit Latency is defined as the ratio between Flit Delay and Number of flits received. It is given Equation (6) (Yu & Ampadu  )
where, M = Number of flits received
5.1.6. Average Throughput
Average Throughput is defined as the ratio between P and Number of IP cores. It is given in Equation (8) (Yu & Ampadu  )
P is defined as the ratio between Total Received Flit and Total Simulation Time. It is given in Equation (9)
where N = Number of IP cores
5.1.7. Switching Factor
The ratio between the Switched in port and total simulation cycle count is called as switching factor. It is given in Equation (10) (Yu & Ampadu  )
where N = Number of IP cores/Routersand, C = Total Simulation Cycle Count
5.1.8. Energy per Useful Flit
Energy per Useful Flit is defined as the ratio between energy and Total Error Free Flits Received. It is given in Equation (11) (Yu & Ampadu  )
5.2. Experimental Section Analysis
To evaluate the performance of the proposed work link, each component is modeled. For this experiment, the source router sends the packets to the sink router and a FIFO is located between these routers. The NoC architecture, implementation started with an RTL description of the DTSA components. To achieve power reduction, we focused on bootable concept (clock gating). The RTL description is made to evaluate clock gating and synthesized to the gate level net list with a Synopsys Design Compiler  . From the resulting layout, switching factor and power consumption are estimated. The switching factors are reported by the proposed work which has been examined in an Intel® 3.1-GHz LGA 1155 Core i3-2100 Processor. The total simulation cycle, for each of the experiments is 1,200,000. The power consumption of the interconnection network is extracted using 90-nm technology. A power analysis is carried out using the Synopsys Prime Time PX tool  . In this analysis, the power consumption under a given traffic pattern is investigated. The conventional traffic approach cannot realistically reveal all types of traffic that will traverse the network, but GT-based traffic pattern  provides a reasonable measurement for the performance of this method.
The NoC VHDL-synthesized code is made to evaluate 90-nm TSMC CMOS technology under a 500-MHZ operating frequency, a supply voltage of 1.8 V and a switching factor 0.5. In V-DTSA module, the controller part is made as a model and that is synthesized in 90 nm TSMS CMOS technology. To evaluate the performance of the proposed V-DTSA circuitry, comparisons has been performed with other recent works includes DTSA  and reconfigurable DTSA  . The Sleep mode and Active mode power consumption are tested with CG and without CG and then the results are presented in  . The power is compared to DTSA modules such as Single- ended Conditional Capturing Energy Recovery (SCCER)  , DCCER  , Static Differential Energy Recovery (SDER)  , Pulsed Flip Flop (PFF)  , M-DTSA  , MCG-DTSA  . The clock enable concept (bootable) is applied to conventional DTSA circuitries (M-DTSA  , MCG-DTSA  .
A mathematical expression for technical evaluation is similar to  . The energy consumption, delay, data Rate and static power consumption results are presented in Figures 6-9. The DTSA, R-DTSA and V-DTSA circuitry results are estimated under HT and LT. The overall comparison of various parameters (energy consumption, static power, and delay and data rate) with existing work is shown in Table 3.
The overall results of VELAN design give superior results than conventional design. The conventional method has achieved latency of 300/1500 ps, under single/five stage operation. The latency result of the proposed work is better under average traffic condition than   (354/1770 ps).
The following experimental parameters can be used to measure the NoC parameters, namely Average Flit Latency (AFL), Average Throughput (AT), Switching Factor (SF), and energy per useful flit. The above parameters are obtained using mathematical equations  . The Flit rate defines the rate at which packets are injected
Figure 6. Energy comparison of DTSA modules.
Figure 7. Delay comparison of DTSA modules.
Figure 8. Data Rate comparison of DTSA modules.
Figure 9. Static Power comparison of DTSA modules.
into the system which is noted in flit/node/cycle.
The dynamic power and the leakage power are tested in different terminals such as the traffic generator, GT- based traffic estimator  , the router, the input buffer, the output buffer, and links under various approaches (   and proposed work). The results are presented in Table 4 and Table 5, and the comparison plots are plotted in Figure 10. It is inferred that the proposed work gives a superior result in terms of power consumption, compared with the   works.
The performance comparison of the traffic injection rate, the throughput and the average flits latency are tested. The results are presented in Table 6 and Table 7 and the comparison plots are plotted in Figure 11. The performance comparison of the flit rate, throughput and the average flits latency are tested. The results are presented in Table 8 and Table 9 and the comparison plots are plotted in Figure 12. It is inferred that the proposed work gives a superior result in terms of throughput and latency, compared with the   works. The performance comparison of energy per useful flits with flit rate is tested. The results are presented in Table 10 and the comparison plots are plotted in Figure 13. It is inferred that the proposed work gives a superior result in terms of energy consumption, compared with the   works.
Schinkel et al. have estimated bandwidth per cross-sectional area (BW/CSA). The differential wires are used in the proposed design which operates at high speed and low swing. The Technical specification used for the proposed work is mentioned in Table 2. The 1.2-V, 6-M, 90-nm CMOS process, metal-4 wires with 0.54 m width and a 0.32 m spacing is used. The conventional system has the highest BW/CSA in total NoC core. In order to estimate total wire-length, the following constraints are made in Schinkel et al. Rwire = 200 ohm/mm2; Cwire = 280/mm2; single differential channel (SDC) = 1.72 µm. Let the link has a length of L = 2 mm and a width W = 64 bits Area occupied of both directions = 2 × W × L × SDC = 2 × 2 × 64 × 1.75 µm = 0. 44 mm2 when placed in one metal layer. For five metal layers with mesh topology configuration, the total link area in Schinkel et al. 3.5 mm2, only 4% of the tile area of 100 mm. Sakthivel et al. have designed R-DTSA and results are reported, this R-DTSA is combination of four DTSA which occupies an approximately 12%. Whereas our proposed V-DTSA is similar to that of the DTSA single DTSA element, but it provide a better performance than DTSA and R-DTSA. It has occupied 4% of the tile area of 100 mm and it’s the same as that of DTSA but reduction
Table 4. Dynamic power of various approaches with transceiver.
Table 5. Leakage power of various approaches with transceiver.
Table 6. Traffic injection rate vs. throughput.
Table 7. Traffic injection rate vs. average flits latency.
Table 8. Average latency vs. flit rate.
Table 9. Throughput vs. flit rate.
Table 10. Flit rate vs. average energy per useful flits (pJ).
Figure 10. Dynamic, leakage power of various approaches.
Figure 11. Traffic injection rate vs. throughput, average flits latency.
Figure 12. Flit rate vs. average latency, throughput.
Figure 13. Flitrate vs. average energy per useful flits.
when compared with R-DTSA based NoC. Therefore, in this V-DTSA based NoC consume lesser area usage and low cost than R-DTSA.
The power consumption and latency are estimated through the synthesizable VHDL model in the Synopsis environment with 90 nm technology. The following performance metrics Energy, Static, Dynamic power are measured and compared with  and  . The experimental results of VELAN design shows better performance than those of  and  .
The proposed work is summarized into four stages, namely selection, analysis, design and performance comparison. In the first stage, among various sense amplifiers circuits few sense amplifiers are selected to form V- DTSA and power comparison is made in both active and sleep modes (M-DTSA and MCG-DTSA selected and refined into B-DTSA, BCG-DTSA). In the second stage, energy comparison is analyzed by applying LT (18 Gb/s/113fJ) and HT (12.8 GB/s/164fJ) traffics on selected DTSA modules (DTSA, B-DTSA and BCG-DTSA). As the result of analysis, power reduction is achieved in B-DTSA for LT and BCG-DTSA for HT. On the third stage, we designed V-DTSA circuit with GTE, Controller and DTSA modules. The GTE estimates the traffic rate and controls the Controller to select B- DTSA for LT and BCG-DTSA for HT. At the final stage, the result of the overall transceiver circuit (VELAN) under average traffic mode is obtained as 6.157 Gb/s data rate, 0.27 w link power and latency of 440 ps/2200ps for single/five stage operation. When compared with conventional methods, the results in VELAN design show performance improvement of 98.512% (data rate) and 18.51% reduction (link power).