An energydelay product study on chip multiprocessors for variable stage pipelining
 Vijayalakshmi Saravanan^{1}Email author,
 Alagan Anpalagan^{1} and
 Isaac Woungang^{2}
https://doi.org/10.1186/s136730150046x
© Saravanan et al. 2015
Received: 31 May 2015
Accepted: 23 August 2015
Published: 21 September 2015
Abstract
Power management is a major concern for computer architects and system designers. As reported by the International Technology Roadmap for Semiconductors (ITRS), energy consumption has become one of the most dominant issues for the semiconductor industry when the size of transistors scales down from 22 to 11 nm nodes. In this regard, current existing techniques such as dynamic voltage scaling, clock gating, and the Complementary metaloxide semiconductor technology have shown their physical limits; therefore, scaling will no longer be a valid strategy for achieving powerperformance improvement. To overcome this critical issue in energyefficient processor design, there is a clear demand for alternative solution. In this paper, an approach that provides a promising solution for energy reduction is proposed, by using a microarchitectural technique referred to as variable stage pipelining, which can be further validated and extended to different application domains such as mobile and desktop. An analytical model for evaluating the relationship between the number of cores and the pipeline stage depth in a chip multiprocessor is also proposed, based on which the optimal pipeline depth for various metrics are calculated.
Keywords
Background
In the recent years, there has been a growing demand for more efficient power management schemes for computing domains such as mobile, enterprise, cloud computing, to name a few; and energy efficiency has become one of the main design goals for such schemes. For mobile processors, high performance and low energy have been among the main design targets for computer architects and hardware designers. But, the design process of such processors is yet to be fully adjusted to fulfill the needed goals. Power efficiency refers to not only low leakage currents and small switched capacitance, but also to the efficiency of the power distribution network, the power conversion circuitry, and the heat removal. Therefore, there is a clear demand for a holistic approach to power optimization and management that considers all these factors; and a general shift away from CPUcentric design thinking is taking place. In mobile systems, the focus should be on display, radio, and sensors; whereas in enterprise systems, memory, storage and networks are becoming increasingly important in terms of their power usage.
Recently, in order to save their battery life while yielding high performance, handheld devices such as laptops and mobile processors have been required to exhibit lowpower consumption. Even though the Complementary metaloxide semiconductor (CMOS) and dynamic voltage scaling (DVS) technologies have potentials for achieving the powerperformance, their effectiveness is expected to be greatly reduced as the process technology advances. With the shrinking CMOS minimum feature sizes, higher chip densities and lower operating voltages have led to the issue of voltage and temperature (PVT) variability, which is one of the main challenges in the design of powerefficient processors. With respect to poweraware processor design, the challenge is to develop an energydelay optimization framework, which includes some methods for the energydelay product (EDP). Other issues that stand out in power optimization and management design challenges are: (1) the statistical uncertainty about the workload, parameters and target system, and (2) a lack of benchmarks and evaluation techniques.
The important factors that have motivated our study of the multicore powerperformance tradeoffs are: (1) how will the technology transition from 22 to 11 nm nodes affect the design of power efficiency integrated with multiprocessors (2) what is the impact of powerperformance on the power efficiency of the chip?, and (3) what are the optimal number of pipeline depth and cores desired to achieve balanced tradeoffs in terms of power efficiency versus computational speed?

Stateoftheart pipelining techniques in chip multiprocessors (CMP) design are utilized and simulated, and their power and performance results are presented. It should be noted that so far, these techniques have mostly been tested on simple and wellknown theoretical functions. To the best of our knowledge, earlier studies on power and performance on real time implementation are rare.

A mathematical model for the evaluation of the proposed approach is introduced, and then solved using a software program execution of code.

Different aspects of the powerperformance efficiency are compared against each other using the existing techniques, and four performance metrics.
The rest of the paper is organized as follows. In "Related works", some related works on optimal pipelining are presented. In "Analytical modeling", an analytical modeling and description of our architectural framework is given. In "Experimental modeling", an experimental modeling and the simulation results are presented. In "Conclusions", we conclude our work.
Related works
In the recent years, the exponential growth of process technology arises the problem of energy consumption has become the major constraints in chip manufacturing industries. It becomes the major problem in performance improvements in desktop to highend processors. In classic CMOS scaling, the increase in performance has been mostly achieved by increasing the instructions per cycle (IPC) and the clock frequency. These improvements arise from a substantial increase in the pipeline depth (socalled deeper pipelines). On the other hand, highend data center designs have been driven by performance alone issues whereas power has become the main concern in microprocessor design. Therefore, both power and performance have to be taken into consideration even at the microarchitectural level, and appropriate power/performance metrics have to be devised for simulation purpose. In [4], Kunkel and Smith studied the optimum pipeline depth and defined a set of performance metrics. Recently, this work has been revisited using performanceonly metrics [5–7, 11], and some pipelined processor power models have been formulated.
There is a growing demand of low power design, and the increasing demand of processor performance improvements leads to deeper pipelines in processor design points. In [8], authors studied the optimum metric for various workloads and proposed a theoretical approach to find the power and performance tradeoff and dynamically changing pipeline depth during program execution as described in [9–11]. In [12], authors discussed about minimizing power consumption to get the optimal powerperformance under throughput constraints.
A measure of performance increase with pipeline depth is the change in CPI (cycles per instruction) [13]. This can be attributed to the following main reasons (1) adding more pipelines by using higher clock speeds and lower supply voltages generally lead to shorter logic depth. Thus, the delay would be considerably minimized. But, the real measure of performance optimization with pipeline depth is to consider the ratio MIPS/BIPS, where MIPS stands for million instructions per second and BIPS stands for billion instructions per second, or to use the time per instruction (TPI) of the machine; (2) TPI is the product of CPI and the cycle time of the processor. As pipeline depth is increased, the cycle time goes down. This is attributed to the fact that the entire logic time is broken down into multiple numbers of intervals, but the total time taken to process an instruction is not increased. It should be noted that the results obtained in our previous studies [3] are based on simulations using a particular system configuration, whereas in this paper, the microarchitecture is taken into consideration.
Nowadays, the voltage scaling techniques widely used technique for power savings. However, there is a huge concerns for future scaling due to advancement in process technologies. The energy reduction techniques are investigated and some studies on the variable stage pipelining and alternative for existing DVFS techniques have been conducted in [3, 14, 15]. However, these studies focused on fixed pipeline depth during the program execution. Using such pipeline depth may be efficient for certain programs, but may also lead to optimal pipeline depth for other programs with different behaviors. In [14], it is argued that deactivating the pipeline stages and using a shallow pipeline can help reducing the processor’s power consumption. Most existing representative schemes for reducing the processor’s power consumption are localized and often application specific. On the other hand, understanding and modeling properly the emerging applications and the mobile user behaviors, as well as developing the metrics for user’s experience, are essential. The reason for doing so is that this will allow a system to decide how much power or energy should be allocated to a given computation [16].
In this paper, the approach used shows that the various intensities of workloads may cause the optimal pipeline depth and its corresponding cores. Our focus is on finding the relationship between pipeline depth and power consumption to the adopted variable stage pipelining. The pipeline stage unification and the number of cores of a chip multiprocessors are optimized simultaneously in order to derive the best possible powerperformance ratio.
Analytical modeling
But how the pipelining and parallel implementations are useful for reducing the power consumption? In research paper [20] illustrate that the leakage current is the dominant source of energy consumption in scaled transistors. Because, subthreshold and leakage current both depend on the total gate count, transistors and gate width, a pipelined approach makes substantial contribution in reducing the leakage current. As noted, pipelining gives the lowpower processor solution because it always runs at low voltage [21]. With this insight, we propose an alternate solution for powerefficient processor design using pipelining concept called variable stage pipelining (VSP).
Powerperformance vs. pipeline stage unification degree
Energy reduction with VSP
 1.
Higher workload Metric\(\alpha _1\) and Corresponding unification degree\(\beta _1\)
 2.
Medium workload Metric\(\alpha _2\) and Corresponding unification degree\(\beta _2\)
 3.
Light workload Metric\(\alpha _3\) and Corresponding unification degree\(\beta _3\)
In order to estimate the energy consumption using the VSP approach, we have investigated how the performance and power will change as the unification degree varies for the diverse processor cores. The relationship between various metrics and their powerperformance implications have been derived as follows.
 (a)IPS To find the optimal depth, Eq. (9) is considered as the basis for further analysis of the various powerperformance metrics. For a Ncore processor, the relationship of IPS with respect to the metric is obtained from Eq. (9) as:For \(\beta _1 > \beta _2 > \beta _3\), and \(\alpha _1 < \alpha _2 < \alpha _3\), assuming \(\alpha\) and \(f_{max}\) are constant, the relationship between N and \(\beta _N\) is obtained as:$$\begin{aligned} {\frac{\alpha \times N \times \beta _N}{f_{max}}} \le 10.245. \end{aligned}$$(10)$$\begin{aligned} IPS= N \times \beta. \end{aligned}$$(11)
In Eq. (11), as the number of cores increases, the metric also increases. But still N cannot be increased beyond a certain value due to energy constraints. Thus, IPS is not as reliable as it should be since it does not take the power into account and it is considered as performanceonly metric. This necessitates the need to analyze the EnergyDelay Product (EDP) to serve as the powerperformance metric, which is suitable for most modern processor platforms such as laptops and mobile phones.
 (b)EMetric We assume the following for the EnergyDelayProduct (EDP) analysis: For a Ncore processor, assuming that N, \(f_{max}\) and \(\alpha\) are constants, the metric is proportional to the square of the unification degree. Therefore, for \(\beta _1 >\beta _2 >\beta _3\), and \(\alpha _1 > \alpha _2 > \alpha _3\). By considering Eq. (9) and by assuming that the voltage scales linearly with the frequency with respect to power, the following relationship is derived:$$\begin{aligned} \frac{N \times \beta _N \times {f^2_{max}}}{\alpha ^2} \le 10.245. \end{aligned}$$(12)
In Eq. (12), we see that as the metric increases, the pipeline stages decrease. When N increases, IPS also increases since IPS/W is a powerperformance metric; but on the other hand, the power increases too. Therefore, N cannot exceed a certain value. If that happen, it will result to a decrease in the metric. For higher metric, N should neither be high nor low.
 (c)EDP Metric By considering Eq. (9) and assuming that voltage scales linearly with frequency with respect to (\(Power \times IPS_N\)), the following relationship can be obtained:In Eq. (13), we see that as the metric increases, the pipeline stages decrease. When N increases, there is an increase in IPS due to \(IPS^2/W\) which in turn leads to increase in power too. Therefore, N cannot go beyond a certain value. If it exceeds the limit, results in decrease the metric. Hence, for the higher metric, N should be either high or low.$$\begin{aligned} \frac{N \times \beta _N \times f_{max}}{\alpha } \le 10.245. \end{aligned}$$(13)
 (d)ED \(^2\) P Metric By considering Eq. (9) and by assuming that the voltage scales linearly with the frequency with respect to (\(Power \times {IPS_N}^2\)), the following relationship can be obtained:where K is a constant used in the experiment. According to \({BIPS}^3/W\), the CMP (chip multiprocessor) configuration consists of large number of fairly narrow cores. Wider cores are considered to be too much power hungry to be competitive.$$\begin{aligned} N \times \beta _N \times K \le 10.245, \end{aligned}$$(14)
Proposed low power architectural technique

The fulltime clock signal is always active regardless of the unification.

The parttime clock signal is deactivated when the pipeline stages are unified. It is active when they are not unified.

The unification signal indicates the pipeline stage unification. Since the pipeline register between two adjacent combinatorial logic circuits is inactive or bypassed, the two logic circuits operate together as a single stage.
In order to bypass a pipeline register, we have used two methods. In the first one, the pipeline register logic is organized in such a way that a signal can pass through it regardless of the clock signal when the VSP is enabled. This solution can be implemented if the pipeline registers are made up of transparent latches. It is simple, but its drawback is the costeffectiveness of using transparent latches in the pipeline. In the second method, the logic gates and multiplexors are involved. The multiplexors are meant to decide which pipeline registers are active and which ones are to be shutdown when the unification signal is applied. An example of this solution is shown in [24].
Pseudocode of the variable stage pipelining algorithm
Experimental modeling
Processor configuration and Clock frequency assumptions
Parameters  Alpha 21264 processor 

Fetch, issue, commitwidth  4,4 (int), 2 (float), 11 
Reorder buffer size  80 
Issue window  20 (int), 15 (float) 
Load/store queue  32 (load), 32 (store) 
Register file  160 
Floatingpoint ALU  1 adder, 1 multiplier 
Integer ALU  4 adder, 4 multiplier 
L1 Data, instructioncache  512642 
Dtlb, Itlb  164128, 132128 (fully associative) 
Clock frequency rate f(\(\beta _1=1\), \(\beta _2=1.5\), \(\beta _3=2\))  100 %, 66.7 %, 50 % 
Results of VSP and optimum analysis
 (a)IPS Metric Pipeline depths for various configurations such as 2, 4, 8 and their corresponding pipeline stage unification degrees are 2, 1.5, 1.25 respectively. For the obtained results of the IPS metrics and the \(f_{max}\) values of each configuration, we obtained the following relationship:$$\begin{aligned} \frac{N \times IPS_N \times \alpha }{f_{max}} \le 10.245. \end{aligned}$$(15)
 (b)EMetric The pipeline depths for various configurations such as 2, 4, 8, 16, 32 and their pipeline stage unification degrees are 1, 1, 1.25, 1.5, and 2 respectively. For the obtained results of the IPS/W metrics, and the \(f_{max}\) and power consumption (\(\rho\)) values of each configuration, we have obtained the following relationship:where \(\beta = \frac{IPS_N}{\rho }\) and \(k_1\) (where \(k_{1}\) = \(2.8 \times\) 10\(^3\)) is the technology design parameter.$$\begin{aligned} \frac{N \times \beta _N \times k_1 \times {f^2_{max}}}{\alpha ^2} \le 10.245, \end{aligned}$$(16)
 (c)EDP Metric The pipeline depths for various configurations such as 2, 4, 8, 16, 32 and their pipeline stage unification VSP degrees are 1, 1, 1.25, 2, and 3.5 respectively. For the obtained results of the \(IPS^2/W\) metrics, and the \(f_{max}\) and power consumption (\(\rho\)) values of each configuration, we have obtained the following relationship:where \(\beta = \frac{IPS^2_N}{\rho }\) and \(k_1\) is the technology design parameter.$$\begin{aligned} \frac{N \times \beta _N \times {k_1} \times {f_{max} }}{\alpha } \le 10.245, \end{aligned}$$(17)
 (d)ED \(^2\) P Metric The pipeline depth for various configurations such as 2, 4, 8 and their pipeline stage unification degrees are 2, 1.5, and 1.25 respectively. For the obtained results of the IPS/W metrics, and the \(f_{max}\) and power consumption (\(\rho\)) values of each configuration, we have obtained the following relationship:where \(\beta = \frac{IPS^3_N}{\rho }\) and \(k_1\) is the technology design parameter.$$\begin{aligned} N \times \beta _N \times k_1 \le 10.245, \end{aligned}$$(18)Table 2
EDP results vs. VSP unification degree
IPS
EMetric
EDP
ED\(^2\)P
No. of cores
VSP degree
No. of cores
VSP degree
No. of cores
VSP degree
No. of cores
VSP degree
2
2
2
1
2
1
2
2
4
1.5
4
1
4
1
4
1.5
8
1.25
8
1.25
8
1.25
8
1.25
16
0
16
1.5
16
2
16
0
32
0
32
2
32
3.5
32
0
Analysis and discussions
In this section, we present an analysis of the relationship between the number of cores and pipeline stage depths. The results have been recorded for the different metrics with their various VSP unification degrees as shown in Table 2.
In Fig. 4, it can be observed that there is an increase in the number (N) of cores, which in turn has led to an increase in the performance and unification degree as well. Thus, the optimal point shifts towards the deeper pipelines with higher number of cores. In Fig. 5, it can be observed that there is also an increase in the number (N) of processor core, which has led to an increase in the energy per delay (IPS/W) and a decrease in the unification degree. Thus, the optimal point shifts towards the shallower pipelines, with a medium number of cores. Also, in Figs. 6 and 7, it can be observed that there is an increase in the number (N) of cores and a decrease in the unification degree, which have led to an increase in both the energydelay product (\(IPS^2/W\)) and \(ED^2P\) (\(IPS^3/W\)). Thus, this metric’s optimal point shifts towards the medium core with very low pipeline stages and a higher number of cores with shallower pipeline stages for the \(ED^2P\). Overall, from the simulation results, it was observed that the optimal number of cores—pipeline stage depth combination is a 8 core processor with a VSP unification degree of 1.25 for all the metrics. Such a configuration will yield the maximum performance without compromising the power consumption. As observed from the results obtained for \(ED^2P\), beyond 16 cores will give rise to inconsistent results due to the memory coherency.
From the analysis of the performance and the energydelay product metrics, it can be argued that the optimal point varies for different metrics. By considering performanceonly metrics such as IPS and \(ED^2P\), the optimal number of cores lies between higher and lower pipeline stage unification degree. When considering powerperformance metrics such as E and EDP, the optimal number of cores lies between lowmedium number of cores with lowmoderate number of pipeline stages. Thus, it can be concluded that when both power and performance are taken into account, a medium number of cores with moderate number of pipeline stages will be the optimal configuration.
Limitations and future enhancement
The proposed VSP configuration requires the preferred workload characteristics. The successful detection of workload characteristics may help to optimize the power modeling. The current limitation of this proposed approach is to predict the workload characteristics by using the history information and the power reconfigurations are usually programmed well in advance of the actual program execution, similarly branch prediction also required in outoforder processors. Hence, the future development of hardware prototype VSP power model is based on understanding the workload characteristics and optimized hardware scheduling algorithm in order to improve the power/performance efficiencies.
Conclusions
In this paper, a VSPbased microarchitectural power saving technique for balanced powerperformance tradeoffs is proposed and its efficiency in terms of several performance metrics is demonstrated by experiments. An analytical model is also proposed to analyze the relationship between the number of cores and the pipeline stage depth. The proposed method can be applied to explore energy efficient design points in multithreaded multicore CPUs. The simulation findings have revealed that the optimal number of cores—pipeline stage depth combination is an 8 core processor with a VSP unification degree of 1.25 among all the studied metrics. Such a configuration will give a good tradeoff between power and performance. In future, we intend to compare the proposed VSPbased scheme against some benchmark schemes, using the multithreaded microarchitectural simulation environment, and performance metrics. Our estimates show that using VSP technique saves energy consumption approximately 2 % as shown in Fig. 3. Though the % improvement is moderate, VSP technique is quite useful as the technology progresses in future mobile, laptop and desktop processors.
Declarations
Authors’ contributions
VS investigated the stateoftheart pipelining techniques in chip multiprocessors (CMP) and proposed a way to utilize it. AA proposed the mathematical model to sustain the abovementioned pipeline techniques. IW investigated the different simulation scenarios and helped in conducting the associated simulations. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Chakraborty K, Roy S (2012) Architecturally homogeneous powerperformance heterogeneous multicore processor. US Patent App. 13/495,961. (Online). https://www.google.com/patents/US20120324250
 (2012) The ITRS Technology Working Groups, International Technology Roadmap for Semiconductors (ITRS). (Online) http://www.itrs.net/ITRS%201999014%20Mtgs,%20Presentations%20&%20Links/2013ITRS/2013Chapters/2013ExecutiveSummary.pdf. Accessed 2013
 Vijayalakshmi S, Anpalagan A, Woungang I, Kothari D (2013) Power management in multicore processors using automatic dynamic pipeline stage unification. In: 2013 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS), July 2013, pp 120–127Google Scholar
 Kunkel SR, Smith JE (1986) Optimal pipelining in supercomputers. In: Proceedings of the 13th annual international symposium on Computer architecture, ser. ISCA ’86. IEEE Computer Society Press, Los Alamitos, pp 404–411. (Online). http://dl.acm.org/citation.cfm?id=17407.17403
 Hartstein A, Puzak TR (2002) Optimum Power/Performance Pipeline Depth. In: IEEE Computer Society, IBMT. J. Watson Research CenterGoogle Scholar
 Hrishikesh MS, Farkas KI, Burgert D, Keckler SW, Shivakumar P (2002) The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, pp 14–24Google Scholar
 Srinivasan V, Brooks D, Gschwind M, Bose P, Zyuban V, Strenski PN, Emma PG (2002) Optimizing pipelines for power and performance. In: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, ser. MICRO 35. IEEE Computer Society Press, Los Alamitos, pp 333–344. (Online). http://dl.acm.org/citation.cfm?id=774861.774897
 Hartstein A, Puzak TR (2002) The optimum pipeline depth for a microprocessor. SIGARCH Comput Archit News 30(2):7–13. doi:10.1145/545214.545217 View ArticleGoogle Scholar
 Borkar S (1999) Design challenges of technology scaling. IEEE Micro 19(4):23–29. doi:10.1109/40.782564 View ArticleGoogle Scholar
 Srinivasan V, Brooks D, Gschwind M, Bose P, Zyuban V, Strenski PN, Emma PG (2002) Optimizing pipelines for power and performance. In: International Symposium on Microarchitecture (MICRO35), Nov. 2002. Selected as one of the four Best IBM Research Papers in Computer Science, Electrical Engineering and Math published inGoogle Scholar
 Sprangle E, Carmean D (2002) Increasing processor performance by implementing deeper pipelines. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, ser. ISCA,’02. IEEE Computer Society. Washington, DC, pp 25–34. (Online). http://dl.acm.org/citation.cfm?id=545215.545219
 Ghasemazar M, Pakbaznia E, Pedram M (2010) Minimizing the power consumption of a chip multiprocessor under an average throughput constraint. In: IEEE ISQED, pp 362–371. (Online). http://dblp.unitrier.de/db/conf/isqed/isqed2010.html#GhasemazarPP10
 Hennessy JL, Patterson DA (2006) Computer architecture: a quantitative approach, fourth edn. Morgan Kaufmann Publishers Inc., San FranciscoGoogle Scholar
 Koppanalil J, Ramrakhyani P, Desai S, Vaidyanathan A, Rotenberg E (2002) A case for dynamic pipeline scaling. In: Proceedins of the 5th International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’02), pp 1–8Google Scholar
 Yao J, Miwa S, Shimada H, Tomita S (2007) Optimal pipeline depth with pipeline stage unification adoption. SIGARCH Comput Archit News 35(5):3–9. doi:10.1145/1360464.1360470 View ArticleGoogle Scholar
 Vijayalakshmi S, Aniket S, Sudeep C (2014) Reducing power dissipation in multicore processors using effective core switching. IJCIT 3(6):1435–1442Google Scholar
 Wang A, Chandrakasan A, Kosonocky S (2002) Optimal supply and threshold scaling for subthreshold cmos circuits. In: Proceedings IEEE Computer Society Annual Symposium on VLSI, 2002, pp 5–9Google Scholar
 Calhoun B, Wang A, Chandrakasan A (2005) Modeling and sizing for minimum energy operation in subthreshold circuits. IEEE J Solid State Circuits 40(9):1778–1786View ArticleGoogle Scholar
 Rabaey J (2009) Low power design essentials, 1st edn. Springer Publishing Company, IncorporatedGoogle Scholar
 Taur Y, Ning TH (2009) Fundamentals of modern VLSI devices, 2nd edn. Cambridge University Press, New YorkGoogle Scholar
 Kim NS, Austin T, Blaauw D, Mudge T, Flautner K, Hu JS, Irwin MJ, Kandemir M, Narayanan V (2003) Leakage current: Moore’s law meets static power. Computer 36(12):68–75. doi:10.1109/MC.2003.1250885 View ArticleGoogle Scholar
 Herbert S, Marculescu D (2007) Analysis of dynamic voltage/frequency scaling in chipmultiprocessors. In: Proceedings of the 2007 International Symposium on Low Power Electronics and Design, ser. ISLPED ’07. ACM New York, pp 38–43. (Online). doi:10.1145/1283780.1283790
 Tsai Y, TPS (2005) University, Tools and Techniques for Leakage Power Analysis. Pennsylvania State University. (Online). https://books.google.com/books?id=W89BjkKz7hMC
 Boucaron J, Coadou A (2009) Dynamic variable stage pipeline: an implementation of its control. INRIA, Rapport de recherche RR6918, rR6918. http://hal.inria.fr/inria00381563/PDF/RR6918.pdf
 Joseph KG, Sharkey J, Ponomarev D (2005) Abstract MSIM: a flexible, multithreaded architectural simulation environment. Tech. RepGoogle Scholar