Revision as of 19:49, 19 November 2010

Characterization and Acceleration of Emerging Applications

Basically, GPUs are specialized hardware cores designed to accelerate rendering and display but they are moving in the direction of general purpose accelerators [BoB09]. GPU vendors have recently introduced new programming models and associated hardware support to broaden the class of non-graphics applications that may efficiently use GPU hardware [NVI08][OHL+08]. There is a large, emerging and commercially-relevant class of applications enabled by the significant increase in GPU computing density, such as graphics and physics for gaming, interactive simulation, data analysis, scientific computing, 3D modeling for CAD, signal processing, digital content creation, and financial analytics. PARSEC benchmark suite [BKS+08] is a good proxy of this kind of applications. Applications in these domains benefit from architectural approaches that provide higher performance through parallelism.
GPUs capabilities excel for applications that exhibit extensive data parallelism. GPUs typically operate on a large number of data points, where the same operation is simultaneously conducted on all of these data points in the form of continuously running vectors or streams. Furthermore, to exploit data-level parallelism, modern GPUs typically batch together groups of individual threads (called warps) running the same shader program, and execute them in lock step on a SIMD pipeline [LuH07][LNO+08]. However, even with a general-purpose programming interface, mapping existing applications to the parallel architecture of a GPU is a non-trivial task.
Vectorization is an optimization technique that has traditionally targeted vector processors. The importance of this optimization has increased in recent years with the introduction of SIMD extensions like Intel’s SSE or IBM/Motorola's Altivec to general purpose processors, and with the growing significance of applications that can benefit from this functionality. However, achieving high performance on modern architectures requires efficient utilization of SIMD units. Doing so requires that algorithms are able to take full advantage of the SIMD width offered and to not waste SIMD instructions on low utilization cases. Both Intel SSE and PowerPC Altivec expose a relatively small SIMD width of four. It is often complicated to apply vectorization techniques to architectures with such SIMD extensions because these extensions are largely non-uniform, supporting specialized functionalities and a limited set of data types. Vectorization is often further impeded by the SIMD memory architecture, which typically provides access to contiguous memory items only, often with additional alignment restrictions. Computations, on the other hand, may access data elements in an order that is neither contiguous nor adequately aligned. Bridging this gap efficiently requires careful use of special mechanisms including permute, pack/unpack, and other instructions that incur additional performance penalties and complexity.
However, given the small cost and potentially high benefit of increasing the SIMD width, it seems likely that future architectures will explore larger SIMD widths such as Nvidia’s Fermi and Intel’s Larrabee[LCS+08]. Larrabee greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. Its approach is based on extending each CPU core with a wide vector unit featuring scatter-gather capability, as well as predicated execution support. On the other hand, available compilers have limitations that prevent loop vectorizing such as control flow, non-contiguous and irregular data access, data dependency, nested loop and undefined number of loop iterations [RWP05] which are present in most of the main loops of emerging applications. There are some works that address the control flow problems [SFS00][Shi07], as well as irregular data access [ChS08].

Intrusion Detection Systems

Computing systems today operate in an environment of seamless connectivity, with attacks being continuously created and propagated through the Internet. There is a clear need to provide an additional layer of security in routers. Intrusion Detection Systems (IDS) are emerging as one of the most promising ways of providing protection to systems on the network against suspicious activities. By monitoring the traffic in real time, an IDS can detect and also take preventive actions against suspicious activities. Network-based IDS have emerged as an effective and efficient defense for systems in the network. They are usually deployed in routers.
The deployment of network-based IDS in routers poses a very interesting challenge. With current trends in doubling of line-rates - especially in backbone routers - every 24 months, requires that the IDS also needs to follow likewise in performance. For example, an IDS deployed in a state-of-the-art backbone router inspects packets streaming at 40 Gbps and scans for >23,000 attack signatures in these packets. This clearly is a tremendous performance challenge. So performance is a key factor for efficient functioning of the IDS.
IDS detects attacks by scanning packets for attack patterns by performing multiple pattern matching. Patterns can be either expressed as fixed strings or regular expressions. The Aho-Corasick [AC71] algorithm is commonly used by IDS [Ro99] for fixed string matching. In the Aho-Corasick algorithm, a finite-state-machine (FSM) is constructed using attack signatures, and this is subsequently traversed using bytes from packets. The main advantage of the Aho-Corasick algorithm is that it runs in linear time to the input bytes regardless of the number of attack signatures. However, the main disadvantage lies in devising a practical implementation due to the large memory needed to store the FSM. So one of the primary area of focus in the IDS research community is in devising a performance and area efficient architecture for the Aho-Corasick algorithm.
IDS also increasingly uses regular expressions to specify attack signatures. The reason for their increasing popularity is due to its rich expressive powers in specifying attack signatures. In order for the regular expressions to be parsed, they are first converted to finite automata (Deterministic or Non-deterministic), and later these automatas are traversed using bytes from packets. However, these automatas are either inefficient with respect to chip-area (in case of Deterministic Finite Automata) or are performance inefficient (in case of Non-deterministic Finite Automata ).
A key requirement for the effectiveness of an IDS is that it has to process packets at the rate they are streaming. The consequences of not doing so can result either in undetected malicious packets or expensive packet drops. An adversary can also bring the IDS to this state of not being able to process packets at wire speeds. Such attempts are commonly referred to as evasion [CW03, PN98], and stem from weaknesses in some part of IDS processing. The nature and ease of evasion makes it very appealing for malicious hosts to bypass the IDS.
There have been numerous works In the area of improving the performance and area efficiency of pattern matching algorithms (fixed string and regular expressions). [TS05, PY08] propose various novel techniques to significantly improve the performance and area efficiency of pattern matching in IDS. In the area of regular expression matching, numerous works [HST09, SEJ08] have studied and proposed improving the DFA storage and DFA traversal. Additionally, [BP05] have studied and proposed techniques for NFAs using reconfigurable hardware. [CW03, SEJ06] have studied various sophisticated attacks and secure defense mechanisms against the IDS. Additionally, [CGJ09, MO07] have also studied similar attack and defense mechanisms against the Unix file system and banked-memory in multi-cores respectively.
Broadly, we plan to address these aforementioned issues by using a hardware/software approach. The software approach focuses on improving the area efficiency while the hardware approach improves the performance efficiency.

Memory Hierarchy

Innovation and technological improvements in processor design have outpaced advances in memory design in the last ten years. This imbalanced advancement has been causing an increasing gap between processor and memory speeds. During the last decade this has led to an approach involving concurrent execution, initially through the execution of multiple threads in one processor and now with the inclusion of multiple cores in a single chip. Unfortunately, the advent of chip-multiprocessors (CMPs) in the last years has made the problem even worse due to the increase of bandwidth requirements and contention on the memory controller. Therefore, this increasing speed gap has motivated that current high performance processors focus on cache organizations, register file and prefetching techniques to tolerate growing memory latencies [BCS09], [BGK96], [SPN96]. Furthermore, power dissipation is becoming a critical issue for microprocessors. Power dissipation determines the cost of the cooling system and ultimately may limit the performance of the microprocessor.
Prefetching mechanisms that decouple and overlap the computing and transfer of data is a well known technique and commonly employed to hide memory latencies [BCS09]. However, although the agressibve prefetchig mechanisms are, for most applications, beneficial to tolerate memory latencies in single core processors, when the prefetching is done in multiple cores of a CMP, the increase of performance of individual cores can be greatly reduced compared to systems without prefetching [EMJ09]. This is caused by interference in the shared resources of prefetching mechanisms.
One of the greatest challenges that appeared with this twist in the chip configuration relies on how users will exploit CMPs. Parallel programming models, which divide an application in several tasks that can be executed concurrently, seems to be the best alternative to take advantage of CMP resources. Unfortunately, current programming models implement blocking synchronization, where critical sections are serialized in order to ensure mutual exclusion. Therefore, blocking synchronization increases the complexity of parallel programming and significantly degrades the performance of parallel applications. This fact encouraged the development of optimistic programming models that use non-blocking synchronization. In these programming models, critical sections are executed simultaneously, requiring modifications in the memory hierarchy to guarantee the correctness of the execution [HWC04] [DLM09].
On the other side, the increasing influence of wire delay in cache design means that access latencies to the last-level cache banks are no longer constant [AHK00], [M97]. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem [KBK02]. A NUCA divides the whole cache memory into smaller banks and allows nearer cache banks to have lower access latencies than farther banks, thus mitigating the effects of the cache’s internal wires.
We will propose several microarchitectural techniques that can be applied to various parts of current microprocessor designs to improve the memory system and to boost the execution of instructions. Some techniques will attempt to ease the gap between processor and memory speeds, while the others will attempt to alleviate the serialization caused by data dependences.

Multithreaded Processors

Industry and researchers are making a shift towards multi-core architectures [WWW07] [WP07] [WWW207] [F07]. This shift is mainly motivated by two factors: on the one hand, we have reached a point where further exploiting instruction level parallelism (ILP) is giving us diminishing returns so that other types of parallelism are needed. On the other hand, new feature sizes allow a greater number of transistors to be implemented on a chip. This increase on the number of transistors opens the possibility of integrating multiple cores on die so that multiple applications and threads from the same application can run in parallel getting good performance by exploiting thread level parallelism (TLP) [KFJ+04].
The ability of executing multiple threads in parallel is called multithreading. There are several ways of implementing multithreading. On the one hand, implementing multiple cores allow you to support multiple threads in parallel. On the other hand, each of these cores could also execute more than one thread at the same time using techniques like Simultaneous Multithreading, fine grain multithreading [TEL95] or switch on event [MB05].
Implementing multiple simple cores on a chip make the number of cores available in a processor to increase every year and companies are sometimes making strong bets designing processors able to exploit TLP very efficiently at the expenses of sacrificing ILP [H05]. However, these novel architectures consisting of simple cores will have to compete with the current out-of-order processors that clearly outperform them in the ILP arena.
Speculative multithreading is a paradigm where single-threaded applications are shred into multiple threads that can be executed in parallel. These threads are generated using speculative optimizations like control speculation and dependence break that maximizes the amount of instructions that can be executed in parallel. Unfortunately, since the optimizations are speculative, it also requires hardware mechanisms in order to detect and recover from misspeculations. However, multi-core architectures comprising single cores can take advantage of this paradigm to reach similar performance as conventional single out-of-order cores on single-threaded applications. Typical implementations of speculative multithreading can be found in [SBV95][SCZ+02][MG99][GMS+05][CMT00]. These implementations usually generate speculative threads where every thread represents a set of consecutive instructions from the original application plus some extra instructions to handle the speculative optimizations. However, other more recent proposals refine these models generating threads where original instructions are more aggressively distributed among threads [MLC+09].

Reliability

Circuit behavior is not deterministic anymore in current and future technologies due to variations. Limitations in the fabrication process introduce indeterminism in the geometry (length, width and thickness) of transistors, and hence, in their behavior. Moreover, voltage and temperature oscillate, and inputs of circuits change. Thus, delay and power of circuits change dynamically, but they must work in any feasible scenario. This issue is addressed by considering the worst-case scenario by default to ensure circuit functionality, but such assumption is pessimistic most of the times and very inefficient in terms of power and delay. The aim of our research is finding solutions to make circuits more efficient under the presence of variations. The result of our research will be a set of strategies, techniques and circuit designs to improve the performance and power of such circuits by adapting their operation to the common case instead of operating assuming worst-case conditions.
Some works have been proposed so far to exploit variations. TIMERRTOL [Uht00] and Razor [EKD+03] exploit variations by using results after the common case delay and checking their correctness after the worst-case delay. There have been works on new memory circuit design [VSP+09], as well as, techniques to reduce the performance impact of process variations [LCW+07].
The main groups working in the area of variations are the Department of Electrical and Computer Engineering at the University of Rhode Island, the Department of Electrical Engineering at the Princeton University, the Division of Engineering and Applied Sciences at the Harvard University, and the Department of Electrical Engineering and Computer Science at the University of Michigan.

Virtual Machines

Co-designed Virtual Machines [SN05] are an attractive vehicle for designing complexity/power effective processors. For this paradigm, a processor is a co-design effort between hardware and software. The software layer wraps the hardware and provides a common interface to the outside world (operating system and applications). This software layer allows us to perform dynamic translation from a source ISA (visible to the operating system and applications), as well as to perform optimization over the source code and adapt it to better exploit the capabilities of the hardware layer underneath.
Several proposals exist in the research arena that showed the potential benefits of a co-designed Virtual Machine as well as the benefits of dynamic optimization. In Transmeta Crusoe [Kla00], IBM DAISY [EA96] and IBM BOA [AGS+99] the concept of co-designed Virtual Machine is leveraged to design system based on a low complexity and low power VLIW hardware layer, able to execute general purpose x86 code. In these proposals, the translation from x86 to VLIW ISA is an important feature and it imposes an important overhead to the system.
RePlay [FBC+01] and PARROT [RAM+04] remove the overhead due to translation and concentrate their efforts on performing dynamic optimization of the most frequently executed sections of the applications. However, these projects rely on the hardware to perform code optimization, which limits the flexibility of the system. A software optimizer is able to perform more complex analysis and optimizations than a hardware-based scheme, and it may be updated many times even when the chip is build. Moreover, a hardware optimizer introduces additional complexity to the hardware which may result in a power consumption increase and additional validation cost.
The goal of our research in this arena is to propose a complete design of a system based on a combined effort on hardware and software. In order to do that, first we will investigate on an efficient and flexible co-designed system that overcomes the limitation of previous proposals. Then, we will investigate on novel techniques to perform dynamic optimization through the co-designed virtual machine software layer. These techniques must be able to adapt the applications to the hardware underneath. Finally, we will investigate on enhancements to the hardware layer to allow a better interaction with the wrapper software layer.
There have been numerous research groups that have focused their research on generic dynamic binary optimization. However, very few of them have concentrated their efforts in the concept of co-design virtual machines, where the final functionality of a processor is transparently provided by the most efficient balance between hardware and software. In addition to the aforementioned projects, the research group led by the recently retired professor Jim Smith [SN05], with who we have strongly collaborated for more than 10 years, started working in this topic.
Due to the increasing complexity of current processors in terms of energy consumption, area and processor validation, we strongly believe that this research topic will become very interesting in the research agenda of many research groups in the next years. In fact, one starts observing that more and more groups advocate for using software to perform tasks that are too complex to perform them in hardware, even if these proposals are not aligned with the concept of co-design virtual machines.

References

[AC71] A. V. Aho, M. J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333–340, 1975.
[AGS+99] E. Altman, M. Gschwind, S. Sathaye, S. Kosonocky, A. Bright, J. Fritts, P. Ledak, D. Appenzeller, C. Agricola, Z. Filan, “BOA: The Architecture of a Binary Translation Engine”, IBM Research Report RC 21665 (97500), 1999
[AHK00] V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Clock rate vs. ipc: The end of the road for conventional microprocessors, in Proceedings of the 27th International Symposium on Computer Architecture, 2000.
[BCS09] S. Byna, Y. Chen and X. Sun. “Taxonomy of Data Prefetching for Multicore Processors”. Journal of Computer Science and Technology, 2009 24 (3): 405-417.
[BGK96] D. Burger, J. R. Goodman, and A. Kägi, “Quantifying Memory Bandwidth Limitations of Current and Future Microprocessors”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996.
[BKS+08] C.Bienia, S.Kumar, J.P.Singh, and K.Li. “The PARSEC Benchmark Suite: Characterization and Architectural Implications”, In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pages 72–81, 2008.
[BoB09] R. Borgo, K. Brodlie,. “State of The Art Report on GPU”, Visualization and Virtual Reality Research Group, School of Computing - University of Leeds, - VizNET REPORT - Ref: VIZNET-WP4-24-LEEDS-GPU, 2009.
[BP05] Z. K. Baker, V. K. Prasanna: High-throughput linked-pattern matching for intrusion detection systems. ANCS 2005.
[CGJ09] Q. Cai, Y. Gui, and R. Johnson. Exploiting Unix File-system Races via Algorithmic Complexity Attacks. Proceedings of IEEE Symposium on Security and Privacy, 2009.
[ChS08] H.Chang and W.Sung, “Efficient Vectorization of SIMD Programs with Non-aligned and Irregular Data Access Hardware”, CASES’08, October 19–24, 2008.
[CMT00] M. Cintra, J.F. Martinez and J. Torrellas, “Architectural Support for Scalable Speculative Parallelization in Shared-Memory Systems”, in Proc. of the 27th Int. Symp. on Computer Architecture, 2000.
[CW03] S. A. Crosby and D. S. Wallach. Denial of Service via Algorithmic Complexity Attacks. In USENIX Security, Aug. 2003.
[DLM09] D. Dice, Y. Lev, M. Moir, and D. Nussbaum, Early Experience with a Commercial Hardware Transactional Memory Implementation, in Proceedings of the 14th International Conference onArchitectural Support for Programming Languages and Operating Systems, 2009.
[EA96] K. Ebcioglu, E. R. Altman, “DAISY: Dynamic Compilation for 100% Architectural Compatibility”, IBM Research Report RC 20538, 1996.
[EKD+03] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, T. Mudge. Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation. Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003.
[EMJ09] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt. “Coordinated control of multiple prefetchers in multi-core systems”. In Proceedings of the 42nd Annual IEEE/ACM international Symposium on Microarchitecture, 2009.
[FBC+01] B. Fahs. S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. Patel, S. Lumetta, “Performance Characterization of a Hardware Mechanism for Dynamic Optimization”, in Procs. of 34th International Symposium On Microarchitecture (MICRO-34), 2001
[F07] J. Fang. Parallel Programming Environment: A Key to Translating Tera-Scale Platforms into a Big Success. Key Note at International Symposium on Code Generation and Optimization. March 2007.
[GMS+05] C. García, C. Madriles, J. Sánchez, P. Marcuello, A. González and D.M. Tullsen, “Mitosis Compiler: An Infrastructure for Speculative Threading Based on Pre-computation Slices”, in Proc. of the Int. Conf. on Programming Language Design and Implementation, pp. 269-278, 2005.
[H05] H.P. Hofstee, Power efficient processor architecture and the cell processor. In Proceedings of 11th International Symposium on High-Performance Computer Architecture HPCA-11, February 2005.
[HST09] N. Hua, H. Song, and T.V. Lakshman. Variable-Stride Multi-Pattern Matching For Scalable Deep Packet Inspection. Proceedings of IEEE INFOCOM, April 2009.
[HWC04] L. Hammond, V. Wong, M. Chen, B. Carlstrom, J. Davis, B. Hertzberg, M. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, Transactional Memory Coherence and Consistency, in Proceedings of the 31st International Symposium on Computer Architecture, 2004.
[KBK02] C. Kim, D. Burger, and S.W.Keckler, An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches, In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.
[KFJ04] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proceedings of the 31st annual International Symposium on Computer Architecture, 2004.
[Kla00] A. Klaiber, “The Technology Behind the CrusoeTM Processors”, white paper, January 2000
[LCW+07] X. Liang, R. Canal, G.Y. Wei, D. Brooks. Process Variation Tolerant 3T1D-Based Cache Architectures. Proceedings of the 40th International Symposium on Microarchitecture (MICRO-40), December 2007
[LNO+08] E.Lindholm, J.Nickolls, S.Oberman, Jmontrym. “NVIDIA TESLA: A Unified Graphics and Computing Architecture”, IEEE Micro, 28, 2 :39-55, March-April 2008.
[LuH07] D.Luebke and G.Humphreys. “How GPUs work”, Computer, 40(2): 96-11, 2007.
[M97] D. Matzke, Will physical scalability sabotage performance gains?, IEEE Computer, September 1997.
[MB05] C. McNairy, and R. Bhatia. Montecito: A Dual-Core, Dual-Thread Itanium Processor. IEEE Micro, Volume 25, Issue 2, Pages 10–20. March-April 2005.
[MG99] P. Marcuello and A. González, “Clustered Speculative Multithreaded Processors”, in Proc. of the 13th Int. Conf. on Supercomputing, pp. 365-372, 1999.
[MLC+09] C. Madriles, F. Latorre, J.M. Codina, E. Gibert, P. López, A. Martínez, R. Martínez and A. González, "Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading", in proceedings of International Conference on Parallel Architectures and Compiler Techniques, September 2009.
[MO07] T. Moscibroda, O. Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-core Systems. In 16th USENIX Security Symposium 2007.
[NVI08] NVIDIA Corporation. “NVIDIA CUDA Programming Guide”, 2.0 edition, 2008.
[OHL+08] J.D.Owens, M.Houston, D.Luebke, S.Green, J.E.Stone, and J.C.Phillips. “GPU Computing”, In Proc. Of the IEEE, vol. 96, pp.879-899, 2008.
[PN98] T. Ptacek and T. Newsham. Insertion, Evasion and Denial of Service: Eluding Network Intrusion Detection. In Secure Networks, Inc., January 1998.
[PY08] P. Piyachon, Y. Luo. Design of a High Performance Pattern Matching Engine Through Compact Deterministic Finite Automata. Proceedings of the ACM DAC 2008.
[RAM+04] R. Rosner, Y. Almog, Micha Moffie, Naftali Schwartz, Avi Mendelson, “Power Awareness through Selective Dynamically Optimized Traces”, in Procs. of 31st International Symposium on Computer Architecture (ISCA-21), 2004
[Ro99] M. Roesch. SNORT - Lightweight Intrusion Detection for Networks. In LISA '99: USENIX 13th Systems Administration Conference 1999.
[RWP05] G.Ren, P.Wu, D.Padua. "An Empirical Study on Vectorization of Multimedia Applications for Multimedia Extensions", Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), 2005.
[SBV95] G.S. Sohi, S.E. Breach and T.N. Vijaykumar, “Multiscalar Processors”, in Proc. of the 22nd Int. Symp. on Computer Architecture, pp.414-425, 1995.
[SCS+08] L.Seiler, D.Carmean, E.Sprangle, T.Forsyth, M.Abrash, P.Dubey, P.Hanrahan, S.Junkins, A.Lake, J. Sugerman, “Larrabee: A Many-core x86 Architecture for Visual Computing”, In SIGGRAPH ’08: ACM SIGGRAPH 2008 Papers, ACM Press, New York, NY, 2008.
[SCZ+02] J. Steffan, C. Colohan, A. Zhai and T. Mowry, “Improving Value Communication for Thread-Level Speculation”, in Proc. of the 8th Int. Symp. on High Performance Computer Architecture, pp. 58-62, 2002.
[SEJ06] R. Smith, C. Estan, and S. Jha. Backtracking Algorithmic Complexity Attacks against a NIDS. In ACSAC 2006.
[SEJ08] R. Smith, C. Estan, and S. Jha. XFA: Faster Signature Matching with Extended Automata. XFA: Faster Signature Matching with Extended Automata. IEEE Symposium on Security and Privacy, May 2008.
[SFS00] J.E.Smith, G.Faanes, R.Sugumar, “Vector Instruction Set Support for Conditional Operations”, International Symposium on Computer Architecture, Pages: 260 - 269, 2000.
[Shi07] J.Shin, “Introducing Control Flow into Vectorized Code”, IEEE, 16th International Conference on Parallel Architecture and Compilation Techniques, 2007.
[SN05] J. E. Smith, and R. Nair, “Virtual Machines: Versatile Platforms for Systems and Processes”, Morgan Kaufmann Publishers, 2005
[SPN96] A. Saulsbury, F. Pong and A. Nowatzyk, “Missing the Memory Wall: The Case for Processor/Memory Integration”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996.
[TEL95] D. Tullsen, S.J. Eggers, H.M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the22th Annual International Symposium on Computer Architecture, 1995.
[TS05] L. Tan, T. Sherwood. A High Throughput String Matching Architecture for Intrusion Detection and Prevention. Proceedings of ISCA 2005, pages 112-122.
[Uht00] Uht, A. K. Achieving typical delays in synchronous systems via timing error toleration. Tech. Rep. Dept. of Electrical and Computer Engineering, No. 032000-0100, University of Rhode Island. 2000
[VSP+09] A. Valero, J. Sahuquillo, S. Petit, V. Lorente, R. Canal, P. Lopez, J. Duato. An hybrid eDRAM/SRAM macrocell to implement first-level data caches. Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42), December 2009
[WWW07] Quad-Core Intel® Xeon® Processor 5300 Series. August 2007.
[WP07] White paper. The Manycore Shift Microsoft: Parallel Computing Initiative Ushers Computing into the Next Era. November 2007.
[WWW207] Quad-Core AMD Opteron processors for Server and Workstation., 2007.

Projects: Difference between revisions

Revision as of 19:49, 19 November 2010

Contents

Characterization and Acceleration of Emerging Applications

Intrusion Detection Systems

Memory Hierarchy

Multithreaded Processors

Reliability

Virtual Machines

References

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Toolbox

@@ Line 1: / Line 1: @@
-= Clustered Processors  =
+= Characterization and Acceleration of Emerging Applications =
-The increasing number of transistors and faster clock speeds of current processors inhibit signals from reaching every point of a chip in a single clock cycle. Also recent process technologies have ever higher interconnect latencies compared to logic gates [AHK00] and power consumption has become a first order design constraint. To solve this problem a design in clusters is adopted, the processor is divided in subunits (clusters) which work independently. The smaller size of these clusters allows signals to reach all of their parts in just one clock cycle. Additionally the complexity of the design is diminished and the energy efficiency is improved [ZK01]. On the other hand this paradigm also introduces a new problem: the workload has to be distributed among all clusters in order to achieve maximum computing potential. Hence, the different clusters need to communicate with each other. This requires an interconnection network among clusters which has significant latencies and a limited bandwidth.
+Basically, GPUs are specialized hardware cores designed to accelerate rendering and display but they are moving in the direction of general purpose accelerators [BoB09]. GPU vendors have recently introduced new programming models and associated hardware support to broaden the class of non-graphics applications that may efficiently use GPU hardware [NVI08][OHL+08]. There is a large, emerging and commercially-relevant class of applications enabled by the significant increase in GPU computing density, such as graphics and physics for gaming, interactive simulation, data analysis, scientific computing, 3D modeling for CAD, signal processing, digital content creation, and financial analytics. PARSEC benchmark suite [BKS+08] is a good proxy of this kind of applications. Applications in these domains benefit from architectural approaches that provide higher performance through parallelism. <br>GPUs capabilities excel for applications that exhibit extensive data parallelism. GPUs typically operate on a large number of data points, where the same operation is simultaneously conducted on all of these data points in the form of continuously running vectors or streams. Furthermore, to exploit data-level parallelism, modern GPUs typically batch together groups of individual threads (called warps) running the same shader program, and execute them in lock step on a SIMD pipeline [LuH07][LNO+08]. However, even with a general-purpose programming interface, mapping existing applications to the parallel architecture of a GPU is a non-trivial task. <br>Vectorization is an optimization technique that has traditionally targeted vector processors. The importance of this optimization has increased in recent years with the introduction of SIMD extensions like Intel’s SSE or IBM/Motorola's Altivec to general purpose processors, and with the growing significance of applications that can benefit from this functionality. However, achieving high performance on modern architectures requires efficient utilization of SIMD units. Doing so requires that algorithms are able to take full advantage of the SIMD width offered and to not waste SIMD instructions on low utilization cases. Both Intel SSE and PowerPC Altivec expose a relatively small SIMD width of four. It is often complicated to apply vectorization techniques to architectures with such SIMD extensions because these extensions are largely non-uniform, supporting specialized functionalities and a limited set of data types. Vectorization is often further impeded by the SIMD memory architecture, which typically provides access to contiguous memory items only, often with additional alignment restrictions. Computations, on the other hand, may access data elements in an order that is neither contiguous nor adequately aligned. Bridging this gap efficiently requires careful use of special mechanisms including permute, pack/unpack, and other instructions that incur additional performance penalties and complexity. <br>However, given the small cost and potentially high benefit of increasing the SIMD width, it seems likely that future architectures will explore larger SIMD widths such as Nvidia’s Fermi and Intel’s Larrabee[LCS+08]. Larrabee greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. Its approach is based on extending each CPU core with a wide vector unit featuring scatter-gather capability, as well as predicated execution support. On the other hand, available compilers have limitations that prevent loop vectorizing such as control flow, non-contiguous and irregular data access, data dependency, nested loop and undefined number of loop iterations [RWP05] which are present in most of the main loops of emerging applications. There are some works that address the control flow problems [SFS00][Shi07], as well as irregular data access [ChS08]. <br>
-Several superscalar clustered architectures have been proposed over time. They can be classified depending on how they map instructions to clusters. Some use control dependencies (and could alternatively be classified as speculative multithreaded) [SBV95] [RJS97], while others are based on data dependencies. The last group can be further classified by the time when instructions are mapped. One group relies on the compiler to perform this work before the program is actually executed [NSB01] [FCJ97]; another group maps instructions dynamically using hardware during program execution [KF96] [PJS97] [K99] [ZK01] [KS02] [BDA03]. The memory unit received special attention in the literature because it is one of the most complex and power consuming structures of a superscalar processor. Most research aims especially at the memory disambiguation unit and the first level data cache. Some proposals also include a predictor [YMR+99] to find an instruction mapping that minimizes communication delays [ZK01] [RP03] [B04] while some others do not [FS96] [SNM+06].
+= Intrusion Detection Systems =
-With respect to the VLIW architectures, the work concentrates on the code generation stage of the compiler that is used to distribute instructions among clusters, schedule instructions and necessary communications, and assign registers. The various proposals found in the literature can be distinguished by the order of execution of the tasks mentioned above. Some proposals execute these tasks independently [E86] [CDN92] [D98] [JCSK98] [NE98] [CFM03]; others propose unification of the last two tasks [OBC98] [SG00a] [LPSA02], and recently some groups proposed techniques to unify the three tasks [KEA01] [CSG01] [ACSG01] [ZLAV01] [ACS+02]. In parallel, some techniques were studied to improve register assignment and spill code [ACGK05] and to reduce the impact of the communications by replicating some code [SG00a][ACGK03a]. On the other hand, some recent works propose alternatives to the distribution of the memory hierarchy in a clustered architecture [WTS+97] [SG00b] [GSG02a], as well as techniques for the code generation for these new organizations [GSG02b] [GSG03a] [GSG03b].
+Computing systems today operate in an environment of seamless connectivity, with attacks being continuously created and propagated through the Internet. There is a clear need to provide an additional layer of security in routers. Intrusion Detection Systems (IDS) are emerging as one of the most promising ways of providing protection to systems on the network against suspicious activities. By monitoring the traffic in real time, an IDS can detect and also take preventive actions against suspicious activities. Network-based IDS have emerged as an effective and efficient defense for systems in the network. They are usually deployed in routers.<br>The deployment of network-based IDS in routers poses a very interesting challenge. With current trends in doubling of line-rates - especially in backbone routers - every 24 months, requires that the IDS also needs to follow likewise in performance. For example, an IDS deployed in a state-of-the-art backbone router inspects packets streaming at 40 Gbps and scans for &gt;23,000 attack signatures in these packets. This clearly is a tremendous performance challenge. So performance is a key factor for efficient functioning of the IDS.<br>IDS detects attacks by scanning packets for attack patterns by performing multiple pattern matching. Patterns can be either expressed as fixed strings or regular expressions. The Aho-Corasick [AC71] algorithm is commonly used by IDS [Ro99] for fixed string matching. In the Aho-Corasick algorithm, a finite-state-machine (FSM) is constructed using attack signatures, and this is subsequently traversed using bytes from packets. The main advantage of the Aho-Corasick algorithm is that it runs in linear time to the input bytes regardless of the number of attack signatures. However, the main disadvantage lies in devising a practical implementation due to the large memory needed to store the FSM. So one of the primary area of focus in the IDS research community is in devising a performance and area efficient architecture for the Aho-Corasick algorithm.<br>IDS also increasingly uses regular expressions to specify attack signatures. The reason for their increasing popularity is due to its rich expressive powers in specifying attack signatures. In order for the regular expressions to be parsed, they are first converted to finite automata (Deterministic or Non-deterministic), and later these automatas are traversed using bytes from packets. However, these automatas are either inefficient with respect to chip-area (in case of Deterministic Finite Automata) or are performance inefficient (in case of Non-deterministic Finite Automata ).<br>A key requirement for the effectiveness of an IDS is that it has to process packets at the rate they are streaming. The consequences of not doing so can result either in undetected malicious packets or expensive packet drops. An adversary can also bring the IDS to this state of not being able to process packets at wire speeds. Such attempts are commonly referred to as evasion [CW03, PN98], and stem from weaknesses in some part of IDS processing. The nature and ease of evasion makes it very appealing for malicious hosts to bypass the IDS.<br>There have been numerous works In the area of improving the performance and area efficiency of pattern matching algorithms (fixed string and regular expressions). [TS05, PY08] propose various novel techniques to significantly improve the performance and area efficiency of pattern matching in IDS. In the area of regular expression matching, numerous works [HST09, SEJ08] have studied and proposed improving the DFA storage and DFA traversal. Additionally, [BP05] have studied and proposed techniques for NFAs using reconfigurable hardware. [CW03, SEJ06] have studied various sophisticated attacks and secure defense mechanisms against the IDS. Additionally, [CGJ09, MO07] have also studied similar attack and defense mechanisms against the Unix file system and banked-memory in multi-cores respectively.<br>Broadly, we plan to address these aforementioned issues by using a hardware/software approach. The software approach focuses on improving the area efficiency while the hardware approach improves the performance efficiency.
-= Co-designed Virtual Machines  =
+= Memory Hierarchy =
-Co-designed Virtual Machines [SN05] are an attractive vehicle for designing complexity-effective processors. In such a scheme, a processor is a co-design effort between hardware and software. The software layer wraps the hardware and provides a common interface to the outside world (operating system and applications). This software layer allows us to perform dynamic translation from a source ISA (visible to the operating system and applications), as well as to perform optimization over the source code to adapt it to better exploit the capabilities of the hardware layer underneath.
+Innovation and technological improvements in processor design have outpaced advances in memory design in the last ten years. This imbalanced advancement has been causing an increasing gap between processor and memory speeds. During the last decade this has led to an approach involving concurrent execution, initially through the execution of multiple threads in one processor and now with the inclusion of multiple cores in a single chip. Unfortunately, the advent of chip-multiprocessors (CMPs) in the last years has made the problem even worse due to the increase of bandwidth requirements and contention on the memory controller. Therefore, this increasing speed gap has motivated that current high performance processors focus on cache organizations, register file and prefetching techniques to tolerate growing memory latencies [BCS09], [BGK96], [SPN96]. Furthermore, power dissipation is becoming a critical issue for microprocessors. Power dissipation determines the cost of the cooling system and ultimately may limit the performance of the microprocessor. <br>Prefetching mechanisms that decouple and overlap the computing and transfer of data is a well known technique and commonly employed to hide memory latencies [BCS09]. However, although the agressibve prefetchig mechanisms are, for most applications, beneficial to tolerate memory latencies in single core processors, when the prefetching is done in multiple cores of a CMP, the increase of performance of individual cores can be greatly reduced compared to systems without prefetching [EMJ09]. This is caused by interference in the shared resources of prefetching mechanisms. <br>One of the greatest challenges that appeared with this twist in the chip configuration relies on how users will exploit CMPs. Parallel programming models, which divide an application in several tasks that can be executed concurrently, seems to be the best alternative to take advantage of CMP resources. Unfortunately, current programming models implement blocking synchronization, where critical sections are serialized in order to ensure mutual exclusion. Therefore, blocking synchronization increases the complexity of parallel programming and significantly degrades the performance of parallel applications. This fact encouraged the development of optimistic programming models that use non-blocking synchronization. In these programming models, critical sections are executed simultaneously, requiring modifications in the memory hierarchy to guarantee the correctness of the execution [HWC04] [DLM09]. <br>On the other side, the increasing influence of wire delay in cache design means that access latencies to the last-level cache banks are no longer constant [AHK00], [M97]. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem [KBK02]. A NUCA divides the whole cache memory into smaller banks and allows nearer cache banks to have lower access latencies than farther banks, thus mitigating the effects of the cache’s internal wires. <br>We will propose several microarchitectural techniques that can be applied to various parts of current microprocessor designs to improve the memory system and to boost the execution of instructions. Some techniques will attempt to ease the gap between processor and memory speeds, while the others will attempt to alleviate the serialization caused by data dependences. <br>
-Several proposals exist in the research arena that shown the potential benefits of a co-designed Virtual Machine as well as the benefits of dynamic optimization. In Transmeta Crusoe [Kla00], and the IBM DAISY [EA96] and BOA [AGS+99] the concept of co-designed Virtual Machine is leveraged to design system based on a low complexity and low power VLIW hardware layer, able to execute general purpose x86 code. In order to allow these systems to work, x86 must be dynamically translated into VLIW code. In these proposals, the translation from x86 to VLIW ISA is an important feature and it imposes an important overhead to the system.
+= Multithreaded Processors<br> =
-RePlay [FBC+01] and PARROT [RAM+04] remove the overhead due to translation and concentrate their efforts on performing dynamic optimization of the most frequently executed sections of the applications. However, these projects rely on the hardware to perform code optimization. The use of the hardware limits the flexibility of the system. A software optimizer is able to perform more complex analysis and optimizations than a hardware-based scheme, and it may be updated many times even when the chip is build. Moreover, a hardware optimizer introduces additional complexity to the hardware which may result in a power consumption increase and it may make the system more difficult to test.
+Industry and researchers are making a shift towards multi-core architectures [WWW07] [WP07] [WWW207] [F07]. This shift is mainly motivated by two factors: on the one hand, we have reached a point where further exploiting instruction level parallelism (ILP) is giving us diminishing returns so that other types of parallelism are needed. On the other hand, new feature sizes allow a greater number of transistors to be implemented on a chip. This increase on the number of transistors opens the possibility of integrating multiple cores on die so that multiple applications and threads from the same application can run in parallel getting good performance by exploiting thread level parallelism (TLP) [KFJ+04]. <br>The ability of executing multiple threads in parallel is called multithreading. There are several ways of implementing multithreading. On the one hand, implementing multiple cores allow you to support multiple threads in parallel. On the other hand, each of these cores could also execute more than one thread at the same time using techniques like Simultaneous Multithreading, fine grain multithreading [TEL95] or switch on event [MB05]. <br>Implementing multiple simple cores on a chip make the number of cores available in a processor to increase every year and companies are sometimes making strong bets designing processors able to exploit TLP very efficiently at the expenses of sacrificing ILP [H05]. However, these novel architectures consisting of simple cores will have to compete with the current out-of-order processors that clearly outperform them in the ILP arena. <br>Speculative multithreading is a paradigm where single-threaded applications are shred into multiple threads that can be executed in parallel. These threads are generated using speculative optimizations like control speculation and dependence break that maximizes the amount of instructions that can be executed in parallel. Unfortunately, since the optimizations are speculative, it also requires hardware mechanisms in order to detect and recover from misspeculations. However, multi-core architectures comprising single cores can take advantage of this paradigm to reach similar performance as conventional single out-of-order cores on single-threaded applications. Typical implementations of speculative multithreading can be found in [SBV95][SCZ+02][MG99][GMS+05][CMT00]. These implementations usually generate speculative threads where every thread represents a set of consecutive instructions from the original application plus some extra instructions to handle the speculative optimizations. However, other more recent proposals refine these models generating threads where original instructions are more aggressively distributed among threads [MLC+09].
-The goal of our research in this arena is to propose a complete design of a system based on a combined effort on hardware and software. In order to do that, first we will investigate on an efficient and flexible co-designed system that overcomes the limitation of previous proposals. Then, we will investigate on novel techniques to perform dynamic optimization through the co-designed virtual machine software layer. These techniques must be able to adapt the applications to the hardware underneath. Finally, we will investigate on enhancements to the hardware layer to allow a better interaction with the wrapper software layer.
+= Reliability<br> =
-In addition to the above mentioned related projects, Jim Smith’ research group [SN05] is also very active on this area. Jim has been collaborating with our research group for more than 10 years.
+Circuit behavior is not deterministic anymore in current and future technologies due to variations. Limitations in the fabrication process introduce indeterminism in the geometry (length, width and thickness) of transistors, and hence, in their behavior. Moreover, voltage and temperature oscillate, and inputs of circuits change. Thus, delay and power of circuits change dynamically, but they must work in any feasible scenario. This issue is addressed by considering the worst-case scenario by default to ensure circuit functionality, but such assumption is pessimistic most of the times and very inefficient in terms of power and delay. The aim of our research is finding solutions to make circuits more efficient under the presence of variations. The result of our research will be a set of strategies, techniques and circuit designs to improve the performance and power of such circuits by adapting their operation to the common case instead of operating assuming worst-case conditions.<br>&nbsp;Some works have been proposed so far to exploit variations. TIMERRTOL [Uht00] and Razor [EKD+03] exploit variations by using results after the common case delay and checking their correctness after the worst-case delay. There have been works on new memory circuit design [VSP+09], as well as, techniques to reduce the performance impact of process variations [LCW+07]. <br>The main groups working in the area of variations are the Department of Electrical and Computer Engineering at the University of Rhode Island, the Department of Electrical Engineering at the Princeton University, the Division of Engineering and Applied Sciences at the Harvard University, and the Department of Electrical Engineering and Computer Science at the University of Michigan.
-= ISA Extensions for Dynamically Scheduled Processors  =
+= Virtual Machines =
-The main objective of scheduling is to obtain a high-level of parallelism to maximize processor resources and minimize the execution time. During the static scheduling process, the compiler has access to all information contained in the program, which allows to extract parallelism of different granularities. However, the amount of parallelism is limited, since there is certain information that is available only at execution time. To overcome such limitations, special instruction set architecture (ISA) extensions such as predication, register windows, etc. have been introduced to permit developing techniques that help the program to better adapt to the execution environment and improve the performance. They do not introduce semantic changes in the overall execution. Such extensions are mainly implemented in in-order processors [MD03][HL99], however they may also be complemented with dynamic scheduling techniques to achieve various parallelism granularities.
+Co-designed Virtual Machines [SN05] are an attractive vehicle for designing complexity/power effective processors. For this paradigm, a processor is a co-design effort between hardware and software. The software layer wraps the hardware and provides a common interface to the outside world (operating system and applications). This software layer allows us to perform dynamic translation from a source ISA (visible to the operating system and applications), as well as to perform optimization over the source code and adapt it to better exploit the capabilities of the hardware layer underneath. <br>Several proposals exist in the research arena that showed the potential benefits of a co-designed Virtual Machine as well as the benefits of dynamic optimization. In Transmeta Crusoe [Kla00], IBM DAISY [EA96] and IBM BOA [AGS+99] the concept of co-designed Virtual Machine is leveraged to design system based on a low complexity and low power VLIW hardware layer, able to execute general purpose x86 code. In these proposals, the translation from x86 to VLIW ISA is an important feature and it imposes an important overhead to the system. <br>RePlay [FBC+01] and PARROT [RAM+04] remove the overhead due to translation and concentrate their efforts on performing dynamic optimization of the most frequently executed sections of the applications. However, these projects rely on the hardware to perform code optimization, which limits the flexibility of the system. A software optimizer is able to perform more complex analysis and optimizations than a hardware-based scheme, and it may be updated many times even when the chip is build. Moreover, a hardware optimizer introduces additional complexity to the hardware which may result in a power consumption increase and additional validation cost. <br>The goal of our research in this arena is to propose a complete design of a system based on a combined effort on hardware and software. In order to do that, first we will investigate on an efficient and flexible co-designed system that overcomes the limitation of previous proposals. Then, we will investigate on novel techniques to perform dynamic optimization through the co-designed virtual machine software layer. These techniques must be able to adapt the applications to the hardware underneath. Finally, we will investigate on enhancements to the hardware layer to allow a better interaction with the wrapper software layer. <br>There have been numerous research groups that have focused their research on generic dynamic binary optimization. However, very few of them have concentrated their efforts in the concept of co-design virtual machines, where the final functionality of a processor is transparently provided by the most efficient balance between hardware and software. In addition to the aforementioned projects, the research group led by the recently retired professor Jim Smith [SN05], with who we have strongly collaborated for more than 10 years, started working in this topic. <br>Due to the increasing complexity of current processors in terms of energy consumption, area and processor validation, we strongly believe that this research topic will become very interesting in the research agenda of many research groups in the next years. In fact, one starts observing that more and more groups advocate for using software to perform tasks that are too complex to perform them in hardware, even if these proposals are not aligned with the concept of co-design virtual machines.
-If-conversion [JK83] is a compiler technique that takes full advantage of predication. Some studies have shown that if-conversion may alleviate the severe performance penalties caused by hard-to-predict branch mispredictions [MBG+94] [CHPC95] [AHM97]. There are many research groups that have developed techniques to execute predicated code on out-of-order processors: the generation of micro-operations for the disambiguation of multiple register definitions [WWK+01], predicate value prediction [CC03] [QPG06], or the introduction of new ISA extensions [KMSP05]. Another problem associated to the if-conversion technique is the loss of correlation information needed to implement the most common branch predictors [ACGH97]. Many studies have proposed the use of predicate information on branch prediction to recover the lost correlated information [ACGH97] [SCF03] [QPG07].
+= References =
-Another ISA extension is register windows. A register window [Sit79] is the set of private logical registers accessed by a function. The passing of parameters from a function to another is done through the overlapping of windows [MD03] [HL99]. When the number of free logical registers is insufficient, it is necessary to activate a spill mechanism to keep the values that are still alive from outer functions. There are several studies that use the register windows to give the processor the illusion of an unlimited number of registers [DM82] [HL91] [ND95] [OBMR05].
+[AC71] A. V. Aho, M. J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333–340, 1975.<br>[AGS+99] E. Altman, M. Gschwind, S. Sathaye, S. Kosonocky, A. Bright, J. Fritts, P. Ledak, D. Appenzeller, C. Agricola, Z. Filan, “BOA: The Architecture of a Binary Translation Engine”, IBM Research Report RC 21665 (97500), 1999 <br>[AHK00] V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Clock rate vs. ipc: The end of the road for conventional microprocessors, in Proceedings of the 27th International Symposium on Computer Architecture, 2000. <br>[BCS09] S. Byna, Y. Chen and X. Sun. “Taxonomy of Data Prefetching for Multicore Processors”. Journal of Computer Science and Technology, 2009 24 (3): 405-417. <br>[BGK96] D. Burger, J. R. Goodman, and A. Kägi, “Quantifying Memory Bandwidth Limitations of Current and Future Microprocessors”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996. <br>[BKS+08] C.Bienia, S.Kumar, J.P.Singh, and K.Li. “The PARSEC Benchmark Suite: Characterization and Architectural Implications”, In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pages 72–81, 2008. <br>[BoB09] R. Borgo, K. Brodlie,. “State of The Art Report on GPU”, Visualization and Virtual Reality Research Group, School of Computing - University of Leeds, - VizNET REPORT - Ref: VIZNET-WP4-24-LEEDS-GPU, 2009. <br>[BP05] Z. K. Baker, V. K. Prasanna: High-throughput linked-pattern matching for intrusion detection systems. ANCS 2005.<br>[CGJ09] Q. Cai, Y. Gui, and R. Johnson. Exploiting Unix File-system Races via Algorithmic Complexity Attacks. Proceedings of IEEE Symposium on Security and Privacy, 2009.<br>[ChS08] H.Chang and W.Sung, “Efficient Vectorization of SIMD Programs with Non-aligned and Irregular Data Access Hardware”, CASES’08, October 19–24, 2008. <br>[CMT00] M. Cintra, J.F. Martinez and J. Torrellas, “Architectural Support for Scalable Speculative Parallelization in Shared-Memory Systems”, in Proc. of the 27th Int. Symp. on Computer Architecture, 2000.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br>[CW03] S. A. Crosby and D. S. Wallach. Denial of Service via Algorithmic Complexity Attacks. In USENIX Security, Aug. 2003.<br>[DLM09] D. Dice, Y. Lev, M. Moir, and D. Nussbaum, Early Experience with a Commercial Hardware Transactional Memory Implementation, in Proceedings of the 14th International Conference onArchitectural Support for Programming Languages and Operating Systems, 2009. <br>[EA96] K. Ebcioglu, E. R. Altman, “DAISY: Dynamic Compilation for 100% Architectural Compatibility”, IBM Research Report RC 20538, 1996. <br>[EKD+03] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, T. Mudge. Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation. Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. <br>[EMJ09] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt. “Coordinated control of multiple prefetchers in multi-core systems”. In Proceedings of the 42nd Annual IEEE/ACM international Symposium on Microarchitecture, 2009. <br>[FBC+01] B. Fahs. S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. Patel, S. Lumetta, “Performance Characterization of a Hardware Mechanism for Dynamic Optimization”, in Procs. of 34th International Symposium On Microarchitecture (MICRO-34), 2001 <br>[F07] J. Fang. Parallel Programming Environment: A Key to Translating Tera-Scale Platforms into a Big Success. Key Note at International Symposium on Code Generation and Optimization. March 2007. <br>[GMS+05] C. García, C. Madriles, J. Sánchez, P. Marcuello, A. González and D.M. Tullsen, “Mitosis Compiler: An Infrastructure for Speculative Threading Based on Pre-computation Slices”, in Proc. of the Int. Conf. on Programming Language Design and Implementation, pp. 269-278, 2005. <br>[H05] H.P. Hofstee, Power efficient processor architecture and the cell processor. In Proceedings of 11th International Symposium on High-Performance Computer Architecture HPCA-11, February 2005. <br>[HST09] N. Hua, H. Song, and T.V. Lakshman. Variable-Stride Multi-Pattern Matching For Scalable Deep Packet Inspection. Proceedings of IEEE INFOCOM, April 2009.<br>[HWC04] L. Hammond, V. Wong, M. Chen, B. Carlstrom, J. Davis, B. Hertzberg, M. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, Transactional Memory Coherence and Consistency, in Proceedings of the 31st International Symposium on Computer Architecture, 2004. <br>[KBK02] C. Kim, D. Burger, and S.W.Keckler, An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches, In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. <br>[KFJ04] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proceedings of the 31st annual International Symposium on Computer Architecture, 2004. <br>[Kla00] A. Klaiber, “The Technology Behind the CrusoeTM Processors”, white paper, January 2000 <br>[LCW+07] X. Liang, R. Canal, G.Y. Wei, D. Brooks. Process Variation Tolerant 3T1D-Based Cache Architectures. Proceedings of the 40th International Symposium on Microarchitecture (MICRO-40), December 2007 <br>[LNO+08] E.Lindholm, J.Nickolls, S.Oberman, Jmontrym. “NVIDIA TESLA: A Unified Graphics and Computing Architecture”, IEEE Micro, 28, 2 :39-55, March-April 2008. <br>[LuH07] D.Luebke and G.Humphreys. “How GPUs work”, Computer, 40(2): 96-11, 2007. <br>[M97] D. Matzke, Will physical scalability sabotage performance gains?, IEEE Computer, September 1997. <br>[MB05] C. McNairy, and R. Bhatia. Montecito: A Dual-Core, Dual-Thread Itanium Processor. IEEE Micro, Volume 25, Issue 2, Pages 10–20. March-April 2005. <br>[MG99] P. Marcuello and A. González, “Clustered Speculative Multithreaded Processors”, in Proc. of the 13th Int. Conf. on Supercomputing, pp. 365-372, 1999. <br>[MLC+09] C. Madriles, F. Latorre, J.M. Codina, E. Gibert, P. López, A. Martínez, R. Martínez and A. González, "Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading", in proceedings of International Conference on Parallel Architectures and Compiler Techniques, September 2009. <br>[MO07] T. Moscibroda, O. Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-core Systems. In 16th USENIX Security Symposium 2007.<br>[NVI08] NVIDIA Corporation. “NVIDIA CUDA Programming Guide”, 2.0 edition, 2008. <br>[OHL+08] J.D.Owens, M.Houston, D.Luebke, S.Green, J.E.Stone, and J.C.Phillips. “GPU Computing”, In Proc. Of the IEEE, vol. 96, pp.879-899, 2008. <br>[PN98] T. Ptacek and T. Newsham. Insertion, Evasion and Denial of Service: Eluding Network Intrusion Detection. In Secure Networks, Inc., January 1998.<br>[PY08] P. Piyachon, Y. Luo. Design of a High Performance Pattern Matching Engine Through Compact Deterministic Finite Automata. Proceedings of the ACM DAC 2008.<br>[RAM+04] R. Rosner, Y. Almog, Micha Moffie, Naftali Schwartz, Avi Mendelson, “Power Awareness through Selective Dynamically Optimized Traces”, in Procs. of 31st International Symposium on Computer Architecture (ISCA-21), 2004 <br>[Ro99] M. Roesch. SNORT - Lightweight Intrusion Detection for Networks. In LISA '99: USENIX 13th Systems Administration Conference 1999.<br>[RWP05] G.Ren, P.Wu, D.Padua. "An Empirical Study on Vectorization of Multimedia Applications for Multimedia Extensions", Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), 2005. <br>[SBV95] G.S. Sohi, S.E. Breach and T.N. Vijaykumar, “Multiscalar Processors”, in Proc. of the 22nd Int. Symp. on Computer Architecture, pp.414-425, 1995. <br>[SCS+08] L.Seiler, D.Carmean, E.Sprangle, T.Forsyth, M.Abrash, P.Dubey, P.Hanrahan, S.Junkins, A.Lake, J. Sugerman, “Larrabee: A Many-core x86 Architecture for Visual Computing”, In SIGGRAPH ’08: ACM SIGGRAPH 2008 Papers, ACM Press, New York, NY, 2008. <br>[SCZ+02] J. Steffan, C. Colohan, A. Zhai and T. Mowry, “Improving Value Communication for Thread-Level Speculation”, in Proc. of the 8th Int. Symp. on High Performance Computer Architecture, pp. 58-62, 2002. <br>[SEJ06] R. Smith, C. Estan, and S. Jha. Backtracking Algorithmic Complexity Attacks against a NIDS. In ACSAC 2006.<br>[SEJ08] R. Smith, C. Estan, and S. Jha. XFA: Faster Signature Matching with Extended Automata. XFA: Faster Signature Matching with Extended Automata. IEEE Symposium on Security and Privacy, May 2008.<br>[SFS00] J.E.Smith, G.Faanes, R.Sugumar, “Vector Instruction Set Support for Conditional Operations”, International Symposium on Computer Architecture, Pages: 260 - 269, 2000. <br>[Shi07] J.Shin, “Introducing Control Flow into Vectorized Code”, IEEE, 16th International Conference on Parallel Architecture and Compilation Techniques, 2007. <br>[SN05] J. E. Smith, and R. Nair, “Virtual Machines: Versatile Platforms for Systems and Processes”, Morgan Kaufmann Publishers, 2005 <br>[SPN96] A. Saulsbury, F. Pong and A. Nowatzyk, “Missing the Memory Wall: The Case for Processor/Memory Integration”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996. <br>[TEL95] D. Tullsen, S.J. Eggers, H.M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the22th Annual International Symposium on Computer Architecture, 1995. <br>[TS05] L. Tan, T. Sherwood. A High Throughput String Matching Architecture for Intrusion Detection and Prevention. Proceedings of ISCA 2005, pages 112-122.<br>[Uht00] Uht, A. K. Achieving typical delays in synchronous systems via timing error toleration. Tech. Rep. Dept. of Electrical and Computer Engineering, No. 032000-0100, University of Rhode Island. 2000<br>[VSP+09] A. Valero, J. Sahuquillo, S. Petit, V. Lorente, R. Canal, P. Lopez, J. Duato. An hybrid eDRAM/SRAM macrocell to implement first-level data caches. Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42), December 2009<br>[WWW07] Quad-Core Intel® Xeon® Processor 5300 Series. August 2007. <br>[WP07] White paper. The Manycore Shift Microsoft: Parallel Computing Initiative Ushers Computing into the Next Era. November 2007. <br>[WWW207] Quad-Core AMD Opteron processors for Server and Workstation., 2007.
-= Memory Hierarchy and Register File Architecture  =
-Innovation and technological improvements in processor design have outpaced advances in memory design in the last ten years. Therefore, the increasing gap between processor and memory speeds has motivated that current high performance processors focus on register file and cache organizations to tolerate growing memory latencies [BGK96], [BKG95], [SD95], [SPN96]. Register file and cache organizations attempt to bridge this gap but do so at the expense of large amounts of die area, increment of the energy consumption and higher demand of memory bandwidth that can be progressively a greater limit to high performance.
-Current processor designs assume that access latencies of register files will be greater than operations of functional units. This fact implies that will be difficult to read data of the register files in a single clock cycle. At the present moment just a single cycle is assumed [BDA01], [BS03] [MGV99], [STR02]. This designs allows latencies of register file greater than one cycle assuming a small reduction of performance but obtaining significant savings on energy consumption [BM99], [PSR00], [YZG00], [ZYG00]. On the other hand, current cache organization designs try to obtain a good balance between cost and performance. For high productions volumes, cost can be associated with area chip, so a way to reduce the cost is to reduce area requirements. Furthermore, power dissipation is becoming a critical issue for microprocessors. Power dissipation determines the cost of the cooling system and ultimately may limit the performance of the microprocessor.
-We will propose several microarchitectural techniques that can be applied to various parts of current microprocessor designs to improve the memory system and to boost the execution of instructions. Some techniques will attempt to ease the gap between processor and memory speeds, while the others will attempt to alleviate the serialization caused by data dependences.
-= Speculative Multithreaded Processors  =
-With the limiting performance benefits from frequecy scaling and increasing complexity of single core superscalar microprocessors, several microprocessor vendors have started migrating to multicore chips. While multicore chips would clearly benefit applications which have explicit thread level parallelism like server workloads, the performance of single threaded applications are not going to benefit without newer novel innovations.
-Speculative multithreading attempts to fill this void for single threaded applications. A speculative multithreaded processor logically consists of multiple cores running chunks of the single-threaded application in parallel. Key challenges in this execution paradigm includes (1) effective partition of the program using speculation on control-flow and data dependences (2) Support for efficient recovery from misspeculations to restore the correct sequential machine state.
-Speculative multithreading has been an active area of research for past few years. Some of the early works in this area consists of the research done on Multiscalar in University of Wisconsin[SBV95], SpMT processors in UPC[MG99][GMS+05], Stampede at CMU[SCZ+02], and the I-ACOMA project in the University of Illinois[CMT00], etc.
-Our research focuses on extending the state of the art research in this area and find efficient design solutions to the several problems, such as the code partitioning and interthread data dependences, that keep speculative-multithreading from being a viable future processor design.
-= Temperature and Power Consumption Control  =
-One of the key elements for future processors is the temperature and power-consumption control ‎[Bor99]. The high frequencies they will operate will not come together with a similar voltage reduction to keep power under control. Thus, the processors will have to implement mechanisms to control the dissipated energy as well as the temperature. These mechanisms will have to be more or less aggressive depending on the target market segment of the processors (server, desktop, and laptop). The energy consumption of a processor can be divided in two big parts: dynamic and static consumption.
-The dynamic consumption depends on the silicon technology used, the frequency of the processor, the activity of the processor and the source voltage. The advances in silicon technology make the transistor size smaller –and thus the power dissipated when commuting; but on the other side, the increase in the transistor count per chip and the constant increase in frequency makes the overall dynamic energy consumption increase in every new generation of processors [DB00].
-The static consumption main component is the leakage power. Until recently, static energy consumption was a minor part of the overall energy budget. Nevertheless, the static consumption has an exponential relationship with threshold voltage. In order to reduce the dynamic energy consumption, the source voltage decreases from one generation to the next. In order to keep/increase the frequency the threshold voltage has to decrease too. This makes the static energy consumption to increase exponentially. Static energy consumption will be around 50% of the overall energy consumption in future generations (in less than a decade) ‎[Bor99]‎[DB00].
-Finally, the power density increases in each generation due to the higher frequency and the leakage currents –mainly. The power density moves directly to heat and this heat has to be dissipated somehow. Actually, the cost of dissipating the processors heat is augmenting in the same proportion as the power density. It is predicted that over 40W, the cost of dissipating 1 watt is between 1 and 3 dollars ‎[GBC+01]. A drastic increment in temperature in an area of the processor may cause a transient failure or even an unrecoverable error. Furthermore, the static power consumption due to leakage currents has an exponential relationship with temperature, thus, an increment in temperature implies an increment in power consumption, which in turn increments temperature, overall resulting in a dangerous feedback circle.
-Nowadays, the techniques that try to reduce the dynamic consumption are focused on reducing the activity of the processor when the maximum throughput is not needed (turning off not-used units or changing the frequency and the voltage of the processor ‎[PKG00]‎[CG01]‎[CGS00]‎[SAD+02]). In order to reduce the static energy consumption, the techniques proposed completely shut down zones of the cache memory or they implement circuits with different frequencies or threshold voltages [FKM+00]‎ [KC00]‎ [KMN+01].
-The topic of temperature reduction is very new, the techniques proposed until now are focused on the reduction of the number of times a thermal emergency is detected. Every time an emergency is detected, the OS takes control of the processor and reduces its frequency –until it is cold enough to resume “normal” operating frequency, and thus incurring in a performance penalization. The techniques proposed try to avoid this situation by a strict control of the activity of the processor when it is close to the limit temperature [HB04]‎[SAS02]‎[SSH+03].
-The group has already done some work on value compression for power reduction [CG00] [CG01] [CGS00] [CGS04] and also on evaluation of multicore architectures [MCG06].
-= Variations  =
-Circuit behavior is not deterministic anymore in current and future technologies due to variations. Limitations in the fabrication process introduce indeterminism in the geometry (length, width and thickness) of transistors, and hence, in their behavior. Moreover, voltage and temperature oscillate, and inputs of circuits change. Thus, delay and power of circuits change dynamically, but they must work in any feasible scenario. This issue is addressed by considering the worst-case scenario by default to ensure circuit functionality, but such assumption is pessimistic most of the times and very inefficient in terms of power and delay.
-The aim of our research is finding solutions to make circuits more efficient under the presence of variations. The result of our research will be a set of strategies, techniques and circuit designs to improve the performance and power of such circuits by adapting their operation to the common case instead of operating assuming worst-case conditions.
-Some works have been proposed so far to exploit variations. TIMERRTOL [Uht00] and Razor [EKD+03] exploit variations by using results after the common case delay and checking their correctness after the worst-case delay. Input variations are also exploited by means of narrow values [BM99], which can be operated with shorter latencies.
-The main groups working in the area of variations are the Department of Electrical and Computer Engineering at the University of Rhode Island, the Department of Electrical Engineering at the Princeton University, the Division of Engineering and Applied Sciences at the Harvard University, and the Department of Electrical Engineering and Computer Science at the University of Michigan.
-= References  =
-[AHK00] V. Agarwal, M.S. Hrishikesh, S.W. Keckler and D. Burger. "Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures". In Proc. of the 27th Ann. Int. Symp. on Computer Architecture, June 2000.
-[ACGK03a] A. Aletà, J.M. Codina, A. González and D. Kaeli, "Instruction Replication for Clustered Microarchitectures", in Procs. of 36th Int. Symp. on Microarchitecture (MICRO-36), Dec. 2003.
-[ACGK05] A. Aletà, J.M. Codina, , A. González and D. Kaeli. “Demistifying On-the-fly Spill Code”, in Procs. of the Conf. on Programming Languages and Implementation Design, 2005.
-[ACSG01] A. Aletà, J.M. Codina, J. Sánchez and A. González. "Graph-Partitioning Based Instruction Scheduling for Clustered Processors", in Proc. of 34th Int. Symp. On Microarchitecture, Dec 2001.
-[ACS+02] A. Aletà, J.M. Codina, J. Sánchez, A. González and D. Kaeli. "Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning", in Proc. of the Int. Conf. on Parallel Archirectures and Compilation Techniques (PACT'02), Sept 2002.
-[JKW83] J. R. Allen, K. Kennedy, C. P. Warren. “Conversion of Control Dependence to Data Dependence”. POPL'83: Proceedings of 10th ACM SIGACT-SIGPLAN symposium on Principles of Programming Languages, pages 177-189.
-[AGS+99] E. Altman, M. Gschwind, S. Sathaye, S. Kosonocky, A. Bright, J. Fritts, P. Ledak, D. Appenzeller, C. Agricola, Z. Filan, “BOA: The Architecture of a Binary Translation Engine”, IBM Research Report RC 21665 (97500), 1999
-[ACGH97] D. August, D. Connors, J. Gyllenhaal, W. M. Hwu, “Architectural Support for Compiler-Synthesized Dynamic Branch Prediction Strategies: Rationale and Initial Results”. In HPCA '97: Proceedings of the 3th international Symposium on High-Performance Computer Architecture, page 84 - 93, 1997.
-[AHM97] D. August, W. M. W. Hwu, S. A. Malhke. “A Framework for Balancing Control Flow and Predication”. In MICRO 30: International Symposium on Microarchitecture, pages 92-103, 1997.
-[BS03] S. Balakrishnan and G. S. Sohi. "Exploiting Value Locality in Physical Register Files", Proceedings of the 36th International Symposium on Microarchitecture, 2003
-[B04] R. Balasubramonian, "Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures". 18th Annual International Conference on Supercomputing (ICS), pp. 326-335, June, 2004.
-[BDA01] R. Balasubramonian, S. Dwarkadas, and D. Albonesi. "Reducing the complexity of the register file in dynamic superscalar processors". In Proc. of the 34th Annual Intl. Symp. On Microarchitecture, pages 237-248, 2001.
-[BDA03] R. Balasubramonian, S. Dwarkadas and D. Albonesi. “Dynamically Managing the Communication-Parallelism Trade-off in Future Clustered Processors”. In Proc. of the 30th. Ann. Intl. Symp. on Computer Architecture, pp. 275-286, June 2003.
-[Bor99] S. Borkar. “Design Challenges of Technology Scaling”. IEEE Micro, 19(4), 1999.
-[BTM00] D. Brooks, V. Tiwari and M. Martonosi, “Wattch: a framework for architectural-level power analysis and optimizations”, 27th Annual International Symposium on Computer Architecture, pp. 83-94, 2000.
-[BGK96] D. Burger, J. R. Goodman, and A. Kägi, “Quantifying Memory Bandwidth Limitations of Current and Future Microprocessors”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996.
-[BKG95] D. Burger, A. Kägi and J. R. Goodman, “The Declining Effectiveness of Dynamic Caching for General Purpose Microprocessors“, Technical Report 1261, Computer Sciences Departament, University of Wisconsin, Madison, WI, January 1995.
-[BM99] D. Brooks and M. Martonosi, “Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance”, In Proceedings of the 5th International Symposium on High-Performance Computer Architecture, January 1999.
-[CG00] R. Canal, A. González. “A Low Complexity Issue Logic”. Proceedings of the 2000 International Conference on Supercomputing, June 2000. [
-CG01] R. Canal, A. González. “Reducing the Complexity of the Issue Logic”. Proceedings of the 2001 International Conference on Supercomputing, pp. 312-320, June 2001.
-[CGS00] R. Canal, A. González and J.E. Smith, “Very Low Power Pipelines using Significance Compression”, in Proceedings of the 33rd International Symposium on Microarchitecture, pp. 181-190, December 2000.
-[CGS04] R. Canal, A. González and J. E. Smith, "Software- Controlled Operand Gating", Proc. of the International Symposium on Code Generation and Optimization (CGO-2), Palo Alto (CA-USA), pp. 125-136, March 2004
-[CDN92] A. Capitanio, D. Dytt and A. Nicolau, "Partitioned Register Files for VLIWs: A Preliminary Analysis of Tradeoffs", in Procs. of 25th. Int. Symp. on Microarchitecture, pp. 192-300, 1992.
-[CHPC95] P. Chang, E. Hao, Y. Patt, P. Chang. Using Predicate Execution to Improve the Performance of a Dinamically Scheduled Machine with Speculative Execution”. In PACT '95: Proceedings of the IFIP WG10.3 working conference on Parallel Architecture and Compilation Techniques, pages 99-108. UK. 1995.
-[CFM03] M. Chu, K. Fan and S. Mahlke, "Region-based Hierarchical Operation Partitioning for Multicluster Processors", in Procs. on Conf. on Programming Languages and Implementation Design, 2003.
-[CC03] W. Chuang, B. Calder. “Predicate Prediction for Efficient Out-of-Order Execution”. In ICS'03: Proceedings of the 17th annual international conference on Supercomputing”, pages 183 – 192, 2003.
-[CMT00] M. Cintra, J.F. Martinez and J. Torrellas, “Architectural Support for Scalable Speculative Parallelization in Shared-Memory Systems”, in Proc. of the 27th Int. Symp. on Computer Architecture, 2000.
-[CSG01] J.M. Codina, J. Sánchez and A. González, "A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors", in Procs. of Int. Conf. on Parallel Architectures and Compilation Techniques (PACT'01), Sept. 2001.
-[DB00] V. De and S. Borkar. “Technology and Design Challenges for Low Power and High Performance”. Proceedings of the International Symposium on Low Power Electronics Design, 2000.
-[D98] G. Desoli, "Instruction Assignment for Clustered VLIW DSP Compilers: A New Approach", Technical Report HPL-98-13, HP Laboratories, February 1998.
-[DM82] D. R. Ditzel, H. R. McLellan. “Register Allocation for Free: The C Machine Stack Cache”. In Proceeding of Symposium on Architectural Support for Programming Languages and Operating Systems, pages 48-56, 1982
-[EA96] K. Ebcioglu, E. R. Altman, “DAISY: Dynamic Compilation for 100% Architectural Compatibility”, IBM Research Report RC 20538, 1996.
-[E86] R. Ellis, "Bulldog: A Compiler for VLIW Architectures", MIT Press, pp. 180-184, 1986.
-[EKD+03] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, T. Mudge. Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation. Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003.
-[FBC+01] B. Fahs. S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. Patel, S. Lumetta, “Performance Characterization of a Hardware Mechanism for Dynamic Optimization”, in Procs. of 34th International Symposium On Microarchitecture (MICRO-34), 2001
-[FCJ97] K. Farkas, P. Chow, N.P. Jouppi, and Z. Vranesic, “The Multicluster Architecture: Reducing Cycle Time Through Partitioning”, In Proc. of the 30th. Int. Symp. on Microarchitecture, pp. 149-159, Dec. 1997.
-[FKM+00] K. Flautner, N.S. Kim, S. Martin, D. Blaauw and T. Mudge. “Drowsy Caches: Simple Techniques for Reducing Leakage Power”. Proceedings of the International Symposium on Computer Architecture, 2002.
-[FS96] M. Franklin, G.S. Sohi, "ARB: A Hardware Mechanism for Dynamic Reordering of Memory References". IEEE Trans. Computers 45(5), pp. 552-571, May, 1996.
-[GMS+05] C. García, C. Madriles, J. Sánchez, P. Marcuello, A. González and D.M. Tullsen, “Mitosis Compiler: An Infrastructure for Speculative Threading Based on Pre-computation Slices”, in Proc. of the Int. Conf. on Programming Language Design and Implementation, pp. 269-278, 2005
-[GSG02a] E. Gibert, J. Sánchez and A. González, "An Interleaved Cache Clustered VLIW Processor", in Procs. of 16th Int. Conf. on Supercomputing, June 2002.
-[GSG02b] E. Gibert, J. Sánchez and A. González, "Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor", in Procs. of 35th Int. Symp. on Microarchitecture, Desember 2002.
-[GSG03a] E. Gibert, J. Sánchez and A. González, "Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache", in Procs. of 1st Int. Symp. on Code Generation and Optimization (CGO'03), March 2003.
-[GSG03b] E. Gibert, J. Sánchez and A. González, "Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors", in Procs. of 36th Int. Symp. on Microarchitecture (MICRO-36), Dec. 2003.
-[GBC+01] S. Gunther, F. Binns, D. M. Carmean and J.C. Hall. “Managing the Impact of Increasing Microprocessor Power Consumption”. Intel Technology Journal, Q1, 2001.
-[HB04] K. Hazelwood and D. Brooks, “Eliminating Voltage Emergencies via Microarchitectural Voltage Control Feedback and Dynamic Optimization”, Proceedings of the International Symposium on Low-Power Electronics and Design, pp. 326-331, August 2004.
-[HL99] T. Horel, G. Lauterbach. “UltraSparc-III: Designing Third-Generation 64-Bit Performance”. IEEE MICRO. May – June 1999.
-[HL91] M. Huguet, T. Lang. “Architectural Support for Reduced Register Saving/Restoring in Single-Window Register Files”. ACM Trans. Computer Systems, 9(1):66 – 67. Feb. 1991.
-[JCSK98] S. Jang, S. Carr, P. Sweany and D. Kuras, "A Code Generation Framework for VLIW Architectures with Partitioned Register Banks", in Procs. of 3rd. Int. Conf. on Massively Parallel Computing Systems, April 1998.
-[KEA01] K. Kailas, K. Ebcioglu and A. Agrawala, "CARS: A New Code Generation Framework for Clustered ILP Processors", in Procs. of the 7th Int. Symp. on High-Performance Computer Architecture, Jan. 2001.
-[KC00] J.T. Kao and A. P. Chandrakasan. “Dual-Threshold Voltage Techniques for Low-Power Digital Circuits”. IEEE Journal of Solid State Circuits, 37(5), 2000.
-[KF96] G.A. Kemp, and M.Franklin. “PEWs: A Decentralized Dynamic Scheduler for ILP Processing". In Proc. of Int. Conf. on Parallel Processing, pp. 239-246, August 1996.
-[KMN+01] A. Keshavarzi, S. Ma, S. Naredra, B. Bloechel, K. Mistry, T. Ghani, S. Borkar and V. De. “Effectiveness of Reverse Body Bias for Leakage Control in Scaled Dual Vt CMOS ICs.” Proc. of the International Symposium on Low Power Electronics Design, 2001.
-[K99] R.E. Kessler. "The Alpha 21264 Microprocessor”. IEEE Micro, 19(2):24-36, 1999.
-[KMSP05] H. Kim, O. Mutlu, J. Stark, Y. N. Patt. “Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution”. In MICRO 38: International Symposium on Microarchitecture, pages 43 – 54, 2000.
-[KS02] H-S.Kim and J.E.Smith. “An Instruction Set and Microarchitecture for Instruction Level Distributed Processing”. In Proc. of the 29th Ann. Intl. Symp. on Computer Architecture, 2002.
-[Kla00] A. Klaiber, “The Technology Behind the CrusoeTM Processors”, white paper, http://www.transmeta.com/pdfs/paper_aklaiber_19jan00.pdf, January 2000
-[LPSA02] W. Lee, D. Puppin, S. Swenson, S. Amarasinghe, "Convergent Scheduling", in Procs. of 35th Int. Symp. on Microarchitecture, December 2002.
-[MBG+94] S. A. Mahlke, R.H. Bringmann, J. Gyllenhyaal, D. Gallagher, Wen M. H. Hwu. “Characterizing the impact of predicate execution on branch prediction”. In MICRO 27: Proceedings of 27th annual ACM/IEE international symposium on Microarchitecture, pages 217 – 227. 1994.
-[MG99] P. Marcuello and A. González, “Clustered Speculative Multithreaded Processors”, in Proc. of the 13th Int. Conf. on Supercomputing, pp. 365-372, 1999.
-[MD03] C. McNairy, D. Soltis. “Itanium 2 Processor Micrarchitecture”. IEEE MICRO. (pag. 44-55). March – April 2003.
-[MCG06] M. Monchiero, R. Canal and A. González , " Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View", 20th ACM International Conference on Supercomputing (ICS'06), Cairns (Australia)
-[MGV99] T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez and V. Viñals "Delaying Physical Register Allocation Through Virtual-Physical Registers", Proceedings of the 32th International Symposium on Microarchitecture, 1999
-[NSB01] R. Nagarajan, K. Sankalingam, D. Burger S.W. Keckler. “A Design Space Evaluation of Grid Processor Architectures”. Proc. 34th International Symposium on Microarchitecture. 2001.
-[ND95] R. Nuth, W. J. Dally. “The Named-State Register File: Implementation and Performance”. In proceedings of international symposium on High Performance Computer Architecture, pages 4 – 13, 1995
-[NE98] E. Nystrom and A.E. Eichenberger, "Effective Cluster Assignement for Modulo Scheduling", in Procs. of the 31st Int. Symp. on Microarchitecture, pp. 103-114, 1998.
-[OBMR05] D. W. Oehmke, N. L. Binkert, T. Mudge, S. K. Reinhardt, “How to Fake 1000 Registers”. In MICRO 38: international symposium in microarchitecture, 2005.
-[OBC98] E. Özer, S. Banerjia, T. Conte, "Unified Assign and Schedule: A New Approach to Scheduling for Clustered Register File Microarchitectures", in Procs. of the 31st Int. Symp. on Microarchitecture, 1998.
-[PJS97] A.S. Palacharla, N.P. Jouppi, and J.E. Smith, “Complexity-Effective Superscalar Processors”, In Proc. of the 24th. Int. Symp. on Computer Architecture, pp. 206-218, June 1997.
-[PKG00] D. Ponomarev, G. Kucuk, K. Ghose. “Reducing Power Requirements of Instruction Scheduling through Dynamic Allocation of Multiple Datapath Resources”. Proc of International. Symposium on Microarchitecture, 2000.
-[PSR00] Z. Purser, K. Sundaramoorthy, and E. Rotenberg. "A Study of Slipstream Processors". In Proceedings of the 33rd International Symposium on Microarchitecture, December 2000.
-[QPG06] E. Quiñones, J.M. Parcerisa, A. Gonzalez. “Selective Predicate Prediction for Out-of-Order Processors”. In ICS'06: Proceedings of the 20th annual international conference on Supercomputing, pages 46 – 66. 2006
-[QPG07] E. Quiñones, J.M. Parcerisa, A. Gonzalez. “Improving Branch Prediction and Predicated Execution in Out-of-Order Processors”. In HPCA '07: Proceedings of the 13th international Symposium on High-Performance Computer Architecture, 2007.
-[RP03] P. Racunas, Yale N. Patt, "Partitioned first-level cache design for clustered microarchitectures". 17th annual international conference on Supercomputing (ICS), June, 2003.
-[RAM+04] R. Rosner, Y. Almog, Micha Moffie, Naftali Schwartz, Avi Mendelson, “Power Awareness through Selective Dynamically Optimized Traces”, in Procs. of 31st International Symposium on Computer Architecture (ISCA-21), 2004
-[RJS97] Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, Jim Smith. “Trace Processors”. Proc. 30th International Symposium on Microarchitecture. 1997.
-[SG00a] J. Sánchez and A. González, "The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures", in Procs. of the 29th Int. Conf. on Parallel Processing, Aug. 2000.
-[SG00b] J. Sánchez, and A. González, "Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture", in Procs. of 33rd Int. Symp. on Microarchitecture, Dec. 2000.
-[SNM+06] K. Sankaralingam, R. Nagarajan, R. McDonald, et al. "Distributed Microarchitectural Protocols in the TRIPS Prototype Processor". 39th International Symposium on Microarchitecture (MICRO), December, 2006.
-[SPN96] A. Saulsbury, F. Pong and A. Nowatzyk, “Missing the Memory Wall: The Case for Processor/Memory Integration”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996.
-[SAD+02] G. Semeraro, D. H. Albonesi, S.G.Dropsho, G. Magklis, S. Dwarkadas and M.L. Scott. “Dynamic Frequency and Voltage Control for a Multiple Clock Domain Microarchitecture”. Proc. Int. Symposium on Microarchitecture, 2002.
-[SIA97] Semiconductor Industry Association, “The National Technology Roadmap for Semiconductors”, 1997.
-[STR02] A. Seznec, E. Toullec and O. Rochecouste "Register Write Specialization Register Read Specialization: A Path to Complexity-Effective Wide-Issue Superscalar Processors", Proceedings of the 35th International Symposium on Microarchitecture,2002.
-[SCF03] B. Simon, B. Calder, J. Ferrante. “Incorporating Predicate Information into Branch Predictors”. In HPCA '03: Proceedings of the 9th international Symposium on High-Performance Computer Architecture, 2003.
-[Sit79] R. L. Sites. “How to Use 1000 Registers”. In Caltech Conference on VLSI, pages 527-532. 1979
-[SAS02] K. Skadron, T. Abdelzahe amd M. R. Stan. “Control-Theoretic Techniques and Thermal-Rc Modeling for Accurate and Localized Dynamic Thermal Management”. Proceedings of the International Symposium on High Performance Computing, 2002.
-[SSH+03] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan and D. Tarjan. “Temperature-Aware Microarchitecture”. Proceedings of the International Symposium on Computer Architecture, pp. 2-13 2003.
-[SN05] J. E. Smith, and R. Nair, “Virtual Machines: Versatile Platforms for Systems and Processes”, Morgan Kaufmann Publishers, 2005
-[SBV95] G.S. Sohi, S.E. Breach and T.N. Vijaykumar, “Multiscalar Processors”, in Proc. of the 22nd Int. Symp. on Computer Architecture, pp.414-425, 1995.
-[SCZ+02] J. Steffan, C. Colohan, A. Zhai and T. Mowry, “Improving Value Communication for Thread-Level Speculation”, in Proc. of the 8th Int. Symp. on High Performance Computer Architecture, pp. 58-62, 2002.
-[SD95] C. L. Su and A. M. Despain, “Cache Design Tradeoffs for Power and Performance Optimization: A Case Study”, In Proceedings of the International Symposium on Low Power Electronics and Design, April 1995.
-[Uht00] Uht, A. K. Achieving typical delays in synchronous systems via timing error toleration. Tech. Rep. Dept. of Electrical and Computer Engineering, No. 032000-0100, University of Rhode Island. 2000
-[WTS+97] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, "Baring it all to Software: Raw Machines", IEEE Computer, pp. 86-93, September 1997.
-[WWK+01] P. H. Wang, H. Wang, R. M. Klim, K. Ramakrishnan, J. P. Shen. “Register Renaming and Scheduling for Dynamic Execution of Predicated Code”. In HPCA '01: Proceedings of the 7th international Symposium on High-Performance Computer Architecture, page 15, 2001.
-[YZG00] J. Yang, Y. Zhang, and R. Gupta, “Frequent Value Compression in Data Caches”, In Proceedings of the 33rd International Symposium on Microarchitecture, December 2000.
-[YMR+99] A. Yoaz, E. Mattan, R. Ronen, and S. Jourdan., “Speculation Techniques for Improving Load Related Instruction Scheduling,“ in Proc. of 26th ISCA, pp. 42-53, May 1999.
-[ZLAV01] J. Zalamea, J. Llosa, E. Ayguadé, and M. Valero, "Modulo Scheduling with integrated register spilling for Clustered VLIW Architectures," Proc. 34th Ann. Int'l Symp. on Microarchitecture (MICRO-34), December 2001.
-[ZYG00] Y. Zhang, J. Yang, and R. Gupta, “Frequent Value Locality and Value-Centric Data Cache Design”, In Proceedings of the 33rd International Conference on Architectural Support for Programming Languages and Operating Systems, November 2000.
-[ZK01] V. Zyuban and P.M. Kogge, “Inherently Lower-Power High-Performance Superscalar Architectures,“ IEEE Trans. on Computers, vol. 50, no. 3, pp. 268-285, March 2001.
-<br>