Difference between revisions of "Projects"

From ArcoWiki
Line 1: Line 1:
= Characterization and Acceleration of Emerging Applications =
+
= Characterization and Acceleration of Emerging Applications =
  
Basically, GPUs are specialized hardware cores designed to accelerate rendering and display but they are moving in the direction of general purpose accelerators [BoB09]. GPU vendors have recently introduced new programming models and associated hardware support to broaden the class of non-graphics applications that may efficiently use GPU hardware [NVI08][OHL+08]. There is a large, emerging and commercially-relevant class of applications enabled by the significant increase in GPU computing density, such as graphics and physics for gaming, interactive simulation, data analysis, scientific computing, 3D modeling for CAD, signal processing, digital content creation, and financial analytics. PARSEC benchmark suite [BKS+08] is a good proxy of this kind of applications. Applications in these domains benefit from architectural approaches that provide higher performance through parallelism. <br>GPUs capabilities excel for applications that exhibit extensive data parallelism. GPUs typically operate on a large number of data points, where the same operation is simultaneously conducted on all of these data points in the form of continuously running vectors or streams. Furthermore, to exploit data-level parallelism, modern GPUs typically batch together groups of individual threads (called warps) running the same shader program, and execute them in lock step on a SIMD pipeline [LuH07][LNO+08]. However, even with a general-purpose programming interface, mapping existing applications to the parallel architecture of a GPU is a non-trivial task. <br>Vectorization is an optimization technique that has traditionally targeted vector processors. The importance of this optimization has increased in recent years with the introduction of SIMD extensions like Intel’s SSE or IBM/Motorola's Altivec to general purpose processors, and with the growing significance of applications that can benefit from this functionality. However, achieving high performance on modern architectures requires efficient utilization of SIMD units. Doing so requires that algorithms are able to take full advantage of the SIMD width offered and to not waste SIMD instructions on low utilization cases. Both Intel SSE and PowerPC Altivec expose a relatively small SIMD width of four. It is often complicated to apply vectorization techniques to architectures with such SIMD extensions because these extensions are largely non-uniform, supporting specialized functionalities and a limited set of data types. Vectorization is often further impeded by the SIMD memory architecture, which typically provides access to contiguous memory items only, often with additional alignment restrictions. Computations, on the other hand, may access data elements in an order that is neither contiguous nor adequately aligned. Bridging this gap efficiently requires careful use of special mechanisms including permute, pack/unpack, and other instructions that incur additional performance penalties and complexity. <br>However, given the small cost and potentially high benefit of increasing the SIMD width, it seems likely that future architectures will explore larger SIMD widths such as Nvidia’s Fermi and Intel’s Larrabee[LCS+08]. Larrabee greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. Its approach is based on extending each CPU core with a wide vector unit featuring scatter-gather capability, as well as predicated execution support. On the other hand, available compilers have limitations that prevent loop vectorizing such as control flow, non-contiguous and irregular data access, data dependency, nested loop and undefined number of loop iterations [RWP05] which are present in most of the main loops of emerging applications. There are some works that address the control flow problems [SFS00][Shi07], as well as irregular data access [ChS08]. <br>
+
Basically, GPUs are specialized hardware cores designed to accelerate rendering and display but they are moving in the direction of general purpose accelerators [BoB09]. GPU vendors have recently introduced new programming models and associated hardware support to broaden the class of non-graphics applications that may efficiently use GPU hardware [NVI08][OHL+08]. There is a large, emerging and commercially-relevant class of applications enabled by the significant increase in GPU computing density, such as graphics and physics for gaming, interactive simulation, data analysis, scientific computing, 3D modeling for CAD, signal processing, digital content creation, and financial analytics. PARSEC benchmark suite [BKS+08] is a good proxy of this kind of applications. Applications in these domains benefit from architectural approaches that provide higher performance through parallelism. <br>GPUs capabilities excel for applications that exhibit extensive data parallelism. GPUs typically operate on a large number of data points, where the same operation is simultaneously conducted on all of these data points in the form of continuously running vectors or streams. Furthermore, to exploit data-level parallelism, modern GPUs typically batch together groups of individual threads (called warps) running the same shader program, and execute them in lock step on a SIMD pipeline [LuH07][LNO+08]. However, even with a general-purpose programming interface, mapping existing applications to the parallel architecture of a GPU is a non-trivial task. <br>Vectorization is an optimization technique that has traditionally targeted vector processors. The importance of this optimization has increased in recent years with the introduction of SIMD extensions like Intel’s SSE or IBM/Motorola's Altivec to general purpose processors, and with the growing significance of applications that can benefit from this functionality. However, achieving high performance on modern architectures requires efficient utilization of SIMD units. Doing so requires that algorithms are able to take full advantage of the SIMD width offered and to not waste SIMD instructions on low utilization cases. Both Intel SSE and PowerPC Altivec expose a relatively small SIMD width of four. It is often complicated to apply vectorization techniques to architectures with such SIMD extensions because these extensions are largely non-uniform, supporting specialized functionalities and a limited set of data types. Vectorization is often further impeded by the SIMD memory architecture, which typically provides access to contiguous memory items only, often with additional alignment restrictions. Computations, on the other hand, may access data elements in an order that is neither contiguous nor adequately aligned. Bridging this gap efficiently requires careful use of special mechanisms including permute, pack/unpack, and other instructions that incur additional performance penalties and complexity. <br>However, given the small cost and potentially high benefit of increasing the SIMD width, it seems likely that future architectures will explore larger SIMD widths such as Nvidia’s Fermi and Intel’s Larrabee[LCS+08]. Larrabee greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. Its approach is based on extending each CPU core with a wide vector unit featuring scatter-gather capability, as well as predicated execution support. On the other hand, available compilers have limitations that prevent loop vectorizing such as control flow, non-contiguous and irregular data access, data dependency, nested loop and undefined number of loop iterations [RWP05] which are present in most of the main loops of emerging applications. There are some works that address the control flow problems [SFS00][Shi07], as well as irregular data access [ChS08]. <br>  
  
= Intrusion Detection Systems =
+
= Intrusion Detection Systems =
  
Computing systems today operate in an environment of seamless connectivity, with attacks being continuously created and propagated through the Internet. There is a clear need to provide an additional layer of security in routers. Intrusion Detection Systems (IDS) are emerging as one of the most promising ways of providing protection to systems on the network against suspicious activities. By monitoring the traffic in real time, an IDS can detect and also take preventive actions against suspicious activities. Network-based IDS have emerged as an effective and efficient defense for systems in the network. They are usually deployed in routers.<br>The deployment of network-based IDS in routers poses a very interesting challenge. With current trends in doubling of line-rates - especially in backbone routers - every 24 months, requires that the IDS also needs to follow likewise in performance. For example, an IDS deployed in a state-of-the-art backbone router inspects packets streaming at 40 Gbps and scans for &gt;23,000 attack signatures in these packets. This clearly is a tremendous performance challenge. So performance is a key factor for efficient functioning of the IDS.<br>IDS detects attacks by scanning packets for attack patterns by performing multiple pattern matching. Patterns can be either expressed as fixed strings or regular expressions. The Aho-Corasick [AC71] algorithm is commonly used by IDS [Ro99] for fixed string matching. In the Aho-Corasick algorithm, a finite-state-machine (FSM) is constructed using attack signatures, and this is subsequently traversed using bytes from packets. The main advantage of the Aho-Corasick algorithm is that it runs in linear time to the input bytes regardless of the number of attack signatures. However, the main disadvantage lies in devising a practical implementation due to the large memory needed to store the FSM. So one of the primary area of focus in the IDS research community is in devising a performance and area efficient architecture for the Aho-Corasick algorithm.<br>IDS also increasingly uses regular expressions to specify attack signatures. The reason for their increasing popularity is due to its rich expressive powers in specifying attack signatures. In order for the regular expressions to be parsed, they are first converted to finite automata (Deterministic or Non-deterministic), and later these automatas are traversed using bytes from packets. However, these automatas are either inefficient with respect to chip-area (in case of Deterministic Finite Automata) or are performance inefficient (in case of Non-deterministic Finite Automata ).<br>A key requirement for the effectiveness of an IDS is that it has to process packets at the rate they are streaming. The consequences of not doing so can result either in undetected malicious packets or expensive packet drops. An adversary can also bring the IDS to this state of not being able to process packets at wire speeds. Such attempts are commonly referred to as evasion [CW03, PN98], and stem from weaknesses in some part of IDS processing. The nature and ease of evasion makes it very appealing for malicious hosts to bypass the IDS.<br>There have been numerous works In the area of improving the performance and area efficiency of pattern matching algorithms (fixed string and regular expressions). [TS05, PY08] propose various novel techniques to significantly improve the performance and area efficiency of pattern matching in IDS. In the area of regular expression matching, numerous works [HST09, SEJ08] have studied and proposed improving the DFA storage and DFA traversal. Additionally, [BP05] have studied and proposed techniques for NFAs using reconfigurable hardware. [CW03, SEJ06] have studied various sophisticated attacks and secure defense mechanisms against the IDS. Additionally, [CGJ09, MO07] have also studied similar attack and defense mechanisms against the Unix file system and banked-memory in multi-cores respectively.<br>Broadly, we plan to address these aforementioned issues by using a hardware/software approach. The software approach focuses on improving the area efficiency while the hardware approach improves the performance efficiency.
+
Computing systems today operate in an environment of seamless connectivity, with attacks being continuously created and propagated through the Internet. There is a clear need to provide an additional layer of security in routers. Intrusion Detection Systems (IDS) are emerging as one of the most promising ways of providing protection to systems on the network against suspicious activities. By monitoring the traffic in real time, an IDS can detect and also take preventive actions against suspicious activities. Network-based IDS have emerged as an effective and efficient defense for systems in the network. They are usually deployed in routers.<br>The deployment of network-based IDS in routers poses a very interesting challenge. With current trends in doubling of line-rates - especially in backbone routers - every 24 months, requires that the IDS also needs to follow likewise in performance. For example, an IDS deployed in a state-of-the-art backbone router inspects packets streaming at 40 Gbps and scans for &gt;23,000 attack signatures in these packets. This clearly is a tremendous performance challenge. So performance is a key factor for efficient functioning of the IDS.<br>IDS detects attacks by scanning packets for attack patterns by performing multiple pattern matching. Patterns can be either expressed as fixed strings or regular expressions. The Aho-Corasick [AC71] algorithm is commonly used by IDS [Ro99] for fixed string matching. In the Aho-Corasick algorithm, a finite-state-machine (FSM) is constructed using attack signatures, and this is subsequently traversed using bytes from packets. The main advantage of the Aho-Corasick algorithm is that it runs in linear time to the input bytes regardless of the number of attack signatures. However, the main disadvantage lies in devising a practical implementation due to the large memory needed to store the FSM. So one of the primary area of focus in the IDS research community is in devising a performance and area efficient architecture for the Aho-Corasick algorithm.<br>IDS also increasingly uses regular expressions to specify attack signatures. The reason for their increasing popularity is due to its rich expressive powers in specifying attack signatures. In order for the regular expressions to be parsed, they are first converted to finite automata (Deterministic or Non-deterministic), and later these automatas are traversed using bytes from packets. However, these automatas are either inefficient with respect to chip-area (in case of Deterministic Finite Automata) or are performance inefficient (in case of Non-deterministic Finite Automata ).<br>A key requirement for the effectiveness of an IDS is that it has to process packets at the rate they are streaming. The consequences of not doing so can result either in undetected malicious packets or expensive packet drops. An adversary can also bring the IDS to this state of not being able to process packets at wire speeds. Such attempts are commonly referred to as evasion [CW03, PN98], and stem from weaknesses in some part of IDS processing. The nature and ease of evasion makes it very appealing for malicious hosts to bypass the IDS.<br>There have been numerous works In the area of improving the performance and area efficiency of pattern matching algorithms (fixed string and regular expressions). [TS05, PY08] propose various novel techniques to significantly improve the performance and area efficiency of pattern matching in IDS. In the area of regular expression matching, numerous works [HST09, SEJ08] have studied and proposed improving the DFA storage and DFA traversal. Additionally, [BP05] have studied and proposed techniques for NFAs using reconfigurable hardware. [CW03, SEJ06] have studied various sophisticated attacks and secure defense mechanisms against the IDS. Additionally, [CGJ09, MO07] have also studied similar attack and defense mechanisms against the Unix file system and banked-memory in multi-cores respectively.<br>Broadly, we plan to address these aforementioned issues by using a hardware/software approach. The software approach focuses on improving the area efficiency while the hardware approach improves the performance efficiency.  
  
= Memory Hierarchy =
+
= Memory Hierarchy =
  
 
Innovation and technological improvements in processor design have outpaced advances in memory design in the last ten years. This imbalanced advancement has been causing an increasing gap between processor and memory speeds. During the last decade this has led to an approach involving concurrent execution, initially through the execution of multiple threads in one processor and now with the inclusion of multiple cores in a single chip. Unfortunately, the advent of chip-multiprocessors (CMPs) in the last years has made the problem even worse due to the increase of bandwidth requirements and contention on the memory controller. Therefore, this increasing speed gap has motivated that current high performance processors focus on cache organizations, register file and prefetching techniques to tolerate growing memory latencies [BCS09], [BGK96], [SPN96]. Furthermore, power dissipation is becoming a critical issue for microprocessors. Power dissipation determines the cost of the cooling system and ultimately may limit the performance of the microprocessor. <br>Prefetching mechanisms that decouple and overlap the computing and transfer of data is a well known technique and commonly employed to hide memory latencies [BCS09]. However, although the agressibve prefetchig mechanisms are, for most applications, beneficial to tolerate memory latencies in single core processors, when the prefetching is done in multiple cores of a CMP, the increase of performance of individual cores can be greatly reduced compared to systems without prefetching [EMJ09]. This is caused by interference in the shared resources of prefetching mechanisms. <br>One of the greatest challenges that appeared with this twist in the chip configuration relies on how users will exploit CMPs. Parallel programming models, which divide an application in several tasks that can be executed concurrently, seems to be the best alternative to take advantage of CMP resources. Unfortunately, current programming models implement blocking synchronization, where critical sections are serialized in order to ensure mutual exclusion. Therefore, blocking synchronization increases the complexity of parallel programming and significantly degrades the performance of parallel applications. This fact encouraged the development of optimistic programming models that use non-blocking synchronization. In these programming models, critical sections are executed simultaneously, requiring modifications in the memory hierarchy to guarantee the correctness of the execution [HWC04] [DLM09]. <br>On the other side, the increasing influence of wire delay in cache design means that access latencies to the last-level cache banks are no longer constant [AHK00], [M97]. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem [KBK02]. A NUCA divides the whole cache memory into smaller banks and allows nearer cache banks to have lower access latencies than farther banks, thus mitigating the effects of the cache’s internal wires. <br>We will propose several microarchitectural techniques that can be applied to various parts of current microprocessor designs to improve the memory system and to boost the execution of instructions. Some techniques will attempt to ease the gap between processor and memory speeds, while the others will attempt to alleviate the serialization caused by data dependences. <br>  
 
Innovation and technological improvements in processor design have outpaced advances in memory design in the last ten years. This imbalanced advancement has been causing an increasing gap between processor and memory speeds. During the last decade this has led to an approach involving concurrent execution, initially through the execution of multiple threads in one processor and now with the inclusion of multiple cores in a single chip. Unfortunately, the advent of chip-multiprocessors (CMPs) in the last years has made the problem even worse due to the increase of bandwidth requirements and contention on the memory controller. Therefore, this increasing speed gap has motivated that current high performance processors focus on cache organizations, register file and prefetching techniques to tolerate growing memory latencies [BCS09], [BGK96], [SPN96]. Furthermore, power dissipation is becoming a critical issue for microprocessors. Power dissipation determines the cost of the cooling system and ultimately may limit the performance of the microprocessor. <br>Prefetching mechanisms that decouple and overlap the computing and transfer of data is a well known technique and commonly employed to hide memory latencies [BCS09]. However, although the agressibve prefetchig mechanisms are, for most applications, beneficial to tolerate memory latencies in single core processors, when the prefetching is done in multiple cores of a CMP, the increase of performance of individual cores can be greatly reduced compared to systems without prefetching [EMJ09]. This is caused by interference in the shared resources of prefetching mechanisms. <br>One of the greatest challenges that appeared with this twist in the chip configuration relies on how users will exploit CMPs. Parallel programming models, which divide an application in several tasks that can be executed concurrently, seems to be the best alternative to take advantage of CMP resources. Unfortunately, current programming models implement blocking synchronization, where critical sections are serialized in order to ensure mutual exclusion. Therefore, blocking synchronization increases the complexity of parallel programming and significantly degrades the performance of parallel applications. This fact encouraged the development of optimistic programming models that use non-blocking synchronization. In these programming models, critical sections are executed simultaneously, requiring modifications in the memory hierarchy to guarantee the correctness of the execution [HWC04] [DLM09]. <br>On the other side, the increasing influence of wire delay in cache design means that access latencies to the last-level cache banks are no longer constant [AHK00], [M97]. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem [KBK02]. A NUCA divides the whole cache memory into smaller banks and allows nearer cache banks to have lower access latencies than farther banks, thus mitigating the effects of the cache’s internal wires. <br>We will propose several microarchitectural techniques that can be applied to various parts of current microprocessor designs to improve the memory system and to boost the execution of instructions. Some techniques will attempt to ease the gap between processor and memory speeds, while the others will attempt to alleviate the serialization caused by data dependences. <br>  
  
= Multithreaded Processors<br> =
+
= Multithreaded Processors<br> =
  
 
Industry and researchers are making a shift towards multi-core architectures [WWW07] [WP07] [WWW207] [F07]. This shift is mainly motivated by two factors: on the one hand, we have reached a point where further exploiting instruction level parallelism (ILP) is giving us diminishing returns so that other types of parallelism are needed. On the other hand, new feature sizes allow a greater number of transistors to be implemented on a chip. This increase on the number of transistors opens the possibility of integrating multiple cores on die so that multiple applications and threads from the same application can run in parallel getting good performance by exploiting thread level parallelism (TLP) [KFJ+04]. <br>The ability of executing multiple threads in parallel is called multithreading. There are several ways of implementing multithreading. On the one hand, implementing multiple cores allow you to support multiple threads in parallel. On the other hand, each of these cores could also execute more than one thread at the same time using techniques like Simultaneous Multithreading, fine grain multithreading [TEL95] or switch on event [MB05]. <br>Implementing multiple simple cores on a chip make the number of cores available in a processor to increase every year and companies are sometimes making strong bets designing processors able to exploit TLP very efficiently at the expenses of sacrificing ILP [H05]. However, these novel architectures consisting of simple cores will have to compete with the current out-of-order processors that clearly outperform them in the ILP arena. <br>Speculative multithreading is a paradigm where single-threaded applications are shred into multiple threads that can be executed in parallel. These threads are generated using speculative optimizations like control speculation and dependence break that maximizes the amount of instructions that can be executed in parallel. Unfortunately, since the optimizations are speculative, it also requires hardware mechanisms in order to detect and recover from misspeculations. However, multi-core architectures comprising single cores can take advantage of this paradigm to reach similar performance as conventional single out-of-order cores on single-threaded applications. Typical implementations of speculative multithreading can be found in [SBV95][SCZ+02][MG99][GMS+05][CMT00]. These implementations usually generate speculative threads where every thread represents a set of consecutive instructions from the original application plus some extra instructions to handle the speculative optimizations. However, other more recent proposals refine these models generating threads where original instructions are more aggressively distributed among threads [MLC+09].  
 
Industry and researchers are making a shift towards multi-core architectures [WWW07] [WP07] [WWW207] [F07]. This shift is mainly motivated by two factors: on the one hand, we have reached a point where further exploiting instruction level parallelism (ILP) is giving us diminishing returns so that other types of parallelism are needed. On the other hand, new feature sizes allow a greater number of transistors to be implemented on a chip. This increase on the number of transistors opens the possibility of integrating multiple cores on die so that multiple applications and threads from the same application can run in parallel getting good performance by exploiting thread level parallelism (TLP) [KFJ+04]. <br>The ability of executing multiple threads in parallel is called multithreading. There are several ways of implementing multithreading. On the one hand, implementing multiple cores allow you to support multiple threads in parallel. On the other hand, each of these cores could also execute more than one thread at the same time using techniques like Simultaneous Multithreading, fine grain multithreading [TEL95] or switch on event [MB05]. <br>Implementing multiple simple cores on a chip make the number of cores available in a processor to increase every year and companies are sometimes making strong bets designing processors able to exploit TLP very efficiently at the expenses of sacrificing ILP [H05]. However, these novel architectures consisting of simple cores will have to compete with the current out-of-order processors that clearly outperform them in the ILP arena. <br>Speculative multithreading is a paradigm where single-threaded applications are shred into multiple threads that can be executed in parallel. These threads are generated using speculative optimizations like control speculation and dependence break that maximizes the amount of instructions that can be executed in parallel. Unfortunately, since the optimizations are speculative, it also requires hardware mechanisms in order to detect and recover from misspeculations. However, multi-core architectures comprising single cores can take advantage of this paradigm to reach similar performance as conventional single out-of-order cores on single-threaded applications. Typical implementations of speculative multithreading can be found in [SBV95][SCZ+02][MG99][GMS+05][CMT00]. These implementations usually generate speculative threads where every thread represents a set of consecutive instructions from the original application plus some extra instructions to handle the speculative optimizations. However, other more recent proposals refine these models generating threads where original instructions are more aggressively distributed among threads [MLC+09].  
  
= Reliability<br> =
+
= Reliability<br> =
  
Circuit behavior is not deterministic anymore in current and future technologies due to variations. Limitations in the fabrication process introduce indeterminism in the geometry (length, width and thickness) of transistors, and hence, in their behavior. Moreover, voltage and temperature oscillate, and inputs of circuits change. Thus, delay and power of circuits change dynamically, but they must work in any feasible scenario. This issue is addressed by considering the worst-case scenario by default to ensure circuit functionality, but such assumption is pessimistic most of the times and very inefficient in terms of power and delay. The aim of our research is finding solutions to make circuits more efficient under the presence of variations. The result of our research will be a set of strategies, techniques and circuit designs to improve the performance and power of such circuits by adapting their operation to the common case instead of operating assuming worst-case conditions.<br>&nbsp;Some works have been proposed so far to exploit variations. TIMERRTOL [Uht00] and Razor [EKD+03] exploit variations by using results after the common case delay and checking their correctness after the worst-case delay. There have been works on new memory circuit design [VSP+09], as well as, techniques to reduce the performance impact of process variations [LCW+07]. <br>The main groups working in the area of variations are the Department of Electrical and Computer Engineering at the University of Rhode Island, the Department of Electrical Engineering at the Princeton University, the Division of Engineering and Applied Sciences at the Harvard University, and the Department of Electrical Engineering and Computer Science at the University of Michigan.
+
Circuit behavior is not deterministic anymore in current and future technologies due to variations. Limitations in the fabrication process introduce indeterminism in the geometry (length, width and thickness) of transistors, and hence, in their behavior. Moreover, voltage and temperature oscillate, and inputs of circuits change. Thus, delay and power of circuits change dynamically, but they must work in any feasible scenario. This issue is addressed by considering the worst-case scenario by default to ensure circuit functionality, but such assumption is pessimistic most of the times and very inefficient in terms of power and delay. The aim of our research is finding solutions to make circuits more efficient under the presence of variations. The result of our research will be a set of strategies, techniques and circuit designs to improve the performance and power of such circuits by adapting their operation to the common case instead of operating assuming worst-case conditions.<br>&nbsp;Some works have been proposed so far to exploit variations. TIMERRTOL [Uht00] and Razor [EKD+03] exploit variations by using results after the common case delay and checking their correctness after the worst-case delay. There have been works on new memory circuit design [VSP+09], as well as, techniques to reduce the performance impact of process variations [LCW+07]. <br>The main groups working in the area of variations are the Department of Electrical and Computer Engineering at the University of Rhode Island, the Department of Electrical Engineering at the Princeton University, the Division of Engineering and Applied Sciences at the Harvard University, and the Department of Electrical Engineering and Computer Science at the University of Michigan.  
  
= Virtual Machines =
+
= Virtual Machines =
  
 
Co-designed Virtual Machines [SN05] are an attractive vehicle for designing complexity/power effective processors. For this paradigm, a processor is a co-design effort between hardware and software. The software layer wraps the hardware and provides a common interface to the outside world (operating system and applications). This software layer allows us to perform dynamic translation from a source ISA (visible to the operating system and applications), as well as to perform optimization over the source code and adapt it to better exploit the capabilities of the hardware layer underneath. <br>Several proposals exist in the research arena that showed the potential benefits of a co-designed Virtual Machine as well as the benefits of dynamic optimization. In Transmeta Crusoe [Kla00], IBM DAISY [EA96] and IBM BOA [AGS+99] the concept of co-designed Virtual Machine is leveraged to design system based on a low complexity and low power VLIW hardware layer, able to execute general purpose x86 code. In these proposals, the translation from x86 to VLIW ISA is an important feature and it imposes an important overhead to the system. <br>RePlay [FBC+01] and PARROT [RAM+04] remove the overhead due to translation and concentrate their efforts on performing dynamic optimization of the most frequently executed sections of the applications. However, these projects rely on the hardware to perform code optimization, which limits the flexibility of the system. A software optimizer is able to perform more complex analysis and optimizations than a hardware-based scheme, and it may be updated many times even when the chip is build. Moreover, a hardware optimizer introduces additional complexity to the hardware which may result in a power consumption increase and additional validation cost. <br>The goal of our research in this arena is to propose a complete design of a system based on a combined effort on hardware and software. In order to do that, first we will investigate on an efficient and flexible co-designed system that overcomes the limitation of previous proposals. Then, we will investigate on novel techniques to perform dynamic optimization through the co-designed virtual machine software layer. These techniques must be able to adapt the applications to the hardware underneath. Finally, we will investigate on enhancements to the hardware layer to allow a better interaction with the wrapper software layer. <br>There have been numerous research groups that have focused their research on generic dynamic binary optimization. However, very few of them have concentrated their efforts in the concept of co-design virtual machines, where the final functionality of a processor is transparently provided by the most efficient balance between hardware and software. In addition to the aforementioned projects, the research group led by the recently retired professor Jim Smith [SN05], with who we have strongly collaborated for more than 10 years, started working in this topic. <br>Due to the increasing complexity of current processors in terms of energy consumption, area and processor validation, we strongly believe that this research topic will become very interesting in the research agenda of many research groups in the next years. In fact, one starts observing that more and more groups advocate for using software to perform tasks that are too complex to perform them in hardware, even if these proposals are not aligned with the concept of co-design virtual machines.  
 
Co-designed Virtual Machines [SN05] are an attractive vehicle for designing complexity/power effective processors. For this paradigm, a processor is a co-design effort between hardware and software. The software layer wraps the hardware and provides a common interface to the outside world (operating system and applications). This software layer allows us to perform dynamic translation from a source ISA (visible to the operating system and applications), as well as to perform optimization over the source code and adapt it to better exploit the capabilities of the hardware layer underneath. <br>Several proposals exist in the research arena that showed the potential benefits of a co-designed Virtual Machine as well as the benefits of dynamic optimization. In Transmeta Crusoe [Kla00], IBM DAISY [EA96] and IBM BOA [AGS+99] the concept of co-designed Virtual Machine is leveraged to design system based on a low complexity and low power VLIW hardware layer, able to execute general purpose x86 code. In these proposals, the translation from x86 to VLIW ISA is an important feature and it imposes an important overhead to the system. <br>RePlay [FBC+01] and PARROT [RAM+04] remove the overhead due to translation and concentrate their efforts on performing dynamic optimization of the most frequently executed sections of the applications. However, these projects rely on the hardware to perform code optimization, which limits the flexibility of the system. A software optimizer is able to perform more complex analysis and optimizations than a hardware-based scheme, and it may be updated many times even when the chip is build. Moreover, a hardware optimizer introduces additional complexity to the hardware which may result in a power consumption increase and additional validation cost. <br>The goal of our research in this arena is to propose a complete design of a system based on a combined effort on hardware and software. In order to do that, first we will investigate on an efficient and flexible co-designed system that overcomes the limitation of previous proposals. Then, we will investigate on novel techniques to perform dynamic optimization through the co-designed virtual machine software layer. These techniques must be able to adapt the applications to the hardware underneath. Finally, we will investigate on enhancements to the hardware layer to allow a better interaction with the wrapper software layer. <br>There have been numerous research groups that have focused their research on generic dynamic binary optimization. However, very few of them have concentrated their efforts in the concept of co-design virtual machines, where the final functionality of a processor is transparently provided by the most efficient balance between hardware and software. In addition to the aforementioned projects, the research group led by the recently retired professor Jim Smith [SN05], with who we have strongly collaborated for more than 10 years, started working in this topic. <br>Due to the increasing complexity of current processors in terms of energy consumption, area and processor validation, we strongly believe that this research topic will become very interesting in the research agenda of many research groups in the next years. In fact, one starts observing that more and more groups advocate for using software to perform tasks that are too complex to perform them in hardware, even if these proposals are not aligned with the concept of co-design virtual machines.  
  
= References =
+
= References =
  
[AC71] A. V. Aho, M. J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333–340, 1975.<br>[AGS+99] E. Altman, M. Gschwind, S. Sathaye, S. Kosonocky, A. Bright, J. Fritts, P. Ledak, D. Appenzeller, C. Agricola, Z. Filan, “BOA: The Architecture of a Binary Translation Engine”, IBM Research Report RC 21665 (97500), 1999 <br>[AHK00] V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Clock rate vs. ipc: The end of the road for conventional microprocessors, in Proceedings of the 27th International Symposium on Computer Architecture, 2000. <br>[BCS09] S. Byna, Y. Chen and X. Sun. “Taxonomy of Data Prefetching for Multicore Processors”. Journal of Computer Science and Technology, 2009 24 (3): 405-417. <br>[BGK96] D. Burger, J. R. Goodman, and A. Kägi, “Quantifying Memory Bandwidth Limitations of Current and Future Microprocessors”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996. <br>[BKS+08] C.Bienia, S.Kumar, J.P.Singh, and K.Li. “The PARSEC Benchmark Suite: Characterization and Architectural Implications”, In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pages 72–81, 2008. <br>[BoB09] R. Borgo, K. Brodlie,. “State of The Art Report on GPU”, Visualization and Virtual Reality Research Group, School of Computing - University of Leeds, - VizNET REPORT - Ref: VIZNET-WP4-24-LEEDS-GPU, 2009. <br>[BP05] Z. K. Baker, V. K. Prasanna: High-throughput linked-pattern matching for intrusion detection systems. ANCS 2005.<br>[CGJ09] Q. Cai, Y. Gui, and R. Johnson. Exploiting Unix File-system Races via Algorithmic Complexity Attacks. Proceedings of IEEE Symposium on Security and Privacy, 2009.<br>[ChS08] H.Chang and W.Sung, “Efficient Vectorization of SIMD Programs with Non-aligned and Irregular Data Access Hardware”, CASES’08, October 19–24, 2008. <br>[CMT00] M. Cintra, J.F. Martinez and J. Torrellas, “Architectural Support for Scalable Speculative Parallelization in Shared-Memory Systems”, in Proc. of the 27th Int. Symp. on Computer Architecture, 2000.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br>[CW03] S. A. Crosby and D. S. Wallach. Denial of Service via Algorithmic Complexity Attacks. In USENIX Security, Aug. 2003.<br>[DLM09] D. Dice, Y. Lev, M. Moir, and D. Nussbaum, Early Experience with a Commercial Hardware Transactional Memory Implementation, in Proceedings of the 14th International Conference onArchitectural Support for Programming Languages and Operating Systems, 2009. <br>[EA96] K. Ebcioglu, E. R. Altman, “DAISY: Dynamic Compilation for 100% Architectural Compatibility”, IBM Research Report RC 20538, 1996. <br>[EKD+03] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, T. Mudge. Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation. Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. <br>[EMJ09] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt. “Coordinated control of multiple prefetchers in multi-core systems”. In Proceedings of the 42nd Annual IEEE/ACM international Symposium on Microarchitecture, 2009. <br>[FBC+01] B. Fahs. S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. Patel, S. Lumetta, “Performance Characterization of a Hardware Mechanism for Dynamic Optimization”, in Procs. of 34th International Symposium On Microarchitecture (MICRO-34), 2001 <br>[F07] J. Fang. Parallel Programming Environment: A Key to Translating Tera-Scale Platforms into a Big Success. Key Note at International Symposium on Code Generation and Optimization. March 2007. <br>[GMS+05] C. García, C. Madriles, J. Sánchez, P. Marcuello, A. González and D.M. Tullsen, “Mitosis Compiler: An Infrastructure for Speculative Threading Based on Pre-computation Slices”, in Proc. of the Int. Conf. on Programming Language Design and Implementation, pp. 269-278, 2005. <br>[H05] H.P. Hofstee, Power efficient processor architecture and the cell processor. In Proceedings of 11th International Symposium on High-Performance Computer Architecture HPCA-11, February 2005. <br>[HST09] N. Hua, H. Song, and T.V. Lakshman. Variable-Stride Multi-Pattern Matching For Scalable Deep Packet Inspection. Proceedings of IEEE INFOCOM, April 2009.<br>[HWC04] L. Hammond, V. Wong, M. Chen, B. Carlstrom, J. Davis, B. Hertzberg, M. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, Transactional Memory Coherence and Consistency, in Proceedings of the 31st International Symposium on Computer Architecture, 2004. <br>[KBK02] C. Kim, D. Burger, and S.W.Keckler, An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches, In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. <br>[KFJ04] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proceedings of the 31st annual International Symposium on Computer Architecture, 2004. <br>[Kla00] A. Klaiber, “The Technology Behind the CrusoeTM Processors”, white paper, January 2000 <br>[LCW+07] X. Liang, R. Canal, G.Y. Wei, D. Brooks. Process Variation Tolerant 3T1D-Based Cache Architectures. Proceedings of the 40th International Symposium on Microarchitecture (MICRO-40), December 2007 <br>[LNO+08] E.Lindholm, J.Nickolls, S.Oberman, Jmontrym. “NVIDIA TESLA: A Unified Graphics and Computing Architecture”, IEEE Micro, 28, 2 :39-55, March-April 2008. <br>[LuH07] D.Luebke and G.Humphreys. “How GPUs work”, Computer, 40(2): 96-11, 2007. <br>[M97] D. Matzke, Will physical scalability sabotage performance gains?, IEEE Computer, September 1997. <br>[MB05] C. McNairy, and R. Bhatia. Montecito: A Dual-Core, Dual-Thread Itanium Processor. IEEE Micro, Volume 25, Issue 2, Pages 10–20. March-April 2005. <br>[MG99] P. Marcuello and A. González, “Clustered Speculative Multithreaded Processors”, in Proc. of the 13th Int. Conf. on Supercomputing, pp. 365-372, 1999. <br>[MLC+09] C. Madriles, F. Latorre, J.M. Codina, E. Gibert, P. López, A. Martínez, R. Martínez and A. González, "Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading", in proceedings of International Conference on Parallel Architectures and Compiler Techniques, September 2009. <br>[MO07] T. Moscibroda, O. Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-core Systems. In 16th USENIX Security Symposium 2007.<br>[NVI08] NVIDIA Corporation. “NVIDIA CUDA Programming Guide”, 2.0 edition, 2008. <br>[OHL+08] J.D.Owens, M.Houston, D.Luebke, S.Green, J.E.Stone, and J.C.Phillips. “GPU Computing”, In Proc. Of the IEEE, vol. 96, pp.879-899, 2008. <br>[PN98] T. Ptacek and T. Newsham. Insertion, Evasion and Denial of Service: Eluding Network Intrusion Detection. In Secure Networks, Inc., January 1998.<br>[PY08] P. Piyachon, Y. Luo. Design of a High Performance Pattern Matching Engine Through Compact Deterministic Finite Automata. Proceedings of the ACM DAC 2008.<br>[RAM+04] R. Rosner, Y. Almog, Micha Moffie, Naftali Schwartz, Avi Mendelson, “Power Awareness through Selective Dynamically Optimized Traces”, in Procs. of 31st International Symposium on Computer Architecture (ISCA-21), 2004 <br>[Ro99] M. Roesch. SNORT - Lightweight Intrusion Detection for Networks. In LISA '99: USENIX 13th Systems Administration Conference 1999.<br>[RWP05] G.Ren, P.Wu, D.Padua. "An Empirical Study on Vectorization of Multimedia Applications for Multimedia Extensions", Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), 2005. <br>[SBV95] G.S. Sohi, S.E. Breach and T.N. Vijaykumar, “Multiscalar Processors”, in Proc. of the 22nd Int. Symp. on Computer Architecture, pp.414-425, 1995. <br>[SCS+08] L.Seiler, D.Carmean, E.Sprangle, T.Forsyth, M.Abrash, P.Dubey, P.Hanrahan, S.Junkins, A.Lake, J. Sugerman, “Larrabee: A Many-core x86 Architecture for Visual Computing”, In SIGGRAPH ’08: ACM SIGGRAPH 2008 Papers, ACM Press, New York, NY, 2008. <br>[SCZ+02] J. Steffan, C. Colohan, A. Zhai and T. Mowry, “Improving Value Communication for Thread-Level Speculation”, in Proc. of the 8th Int. Symp. on High Performance Computer Architecture, pp. 58-62, 2002. <br>[SEJ06] R. Smith, C. Estan, and S. Jha. Backtracking Algorithmic Complexity Attacks against a NIDS. In ACSAC 2006.<br>[SEJ08] R. Smith, C. Estan, and S. Jha. XFA: Faster Signature Matching with Extended Automata. XFA: Faster Signature Matching with Extended Automata. IEEE Symposium on Security and Privacy, May 2008.<br>[SFS00] J.E.Smith, G.Faanes, R.Sugumar, “Vector Instruction Set Support for Conditional Operations”, International Symposium on Computer Architecture, Pages: 260 - 269, 2000. <br>[Shi07] J.Shin, “Introducing Control Flow into Vectorized Code”, IEEE, 16th International Conference on Parallel Architecture and Compilation Techniques, 2007. <br>[SN05] J. E. Smith, and R. Nair, “Virtual Machines: Versatile Platforms for Systems and Processes”, Morgan Kaufmann Publishers, 2005 <br>[SPN96] A. Saulsbury, F. Pong and A. Nowatzyk, “Missing the Memory Wall: The Case for Processor/Memory Integration”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996. <br>[TEL95] D. Tullsen, S.J. Eggers, H.M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the22th Annual International Symposium on Computer Architecture, 1995. <br>[TS05] L. Tan, T. Sherwood. A High Throughput String Matching Architecture for Intrusion Detection and Prevention. Proceedings of ISCA 2005, pages 112-122.<br>[Uht00] Uht, A. K. Achieving typical delays in synchronous systems via timing error toleration. Tech. Rep. Dept. of Electrical and Computer Engineering, No. 032000-0100, University of Rhode Island. 2000<br>[VSP+09] A. Valero, J. Sahuquillo, S. Petit, V. Lorente, R. Canal, P. Lopez, J. Duato. An hybrid eDRAM/SRAM macrocell to implement first-level data caches. Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42), December 2009<br>[WWW07] Quad-Core Intel® Xeon® Processor 5300 Series. August 2007. <br>[WP07] White paper. The Manycore Shift Microsoft: Parallel Computing Initiative Ushers Computing into the Next Era. November 2007. <br>[WWW207] Quad-Core AMD Opteron processors for Server and Workstation., 2007.
+
[AC71] A. V. Aho, M. J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333–340, 1975.<br>[AGS+99] E. Altman, M. Gschwind, S. Sathaye, S. Kosonocky, A. Bright, J. Fritts, P. Ledak, D. Appenzeller, C. Agricola, Z. Filan, “BOA: The Architecture of a Binary Translation Engine”, IBM Research Report RC 21665 (97500), 1999 <br>[AHK00] V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Clock rate vs. ipc: The end of the road for conventional microprocessors, in Proceedings of the 27th International Symposium on Computer Architecture, 2000. <br>[BCS09] S. Byna, Y. Chen and X. Sun. “Taxonomy of Data Prefetching for Multicore Processors”. Journal of Computer Science and Technology, 2009 24 (3): 405-417. <br>[BGK96] D. Burger, J. R. Goodman, and A. Kägi, “Quantifying Memory Bandwidth Limitations of Current and Future Microprocessors”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996. <br>[BKS+08] C.Bienia, S.Kumar, J.P.Singh, and K.Li. “The PARSEC Benchmark Suite: Characterization and Architectural Implications”, In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pages 72–81, 2008. <br>[BoB09] R. Borgo, K. Brodlie,. “State of The Art Report on GPU”, Visualization and Virtual Reality Research Group, School of Computing - University of Leeds, - VizNET REPORT - Ref: VIZNET-WP4-24-LEEDS-GPU, 2009. <br>[BP05] Z. K. Baker, V. K. Prasanna: High-throughput linked-pattern matching for intrusion detection systems. ANCS 2005.<br>[CGJ09] Q. Cai, Y. Gui, and R. Johnson. Exploiting Unix File-system Races via Algorithmic Complexity Attacks. Proceedings of IEEE Symposium on Security and Privacy, 2009.<br>[ChS08] H.Chang and W.Sung, “Efficient Vectorization of SIMD Programs with Non-aligned and Irregular Data Access Hardware”, CASES’08, October 19–24, 2008. <br>[CMT00] M. Cintra, J.F. Martinez and J. Torrellas, “Architectural Support for Scalable Speculative Parallelization in Shared-Memory Systems”, in Proc. of the 27th Int. Symp. on Computer Architecture, 2000.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br>[CW03] S. A. Crosby and D. S. Wallach. Denial of Service via Algorithmic Complexity Attacks. In USENIX Security, Aug. 2003.<br>[DLM09] D. Dice, Y. Lev, M. Moir, and D. Nussbaum, Early Experience with a Commercial Hardware Transactional Memory Implementation, in Proceedings of the 14th International Conference onArchitectural Support for Programming Languages and Operating Systems, 2009. <br>[EA96] K. Ebcioglu, E. R. Altman, “DAISY: Dynamic Compilation for 100% Architectural Compatibility”, IBM Research Report RC 20538, 1996. <br>[EKD+03] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, T. Mudge. Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation. Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. <br>[EMJ09] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt. “Coordinated control of multiple prefetchers in multi-core systems”. In Proceedings of the 42nd Annual IEEE/ACM international Symposium on Microarchitecture, 2009. <br>[FBC+01] B. Fahs. S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. Patel, S. Lumetta, “Performance Characterization of a Hardware Mechanism for Dynamic Optimization”, in Procs. of 34th International Symposium On Microarchitecture (MICRO-34), 2001 <br>[F07] J. Fang. Parallel Programming Environment: A Key to Translating Tera-Scale Platforms into a Big Success. Key Note at International Symposium on Code Generation and Optimization. March 2007. <br>[GMS+05] C. García, C. Madriles, J. Sánchez, P. Marcuello, A. González and D.M. Tullsen, “Mitosis Compiler: An Infrastructure for Speculative Threading Based on Pre-computation Slices”, in Proc. of the Int. Conf. on Programming Language Design and Implementation, pp. 269-278, 2005. <br>[H05] H.P. Hofstee, Power efficient processor architecture and the cell processor. In Proceedings of 11th International Symposium on High-Performance Computer Architecture HPCA-11, February 2005. <br>[HST09] N. Hua, H. Song, and T.V. Lakshman. Variable-Stride Multi-Pattern Matching For Scalable Deep Packet Inspection. Proceedings of IEEE INFOCOM, April 2009.<br>[HWC04] L. Hammond, V. Wong, M. Chen, B. Carlstrom, J. Davis, B. Hertzberg, M. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, Transactional Memory Coherence and Consistency, in Proceedings of the 31st International Symposium on Computer Architecture, 2004. <br>[KBK02] C. Kim, D. Burger, and S.W.Keckler, An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches, In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. <br>[KFJ04] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proceedings of the 31st annual International Symposium on Computer Architecture, 2004. <br>[Kla00] A. Klaiber, “The Technology Behind the CrusoeTM Processors”, white paper, January 2000 <br>[LCW+07] X. Liang, R. Canal, G.Y. Wei, D. Brooks. Process Variation Tolerant 3T1D-Based Cache Architectures. Proceedings of the 40th International Symposium on Microarchitecture (MICRO-40), December 2007 <br>[LNO+08] E.Lindholm, J.Nickolls, S.Oberman, Jmontrym. “NVIDIA TESLA: A Unified Graphics and Computing Architecture”, IEEE Micro, 28, 2&nbsp;:39-55, March-April 2008. <br>[LuH07] D.Luebke and G.Humphreys. “How GPUs work”, Computer, 40(2): 96-11, 2007. <br>[M97] D. Matzke, Will physical scalability sabotage performance gains?, IEEE Computer, September 1997. <br>[MB05] C. McNairy, and R. Bhatia. Montecito: A Dual-Core, Dual-Thread Itanium Processor. IEEE Micro, Volume 25, Issue 2, Pages 10–20. March-April 2005. <br>[MG99] P. Marcuello and A. González, “Clustered Speculative Multithreaded Processors”, in Proc. of the 13th Int. Conf. on Supercomputing, pp. 365-372, 1999. <br>[MLC+09] C. Madriles, F. Latorre, J.M. Codina, E. Gibert, P. López, A. Martínez, R. Martínez and A. González, "Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading", in proceedings of International Conference on Parallel Architectures and Compiler Techniques, September 2009. <br>[MO07] T. Moscibroda, O. Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-core Systems. In 16th USENIX Security Symposium 2007.<br>[NVI08] NVIDIA Corporation. “NVIDIA CUDA Programming Guide”, 2.0 edition, 2008. <br>[OHL+08] J.D.Owens, M.Houston, D.Luebke, S.Green, J.E.Stone, and J.C.Phillips. “GPU Computing”, In Proc. Of the IEEE, vol. 96, pp.879-899, 2008. <br>[PN98] T. Ptacek and T. Newsham. Insertion, Evasion and Denial of Service: Eluding Network Intrusion Detection. In Secure Networks, Inc., January 1998.<br>[PY08] P. Piyachon, Y. Luo. Design of a High Performance Pattern Matching Engine Through Compact Deterministic Finite Automata. Proceedings of the ACM DAC 2008.<br>[RAM+04] R. Rosner, Y. Almog, Micha Moffie, Naftali Schwartz, Avi Mendelson, “Power Awareness through Selective Dynamically Optimized Traces”, in Procs. of 31st International Symposium on Computer Architecture (ISCA-21), 2004 <br>[Ro99] M. Roesch. SNORT - Lightweight Intrusion Detection for Networks. In LISA '99: USENIX 13th Systems Administration Conference 1999.<br>[RWP05] G.Ren, P.Wu, D.Padua. "An Empirical Study on Vectorization of Multimedia Applications for Multimedia Extensions", Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), 2005. <br>[SBV95] G.S. Sohi, S.E. Breach and T.N. Vijaykumar, “Multiscalar Processors”, in Proc. of the 22nd Int. Symp. on Computer Architecture, pp.414-425, 1995. <br>[SCS+08] L.Seiler, D.Carmean, E.Sprangle, T.Forsyth, M.Abrash, P.Dubey, P.Hanrahan, S.Junkins, A.Lake, J. Sugerman, “Larrabee: A Many-core x86 Architecture for Visual Computing”, In SIGGRAPH ’08: ACM SIGGRAPH 2008 Papers, ACM Press, New York, NY, 2008. <br>[SCZ+02] J. Steffan, C. Colohan, A. Zhai and T. Mowry, “Improving Value Communication for Thread-Level Speculation”, in Proc. of the 8th Int. Symp. on High Performance Computer Architecture, pp. 58-62, 2002. <br>[SEJ06] R. Smith, C. Estan, and S. Jha. Backtracking Algorithmic Complexity Attacks against a NIDS. In ACSAC 2006.<br>[SEJ08] R. Smith, C. Estan, and S. Jha. XFA: Faster Signature Matching with Extended Automata. XFA: Faster Signature Matching with Extended Automata. IEEE Symposium on Security and Privacy, May 2008.<br>[SFS00] J.E.Smith, G.Faanes, R.Sugumar, “Vector Instruction Set Support for Conditional Operations”, International Symposium on Computer Architecture, Pages: 260 - 269, 2000. <br>[Shi07] J.Shin, “Introducing Control Flow into Vectorized Code”, IEEE, 16th International Conference on Parallel Architecture and Compilation Techniques, 2007. <br>[SN05] J. E. Smith, and R. Nair, “Virtual Machines: Versatile Platforms for Systems and Processes”, Morgan Kaufmann Publishers, 2005 <br>[SPN96] A. Saulsbury, F. Pong and A. Nowatzyk, “Missing the Memory Wall: The Case for Processor/Memory Integration”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996. <br>[TEL95] D. Tullsen, S.J. Eggers, H.M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the22th Annual International Symposium on Computer Architecture, 1995. <br>[TS05] L. Tan, T. Sherwood. A High Throughput String Matching Architecture for Intrusion Detection and Prevention. Proceedings of ISCA 2005, pages 112-122.<br>[Uht00] Uht, A. K. Achieving typical delays in synchronous systems via timing error toleration. Tech. Rep. Dept. of Electrical and Computer Engineering, No. 032000-0100, University of Rhode Island. 2000<br>[VSP+09] A. Valero, J. Sahuquillo, S. Petit, V. Lorente, R. Canal, P. Lopez, J. Duato. An hybrid eDRAM/SRAM macrocell to implement first-level data caches. Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42), December 2009<br>[WWW07] Quad-Core Intel® Xeon® Processor 5300 Series. August 2007. <br>[WP07] White paper. The Manycore Shift Microsoft: Parallel Computing Initiative Ushers Computing into the Next Era. November 2007. <br>[WWW207] Quad-Core AMD Opteron processors for Server and Workstation., 2007.  
 +
 
 +
= Projects Bureaucracy<br> =

Revision as of 21:21, 30 January 2013

Characterization and Acceleration of Emerging Applications

Basically, GPUs are specialized hardware cores designed to accelerate rendering and display but they are moving in the direction of general purpose accelerators [BoB09]. GPU vendors have recently introduced new programming models and associated hardware support to broaden the class of non-graphics applications that may efficiently use GPU hardware [NVI08][OHL+08]. There is a large, emerging and commercially-relevant class of applications enabled by the significant increase in GPU computing density, such as graphics and physics for gaming, interactive simulation, data analysis, scientific computing, 3D modeling for CAD, signal processing, digital content creation, and financial analytics. PARSEC benchmark suite [BKS+08] is a good proxy of this kind of applications. Applications in these domains benefit from architectural approaches that provide higher performance through parallelism.
GPUs capabilities excel for applications that exhibit extensive data parallelism. GPUs typically operate on a large number of data points, where the same operation is simultaneously conducted on all of these data points in the form of continuously running vectors or streams. Furthermore, to exploit data-level parallelism, modern GPUs typically batch together groups of individual threads (called warps) running the same shader program, and execute them in lock step on a SIMD pipeline [LuH07][LNO+08]. However, even with a general-purpose programming interface, mapping existing applications to the parallel architecture of a GPU is a non-trivial task.
Vectorization is an optimization technique that has traditionally targeted vector processors. The importance of this optimization has increased in recent years with the introduction of SIMD extensions like Intel’s SSE or IBM/Motorola's Altivec to general purpose processors, and with the growing significance of applications that can benefit from this functionality. However, achieving high performance on modern architectures requires efficient utilization of SIMD units. Doing so requires that algorithms are able to take full advantage of the SIMD width offered and to not waste SIMD instructions on low utilization cases. Both Intel SSE and PowerPC Altivec expose a relatively small SIMD width of four. It is often complicated to apply vectorization techniques to architectures with such SIMD extensions because these extensions are largely non-uniform, supporting specialized functionalities and a limited set of data types. Vectorization is often further impeded by the SIMD memory architecture, which typically provides access to contiguous memory items only, often with additional alignment restrictions. Computations, on the other hand, may access data elements in an order that is neither contiguous nor adequately aligned. Bridging this gap efficiently requires careful use of special mechanisms including permute, pack/unpack, and other instructions that incur additional performance penalties and complexity.
However, given the small cost and potentially high benefit of increasing the SIMD width, it seems likely that future architectures will explore larger SIMD widths such as Nvidia’s Fermi and Intel’s Larrabee[LCS+08]. Larrabee greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. Its approach is based on extending each CPU core with a wide vector unit featuring scatter-gather capability, as well as predicated execution support. On the other hand, available compilers have limitations that prevent loop vectorizing such as control flow, non-contiguous and irregular data access, data dependency, nested loop and undefined number of loop iterations [RWP05] which are present in most of the main loops of emerging applications. There are some works that address the control flow problems [SFS00][Shi07], as well as irregular data access [ChS08].

Intrusion Detection Systems

Computing systems today operate in an environment of seamless connectivity, with attacks being continuously created and propagated through the Internet. There is a clear need to provide an additional layer of security in routers. Intrusion Detection Systems (IDS) are emerging as one of the most promising ways of providing protection to systems on the network against suspicious activities. By monitoring the traffic in real time, an IDS can detect and also take preventive actions against suspicious activities. Network-based IDS have emerged as an effective and efficient defense for systems in the network. They are usually deployed in routers.
The deployment of network-based IDS in routers poses a very interesting challenge. With current trends in doubling of line-rates - especially in backbone routers - every 24 months, requires that the IDS also needs to follow likewise in performance. For example, an IDS deployed in a state-of-the-art backbone router inspects packets streaming at 40 Gbps and scans for >23,000 attack signatures in these packets. This clearly is a tremendous performance challenge. So performance is a key factor for efficient functioning of the IDS.
IDS detects attacks by scanning packets for attack patterns by performing multiple pattern matching. Patterns can be either expressed as fixed strings or regular expressions. The Aho-Corasick [AC71] algorithm is commonly used by IDS [Ro99] for fixed string matching. In the Aho-Corasick algorithm, a finite-state-machine (FSM) is constructed using attack signatures, and this is subsequently traversed using bytes from packets. The main advantage of the Aho-Corasick algorithm is that it runs in linear time to the input bytes regardless of the number of attack signatures. However, the main disadvantage lies in devising a practical implementation due to the large memory needed to store the FSM. So one of the primary area of focus in the IDS research community is in devising a performance and area efficient architecture for the Aho-Corasick algorithm.
IDS also increasingly uses regular expressions to specify attack signatures. The reason for their increasing popularity is due to its rich expressive powers in specifying attack signatures. In order for the regular expressions to be parsed, they are first converted to finite automata (Deterministic or Non-deterministic), and later these automatas are traversed using bytes from packets. However, these automatas are either inefficient with respect to chip-area (in case of Deterministic Finite Automata) or are performance inefficient (in case of Non-deterministic Finite Automata ).
A key requirement for the effectiveness of an IDS is that it has to process packets at the rate they are streaming. The consequences of not doing so can result either in undetected malicious packets or expensive packet drops. An adversary can also bring the IDS to this state of not being able to process packets at wire speeds. Such attempts are commonly referred to as evasion [CW03, PN98], and stem from weaknesses in some part of IDS processing. The nature and ease of evasion makes it very appealing for malicious hosts to bypass the IDS.
There have been numerous works In the area of improving the performance and area efficiency of pattern matching algorithms (fixed string and regular expressions). [TS05, PY08] propose various novel techniques to significantly improve the performance and area efficiency of pattern matching in IDS. In the area of regular expression matching, numerous works [HST09, SEJ08] have studied and proposed improving the DFA storage and DFA traversal. Additionally, [BP05] have studied and proposed techniques for NFAs using reconfigurable hardware. [CW03, SEJ06] have studied various sophisticated attacks and secure defense mechanisms against the IDS. Additionally, [CGJ09, MO07] have also studied similar attack and defense mechanisms against the Unix file system and banked-memory in multi-cores respectively.
Broadly, we plan to address these aforementioned issues by using a hardware/software approach. The software approach focuses on improving the area efficiency while the hardware approach improves the performance efficiency.

Memory Hierarchy

Innovation and technological improvements in processor design have outpaced advances in memory design in the last ten years. This imbalanced advancement has been causing an increasing gap between processor and memory speeds. During the last decade this has led to an approach involving concurrent execution, initially through the execution of multiple threads in one processor and now with the inclusion of multiple cores in a single chip. Unfortunately, the advent of chip-multiprocessors (CMPs) in the last years has made the problem even worse due to the increase of bandwidth requirements and contention on the memory controller. Therefore, this increasing speed gap has motivated that current high performance processors focus on cache organizations, register file and prefetching techniques to tolerate growing memory latencies [BCS09], [BGK96], [SPN96]. Furthermore, power dissipation is becoming a critical issue for microprocessors. Power dissipation determines the cost of the cooling system and ultimately may limit the performance of the microprocessor.
Prefetching mechanisms that decouple and overlap the computing and transfer of data is a well known technique and commonly employed to hide memory latencies [BCS09]. However, although the agressibve prefetchig mechanisms are, for most applications, beneficial to tolerate memory latencies in single core processors, when the prefetching is done in multiple cores of a CMP, the increase of performance of individual cores can be greatly reduced compared to systems without prefetching [EMJ09]. This is caused by interference in the shared resources of prefetching mechanisms.
One of the greatest challenges that appeared with this twist in the chip configuration relies on how users will exploit CMPs. Parallel programming models, which divide an application in several tasks that can be executed concurrently, seems to be the best alternative to take advantage of CMP resources. Unfortunately, current programming models implement blocking synchronization, where critical sections are serialized in order to ensure mutual exclusion. Therefore, blocking synchronization increases the complexity of parallel programming and significantly degrades the performance of parallel applications. This fact encouraged the development of optimistic programming models that use non-blocking synchronization. In these programming models, critical sections are executed simultaneously, requiring modifications in the memory hierarchy to guarantee the correctness of the execution [HWC04] [DLM09].
On the other side, the increasing influence of wire delay in cache design means that access latencies to the last-level cache banks are no longer constant [AHK00], [M97]. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem [KBK02]. A NUCA divides the whole cache memory into smaller banks and allows nearer cache banks to have lower access latencies than farther banks, thus mitigating the effects of the cache’s internal wires.
We will propose several microarchitectural techniques that can be applied to various parts of current microprocessor designs to improve the memory system and to boost the execution of instructions. Some techniques will attempt to ease the gap between processor and memory speeds, while the others will attempt to alleviate the serialization caused by data dependences.

Multithreaded Processors

Industry and researchers are making a shift towards multi-core architectures [WWW07] [WP07] [WWW207] [F07]. This shift is mainly motivated by two factors: on the one hand, we have reached a point where further exploiting instruction level parallelism (ILP) is giving us diminishing returns so that other types of parallelism are needed. On the other hand, new feature sizes allow a greater number of transistors to be implemented on a chip. This increase on the number of transistors opens the possibility of integrating multiple cores on die so that multiple applications and threads from the same application can run in parallel getting good performance by exploiting thread level parallelism (TLP) [KFJ+04].
The ability of executing multiple threads in parallel is called multithreading. There are several ways of implementing multithreading. On the one hand, implementing multiple cores allow you to support multiple threads in parallel. On the other hand, each of these cores could also execute more than one thread at the same time using techniques like Simultaneous Multithreading, fine grain multithreading [TEL95] or switch on event [MB05].
Implementing multiple simple cores on a chip make the number of cores available in a processor to increase every year and companies are sometimes making strong bets designing processors able to exploit TLP very efficiently at the expenses of sacrificing ILP [H05]. However, these novel architectures consisting of simple cores will have to compete with the current out-of-order processors that clearly outperform them in the ILP arena.
Speculative multithreading is a paradigm where single-threaded applications are shred into multiple threads that can be executed in parallel. These threads are generated using speculative optimizations like control speculation and dependence break that maximizes the amount of instructions that can be executed in parallel. Unfortunately, since the optimizations are speculative, it also requires hardware mechanisms in order to detect and recover from misspeculations. However, multi-core architectures comprising single cores can take advantage of this paradigm to reach similar performance as conventional single out-of-order cores on single-threaded applications. Typical implementations of speculative multithreading can be found in [SBV95][SCZ+02][MG99][GMS+05][CMT00]. These implementations usually generate speculative threads where every thread represents a set of consecutive instructions from the original application plus some extra instructions to handle the speculative optimizations. However, other more recent proposals refine these models generating threads where original instructions are more aggressively distributed among threads [MLC+09].

Reliability

Circuit behavior is not deterministic anymore in current and future technologies due to variations. Limitations in the fabrication process introduce indeterminism in the geometry (length, width and thickness) of transistors, and hence, in their behavior. Moreover, voltage and temperature oscillate, and inputs of circuits change. Thus, delay and power of circuits change dynamically, but they must work in any feasible scenario. This issue is addressed by considering the worst-case scenario by default to ensure circuit functionality, but such assumption is pessimistic most of the times and very inefficient in terms of power and delay. The aim of our research is finding solutions to make circuits more efficient under the presence of variations. The result of our research will be a set of strategies, techniques and circuit designs to improve the performance and power of such circuits by adapting their operation to the common case instead of operating assuming worst-case conditions.
 Some works have been proposed so far to exploit variations. TIMERRTOL [Uht00] and Razor [EKD+03] exploit variations by using results after the common case delay and checking their correctness after the worst-case delay. There have been works on new memory circuit design [VSP+09], as well as, techniques to reduce the performance impact of process variations [LCW+07].
The main groups working in the area of variations are the Department of Electrical and Computer Engineering at the University of Rhode Island, the Department of Electrical Engineering at the Princeton University, the Division of Engineering and Applied Sciences at the Harvard University, and the Department of Electrical Engineering and Computer Science at the University of Michigan.

Virtual Machines

Co-designed Virtual Machines [SN05] are an attractive vehicle for designing complexity/power effective processors. For this paradigm, a processor is a co-design effort between hardware and software. The software layer wraps the hardware and provides a common interface to the outside world (operating system and applications). This software layer allows us to perform dynamic translation from a source ISA (visible to the operating system and applications), as well as to perform optimization over the source code and adapt it to better exploit the capabilities of the hardware layer underneath.
Several proposals exist in the research arena that showed the potential benefits of a co-designed Virtual Machine as well as the benefits of dynamic optimization. In Transmeta Crusoe [Kla00], IBM DAISY [EA96] and IBM BOA [AGS+99] the concept of co-designed Virtual Machine is leveraged to design system based on a low complexity and low power VLIW hardware layer, able to execute general purpose x86 code. In these proposals, the translation from x86 to VLIW ISA is an important feature and it imposes an important overhead to the system.
RePlay [FBC+01] and PARROT [RAM+04] remove the overhead due to translation and concentrate their efforts on performing dynamic optimization of the most frequently executed sections of the applications. However, these projects rely on the hardware to perform code optimization, which limits the flexibility of the system. A software optimizer is able to perform more complex analysis and optimizations than a hardware-based scheme, and it may be updated many times even when the chip is build. Moreover, a hardware optimizer introduces additional complexity to the hardware which may result in a power consumption increase and additional validation cost.
The goal of our research in this arena is to propose a complete design of a system based on a combined effort on hardware and software. In order to do that, first we will investigate on an efficient and flexible co-designed system that overcomes the limitation of previous proposals. Then, we will investigate on novel techniques to perform dynamic optimization through the co-designed virtual machine software layer. These techniques must be able to adapt the applications to the hardware underneath. Finally, we will investigate on enhancements to the hardware layer to allow a better interaction with the wrapper software layer.
There have been numerous research groups that have focused their research on generic dynamic binary optimization. However, very few of them have concentrated their efforts in the concept of co-design virtual machines, where the final functionality of a processor is transparently provided by the most efficient balance between hardware and software. In addition to the aforementioned projects, the research group led by the recently retired professor Jim Smith [SN05], with who we have strongly collaborated for more than 10 years, started working in this topic.
Due to the increasing complexity of current processors in terms of energy consumption, area and processor validation, we strongly believe that this research topic will become very interesting in the research agenda of many research groups in the next years. In fact, one starts observing that more and more groups advocate for using software to perform tasks that are too complex to perform them in hardware, even if these proposals are not aligned with the concept of co-design virtual machines.

References

[AC71] A. V. Aho, M. J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333–340, 1975.
[AGS+99] E. Altman, M. Gschwind, S. Sathaye, S. Kosonocky, A. Bright, J. Fritts, P. Ledak, D. Appenzeller, C. Agricola, Z. Filan, “BOA: The Architecture of a Binary Translation Engine”, IBM Research Report RC 21665 (97500), 1999
[AHK00] V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Clock rate vs. ipc: The end of the road for conventional microprocessors, in Proceedings of the 27th International Symposium on Computer Architecture, 2000.
[BCS09] S. Byna, Y. Chen and X. Sun. “Taxonomy of Data Prefetching for Multicore Processors”. Journal of Computer Science and Technology, 2009 24 (3): 405-417.
[BGK96] D. Burger, J. R. Goodman, and A. Kägi, “Quantifying Memory Bandwidth Limitations of Current and Future Microprocessors”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996.
[BKS+08] C.Bienia, S.Kumar, J.P.Singh, and K.Li. “The PARSEC Benchmark Suite: Characterization and Architectural Implications”, In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pages 72–81, 2008.
[BoB09] R. Borgo, K. Brodlie,. “State of The Art Report on GPU”, Visualization and Virtual Reality Research Group, School of Computing - University of Leeds, - VizNET REPORT - Ref: VIZNET-WP4-24-LEEDS-GPU, 2009.
[BP05] Z. K. Baker, V. K. Prasanna: High-throughput linked-pattern matching for intrusion detection systems. ANCS 2005.
[CGJ09] Q. Cai, Y. Gui, and R. Johnson. Exploiting Unix File-system Races via Algorithmic Complexity Attacks. Proceedings of IEEE Symposium on Security and Privacy, 2009.
[ChS08] H.Chang and W.Sung, “Efficient Vectorization of SIMD Programs with Non-aligned and Irregular Data Access Hardware”, CASES’08, October 19–24, 2008.
[CMT00] M. Cintra, J.F. Martinez and J. Torrellas, “Architectural Support for Scalable Speculative Parallelization in Shared-Memory Systems”, in Proc. of the 27th Int. Symp. on Computer Architecture, 2000.         
[CW03] S. A. Crosby and D. S. Wallach. Denial of Service via Algorithmic Complexity Attacks. In USENIX Security, Aug. 2003.
[DLM09] D. Dice, Y. Lev, M. Moir, and D. Nussbaum, Early Experience with a Commercial Hardware Transactional Memory Implementation, in Proceedings of the 14th International Conference onArchitectural Support for Programming Languages and Operating Systems, 2009.
[EA96] K. Ebcioglu, E. R. Altman, “DAISY: Dynamic Compilation for 100% Architectural Compatibility”, IBM Research Report RC 20538, 1996.
[EKD+03] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, T. Mudge. Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation. Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003.
[EMJ09] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt. “Coordinated control of multiple prefetchers in multi-core systems”. In Proceedings of the 42nd Annual IEEE/ACM international Symposium on Microarchitecture, 2009.
[FBC+01] B. Fahs. S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. Patel, S. Lumetta, “Performance Characterization of a Hardware Mechanism for Dynamic Optimization”, in Procs. of 34th International Symposium On Microarchitecture (MICRO-34), 2001
[F07] J. Fang. Parallel Programming Environment: A Key to Translating Tera-Scale Platforms into a Big Success. Key Note at International Symposium on Code Generation and Optimization. March 2007.
[GMS+05] C. García, C. Madriles, J. Sánchez, P. Marcuello, A. González and D.M. Tullsen, “Mitosis Compiler: An Infrastructure for Speculative Threading Based on Pre-computation Slices”, in Proc. of the Int. Conf. on Programming Language Design and Implementation, pp. 269-278, 2005.
[H05] H.P. Hofstee, Power efficient processor architecture and the cell processor. In Proceedings of 11th International Symposium on High-Performance Computer Architecture HPCA-11, February 2005.
[HST09] N. Hua, H. Song, and T.V. Lakshman. Variable-Stride Multi-Pattern Matching For Scalable Deep Packet Inspection. Proceedings of IEEE INFOCOM, April 2009.
[HWC04] L. Hammond, V. Wong, M. Chen, B. Carlstrom, J. Davis, B. Hertzberg, M. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, Transactional Memory Coherence and Consistency, in Proceedings of the 31st International Symposium on Computer Architecture, 2004.
[KBK02] C. Kim, D. Burger, and S.W.Keckler, An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches, In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.
[KFJ04] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proceedings of the 31st annual International Symposium on Computer Architecture, 2004.
[Kla00] A. Klaiber, “The Technology Behind the CrusoeTM Processors”, white paper, January 2000
[LCW+07] X. Liang, R. Canal, G.Y. Wei, D. Brooks. Process Variation Tolerant 3T1D-Based Cache Architectures. Proceedings of the 40th International Symposium on Microarchitecture (MICRO-40), December 2007
[LNO+08] E.Lindholm, J.Nickolls, S.Oberman, Jmontrym. “NVIDIA TESLA: A Unified Graphics and Computing Architecture”, IEEE Micro, 28, 2 :39-55, March-April 2008.
[LuH07] D.Luebke and G.Humphreys. “How GPUs work”, Computer, 40(2): 96-11, 2007.
[M97] D. Matzke, Will physical scalability sabotage performance gains?, IEEE Computer, September 1997.
[MB05] C. McNairy, and R. Bhatia. Montecito: A Dual-Core, Dual-Thread Itanium Processor. IEEE Micro, Volume 25, Issue 2, Pages 10–20. March-April 2005.
[MG99] P. Marcuello and A. González, “Clustered Speculative Multithreaded Processors”, in Proc. of the 13th Int. Conf. on Supercomputing, pp. 365-372, 1999.
[MLC+09] C. Madriles, F. Latorre, J.M. Codina, E. Gibert, P. López, A. Martínez, R. Martínez and A. González, "Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading", in proceedings of International Conference on Parallel Architectures and Compiler Techniques, September 2009.
[MO07] T. Moscibroda, O. Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-core Systems. In 16th USENIX Security Symposium 2007.
[NVI08] NVIDIA Corporation. “NVIDIA CUDA Programming Guide”, 2.0 edition, 2008.
[OHL+08] J.D.Owens, M.Houston, D.Luebke, S.Green, J.E.Stone, and J.C.Phillips. “GPU Computing”, In Proc. Of the IEEE, vol. 96, pp.879-899, 2008.
[PN98] T. Ptacek and T. Newsham. Insertion, Evasion and Denial of Service: Eluding Network Intrusion Detection. In Secure Networks, Inc., January 1998.
[PY08] P. Piyachon, Y. Luo. Design of a High Performance Pattern Matching Engine Through Compact Deterministic Finite Automata. Proceedings of the ACM DAC 2008.
[RAM+04] R. Rosner, Y. Almog, Micha Moffie, Naftali Schwartz, Avi Mendelson, “Power Awareness through Selective Dynamically Optimized Traces”, in Procs. of 31st International Symposium on Computer Architecture (ISCA-21), 2004
[Ro99] M. Roesch. SNORT - Lightweight Intrusion Detection for Networks. In LISA '99: USENIX 13th Systems Administration Conference 1999.
[RWP05] G.Ren, P.Wu, D.Padua. "An Empirical Study on Vectorization of Multimedia Applications for Multimedia Extensions", Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), 2005.
[SBV95] G.S. Sohi, S.E. Breach and T.N. Vijaykumar, “Multiscalar Processors”, in Proc. of the 22nd Int. Symp. on Computer Architecture, pp.414-425, 1995.
[SCS+08] L.Seiler, D.Carmean, E.Sprangle, T.Forsyth, M.Abrash, P.Dubey, P.Hanrahan, S.Junkins, A.Lake, J. Sugerman, “Larrabee: A Many-core x86 Architecture for Visual Computing”, In SIGGRAPH ’08: ACM SIGGRAPH 2008 Papers, ACM Press, New York, NY, 2008.
[SCZ+02] J. Steffan, C. Colohan, A. Zhai and T. Mowry, “Improving Value Communication for Thread-Level Speculation”, in Proc. of the 8th Int. Symp. on High Performance Computer Architecture, pp. 58-62, 2002.
[SEJ06] R. Smith, C. Estan, and S. Jha. Backtracking Algorithmic Complexity Attacks against a NIDS. In ACSAC 2006.
[SEJ08] R. Smith, C. Estan, and S. Jha. XFA: Faster Signature Matching with Extended Automata. XFA: Faster Signature Matching with Extended Automata. IEEE Symposium on Security and Privacy, May 2008.
[SFS00] J.E.Smith, G.Faanes, R.Sugumar, “Vector Instruction Set Support for Conditional Operations”, International Symposium on Computer Architecture, Pages: 260 - 269, 2000.
[Shi07] J.Shin, “Introducing Control Flow into Vectorized Code”, IEEE, 16th International Conference on Parallel Architecture and Compilation Techniques, 2007.
[SN05] J. E. Smith, and R. Nair, “Virtual Machines: Versatile Platforms for Systems and Processes”, Morgan Kaufmann Publishers, 2005
[SPN96] A. Saulsbury, F. Pong and A. Nowatzyk, “Missing the Memory Wall: The Case for Processor/Memory Integration”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996.
[TEL95] D. Tullsen, S.J. Eggers, H.M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the22th Annual International Symposium on Computer Architecture, 1995.
[TS05] L. Tan, T. Sherwood. A High Throughput String Matching Architecture for Intrusion Detection and Prevention. Proceedings of ISCA 2005, pages 112-122.
[Uht00] Uht, A. K. Achieving typical delays in synchronous systems via timing error toleration. Tech. Rep. Dept. of Electrical and Computer Engineering, No. 032000-0100, University of Rhode Island. 2000
[VSP+09] A. Valero, J. Sahuquillo, S. Petit, V. Lorente, R. Canal, P. Lopez, J. Duato. An hybrid eDRAM/SRAM macrocell to implement first-level data caches. Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42), December 2009
[WWW07] Quad-Core Intel® Xeon® Processor 5300 Series. August 2007.
[WP07] White paper. The Manycore Shift Microsoft: Parallel Computing Initiative Ushers Computing into the Next Era. November 2007.
[WWW207] Quad-Core AMD Opteron processors for Server and Workstation., 2007.

Projects Bureaucracy