Projects: Difference between revisions

From ArcoWiki
Jump to navigationJump to search
No edit summary
Antonio (talk | contribs)
No edit summary
 
(9 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Characterization and Acceleration of Emerging Applications  =
= Domain-Specific Architectures for Energy-Efficient Computing SystemsI I(2024-2028) =
We are entering a new era in computing, characterized by the integration of many diverse computing devices into our daily lives. These systems will encompass a range of functions akin to human cognitive tasks, such as comprehending our surroundings (e.g., vision and language processing), learning from data and experiences, and making proactive decisions and autonomous actions (e.g., self-driving cars). This era marks the dawn of ubiquitous intelligent computing.


Basically, GPUs are specialized hardware cores designed to accelerate rendering and display but they are moving in the direction of general purpose accelerators [BoB09]. GPU vendors have recently introduced new programming models and associated hardware support to broaden the class of non-graphics applications that may efficiently use GPU hardware [NVI08][OHL+08]. There is a large, emerging and commercially-relevant class of applications enabled by the significant increase in GPU computing density, such as graphics and physics for gaming, interactive simulation, data analysis, scientific computing, 3D modeling for CAD, signal processing, digital content creation, and financial analytics. PARSEC benchmark suite [BKS+08] is a good proxy of this kind of applications. Applications in these domains benefit from architectural approaches that provide higher performance through parallelism. <br>GPUs capabilities excel for applications that exhibit extensive data parallelism. GPUs typically operate on a large number of data points, where the same operation is simultaneously conducted on all of these data points in the form of continuously running vectors or streams. Furthermore, to exploit data-level parallelism, modern GPUs typically batch together groups of individual threads (called warps) running the same shader program, and execute them in lock step on a SIMD pipeline [LuH07][LNO+08]. However, even with a general-purpose programming interface, mapping existing applications to the parallel architecture of a GPU is a non-trivial task. <br>Vectorization is an optimization technique that has traditionally targeted vector processors. The importance of this optimization has increased in recent years with the introduction of SIMD extensions like Intel’s SSE or IBM/Motorola's Altivec to general purpose processors, and with the growing significance of applications that can benefit from this functionality. However, achieving high performance on modern architectures requires efficient utilization of SIMD units. Doing so requires that algorithms are able to take full advantage of the SIMD width offered and to not waste SIMD instructions on low utilization cases. Both Intel SSE and PowerPC Altivec expose a relatively small SIMD width of four. It is often complicated to apply vectorization techniques to architectures with such SIMD extensions because these extensions are largely non-uniform, supporting specialized functionalities and a limited set of data types. Vectorization is often further impeded by the SIMD memory architecture, which typically provides access to contiguous memory items only, often with additional alignment restrictions. Computations, on the other hand, may access data elements in an order that is neither contiguous nor adequately aligned. Bridging this gap efficiently requires careful use of special mechanisms including permute, pack/unpack, and other instructions that incur additional performance penalties and complexity. <br>However, given the small cost and potentially high benefit of increasing the SIMD width, it seems likely that future architectures will explore larger SIMD widths such as Nvidia’s Fermi and Intel’s Larrabee[LCS+08]. Larrabee greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. Its approach is based on extending each CPU core with a wide vector unit featuring scatter-gather capability, as well as predicated execution support. On the other hand, available compilers have limitations that prevent loop vectorizing such as control flow, non-contiguous and irregular data access, data dependency, nested loop and undefined number of loop iterations [RWP05] which are present in most of the main loops of emerging applications. There are some works that address the control flow problems [SFS00][Shi07], as well as irregular data access [ChS08]. <br>
However, these computing systems will necessitate significant advancements in energy efficiency, as they will execute complex tasks under stringent power constraints. While chip manufacturing process technology has been instrumental in improving energy efficiency over generations, advances in dimension scaling have recently slowed down, leading many experts to believe that dimension scaling may soon reach a halt. In this context, disruptive innovations in architecture will become pivotal in enhancing energy efficiency and driving innovation. This project focuses on developing these disruptive architecture innovations.


= Intrusion Detection Systems  =
Our approach to designing these novel architectures will be grounded in three key pillars: simplicity, minimal data movement, and hardware-software specialization. This specialization will lead to the development of domain-specific architectures. In this project, we will concentrate on two domains that we believe will be highly popular and impactful in the future: cognitive computing and graphics.


Computing systems today operate in an environment of seamless connectivity, with attacks being continuously created and propagated through the Internet. There is a clear need to provide an additional layer of security in routers. Intrusion Detection Systems (IDS) are emerging as one of the most promising ways of providing protection to systems on the network against suspicious activities. By monitoring the traffic in real time, an IDS can detect and also take preventive actions against suspicious activities. Network-based IDS have emerged as an effective and efficient defense for systems in the network. They are usually deployed in routers.<br>The deployment of network-based IDS in routers poses a very interesting challenge. With current trends in doubling of line-rates - especially in backbone routers - every 24 months, requires that the IDS also needs to follow likewise in performance. For example, an IDS deployed in a state-of-the-art backbone router inspects packets streaming at 40 Gbps and scans for &gt;23,000 attack signatures in these packets. This clearly is a tremendous performance challenge. So performance is a key factor for efficient functioning of the IDS.<br>IDS detects attacks by scanning packets for attack patterns by performing multiple pattern matching. Patterns can be either expressed as fixed strings or regular expressions. The Aho-Corasick [AC71] algorithm is commonly used by IDS [Ro99] for fixed string matching. In the Aho-Corasick algorithm, a finite-state-machine (FSM) is constructed using attack signatures, and this is subsequently traversed using bytes from packets. The main advantage of the Aho-Corasick algorithm is that it runs in linear time to the input bytes regardless of the number of attack signatures. However, the main disadvantage lies in devising a practical implementation due to the large memory needed to store the FSM. So one of the primary area of focus in the IDS research community is in devising a performance and area efficient architecture for the Aho-Corasick algorithm.<br>IDS also increasingly uses regular expressions to specify attack signatures. The reason for their increasing popularity is due to its rich expressive powers in specifying attack signatures. In order for the regular expressions to be parsed, they are first converted to finite automata (Deterministic or Non-deterministic), and later these automatas are traversed using bytes from packets. However, these automatas are either inefficient with respect to chip-area (in case of Deterministic Finite Automata) or are performance inefficient (in case of Non-deterministic Finite Automata ).<br>A key requirement for the effectiveness of an IDS is that it has to process packets at the rate they are streaming. The consequences of not doing so can result either in undetected malicious packets or expensive packet drops. An adversary can also bring the IDS to this state of not being able to process packets at wire speeds. Such attempts are commonly referred to as evasion [CW03, PN98], and stem from weaknesses in some part of IDS processing. The nature and ease of evasion makes it very appealing for malicious hosts to bypass the IDS.<br>There have been numerous works In the area of improving the performance and area efficiency of pattern matching algorithms (fixed string and regular expressions). [TS05, PY08] propose various novel techniques to significantly improve the performance and area efficiency of pattern matching in IDS. In the area of regular expression matching, numerous works [HST09, SEJ08] have studied and proposed improving the DFA storage and DFA traversal. Additionally, [BP05] have studied and proposed techniques for NFAs using reconfigurable hardware. [CW03, SEJ06] have studied various sophisticated attacks and secure defense mechanisms against the IDS. Additionally, [CGJ09, MO07] have also studied similar attack and defense mechanisms against the Unix file system and banked-memory in multi-cores respectively.<br>Broadly, we plan to address these aforementioned issues by using a hardware/software approach. The software approach focuses on improving the area efficiency while the hardware approach improves the performance efficiency.  
Cognitive computing encompasses a broad range of artificial intelligence techniques, including machine learning, that enable computers to interact and think like humans (perception, reasoning, learning, decision-making, etc.). Graphics are the primary means used by most applications to display data to users. Animated graphics applications, such as games and movies, demand exceptionally high-quality graphics, low latency, and often operate under tight power constraints (e.g., mobile devices). The ultimate objective of this research project is to design novel domain-specific architectures that provide exceptional user experiences in these two domains.


= Memory Hierarchy  =
<br>
= Domain-Specific Architectures for Energy-Efficient Computing Systems (2020-2025) =
We are on the verge of transitioning to a new era in computing, characterized by an abundance of very different computing devices integrated in most systems that surround us in our daily lives. Besides, these computing systems will include a rich set of functions that are similar to human cognitive tasks, such as the ability to comprehend our surroundings (e.g., vision and language processing), learn from data and experiences, and proactively take decisions and autonomous actions (e.g., self-driving cars). This will be the era of ubiquitous intelligent computing.


Innovation and technological improvements in processor design have outpaced advances in memory design in the last ten years. This imbalanced advancement has been causing an increasing gap between processor and memory speeds. During the last decade this has led to an approach involving concurrent execution, initially through the execution of multiple threads in one processor and now with the inclusion of multiple cores in a single chip. Unfortunately, the advent of chip-multiprocessors (CMPs) in the last years has made the problem even worse due to the increase of bandwidth requirements and contention on the memory controller. Therefore, this increasing speed gap has motivated that current high performance processors focus on cache organizations, register file and prefetching techniques to tolerate growing memory latencies [BCS09], [BGK96], [SPN96]. Furthermore, power dissipation is becoming a critical issue for microprocessors. Power dissipation determines the cost of the cooling system and ultimately may limit the performance of the microprocessor. <br>Prefetching mechanisms that decouple and overlap the computing and transfer of data is a well known technique and commonly employed to hide memory latencies [BCS09]. However, although the agressibve prefetchig mechanisms are, for most applications, beneficial to tolerate memory latencies in single core processors, when the prefetching is done in multiple cores of a CMP, the increase of performance of individual cores can be greatly reduced compared to systems without prefetching [EMJ09]. This is caused by interference in the shared resources of prefetching mechanisms. <br>One of the greatest challenges that appeared with this twist in the chip configuration relies on how users will exploit CMPs. Parallel programming models, which divide an application in several tasks that can be executed concurrently, seems to be the best alternative to take advantage of CMP resources. Unfortunately, current programming models implement blocking synchronization, where critical sections are serialized in order to ensure mutual exclusion. Therefore, blocking synchronization increases the complexity of parallel programming and significantly degrades the performance of parallel applications. This fact encouraged the development of optimistic programming models that use non-blocking synchronization. In these programming models, critical sections are executed simultaneously, requiring modifications in the memory hierarchy to guarantee the correctness of the execution [HWC04] [DLM09]. <br>On the other side, the increasing influence of wire delay in cache design means that access latencies to the last-level cache banks are no longer constant [AHK00], [M97]. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem [KBK02]. A NUCA divides the whole cache memory into smaller banks and allows nearer cache banks to have lower access latencies than farther banks, thus mitigating the effects of the cache’s internal wires. <br>We will propose several microarchitectural techniques that can be applied to various parts of current microprocessor designs to improve the memory system and to boost the execution of instructions. Some techniques will attempt to ease the gap between processor and memory speeds, while the others will attempt to alleviate the serialization caused by data dependences. <br>
These computing systems will require dramatic improvements in energy-efficiency since they will perform very complex tasks under tightly constrained power budgets. Traditional approaches to improve energy-efficiency are running out of gas, and disruptive innovations from architecture are going to be a main driving force for energy-efficiency in the future. These disruptive architecture innovations are the main focus on this project.


= Multithreaded Processors<br>  =
Our approach for designing these novel architectures will rely on four main pillars: simplicity, minimal data movement and specialization of the both hardware and software. This specialization will give rise to different domain-specific architectures. In this project we will focus on two particular domains that we believe will be among the among the most popular in the forthcoming future: cognitive computing and graphics.


Industry and researchers are making a shift towards multi-core architectures [WWW07] [WP07] [WWW207] [F07]. This shift is mainly motivated by two factors: on the one hand, we have reached a point where further exploiting instruction level parallelism (ILP) is giving us diminishing returns so that other types of parallelism are needed. On the other hand, new feature sizes allow a greater number of transistors to be implemented on a chip. This increase on the number of transistors opens the possibility of integrating multiple cores on die so that multiple applications and threads from the same application can run in parallel getting good performance by exploiting thread level parallelism (TLP) [KFJ+04]. <br>The ability of executing multiple threads in parallel is called multithreading. There are several ways of implementing multithreading. On the one hand, implementing multiple cores allow you to support multiple threads in parallel. On the other hand, each of these cores could also execute more than one thread at the same time using techniques like Simultaneous Multithreading, fine grain multithreading [TEL95] or switch on event [MB05]. <br>Implementing multiple simple cores on a chip make the number of cores available in a processor to increase every year and companies are sometimes making strong bets designing processors able to exploit TLP very efficiently at the expenses of sacrificing ILP [H05]. However, these novel architectures consisting of simple cores will have to compete with the current out-of-order processors that clearly outperform them in the ILP arena. <br>Speculative multithreading is a paradigm where single-threaded applications are shred into multiple threads that can be executed in parallel. These threads are generated using speculative optimizations like control speculation and dependence break that maximizes the amount of instructions that can be executed in parallel. Unfortunately, since the optimizations are speculative, it also requires hardware mechanisms in order to detect and recover from misspeculations. However, multi-core architectures comprising single cores can take advantage of this paradigm to reach similar performance as conventional single out-of-order cores on single-threaded applications. Typical implementations of speculative multithreading can be found in [SBV95][SCZ+02][MG99][GMS+05][CMT00]. These implementations usually generate speculative threads where every thread represents a set of consecutive instructions from the original application plus some extra instructions to handle the speculative optimizations. However, other more recent proposals refine these models generating threads where original instructions are more aggressively distributed among threads [MLC+09].  
Cognitive computing is a broad area that includes machine learning and other artificial intelligence techniques that will allow computers to interact and think like humans (perception, reasoning, learning, decision making etc.). Graphics are the common way used by most applications to display data to users. Applications such as games and movies demand for extremely high-quality graphics, and in many cases (e.g., mobile devices) have very tight power constraints. The ultimate goal of this project is to devise novel domain-specific architectures that provide rich user experiences in these two areas.


= Reliability<br> =
<br>
= CoCoUnit: An Energy-Efficient Processing Unit for Cognitive Computing (2019-2025) =
There is a fast-growing interest in extending the capabilities of computing systems to perform human-like tasks in an intelligent way. These technologies are usually referred to as cognitive computing. We envision a next revolution in computing in the forthcoming years that will be driven by deploying many “intelligent” devices around us in all kind of environments (work, entertainment, transportation, health care, etc.) backed up by “intelligent” servers in the cloud. These cognitive computing systems will provide new user experiences by delivering new services or improving the operational efficiency of existing ones, and altogether will enrich our lives and our economy.


Circuit behavior is not deterministic anymore in current and future technologies due to variations. Limitations in the fabrication process introduce indeterminism in the geometry (length, width and thickness) of transistors, and hence, in their behavior. Moreover, voltage and temperature oscillate, and inputs of circuits change. Thus, delay and power of circuits change dynamically, but they must work in any feasible scenario. This issue is addressed by considering the worst-case scenario by default to ensure circuit functionality, but such assumption is pessimistic most of the times and very inefficient in terms of power and delay. The aim of our research is finding solutions to make circuits more efficient under the presence of variations. The result of our research will be a set of strategies, techniques and circuit designs to improve the performance and power of such circuits by adapting their operation to the common case instead of operating assuming worst-case conditions.<br>&nbsp;Some works have been proposed so far to exploit variations. TIMERRTOL [Uht00] and Razor [EKD+03] exploit variations by using results after the common case delay and checking their correctness after the worst-case delay. There have been works on new memory circuit design [VSP+09], as well as, techniques to reduce the performance impact of process variations [LCW+07]. <br>The main groups working in the area of variations are the Department of Electrical and Computer Engineering at the University of Rhode Island, the Department of Electrical Engineering at the Princeton University, the Division of Engineering and Applied Sciences at the Harvard University, and the Department of Electrical Engineering and Computer Science at the University of Michigan.  
A key characteristic of cognitive computing systems will be their capability to process in real time large amounts of data coming from audio and vision devices, and other type of sensors. This will demand a very high computing power but at the same time an extremely low energy consumption. This very challenging energy-efficiency requirement is a sine qua non to success not only for mobile and wearable systems, where power dissipation and cost budgets are very low, but also for large data centers where energy consumption is a main component of the total cost of ownership.


= Virtual Machines  =
Current processor architectures (including general-purpose cores and GPUs) are not a good fit for this type of systems since they keep the same basic organization as early computers, which were mainly optimized for “number crunching”. CoCoUnit will take a disruptive direction by investigating unconventional architectures that can offer orders of magnitude better efficiency in terms of performance per energy and cost for cognitive computing tasks. The ultimate goal of this project is to devise a novel processing unit that will be integrated with the existing units of a processor (general purpose cores and GPUs) and altogether will be able to deliver cognitive computing user experiences with extremely high energy-efficiency.


Co-designed Virtual Machines [SN05] are an attractive vehicle for designing complexity/power effective processors. For this paradigm, a processor is a co-design effort between hardware and software. The software layer wraps the hardware and provides a common interface to the outside world (operating system and applications). This software layer allows us to perform dynamic translation from a source ISA (visible to the operating system and applications), as well as to perform optimization over the source code and adapt it to better exploit the capabilities of the hardware layer underneath. <br>Several proposals exist in the research arena that showed the potential benefits of a co-designed Virtual Machine as well as the benefits of dynamic optimization. In Transmeta Crusoe [Kla00], IBM DAISY [EA96] and IBM BOA [AGS+99] the concept of co-designed Virtual Machine is leveraged to design system based on a low complexity and low power VLIW hardware layer, able to execute general purpose x86 code. In these proposals, the translation from x86 to VLIW ISA is an important feature and it imposes an important overhead to the system. <br>RePlay [FBC+01] and PARROT [RAM+04] remove the overhead due to translation and concentrate their efforts on performing dynamic optimization of the most frequently executed sections of the applications. However, these projects rely on the hardware to perform code optimization, which limits the flexibility of the system. A software optimizer is able to perform more complex analysis and optimizations than a hardware-based scheme, and it may be updated many times even when the chip is build. Moreover, a hardware optimizer introduces additional complexity to the hardware which may result in a power consumption increase and additional validation cost. <br>The goal of our research in this arena is to propose a complete design of a system based on a combined effort on hardware and software. In order to do that, first we will investigate on an efficient and flexible co-designed system that overcomes the limitation of previous proposals. Then, we will investigate on novel techniques to perform dynamic optimization through the co-designed virtual machine software layer. These techniques must be able to adapt the applications to the hardware underneath. Finally, we will investigate on enhancements to the hardware layer to allow a better interaction with the wrapper software layer. <br>There have been numerous research groups that have focused their research on generic dynamic binary optimization. However, very few of them have concentrated their efforts in the concept of co-design virtual machines, where the final functionality of a processor is transparently provided by the most efficient balance between hardware and software. In addition to the aforementioned projects, the research group led by the recently retired professor Jim Smith [SN05], with who we have strongly collaborated for more than 10 years, started working in this topic. <br>Due to the increasing complexity of current processors in terms of energy consumption, area and processor validation, we strongly believe that this research topic will become very interesting in the research agenda of many research groups in the next years. In fact, one starts observing that more and more groups advocate for using software to perform tasks that are too complex to perform them in hardware, even if these proposals are not aligned with the concept of co-design virtual machines.  
This project is funded by the European Research Council through the ERC Advanced Grants program.


= References  =
<br>
= Intelligent, Ubiquitous and Energy-Efficient Computing Systems (2016-2020) =
The ultimate goal of this project is to devise novel platforms that provide rich user experiences in the areas of cognitive computing and computational intelligence in mobile devices such as smartphones or wearable devices. This project investigates novel unconventional architectures that can offer orders of magnitude better efficiency in terms of performance per energy, and at the same time important improvements in raw performance. These platforms will rely on various types of units specialized for different application domains. Special focus is paid to graphics processors and brain-inspired architectures (e.g. hardware neural networks) due to their potential to exploit high degrees of parallelism and their energy efficiency for this type of applications. Extensions to existing architectures combined with novel accelerators will be explored. We also investigate the use of resilient architectures that can allow computing systems to operate at very low supply voltage levels in order to optimize their energy consumption while not compromising their reliability by providing adequate fault tolerance solutions.


[AC71] A. V. Aho, M. J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333–340, 1975.<br>[AGS+99] E. Altman, M. Gschwind, S. Sathaye, S. Kosonocky, A. Bright, J. Fritts, P. Ledak, D. Appenzeller, C. Agricola, Z. Filan, “BOA: The Architecture of a Binary Translation Engine”, IBM Research Report RC 21665 (97500), 1999 <br>[AHK00] V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Clock rate vs. ipc: The end of the road for conventional microprocessors, in Proceedings of the 27th International Symposium on Computer Architecture, 2000. <br>[BCS09] S. Byna, Y. Chen and X. Sun. “Taxonomy of Data Prefetching for Multicore Processors”. Journal of Computer Science and Technology, 2009 24 (3): 405-417. <br>[BGK96] D. Burger, J. R. Goodman, and A. Kägi, “Quantifying Memory Bandwidth Limitations of Current and Future Microprocessors”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996. <br>[BKS+08] C.Bienia, S.Kumar, J.P.Singh, and K.Li. “The PARSEC Benchmark Suite: Characterization and Architectural Implications”, In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pages 72–81, 2008. <br>[BoB09] R. Borgo, K. Brodlie,. “State of The Art Report on GPU”, Visualization and Virtual Reality Research Group, School of Computing - University of Leeds, - VizNET REPORT - Ref: VIZNET-WP4-24-LEEDS-GPU, 2009. <br>[BP05] Z. K. Baker, V. K. Prasanna: High-throughput linked-pattern matching for intrusion detection systems. ANCS 2005.<br>[CGJ09] Q. Cai, Y. Gui, and R. Johnson. Exploiting Unix File-system Races via Algorithmic Complexity Attacks. Proceedings of IEEE Symposium on Security and Privacy, 2009.<br>[ChS08] H.Chang and W.Sung, “Efficient Vectorization of SIMD Programs with Non-aligned and Irregular Data Access Hardware”, CASES’08, October 19–24, 2008. <br>[CMT00] M. Cintra, J.F. Martinez and J. Torrellas, “Architectural Support for Scalable Speculative Parallelization in Shared-Memory Systems”, in Proc. of the 27th Int. Symp. on Computer Architecture, 2000.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br>[CW03] S. A. Crosby and D. S. Wallach. Denial of Service via Algorithmic Complexity Attacks. In USENIX Security, Aug. 2003.<br>[DLM09] D. Dice, Y. Lev, M. Moir, and D. Nussbaum, Early Experience with a Commercial Hardware Transactional Memory Implementation, in Proceedings of the 14th International Conference onArchitectural Support for Programming Languages and Operating Systems, 2009. <br>[EA96] K. Ebcioglu, E. R. Altman, “DAISY: Dynamic Compilation for 100% Architectural Compatibility”, IBM Research Report RC 20538, 1996. <br>[EKD+03] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, T. Mudge. Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation. Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. <br>[EMJ09] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt. “Coordinated control of multiple prefetchers in multi-core systems”. In Proceedings of the 42nd Annual IEEE/ACM international Symposium on Microarchitecture, 2009. <br>[FBC+01] B. Fahs. S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. Patel, S. Lumetta, “Performance Characterization of a Hardware Mechanism for Dynamic Optimization”, in Procs. of 34th International Symposium On Microarchitecture (MICRO-34), 2001 <br>[F07] J. Fang. Parallel Programming Environment: A Key to Translating Tera-Scale Platforms into a Big Success. Key Note at International Symposium on Code Generation and Optimization. March 2007. <br>[GMS+05] C. García, C. Madriles, J. Sánchez, P. Marcuello, A. González and D.M. Tullsen, “Mitosis Compiler: An Infrastructure for Speculative Threading Based on Pre-computation Slices”, in Proc. of the Int. Conf. on Programming Language Design and Implementation, pp. 269-278, 2005. <br>[H05] H.P. Hofstee, Power efficient processor architecture and the cell processor. In Proceedings of 11th International Symposium on High-Performance Computer Architecture HPCA-11, February 2005. <br>[HST09] N. Hua, H. Song, and T.V. Lakshman. Variable-Stride Multi-Pattern Matching For Scalable Deep Packet Inspection. Proceedings of IEEE INFOCOM, April 2009.<br>[HWC04] L. Hammond, V. Wong, M. Chen, B. Carlstrom, J. Davis, B. Hertzberg, M. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, Transactional Memory Coherence and Consistency, in Proceedings of the 31st International Symposium on Computer Architecture, 2004. <br>[KBK02] C. Kim, D. Burger, and S.W.Keckler, An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches, In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. <br>[KFJ04] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proceedings of the 31st annual International Symposium on Computer Architecture, 2004. <br>[Kla00] A. Klaiber, “The Technology Behind the CrusoeTM Processors”, white paper, January 2000 <br>[LCW+07] X. Liang, R. Canal, G.Y. Wei, D. Brooks. Process Variation Tolerant 3T1D-Based Cache Architectures. Proceedings of the 40th International Symposium on Microarchitecture (MICRO-40), December 2007 <br>[LNO+08] E.Lindholm, J.Nickolls, S.Oberman, Jmontrym. “NVIDIA TESLA: A Unified Graphics and Computing Architecture”, IEEE Micro, 28, 2&nbsp;:39-55, March-April 2008. <br>[LuH07] D.Luebke and G.Humphreys. “How GPUs work”, Computer, 40(2): 96-11, 2007. <br>[M97] D. Matzke, Will physical scalability sabotage performance gains?, IEEE Computer, September 1997. <br>[MB05] C. McNairy, and R. Bhatia. Montecito: A Dual-Core, Dual-Thread Itanium Processor. IEEE Micro, Volume 25, Issue 2, Pages 10–20. March-April 2005. <br>[MG99] P. Marcuello and A. González, “Clustered Speculative Multithreaded Processors”, in Proc. of the 13th Int. Conf. on Supercomputing, pp. 365-372, 1999. <br>[MLC+09] C. Madriles, F. Latorre, J.M. Codina, E. Gibert, P. López, A. Martínez, R. Martínez and A. González, "Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading", in proceedings of International Conference on Parallel Architectures and Compiler Techniques, September 2009. <br>[MO07] T. Moscibroda, O. Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-core Systems. In 16th USENIX Security Symposium 2007.<br>[NVI08] NVIDIA Corporation. “NVIDIA CUDA Programming Guide”, 2.0 edition, 2008. <br>[OHL+08] J.D.Owens, M.Houston, D.Luebke, S.Green, J.E.Stone, and J.C.Phillips. “GPU Computing”, In Proc. Of the IEEE, vol. 96, pp.879-899, 2008. <br>[PN98] T. Ptacek and T. Newsham. Insertion, Evasion and Denial of Service: Eluding Network Intrusion Detection. In Secure Networks, Inc., January 1998.<br>[PY08] P. Piyachon, Y. Luo. Design of a High Performance Pattern Matching Engine Through Compact Deterministic Finite Automata. Proceedings of the ACM DAC 2008.<br>[RAM+04] R. Rosner, Y. Almog, Micha Moffie, Naftali Schwartz, Avi Mendelson, “Power Awareness through Selective Dynamically Optimized Traces”, in Procs. of 31st International Symposium on Computer Architecture (ISCA-21), 2004 <br>[Ro99] M. Roesch. SNORT - Lightweight Intrusion Detection for Networks. In LISA '99: USENIX 13th Systems Administration Conference 1999.<br>[RWP05] G.Ren, P.Wu, D.Padua. "An Empirical Study on Vectorization of Multimedia Applications for Multimedia Extensions", Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), 2005. <br>[SBV95] G.S. Sohi, S.E. Breach and T.N. Vijaykumar, “Multiscalar Processors”, in Proc. of the 22nd Int. Symp. on Computer Architecture, pp.414-425, 1995. <br>[SCS+08] L.Seiler, D.Carmean, E.Sprangle, T.Forsyth, M.Abrash, P.Dubey, P.Hanrahan, S.Junkins, A.Lake, J. Sugerman, “Larrabee: A Many-core x86 Architecture for Visual Computing”, In SIGGRAPH ’08: ACM SIGGRAPH 2008 Papers, ACM Press, New York, NY, 2008. <br>[SCZ+02] J. Steffan, C. Colohan, A. Zhai and T. Mowry, “Improving Value Communication for Thread-Level Speculation”, in Proc. of the 8th Int. Symp. on High Performance Computer Architecture, pp. 58-62, 2002. <br>[SEJ06] R. Smith, C. Estan, and S. Jha. Backtracking Algorithmic Complexity Attacks against a NIDS. In ACSAC 2006.<br>[SEJ08] R. Smith, C. Estan, and S. Jha. XFA: Faster Signature Matching with Extended Automata. XFA: Faster Signature Matching with Extended Automata. IEEE Symposium on Security and Privacy, May 2008.<br>[SFS00] J.E.Smith, G.Faanes, R.Sugumar, “Vector Instruction Set Support for Conditional Operations”, International Symposium on Computer Architecture, Pages: 260 - 269, 2000. <br>[Shi07] J.Shin, “Introducing Control Flow into Vectorized Code”, IEEE, 16th International Conference on Parallel Architecture and Compilation Techniques, 2007. <br>[SN05] J. E. Smith, and R. Nair, “Virtual Machines: Versatile Platforms for Systems and Processes”, Morgan Kaufmann Publishers, 2005 <br>[SPN96] A. Saulsbury, F. Pong and A. Nowatzyk, “Missing the Memory Wall: The Case for Processor/Memory Integration”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996. <br>[TEL95] D. Tullsen, S.J. Eggers, H.M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the22th Annual International Symposium on Computer Architecture, 1995. <br>[TS05] L. Tan, T. Sherwood. A High Throughput String Matching Architecture for Intrusion Detection and Prevention. Proceedings of ISCA 2005, pages 112-122.<br>[Uht00] Uht, A. K. Achieving typical delays in synchronous systems via timing error toleration. Tech. Rep. Dept. of Electrical and Computer Engineering, No. 032000-0100, University of Rhode Island. 2000<br>[VSP+09] A. Valero, J. Sahuquillo, S. Petit, V. Lorente, R. Canal, P. Lopez, J. Duato. An hybrid eDRAM/SRAM macrocell to implement first-level data caches. Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42), December 2009<br>[WWW07] Quad-Core Intel® Xeon® Processor 5300 Series. August 2007. <br>[WP07] White paper. The Manycore Shift Microsoft: Parallel Computing Initiative Ushers Computing into the Next Era. November 2007. <br>[WWW207] Quad-Core AMD Opteron processors for Server and Workstation., 2007.  
<br>
= Microarquitecture and Compilers for Future Processors III (2014-2016) =
The main objective of this project as for the researchers of the ARCO group is the research in the design of future microprocessors, taking into account the determining factors of future technology, both for high performance processors and for commodity electronics.
Fundamentally, two factors have determined the increased performance in processors: on one hand the technological advances in microprocessor manufacturing and, on the other hand, the use of new and more efficient microarchitectural and compiler techniques. All these improvements bring a number of challenges that are now considered as key in designing the processors for this upcoming decade: the limited instruction-level parallelism, the interconnection network delays, high power consumption, heat dissipation, system relibility and security.
In this project we are going to address the influence of these issues in the research of future processors.  
Specifically, we will address six areas that we consider fundamental:
# The efficient design of circuits in the presence of unexpected changes in its operating parameters
# The efficient design of graphic processors oriented to mobile devices
# The efficient implementation of virtual machines with low complexity but high computing power
# The characterization and acceleration of emerging applications
# The design of new heterogeneous multiprocessor architectures that optimize the use of the different processors depending on the types of application being executed
# The study of new techniques in the design of the memory hierarchy and interconnection networks to tolerate the increasing gap between the speeds of the various components of the computer.


= Projects Bureaucracy<br> =
<br>
= Microarquitecture and Compilers for Future Processors II (2010-2014) =
The main objective of this project as for the researchers of the ARCO group is the research in the design of future microprocessors, taking into account the determining factors of future technology, both for high performance processors and for commodity electronics.
Fundamentally, two factors have determined the increased performance in processors: on one hand the technological advances in microprocessor manufacturing and, on the other hand, the use of new and more efficient microarchitectural and compiler techniques. All these improvements bring a number of challenges that are now considered as key in designing the processors for this upcoming decade: the limited instruction-level parallelism, the interconnection network delays, high power consumption, heat dissipation, system relibility and security.
In this project we are going to address the influence of these issues in the research of future processors.
Specifically, we will focus on five areas which we consider to be fundamental:
# The study of new techniques in the memory hierarchy design to tolerate the increasing gap between processor and memory speeds
# The efficient circuit design in the face of unexpected variations of their working parameters
# The implementation of efficient virtual machines of low complexity but high level computing
# The implementation of intrusion detection systems to assure a high computer security level
# Characterization and acceleration of emerging applications
# The design of novel multithreaded processors to exploit thread-level parallelism


[[Project Bureaucracy|'''Go to Project Bureaucracy''']]
<br>
= Microarquitecture and Compilers for Future Processors (2006-2010) =
The main objective in this project is to research the design of next decade processors considering the requirements of the technology which is estimated to be feasible for the next years.
Till recently, processor performance was mainly determined by two factors: technological advances in microprocessor manufacturing and the use of new and more efficient microarchitectural and compiler techniques. Now, new challenges are approached, for instance: high power consumption, heat dissipation, wire delays, design complexity, and the limited instruction-level parallelism.
In this project we are going to address the influence of these issues in the research of future processors. Specifically, we will focus on seven areas which we consider to be fundamental:
# The reduction in power consumption and better approaches for heat dissipation
# The exploitation of thread-level speculative parallelism
# The design of clustered microarchitectures
# The efficient implementation of ISA extensions for out-of-order processors
# The efficient implementation of co-designed virtual machines;
# The study of new techniques in the register file and cache memory design to tolerate the increasing gap between processor and memory speeds;
# The efficient circuit design in the face of unexpected variations of their working parameters.
 
<br>
= Project Bureaucracy =
*[[Project Bureaucracy|'''Go to Project Bureaucracy''']]

Latest revision as of 09:38, 27 April 2026

Domain-Specific Architectures for Energy-Efficient Computing SystemsI I(2024-2028)

We are entering a new era in computing, characterized by the integration of many diverse computing devices into our daily lives. These systems will encompass a range of functions akin to human cognitive tasks, such as comprehending our surroundings (e.g., vision and language processing), learning from data and experiences, and making proactive decisions and autonomous actions (e.g., self-driving cars). This era marks the dawn of ubiquitous intelligent computing.

However, these computing systems will necessitate significant advancements in energy efficiency, as they will execute complex tasks under stringent power constraints. While chip manufacturing process technology has been instrumental in improving energy efficiency over generations, advances in dimension scaling have recently slowed down, leading many experts to believe that dimension scaling may soon reach a halt. In this context, disruptive innovations in architecture will become pivotal in enhancing energy efficiency and driving innovation. This project focuses on developing these disruptive architecture innovations.

Our approach to designing these novel architectures will be grounded in three key pillars: simplicity, minimal data movement, and hardware-software specialization. This specialization will lead to the development of domain-specific architectures. In this project, we will concentrate on two domains that we believe will be highly popular and impactful in the future: cognitive computing and graphics.

Cognitive computing encompasses a broad range of artificial intelligence techniques, including machine learning, that enable computers to interact and think like humans (perception, reasoning, learning, decision-making, etc.). Graphics are the primary means used by most applications to display data to users. Animated graphics applications, such as games and movies, demand exceptionally high-quality graphics, low latency, and often operate under tight power constraints (e.g., mobile devices). The ultimate objective of this research project is to design novel domain-specific architectures that provide exceptional user experiences in these two domains.


Domain-Specific Architectures for Energy-Efficient Computing Systems (2020-2025)

We are on the verge of transitioning to a new era in computing, characterized by an abundance of very different computing devices integrated in most systems that surround us in our daily lives. Besides, these computing systems will include a rich set of functions that are similar to human cognitive tasks, such as the ability to comprehend our surroundings (e.g., vision and language processing), learn from data and experiences, and proactively take decisions and autonomous actions (e.g., self-driving cars). This will be the era of ubiquitous intelligent computing.

These computing systems will require dramatic improvements in energy-efficiency since they will perform very complex tasks under tightly constrained power budgets. Traditional approaches to improve energy-efficiency are running out of gas, and disruptive innovations from architecture are going to be a main driving force for energy-efficiency in the future. These disruptive architecture innovations are the main focus on this project.

Our approach for designing these novel architectures will rely on four main pillars: simplicity, minimal data movement and specialization of the both hardware and software. This specialization will give rise to different domain-specific architectures. In this project we will focus on two particular domains that we believe will be among the among the most popular in the forthcoming future: cognitive computing and graphics.

Cognitive computing is a broad area that includes machine learning and other artificial intelligence techniques that will allow computers to interact and think like humans (perception, reasoning, learning, decision making etc.). Graphics are the common way used by most applications to display data to users. Applications such as games and movies demand for extremely high-quality graphics, and in many cases (e.g., mobile devices) have very tight power constraints. The ultimate goal of this project is to devise novel domain-specific architectures that provide rich user experiences in these two areas.


CoCoUnit: An Energy-Efficient Processing Unit for Cognitive Computing (2019-2025)

There is a fast-growing interest in extending the capabilities of computing systems to perform human-like tasks in an intelligent way. These technologies are usually referred to as cognitive computing. We envision a next revolution in computing in the forthcoming years that will be driven by deploying many “intelligent” devices around us in all kind of environments (work, entertainment, transportation, health care, etc.) backed up by “intelligent” servers in the cloud. These cognitive computing systems will provide new user experiences by delivering new services or improving the operational efficiency of existing ones, and altogether will enrich our lives and our economy.

A key characteristic of cognitive computing systems will be their capability to process in real time large amounts of data coming from audio and vision devices, and other type of sensors. This will demand a very high computing power but at the same time an extremely low energy consumption. This very challenging energy-efficiency requirement is a sine qua non to success not only for mobile and wearable systems, where power dissipation and cost budgets are very low, but also for large data centers where energy consumption is a main component of the total cost of ownership.

Current processor architectures (including general-purpose cores and GPUs) are not a good fit for this type of systems since they keep the same basic organization as early computers, which were mainly optimized for “number crunching”. CoCoUnit will take a disruptive direction by investigating unconventional architectures that can offer orders of magnitude better efficiency in terms of performance per energy and cost for cognitive computing tasks. The ultimate goal of this project is to devise a novel processing unit that will be integrated with the existing units of a processor (general purpose cores and GPUs) and altogether will be able to deliver cognitive computing user experiences with extremely high energy-efficiency.

This project is funded by the European Research Council through the ERC Advanced Grants program.


Intelligent, Ubiquitous and Energy-Efficient Computing Systems (2016-2020)

The ultimate goal of this project is to devise novel platforms that provide rich user experiences in the areas of cognitive computing and computational intelligence in mobile devices such as smartphones or wearable devices. This project investigates novel unconventional architectures that can offer orders of magnitude better efficiency in terms of performance per energy, and at the same time important improvements in raw performance. These platforms will rely on various types of units specialized for different application domains. Special focus is paid to graphics processors and brain-inspired architectures (e.g. hardware neural networks) due to their potential to exploit high degrees of parallelism and their energy efficiency for this type of applications. Extensions to existing architectures combined with novel accelerators will be explored. We also investigate the use of resilient architectures that can allow computing systems to operate at very low supply voltage levels in order to optimize their energy consumption while not compromising their reliability by providing adequate fault tolerance solutions.


Microarquitecture and Compilers for Future Processors III (2014-2016)

The main objective of this project as for the researchers of the ARCO group is the research in the design of future microprocessors, taking into account the determining factors of future technology, both for high performance processors and for commodity electronics. Fundamentally, two factors have determined the increased performance in processors: on one hand the technological advances in microprocessor manufacturing and, on the other hand, the use of new and more efficient microarchitectural and compiler techniques. All these improvements bring a number of challenges that are now considered as key in designing the processors for this upcoming decade: the limited instruction-level parallelism, the interconnection network delays, high power consumption, heat dissipation, system relibility and security. In this project we are going to address the influence of these issues in the research of future processors. Specifically, we will address six areas that we consider fundamental:

  1. The efficient design of circuits in the presence of unexpected changes in its operating parameters
  2. The efficient design of graphic processors oriented to mobile devices
  3. The efficient implementation of virtual machines with low complexity but high computing power
  4. The characterization and acceleration of emerging applications
  5. The design of new heterogeneous multiprocessor architectures that optimize the use of the different processors depending on the types of application being executed
  6. The study of new techniques in the design of the memory hierarchy and interconnection networks to tolerate the increasing gap between the speeds of the various components of the computer.


Microarquitecture and Compilers for Future Processors II (2010-2014)

The main objective of this project as for the researchers of the ARCO group is the research in the design of future microprocessors, taking into account the determining factors of future technology, both for high performance processors and for commodity electronics. Fundamentally, two factors have determined the increased performance in processors: on one hand the technological advances in microprocessor manufacturing and, on the other hand, the use of new and more efficient microarchitectural and compiler techniques. All these improvements bring a number of challenges that are now considered as key in designing the processors for this upcoming decade: the limited instruction-level parallelism, the interconnection network delays, high power consumption, heat dissipation, system relibility and security. In this project we are going to address the influence of these issues in the research of future processors. Specifically, we will focus on five areas which we consider to be fundamental:

  1. The study of new techniques in the memory hierarchy design to tolerate the increasing gap between processor and memory speeds
  2. The efficient circuit design in the face of unexpected variations of their working parameters
  3. The implementation of efficient virtual machines of low complexity but high level computing
  4. The implementation of intrusion detection systems to assure a high computer security level
  5. Characterization and acceleration of emerging applications
  6. The design of novel multithreaded processors to exploit thread-level parallelism


Microarquitecture and Compilers for Future Processors (2006-2010)

The main objective in this project is to research the design of next decade processors considering the requirements of the technology which is estimated to be feasible for the next years. Till recently, processor performance was mainly determined by two factors: technological advances in microprocessor manufacturing and the use of new and more efficient microarchitectural and compiler techniques. Now, new challenges are approached, for instance: high power consumption, heat dissipation, wire delays, design complexity, and the limited instruction-level parallelism. In this project we are going to address the influence of these issues in the research of future processors. Specifically, we will focus on seven areas which we consider to be fundamental:

  1. The reduction in power consumption and better approaches for heat dissipation
  2. The exploitation of thread-level speculative parallelism
  3. The design of clustered microarchitectures
  4. The efficient implementation of ISA extensions for out-of-order processors
  5. The efficient implementation of co-designed virtual machines;
  6. The study of new techniques in the register file and cache memory design to tolerate the increasing gap between processor and memory speeds;
  7. The efficient circuit design in the face of unexpected variations of their working parameters.


Project Bureaucracy