Difference between revisions of "Projects"

From ArcoWiki
 
(14 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Clustered Processors  =
+
= CoCoUnit: An Energy-Efficient Processing Unit for Cognitive Computing (2019-2025) =
 +
There is a fast-growing interest in extending the capabilities of computing systems to perform human-like tasks in an intelligent way. These technologies are usually referred to as cognitive computing. We envision a next revolution in computing in the forthcoming years that will be driven by deploying many “intelligent” devices around us in all kind of environments (work, entertainment, transportation, health care, etc.) backed up by “intelligent” servers in the cloud. These cognitive computing systems will provide new user experiences by delivering new services or improving the operational efficiency of existing ones, and altogether will enrich our lives and our economy.
  
The increasing number of transistors and faster clock speeds of current processors inhibit signals from reaching every point of a chip in a single clock cycle. Also recent process technologies have ever higher interconnect latencies compared to logic gates [AHK00] and power consumption has become a first order design constraint. To solve this problem a design in clusters is adopted, the processor is divided in subunits (clusters) which work independently. The smaller size of these clusters allows signals to reach all of their parts in just one clock cycle. Additionally the complexity of the design is diminished and the energy efficiency is improved [ZK01]. On the other hand this paradigm also introduces a new problem: the workload has to be distributed among all clusters in order to achieve maximum computing potential. Hence, the different clusters need to communicate with each other. This requires an interconnection network among clusters which has significant latencies and a limited bandwidth.  
+
A key characteristic of cognitive computing systems will be their capability to process in real time large amounts of data coming from audio and vision devices, and other type of sensors. This will demand a very high computing power but at the same time an extremely low energy consumption. This very challenging energy-efficiency requirement is a sine qua non to success not only for mobile and wearable systems, where power dissipation and cost budgets are very low, but also for large data centers where energy consumption is a main component of the total cost of ownership.
  
Several superscalar clustered architectures have been proposed over time. They can be classified depending on how they map instructions to clusters. Some use control dependencies (and could alternatively be classified as speculative multithreaded) [SBV95] [RJS97], while others are based on data dependencies. The last group can be further classified by the time when instructions are mapped. One group relies on the compiler to perform this work before the program is actually executed [NSB01] [FCJ97]; another group maps instructions dynamically using hardware during program execution [KF96] [PJS97] [K99] [ZK01] [KS02] [BDA03]. The memory unit received special attention in the literature because it is one of the most complex and power consuming structures of a superscalar processor. Most research aims especially at the memory disambiguation unit and the first level data cache. Some proposals also include a predictor [YMR+99] to find an instruction mapping that minimizes communication delays [ZK01] [RP03] [B04] while some others do not [FS96] [SNM+06].  
+
Current processor architectures (including general-purpose cores and GPUs) are not a good fit for this type of systems since they keep the same basic organization as early computers, which were mainly optimized for “number crunching”. CoCoUnit will take a disruptive direction by investigating unconventional architectures that can offer orders of magnitude better efficiency in terms of performance per energy and cost for cognitive computing tasks. The ultimate goal of this project is to devise a novel processing unit that will be integrated with the existing units of a processor (general purpose cores and GPUs) and altogether will be able to deliver cognitive computing user experiences with extremely high energy-efficiency.
  
With respect to the VLIW architectures, the work concentrates on the code generation stage of the compiler that is used to distribute instructions among clusters, schedule instructions and necessary communications, and assign registers. The various proposals found in the literature can be distinguished by the order of execution of the tasks mentioned above. Some proposals execute these tasks independently [E86] [CDN92] [D98] [JCSK98] [NE98] [CFM03]; others propose unification of the last two tasks [OBC98] [SG00a] [LPSA02], and recently some groups proposed techniques to unify the three tasks [KEA01] [CSG01] [ACSG01] [ZLAV01] [ACS+02]. In parallel, some techniques were studied to improve register assignment and spill code [ACGK05] and to reduce the impact of the communications by replicating some code [SG00a][ACGK03a]. On the other hand, some recent works propose alternatives to the distribution of the memory hierarchy in a clustered architecture [WTS+97] [SG00b] [GSG02a], as well as techniques for the code generation for these new organizations [GSG02b] [GSG03a] [GSG03b].  
+
This project is funded by the European Research Council through the ERC Advanced Grants program.
  
= Co-designed Virtual Machines  =
+
<br>
 +
= Intelligent, Ubiquitous and Energy-Efficient Computing Systems (2016-2020) =
 +
The ultimate goal of this project is to devise novel platforms that provide rich user experiences in the areas of cognitive computing and computational intelligence in mobile devices such as smartphones or wearable devices. This project investigates novel unconventional architectures that can offer orders of magnitude better efficiency in terms of performance per energy, and at the same time important improvements in raw performance. These platforms will rely on various types of units specialized for different application domains. Special focus is paid to graphics processors and brain-inspired architectures (e.g. hardware neural networks) due to their potential to exploit high degrees of parallelism and their energy efficiency for this type of applications. Extensions to existing architectures combined with novel accelerators will be explored. We also investigate the use of resilient architectures that can allow computing systems to operate at very low supply voltage levels in order to optimize their energy consumption while not compromising their reliability by providing adequate fault tolerance solutions.
  
Co-designed Virtual Machines [SN05] are an attractive vehicle for designing complexity-effective processors. In such a scheme, a processor is a co-design effort between hardware and software. The software layer wraps the hardware and provides a common interface to the outside world (operating system and applications). This software layer allows us to perform dynamic translation from a source ISA (visible to the operating system and applications), as well as to perform optimization over the source code to adapt it to better exploit the capabilities of the hardware layer underneath.  
+
<br>
 +
= Microarquitecture and Compilers for Future Processors III (2014-2016) =
 +
The main objective of this project as for the researchers of the ARCO group is the research in the design of future microprocessors, taking into account the determining factors of future technology, both for high performance processors and for commodity electronics.
 +
Fundamentally, two factors have determined the increased performance in processors: on one hand the technological advances in microprocessor manufacturing and, on the other hand, the use of new and more efficient microarchitectural and compiler techniques. All these improvements bring a number of challenges that are now considered as key in designing the processors for this upcoming decade: the limited instruction-level parallelism, the interconnection network delays, high power consumption, heat dissipation, system relibility and security.
 +
In this project we are going to address the influence of these issues in the research of future processors.
 +
Specifically, we will address six areas that we consider fundamental:
 +
# The efficient design of circuits in the presence of unexpected changes in its operating parameters
 +
# The efficient design of graphic processors oriented to mobile devices
 +
# The efficient implementation of virtual machines with low complexity but high computing power
 +
# The characterization and acceleration of emerging applications
 +
# The design of new heterogeneous multiprocessor architectures that optimize the use of the different processors depending on the types of application being executed
 +
# The study of new techniques in the design of the memory hierarchy and interconnection networks to tolerate the increasing gap between the speeds of the various components of the computer.
  
Several proposals exist in the research arena that shown the potential benefits of a co-designed Virtual Machine as well as the benefits of dynamic optimization. In Transmeta Crusoe [Kla00], and the IBM DAISY [EA96] and BOA [AGS+99] the concept of co-designed Virtual Machine is leveraged to design system based on a low complexity and low power VLIW hardware layer, able to execute general purpose x86 code. In order to allow these systems to work, x86 must be dynamically translated into VLIW code. In these proposals, the translation from x86 to VLIW ISA is an important feature and it imposes an important overhead to the system.
+
<br>
 +
= Microarquitecture and Compilers for Future Processors II (2010-2014) =
 +
The main objective of this project as for the researchers of the ARCO group is the research in the design of future microprocessors, taking into account the determining factors of future technology, both for high performance processors and for commodity electronics.  
 +
Fundamentally, two factors have determined the increased performance in processors: on one hand the technological advances in microprocessor manufacturing and, on the other hand, the use of new and more efficient microarchitectural and compiler techniques. All these improvements bring a number of challenges that are now considered as key in designing the processors for this upcoming decade: the limited instruction-level parallelism, the interconnection network delays, high power consumption, heat dissipation, system relibility and security.
 +
In this project we are going to address the influence of these issues in the research of future processors.
 +
Specifically, we will focus on five areas which we consider to be fundamental:
 +
# The study of new techniques in the memory hierarchy design to tolerate the increasing gap between processor and memory speeds
 +
# The efficient circuit design in the face of unexpected variations of their working parameters
 +
# The implementation of efficient virtual machines of low complexity but high level computing
 +
# The implementation of intrusion detection systems to assure a high computer security level
 +
# Characterization and acceleration of emerging applications
 +
# The design of novel multithreaded processors to exploit thread-level parallelism
  
RePlay [FBC+01] and PARROT [RAM+04] remove the overhead due to translation and concentrate their efforts on performing dynamic optimization of the most frequently executed sections of the applications. However, these projects rely on the hardware to perform code optimization. The use of the hardware limits the flexibility of the system. A software optimizer is able to perform more complex analysis and optimizations than a hardware-based scheme, and it may be updated many times even when the chip is build. Moreover, a hardware optimizer introduces additional complexity to the hardware which may result in a power consumption increase and it may make the system more difficult to test.
+
<br>
 
+
= Microarquitecture and Compilers for Future Processors (2006-2010) =
The goal of our research in this arena is to propose a complete design of a system based on a combined effort on hardware and software. In order to do that, first we will investigate on an efficient and flexible co-designed system that overcomes the limitation of previous proposals. Then, we will investigate on novel techniques to perform dynamic optimization through the co-designed virtual machine software layer. These techniques must be able to adapt the applications to the hardware underneath. Finally, we will investigate on enhancements to the hardware layer to allow a better interaction with the wrapper software layer.
+
The main objective in this project is to research the design of next decade processors considering the requirements of the technology which is estimated to be feasible for the next years.
 
+
Till recently, processor performance was mainly determined by two factors: technological advances in microprocessor manufacturing and the use of new and more efficient microarchitectural and compiler techniques. Now, new challenges are approached, for instance: high power consumption, heat dissipation, wire delays, design complexity, and the limited instruction-level parallelism.
In addition to the above mentioned related projects, Jim Smith’ research group [SN05] is also very active on this area. Jim has been collaborating with our research group for more than 10 years.
+
In this project we are going to address the influence of these issues in the research of future processors. Specifically, we will focus on seven areas which we consider to be fundamental:  
 
+
# The reduction in power consumption and better approaches for heat dissipation
= ISA Extensions for Dynamically Scheduled Processors =
+
# The exploitation of thread-level speculative parallelism
 
+
# The design of clustered microarchitectures
The main objective of scheduling is to obtain a high-level of parallelism to maximize processor resources and minimize the execution time. During the static scheduling process, the compiler has access to all information contained in the program, which allows to extract parallelism of different granularities. However, the amount of parallelism is limited, since there is certain information that is available only at execution time. To overcome such limitations, special instruction set architecture (ISA) extensions such as predication, register windows, etc. have been introduced to permit developing techniques that help the program to better adapt to the execution environment and improve the performance. They do not introduce semantic changes in the overall execution. Such extensions are mainly implemented in in-order processors [MD03][HL99], however they may also be complemented with dynamic scheduling techniques to achieve various parallelism granularities.
+
# The efficient implementation of ISA extensions for out-of-order processors
 
+
# The efficient implementation of co-designed virtual machines;
If-conversion [JK83] is a compiler technique that takes full advantage of predication. Some studies have shown that if-conversion may alleviate the severe performance penalties caused by hard-to-predict branch mispredictions [MBG+94] [CHPC95] [AHM97]. There are many research groups that have developed techniques to execute predicated code on out-of-order processors: the generation of micro-operations for the disambiguation of multiple register definitions [WWK+01], predicate value prediction [CC03] [QPG06], or the introduction of new ISA extensions [KMSP05]. Another problem associated to the if-conversion technique is the loss of correlation information needed to implement the most common branch predictors [ACGH97]. Many studies have proposed the use of predicate information on branch prediction to recover the lost correlated information [ACGH97] [SCF03] [QPG07].
+
# The study of new techniques in the register file and cache memory design to tolerate the increasing gap between processor and memory speeds;
 
+
# The efficient circuit design in the face of unexpected variations of their working parameters.
Another ISA extension is register windows. A register window [Sit79] is the set of private logical registers accessed by a function. The passing of parameters from a function to another is done through the overlapping of windows [MD03] [HL99]. When the number of free logical registers is insufficient, it is necessary to activate a spill mechanism to keep the values that are still alive from outer functions. There are several studies that use the register windows to give the processor the illusion of an unlimited number of registers [DM82] [HL91] [ND95] [OBMR05].
 
 
 
= Memory Hierarchy and Register File Architecture  =
 
 
 
Innovation and technological improvements in processor design have outpaced advances in memory design in the last ten years. Therefore, the increasing gap between processor and memory speeds has motivated that current high performance processors focus on register file and cache organizations to tolerate growing memory latencies [BGK96], [BKG95], [SD95], [SPN96]. Register file and cache organizations attempt to bridge this gap but do so at the expense of large amounts of die area, increment of the energy consumption and higher demand of memory bandwidth that can be progressively a greater limit to high performance.
 
 
 
Current processor designs assume that access latencies of register files will be greater than operations of functional units. This fact implies that will be difficult to read data of the register files in a single clock cycle. At the present moment just a single cycle is assumed [BDA01], [BS03] [MGV99], [STR02]. This designs allows latencies of register file greater than one cycle assuming a small reduction of performance but obtaining significant savings on energy consumption [BM99], [PSR00], [YZG00], [ZYG00]. On the other hand, current cache organization designs try to obtain a good balance between cost and performance. For high productions volumes, cost can be associated with area chip, so a way to reduce the cost is to reduce area requirements. Furthermore, power dissipation is becoming a critical issue for microprocessors. Power dissipation determines the cost of the cooling system and ultimately may limit the performance of the microprocessor.
 
 
 
We will propose several microarchitectural techniques that can be applied to various parts of current microprocessor designs to improve the memory system and to boost the execution of instructions. Some techniques will attempt to ease the gap between processor and memory speeds, while the others will attempt to alleviate the serialization caused by data dependences.
 
 
 
= Speculative Multithreaded Processors  =
 
 
 
With the limiting performance benefits from frequecy scaling and increasing complexity of single core superscalar microprocessors, several microprocessor vendors have started migrating to multicore chips. While multicore chips would clearly benefit applications which have explicit thread level parallelism like server workloads, the performance of single threaded applications are not going to benefit without newer novel innovations.
 
 
 
Speculative multithreading attempts to fill this void for single threaded applications. A speculative multithreaded processor logically consists of multiple cores running chunks of the single-threaded application in parallel. Key challenges in this execution paradigm includes (1) effective partition of the program using speculation on control-flow and data dependences (2) Support for efficient recovery from misspeculations to restore the correct sequential machine state.
 
 
 
Speculative multithreading has been an active area of research for past few years. Some of the early works in this area consists of the research done on Multiscalar in University of Wisconsin[SBV95], SpMT processors in UPC[MG99][GMS+05], Stampede at CMU[SCZ+02], and the I-ACOMA project in the University of Illinois[CMT00], etc.  
 
 
 
Our research focuses on extending the state of the art research in this area and find efficient design solutions to the several problems, such as the code partitioning and interthread data dependences, that keep speculative-multithreading from being a viable future processor design.
 
 
 
== Temperature and Power Consumption Control  ==
 
 
 
One of the key elements for future processors is the temperature and power-consumption control ‎[Bor99]. The high frequencies they will operate will not come together with a similar voltage reduction to keep power under control. Thus, the processors will have to implement mechanisms to control the dissipated energy as well as the temperature. These mechanisms will have to be more or less aggressive depending on the target market segment of the processors (server, desktop, and laptop). The energy consumption of a processor can be divided in two big parts: dynamic and static consumption.
 
 
 
The dynamic consumption depends on the silicon technology used, the frequency of the processor, the activity of the processor and the source voltage. The advances in silicon technology make the transistor size smaller –and thus the power dissipated when commuting; but on the other side, the increase in the transistor count per chip and the constant increase in frequency makes the overall dynamic energy consumption increase in every new generation of processors [DB00].
 
 
 
The static consumption main component is the leakage power. Until recently, static energy consumption was a minor part of the overall energy budget. Nevertheless, the static consumption has an exponential relationship with threshold voltage. In order to reduce the dynamic energy consumption, the source voltage decreases from one generation to the next. In order to keep/increase the frequency the threshold voltage has to decrease too. This makes the static energy consumption to increase exponentially. Static energy consumption will be around 50% of the overall energy consumption in future generations (in less than a decade) ‎[Bor99]‎[DB00].
 
 
 
Finally, the power density increases in each generation due to the higher frequency and the leakage currents –mainly. The power density moves directly to heat and this heat has to be dissipated somehow. Actually, the cost of dissipating the processors heat is augmenting in the same proportion as the power density. It is predicted that over 40W, the cost of dissipating 1 watt is between 1 and 3 dollars ‎[GBC+01]. A drastic increment in temperature in an area of the processor may cause a transient failure or even an unrecoverable error. Furthermore, the static power consumption due to leakage currents has an exponential relationship with temperature, thus, an increment in temperature implies an increment in power consumption, which in turn increments temperature, overall resulting in a dangerous feedback circle.
 
 
 
Nowadays, the techniques that try to reduce the dynamic consumption are focused on reducing the activity of the processor when the maximum throughput is not needed (turning off not-used units or changing the frequency and the voltage of the processor ‎[PKG00]‎[CG01]‎[CGS00]‎[SAD+02]). In order to reduce the static energy consumption, the techniques proposed completely shut down zones of the cache memory or they implement circuits with different frequencies or threshold voltages [FKM+00]‎ [KC00]‎ [KMN+01].
 
 
 
The topic of temperature reduction is very new, the techniques proposed until now are focused on the reduction of the number of times a thermal emergency is detected. Every time an emergency is detected, the OS takes control of the processor and reduces its frequency –until it is cold enough to resume “normal” operating frequency, and thus incurring in a performance penalization. The techniques proposed try to avoid this situation by a strict control of the activity of the processor when it is close to the limit temperature [HB04]‎[SAS02]‎[SSH+03].
 
 
 
The group has already done some work on value compression for power reduction [CG00] [CG01] [CGS00] [CGS04] and also on evaluation of multicore architectures [MCG06].
 
 
 
== Variations  ==
 
 
 
Circuit behavior is not deterministic anymore in current and future technologies due to variations. Limitations in the fabrication process introduce indeterminism in the geometry (length, width and thickness) of transistors, and hence, in their behavior. Moreover, voltage and temperature oscillate, and inputs of circuits change. Thus, delay and power of circuits change dynamically, but they must work in any feasible scenario. This issue is addressed by considering the worst-case scenario by default to ensure circuit functionality, but such assumption is pessimistic most of the times and very inefficient in terms of power and delay.
 
 
 
The aim of our research is finding solutions to make circuits more efficient under the presence of variations. The result of our research will be a set of strategies, techniques and circuit designs to improve the performance and power of such circuits by adapting their operation to the common case instead of operating assuming worst-case conditions.
 
 
 
Some works have been proposed so far to exploit variations. TIMERRTOL [Uht00] and Razor [EKD+03] exploit variations by using results after the common case delay and checking their correctness after the worst-case delay. Input variations are also exploited by means of narrow values [BM99], which can be operated with shorter latencies.
 
 
 
The main groups working in the area of variations are the Department of Electrical and Computer Engineering at the University of Rhode Island, the Department of Electrical Engineering at the Princeton University, the Division of Engineering and Applied Sciences at the Harvard University, and the Department of Electrical Engineering and Computer Science at the University of Michigan.
 
 
 
== References  ==
 
 
 
[AHK00] V. Agarwal, M.S. Hrishikesh, S.W. Keckler and D. Burger. "Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures". In Proc. of the 27th Ann. Int. Symp. on Computer Architecture, June 2000.
 
 
 
[ACGK03a] A. Aletà, J.M. Codina, A. González and D. Kaeli, "Instruction Replication for Clustered Microarchitectures", in Procs. of 36th Int. Symp. on Microarchitecture (MICRO-36), Dec. 2003.
 
 
 
[ACGK05] A. Aletà, J.M. Codina, , A. González and D. Kaeli. “Demistifying On-the-fly Spill Code”, in Procs. of the Conf. on Programming Languages and Implementation Design, 2005.
 
 
 
[ACSG01] A. Aletà, J.M. Codina, J. Sánchez and A. González. "Graph-Partitioning Based Instruction Scheduling for Clustered Processors", in Proc. of 34th Int. Symp. On Microarchitecture, Dec 2001.
 
 
 
[ACS+02] A. Aletà, J.M. Codina, J. Sánchez, A. González and D. Kaeli. "Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning", in Proc. of the Int. Conf. on Parallel Archirectures and Compilation Techniques (PACT'02), Sept 2002.
 
 
 
[JKW83] J. R. Allen, K. Kennedy, C. P. Warren. “Conversion of Control Dependence to Data Dependence”. POPL'83: Proceedings of 10th ACM SIGACT-SIGPLAN symposium on Principles of Programming Languages, pages 177-189.
 
 
 
[AGS+99] E. Altman, M. Gschwind, S. Sathaye, S. Kosonocky, A. Bright, J. Fritts, P. Ledak, D. Appenzeller, C. Agricola, Z. Filan, “BOA: The Architecture of a Binary Translation Engine”, IBM Research Report RC 21665 (97500), 1999
 
 
 
[ACGH97] D. August, D. Connors, J. Gyllenhaal, W. M. Hwu, “Architectural Support for Compiler-Synthesized Dynamic Branch Prediction Strategies: Rationale and Initial Results”. In HPCA '97: Proceedings of the 3th international Symposium on High-Performance Computer Architecture, page 84 - 93, 1997.
 
 
 
[AHM97] D. August, W. M. W. Hwu, S. A. Malhke. “A Framework for Balancing Control Flow and Predication”. In MICRO 30: International Symposium on Microarchitecture, pages 92-103, 1997.
 
 
 
[BS03] S. Balakrishnan and G. S. Sohi. "Exploiting Value Locality in Physical Register Files", Proceedings of the 36th International Symposium on Microarchitecture, 2003
 
 
 
[B04] R. Balasubramonian, "Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures". 18th Annual International Conference on Supercomputing (ICS), pp. 326-335, June, 2004.
 
 
 
[BDA01] R. Balasubramonian, S. Dwarkadas, and D. Albonesi. "Reducing the complexity of the register file in dynamic superscalar processors". In Proc. of the 34th Annual Intl. Symp. On Microarchitecture, pages 237-248, 2001.
 
 
 
[BDA03] R. Balasubramonian, S. Dwarkadas and D. Albonesi. “Dynamically Managing the Communication-Parallelism Trade-off in Future Clustered Processors”. In Proc. of the 30th. Ann. Intl. Symp. on Computer Architecture, pp. 275-286, June 2003.
 
 
 
[Bor99] S. Borkar. “Design Challenges of Technology Scaling”. IEEE Micro, 19(4), 1999.
 
 
 
[BTM00] D. Brooks, V. Tiwari and M. Martonosi, “Wattch: a framework for architectural-level power analysis and optimizations”, 27th Annual International Symposium on Computer Architecture, pp. 83-94, 2000.  
 
 
 
[BGK96] D. Burger, J. R. Goodman, and A. Kägi, “Quantifying Memory Bandwidth Limitations of Current and Future Microprocessors”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996.
 
 
 
[BKG95] D. Burger, A. Kägi and J. R. Goodman, “The Declining Effectiveness of Dynamic Caching for General Purpose Microprocessors“, Technical Report 1261, Computer Sciences Departament, University of Wisconsin, Madison, WI, January 1995.
 
 
 
[BM99] D. Brooks and M. Martonosi, “Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance”, In Proceedings of the 5th International Symposium on High-Performance Computer Architecture, January 1999.
 
 
 
[CG00] R. Canal, A. González. “A Low Complexity Issue Logic”. Proceedings of the 2000 International Conference on Supercomputing, June 2000. [
 
 
 
CG01] R. Canal, A. González. “Reducing the Complexity of the Issue Logic”. Proceedings of the 2001 International Conference on Supercomputing, pp. 312-320, June 2001.
 
 
 
[CGS00] R. Canal, A. González and J.E. Smith, “Very Low Power Pipelines using Significance Compression”, in Proceedings of the 33rd International Symposium on Microarchitecture, pp. 181-190, December 2000.
 
 
 
[CGS04] R. Canal, A. González and J. E. Smith, "Software- Controlled Operand Gating", Proc. of the International Symposium on Code Generation and Optimization (CGO-2), Palo Alto (CA-USA), pp. 125-136, March 2004
 
 
 
[CDN92] A. Capitanio, D. Dytt and A. Nicolau, "Partitioned Register Files for VLIWs: A Preliminary Analysis of Tradeoffs", in Procs. of 25th. Int. Symp. on Microarchitecture, pp. 192-300, 1992.
 
 
 
[CHPC95] P. Chang, E. Hao, Y. Patt, P. Chang. Using Predicate Execution to Improve the Performance of a Dinamically Scheduled Machine with Speculative Execution”. In PACT '95: Proceedings of the IFIP WG10.3 working conference on Parallel Architecture and Compilation Techniques, pages 99-108. UK. 1995.
 
 
 
[CFM03] M. Chu, K. Fan and S. Mahlke, "Region-based Hierarchical Operation Partitioning for Multicluster Processors", in Procs. on Conf. on Programming Languages and Implementation Design, 2003.
 
 
 
[CC03] W. Chuang, B. Calder. “Predicate Prediction for Efficient Out-of-Order Execution”. In ICS'03: Proceedings of the 17th annual international conference on Supercomputing”, pages 183 – 192, 2003.
 
 
 
[CMT00] M. Cintra, J.F. Martinez and J. Torrellas, “Architectural Support for Scalable Speculative Parallelization in Shared-Memory Systems”, in Proc. of the 27th Int. Symp. on Computer Architecture, 2000.
 
 
 
[CSG01] J.M. Codina, J. Sánchez and A. González, "A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors", in Procs. of Int. Conf. on Parallel Architectures and Compilation Techniques (PACT'01), Sept. 2001.
 
 
 
[DB00] V. De and S. Borkar. “Technology and Design Challenges for Low Power and High Performance”. Proceedings of the International Symposium on Low Power Electronics Design, 2000.
 
 
 
[D98] G. Desoli, "Instruction Assignment for Clustered VLIW DSP Compilers: A New Approach", Technical Report HPL-98-13, HP Laboratories, February 1998.
 
 
 
[DM82] D. R. Ditzel, H. R. McLellan. “Register Allocation for Free: The C Machine Stack Cache”. In Proceeding of Symposium on Architectural Support for Programming Languages and Operating Systems, pages 48-56, 1982
 
 
 
[EA96] K. Ebcioglu, E. R. Altman, “DAISY: Dynamic Compilation for 100% Architectural Compatibility”, IBM Research Report RC 20538, 1996.
 
 
 
[E86] R. Ellis, "Bulldog: A Compiler for VLIW Architectures", MIT Press, pp. 180-184, 1986.
 
 
 
[EKD+03] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, T. Mudge. Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation. Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003.
 
 
 
[FBC+01] B. Fahs. S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. Patel, S. Lumetta, “Performance Characterization of a Hardware Mechanism for Dynamic Optimization”, in Procs. of 34th International Symposium On Microarchitecture (MICRO-34), 2001
 
 
 
[FCJ97] K. Farkas, P. Chow, N.P. Jouppi, and Z. Vranesic, “The Multicluster Architecture: Reducing Cycle Time Through Partitioning”, In Proc. of the 30th. Int. Symp. on Microarchitecture, pp. 149-159, Dec. 1997.
 
 
 
[FKM+00] K. Flautner, N.S. Kim, S. Martin, D. Blaauw and T. Mudge. “Drowsy Caches: Simple Techniques for Reducing Leakage Power”. Proceedings of the International Symposium on Computer Architecture, 2002.
 
 
 
[FS96] M. Franklin, G.S. Sohi, "ARB: A Hardware Mechanism for Dynamic Reordering of Memory References". IEEE Trans. Computers 45(5), pp. 552-571, May, 1996.
 
 
 
[GMS+05] C. García, C. Madriles, J. Sánchez, P. Marcuello, A. González and D.M. Tullsen, “Mitosis Compiler: An Infrastructure for Speculative Threading Based on Pre-computation Slices”, in Proc. of the Int. Conf. on Programming Language Design and Implementation, pp. 269-278, 2005
 
 
 
[GSG02a] E. Gibert, J. Sánchez and A. González, "An Interleaved Cache Clustered VLIW Processor", in Procs. of 16th Int. Conf. on Supercomputing, June 2002.
 
 
 
[GSG02b] E. Gibert, J. Sánchez and A. González, "Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor", in Procs. of 35th Int. Symp. on Microarchitecture, Desember 2002.
 
 
 
[GSG03a] E. Gibert, J. Sánchez and A. González, "Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache", in Procs. of 1st Int. Symp. on Code Generation and Optimization (CGO'03), March 2003.
 
 
 
[GSG03b] E. Gibert, J. Sánchez and A. González, "Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors", in Procs. of 36th Int. Symp. on Microarchitecture (MICRO-36), Dec. 2003.
 
 
 
[GBC+01] S. Gunther, F. Binns, D. M. Carmean and J.C. Hall. “Managing the Impact of Increasing Microprocessor Power Consumption”. Intel Technology Journal, Q1, 2001.
 
 
 
[HB04] K. Hazelwood and D. Brooks, “Eliminating Voltage Emergencies via Microarchitectural Voltage Control Feedback and Dynamic Optimization”, Proceedings of the International Symposium on Low-Power Electronics and Design, pp. 326-331, August 2004.
 
 
 
[HL99] T. Horel, G. Lauterbach. “UltraSparc-III: Designing Third-Generation 64-Bit Performance”. IEEE MICRO. May – June 1999.
 
 
 
[HL91] M. Huguet, T. Lang. “Architectural Support for Reduced Register Saving/Restoring in Single-Window Register Files”. ACM Trans. Computer Systems, 9(1):66 – 67. Feb. 1991.
 
 
 
[JCSK98] S. Jang, S. Carr, P. Sweany and D. Kuras, "A Code Generation Framework for VLIW Architectures with Partitioned Register Banks", in Procs. of 3rd. Int. Conf. on Massively Parallel Computing Systems, April 1998.
 
 
 
[KEA01] K. Kailas, K. Ebcioglu and A. Agrawala, "CARS: A New Code Generation Framework for Clustered ILP Processors", in Procs. of the 7th Int. Symp. on High-Performance Computer Architecture, Jan. 2001.
 
 
 
[KC00] J.T. Kao and A. P. Chandrakasan. “Dual-Threshold Voltage Techniques for Low-Power Digital Circuits”. IEEE Journal of Solid State Circuits, 37(5), 2000.
 
 
 
[KF96] G.A. Kemp, and M.Franklin. “PEWs: A Decentralized Dynamic Scheduler for ILP Processing". In Proc. of Int. Conf. on Parallel Processing, pp. 239-246, August 1996.
 
 
 
[KMN+01] A. Keshavarzi, S. Ma, S. Naredra, B. Bloechel, K. Mistry, T. Ghani, S. Borkar and V. De. “Effectiveness of Reverse Body Bias for Leakage Control in Scaled Dual Vt CMOS ICs.” Proc. of the International Symposium on Low Power Electronics Design, 2001.
 
 
 
[K99] R.E. Kessler. "The Alpha 21264 Microprocessor”. IEEE Micro, 19(2):24-36, 1999.
 
 
 
[KMSP05] H. Kim, O. Mutlu, J. Stark, Y. N. Patt. “Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution”. In MICRO 38: International Symposium on Microarchitecture, pages 43 – 54, 2000.
 
 
 
[KS02] H-S.Kim and J.E.Smith. “An Instruction Set and Microarchitecture for Instruction Level Distributed Processing”. In Proc. of the 29th Ann. Intl. Symp. on Computer Architecture, 2002.
 
 
 
[Kla00] A. Klaiber, “The Technology Behind the CrusoeTM Processors”, white paper, http://www.transmeta.com/pdfs/paper_aklaiber_19jan00.pdf, January 2000
 
 
 
[LPSA02] W. Lee, D. Puppin, S. Swenson, S. Amarasinghe, "Convergent Scheduling", in Procs. of 35th Int. Symp. on Microarchitecture, December 2002.
 
 
 
[MBG+94] S. A. Mahlke, R.H. Bringmann, J. Gyllenhyaal, D. Gallagher, Wen M. H. Hwu. “Characterizing the impact of predicate execution on branch prediction”. In MICRO 27: Proceedings of 27th annual ACM/IEE international symposium on Microarchitecture, pages 217 – 227. 1994.
 
 
 
[MG99] P. Marcuello and A. González, “Clustered Speculative Multithreaded Processors”, in Proc. of the 13th Int. Conf. on Supercomputing, pp. 365-372, 1999.
 
 
 
[MD03] C. McNairy, D. Soltis. “Itanium 2 Processor Micrarchitecture”. IEEE MICRO. (pag. 44-55). March – April 2003.
 
 
 
[MCG06] M. Monchiero, R. Canal and A. González , " Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View", 20th ACM International Conference on Supercomputing (ICS'06), Cairns (Australia)
 
 
 
[MGV99] T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez and V. Viñals "Delaying Physical Register Allocation Through Virtual-Physical Registers", Proceedings of the 32th International Symposium on Microarchitecture, 1999
 
 
 
[NSB01] R. Nagarajan, K. Sankalingam, D. Burger S.W. Keckler. “A Design Space Evaluation of Grid Processor Architectures”. Proc. 34th International Symposium on Microarchitecture. 2001.
 
 
 
[ND95] R. Nuth, W. J. Dally. “The Named-State Register File: Implementation and Performance”. In proceedings of international symposium on High Performance Computer Architecture, pages 4 – 13, 1995
 
 
 
[NE98] E. Nystrom and A.E. Eichenberger, "Effective Cluster Assignement for Modulo Scheduling", in Procs. of the 31st Int. Symp. on Microarchitecture, pp. 103-114, 1998.
 
 
 
[OBMR05] D. W. Oehmke, N. L. Binkert, T. Mudge, S. K. Reinhardt, “How to Fake 1000 Registers”. In MICRO 38: international symposium in microarchitecture, 2005.
 
 
 
[OBC98] E. Özer, S. Banerjia, T. Conte, "Unified Assign and Schedule: A New Approach to Scheduling for Clustered Register File Microarchitectures", in Procs. of the 31st Int. Symp. on Microarchitecture, 1998.
 
 
 
[PJS97] A.S. Palacharla, N.P. Jouppi, and J.E. Smith, “Complexity-Effective Superscalar Processors”, In Proc. of the 24th. Int. Symp. on Computer Architecture, pp. 206-218, June 1997.
 
 
 
[PKG00] D. Ponomarev, G. Kucuk, K. Ghose. “Reducing Power Requirements of Instruction Scheduling through Dynamic Allocation of Multiple Datapath Resources”. Proc of International. Symposium on Microarchitecture, 2000.
 
 
 
[PSR00] Z. Purser, K. Sundaramoorthy, and E. Rotenberg. "A Study of Slipstream Processors". In Proceedings of the 33rd International Symposium on Microarchitecture, December 2000.
 
 
 
[QPG06] E. Quiñones, J.M. Parcerisa, A. Gonzalez. “Selective Predicate Prediction for Out-of-Order Processors”. In ICS'06: Proceedings of the 20th annual international conference on Supercomputing, pages 46 – 66. 2006
 
 
 
[QPG07] E. Quiñones, J.M. Parcerisa, A. Gonzalez. “Improving Branch Prediction and Predicated Execution in Out-of-Order Processors”. In HPCA '07: Proceedings of the 13th international Symposium on High-Performance Computer Architecture, 2007.
 
 
 
[RP03] P. Racunas, Yale N. Patt, "Partitioned first-level cache design for clustered microarchitectures". 17th annual international conference on Supercomputing (ICS), June, 2003.
 
 
 
[RAM+04] R. Rosner, Y. Almog, Micha Moffie, Naftali Schwartz, Avi Mendelson, “Power Awareness through Selective Dynamically Optimized Traces”, in Procs. of 31st International Symposium on Computer Architecture (ISCA-21), 2004
 
 
 
[RJS97] Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, Jim Smith. “Trace Processors”. Proc. 30th International Symposium on Microarchitecture. 1997.
 
 
 
[SG00a] J. Sánchez and A. González, "The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures", in Procs. of the 29th Int. Conf. on Parallel Processing, Aug. 2000.
 
 
 
[SG00b] J. Sánchez, and A. González, "Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture", in Procs. of 33rd Int. Symp. on Microarchitecture, Dec. 2000.
 
 
 
[SNM+06] K. Sankaralingam, R. Nagarajan, R. McDonald, et al. "Distributed Microarchitectural Protocols in the TRIPS Prototype Processor". 39th International Symposium on Microarchitecture (MICRO), December, 2006.
 
 
 
[SPN96] A. Saulsbury, F. Pong and A. Nowatzyk, “Missing the Memory Wall: The Case for Processor/Memory Integration”, In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996.
 
 
 
[SAD+02] G. Semeraro, D. H. Albonesi, S.G.Dropsho, G. Magklis, S. Dwarkadas and M.L. Scott. “Dynamic Frequency and Voltage Control for a Multiple Clock Domain Microarchitecture”. Proc. Int. Symposium on Microarchitecture, 2002.
 
 
 
[SIA97] Semiconductor Industry Association, “The National Technology Roadmap for Semiconductors”, 1997.
 
 
 
[STR02] A. Seznec, E. Toullec and O. Rochecouste "Register Write Specialization Register Read Specialization: A Path to Complexity-Effective Wide-Issue Superscalar Processors", Proceedings of the 35th International Symposium on Microarchitecture,2002.
 
 
 
[SCF03] B. Simon, B. Calder, J. Ferrante. “Incorporating Predicate Information into Branch Predictors”. In HPCA '03: Proceedings of the 9th international Symposium on High-Performance Computer Architecture, 2003.
 
 
 
[Sit79] R. L. Sites. “How to Use 1000 Registers”. In Caltech Conference on VLSI, pages 527-532. 1979
 
 
 
[SAS02] K. Skadron, T. Abdelzahe amd M. R. Stan. “Control-Theoretic Techniques and Thermal-Rc Modeling for Accurate and Localized Dynamic Thermal Management”. Proceedings of the International Symposium on High Performance Computing, 2002.
 
 
 
[SSH+03] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan and D. Tarjan. “Temperature-Aware Microarchitecture”. Proceedings of the International Symposium on Computer Architecture, pp. 2-13 2003.
 
 
 
[SN05] J. E. Smith, and R. Nair, “Virtual Machines: Versatile Platforms for Systems and Processes”, Morgan Kaufmann Publishers, 2005
 
 
 
[SBV95] G.S. Sohi, S.E. Breach and T.N. Vijaykumar, “Multiscalar Processors”, in Proc. of the 22nd Int. Symp. on Computer Architecture, pp.414-425, 1995.
 
 
 
[SCZ+02] J. Steffan, C. Colohan, A. Zhai and T. Mowry, “Improving Value Communication for Thread-Level Speculation”, in Proc. of the 8th Int. Symp. on High Performance Computer Architecture, pp. 58-62, 2002.
 
 
 
[SD95] C. L. Su and A. M. Despain, “Cache Design Tradeoffs for Power and Performance Optimization: A Case Study”, In Proceedings of the International Symposium on Low Power Electronics and Design, April 1995.
 
 
 
[Uht00] Uht, A. K. Achieving typical delays in synchronous systems via timing error toleration. Tech. Rep. Dept. of Electrical and Computer Engineering, No. 032000-0100, University of Rhode Island. 2000
 
 
 
[WTS+97] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, "Baring it all to Software: Raw Machines", IEEE Computer, pp. 86-93, September 1997.
 
 
 
[WWK+01] P. H. Wang, H. Wang, R. M. Klim, K. Ramakrishnan, J. P. Shen. “Register Renaming and Scheduling for Dynamic Execution of Predicated Code”. In HPCA '01: Proceedings of the 7th international Symposium on High-Performance Computer Architecture, page 15, 2001.
 
 
 
[YZG00] J. Yang, Y. Zhang, and R. Gupta, “Frequent Value Compression in Data Caches”, In Proceedings of the 33rd International Symposium on Microarchitecture, December 2000.
 
 
 
[YMR+99] A. Yoaz, E. Mattan, R. Ronen, and S. Jourdan., “Speculation Techniques for Improving Load Related Instruction Scheduling,“ in Proc. of 26th ISCA, pp. 42-53, May 1999.
 
 
 
[ZLAV01] J. Zalamea, J. Llosa, E. Ayguadé, and M. Valero, "Modulo Scheduling with integrated register spilling for Clustered VLIW Architectures," Proc. 34th Ann. Int'l Symp. on Microarchitecture (MICRO-34), December 2001.
 
 
 
[ZYG00] Y. Zhang, J. Yang, and R. Gupta, “Frequent Value Locality and Value-Centric Data Cache Design”, In Proceedings of the 33rd International Conference on Architectural Support for Programming Languages and Operating Systems, November 2000.
 
 
 
[ZK01] V. Zyuban and P.M. Kogge, “Inherently Lower-Power High-Performance Superscalar Architectures,“ IEEE Trans. on Computers, vol. 50, no. 3, pp. 268-285, March 2001.  
 
  
 
<br>
 
<br>
 +
= Project Bureaucracy =
 +
*[[Project Bureaucracy|'''Go to Project Bureaucracy''']]

Latest revision as of 17:34, 27 October 2020

CoCoUnit: An Energy-Efficient Processing Unit for Cognitive Computing (2019-2025)

There is a fast-growing interest in extending the capabilities of computing systems to perform human-like tasks in an intelligent way. These technologies are usually referred to as cognitive computing. We envision a next revolution in computing in the forthcoming years that will be driven by deploying many “intelligent” devices around us in all kind of environments (work, entertainment, transportation, health care, etc.) backed up by “intelligent” servers in the cloud. These cognitive computing systems will provide new user experiences by delivering new services or improving the operational efficiency of existing ones, and altogether will enrich our lives and our economy.

A key characteristic of cognitive computing systems will be their capability to process in real time large amounts of data coming from audio and vision devices, and other type of sensors. This will demand a very high computing power but at the same time an extremely low energy consumption. This very challenging energy-efficiency requirement is a sine qua non to success not only for mobile and wearable systems, where power dissipation and cost budgets are very low, but also for large data centers where energy consumption is a main component of the total cost of ownership.

Current processor architectures (including general-purpose cores and GPUs) are not a good fit for this type of systems since they keep the same basic organization as early computers, which were mainly optimized for “number crunching”. CoCoUnit will take a disruptive direction by investigating unconventional architectures that can offer orders of magnitude better efficiency in terms of performance per energy and cost for cognitive computing tasks. The ultimate goal of this project is to devise a novel processing unit that will be integrated with the existing units of a processor (general purpose cores and GPUs) and altogether will be able to deliver cognitive computing user experiences with extremely high energy-efficiency.

This project is funded by the European Research Council through the ERC Advanced Grants program.


Intelligent, Ubiquitous and Energy-Efficient Computing Systems (2016-2020)

The ultimate goal of this project is to devise novel platforms that provide rich user experiences in the areas of cognitive computing and computational intelligence in mobile devices such as smartphones or wearable devices. This project investigates novel unconventional architectures that can offer orders of magnitude better efficiency in terms of performance per energy, and at the same time important improvements in raw performance. These platforms will rely on various types of units specialized for different application domains. Special focus is paid to graphics processors and brain-inspired architectures (e.g. hardware neural networks) due to their potential to exploit high degrees of parallelism and their energy efficiency for this type of applications. Extensions to existing architectures combined with novel accelerators will be explored. We also investigate the use of resilient architectures that can allow computing systems to operate at very low supply voltage levels in order to optimize their energy consumption while not compromising their reliability by providing adequate fault tolerance solutions.


Microarquitecture and Compilers for Future Processors III (2014-2016)

The main objective of this project as for the researchers of the ARCO group is the research in the design of future microprocessors, taking into account the determining factors of future technology, both for high performance processors and for commodity electronics. Fundamentally, two factors have determined the increased performance in processors: on one hand the technological advances in microprocessor manufacturing and, on the other hand, the use of new and more efficient microarchitectural and compiler techniques. All these improvements bring a number of challenges that are now considered as key in designing the processors for this upcoming decade: the limited instruction-level parallelism, the interconnection network delays, high power consumption, heat dissipation, system relibility and security. In this project we are going to address the influence of these issues in the research of future processors. Specifically, we will address six areas that we consider fundamental:

  1. The efficient design of circuits in the presence of unexpected changes in its operating parameters
  2. The efficient design of graphic processors oriented to mobile devices
  3. The efficient implementation of virtual machines with low complexity but high computing power
  4. The characterization and acceleration of emerging applications
  5. The design of new heterogeneous multiprocessor architectures that optimize the use of the different processors depending on the types of application being executed
  6. The study of new techniques in the design of the memory hierarchy and interconnection networks to tolerate the increasing gap between the speeds of the various components of the computer.


Microarquitecture and Compilers for Future Processors II (2010-2014)

The main objective of this project as for the researchers of the ARCO group is the research in the design of future microprocessors, taking into account the determining factors of future technology, both for high performance processors and for commodity electronics. Fundamentally, two factors have determined the increased performance in processors: on one hand the technological advances in microprocessor manufacturing and, on the other hand, the use of new and more efficient microarchitectural and compiler techniques. All these improvements bring a number of challenges that are now considered as key in designing the processors for this upcoming decade: the limited instruction-level parallelism, the interconnection network delays, high power consumption, heat dissipation, system relibility and security. In this project we are going to address the influence of these issues in the research of future processors. Specifically, we will focus on five areas which we consider to be fundamental:

  1. The study of new techniques in the memory hierarchy design to tolerate the increasing gap between processor and memory speeds
  2. The efficient circuit design in the face of unexpected variations of their working parameters
  3. The implementation of efficient virtual machines of low complexity but high level computing
  4. The implementation of intrusion detection systems to assure a high computer security level
  5. Characterization and acceleration of emerging applications
  6. The design of novel multithreaded processors to exploit thread-level parallelism


Microarquitecture and Compilers for Future Processors (2006-2010)

The main objective in this project is to research the design of next decade processors considering the requirements of the technology which is estimated to be feasible for the next years. Till recently, processor performance was mainly determined by two factors: technological advances in microprocessor manufacturing and the use of new and more efficient microarchitectural and compiler techniques. Now, new challenges are approached, for instance: high power consumption, heat dissipation, wire delays, design complexity, and the limited instruction-level parallelism. In this project we are going to address the influence of these issues in the research of future processors. Specifically, we will focus on seven areas which we consider to be fundamental:

  1. The reduction in power consumption and better approaches for heat dissipation
  2. The exploitation of thread-level speculative parallelism
  3. The design of clustered microarchitectures
  4. The efficient implementation of ISA extensions for out-of-order processors
  5. The efficient implementation of co-designed virtual machines;
  6. The study of new techniques in the register file and cache memory design to tolerate the increasing gap between processor and memory speeds;
  7. The efficient circuit design in the face of unexpected variations of their working parameters.


Project Bureaucracy