Skip to main content

Part 2 ― Summary of Recent of Technical Papers

Higher Performance and Lower Power Consumption in a Multi-Core LSI Achieved in Conjunction with an Auto-Parallelizing Compiler

ISSCC 2008 Paper No.4.5 describes independent power supply isolation and CPU synchronization control methods used in a prototype chip containing eight CPU cores and eight RAM memories.

Presented by: Waseda University, Hitachi Ltd., and Renesas Technology Corporation

Masayuki Ito
Senior Engineer
CPU Development Dept.1
System Core Technology Div.
System Solution Business Group
Renesas Technology Corp.

Toshihiro Hattori,Ph.D
Department Manager
CPU Development Dept.1
System Core Technology Div.
System Solution Business Group
Renesas Technology Corp.

Renesas has worked with Waseda University and Hitachi to develop a prototype multi-core LSI with eight CPU cores, a device aimed at the digital consumer electronics sector where there is a strong demand for higher performance and more advanced functions. The LSI incorporates a mechanism that works in conjunction with an auto-parallelizing compiler to reduce unnecessary power consumption by turning off the power to idle CPU cores. It also implements a hardware-based barrier synchronization scheme that significantly improves the performance of parallel processing.

Using an auto-parallelizing compiler to control the power mode independently for each of the eight CPUs

Renesas, in collaboration with Waseda University (Laboratory of Hironori Kasahara and Keiji Kimura) and Hitachi, have developed a prototype multi-core LSI device with eight SH-4A CPU cores that consumes 2.8W at 1.0V. The CPUs operating at speeds up to 600MHz, and at that clock frequency the total CPU performance is 8640MIPS (see Figure 1). The chip is an ideal solution for multi-function digital consumer electronics products because it can execute functions such as networking, music, and video playback in real time, while keeping power consumption low.

A power supply isolation mechanism implemented in the new LSI device adds two new power modes, 'resume power-off' and 'full power-off', to the 'normal', "light sleep', and 'sleep' modes used conventionally (see Table 1). The resume power-off mode leaves only the user RAM (URAM) power supply turned on, but it allows a quick restart after the power is restored. The full power-off mode completely disconnects all power to the CPU cores (see Figure 2).

The auto-parallelizing compiler developed for the prototype multi-core LSI device at Waseda University can control both the power supply and clock frequency for each CPU core and its associated URAM. The compiler generates parallel C programs that perform power control and allocate processing efficiently between the CPUs. The compiler also automatically generates a power control program that manages the state of each CPU, aiming to keep power consumption down while still achieving the desired completion time. "Using this technology has greatly shortened the time required to produce a parallel processing application. Whereas it used to take weeks using our previous parallelization techniques, now it takes just minutes," commented Mr. Hattori.

The development team achieved a 70% reduction in power consumption (see Figure 3) when the resume power-off mode was evaluated on the eight-CPU prototype chip. During that evaluation, the device ran evaluation software that executed an AAC encoding program for audio encoding in real time.

Figure 1: Photograph of eight-core integrated multi-core LSI chip (left) and internal structure of each core.

Table 1: Prototype device's five power modes.
The power mode can be set independently for each of the eight CPUs. Two additional power supply isolation modes aimed at reducing leak current ― "resume power-off" and "full power-off" ― have been added to the three conventional modes.

Figure 2: Power consumption in five different power modes.
These evaluation results show that although the chip consumes 304mW even when all CPUs are in sleep mode, the new resume power-off mode significantly reduces leak current, decreasing overall power consumption by 88%.

Figure 3: Measured power consumption.
The data shown here is from a evaluation in which an AAC encoder program was executed on the eight CPU cores in real time. When the power control functions incorporated by the multi-core compiler were used, power consumption decreased by 70% compared to the level observed without those control functions.
Applying hardware barrier synchronization to get an eighteen-fold increase in speed

The work on the prototype chip has also involved developing new technology for synchronizing multiple CPU cores. One of the techniques traditionally used for doing this is barrier synchronization control using software. However, because all of the CPUs access memory resources via the same system bus, this method is subject to large overheads associated with synchronization control itself including a shared memory access overhead, busy-wait and sending notification to each processing core.

The hardware barrier synchronization control method that the engineers have developed avoids these problems and, when used with the auto-parallelizing compiler, can support a next-generation parallel processing technology called 'hierarchical coarse-grain task parallelization.' On-chip circuits quickly detect when all or some of the eight CPUs have completed executing those parts of the program that require synchronization control. Those circuits then immediately update this information in the synchronization read register for all CPUs, without using the system bus (see Figure 4). "We found an eighteen-fold increase in speed compared to software synchronization control when we evaluated our new method on an application program that made repeated use of barrier synchronization," noted Mr. Ito.

This development work was part of the NEDO "Real-time Digital Consumer Electronics Multi-Core" semiconductor application chip project. The goal of the project is to help make this technology a practical reality by adding the benefits of rapid development and low power consumption to the high-speed processing advantages of a multi-core system.

Figure 4: Newly devised hardware barrier synchronization control mechanism.
(For clarity, this example shows the synchronization of only four CPUs). The synchronization write register value (BARW) is updated immediately in the synchronization read registers (BARR) on all CPUs to allow synchronization between all CPU cores or between multiple core groups. This synchronization technique can be used in conjunction with an auto-parallelizing compiler to support a next-generation parallel processing technology called hierarchical coarse-grain task parallelization.

End of content

Back To Top