IBM POWER6 Microprocessor (64 bit) Essay
Abstract— This term paper is about IBM POWER6 Microprocessors. It covers Introduction, Core chapters including definition, description, history, design etc. It also includes their Applications, Future perspective and Conclusion etc. Index Terms— Introduction, Core chapters, Applications & Future perspective, Conclusion. I. INTRODUCTION
A silicon chip that contains a CPU. In the world of personal computers, the terms microprocessor and CPU are used interchangeably. At the heart of all personal computers and most workstations sits a microprocessor. Microprocessors also control the logic of almost all digital devices, from clock radios to fuel-injection systems for automobiles. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and provides results as output. Intel introduced its first 4-bit microprocessor 4004 in 1971 and its 8-bit microprocessor 8008 in 1972.
B.IBM POWER6 Microprocessors
The POWER6 is a microprocessor developed by IBM that implemented the Power ISA v.2.03. When it became available in systems in 2007, it succeeded the POWER5+ as IBM’s flagship Power microprocessor. The POWER6 processor is the latest generation in the POWER line of PowerPC processors. Fabricated using
IBM’s 65 nm partially-depleted SOI process, the 341 mm
POWER6 chip contains over 790 million transistors and 1953 signal I/Os
connected using 4.5 km of wire on 10 copper metal layers (Fig. 1). Each chip includes two dual threaded SMT processor cores implemented in a 13 FO4 design capable of running at speeds up to 5 GHz. In addition, a private 4 MB L2 cache per core, a shared 32 MB L3 cache controller, two inte- grated memory controllers, an on-board I/O controller and nest support for large-scale SMP are included on the chip. In order to provide mainframe-like reliability, enhanced error detection and system monitoring capabilities are managed through a new recovery unit that provides full checkpointing facilities. This is supplemented by complete ECC protection of large caches and architected state, parity protection on more than 99% of register ?les and 70% of data?ow circuits, along with extensive control checkers. In addition, improved virtualization support and decimal ?oating-point execution capability provide a rich set of features, while remaining binary compatible with previous POWER designs.
II. CORE CHAPTERS
POWER6 was described at the International Solid-State Circuits Conference (ISSCC) in February 2006, and additional details were added at the Microprocessor Forum in October 2006 and at the next ISSCC in February 2007. It was formally announced on May 21, 2007. It was released on June 8, 2007 at speeds of 3.5, 4.2 and 4.7 GHz, but the company has noted prototypes have reached 6 GHz. POWER6 reached first silicon in the middle of 2005, and was bumped to 5.0 GHz in May 2008 with the introduction of the P595.
POWER is a RISC instruction set architecture designed by IBM. (POWERis Performance Optimization With Enhanced RISC*) ? It’s based on IBM POWER5 microprocessor technology (SMT, Dual Core) plus some extensions in order to increase performances. ? Its core is fabricated in 65-nm silicon-on-insulator (SOI) technology and operates at frequencies of more than 4 GHz.
? The microprocessor is a 13-FO4** design containing more than 790 million transistors, 1,953 signal I/Os, and more than 4.5 km of wire on ten copper
Fig: Power6 chip with cores and L2 latch highlighted
The IBM POWER6*microprocessor core is fabricated using the IBM 65-nm silicon-on-insulator (SOI) process and provides a significant boost in frequency and performance to pSeries*systems. Core operating frequencies of more than 5 GHz have been demonstrated.
The processor chip contains two cores, 8 MB of on-chip level 2 (L2) cache, a directory for a 32-MB L3 cache, two memory controllers, a GX I/O controller, and nest support circuitry for a 128-way symmetric multiprocessor (SMP). The chip shown in Figure 1has an area of 341 mm2 and contains 790 million transistors, 1,953 signal I/Os, 5,399 power and ground I/Os, and more than 4.5 km of wire.
The on-chip circuits are connected via ten levels of copper wire and are powered through multiple voltage domains. The core logic, array, and I/O circuits are designed to operate at nominal voltages of 1.15, 1.3, and 1.2 V, respectively. However, the actual logic and array voltages delivered to each chip vary between 0.85 V and 1.3 V and between 1.0 V and 1.4 V, respectively, depending on the speed of the part. Chips with shorter channels typically run faster but use considerably more power because of higher leakage. In previous-generation processors, these parts would have been discarded because of excessive power dissipation but now are usable by operating at lowered voltages. In addition, chips with longer channels typically run slower, so some of these parts also would not have been used in earlier generation processors because of their low operating frequency, but now they also are made usable by increasing their operating voltages.
Fig: Architecture of power6
? The Power6 Chip operates at twice the frequency of Power5
?In place of speculative out-of-order execution that requires costly circuit renaming, the design concentrates on providing data prefetch.
?Limited out-of-order execution is implemented for FP instructions.
?Improvement of the Dispatch and Completion: 7 intr from both cores simultaneously
?Better SMT speed up due to increased cache size,associativity
?Designed to consume less power
D.Circuit Design Methodologies
The majority of state-saving devices used in POWER6, out-side register _les and SRAMs, are scannable master–slave _ip- ?ops (FFs). In normal operation, each of these is controlled by two opposite phase, slightly skewed clocks, C1 and C2 that drive the master latch (L1) and slave latch (L2), respectively . In order to reduce chip power, most _ip-_ops can be run in pulsed mode where C1 is held high while C2 is pulsed (Fig. 3). Since only one clock signal is active in this mode, switching power is reduced. Table I describes various latch modes and their clocks. Delay C1 mode allows cycle stealing during the C2 rise and C1 fall overlap, which provides the capability to shift cycle bound- aries and tune frequency in the hardware. Pulsed mode allows even more cycle stealing at the cost of extra padding needed to meet tighter hold time requirements. Designs were padded for minimum pulsewidth mode (2.9 FO4), while mid (4.2 FO4) and max (5.2 FO4) pulsewidth modes were supported to pro- vide maximum _exibility when the chip was tuned in the lab (see Section IV). Finally, a Delay C2 mode, which delayed the C2 rise, was available for debugging frequency limiting paths.
One of the driving forces behind the ef_cient design method- ology of POWER6 was the RodRunner pcell-based gate library Fig. 4. Custom design _ow. that provided _ne device size granularity while retaining the ad- vantages of cell-based layout design. In addition, the resources required to create the full cell library were greatly reduced because layouts for each cell were generated and updated automatically. For synthesized random logic macros, the use of RodRunner
cells allowed a very large library of standard cells to be cre- ated giving synthesis maximum _exibility. Over 500 unique cells were available for each of three types supported in the 65 nm technology, without the enormous overhead that would normally be associated with maintaining a library of that size. As many as four different beta ratios were available for each size cell with two- and three-input cells usually having multiple tapering ratios available as well. A key bene_t to RodRunner was the ability to make any DRC
or methodology (METH) updates in a single location within the RodRunner cell. This change was instantly picked up across all instantiations of the cell, including in the standard cell library. While this occasionally required minor updates to existing lay- outs to ensure compatibility, these could be performed with minimal effort. This also allowed technology updates, which could affect transistor strengths and beta ratios, to be easily compen- sated for by the designer or Einstuner (IBM’s device tuning tool).
The tools used for the custom methodology maximized the
possible number of iterations on a circuit, allowing the designer to rapidly approach an optimal solution. The methodology could be split into three design phases as illustrated in Fig. 4: high level design, schematic entry and placement, and layout. During the high level design phase, physical abstracts were used to _oorplan a macro as well as develop a pin/wiring contract with integrators. Early timing abstracts were generated based on circuit designer estimates of logic implementation. Schematic entry and placement could be performed simulta- neously with an innovative new tool called PIP (Placement with Instance Parameters), a GUI for a library of Skill functions used to place cells. PIP allowed circuit designers to more accurately and easily _oorplan and time their macros. This combined with STEP  (STeiner Estimated Parasitic), allowed fairly accurate wire models to be included in early timing abstracts.
A circuit topology checker could then be used to verify the circuits to ensure they met project design rules prior to layout implementation. During the layout phase, only routing was needed as all cells had been previously placed. The use of RodRunner allowed automatic optimization tools, such as Einstuner, to easily update both schematics and layouts to improve timing, area and power by optimizing device sizing and beta ratios. Additionally, the LAVA engine , which performed leakage calculations based on analyzing channel-connected components, could change the of cells, to either reduce leakage power on noncritical paths or increase the speed of a failing paths. Tools to add decoupling capacitors, gate arrays (see Section II-E) and redundant vias or to tweak n-well/rox layers could then be run on the completed layout to improve yield and performance. A number of physical checks, including DRC and LVS, methodology, DFT, extraction, power and transistor-level timing, were performed to validate the design, followed by electromigration, noise and IR drop analysis to ensure circuit reliability.
The synthesized random logic macro (RLM) methodology was designed to have as much commonality with customs as possible to allow maximum sharing of tools and checkers. RLMs were designed with the same bit image as customs and were generally allowed unrestricted use of M1–M3 while M4 (and higher for special cases) was shared with the unit via contracts. Pins were required to follow a more restrictive set of placement and spacing rules to provide the highest possible pin density while still ensuring accessibility to pins for both the unit and the RLM by automated routing tools.
The RLM process was broken into three major phases: syn-
thesis and placement, routing, and physical validation. The _rst step was performed in an IBM tool suite that combined logical synthesis, mapping, placement and timing capabilities in an in- tegrated framework called PDSRTL . Given a VHDL design, macro dimensions with pin locations, and a set of timing con- tracts, PDSRTL optimized the design for timing, power, area and electrical constraints. A carefully tuned set of default pa- rameters yielded high quality results for the majority of the de- signs, while at the same time these could be customized to adapt to characteristics of individual RLMs.
The second step took the fully placed RLM, pre-routed wide clock nets based on LCB and latch placement, and added _ll cells (see Section II-E) before the design was run through a grid-based routing tool. A fully redundant via set could be used on 95% of the designs for increased yield, with the remaining RLMs using a mixed via set. The _nal routed design was trans- lated into a standard layout by removing _oorplanning informa- tion and replacing abstracts of all the standard cells with actual layouts.
At this point, the methodology aligned with the custom macro methodology and the same physical checks are performed to validate the design. Typically, RLMs were clean by construction and only required minimal tweaking to pass all requirements. Like for customs, Einstuner and LAVA substitution were available for post-layout tuning.
5)Filler Cell and ECO Methodology
The aggressive schedule of the POWER6 design required
physical design (PD) to already be in late stages while veri_- cation work was still ongoing, resulting in an unusually large number of engineering change orders (ECOs). The RLM _ow was capable of automatically taking a modi_ed (and option-ally placed) netlist, merging it into the existing design and run- ning incremental routing to update the layout. In past designs, once the front-end-of-the-line (FEOL) layers were locked and no further changes to cell sizing or placement were possible, back-end-of-the-line (BEOL) or wire-only ECO capability was
severely limited by the number of spare cells of each type and their location in the macro. In POWER6, a special set of gate- array cells, each containing a single PFET and NFET device,
were used to _ll the unused area in both RLM and custom de-
signs. When used as _ll, these cells remained disconnected to have no impact on power, but in a BEOL ECO, they could be
combined and replaced by functional cells that had the exact same FEOL layers, but connected the transistors to form any
type of static gate. This capability, combined with spare latches that were scattered throughout all RLMs and some customs, al- lowed even the most complex changes to be completed using
BEOL layers only.
For extremely complex changes where VHDL did not
easily correspond to the netlist, an experimental process was introduced for RLMs that made the PDSRTL tool aware of
the gate-array methodology. By describing a delta-VHDL, a
designer could essentially graft a new cone of logic into the design. PDSRTL would swap _ll cells for gate-array cells as
needed and, where possible, reuse existing cells to map the new logic. While results were generally not as ef_cient as manual ECOs, this new approach proved to be extremely valuable for
complex situations where an ECO would have otherwise been
In order to achieve timing closure on the POWER6 chip, par-allel development at all levels of the design, rapid iteration and early timing estimation were essential in addition to highly ac- curate timing models. Hierarchies on POWER6 included chip,
core, nest, unit, and macro levels. Using assertions or timing contracts to describe boundary conditions such as arrival times, slews and capacitance loads, a top-down methodology was used, allowing each level of the design to be analyzed and iterated in- dependently of the others.
Only the basic best case and nominal case timing corners needed to be run separately to evaluate timing at any given level of hierarchy, due to parallel modeling of all clock phases, voltage levels and both pulsed and nonpulsed modes in a single run. A third run that modeled actual hostile capacitances and their impact on timing was used late in the design for detailed noise coupling analysis.
It is not until you undertake the project like this one that you realize how massive the effort it really is, or how much you must rely upon the Selfless efforts and good will of others. There are many who helped me with this project, and I want to thank them all from the core of my heart. I owe special words of thanks to my Teacher for her vision, thoughtful counseling and encouragement at every step of the project. I am also thankful to the teachers of the Department for giving me the best of knowledge and guidance throughout the project. All this has become reality because of their blessings and above all by the grace of god. REFERENCES
http://www.realworldtech.com/page.cfm?ArticleID=RWT121905001634 +Additional pdf’s