Computer Architecture 2011 – Advanced Topics 1 Computer Architecture Advanced Topics.

  • Published on
    21-Dec-2015

  • View
    214

  • Download
    2

Transcript

  • Slide 1
  • Computer Architecture 2011 Advanced Topics 1 Computer Architecture Advanced Topics
  • Slide 2
  • Computer Architecture 2011 Advanced Topics 2 Performance per Watt Mobile smaller form-factor decreases power budget Power generates heat, which must be dissipated to keep transistors within allowed temperature Limits the processors peak power consumption Change the target Old target: get max performance New target: get max performance at a given power envelope Performance per Watt Performance via frequency increase Power = CV 2 f, but increasing f also requires increasing V X% performance costs 3X% power Assume performance linear with frequency A power efficient feature better than 1:3 performance : power Otherwise it is better to just increase frequency (and voltage) U-arch performance features should be power efficient
  • Slide 3
  • Computer Architecture 2011 Advanced Topics 3 Higher Performance vs. Longer Battery Life Processor average power is
  • Computer Architecture 2011 Advanced Topics 5 Enhanced SpeedStep Technology The Basic SpeedStep Technology had 2 operating points Non-transparent switch The Enhanced version provides Multi voltage/frequency operating points For example, 1.6GHz Pentium M processor operation ranges: From 600MHz @ 0.956V To 1.6GHz @ 1.484V Transparent switch Frequent switches Benefits Higher power efficiency 2.7X lower frequency 2X performance loss >2X energy gain Outstanding battery life 2.7X 6.1X Efficiency ratio = 2.3
  • Slide 6
  • Computer Architecture 2011 Advanced Topics 6 2 nd Generation Intel Core TM Sandy Bridge
  • Slide 7
  • Computer Architecture 2011 Advanced Topics 7 2nd Gen Intel Core Microarchitecture: Overview Integrated Memory Controller 2ch DDR3 High Bandwidth Last Level Cache High Bandwidth Last Level Cache Next Generation Processor Graphics and Media Next Generation Intel Turbo Boost Technology Next Generation Intel Turbo Boost Technology Intel Hyper-Threading Technology 4 Cores / 8 Threads 2 Cores / 4 Threads Intel Hyper-Threading Technology 4 Cores / 8 Threads 2 Cores / 4 Threads Integrates CPU, Graphics, MC, PCI Express* On Single Chip Embedded DisplayPort Substantial performance improvement Substantial performance improvement Intel Advanced Vector Extension (Intel AVX) High BW/low-latency modular core/GFX interconnect Discrete Graphics Support: 1x16 or 2x8 Discrete Graphics Support: 1x16 or 2x8 2ch DDR3 x16 PCIe PECI Interface To Embedded Controller Notebook DP Port Graphics Core LLC Core LLC Core LLC Core LLC System Agent System Agent Display DMI PCI Express* IMC PCH
  • Slide 8
  • Computer Architecture 2011 Advanced Topics 8 ALU, SIMUL, DIV, FP MUL ALU, SIMUL, DIV, FP MUL ALU, SIALU, FP ADD ALU, SIALU, FP ADD ALU, Branch, FP Shuffle ALU, Branch, FP Shuffle Load Store Address Store Data Store Data Six Execution Ports 32k L1 Instruction Cache Scheduler Port 0 Port 1 Port 5 Port 2 Port 3 Port 4 32k L1 Data Cache 48 bytes/cycle Allocate/Rename/Retire Zeroing Idioms Load Buffe rs Store Buffe rs Reord erBuff ers L2 Cache (MLC) Fill Buffers Pre decode Instruction Queue Instruction Queue Decoders 1.5k uOP cache Decoders Branch Pred In order Out-of- order Front End (IA instructions Uops) In Order Allocation, Rename, Retirement Out of Order Uop Scheduling DataCacheUnit Core Block Diagram
  • Slide 9
  • Computer Architecture 2011 Advanced Topics 9 Front End Instruction Fetch and Decode 32KB 8-way Associative ICache 4 Decoders, up to 4 instructions / cycle Micro-Fusion Bundle multiple instruction events into a single Uops Macro-Fusion Fuse instruction pairs into a complex Uop Decode Pipeline supports 16 bytes per cycle 32KB L1 I-Cache Pre decode Instruction Queue Instruction Queue Decoders Branch Prediction Unit
  • Slide 10
  • Computer Architecture 2011 Advanced Topics 10 Decoded Uop Cache ~1.5 Kuops Decoded Uop Cache Instruction Cache for Uops instead of Instruction Bytes ~80% hit rate for most applications Higher Instruction Bandwidth and Lower Latency Decoded Uop Cache can represent 32-byte / cycle More Cycles sustaining 4 instruction/cycle Able to stitch across taken branches in the control flow Branch Prediction Unit 32KB L1 I-Cache Pre decode Instruction Queue Instruction Queue Decoders
  • Slide 11
  • Computer Architecture 2011 Advanced Topics 11 Branch Prediction Unit New Branch Predictor Twice as many targets Much more effective storage for history Much longer history for data dependent behaviors Branch Prediction Unit 32k L1 Instruction Cache Pre decode Instruction Queue Instruction Queue Decoders Decoded Uop Cache ~1.5 Kuops
  • Slide 12
  • Computer Architecture 2011 Advanced Topics 12 Decoded Uop Cache lets the normal front end sleep Decode one time instead of many times Branch-Mispredictions reduced substantially The correct path is also the most efficient path Save Power while Increasing Performance Front End Branch Prediction Unit 32k L1 Instruction Cache Pre decode Instruction Queue Instruction Queue Decoders Decoded Uop Cache ~1.5 Kuops Zzzz
  • Slide 13
  • Computer Architecture 2011 Advanced Topics 13 Scheduler Port 0 Port 1 Port 5 Port 2 Port 3 Port 4 Allocate/Rename/Retire Zeroing Idioms Load Buffe rs Store Buffe rs Reord erBuff ers In order Out-of-order In Order Allocation, Rename, Retirement Out of Order Uop Scheduling Out of Order Part of the machine Receives Uops from the Front End Sends them to Execution Units when they are ready Retires them in Program Order Increase Performance by finding more Instruction Level Parallelism Increasing Depth and Width of machine implies larger buffers More Data Storage, More Data Movement, More Power
  • Slide 14
  • Computer Architecture 2011 Advanced Topics 14 Sandy Bridge Out-of-Order (OOO) Cluster Method: Physical Reg File (PRF) instead of centralized Retirement Register File Single copy of every data No movement after calculation Allows significant increase in buffer sizes Dataflow window ~33% larger Scheduler Allocate/Rename/Retire Zeroing Idioms Load Buffers Store Buffers Reorder Buffers In order Out-of- order FP/INT Vector PRFInt PRF PRF has better than linear performance/power Key enabler for Intel AVX PRF has better than linear performance/power Key enabler for Intel AVX
  • Slide 15
  • Computer Architecture 2011 Advanced Topics 15 Vectors are a natural data-type for many applications Extend SSE FP instruction set to 256 bits operand size Intel AVX extends all 16 XMM registers to 256bits New, non-destructive source syntax VADDPS ymm1, ymm2, ymm3 New Operations to enhance vectorization Broadcasts Masked load & store Intel Advanced Vector Extensions YMM0 XMM0 128 bits 256 bits (AVX) Wider vectors and non-destructive source specify more work with fewer instructions Extending the existing state is area and power efficient Wider vectors and non-destructive source specify more work with fewer instructions Extending the existing state is area and power efficient
  • Slide 16
  • Computer Architecture 2011 Advanced Topics 16 Port 0 Port 1 Port 5 GPR SIMD INT SIMD FP Execution Cluster Scheduler sees matrix: 3 ports to 3 stacks of execution units General Purpose Integer SIMD (Vector) Integer SIMD Floating Point Challenge: double the output of one of these stacks in a manner that is invisible to the others ALU JMP FP Shuf FP Bool VI ADD VI MUL VI Shuffle DIV VI Shuffle FP ADD Blend FP MUL
  • Slide 17
  • Computer Architecture 2011 Advanced Topics 17 Port 0 Port 1 Port 5 GPR SIMD INT SIMD FP Execution Cluster Solution: Repurpose existing data paths to dual-use SIMD integer and legacy SIMD FP use legacy stack style Intel AVX utilizes both 128-bit execution stacks Double FLOPs 256-bit Multiply + 256-bit ADD + 256-bit Load per clock ALU JMP FP Shuf FP Bool VI ADD VI MUL VI Shuffle DIV VI Shuffle FP ADD Blend FP MUL FP ADD FP Multiply FP Blend FP Shuffle FP Boolean FP Blend
  • Slide 18
  • Computer Architecture 2011 Advanced Topics 18 Memory Cluster Memory Unit can service two memory requests per cycle 16 bytes load and 16 bytes store per cycle Goal: Maintain the historic bytes/flop ratio of SSE for Intel AVX Memory Control 32KB 8-way L1 Data Cache 32 bytes/cycle Load Store Address Store Data Store Data 256KB L2 Cache (MLC) Fill Buffers Store Buffers
  • Slide 19
  • Computer Architecture 2011 Advanced Topics 19 Memory Cluster Solution : Dual-Use the existing connections Make load/store pipes symmetric Memory Unit services three data accesses per cycle 2 read requests of up to 16 bytes AND 1 store of up to 16 bytes Internal sequencer deals with queued requests Second Load Port is one of highest performance features Required to keep Intel AVX fed linear power/performance Memory Control 32KB 8-way L1 Data Cache 48 bytes/cycle Load Store Address Store Data Store Data 256KB L2 Cache (MLC) Fill Buffers Store Buffers
  • Slide 20
  • Computer Architecture 2011 Advanced Topics 20 Putting it together Sandy Bridge Microarchitecture 32k L1 Instruction Cache Scheduler Memory Control Port 0 Port 1 Port 5 Port 2 Port 3 Port 4 32k L1 Data Cache 48 bytes/cycle Allocate/Rename/Retire Zeroing Idioms Load Buffers Store Buffers Reorder Buffers Load Store Address Store Data Store Data L2 Data Cache (MLC) Fill Buffers Pre decode Instruction Queue Instruction Queue Decoders 1.5k uOP cache Decoders Branch Pred In order Out-of- order ALU VI ADD VI Shuffle AVX FP ADD ALU JMP AVX/FP Shuf AVX/FP Bool AVX FP Blend ALU VI MUL VI Shuffle DIV AVX FP Blend AVX FP MUL
  • Slide 21
  • Computer Architecture 2011 Advanced Topics 21 Other Architectural Extensions Cryptography Instruction Throughput Enhancements Increased throughput for AES instructions Arithmetic Throughput Enhancements ADC (Add with Carry) throughput doubled Multiply (64-bit multiplicands with 128-bit product) ~25% speedup on existing RSA binaries State Save/Restore Enhancements New state added in Intel AVX HW monitors features used by applications Only saves/restores state that is used
  • Slide 22
  • Computer Architecture 2011 Advanced Topics 22 System Agent, Ring Architecture and Other Innovations in 2nd Generation Intel Core Microarchitecture formerly codenamed Sandy Bridge 2ch DDR3 x16 PCIe PECI Interface To Embedded Controller Notebook DP Port Graphics Core LLC Core LLC Core LLC Core LLC System Agent System Agent Display DMI PCI Express IMC 2011 PCH DMI 2 nd Gen Intel Core Microarchitecture
  • Slide 23
  • Computer Architecture 2011 Advanced Topics 23 Integration: Optimization Opportunities Dynamically redistribute power between Cores & Graphics Tight power management control of all components, providing better granularity and deeper idle/sleep states Three separate power/frequency domains: System Agent (Fixed), Cores+Ring, Graphics (Variable) High BW Last Level Cache, shared among Cores and Graphics Significant performance boost, saves memory bandwidth and power Integrated Memory Controller and PCI Express ports Tightly integrated with Core/Graphics/LLC domain Provides low latency & low power remove intermediate busses Bandwidth is balanced across the whole machine, from Core/Graphics all the way to Memory Controller Modular uArch for optimal cost/power/performance Derivative products done with minimal effort/time
  • Slide 24
  • Computer Architecture 2011 Advanced Topics 24 Scalable Ring On-die Interconnect Ring-based interconnect between Cores, Graphics, Last Level Cache (LLC) and System Agent domain Graphics Core LLC Core LLC Core LLC System Agent System Agent Display DMI PCI Express* IMC LLC Composed of 4 rings 32 Byte Data ring, Request ring, Acknowledge ring and Snoop ring Fully pipelined at core frequency/voltage: bandwidth, latency and power scale with cores Massive ring wire routing runs over the LLC with no area impact Access on ring always picks the shortest path minimize latency Distributed arbitration, ring protocol handles coherency, ordering, and core interface Scalable to servers with large number of processors High Bandwidth, Low Latency, Modular
  • Slide 25
  • Computer Architecture 2011 Advanced Topics 25 Cache Box Interface block Between Core/Graphics/Media and the Ring Between Cache controller and the Ring Implements the ring logic, arbitration, cache controller Communicates with System Agent for LLC misses, external snoops, non-cacheable accesses Full cache pipeline in each cache box Graphics Core LLC Core LLC Core LLC System Agent System Agent Display DMI PCI Express* IMC LLC Physical Addresses are hashed at the source to prevent hot spots and increase bandwidth Maintains coherency and ordering for the addresses that are mapped to it LLC is fully inclusive with Core Valid Bits eliminates unnecessary snoops to cores Runs at core voltage/frequency, scales with Cores Distributed coherency & ordering; Scalable Bandwidth, Latency & Power
  • Slide 26
  • Computer Architecture 2011 Advanced Topics 26 LLC Sharing LLC shared among all Cores, Graphics and Media Graphics driver controls which streams are cached/coherent Any agent can access all data in the LLC, independent of who allocated the line, after memory range checks Controlled LLC way allocation mechanism to prevent thrashing between Core/graphics Graphics Core LLC Core LLC Core LLC System Agent System Agent Display DMI PCI Express* IMC LLC Much higher Graphics performance, DRAM power savings, more DRAM BW available for Cores Multiple coherency domains IA Domain (Fully coherent via cross-snoops) Graphic domain (Graphics virtual caches, flushed to IA domain by graphics engine) Non-Coherent domain (Display data, flushed to memory by graphics engine)
  • Slide 27
  • Computer Architecture 2011 Advanced Topics 27 System Agent Contains PCI Express, DMI, Memory Controller, Display Engine Contains Power Control Unit Programmable uController, handles all power management and reset functions in the chip Smart integration with the ring Provides cores/Graphics /Media with high BW, low latency to DRAM/IO for best performance Handles IO-to-cache coherency Separate voltage and frequency from ring/cores, Display integration for better battery life Extensive power and thermal management for PCI Express and DDR Smart I/O Integration Graphics Core LLC Core LLC Core LLC System Agent System Agent Display DMI PCI Express* IMC LLC
  • Slide 28
  • Computer Architecture 2011 Advanced Topics 28 Hyper Threading Technology
  • Slide 29
  • Computer Architecture 2011 Advanced Topics 29 Thread-Level Parallelism Multiprocessor systems have been used for many years There are known techniques to exploit multiprocessors Software trends Applications consist of multiple threads or processes that can be executed in parallel on multiple processors Thread-level parallelism (TLP) threads can be from the same application different applications running simultaneously operating system services Increasing single thread performance becomes harder and is less and less power efficient Chip Multi-Processing (CMP) Two (or more) processors are put on a single die
  • Slide 30
  • Computer Architecture 2011 Advanced Topics 30 Multi-Threading Multi-threading: a single processor executes multiple threads Time-slice multithreading The processor switches between software threads after a fixed period Can effectively minimize the effects of long latencies to memory Switch-on-event multithreading Switch threads on long latency events such as cache misses Works well for server applications that have many cache misses A deficiency of both time-slice MT and switch-on-event MT They do not cover for branch mis-predictions and long dependencies Simultaneous multi-threading (SMT) Multiple threads execute on a single processor simultaneously w/o switching Makes the most effective use of processor resources Maximizes performance vs. transistor count and power
  • Slide 31
  • Computer Architecture 2011 Advanced Topics 31 Hyper-threading (HT) Technology HT is SMT Makes a single processor appear as 2 logical processors = threads Each thread keeps a its own architectural state General-purpose registers Control and machine state registers Each thread has its own interrupt controller Interrupts sent to a specific logical processor are handled only by it OS views logical processors (threads) as physical processors Schedule threads to logical processors as in a multiprocessor system From a micro-architecture perspective Thread share a single set of physical resources caches, execution units, branch predictors, control logic, and buses
  • Slide 32
  • Computer Architecture 2011 Advanced Topics 32 Two Important Goals When one thread is stalled the other thread can continue to make progress Independent progress ensured by either Partitioning buffering queues and limiting the number of entries each thread can use Duplicating buffering queues A single active thread running on a processor with HT runs at the same speed as without HT Partitioned resources are recombined when only one thread is active
  • Slide 33
  • Computer Architecture 2011 Advanced Topics 33 Front End Each thread manages its own next-instruction-pointer Threads arbitrate Uop cache access every cycle (Ping-Pong) If both want to access the UC access granted in alternating cycles If one thread is stalled, the other thread gets the full UC bandwidth TC entries are tagged with thread-ID Dynamically allocated as needed Allows one logical processor to have more entries than the other Uop Cache
  • Slide 34
  • Computer Architecture 2011 Advanced Topics 34 Front End (cont.) Branch prediction structures are either duplicated or shared The return stack buffer is duplicated Global history is tracked for each thread The large global history array is a shared Entries are tagged with a logical processor ID Each thread has its own ITLB Both threads share the same decoder logic if only one needs the decode logic, it gets the full decode bandwidth The state needed by the decodes is duplicated Uop queue is hard partitioned Allows both logical processors to make independent forward progress regardless of FE stalls (e.g., TC miss) or EXE stalls
  • Slide 35
  • Computer Architecture 2011 Advanced Topics 35 Out-of-order Execution ROB and MOB are hard partitioned Enforce fairness and prevent deadlocks Allocator ping-pongs between the thread A thread is selected for allocation if Its uop-queue is not empty its buffers (ROB, RS) are not full It is the threads turn, or the other thread cannot be selected
  • Slide 36
  • Computer Architecture 2011 Advanced Topics 36 Out-of-order Execution (cont) Registers renamed to a shared physical register pool Store results until retirement After allocation and renaming uops are placed in one of 2 Qs Memory instruction queue and general instruction queue The two queues are hard partitioned Uops are read from the Qs and sent to the scheduler using ping-pong The schedulers are oblivious to threads Schedule uops based on dependencies and exe. resources availability Regardless of their thread Uops from the two threads can be dispatched in the same cycle To avoid deadlock and ensure fairness Limit the number of active entries a thread can have in each schedulers queue Forwarding logic compares physical register numbers Forward results to other uops without thread knowledge
  • Slide 37
  • Computer Architecture 2011 Advanced Topics 37 Out-of-order Execution (cont) Memory is largely oblivious L1 Data Cache, L2 Cache, L3 Cache are thread oblivious All use physical addresses DTLB is shared Each DTLB entry includes a thread ID as part of the tag Retirement ping-pongs between threads If one thread is not ready to retire uops all retirement bandwidth is dedicated to the other thread
  • Slide 38
  • Computer Architecture 2011 Advanced Topics 38 Single-task And Multi-task Modes MT-mode (Multi-task mode) Two active threads, with some resources partitioned as described earlier ST-mode (Single-task mode) There are two flavors of ST-mode single-task thread 0 (ST0) only thread 0 is active single-task thread 1 (ST1) only thread 1 is active Resources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources Moving the processor from between modes ST0 ST1 MT Thread 1 executes HALT Low Power Thread 1 executes HALT Thread 0 executes HALT Interrupt
  • Slide 39
  • Computer Architecture 2011 Advanced Topics 39 Operating System And Applications An HT processor appears to the OS and application SW as 2 processors The OS manages logical processors as it does physical processors The OS should implement two optimizations: Use HALT if only one logical processor is active Allows the processor to transition to either the ST0 or ST1 mode Otherwise the OS would execute on the idle logical processor a sequence of instructions that repeatedly checks for work to do This so-called idle loop can consume significant execution resources that could otherwise be used by the other active logical processor On a multi-processor system, Schedule threads to logical processors on different physical processors before scheduling multiple threads to the same physical processor Allows SW threads to use different physical resources when possible

Recommended

View more >