이강좌는과학기술부의국가지정연구실인연세대학교이용석교수연구실 ( 프로세서연구실 ) 에서 C&S Technology 사의지원을받아서제작되었습니다 SMT 마이크로프로세서구조의개요.. 연세대학교전기전자공학과프로세서연구실박사과정문병인 E-mail: yonglee@yonsei.ac.kr Homepage: http://mpu.yonsei.ac.kr 전화 : - -88 고성능마이크로프로세서구조와설계강좌시리즈 (http://mpu.yonsei.ac.kr mpu.yonsei.ac.kr).. 연세대학교전기전자공학과프로세서연구실박사과정문병인 E-mail: yonglee@yonsei.ac.kr. 반도체산업과비메모리분야육성을위한방안 (998.). 고성능마이크로프로세서구조의개요 (998.). 고성능마이크로프로세서명령어해석기 (Instruction Decoder) 의구조 (998.). 고성능마이크로프로세서분기명령어 (Branch Instruction) 의수행방법 (998.). 고성능마이크로프로세서곱셈기 (Multiplier) 의구조 (998.). 고성능마이크로프로세서부동소수점연산기 (Floating-Point Unit) 구조 (999.). 고성능마이크로프로세서캐쉬 (Cache) 메모리구조 (999.) 8. 고성능마이크로프로세서나눗셈연산기 (Divider) 의구조 (999.) 9. 고성능마이크로프로세서초월함수 (Transcendental) 연산기구조 (999.). 고성능마이크로프로세서 ALU 와레지스터 파일의구조 (.). 직접디지털주파수합성기 (DDFS) 의구조 (.) -- -- --
. 암호화를위한 VLSI 구조와설계의개요 (.). 고성능마이크로프로세서부동소수점연산기 (Floating-Point Unit) 구조 () (.). Floating-point Division: Goldschmidt s s Algorithm (.) 참고문헌 [] 문병인, 순차적 SMT 구조및그룹화방안에관한연구, 연세대학교대학원전기전자공학과, 공학박사학위논문, 년 월.. SMT 마이크로프로세서구조의개요 (.) [] Dean M. Tullsen,, Susan J. Eggers, and Henry M. Levy, Simultaneous Multithreading: Maximizing On-Chip Parallelism, Proceedings of nd Annual International Symposium on Computer Architecture,, pp. 9-, Santa Margherita Ligure,, Italy, May 99. [] Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, and Henry M. Levy, Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Proceedings of rd Annual International Symposium on Computer Architecture,, pp. 9-, Philadelphia, Pennsylvania, May 99. -- -8- -9- -- [] Peter Song, Multithreading Comes of Age, Microprocessor Report,, Vol., No. 9, July, 99. -- 강좌의내용 * Background and motivation * Multithreading - Coarse multithreading - Fine multithreading - Simultaneous multithreading * SMT vs. Superscalar * Simulation results --
기본용어 * IPC - Instructions Per Cycle * ILP - Instruction Level Parallelism * TLP - Thread Level Parallelism * Thread - 프로그램의일부분 (parallel programming) - 단일프로그램 (multiprogramming) -- Superscalar Arrives at Its Limit * ILP(Instruction Level Parallelism) history - 년전 : transistor 수가 만개이하인 single-issue issue microprocessor - 현재 : transistor 수가 만개이상인비순차적 -/ microprocessor 의 IPC 는단지 배정도 Superscalar Arrives at Its Limit Waste of Superscalar Issue Slots * 이론적인 ILP: 8 to IPC * 실제적인 ILP: IPC도어려움 * Semiconductor-circuit speed-up : % per year - % per year 의요구 more and more parallelism Cycles Issue slots Full Empty Horizontal waste = 9 slots Vertical waste = slots -- -- Huge Transistor Count on Chip * 반도체공정과 packaging 기술의발전 - -million transistor-per chip - 여분의 transistor 를활용할필요성 * Larger on-chip memory - Performance becomes limited by the speed of the processor core * CMP (Chip MultiProcessor) - Each of multiple processors on one chip executes the thread allocated to it - Fixed resource partitioning -> > poor performance when workload doesn t t match -- -- Multithreading Architectures I * Attention to multithreading as the next- generation microprocessor architecture - ILP 의부족을 TLP 를이용하여보충 * Coarse multithreading (CMT) - Only one active thread in the pipeline - Thread switch overhead is large - The events that cause threads to switch is limited to long-latency latency operations - Low complexity -8-
Multithreading Architectures II * Fine multithreading (FMT) - Multiple active threads in the pipeline - Issues from only one thread in a cycle - Thread switch overhead is small - Switch to other threads even in case of short-latency operations - High complexity 예 ) Tera,, HEP -9- Multithreading Architectures III * Simultaneous multithreading (SMT) - Issues and executes simultaneously from multiple threads each cycle - No thread switch overhead - Threads dynamically share processor resources - Huge complexity - Can be derived from the conventional superscalar processor -- Sequential vs. Multithreaded A A Instruction Issue of Coarse, Fine and Simultaneous Coarse Fine Horizontal waste Simultaneous A B B Sequential Execution B A B Multithreaded Execution Useful cycles Idle cycles Context-switch cycles Thread-switch cycles -- Thread switch cycles Large thread switch overhead (Tens or hundreds of cycles) Thread switch cycles Horizontal waste Thread switch cycles Horizontal waste Small thread switch overhead (Typically zero or one cycle) No thread switch overhead (decreases the amount of horizontal waste) Unutilized Thread Thread -- Instruction Issue of Superscalar, CMP and SMT Cycles Superscalar CMP SMT Unutilized Thread Thread Out-of of-order order vs. In-order in SMT * Out-of of-order order issue and completion - Register renaming : increases design complexity Needs one additional pipeline stage for register renaming Needs additional registers : access to the register file becomes slower - Complicated recovery and restart mechanism for branch misprediction and exception -- --
Out-of of-order order vs. In-order in SMT * In-order issue and completion - No need for register renaming - Simple recovery and restart mechanism for branch misprediction and exception - TLP compensates for lack of ILP of each thread A Proposed SMT Architecture * Issues and executes simultaneously from multiple threads each cycle - Utilizes TLP as well as ILP * Most resources are shared dynamically among different threads - Minimizes resource waste and greatly improves utilization of functional units -- -- A Proposed SMT Architecture * Simple in-order issue and completion - Removes the need of register renaming - Simple recovery and restart mechanism for branch misprediction and exception - Reduces design complexity and hardware cost * Only a few changes from superscalar Overall Superscalar Architecture Scoreboard Array Issue Unit Decode Unit ALU ALU ALU ALU Multiplier Unit Unit IMMU Instruction Cache Instruction Fetch Queue DMMU PC Fetch Unit Register File Data Cache -- -8- Overall SMT Architecture Scoreboard Array Issue Unit Decode Unit ALU ALU ALU ALU Multiplier Unit Unit IMMU Instruction Cache Instruction Fetch Queue DMMU Thread Selector PC Fetch Unit Thread Register Set Thread Register Set Thread Register Set Thread Register Set Data Cache Superscalar Pipeline Structure No stall Fetch (F) Decode (D) Issue (I) Read (R) Execute (E) Memory (M) Write (W) Instruction fetch Instruction decode Instruction issue read ALU: execution, Multiply: multiplication, Load/store: memory address calculation Multiply: accumulation, Load/store: memory access, Branch and exception checks, ALU: no operation Write back results to the register file -9- --
SMT Pipeline Structure Superscalar Instruction Fetch Select (S) Fetch (F) Decode (D) Issue (I) Thread selection for instruction fetch Instruction fetch Instruction decode Instruction issue Virtual high-order instruction address (VPN) ITLB Virtual/physical low-order order instruction address (Page offset) Physical high-order instruction address (PPN) High-order address Low-order order address Instruction Cache Inst data No stall Read (R) Execute (E) Memory (M) Write (W) read ALU: execution, Multiply: multiplication, Load/store: memory address calculation Multiply: accumulation, Load/store: memory access, Branch and exception checks, ALU: no operation Write back results to the register file Virtual instruction address Virtual instruction address PC Branch prediction history (to BHB) Instructions to Instruction Fetch Queue -- -- SMT Instruction Fetch PC M virtual high-order instruction (M VPNs) ITLB M virtual/physical low-order order instruction (M( page offsets) M virtual M virtual instruction instruction T PCs PC T- Select signals Thread Selector M physical high-order instruction (M PPNs) M high-order Instruction Cache M low-order order Inst Inst data data N/M M branch prediction histories (to BHBs) N/M Inst data M- N/M -- Superscalar Instruction Fetch Queue N from the Instruction Cache Entry K- K N- entries K-N- entries N- entries Entry cycle i N to the Decode Unit N from the Instruction Cache Entry K- K N- entries K-N- entries N- entries Entry cycle i+ N to the Decode Unit -- SMT 구조에서발생하는명령어페치큐중간의빈엔트리들 Instruction Fetch Queue Empty Thread instruction Thread instruction Thread instruction Thread instruction Thread instruction Thread instruction Thread instruction Thread instruction Flushed Flushed Flushed Instruction Fetch Queue Empty Thread instruction Thread instruction Empty (flushed) Thread instruction Thread instruction Empty (flushed) Empty (flushed) Thread instruction Instruction Issue Queue Compressing in SMT Instruction Issue Queue Entry L- L Entry Instruction Issue Queue Entry L- L Entry Before thread flushing After thread flushing -- Before compressing After Compressing --
Superscalar (ARM) R (PC) CPSR SMT Thread Register Set Thread R (PC) CPSR R R R8 R R R User/System R R R8 R R R User/System R8_fiq R_fiq R_fiq R_fiq SPSR_fiq FIQ R8_fiq R_fiq R_fiq R_fiq SPSR_fiq FIQ R_irq R_svc R_irq R_svc SPSR_irq SPSR_svc IRQ Supervisor Thread Register Set Thread Register Set R_irq R_svc R_irq R_svc SPSR_irq SPSR_svc IRQ Supervisor R_abt R_und R_abt R_und SPSR_abt SPSR_und Abort Undefined -- Thread T- T Register Set R_abt R_und R_abt R_und SPSR_abt SPSR_und Abort Undefined -8- Superscalar Scoreboard Array SMT Scoreboard Array Scoreboard Array Scond Thread Scoreboard Set Scond S S S8 S S S User/System S S S8 S S S User/System S8_fiq S_fiq S_fiq S_fiq FIQ S8_fiq S_fiq S_fiq S_fiq FIQ S_irq S_svc S_irq S_svc IRQ Supervisor Thread Scoreboard Set Thread Scoreboard Set S_irq S_svc S_irq S_svc IRQ Supervisor S_abt S_und S_abt S_und Abort Undefined -9- Thread T- T Scoreboard Set Scoreboard Array S_abt S_und S_abt S_und Abort Undefined -- Changes from Superscalar I * Multiple PCs : one PC per thread * Thread Selector and S stage * Instruction cache - Several banks, multi-ported, non-blocking * Multi-ported ITLB and * A thread id with each entry * Multiple BHBs (one BHB per thread) * Each instruction accompanied by thread id * Fetch and issue - - Changes from Superscalar II * Instruction issue queue compressing * Per-thread - Instruction dependency check - Branch misprediction check - Exception mechanism - Instruction flush * Register file and scoreboard array - T (the number of threads) times larger than those of the conventional superscalar processor --
Grouping in SMT * All threads share all hardware resources - Enhance performance - Increase design complexity * Grouping (sectioning or partitioning) - Resource 별로 thread 를 group 으로나눔 - Resource 들을각 group 들이배타적으로사용하도록함 - Decreases performance but simplifies design -- Three Types of Grouping Issue Unit Grouping Decode Unit Instruction Cache PC Fetch Unit Instruction Fetch Queue Scoreboard Array Grouping ALU ALU ALU ALU Multiplier Unit Unit Grouping IMMU DMMU Thread Selector Thread Register Set Thread Register Set Thread Register Set Thread Register Set Data Cache -- Grouping Grouping Virtual address of thread or ITLB Physical address of thread or Physical address of thread or Instruction Cache Group N/ of group from Instruction Fetch Queue of Group Group N/ of group from Instruction Fetch Queue of Group PC Select signal for group Thread Selector Select signal for group Virtual address of thread or PC PC PC Virtual address of thread or Virtual address of thread or Branch prediction history of thread or N/ of thread or N/ of thread or Branch prediction history of thread or -- N/ of group Decode Slots of Group N/ decoded of group Instruction Issue Queue of Group N/ issued of group N/ of group Decode Slots of Group N/ decoded of group Instruction Issue Queue of Group N/ issued of group -- Grouping Thread 수에따른성능변화 I Instruction Issue Control Logic of Group Group Recent update information Scoreboard Array of Group Instruction Issue Control Logic of Group Group Recent update information Scoreboard Array of Group.8... -way issue.. -way issue Read of group Issued of group Results of group of Group s of group Functional Units of Group Write of group Read of group Issued of group Results of group of Group s of group Functional Units of Group Write of group.8... 8.. 8 -- -8-
Thread 수에따른성능변화 II 8-way issue 8 -way issue 8 Thread 수에따른캐쉬미스율 (-way Set-associative Caches) M iss (% ) 8 8 -way issue -way set-assoc. 8 Miss (%) 8-way issue -way set-assoc. 8-9- -- 캐쉬 Way 수변화에따른성능 I 캐쉬 Way 수변화에따른성능 II... -thread M iss (% ) -thread... M iss (% ).. -way -way -way 8-way -way -way -way -way 8-way -way -way -way -way 8-way -way -way -way -way 8-way -way -- -- 캐쉬 Way 수변화에따른성능 III 캐쉬 Way 수변화에따른성능 IV.... -thread M iss (% ) -thread -thread Miss (%) -thread -way -way -way 8-way -way -way -way -way 8-way -way -way -way -way 8-way -way -way -way -way 8-way -way -- --
캐쉬 Way 수변화에따른성능 V 캐쉬 Way 수변화에따른성능 VI Miss (%) -thread M iss (% ) -thread -way -way -way 8-way -way -way -way -way 8-way -way -way -way -way 8-way -way -way -way -way 8-way -way -- -- 캐쉬크기변화에따른성능 I 캐쉬크기변화에따른성능 II.... -thread -way set-assoc. M iss (% ) 9 8 -thread -way set-assoc. 8-way set-assoc. M iss (% ) 8 8-way set-assoc. 8 8 Cache size (KB) 8 8 Cache size (KB) 8 8 Cache size (KB) 8 8 Cache size (KB) -- -8- Thread 수에따른분기예측오류빈도 (8-entry ) Entry 수변화에따른성능 I. Freq uency (% ) -way issue 8-entry Freq uency (% ) 8-way issue 8-entry.. -thread Freq uency (% ) -thread. 8 frequency of branch mispredictions frequency of replacements 8 frequency of branch mispredictions frequency of replacements 8 Number of entries 8 Number of entries frequency of branch mispredictions frequency of replacements -9- --
Entry 수변화에따른성능 II Priority Policy 에따른성능 (, -thread) 8 Number of entries Freq uency (% ) 8 Number of entries frequency of branch mispredictions frequency of replacements -- Type of Fetch Issue Priority RR ICOUNT_IFQ ICOUNT_Q ICOUNT_ALL ICOUNT_BR ICOUNT_MIS S IIQOL OLDEST ICNT_FU ICNT_MISS Fetch..8....9...8.88 Issue.9..89.8.8.8.8.9.8. Execution.888..9.8....888.. -- Priority Policy 에따른성능 (, ) 8 그룹화에따른성능변화 (, ) 8 Type of Fetch Issue Priority RR ICOUNT_IFQ ICOUNT_Q ICOUNT_ALL ICOUNT_BR ICOUNT_MIS S IIQOL OLDEST ICNT_FU ICNT_MISS Fetch.9.8.8...9..9..9 Issue..9.8.9.9.9.9...8 Execution.8.9.9.98.888..9.8..8 -- 그룹화 ( 그룹화 / 그룹화 / 그룹화 ) / / / / / / / / / / / / / / / / / / / / Fetch.9.....9..9.. Issue...8......88.8 Execution...8.9.89...89.9.9 -- 강좌의요약 I 강좌의요약 II * SMT 구조연구의배경 - Superscalar arrives at its limit - Huge transistor count on chip * Multithreading - CMP (chip multiprocessor) - CMT (coarse multithreading) - FMT (fine multithreading) - SMT (simultaneous multithreading) -- * SMT can be derived from the conventional superscalar - Superscalar vs. SMT - Grouping * Simulation results - - Inter-thread thread interference - Priority - Grouping --