Microprocessor Technology: Heterogeneous Multi-Core Processors Won Woo Ro Computer Systems 2
Processor-Memory Systems Cloud and Mobile Computing Internet and Desktop Computing 2010 ~ 2000 ~ Processor: Heterogeneous Multi-Core/AP Memory: SSD/PRAM/MRAM/LPDDR/Wide I/O Processor: Multi-Core/GPU Memory: DDR/Cache 1990 ~ Processor: High Clock Rate, ILP Memory: SDRAM Server Client Computing 3 The History of Breaking Walls High Clock Rate ILP & Cache Multi-Core Heterogeneous Multi-Core Thermal/ Power Dark Silicon TLP Efficiency (energy/speed) 4
Software Performance Sequential Programing Multitasking Parallel Programming OpenCL / CUDA Thermal/ Power Dark Silicon TLP Efficiency (energy/speed) 5 DARK SILICON
Dark Silicon Dark Silicon: 최대성능을낼때, 칩내에서그성능에기여하지못하는실리콘면적 Multi-Core Processor 프로그램실행시, 사용되지않는코어들의면적합 65nm, 4 Cores, 1.8 GHz 32nm, 8 Cores, >=1.8 GHz 코어스케일링진행됨에따라 Dark silicon 면적증가 주요원인 Parallelism Limit Power Limit 관련이론 Amdahl s Law End of Dennard scaling * Esmaeilzadeh et al., Dark Silicon and the End of Multicore Scaling 7 Parallelism 의제한 Amdahl s Law*: 순차실행시또는제한된병렬실행시, 나머지코어 Dark silicon Sequential Bottleneck Parallelizable Sequential Bottleneck Parallelizable Sequential Bottleneck Sequential Bottleneck Reduced! Dark silicon 줄이기위해무조건각코어의크기를증가? Software 의 parallelism 이높으면 Dark silicon 감소 동일자원으로구성가능한멀티코어예 * Hill, M. D. et al., Amdahl's Law in the Multicore Era 8
Power 의제한 Moore s Law: 매 18 개월마다 Transistor 개수 2 배씩증가, Dennard scaling 과함께 multicore 스케일링의기반법칙 Dennard Scaling 의실패 * Transistor 수 2 배증가 Technology 0.7 배로감소 (S : Technology scaling factor = 1/0.7 1.4) Power Scaling S 3 S 2 S 1 Capacitance 0.7배 (1/S) Frequency 1.4배 (S) V dd Scaling 0.5배 (1/S 2 ) 트렌지스터수 2배 (S 2 ) Dennard Scaling 실패 * Michael B. Taylor, Is Dark Silicon Useful?: Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse Dark Silicon 해결위한연구방향 1. 칩크기축소 비용문제및전력밀도가높아짐에따라온도증가문제 2. Dim Silicon Core scaling 시전력제한극복위해 under-clocking 방식사용 순간적으로 clock frequency 올려성능증폭예 ) Turbo Boost 2.0 (Intel), Computational Sprinting (HPCA 12), big.little Core (ARM) 3. Specialized Hardware 특수목적유닛을내장하여필요시켜서사용 예 ) Greendroid, Heterogeneous Processor 4. 새로운 device 개발 MOSFET 의한계를극복할수있는새로운 device 개발 Dennard scaling 의실패근본원인인 leakage 문제해결 * Michael B. Taylor, Is Dark Silicon Useful?: Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse
GPU GPU Graphics only before Application Stage Vertex Processing Geometry Stage Pixel Processing Rasterization Stage A graphics processing unit (GPU), also occasionally called visual processing unit (VPU), is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the building of images in a frame buffer intended for output to a display. - from Wiki 12
GPU GPGPU now! General-Purpose Computation on GPU Unified shader architecture의도입, programmability의증대 연산성능의비약적인발전 3D graphics 만이아닌, 일반범용연산분야에서의사용시작 The utilization of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). - from Wiki 13 NVIDIA GPUs 2008 2010 2012 Tesla Microarchitecture Fermi Microarchitecture Kepler Microarchitecture Streaming Multiprocessor (SM) CUDA core Load/Store unit SFU Scheduler SM L1 cache Shared L2 cache SMX 8 CUDA core / SM Up to 128 CUDA cores 933 GFLOPs (single precision) 77 GFLOPs (double precision) 32 CUDA core / SM Up to 512 CUDA cores L1 cache Shared L2 cache 1.3 TFLOPs (single precision) 0.6 TFLOPs (double precision) 192 CUDA core / SMX Up to 2,688 CUDA cores L1 cache Shared L2 cache 4 TFLOPs (single precision) 1.3 TFLOPs (double precision) 14
Fermi Architectures Kepler GPU Architecture (GK110) 28nm, 561mm 2, 7,080 million Transistors Core 837MHz GDDR5, 6GHz, 384-bit, 288.4GB/s bandwidth TDP 250 watts 16
SMX: 192 CUDA Cores 14 SMX * 192 CUDA cores = 2,688 Execution units Graphics Processing Cluster Other Logics GPC Raster Engine Raster Engine Raster Engine Raster Engine Raster Engine GigaThread Engine Raster Engine Disabled GPC GPC GPC 17 GPU Architecture 특징 3D graphics에서의픽셀 (pixel) 당연산을고속처리하기위한다수의 ALU 일반적인 CPU 의수십 ~ 수백배의 processor core 내장 단순화된 control logic 분기예측, 비순차적실행지원을위해 control logic 이매우비대한 CPU 와달리매우단순 작은 cache memory, 고용량 register file 최소한의 cache만을갖는대신, 모든 thread에대하여 register를독립적으로할당, context switching latency가거의없음 고속 / 광대역폭의 DRAM 소용량의 cache memory에의한성능저하극복 (NVIDIA Fermi GPU: 192.4GB/s vs. Intel Core i7 CPU : 25.6GB/s) 18
CPU vs. GPU: Vector Addition CPU for(int i=0; i<n ; i++){ c[i] = a[i] + b[i]; } GPU c[threadidx.x] = a[threadidx.x] + b[threadidx.x]; A A B B CPU Core GPU C C 다수의 processor core 를사용, 대량의데이터에대한연산을병렬처리 GPGPU Programming Model: CUDA NVIDIA GPU의 general purpose processing을위한 GPU programming model (GPU를 CPU의연산보조프로세서로활용 ) CPU에서응용프로그램을실행, 프로그래머가지정한데이터를 GPU로전송하여처리, 결과를 CPU에서활용하는구조 Many core 프로세서인 GPGPU의구조적특징을활용하기위해매우많은수의 thread를사용 각각의 GPU thread들이 SIMT (Single-Instruction, Multiple-Thread) 구조로동작, 각기할당받은작업을수행
GPU API Graphics GPGPU OpenGL DirectX/Direct3D OpenCL CUDA 21 INTEL
Intel Processor: for Desktop/Server Tock : New Architecture Tick : Die Shrinking 2008 2010 2011 2012 2013 Tock Tick Tock Tick Tock Tick 45nm 32nm 22nm 14nm Nehalem Microarchitecture Sandy Bridge Microarchitecture Haswell Microarchitecture Core-i7 4000 Serieses Nehalem Westmere Sandy Bridge Ivy Bridge Haswell Broadwell Up to 8 cores Native multicore structure Hyper-threading reintroduced L3 cache Integrated memory controller Up to 8 cores Ring bus interconnect Integrated GPU Integrated voltage regulator Designed for power efficiency 23 Intel MIC Larrabee Microarchitecture (canceled) 2006~2010 Larrabee 2010 Intel Xeon Phi Family Intel Mani Integrated Core Architecture (Intel MIC) 2012 (Planed) Knight Ferry Knight Corner Knight Landing Designed for 3D graphics / GPGPU P54C Pentium based x86 cores 512-bit vector processing unit Up to 1TFLOPS (single precision FP) MIC prototype 45nm 32 core 4 thread/core 2GB RAM Up to 750GFLOPS (single precision) Commercial product 22nm 60 core 4 thread/core 8GB RAM Up to 1TFLOPS (double precision FP) Designed for High Performance Computing 2 nd generation MIC 24
HETEROGENEOUS SYSTEM ARCHITECTURE Limitations of Current CPU+GPU Architecture Separated Memory Address Space CPU 및 GPU가사용하는 memory address space가서로분리되어있음 CPU-GPU 간데이터전송, memory 할당등을모두 programmer가처리하여야함 GPU의 memory address space가 CPU 대비작음 대규모의 memory 를사용하는 application 의경우 GPU 로 port 하기가어려운경우가많음 Programming Model OpenCL 혹은 DirectCompute 등의 general-purpose programming API는 CPU와 GPU를모두사용가능 그러나각프로세서개체간데이터전송및제어과정을전적으로 programmer에게의존 CPU/GPU를동시에효율적으로제어하기어렵고과정이복잡함 GPU측에서 task/thread 생성및제어가매우제한적 26
Heterogeneous System Architecture (HSA)* AMD 가제안하는차세대 Computer Architecture HSA APU = LCU + TCU Latency Compute Unit ( 기존 CPU 개념 ) - Serial/Task parallel workload 처리 Throughput Compute Unit ( 기존 GPU 개념 ) - Data parallel workload 처리 Unified Memory Address HMMU (HSA-specific Memory Management Unit) 적용 : LCU와 TCU 간 memory address space를통합기존 CPU-GPU와같은memory copy 등이필요없이, 기존 programming 방법과같이 pointer 전달로 memory 접근가능 * Lisa T. Su, Architecting the Future through Heterogeneous Computing, ISSCC 2013 ARM S BIG.LITTLE ARCHITECTURE
ARM s big.little Processor * 구조 * ARM, Big.LLITTLE Processing with ARM Cortex TM -A15 & Cortex-A7 White Paper 동일한 ISA 를가진서로다른성능의두코어가그룹을이루는아키텍처 고성능 / 고전력코어 (Big, Out-of-Order) 와저성능 / 저전력 (Small, Inorder) 코어로구성 하나의 Workload 는한그룹에할당 두코어중하나에서만실행가능 Coretex-A15-CCI-Coretex-A7 System Task Assignment Performance vs. Power DVFS 를적용하였을때의 performance vs. power 그래프 빗금친영역 : 동일한성능을가질때의절약되는 power Task migration 상호간의 snooping 을허용함으로인해 overhead 줄임 어느시점에서 cache snooping 종료 전력절약모드 Task migration 과정 Performance vs. Power 그래프
MOBILE GPU Architecture Trend Mobile GPU Desktop PC 의그래픽경험을 mobile device 에서도동일하게경험할수있도록함을목표 Mobile 환경에적합한저전력을유지하면서높은 3D 그래픽성능및 media 처리성능을얻는것에주력 과거 discrete GPU 에서지원하던 graphics API(DirectX, OpenGL 등 ) 에대한지원강화중 Discrete GPU 현실에가까운고수준의 3D graphics 실시간렌더링을위한높은처리성능획득에주력 Graphics 처리뿐만아닌 high-performance computing을위한구조적변화 대규모연산유닛및고속메모리사용에따른전력 / 발열문제로 performance per watt 가중요이슈 32
Trends: High-Performance Mobile CPU + GPU Various Apps Latest Mobile GPUs Mobile GPUs PowerVR SGXMP Multi-core 60M tri/s, 400M pix/s Mali T450 Quad core 104M tri/s 3.8G pix/s... Mali T6xx (ARM) Double-precision floating point 5x performance improvement PowerVR 6 (Imagination Tech.) Tile Based Deferred Rendering Cluster GPU architecture Adreno GPU (Qualcomm) 128 shader cores 2x performance improvement Future trends for high performance mobile applications: Parallel computing on heterogeneous MP-SoC with GPGPU Commercial Phones Samsung Galaxy Note 2 Samsung Galaxy S3 Samsung Galaxy S4 Apple iphone 5 HTC One X + LG Optimums G LG Optimums G pro Nokia Lumia 920 Motorola Droid Razr Maxx HD Sony Xperia T Release Date Aug, 2012 May, 2012 April, 2013 Sep, 2012 Oct, 2012 Oct, 2012 April, 2013 Sep, 2012 Sep, 2012 Sep, 2012 Application Processor Exynos 4412 Exynos 4412 Exynos 5410 Apple A6 Nvidia Tegra 3 AP37 Qualcomm S4 APQ8064 Qualcomm APQ 8964T Qualcomm MSM8960 Qualcomm MSM8960 Qualcomm MSM8260A Technology 32nm 32nm 28nm 32nm 40nm 28nm 28nm 28nm 28nm 28nm CPU Quad core Cortex-A9 @1.6GHz Quad core Cortex-A9 @1.6GHz Octa core Cortex-A9&15 @1.2&1.6GHz Dual core Cortex-A15 @1.2GHz Quad core Cortex-A9 @1.7GHz Quad core Cortex-A9 @1.5GHz Quad core Cortex-A9 @1.7GHz Dual core Cortex-A9 @1.5GHz Dual core Cortex-A9 @1.5GHz Dual core Cortex-A9 @1.5GHz GPU QUAD core Mali-400MP @533MHz QUAD core Mali-400MP Triple core PowerVR SGX844MP3 Triple core PowerVR SGX543MP3 ULP GeForce @520MHz Adreno 320 Adreno 320 Adreno 225 Adreno 225 Adreno 225
Imagination 임베디드 / 모바일 GPU 코어세계시장 1 위업체 PowerVR: Imagination 의 mobile GPU 브랜드로서 mobile AP 시장에서높은점유율을가짐 (Apple A4, A5, A6 및 Samsung Hummingbird/Exynos5-Octa 등다양한제품에적용 ) PowerVR SGX GPGPU 코어아키텍쳐 ARM Mali: 노르웨이의 Falanx 를 ARM 이인수하여확보한 GPU 브랜드명 Graphics 전용 Mali400 series 및 high-end 스마트기기및태블릿시장을위한 Mali-600 series 출시, Samsung Exynos 4/5 에적용 Mali-450MP 코어아키텍쳐
Qualcomm Adreno: Qualcomm 의 mobile GPU 브랜드 Adreno320 은가장최근에출시된 unified shader 구조의 GPU 코어로 dual 메모리채널및 OpenCL 1.1 을지원 (Snapdragon600, Snapdragon800 에적용 ) Qualcomm Adreno GPU Roadmap NVIDIA Mobile GPU 자사의 Tegra series 로 mobile AP 시장공략중, Tegra 및이에포함되는 GPU 인 ULP
CONCLUSION Intel Processors 40
Heterogeneous Sandy Bridge Ivy Bridge Die Size 212 mm 2 160 mm 2 Total Transistors 1.16 B 1.40 B Core Transistors 79.4 M 80.4 M 41 Commercial Processors Sandy Bridge (Core i7 3970x) Kepler (GK110) Knights Corner (Xeon Phi SE10x) Die Size 435 mm 2 551 mm 2 350 mm 2 Total Transistors 2.3 B 7.1 B 5 B No of Cores 6 CPU Cores 2,688 CUDA Cores 61 Cores TDP 150 W 42 250 W 300 W Cl k & T h l
So Heterogeneous Multi-Core looks Natural One chip solution and dark silicon But, programmability for GPGPU? Performance? Discrete GPUs will be there as well General purpose processor (Intel) Xeon Phi Heterogeneous: Xeon/Core/Atom Mobile AP + GPU Not for GPGPU but for graphics ARM ISA. Cache and memory 43 Q & A Thank you!