<4D F736F F F696E74202D204C FBAB4B7C4C7C1B7CEB1D7B7A1B9D6B1E2C3CA2E BC8A3C8AF20B8F0B5E55D>

The Power and Potential of Parallelism 이홍석 2010. 3 2010-03-11 1 병렬처리시스템활용시대가도래... 소규모 Multi-core 환경 노트북, PC, 워크스테이션, 서버등모두 Multi-core OS 가 multi-core 를지원 POSIX 쓰레드와 win32 쓰레드 GPGPU 기반의개인슈퍼컴퓨터의등장 CUDA/OpenCL/GPGPU 의급속한발전 대규모 multicore 환경 슈퍼컴퓨터 : 1 만 ~ 20 만개수이상코어로구성된시스템 Clusters : 가격대비성능비가좋은시스템 Servers Grid computing 2010-03-11 2 1

병렬컴퓨팅과분산컴퓨팅 병렬컴퓨팅 단일문제를해결하기위하여 multiple processor 를사용 프로세스사이의 interconnection 이우수함을요구 슈퍼컴퓨팅 분산컴퓨팅 많은프로세스들을쉽게이용하도록편리를제공 통신부하가적은응용에적합 병렬계산은짧지만분산하는시간을길다 그리드컴퓨팅 2010-03-11 3 병렬프로그래밍기법을배우자 우수한병렬프로그램작성한다는것은.. 답이정확하고 (correct) 우수한병렬성능 (good performance) 많은프로세스를이용한우수한병렬확장성 (scalability) 이기종병렬시스템에이식성 (portability) 병렬프로그램은하드웨어스펙과밀접한관련이됨 MPI OpenMP Hybrid (MPI+OpenMP) CUDA Pthread, Java, 2010-03-11 4 2

무어 (Moore) 의법칙리뷰 1965 paper Doubling of the number of transistors on integrated circuits every two years. Moore's law has been the name given to everything that t changes exponentially. 2010-03-11 5 http://news.cnet.com/images-moores-law-turns-40/2009-1041_3-5649019.html 5 Moor s 법칙리뷰 1 PF 1TF 1 GF Super Scalar/Special Purpose/Parallel IBM RoadRunner Moore s Law 2X Transistors/Chip ASCI Red ASCI White Every 1.5 Years Pacific TMC CM-5 Cray T3D Super Scalar Vector Cray 1 TMC CM-2 Cray 2 Cray X-MP Parallel Scalar 1 MF IBM 7090 CDC 7600 IBM 360/195 CDC 6600 1 KF UNIVAC 1 EDSAC 1 1950 1960 1970 1980 1990 2000 2010 2010-03-11 6 3

Supercomputing 기법변화가속화 세계 1 위성능슈퍼컴퓨터 (2009.1) IBM Cell 기반 Roadrunner 1 Pflops 이상성능 특별한목적에맞게개발 고속의가속보드의등장 General purpose computing on GPU NVIDIA Tesla S870 Supercomputer 2010-03-11 7 ORNL s Newest System Jaguar XT5 Jaguar Total XT5 XT4 Peak Performance 1,645 1,382 263 AMD Opteron Cores 181,504 150,176 31,328 System Memory (TB) 362 300 62 Disk Bandwidth (GB/s) 284 240 44 Disk Space (TB) 10,750 10,000 750 Interconnect Bandwidth (TB/s) 532 374 157 The systems will be combined after acceptance of the new XT5 upgrade. Each system will be linked to the file system through 4x-DDR Infiniband 2010-03-11 8 4

슈퍼컴퓨터 Top500 분석 (2008.11) 1200 1100 1.1 Pflop/s 1000 900 2 systems > 1 Pflop/s Rmax (Tflop/s) 800 700 600 500 400 300 200 100 19 systems > 100 Tflop/s 51 systems > 50 Tflop/s 119 systems > 25 Tflop/s 0 1 27 53 79 105 131 157 183 209 235 261 287 313 339 365 391 417 443 469 495 Rank 2010-03-11 9 컴퓨터하드웨어발전추이 새로운제약조건 전기사용량이 clock의발전을제한 Instruction Level Parallelism로더이상성능향상은기대하기어려움새로운경향 Transistor count still rising Clock speed flattening sharply Multiple instruction streaming이필요 multicore 프로그래밍 Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith 2010-03-11 10 5

CPU 성능 vs 전기사용비용 Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3 2010-03-11 11 무어의법칙재검토 칲당코어수가매 2 년마다 2 배가되고있지만, 클럭속도는줄고있다. Clock speed 는더이상증가하고있지않고 수백만개수쓰레드를동시에관리하는프로그래밍기법이필요 미래세대에는수백만개수의쓰레드 Intro-chip 병렬화가가능한 inter-chip 병렬화를쉽게대체필요 프로세서수의 2 배증가매 2 년마다.. 2010-03-11 12 6

Multicore 에서 Manycore 로 멀티코어란 한개의컴퓨터 chip 에 2~4 개프로세스 ( 코어 ) 가있는아키텍처디자인 Chip multiprocessors (CMP) Manycore Chip 당 10~100+ 코어가집적 이질성이대세 예를들면 Intel 80-core TeraScale chip Intel 32-core Larrabee chip IBM Cyclops-64 chip with 160 thread units ClearSpeed 96-core CSX chip NvidiaTesla 240-core C870 AMD/ATI FireStream 9170 320 cores 2010-03-11 13 Intel s Larrabee Chip Many X 86 IA cores Scalable to Tflop/s New cache architecture New vector instructions set Vector memory operations Conditionals Integer and floating point arithmetic New vector processing unit / wide SIMD 2010-03-11 14 14 7

GPGPU 가속 : NVIDIA Tesla T10P T10P chip 240 cores; 1.5 GHz Tpeak 1 Tflop/s - 32 bit floating point Tpeak 100 Gflop/s - 64 bit floating point S1070 board 4 - T10P devices; 700 Watts GTX 280 1 T10P; 1.3 GHz Tpeak 864 Gflop/s - 32 bit floating point Tpeak 86.4 Gflop/s - 64 bit floating point 2010-03-11 15 Examples of Parallel Computers 2010-03-11 16 8

최신병렬컴퓨터 Chip multiprocessor: Intel Core Duo AMD Dual Core Symmetric Multiprocessor Sun Fire E25K Heterogeneous Chips Cell Processor NVIDIA 8800 GT Clusters Supercomputers BlueGene/L 2010-03-11 17 Intel core duo Two 32-bit Pentium processors each has its own 32K L1 D and I cache shared 2MB or 4MB L2 cache, fast communication through shared L2 coherent shared memory 사용 : 보통개인용 PC 2010-03-11 18 9

AMD Dual Core Opteron Two AMD64 processors each with 64K L1 D and I cache each with 1MB L2 cache coherent shared memory 2010-03-11 19 Intel 과 AMD 비교 main difference L2 cache position AMD... o more core private memory o easier to share cache coherency info with other CPUs o more general SMP architecture o preferred in multi chip systems Intel... o core can use more of the shared L2 at times o lower latency communication between cores o preferred in single chip systems 2010-03-11 20 10

Generic SMP 멀티코어와멀티CPU Use: Both multi-core and multi-cpu PCs single logical memory image uniform or non-uniform memory architecture shared bus often bottleneck small number of processors 8, 16, 32 2010-03-11 21 Sun Fire E25K Use: High-end servers from Sun Sun의 High-End servers 18 E25K boards, 4 cores each, two hardware threads each core can access up to 16GB of memory uniform memory access latency directory based cache coherence protocol 2010-03-11 22 11

Cluster Low-core largre-scale parallelism parallel computers made of commodity parts a number of boards each with... small number of CPUs RAM disk connected by commodity interconnect, e.g. fast Ethernet good price / performance memory not shared between boards message passing 2010-03-11 23 Cell Processor PlayStation 3, Roadrunner Supercomputer with Opterons SONY, IBM and Toshiba 64-bit PowerPC with two hardware 8 Synergistic Processing Elements 256KB of on chip memory per SPE dual 12.8GB/s memory bandwidth no coherent memory for SPEs, must be managed by programmer Use: PlayStation 3, Roadrunner supercomputer (w/opterons) 2010-03-11 24 12

BlueGene/L IBM Supercomputer CPU clock 770MHz 65,536 dual core nodes each node PowerPC 440 processor 4 floating point ops per node per clock 32K L1 I and D cache per node L2 prefetch buffer 4MB of shared L3 cache 512MB of shared off chip RAM Use: Supercomputing 2010-03-11 25 Interconnection Networks Each node has PE and memory routing follows the shortest path through the available links a) 2D torus, b) 3-cube 2010-03-11 26 13

Interconnection Networks Each node has PE and memory fat tree multiple roots to avoid congestion 2010-03-11 27 Interconnection Networks Omega Network processors on the left memory modules on the right can block 2010-03-11 28 14

Heterogeneous Chips 멀티코어대신특별한코프로세서 GPU FPGA Cell Hybrid Programming Model Main CPU performs hard to parallelize portion Attached processor (GPU) performs compute intensive parts 2010-03-11 29 NVIDIA CUDA NVIDIA Compute Unified Device Architecture (CUDA) 8 SIMD cores per SM on chip memory 16KB, organized into 16 blocks 8192 registers per SM 32 threads processed in 4 clock cycles different SM cannot synchronize or communicate atomic operations on global memory 2010-03-11 30 15

Multiple Instruction Streams 를이용한병렬화 2010-03-11 31 왜병렬처리시스템인가? 슈퍼컴퓨터혹은고성능컴퓨팅시스템 거대규모계산문제를빠르게처리가가능 왜? 많은메모리계산이가능 하지만, 그리간단한문제가아님 병렬화지원컴파일러가없음 ALU REGISTERS L1 CACHE FETCH/ DECODE UNIT Main Memory I/O Devices L2 CACHE 2010-03-11 32 16

병렬컴퓨팅 공유메모리 (Shared-Memory) Intel quad-core Nehalem micro-architecture AMD Barcelona (quad-core) AMD Sanghai (quad-core) Sun Niagara Distributed-Memory IBM BlueGene/L Cell (see http://users.ece.utexas.edu/~adnan/vlsi-07/hofstee-cell.ppt) Mini-cores Intel Teraflops GPUs 2010-03-11 33 병렬프로그래밍모델 How to transfer data between processors? Can be a key scalability bottleneck 1-sided models better at latency hiding How do we program the parallel machine? Single Execution Stream Data Parallel, e.g. HPF Multiple Execution Streams Partitioned-Local Data Access :MPI Uniform-Global-Shared Data Access :OpenMP Partitioned-Global-Shared Data Access : Co-Array Fortran Uniform-Global-Shared + Partitioned Data Access 2010-03-11 34 17

쓰레드의개념 병렬화의기본단위로쓰레드 : thread Stream of instruction를실행하기위한모든것을포함 Private program text A call stack A program counter 다중쓰레드사이의메모리는공유 2010-03-11 35 메모리참조메커니즘.. Shared Memory single address space easy to use easy to create inefficient programs through non-local references easy to create race conditions typically requires hardware support to make it coherent A A=B B P0 shared memory load/stores 0-sided model P1 2010-03-11 36 18

Memory Reference Mechanisms One-side communication single address space direct load/store on local memory get/put on remote memory, not coherent must use synchronization to ensure coherency simpler hardware, more burden on software A P0 put B P1 remote memory access (RMA) 1-sided model 2010-03-11 37 Memory Reference Mechanisms Message Passing no shared address space most primitive least hardware support can directly access only local memory send / recv used to exchange data between processors claimed to be easier to debug communication is more explicit and easier to find in programs A P0 receive send message passing 2-sided model B P1 2010-03-11 38 19

Models of Computation 대표적인성공순차코딩모델 : Random Access Machine (RAM) model Known as the von Neumann model Here, RAM is not random access memory. Not a good model : PRAM extend RAM to parallel computers PRAM doesn't work well as communication cost is ignored 2010-03-11 39 Models of Parallelism A good model : CTA Stands for Candidate Type Architecture Makes useful predictions about real performance 2010-03-11 40 20

멀티코어아키텍처에서프로그래밍 멀티코어 / 매니코어에서우수한병렬성능은 프로세서 / 코어사이의통신과동기화를최소화 효율적인병렬알고리즘디자인 알고리즘레벨에서 concurrency 사용 메모리대역폭은매우제한됨 (bytes/flop) IO 는 bottleneck 이다. Flops 는거저얻지만대역폭은매우고가이다. 2010-03-11 41 예제프로그램 Counting 3s 2010-03-11 42 21

예제 1 : Iterative sum 순차코드 vs Pair-wise 합 1: sum = 0 ; 2: for ( i = 0 ; i < n ; i ++ ) 3: sum += x[ i ] ; 순차합순서. The order of combining a sequence of numbers (7, 3, 15, 10, 13, 18, 6, 4) when adding them to an accumulation variable. 2010-03-11 43 순차코드 : counting 3s #include <stdlib.h> #include <stdio.h> #include <time.h> void dtime(); int *array; const int length = 100000; int count; const int iters = 10000; /* artificially multiply work */ int count3s() { int i, j; int count = 0; for (j = 0; j < iters; j++) { for (i = 0; i< length; i++) ){ if (array[i] == 3) { count++; } } } return count; } static void make_array() { int i; array = (int *)malloc(length * sizeof(int)); for (i = 0; i < length; i++) array[i] = i % 10; } int main(int argc, char **argv) { make_array(); double stime, etime, wtime; dtime(&stime); printf("%d\n", count3s()); dtime(&etime); printf("elapsed Time = %12.8lf (sec) \n ",etime-stime); return 0; } 2010-03-11 44 22

일반적인컴파일러처리과정 2010-03-11 45 병렬처리시스템 Organization of a multi-core computer system: - Each processor has a private L1 cache; - it shares an L2 cache with its chip-mate -and shares an L3 cache with the other processors. 2010-03-11 46 23

병렬합 Pair-wise 합 Summing in pairs. The order of combining a sequence of numbers (7, 3, 15, 10, 13, 18, 6, 4) by (recursively) combining pairs of values, then pairs of results, and so on. 2010-03-11 47 Thread 프로그래밍모델 Schematic diagram of data allocation to threads. Allocations are consecutive indices. 2010-03-11 48 24

Multithreaded system 4 개의쓰레드를이용한배열합 1: sum = 0 ; 2: for ( i = start ; i < end ; i ++ ) 3: sum += x[ i ] ; 병렬화방법 인덱스 i는 local로stack 공유메모리는 sum, x 각각의쓰레드에다른값 0 start, end 9 6 4 7 3 2 1 0 8 3 2 1 1 5 7 Thread 0 Thread 1 Thread 2 Thread 3 2010-03-11 49 Race condition One of several possible interleavings of references to the unprotected variable count, illustrating a race condition. : locking 필요 2010-03-11 50 25

공유메모리정리사항 용어정리 Thread ~ instruction stream을실행하기위한모든것이포함된유닛 Race condition ~ 한실행의결과가두가지실행시간에의존함 Mutex ~ mutual exclusion을제공하는오브젝트, 2개의상태를갖음 Lock contention ~ multiple threading 할경우한개의쓰레드가 critical 세션을위하여대기하는추가시간 Granularity ~ 서로상호통신하는프로세스주기 False sharing ~ locally 데이터가물리적인 cache lined을공유함 2010-03-11 51 Tachyon: Quad-Core AMD Processor (Barcelona) Core 1 Core 2 Core 3 Core 4 64KB L1 64KB L1 64KB L1 64KB L1 512KB L2 512KB L2 512KB L2 512KB L2 2MB L3 Memory I/O Direct Connect Architecture, AMD CoolCore Technology Integrated DDR2 DRAM Controller AMD Balanced Smart Cache, AMD Wide Floating Point Accelerator 2010-03-11 52 26

Results on KISTI supercomputer Tachyon Initializing array with 16 threads Array size: 2000000000 elements, 16.000 Gigabytes With branch With no 3s Running serial test 11.04099178 Running with false sharing Nthread=1 10.99804592 Nthread=2 5.94327688 Nthread=4 3.27491999 Nthread=8 1.92523599 Nthread=16 1.63911498 Running with padded counters Nthread=1 12.23770237 Nthread=2 6.38692808 Nthread=4 3.26296592 Nthread=8 1.73879397 Nthread=16 1.49470901 Running OpenMP parallel for Nthread=1 11.10780621 Nthread=2 5.97191906 Nthread=4 3.26038599 Nthread=8 1.86092806 Nthread=16 1.65373802 count = 0 Running OpenMP reduction Nthread=1 11.95866394 Nthread=2 6.20743322 Nthread=4 3.12957096 Nthread=8 1.82370603 Nthread=16 1.40708995 count = 0 Running bcopy Nthread=1 6.74117422 Nthread=2 6.76228523 Nthread=4 6.73990822 Nthread=8 6.74149418 Nthread=16 6.74236917 count = 0 Running with local counters Nthread=1 11.10706520 Nthread=2 5.96803093 Nthread=4 3.26944804 Nthread=8 1.78986001 Total Elapsed Time (sec) 191.071401 Nthread=16 1.59914899 2010-03-11 53 AMD Barcelona Chip : Tachyon 16GB 배열 : counting 3s 예제 컴파일옵션 icc mcmodel=medium이필요 2010-03-11 54 27

Multi-core and Parallel Performance Tuning Techniques 2010-03-11 55 Parallel Performance Metrics The simplest parallel performance metric is always wallclock time. Represents the "time to solution CPU time is even less useful here than in the single processor case. Hardware performance counters can still be used to assess overall performance as well. The parallelism in the application introduces the concept of scalability and two new metrics to consider: Speedup Parallel efficiency 2010-03-11 56 28

Parallel Performance Metrics : Speedup Speedup is the ratio of the time for a single processor to that for N processors, given the same input: S(N) = N 0 t(n 0 ) /t(n) This is a measure of how much faster an application i becomes as more processors are used. Ideally, speedup would be exactly equal to the processor count: S ideal (N) = N superlinear speedup the speedup on N processors is greater than N. 2010-03-11 57 Elapsed Times and Speedup 2010-03-11 58 29

Amdahl's Law Amdahl's Law : 1967 An observation on the practical limits of speedup for a given problem The maximum speedup that can be achieved by that application is: S max = 1 / (1-p) Note that this does not consider other limiting factors such as memory, interconnect, or I/O bandwidth. 2010-03-11 59 Measuring the Performance of Parallel Applications For parallel applications with strong scaling characteristics, timing is absolutely critical to assessing parallel performance. Timings needed to compute speedups Speedups needed d to compute parallel l efficiency i Profiling can be used to identify performance bottlenecks in parallel applications. In parallel, performance bottlenecks are parts of the code. 2010-03-11 60 30

Interconnect Characteristics Performance characteristics bandwidth 6000 Bandwith (MPI1) 5000 Bandwith (MPI2) Band dwidth (MB/s) 5000 4000 3000 2000 p595 p690 x6420 Ban ndwidth (MB/s) 4000 3000 2000 1000 p595 p690 x6420 1000 0 1 10 4 1 10 5 1 10 6 Mesage Size(Bytes) 0 1 10 4 1 10 5 1 10 6 Mesage Size(Bytes) 2010-03-11 61 Timing of Parallel Programs wallclock timing /usr/bin/time. it is usually a good idea to instrument your application with timing calls around important sections of the code. MPI (C/C++/Fortran): MPI_Wtime 2010-03-11 62 31

Profiling of Parallel Programs Profiling parallel program is often difficult. Profiling tools included with compilers typically not designed for use with concurrent programs. two parallel profiling tools Tau TAU (Tuning and Analysis Utilities) is a set of tools for analyzing the performance of C, C++ and Fortran programs. http://acts.nersc.gov/tau/ Totalview 2010-03-11 63 32