<4D F736F F F696E74202D204C FBAB4B7C4C7C1B7CEB1D7B7A1B9D6B1E2C3CA2E BC8A3C8AF20B8F0B5E55D>

Similar documents
Integ

6주차.key

Oracle9i Real Application Clusters

solution map_....

Microsoft PowerPoint - eSlim SV [ ]

Microsoft PowerPoint - eSlim SV [080116]

PowerPoint 프레젠테이션

untitled

public key private key Encryption Algorithm Decryption Algorithm 1

MAX+plus II Getting Started - 무작정따라하기

°í¼®ÁÖ Ãâ·Â

10주차.key

example code are examined in this stage The low pressure pressurizer reactor trip module of the Plant Protection System was programmed as subject for

강의10


Figure 5.01

pseries, Systems Group pseries 2003 Corporation (p) Systems Group POWER4/POWER4+ pseries! +! + + The Only Complete UNIX on Demand

¹Ìµå¹Ì3Â÷Àμâ

Windows Embedded Compact 2013 [그림 1]은 Windows CE 로 알려진 Microsoft의 Windows Embedded Compact OS의 history를 보여주고 있다. [표 1] 은 각 Windows CE 버전들의 주요 특징들을 담고

Chap7.PDF

본문01

- 2 -

APOGEE Insight_KR_Base_3P11

^ Pyo Young

04-다시_고속철도61~80p

6.24-9년 6월

PowerPoint Presentation

Microsoft PowerPoint - ch03ysk2012.ppt [호환 모드]

<목 차 > 제 1장 일반사항 4 I.사업의 개요 4 1.사업명 4 2.사업의 목적 4 3.입찰 방식 4 4.입찰 참가 자격 4 5.사업 및 계약 기간 5 6.추진 일정 6 7.사업 범위 및 내용 6 II.사업시행 주요 요건 8 1.사업시행 조건 8 2.계약보증 9 3

Microsoft PowerPoint - AC3.pptx

ºÎ·ÏB

Microsoft PowerPoint Q AMD DT channel training Nov.ppt

<32382DC3BBB0A2C0E5BED6C0DA2E687770>

THE JOURNAL OF KOREAN INSTITUTE OF ELECTROMAGNETIC ENGINEERING AND SCIENCE Nov.; 25(11),

Preliminary spec(K93,K62_Chip_081118).xls

00-Intro

08이규형_ok.hwp

WIDIN - Toolholding Catalogue.pdf

Microsoft Word - 산업분석리포트 doc

DE1-SoC Board

Chap06(Interprocess Communication).PDF

Microsoft Word doc

<31325FB1E8B0E6BCBA2E687770>

김기남_ATDC2016_160620_[키노트].key

untitled

리뉴얼 xtremI 최종 softcopy

Slide 1

ARM01

歯I-3_무선통신기반차세대망-조동호.PDF

목차 제 1 장 inexio Touch Driver소개 소개 및 주요 기능 제품사양... 4 제 2 장 설치 및 실행 설치 시 주의사항 설치 권고 사양 프로그램 설치 하드웨

을 할 때, 결국 여러 가지 단어를 넣어서 모두 찾아야 한다는 것이다. 그 러나 가능한 모든 용어 표현을 상상하기가 쉽지 않고, 또 모두 찾기도 어 렵다. 용어를 표준화하여 한 가지 표현만 쓰도록 하여야 한다고 하지만, 말은 쉬워도 모든 표준화된 용어를 일일이 외우기는

다음 사항을 꼭 확인하세요! 도움말 안내 - 본 도움말에는 iodd2511 조작방법 및 활용법이 적혀 있습니다. - 본 제품 사용 전에 안전을 위한 주의사항 을 반드시 숙지하십시오. - 문제가 발생하면 문제해결 을 참조하십시오. 중요한 Data 는 항상 백업 하십시오.

PJTROHMPCJPS.hwp

Microsoft PowerPoint - Infiniband 20Gb 40Gb Switch HCA (??_1).ppt [Compatibility Mode]

untitled

1

사용시 기본적인 주의사항 경고 : 전기 기구를 사용할 때는 다음의 기본적인 주의 사항을 반드시 유의하여야 합니다..제품을 사용하기 전에 반드시 사용법을 정독하십시오. 2.물과 가까운 곳, 욕실이나 부엌 그리고 수영장 같은 곳에서 제품을 사용하지 마십시오. 3.이 제품은

01이국세_ok.hwp

Page 2 of 5 아니다 means to not be, and is therefore the opposite of 이다. While English simply turns words like to be or to exist negative by adding not,

Hi-MO 애프터케어 시스템 편 5. 오비맥주 카스 카스 후레쉬 테이블 맥주는 천연식품이다 편 처음 스타일 그대로, 부탁 케어~ Hi-MO 애프터케어 시스템 지속적인 모발 관리로 끝까지 스타일이 유지되도록 독보적이다! 근데 그거 아세요? 맥주도 인공첨가물이

Microsoft Word - zfs-storage-family_ko.doc

Vol.257 C O N T E N T S M O N T H L Y P U B L I C F I N A N C E F O R U M

<31342D3034C0E5C7FDBFB52E687770>

Microsoft PowerPoint - 알고리즘_2주차_1차시.pptx


0125_ 워크샵 발표자료_완성.key

(72) 발명자 이동희 서울 동작구 여의대방로44길 10, 101동 802호 (대 방동, 대림아파트) 노삼혁 서울 중구 정동길 21-31, B동 404호 (정동, 정동상 림원) 이 발명을 지원한 국가연구개발사업 과제고유번호 부처명 교육과학기술부

Something that can be seen, touched or otherwise sensed


UML

DBPIA-NURIMEDIA

6자료집최종(6.8))

4. #include <stdio.h> #include <stdlib.h> int main() { functiona(); } void functiona() { printf("hihi\n"); } warning: conflicting types for functiona

1. GigE Camera Interface를 위한 최소 PC 사양 CPU : Intel Core 2 Duo, 2.4GHz이상 RAM : 2GB 이상 LANcard : Intel PRO/1000xT 이상 VGA : PCI x 16, VRAM DDR2 RAM 256MB

Page 2 of 6 Here are the rules for conjugating Whether (or not) and If when using a Descriptive Verb. The only difference here from Action Verbs is wh

116 제 2 발표장 (2 일금 ) 특별세션 - 고성능컴퓨팅 박근태 1, 최해천 1*, 최석호 2, 사용철 2, 권오경 3 ANALYSIS AND CONTROL OF TIP-LEAKAGE FLOW IN AN AXIAL FLOW FAN USING LARGE EDDY SI

Microsoft PowerPoint - CUDA_NeuralNet_정기철_발표자료.pptx

SRC PLUS 제어기 MANUAL

CONTENTS 목차 1. 전원 및 설치시 주의사항 2 2. 시스템 사용시 바른 자세 4 3. 시스템 구성품 확인 5 슬림형 케이스1 6 슬림형 케이스2 7 타워형 케이스1 8 타워형 케이스2 9 일체형 케이스1 10 망분리형 케이스 시스템 시작 및 종료

Output file

untitled

11이정민

PowerPoint 프레젠테이션

3.Bladesystem

<BDBAB8B6C6AEC6F95FBDC3C0E55FC8AEB4EB5FC0CCC1D6BFCF5F E687770>

PRO1_04E [읽기 전용]

<BCF6BDC D31385FB0EDBCD3B5B5B7CEC8DEB0D4C5B8BFEEB5B5C0D4B1B8BBF3BFACB1B85FB1C7BFB5C0CE2E687770>


untitled

강의지침서 작성 양식

untitled

Ⅱ. Embedded GPU 모바일 프로세서의 발전방향은 저전력 고성능 컴퓨팅이다. 이 러한 목표를 달성하기 위해서 모바일 프로세서 기술은 멀티코 어 형태로 발전해 가고 있다. 예를 들어 NVIDIA의 최신 응용프 로세서인 Tegra3의 경우 쿼드코어 ARM Corte



13김상민_ok.hwp

Microsoft PowerPoint - 알고리즘_1주차_2차시.pptx

Microsoft Word - s.doc

Microsoft PowerPoint - 발표_090513_IBM세미나_IPTV_디디오넷_완료.ppt

(Microsoft Word - \274\366\301\244Edit _Hynix.doc)

Yggdrash White Paper Kr_ver 0.18

Transcription:

The Power and Potential of Parallelism 이홍석 2010. 3 2010-03-11 1 병렬처리시스템활용시대가도래... 소규모 Multi-core 환경 노트북, PC, 워크스테이션, 서버등모두 Multi-core OS 가 multi-core 를지원 POSIX 쓰레드와 win32 쓰레드 GPGPU 기반의개인슈퍼컴퓨터의등장 CUDA/OpenCL/GPGPU 의급속한발전 대규모 multicore 환경 슈퍼컴퓨터 : 1 만 ~ 20 만개수이상코어로구성된시스템 Clusters : 가격대비성능비가좋은시스템 Servers Grid computing 2010-03-11 2 1

병렬컴퓨팅과분산컴퓨팅 병렬컴퓨팅 단일문제를해결하기위하여 multiple processor 를사용 프로세스사이의 interconnection 이우수함을요구 슈퍼컴퓨팅 분산컴퓨팅 많은프로세스들을쉽게이용하도록편리를제공 통신부하가적은응용에적합 병렬계산은짧지만분산하는시간을길다 그리드컴퓨팅 2010-03-11 3 병렬프로그래밍기법을배우자 우수한병렬프로그램작성한다는것은.. 답이정확하고 (correct) 우수한병렬성능 (good performance) 많은프로세스를이용한우수한병렬확장성 (scalability) 이기종병렬시스템에이식성 (portability) 병렬프로그램은하드웨어스펙과밀접한관련이됨 MPI OpenMP Hybrid (MPI+OpenMP) CUDA Pthread, Java, 2010-03-11 4 2

무어 (Moore) 의법칙리뷰 1965 paper Doubling of the number of transistors on integrated circuits every two years. Moore's law has been the name given to everything that t changes exponentially. 2010-03-11 5 http://news.cnet.com/images-moores-law-turns-40/2009-1041_3-5649019.html 5 Moor s 법칙리뷰 1 PF 1TF 1 GF Super Scalar/Special Purpose/Parallel IBM RoadRunner Moore s Law 2X Transistors/Chip ASCI Red ASCI White Every 1.5 Years Pacific TMC CM-5 Cray T3D Super Scalar Vector Cray 1 TMC CM-2 Cray 2 Cray X-MP Parallel Scalar 1 MF IBM 7090 CDC 7600 IBM 360/195 CDC 6600 1 KF UNIVAC 1 EDSAC 1 1950 1960 1970 1980 1990 2000 2010 2010-03-11 6 3

Supercomputing 기법변화가속화 세계 1 위성능슈퍼컴퓨터 (2009.1) IBM Cell 기반 Roadrunner 1 Pflops 이상성능 특별한목적에맞게개발 고속의가속보드의등장 General purpose computing on GPU NVIDIA Tesla S870 Supercomputer 2010-03-11 7 ORNL s Newest System Jaguar XT5 Jaguar Total XT5 XT4 Peak Performance 1,645 1,382 263 AMD Opteron Cores 181,504 150,176 31,328 System Memory (TB) 362 300 62 Disk Bandwidth (GB/s) 284 240 44 Disk Space (TB) 10,750 10,000 750 Interconnect Bandwidth (TB/s) 532 374 157 The systems will be combined after acceptance of the new XT5 upgrade. Each system will be linked to the file system through 4x-DDR Infiniband 2010-03-11 8 4

슈퍼컴퓨터 Top500 분석 (2008.11) 1200 1100 1.1 Pflop/s 1000 900 2 systems > 1 Pflop/s Rmax (Tflop/s) 800 700 600 500 400 300 200 100 19 systems > 100 Tflop/s 51 systems > 50 Tflop/s 119 systems > 25 Tflop/s 0 1 27 53 79 105 131 157 183 209 235 261 287 313 339 365 391 417 443 469 495 Rank 2010-03-11 9 컴퓨터하드웨어발전추이 새로운제약조건 전기사용량이 clock의발전을제한 Instruction Level Parallelism로더이상성능향상은기대하기어려움새로운경향 Transistor count still rising Clock speed flattening sharply Multiple instruction streaming이필요 multicore 프로그래밍 Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith 2010-03-11 10 5

CPU 성능 vs 전기사용비용 Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3 2010-03-11 11 무어의법칙재검토 칲당코어수가매 2 년마다 2 배가되고있지만, 클럭속도는줄고있다. Clock speed 는더이상증가하고있지않고 수백만개수쓰레드를동시에관리하는프로그래밍기법이필요 미래세대에는수백만개수의쓰레드 Intro-chip 병렬화가가능한 inter-chip 병렬화를쉽게대체필요 프로세서수의 2 배증가매 2 년마다.. 2010-03-11 12 6

Multicore 에서 Manycore 로 멀티코어란 한개의컴퓨터 chip 에 2~4 개프로세스 ( 코어 ) 가있는아키텍처디자인 Chip multiprocessors (CMP) Manycore Chip 당 10~100+ 코어가집적 이질성이대세 예를들면 Intel 80-core TeraScale chip Intel 32-core Larrabee chip IBM Cyclops-64 chip with 160 thread units ClearSpeed 96-core CSX chip NvidiaTesla 240-core C870 AMD/ATI FireStream 9170 320 cores 2010-03-11 13 Intel s Larrabee Chip Many X 86 IA cores Scalable to Tflop/s New cache architecture New vector instructions set Vector memory operations Conditionals Integer and floating point arithmetic New vector processing unit / wide SIMD 2010-03-11 14 14 7

GPGPU 가속 : NVIDIA Tesla T10P T10P chip 240 cores; 1.5 GHz Tpeak 1 Tflop/s - 32 bit floating point Tpeak 100 Gflop/s - 64 bit floating point S1070 board 4 - T10P devices; 700 Watts GTX 280 1 T10P; 1.3 GHz Tpeak 864 Gflop/s - 32 bit floating point Tpeak 86.4 Gflop/s - 64 bit floating point 2010-03-11 15 Examples of Parallel Computers 2010-03-11 16 8

최신병렬컴퓨터 Chip multiprocessor: Intel Core Duo AMD Dual Core Symmetric Multiprocessor Sun Fire E25K Heterogeneous Chips Cell Processor NVIDIA 8800 GT Clusters Supercomputers BlueGene/L 2010-03-11 17 Intel core duo Two 32-bit Pentium processors each has its own 32K L1 D and I cache shared 2MB or 4MB L2 cache, fast communication through shared L2 coherent shared memory 사용 : 보통개인용 PC 2010-03-11 18 9

AMD Dual Core Opteron Two AMD64 processors each with 64K L1 D and I cache each with 1MB L2 cache coherent shared memory 2010-03-11 19 Intel 과 AMD 비교 main difference L2 cache position AMD... o more core private memory o easier to share cache coherency info with other CPUs o more general SMP architecture o preferred in multi chip systems Intel... o core can use more of the shared L2 at times o lower latency communication between cores o preferred in single chip systems 2010-03-11 20 10

Generic SMP 멀티코어와멀티CPU Use: Both multi-core and multi-cpu PCs single logical memory image uniform or non-uniform memory architecture shared bus often bottleneck small number of processors 8, 16, 32 2010-03-11 21 Sun Fire E25K Use: High-end servers from Sun Sun의 High-End servers 18 E25K boards, 4 cores each, two hardware threads each core can access up to 16GB of memory uniform memory access latency directory based cache coherence protocol 2010-03-11 22 11

Cluster Low-core largre-scale parallelism parallel computers made of commodity parts a number of boards each with... small number of CPUs RAM disk connected by commodity interconnect, e.g. fast Ethernet good price / performance memory not shared between boards message passing 2010-03-11 23 Cell Processor PlayStation 3, Roadrunner Supercomputer with Opterons SONY, IBM and Toshiba 64-bit PowerPC with two hardware 8 Synergistic Processing Elements 256KB of on chip memory per SPE dual 12.8GB/s memory bandwidth no coherent memory for SPEs, must be managed by programmer Use: PlayStation 3, Roadrunner supercomputer (w/opterons) 2010-03-11 24 12

BlueGene/L IBM Supercomputer CPU clock 770MHz 65,536 dual core nodes each node PowerPC 440 processor 4 floating point ops per node per clock 32K L1 I and D cache per node L2 prefetch buffer 4MB of shared L3 cache 512MB of shared off chip RAM Use: Supercomputing 2010-03-11 25 Interconnection Networks Each node has PE and memory routing follows the shortest path through the available links a) 2D torus, b) 3-cube 2010-03-11 26 13

Interconnection Networks Each node has PE and memory fat tree multiple roots to avoid congestion 2010-03-11 27 Interconnection Networks Omega Network processors on the left memory modules on the right can block 2010-03-11 28 14

Heterogeneous Chips 멀티코어대신특별한코프로세서 GPU FPGA Cell Hybrid Programming Model Main CPU performs hard to parallelize portion Attached processor (GPU) performs compute intensive parts 2010-03-11 29 NVIDIA CUDA NVIDIA Compute Unified Device Architecture (CUDA) 8 SIMD cores per SM on chip memory 16KB, organized into 16 blocks 8192 registers per SM 32 threads processed in 4 clock cycles different SM cannot synchronize or communicate atomic operations on global memory 2010-03-11 30 15

Multiple Instruction Streams 를이용한병렬화 2010-03-11 31 왜병렬처리시스템인가? 슈퍼컴퓨터혹은고성능컴퓨팅시스템 거대규모계산문제를빠르게처리가가능 왜? 많은메모리계산이가능 하지만, 그리간단한문제가아님 병렬화지원컴파일러가없음 ALU REGISTERS L1 CACHE FETCH/ DECODE UNIT Main Memory I/O Devices L2 CACHE 2010-03-11 32 16

병렬컴퓨팅 공유메모리 (Shared-Memory) Intel quad-core Nehalem micro-architecture AMD Barcelona (quad-core) AMD Sanghai (quad-core) Sun Niagara Distributed-Memory IBM BlueGene/L Cell (see http://users.ece.utexas.edu/~adnan/vlsi-07/hofstee-cell.ppt) Mini-cores Intel Teraflops GPUs 2010-03-11 33 병렬프로그래밍모델 How to transfer data between processors? Can be a key scalability bottleneck 1-sided models better at latency hiding How do we program the parallel machine? Single Execution Stream Data Parallel, e.g. HPF Multiple Execution Streams Partitioned-Local Data Access :MPI Uniform-Global-Shared Data Access :OpenMP Partitioned-Global-Shared Data Access : Co-Array Fortran Uniform-Global-Shared + Partitioned Data Access 2010-03-11 34 17

쓰레드의개념 병렬화의기본단위로쓰레드 : thread Stream of instruction를실행하기위한모든것을포함 Private program text A call stack A program counter 다중쓰레드사이의메모리는공유 2010-03-11 35 메모리참조메커니즘.. Shared Memory single address space easy to use easy to create inefficient programs through non-local references easy to create race conditions typically requires hardware support to make it coherent A A=B B P0 shared memory load/stores 0-sided model P1 2010-03-11 36 18

Memory Reference Mechanisms One-side communication single address space direct load/store on local memory get/put on remote memory, not coherent must use synchronization to ensure coherency simpler hardware, more burden on software A P0 put B P1 remote memory access (RMA) 1-sided model 2010-03-11 37 Memory Reference Mechanisms Message Passing no shared address space most primitive least hardware support can directly access only local memory send / recv used to exchange data between processors claimed to be easier to debug communication is more explicit and easier to find in programs A P0 receive send message passing 2-sided model B P1 2010-03-11 38 19

Models of Computation 대표적인성공순차코딩모델 : Random Access Machine (RAM) model Known as the von Neumann model Here, RAM is not random access memory. Not a good model : PRAM extend RAM to parallel computers PRAM doesn't work well as communication cost is ignored 2010-03-11 39 Models of Parallelism A good model : CTA Stands for Candidate Type Architecture Makes useful predictions about real performance 2010-03-11 40 20

멀티코어아키텍처에서프로그래밍 멀티코어 / 매니코어에서우수한병렬성능은 프로세서 / 코어사이의통신과동기화를최소화 효율적인병렬알고리즘디자인 알고리즘레벨에서 concurrency 사용 메모리대역폭은매우제한됨 (bytes/flop) IO 는 bottleneck 이다. Flops 는거저얻지만대역폭은매우고가이다. 2010-03-11 41 예제프로그램 Counting 3s 2010-03-11 42 21

예제 1 : Iterative sum 순차코드 vs Pair-wise 합 1: sum = 0 ; 2: for ( i = 0 ; i < n ; i ++ ) 3: sum += x[ i ] ; 순차합순서. The order of combining a sequence of numbers (7, 3, 15, 10, 13, 18, 6, 4) when adding them to an accumulation variable. 2010-03-11 43 순차코드 : counting 3s #include <stdlib.h> #include <stdio.h> #include <time.h> void dtime(); int *array; const int length = 100000; int count; const int iters = 10000; /* artificially multiply work */ int count3s() { int i, j; int count = 0; for (j = 0; j < iters; j++) { for (i = 0; i< length; i++) ){ if (array[i] == 3) { count++; } } } return count; } static void make_array() { int i; array = (int *)malloc(length * sizeof(int)); for (i = 0; i < length; i++) array[i] = i % 10; } int main(int argc, char **argv) { make_array(); double stime, etime, wtime; dtime(&stime); printf("%d\n", count3s()); dtime(&etime); printf("elapsed Time = %12.8lf (sec) \n ",etime-stime); return 0; } 2010-03-11 44 22

일반적인컴파일러처리과정 2010-03-11 45 병렬처리시스템 Organization of a multi-core computer system: - Each processor has a private L1 cache; - it shares an L2 cache with its chip-mate -and shares an L3 cache with the other processors. 2010-03-11 46 23

병렬합 Pair-wise 합 Summing in pairs. The order of combining a sequence of numbers (7, 3, 15, 10, 13, 18, 6, 4) by (recursively) combining pairs of values, then pairs of results, and so on. 2010-03-11 47 Thread 프로그래밍모델 Schematic diagram of data allocation to threads. Allocations are consecutive indices. 2010-03-11 48 24

Multithreaded system 4 개의쓰레드를이용한배열합 1: sum = 0 ; 2: for ( i = start ; i < end ; i ++ ) 3: sum += x[ i ] ; 병렬화방법 인덱스 i는 local로stack 공유메모리는 sum, x 각각의쓰레드에다른값 0 start, end 9 6 4 7 3 2 1 0 8 3 2 1 1 5 7 Thread 0 Thread 1 Thread 2 Thread 3 2010-03-11 49 Race condition One of several possible interleavings of references to the unprotected variable count, illustrating a race condition. : locking 필요 2010-03-11 50 25

공유메모리정리사항 용어정리 Thread ~ instruction stream을실행하기위한모든것이포함된유닛 Race condition ~ 한실행의결과가두가지실행시간에의존함 Mutex ~ mutual exclusion을제공하는오브젝트, 2개의상태를갖음 Lock contention ~ multiple threading 할경우한개의쓰레드가 critical 세션을위하여대기하는추가시간 Granularity ~ 서로상호통신하는프로세스주기 False sharing ~ locally 데이터가물리적인 cache lined을공유함 2010-03-11 51 Tachyon: Quad-Core AMD Processor (Barcelona) Core 1 Core 2 Core 3 Core 4 64KB L1 64KB L1 64KB L1 64KB L1 512KB L2 512KB L2 512KB L2 512KB L2 2MB L3 Memory I/O Direct Connect Architecture, AMD CoolCore Technology Integrated DDR2 DRAM Controller AMD Balanced Smart Cache, AMD Wide Floating Point Accelerator 2010-03-11 52 26

Results on KISTI supercomputer Tachyon Initializing array with 16 threads Array size: 2000000000 elements, 16.000 Gigabytes With branch With no 3s Running serial test 11.04099178 Running with false sharing Nthread=1 10.99804592 Nthread=2 5.94327688 Nthread=4 3.27491999 Nthread=8 1.92523599 Nthread=16 1.63911498 Running with padded counters Nthread=1 12.23770237 Nthread=2 6.38692808 Nthread=4 3.26296592 Nthread=8 1.73879397 Nthread=16 1.49470901 Running OpenMP parallel for Nthread=1 11.10780621 Nthread=2 5.97191906 Nthread=4 3.26038599 Nthread=8 1.86092806 Nthread=16 1.65373802 count = 0 Running OpenMP reduction Nthread=1 11.95866394 Nthread=2 6.20743322 Nthread=4 3.12957096 Nthread=8 1.82370603 Nthread=16 1.40708995 count = 0 Running bcopy Nthread=1 6.74117422 Nthread=2 6.76228523 Nthread=4 6.73990822 Nthread=8 6.74149418 Nthread=16 6.74236917 count = 0 Running with local counters Nthread=1 11.10706520 Nthread=2 5.96803093 Nthread=4 3.26944804 Nthread=8 1.78986001 Total Elapsed Time (sec) 191.071401 Nthread=16 1.59914899 2010-03-11 53 AMD Barcelona Chip : Tachyon 16GB 배열 : counting 3s 예제 컴파일옵션 icc mcmodel=medium이필요 2010-03-11 54 27

Multi-core and Parallel Performance Tuning Techniques 2010-03-11 55 Parallel Performance Metrics The simplest parallel performance metric is always wallclock time. Represents the "time to solution CPU time is even less useful here than in the single processor case. Hardware performance counters can still be used to assess overall performance as well. The parallelism in the application introduces the concept of scalability and two new metrics to consider: Speedup Parallel efficiency 2010-03-11 56 28

Parallel Performance Metrics : Speedup Speedup is the ratio of the time for a single processor to that for N processors, given the same input: S(N) = N 0 t(n 0 ) /t(n) This is a measure of how much faster an application i becomes as more processors are used. Ideally, speedup would be exactly equal to the processor count: S ideal (N) = N superlinear speedup the speedup on N processors is greater than N. 2010-03-11 57 Elapsed Times and Speedup 2010-03-11 58 29

Amdahl's Law Amdahl's Law : 1967 An observation on the practical limits of speedup for a given problem The maximum speedup that can be achieved by that application is: S max = 1 / (1-p) Note that this does not consider other limiting factors such as memory, interconnect, or I/O bandwidth. 2010-03-11 59 Measuring the Performance of Parallel Applications For parallel applications with strong scaling characteristics, timing is absolutely critical to assessing parallel performance. Timings needed to compute speedups Speedups needed d to compute parallel l efficiency i Profiling can be used to identify performance bottlenecks in parallel applications. In parallel, performance bottlenecks are parts of the code. 2010-03-11 60 30

Interconnect Characteristics Performance characteristics bandwidth 6000 Bandwith (MPI1) 5000 Bandwith (MPI2) Band dwidth (MB/s) 5000 4000 3000 2000 p595 p690 x6420 Ban ndwidth (MB/s) 4000 3000 2000 1000 p595 p690 x6420 1000 0 1 10 4 1 10 5 1 10 6 Mesage Size(Bytes) 0 1 10 4 1 10 5 1 10 6 Mesage Size(Bytes) 2010-03-11 61 Timing of Parallel Programs wallclock timing /usr/bin/time. it is usually a good idea to instrument your application with timing calls around important sections of the code. MPI (C/C++/Fortran): MPI_Wtime 2010-03-11 62 31

Profiling of Parallel Programs Profiling parallel program is often difficult. Profiling tools included with compilers typically not designed for use with concurrent programs. two parallel profiling tools Tau TAU (Tuning and Analysis Utilities) is a set of tools for analyzing the performance of C, C++ and Fortran programs. http://acts.nersc.gov/tau/ Totalview 2010-03-11 63 32