국가슈퍼컴퓨팅 공동활용체제구축

Similar documents

Outline PLSI 시스템접속병렬처리병렬프로그래밍개요 OpenMP를이용한병렬화 MPI를이용한병렬화순차코드의병렬화

Microsoft Word - 3부A windows 환경 IVF + visual studio.doc

K&R2 Reference Manual 번역본

Parallel Programming 박필성 IT 대학컴퓨터학과

4. #include <stdio.h> #include <stdlib.h> int main() { functiona(); } void functiona() { printf("hihi\n"); } warning: conflicting types for functiona

C# Programming Guide - Types

example code are examined in this stage The low pressure pressurizer reactor trip module of the Plant Protection System was programmed as subject for

PowerPoint 프레젠테이션

Microsoft PowerPoint - chap10-함수의활용.pptx

Microsoft PowerPoint - chap02-C프로그램시작하기.pptx

프로그램을 학교 등지에서 조금이라도 배운 사람들을 위한 프로그래밍 노트 입니다. 저 역시 그 사람들 중 하나 입니다. 중고등학교 시절 학교 도서관, 새로 생긴 시립 도서관 등을 다니며 책을 보 고 정리하며 어느정도 독학으르 공부하긴 했지만, 자주 안하다 보면 금방 잊어

김기남_ATDC2016_160620_[키노트].key

Oracle9i Real Application Clusters

C++-¿Ïº®ÇØ¼³10Àå

02 C h a p t e r Java

PowerPoint 프레젠테이션

BSC Discussion 1

Microsoft PowerPoint - chap03-변수와데이터형.pptx

ecorp-프로젝트제안서작성실무(양식3)

PowerPoint 프레젠테이션

목차 BUG offline replicator 에서유효하지않은로그를읽을경우비정상종료할수있다... 3 BUG 각 partition 이서로다른 tablespace 를가지고, column type 이 CLOB 이며, 해당 table 을 truncate

Microsoft PowerPoint - chap12-고급기능.pptx

해양모델링 2장5~ :26 AM 페이지6 6 오픈소스 소프트웨어를 이용한 해양 모델링 물리적 해석 식 (2.1)의 좌변은 어떤 물질의 단위 시간당 변화율을 나타내며, 우변은 그 양을 나타낸 다. k 5 0이면 C는 처음 값 그대로 농

Windows Embedded Compact 2013 [그림 1]은 Windows CE 로 알려진 Microsoft의 Windows Embedded Compact OS의 history를 보여주고 있다. [표 1] 은 각 Windows CE 버전들의 주요 특징들을 담고

ePapyrus PDF Document

MS-SQL SERVER 대비 기능

Microsoft PowerPoint - ch07 - 포인터 pm0415

1.장인석-ITIL 소개.ppt

Something that can be seen, touched or otherwise sensed

<C6F7C6AEB6F5B1B3C0E72E687770>

금오공대 컴퓨터공학전공 강의자료

Microsoft PowerPoint - PL_03-04.pptx

Microsoft PowerPoint - chap11-포인터의활용.pptx

04-다시_고속철도61~80p

PJTROHMPCJPS.hwp

PowerPoint 프레젠테이션

Microsoft Word - ExecutionStack

solution map_....

03장.스택.key

歯엑셀모델링

13주-14주proc.PDF

Microsoft PowerPoint - 27.pptx

06_ÀÌÀçÈÆ¿Ü0926

4 CD Construct Special Model VI 2 nd Order Model VI 2 Note: Hands-on 1, 2 RC 1 RLC mass-spring-damper 2 2 ζ ω n (rad/sec) 2 ( ζ < 1), 1 (ζ = 1), ( ) 1

IASB( ) IASB (IASB ),, ( ) [] IASB( ), IASB 1

Microsoft PowerPoint - chap04-연산자.pptx

, ( ),, ( ), 3, int kor[5]; int eng[5]; int Microsoft Windows 4 (ANSI C2 ) int kor[5] 20 # define #define SIZE 20 int a[10]; char c[10]; float

<4D F736F F F696E74202D FC0CFB9DD5FBAB4B7C4C7C1B7CEB1D7B7A1B9D62E >

Microsoft PowerPoint - chap05-제어문.pptx

歯15-ROMPLD.PDF

The Self-Managing Database : Automatic Health Monitoring and Alerting

<31342D3034C0E5C7FDBFB52E687770>

예제 1.1 ( 관계연산자 ) >> A=1:9, B=9-A A = B = >> tf = A>4 % 4 보다큰 A 의원소들을찾을경우 tf = >> tf = (A==B) % A

, ( ) 1) *.. I. (batch). (production planning). (downstream stage) (stockout).... (endangered). (utilization). *

Microsoft PowerPoint Predicates and Quantifiers.ppt

0. 표지에이름과학번을적으시오. (6) 1. 변수 x, y 가 integer type 이라가정하고다음빈칸에 x 와 y 의계산결과값을적으시오. (5) x = (3 + 7) * 6; x = 60 x = (12 + 6) / 2 * 3; x = 27 x = 3 * (8 / 4

제4장 기본 의미구조 (Basic Semantics)

비트와바이트 비트와바이트 비트 (Bit) : 2진수값하나 (0 또는 1) 를저장할수있는최소메모리공간 1비트 2비트 3비트... n비트 2^1 = 2개 2^2 = 4개 2^3 = 8개... 2^n 개 1 바이트는 8 비트 2 2

Line (A) å j a k= i k #define max(a, b) (((a) >= (b))? (a) : (b)) long MaxSubseqSum0(int A[], unsigned Left, unsigned Right) { int Center, i; long Max

APOGEE Insight_KR_Base_3P11

Microsoft PowerPoint - a10.ppt [호환 모드]

歯CRM개괄_허순영.PDF

Microsoft PowerPoint - SVPSVI for LGNSYS_ ppt

KDTÁ¾ÇÕ-2-07/03

0125_ 워크샵 발표자료_완성.key

<30362E20C6EDC1FD2DB0EDBFB5B4EBB4D420BCF6C1A42E687770>

PRO1_04E [읽기 전용]

<31325FB1E8B0E6BCBA2E687770>

OCW_C언어 기초

학습목차 2.1 다차원배열이란 차원배열의주소와값의참조

11장 포인터

<C1DF3320BCF6BEF7B0E8C8B9BCAD2E687770>

<C0CCBCBCBFB52DC1A4B4EBBFF82DBCAEBBE7B3EDB9AE2D D382E687770>

KDTÁ¾ÇÕ-1-07/03

목차 포인터의개요 배열과포인터 포인터의구조 실무응용예제 C 2

PowerPoint 프레젠테이션

Transcription:

Partnership & Leadership for the nationwide SuperComputing Infrastructure Parallel Programming PLSI 사용자 응용기술지원팀 컴퓨팅브릿지 정진우 PLSI 사용자 응용기술지원팀

Outline Introduction to Parallel Programming Introduction to OpenMP OpenMP Programming OpenMP Applications Parallelization of Serial Code PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Introduction to Parallel Programming PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Parallel Processing (1/3) Parallel processing is the ability to carry out multiple operations or tasks simultaneously. The term is used in the contexts of both human cognition, particularly in the ability of the brain to simultaneously process incoming stimuli, and in parallel computing by machines. PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Parallel Processing (2/3) Serial execution Inputs Parallel execution Outputs PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Parallel Processing (3/3) Purpose : High performance computing A decrease in wall-clock time An increase in scale of problems Classes of parallel computers Multi-core computing : A multi-core processor is a processor that multiple execution units (core) on the same chip. Symmetric multiprocessing : A symmetric multiprocessor (SMP) is a computer system with multiple identical processors that share memory and connect via a bus. Distributed computing : A distributed computer is a distributed memory computer system in which the processing elements are connected by a network. PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Why parallelization? Limitation of high performance single processor Limitation of transmission speed (copper : 9cm/nanosec) Limitation of miniaturization High speed network, distributed and multiprocessor systems -> Parallel computing Financial high performance computing system PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Program A program (also a software program, or just a program) is a sequence of instructions written to perform a specified task for a computer. The program has an executable form that the computer can use directly to execute the instructions. PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Process A computer program or an instance of a program running concurrently with other programs. A computer program is a passive collection of instructions, a process is the actual execution of those instructions. Several processes may be associated with the same program; for example, opening up several instances of the same program often means more than one process is being executed. PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Thread A thread of execution is the smallest unit of processing that can be scheduled by an operating system. It generally results from a fork of a computer program into two or more concurrently running tasks. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process. PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

process and thread Thread Process PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Shared memory system Single thread time S1 time S1 fork Multi-thread Thread P1 P1 P2 P3 P4 P2 P3 S2 join Shared address space P4 S2 Process Process PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Message-passing interface time Serial S1 time S1 S1 Messagepassing S1 S1 P1 P1 P2 P3 P4 P2 S2 S2 S2 S2 P3 Process 0 Process 1 Process 2 Process 3 P4 Node 1 Node 2 Node 3 Node 4 S2 Data transmission over the interconnect Process PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

performance test performance test speedup efficiency cost PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

speedup Speed-up : S(n) running time of serial program running time of parallel program (#n S(n) = = processes) t s t p runtime = Wall-clock time PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Maximum speedup t s S(n) = = t p t s ft s + (1-f)t s /n S(n) = 1 f + (1-f)/n Maximum speedup ( n ) S(n) = 1 f PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

example of speedup f = 0.2, n = 4 Serial Parallel process 1 20 80 20 20 process 2 process 3 process 4 cannot be parallelized can be parallelized 1 S(4) = = 2.5 0.2 + (1-0.2)/4 PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Efficiency Efficiency : E(n) t s t p ⅹn S(n) n E(n) = = [ⅹ100(%)] speedup is 2 via 10 processes : E(10) = 20% speedup is 10 via 100 processes : E(100) = 10% PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Cost Cost Cost = runtime ⅹ # of processes Serial program : Cost = t s t s n Parallel program : Cost = t p ⅹ n = S(n) = t s E(n) > examples t s t p n S(n) E(n) Cost 100 50 10 2 0.2 500 100 10 100 10 0.1 1000 PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Introduction to OpenMP PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

History of OpenMP Version 3.0 Complete Specifications - (May, 2008). Version 3.0 Summary Card C/C++ (November, 2008) Version 3.0 Summary Card Fortran (revised March, 2009) Version 2.5 - (May 2005, combined C/C++ and Fortran) C/C++ version 2.0 - (March 2002) C/C++ version 2.0 with change bars reflecting changes from 1.0 - (March 2002) FORTRAN version 2.0 - (November 2000) FORTRAN version 2.0 with change bars reflecting changes from 1.1 (November 2000) C/C++ version 1.0 - (October 1998) FORTRAN version 1.1 - (November 1999 - incorporates April 1999 Interpretations and Errata) FORTRAN version 1.0 - (October 1997) PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

What is OpenMP? Multi-thread parallel programming Application interface for SMP system Open specifications for Multi Processing PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Shared memory system Memory I/O Bus or Crossbar Switch Cache Cache Cache Cache Processor Processor Processor Processor PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Structure of OpenMP PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

OpenMP programming model Thread-Based Fork-Join model F J F J O O O O Master Thread R K I N R K I N [Parallel Region] [Parallel Region] PLSI 사용자 응용기술지원팀 Partnership & Leadership for the nationwide SuperComputing Infrastructure

Introduction to OpenMP (1/5) Directive Directive base programming Almost Compilers support to OpenMP directives Serial code PROGRAM exam ialpha = 2 DO i = 1, 100 a(i) = a(i) + ialpha*b(i) ENDDO PRINT *, a END Parallel code PROGRAM exam ialpha = 2!$OMP PARALLEL DO DO i = 1, 100 a(i) = a(i) + ialpha*b(i) ENDDO!$ OMP END PARALLEL DO PRINT *, a END Supercomputing Center 26

Introduction to OpenMP (2/5) Fork-Join ialpha = 2 export OMP_NUM_THREADS = 4 (Master Thread) (Fork) DO i=1,25 DO i=26,50 DO i=51,75 DO i=76,100............ (Join) (Master) (Slave) (Slave) (Slave) PRINT *, a (Master Thread) Supercomputing Center 27

Introduction to OpenMP (3/5) OpenMP directive gramma Fortran (fix form:f77) Fortran (free form:f90) C Starting of direcitve!$omp <directive> C$OMP <directive> *$OMP <directive>!$omp <directive> #pragma omp <directive> Continuati on!$omp <directive>!$omp&!$omp <directive> & #pragma omp \ Selective compile Starting point!$ #ifdef _OPENMP C$!$ *$ #endif First column No rule No rule Supercomputing Center 28

Introduction to OpenMP (4/5) Useful directives Fortran!$OMP PARALLEL!$OMP DO!$OMP PARALLEL DO!$OMP CRITICAL PRIVATE/SHARED DEFAULT REDUCTION C #pragma omp parallel #pragma omp for #pragma omp parallel for #pragma omp critical private/shared default reduction Supercomputing Center 29

Introduction to OpenMP (5/5) Compile and execution of OpenMP programs compile IBM Fortran xlf_r qsmp=omp o ompprog ompprog.f C xlc_r qsmp=omp o ompprog ompprog.c Intel Fortran ifort -openmp o ompprog ompprog.f C icc -openmp o ompprog ompprog.c execution./ompprog Supercomputing Center 30

Example: PI Supercomputing Center 31

Example: PI Fortran PARAMETER (NUM_STEPS=1000000) SUM = 0.0 STEP = 1.0/REAL(NUM_STEPS) DO I = 1, NUM_STEPS X=(I-0.5)*STEP SUM = SUM + 4.0/(1.0+X*X) ENDDO PI=STEP*SUM static long num_steps = 1000000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; } C Supercomputing Center 32

Parallel: Fortran!$OMP PARALLEL PRIVATE(ID,X) ID = OMP_GET_THREAD_NUM() SUM(ID)=0.0 DO I = ID, NUM_STEPS-1, NUM_THREADS; X = (I+0.5)*STEP; SUM(ID) = SUM(ID)+ 4.0/(1.0+X*X); ENDDO!$OMP END PARALLEL PI=0.0 DO I=0, NUM_THREADS-1 PI = PI + SUM(I)*STEP ENDDO Supercomputing Center 33

Parallel: C #pragma omp parallel private(id,x) { id = omp_get_thread_num(); sum[id]=0.0; for (i=id; i< num_steps; i=i+num_threads){ x = (i+0.5)*step; sum[id] += 4.0/(1.0+x*x); } } for(i=0, pi=0.0;i<num_threads;i++) pi += sum[i] * step; Supercomputing Center 34

do/sections/single Master thread Master thread Master thread F O R K F O R K F O R K DO / for loop team SECTIONS team SINGLE J O I N J O I N J O I N Master thread Master thread Master thread do/for sections single Supercomputing Center 35

DO/for: Fortran!$OMP PARALLEL PRIVATE(ID) ID = OMP_GET_THREAD_NUM() SUM(ID)=0.0!$OMP DO PRIVATE(X) DO I = 0, NUM_STEPS-1 X = (I+0.5)*STEP SUM(ID) = SUM(ID) + 4.0/(1.0+X*X) ENDDO!$OMP END DO!$OMP END PARALLEL PI=0.0 DO I=0, NUM_THREADS-1 PI = PI + SUM(I)*STEP ENDDO Supercomputing Center 36

Do/for: C #pragma omp parallel private(id) { id = omp_get_thread_num(); sum[id] = 0.0; #pragma omp for private(x) for (i=0; i<num_steps; i++){ x = (i+0.5)*step; sum[id] += 4.0/(1.0+x*x); } } for(i=0, pi=0.0; i<num_threads; i++) pi += sum[i] * step; Supercomputing Center 37

sections : Fortran!$OMP PARALLEL PRIVATE(ID) ID = OMP_GET_THREAD_NUM() SUM(ID)=0.0!$OMP SECTIONS PRIVATE(X)!$OMP SECTION DO I = 0, (NUM_STEPS-1)/2 X = (I+.5)*STEP SUM(ID)= SUM(ID) + 4.0/(1.0+X*X) ENDDO!$OMP SECTION DO I = (NUM_STEPS-1)/2+1, NUM_STEPS-1 X = (I+.5)*STEP SUM(ID)= SUM(ID) + 4.0/(1.0+X*X) ENDDO!$OMP END SECTIONS Supercomputing Center 38

sections: C #pragma omp parallel private(id) { id = omp_get_thread_num(); sum[id]=0.0; #pragma omp sections private(x) { #pragma omp section for (i=0; i< num_steps/2; i++){ x = (i+0.5)*step; sum[id] += 4.0/(1.0+x*x); } #pragma omp section for (i= num_steps/2+1; i< num_steps; i++){ x = (i+0.5)*step; sum[id] += 4.0/(1.0+x*x); } } } Supercomputing Center 39

critical (1/3) Fortran C!$OMP CRITICAL [(name)] structured block!$omp END CRITICAL [(name)] #pragma omp critical [(name)] structured block Supercomputing Center 40

critical (3/3)!$OMP PARALLEL PRIVATE(i) SHARED(cnt1, cnt2)!$omp DO DO i = 1, n do_work IF (condition1) THEN!$OMP CRITICAL (name1) cnt1 = cnt1 + 1!$OMP END CRITICAL (name1) ELSE!$OMP CRITICAL (name1) cnt1 = cnt1-1!$omp END CRITICAL (name1) IF (condition2) THEN!$OMP CRITICAL(name2) cnt2 = cnt2 + 1 ENDIF!$OMP END CRITICAL (name2) ENDIF ENDDO!$OMP END PARALLEL Supercomputing Center 41

critical: Fortran!$OMP PARALLEL PRIVATE(ID, X, SUM) ID = OMP_GET_THREAD_NUM() SUM=0.0 DO I = ID, NUM_STEPS-1, NUM_THREADS X = (I+0.5)*STEP SUM= SUM + 4.0/(1.0+X*X) ENDDO!$OMP CRITICAL PI = PI + SUM*STEP!$OMP END CRITICAL!$OMP END PARALLEL Supercomputing Center 42

critical: C #pragma omp parallel private (id, x, sum) { int id; id = omp_get_thread_num(); sum=0.0; for (i=id; i< num_steps; i=i+num_threads) { x = (i+0.5)*step; sum += 4.0/(1.0+x*x); } #pragma omp critical { pi += sum*step; } } Supercomputing Center 43

Clauses private(var1, var2, ) shared(var1, var2, ) default(shared private none) firstprivate(var1, var2, ) lastprivate(var1, var2, ) reduction(operator intrinsic:var1, var2, ) schedule(type [,chunk]) Supercomputing Center 44

데이터 유효범위 지정 Clause 명시적인 데이터 유효범위 지정 clause Private Shared Default Firstprivate Lastprivate Reduction Supercomputing Center 45

기본 데이터 유효범위: Fortran (1/2) SUBROUTINE CALLER(A,N) INTEGER N, A(N), I, J, M M=3!$OMP PARALLEL DO DO I=1, N DO J=1,5 CALL CALLEE(A(I),M,J) ENDDO ENDDO END SUBROUTINE CALLEE(X,Y,Z) COMMON /COM/C INTEGER X,Y,Z,C,II,CNT SAVE CNT CNT=CNT+1 DO II = 1, Z X=Y+C ENDDO END Supercomputing Center 46

기본 데이터 유효범위: Fortran (2/2) 변수 유효범위 설 명 A shared 병렬영역 밖에서 선언 N shared 병렬영역 밖에서 선언 I private 병렬루프 인덱스 J private 순차루프 인덱스 (Fortran) M shared 병렬영역 밖에서 선언 X shared 실제 인수 A가 shared Y shared 실제 인수 M이 shared Z private 실제 인수 J가 private C shared Common block으로 선언 II private 호출된 서브루틴의 지역 변수 CNT shared Save 속성을 가지는 지역 변수 Supercomputing Center 47

기본 데이터 유효범위: C (1/2) void caller(int a[], int n) { int i, j, m = 3; #pragma omp parallel for for (i=0; i<n; i++) { int k = m; for(j=1; j 5; j++) callee(&a[i], &k, j); } } extern int c; void callee(int *x, int *y, int z) { int ii; static int cnt; cnt++; for(ii=0; ii<z, ii++) *x = *y + c; } Supercomputing Center 48

기본 데이터 유효범위: C (2/2) 변수 유효범위 설 명 a shared 병렬영역 밖에서 선언 n shared 병렬영역 밖에서 선언 i private 병렬루프 인덱스 j shared 순차루프 인덱스 (in C) m shared 병렬영역 밖에서 선언 k private 병렬영역 안에서 선언된 자동 변수 x private Value parameter *x shared 실제 인수 a가 shared y private Value parameter *y private 실제 인수 k가 private z private Value parameter c shared Extern으로 선언 ii private 호출된 서브루틴의 지역 변수 cnt shared Static 선언된 지역변수 Supercomputing Center 49

clause : private private(var1, var2, ) 지정된 변수를 스레드끼리 공유하는 것 방지 private변수는 병렬영역 내에서만 정의됨 병렬영역 밖에서 초기화 할 수 없음 ( firstprivate) 병렬영역이 끝나면서 사라짐 ( lastprivate) private 선언을 고려해야 하는 변수 병렬영역 내에서 값을 할당 받는 변수!$OMP PARALLEL shared(a) private(myid, x) myid = OMP_GET_THREAD_NUM() x = work(myid) IF (x<1.0) THEN a(myid) = x ENDIF!$OMP END PARALLEL Supercomputing Center 50

clause : shared, default (1/2) shared(var1, var2, ) 지정된 변수를 모든 스레드가 공유하도록 함 default (private shared none) private 또는 shared로 선언되지 않은 변수의 기본적인 유효범위 지정 parallel do(for) 구문 : default 선언과 무관하게 루프 인덱스는 항상 private default(none): 모든 변수는 shared 또는 private으로 선언되어야 함 C : default(shared none) Supercomputing Center 51

clause : shared, default (2/2)!$OMP PARALLEL shared(a) private(myid, x) myid = OMP_GET_THREAD_NUM() x = work(myid) IF (x<1.0) THEN a(myid) = x ENDIF!$OMP END PARALLEL!$OMP PARALLEL default(private) shared(a) myid = OMP_GET_THREAD_NUM() x = work[myid] IF (x<1.0) THEN a[myid] = x ENDIF!$OMP END PARALLEL Supercomputing Center 52

clause : firstprivate (1/2) firstprivate(var1, var2, ) private 변수처럼 각 스레드에 개별적으로 변수 생성 각 스레드 마다 순차영역에서 가져온 값으로 초기화 Supercomputing Center 53

clause : firstprivate (2/2)!$OMP PARALLEL!$OMP DO PRIVATE (C) DO J=1,M DO I=2,N-1 C(I)=SQRT(1.0+B(I,J)**2) ENDDO DO I=1,N A(I,J)=SQRT(B(I,J)**2+C(I)**2) ENDDO ENDDO!$OMP END PARALLEL!$OMP PARALLEL!$OMP DO FIRSTPRIVATE (C) DO J=1,M DO I=2,N-1 C(I)=SQRT(1.0+B(I,J)**2) ENDDO DO I=1,N A(I,J)=SQRT(B(I,J)**2+C(I)**2) ENDDO ENDDO!$OMP END PARALLEL C(1), C(N)?? Supercomputing Center 54

clause : lastprivate (1/2) lastprivate(var1, var2, ) private 변수처럼 각 스레드에 개별적으로 변수 생성 순차실행에서 마지막계산에 해당되는 값 즉, 마지막 반복실행 의 값을 마스터 스레드에게 넘겨줌 Supercomputing Center 55

clause : lastprivate (2/2)!$OMP PARALLEL!$OMP DO FIRSTPRIVATE (C) DO J=1,M DO I=2,N-1 C(I)=SQRT(1.0+B(I,J)**2) ENDDO DO I=1,N A(I,J)=SQRT(B(I,J)**2+C(I)**2) ENDDO ENDDO IF(J.EQ. M+1) THEN DO I=1,N X(I) = C(I) ENDDO ENDIF!$OMP END PARALLEL!$OMP PARALLEL!$OMP DO FIRSTPRIVATE (C) &!$ LASTPRIVATE(J, C) DO J=1,M DO I=2,N-1 C(I)=SQRT(1.0+B(I,J)**2) ENDDO DO I=1,N A(I,J)=SQRT(B(I,J)**2+C(I)**2) ENDDO ENDDO IF(J.EQ. M+1) THEN DO I=1,N X(I) = C(I) ENDDO ENDIF!$OMP END PARALLEL Supercomputing Center 56

clause : reduction (1/4) reduction(operator intrinsic:var1, var2, ) reduction 변수는 shared 배열 가능(Fortran only): deferred shape, assumed shape array 사용 불가 C는 scalar 변수만 가능 각 스레드에 복제돼 연산에 따라 다른 값으로 초기화되고(표 참 조) 병렬 연산 수행 다중 스레드에서 병렬로 수행된 계산결과를 환산해 최종 결 과를 마스터 스레드로 내 놓음 Supercomputing Center 57

clause : reduction (2/4)!$OMP DO reduction(+:sum) DO i = 1, 100 sum = sum + x(i) ENDDO Thread 0 sum0 = 0 DO i = 1, 50 sum0 = sum0 + x(i) ENDDO Thread 1 sum1 = 0 DO i = 51, 100 sum1 = sum1 + x(i) ENDDO sum = sum0 + sum1 Supercomputing Center 58

clause : reduction (3/4) Reduction Operators : Fortran Operator Data Types 초기값 + * -.AND..OR..EQV..NEQV. MAX MIN IAND IOR IEOR integer, floating point (complex or real) integer, floating point (complex or real) integer, floating point (complex or real) logical logical logical logical integer, floating point (real only) integer, floating point (real only) integer integer integer 0 1 0.TRUE..FALSE..TRUE..FALSE. 가능한 최소값 가능한 최대값 all bits on 0 0 Supercomputing Center 59

clause : reduction (4/4) Reduction Operators : C Operator Data Types 초기값 + * - & ^ && integer, floating point integer, floating point integer, floating point integer integer integer integer integer 0 1 0 all bits on 0 0 1 0 Supercomputing Center 60

reduction: Fortran!$OMP PARALLEL DO REDUCTION(+:SUM) PRIVATE(X) DO I = 1, NUM_STEPS X = (I-0.5)*STEP SUM = SUM + 4.0/(1.0+X*X) ENDDO!$OMP END PARALLEL DO PI = SUM*STEP Supercomputing Center 61

reduction: C #pragma omp parallel for reduction(+:sum) private(x) for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; Supercomputing Center 62

clause : schedule (1/8) schedule(type [,chunk_size]) 루프의 schedule : 루프실행의 분배문제 기본적인 schedule 정책 : 실행 회수 균등 분배 작업의 균등 분배를 위해 schedule clause 사용 type (default는 implementation depend) static dynamic guided runtime!$omp PARALLEL DO private(xkind) DO i = 1, N xkind = f(i) IF (xkind < 10) THEN CALL fast(x(i)) ELSE CALL slow(x(i)) ENDIF ENDDO Supercomputing Center 63

clause : schedule (2/8) schedule (static) 반복실행이 각 스레드에 균일하게 할당!$OMP DO shared(x) private(i) &!$ schedule(static) DO i = 1, 1000 x(i) = a ENDDO thread 0(i=1, 250) thread 1(i=251,500) thread 0 thread 0 thread 2(i=501,750) thread 3(i=751,1000) Supercomputing Center 64

clause : schedule (3/8) schedule(static, chunk_size) 총 반복실행 회수를 chunk_size로 나누어 chunk 생성 chunk들을 스레드에 라운드-로빈 방식으로 정적 할당!$OMP DO shared(x) private(i) &!$ schedule(static, 100) DO i = 1, 1200 work ENDDO thread 0 (1,100),(401,500),(801,900) thread 0 thread 1 (101,200),(501,600),(901,1000) thread 2 (201,300),(601,700),(1001,1100) thread 3 (301,400),(701,800),(1101,1200) thread 0 Supercomputing Center 65

clause : schedule (4/8) Supercomputing Center 66

clause : schedule (5/8) schedule (dynamic, chunk_size) 총 반복실행 회수를 chunk_size로 나누어 chunk 생성 chunk들을 스레드에 동적할당 작업이 먼저 끝나는 스레드에 다음 chunk 할당 chunk_size가 없으면 디폴트 chunk_size = 1!$OMP DO schedule(dynamic, 1000) DO i = 1, 10000 work ENDDO Supercomputing Center 67

clause : schedule (6/8) schedule (guided, chunk_size) dynamic scheduling chunk 크기가 변한다.(N 0 :반복회수, P:스레드 개수) N n = MAX(N n-1 -size(chunk n-1 ),P*chunk_size) (n 1) size(chunk n ) = CEILING(N n /P) 각 스레드가 DO구문에 도착하는 순서가 다를 때 유용!$OMP DO schedule(guided, 55) DO i = 1, 12000 work ENDDO Supercomputing Center 68

clause : schedule (7/8) Supercomputing Center 69

clause : schedule (8/8) schedule(runtime) 프로그램 실행 중에 환경변수 OMP_SCHEDULE 값을 참조 재 컴파일 없이 다양한 스케줄링 방식 시도 가능 export OMP_SCHEDULE= static,1000 export OMP_SCHEDULE= dynamic Supercomputing Center 70

schedule : Fortran!$OMP PARALLEL!$OMP DO SCHEDULE(DYNAMIC,10) DO I = 1, 100 PRINT*, 'I AM :', OMP_GET_THREAD_NUM(), 'I:', I ENDDO!$OMP END PARALLEL Supercomputing Center 71

schedule : C #pragma omp parallel { #pragma omp for schedule(dynamic,10) for(i=0; i<100; i++) printf( I am %d, i = %d \n, omp_get_thread_num(), i); } Supercomputing Center 72

clause : if if (logical expression) expression이 참이면 병렬실행 거짓이면 순차실행 예) 만약 실행회수 800이상에서 병렬화 이득을 볼 수 있다면 Fortran!$OMP PARALLEL DO &!$ if(n.ge. 800) DO i = 1, n z(i) = a*x(i) + y ENDDO!$OMP END PARALLEL DO C #pragma omp parallel for \ if (800 <= n) { for(i=1, i<=n, i++) z[i] = a*x[i] + y; } Supercomputing Center 73

clause : ordered ordered 루프 내에 ordered 지시어가 나타날 것임을 지적 ordered 지시어는 ordered clause와 같이 사용 병렬구문 안의 내용을 인덱스 순서대로 실행하게 함 동기화 지시어 ordered 참조!$OMP PARALLEL private(myid) myid = omp_get_thread_num()!$omp DO private(i) ordered DO i = 1, 8!$OMP ORDERED PRINT*, T:, myid, i=,i!$omp END ORDERED ENDDO!$OMP END PARALLEL Supercomputing Center 74

clause : copyin Fortran C copyin (/cb1/ [,/cb2/ ]) copyin (var1, var2, ]) 마스터 스레드의 threadprivate 데이터에 다른 스레드들이 접근 가능하도록 함 스레드들의 threadprivate 변수들을 마스터 스레드의 threadprivate 데이터로 초기화 하는데 사용 Supercomputing Center 75

copyin : Fortran & C INTEGER tid, x COMMON /mine/ x!$omp THREADPRIVATE(/mine/) x = 33 CALL omp_set_num_threads(4)!$omp PARALLEL private(tid) & copyin(/mine/) tid = omp_get_thread_num() PRINT*, T:,tid, x=,x!$omp END PARALLEL int x; #pragma omp threadprivate(x) main(){ int tid; x = 33; omp_set_num_threads(4); #pragma omp parallel \ private(tid) copyin(x) { tid = omp_get_thread_num(); printf( T:%d,x=%d\n,tid,x); } } Supercomputing Center 76

지시어와 clauses : 요약 (1/2) 지 시 어 Clause parallel do/for sections single parallel do/for parallel sections if private shared default firstprivate lastprivate reduction copyin schedule ordered nowait Supercomputing Center 77

지시어와 clauses : 요약 (2/2) 다음 지시어들은 clause를 사용하지 않는다. master critical barrier atomic ordered threadprivate flush Supercomputing Center 78

제 III 장 환경변수와 실행시간 라이브러리 환경변수와 실행시간 라이브러리 루틴의 사용법 에 대해 알아본다.

환경변수 (1/3) OMP_NUM_THREADS OMP_SCHEDULE OMP_DYNAMIC OMP_NESTED OpenMP 프로그램 실행 제어 프로그램 실행 시작 전에 한번 참조됨 프로그램 내에서 실행시간 라이브러리에 의해 값 수정 가능 변수 이름은 반드시 대문자 변수 값은 소문자 사용 가능 C : 논리값 1(TRUE), 0(FALSE) Supercomputing Center 80

환경변수 (2/3) OMP_NUM_THREADS 병렬영역에서 사용 가능한 최대 스레드 개수 지정 OMP_SCHEDULE schedule type이 runtime으로 지정된 루프들에게 scheduling 방식 지정 OMP_DYNAMIC 스레드 개수의 동적할당 여부 결정 TRUE : 병렬영역에서 실제로 사용되는 스레드 수를 최대 개수 범위 내에서 동적할당 OMP_NESTED nested 병렬성 지원여부 결정 OpenMP 표준은 지원, 그러나 대부분 업체는 지원하지 않음 디폴트 : FALSE Supercomputing Center 81

환경변수 (3/3) 사용 예 (ksh) 환경변수 사 용 OMP_SCHEDULE OMP_NUM_THREADS OMP_DYNAMIC OMP_NESTED export OMP_SCHEDULE= guided, 4 export OMP_SCHEDULE= dynamic export OMP_NUM_THREADS=32 export OMP_DYNAMIC=TRUE export OMP_NESTED=FALSE Supercomputing Center 82

실행시간 라이브러리 루틴 실행환경 루틴 omp_set_num_threads() omp_get_num_threads() omp_get_thread_num() omp_get_max_threads() omp_set_nested() omp_set_dynamic() omp_get_nested() omp_get_dynamic() omp_in_parallel() omp_get_num_procs() 잠금(lock) 루틴 Supercomputing Center 83

omp_set_num_threads Fortran : CALL omp_set_num_threads(integer) C : void omp_set_num_threads(int) 이어지는 병렬영역에서 사용할 스레드 개수 설정 환경변수(OMP_NUM_THREADS) 설정에 우선함 다음 호출이 있을 때까지 스레드 개수는 고정 스레드 할당이 동적이면 사용 가능한 최대 스레드 개수를 나타냄 병렬영역 안에서 호출할 수 없음 Supercomputing Center 84

omp_get_num_threads Fortran : INTEGER omp_get_num_threads() C : int omp_get_num_threads(void) 병렬영역 안에서 호출되어 생성된 스레드의 개수를 리턴 순차영역에서 호출하면 1을 리턴 Fortran INTEGER omp_get_num_threads nthreads = 16 CALL & omp_set_num_threads(nthreads)!$omp PARALLEL PRINT *, # of threads =, omp_get_num_threads()!$omp END PARALLEL & C #include <omp.h> num_threads = 16 omp_set_num_threads(num_threads); #pragma omp parallel { printf( # of threads = %d\n, omp_get_num_threads()); } Supercomputing Center 85

omp_get_thread_num Fortran : INTEGER omp_get_thread_num() C : int omp_get_thread_num(void) 병렬영역 안에서 생성된 스레드들의 ID를 리턴 0 <= 스레드 ID <= omp_get_num_threads() - 1 순차영역에서 호출하면 0(마스터 스레드)을 리턴 Fortran INTEGER omp_get_thread_num #include <omp.h> C!$OMP PARALLEL PRINT *, thread ID =, & omp_get_thread_num()!$omp END PARALLEL #pragma omp parallel { printf( thread ID = %d\n, omp_get_thread_num()); } Supercomputing Center 86

omp_in_parallel Fortran : LOGICAL omp_in_parallel() C : int omp_in_parallel(void) 호출된 지점이 순차영역인지 병렬영역인지 확인 병렬영역이면.TRUE.(1) 순차영역이면.FALSE.(0) 리턴 Fortran LOGICAL omp_in_parallel C #include <omp.h> PRINT *, parallel region?, & omp_in_parallel()!$omp PARALLEL PRINT *, parallel region?, & omp_in_parallel()!$omp END PARALLEL printf( parallel region? = %d\n, omp_in_parallel()); #pragma omp parallel { } printf( parallel region? = %d\n, omp_in_parallel()); Supercomputing Center 87

profiling tools prof, gprof(gnu Profiler) PAPI Dynaprof GuideView (OpenMP) Vampir (MPI) TAU (OpenMP, MPI, Hybrid) Vprof Etc. Supercomputing Center 88

profiling steps with gprof GNU Profiler (gprof) compile $ifort o [myprog] [myprog.f] -pg execute $./[myprog] data file print $gprof [option]./[myprog] gmon.out > myprog_prof.txt Supercomputing Center 89

gprof의 주요 옵션 GNU Profiler (gprof) 출력 형식 관련 옵션 -A 소스 코드에 분석 결과를 삽입하여 출력 -C[funcName] -J[funcName] -p[funcname] -q[funcname] 지정된 심볼만 함수 분석에 사용 지정된 심볼만 소스 코드에 분석 결과를 삽입 평면 프로파일 생성, [funcname] 지정시 지정된 심볼만 생성 호출 그래프 프로파일 생성, [funcname] 지정시 지정된 심볼만 생성 분석 옵션 -a static으로 정의된 함수를 분석하지 않음 -c 프로파일링 옵션으로 컴파일 되지 않은 함수에 대하여 짐작하여 호출관계를 추정하 여 분석함 -l 줄 단위 프로파일링 -s 프로파일링 데이터 파일들을 읽어서 gmon.sum에 합산하여 기록 Supercomputing Center 90

GNU Profiler (gprof) 프로파일링 데이터 분석 평면 프로파일 (Flat Profile) 각 함수의 총 수행 시간, 평균 수행 시간 등을 분석 함수 간 호출 정보는 없음 -z, -c 옵션을 통해 호출되지 않은 함수의 정보 분 석 호출 그래프 프로파일 (Call Graph) 각 함수의 호출 관계를 통한 수행 시간 분석 평면 프로파일 보다 자세한 분석 Supercomputing Center 91

프로파일링 출력 분석 : GNU Profiler (gprof) 평면 프로파일 (Flat Profile) % time cumulative seconds self seconds calls self ms/call total ms/call name 프로그램 내에서 함수가 수행된 전체 시갂의 백분율 프로그램 내에서 함수와 테이블의 위 함수들이 수행된 시갂의 합 프로그램 내에서 함수가 수행된 전체 시갂 프로그램 내에서 함수가 호출된 횟수 함수의 호출 당 평균 수행 시갂, 다른 함수 호출에 의해 사용된 시갂 제외 ( = self seconds / calls ) 함수의 호출 당 평균 수행 시갂, 다른 함수 호출에 의해 사용된 시갂 포함 함수의 이름 Supercomputing Center 92

프로파일링 출력 분석 : GNU Profiler (gprof) 호출 그래프 프로파일 (Call Graph) index 각 함수마다 유일하게 지정되는 번호 % time 함수 수행 시갂 중 차지하는 비율 self children called name 대상 함수의 순수 수행 시갂 다른 함수를 호출하는데 사용된 시갂 총 함수의 호출 횟수와 분석 함수에 의한 호출 횟수 표시 함수의 이름과 index Supercomputing Center 93