Parallel Programming 박필성 IT 대학컴퓨터학과

목차 Why Parallel? Parallel Computers Parallel Processing Parallel Programming Models Parallel Programming OpenMP Standard MPI Standard Related Topics & Questions

Why Parallel? (1) Why parallel computing? 전통적인 compute-intensive applications 기상예보, 전산유체역학, 화학, 천문학, 다양한공학문제 새로운 data-intensive applications video servers, data mining 미래의 high performance applications VR, 협업환경, CAD Save time and/or money. Solve large problems. Provide concurrency. Use of non-local resources Limits to serial computing http://en.wikipedia.org/wiki/parallel_computer http://www.top500.org/

Why Parallel? (2) Atmosphere, Earth, Environment Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics Bioscience, Biotechnology, Genetics Chemistry, Molecular Sciences Geology, Seismology Mechanical Engineering - from prosthetics to spacecraft Electrical Engineering, Circuit Design, Microelectronics Computer Science, Mathematics

Why Parallel? (3) Databases, data mining Oil exploration Web search engines, web based business services Medical imaging and diagnosis Pharmaceutical design Management of national and multi-national corporations Financial and economic modeling Advanced graphics and virtual reality, particularly in the entertainment industry Networked video and multi-media technologies Collaborative work environments

Why Parallel? (4) Computer 의물리적한계 ( 계속 ) 현재 CPU 의 clock speed 는대략 3GHz - 발열문제 - 반도체소자의문제 1 Tera Hz(1 x 10^12 Hz) 의 computer 는가능한가? 광속 c = 3 x 10^8 m/sec = 3 x 10^11 mm/sec CPU 와 memory 사이의거리를 r 이라고하면 r < c / 10^12 = 0.3 mm!! 1 Tera byte(=10^6 x 10^6 bytes) 의 memory 가지려면 1 byte 는 3A x 3A 이내에저장되어야!! Computer 의성능높아질수록가격은기하급수적으로상승 하나의고가시스템보다여러개의저가시스템을사용하여병렬처리하는것이대안

Why Parallel? (5) Computer 의물리적한계 ( 계속 ) 2004 년까지 SW 개선없이도 HW 발전에따라수혜 CPU clock 의증가 : 2000 년대초반까지기하급수적으로증가 CPU 실행시간최적화 : CPU 명령어의순차실행최적화 (pipeline, 분기예측, out-of-order execution 등 ) Cache 크기증가 The free lunch is over: A fundamental turn toward concurrency in SW (Hurb Sutter, 2005) 전력소모및발열문제 : clock 속도의제곱 ( 세제곱?) Clock 속도증가 짧은도선 CPU 의신뢰도하락 2004 년부터 CPU 제조사들은 clock 속도경쟁포기 CPU : single core multi-core GPU : many-core

Parallel Computers (1) Parallel computer 정의 * 다수의 CPU 가다수의프로그램혹은분할된프로그램을동시에처리하는컴퓨터 * 다수의 CPU 를결합하여단일 CPU 성능의한계를극복하기위한컴퓨터구조 http://giyyon.tistory.com/44?srchid=br1http%3a%2f%2fgiyyon.tistory.com%2f44 다양한분류방법이있으나, 메모리공유에따른분류는 * SMP(Symmetric Multi Processing) * MPP(Massively Parallel Processing) * NUMA(Non-Uniform Memory Access) ** Cluster computer (workstation, PC, ) - 개인 PC 나소형 server 등을 network 장비를사용하여다수연결하여구성한일종의병렬처리용 supercomputer - 저렴한가격 ( 상용 supercomputer 의 1/10), 확장성, 유연성

Parallel Computers (2) 2012 년 11 월 Top500 에따른 HPC 시장의구성비 (1993-)

Parallel Computers (3) 2012 년 11 월 Top500 에따른 HPC 시장의구성비 (1993-)

Parallel Computers (4) 2012 년 11 월 Top500 에따른 HPC 시장의구성비 (1993-)

Parallel Computers (5) PC Cluster 소형 PC 나 PC 서버를수십대에서수천대까지병렬네트워크로연결해슈퍼컴퓨터에상응하는고성능컴퓨팅 (HPC) 을구현하는기술 최초 NASA CESDIS Beowulf : 16-node Intel DX4 processors http://www.phy.duke.edu/~rgb/brahma/resources/beowulf/

Parallel Computers (6) Computer architecture 구분

Parallel Processing (1) 병렬처리 (parallel processing) 란? 복수의처리장치를사용하여, 모든처리장치가하나의프로그램상의서로다른태스크를동시에처리함으로써처리의부하를분담하여처리속도를향상시키는방법 여러개의프로그램을동시에병렬처리하는다중처리 ( multiprocessing) 와는다름

Parallel Processing (2) 병렬처리의장단점 장점 - program 의실제실행시간 (wall-clock time) 감소 - 해결할수있는문제의규모증대 단점 - 추가적인 overhead 소요 - 프로그래밍이어려움 - 모든문제에효율적으로적용되는것이아님 병렬처리의예 일상생활 / 회사에서의예 벽에페인트칠하기 병렬처리쉬움 고도의수학문제풀기 병렬처리어려움 / 불가능 / 무의미

Parallel Processing (3) 병렬처리관련용어 program disk 에 file 로저장되어있는일련의명령어의모음 processor(=cpu) single(1), dual(2), quad(4), hexa(6), octa(8), deca(10), magni(12) CPU core : CPU 의핵심부분 task OS 관점에서본독립적인실행작업의단위 process - 실행가능한 program 이호출되어 OS 의실행제어상태에놓인것. - 각 process 는 program code, data, stack, register 등의독립적인자원과주소공간을가진다. - 일반적으로하나의 program 이실행될때다수의 process 생성. thread - OS 에서제어하는가장작은작업단위의하나로 process 에포함되어실행. - 즉하나의 process 안에서동일한 text( 문맥 ) 을사용하는작업단위 - 독립적인제어의흐름으로프로그램에대한 CPU 시간할당의단위가된다

Parallel Processing (4) process 와 thread processor 는한번에하나의 process 만사용가능 processor 를다른 process 에게넘겨줄때 context switch( 문맥교환 ) 현재실행중인 process 의상태를저장하고다른 process 를실행시키기위한처리과정이필요. * task context (process, thread ) the minimal set of data used by the task that must be saved to allow a task interruption + the configuration of the task process 는독립적인작업단위로수행되나컴퓨터기술의발전으로 process 는보다작은작업단위인 thread 로분류되어수행될수있게되었다. process scheduling 과 thread scheduling - 각 process 는서로다른 text 를사용하나 thread 는 text 를공유 - 하나의 process 안에서각 thread 는모두동일한 text 를공유하므로하나의 process 안에서 thread 의교체가일어나는경우 text 를교체할필요가없으므로보다빠르게작업을교체하여실행가능 - 따라서효율적인자원할당과문맥교환으로인해 thread 단위의 program 실행은 process 단위의실행보다더나은성능보인다.

Parallel Programming Models (1) Why parallel programming model? 병렬연산에서는 process( 혹은 thread) 간에 data 전달필요. 시스템의 hardware 구조에따른적절한 programming model 선택해야 Shared memory system - 모든 CPU 는메모리를공유 한 CPU 가다른 CPU 에 data 를전달하려면단순히공유메모리에 write 다른 CPU 가 read programming 이쉽다. - memory contention CPU 개수에한계 Distributed memory system - 모든 CPU 는독립된메모리사용 다른 CPU 에 data 를전달하려면명시적인통신필요 programming 어렵다. - memory contention 없으므로시스템확장용이

Parallel Programming Models (2) 1) 공유메모리병렬프로그래밍모델 하나의 process 에속한 thread 들은시스템자원을공유하므로다중 thread program 이공유메모리아키텍쳐에적합. - POSIX thread(p-thread) : 사용자가 thread 생성및작업할당제어 - OpenMP : source program 에 compiler 지시어를삽입하여다중 thread 실행 program 을생성 단일 thread process 와다중 thread process.

Parallel Programming Models (3) 2) Message passing 병렬프로그래밍모델 병렬시스템을구성하는 node 들이주소공간을공유하지않는다면각 process 는다른 process 들이갱신하는 data 에접근하기위해 network 을통해 data 를주고받아야한다. - HPF(High Performance FORTRAN) - PVM(Parallel Virtual Machine) - MPI (Message Passing Interface) Message Passing

Parallel Programming Models (4) 3) Hybrid 병렬프로그래밍모델 SMP cluster 와분산 - 공유메모리아키텍처를가지는시스템에서는공유메모리모델의특징과 message passing 모델의특성을모두살리는 hybrid 모델이적합 하나의 program 내에서공유메모리모델과 message passing model 을모두이용하여 programming 하는것을말함 ex. OpenMP + MPI 하나의 node 에단일 thread 를가지는여러 process 생성의예

Parallel Programming Models (5) SPMD vs. MPMD SPMD (Single Program Multiple Data) 병렬프로그램을수행하는모든 process 또는 thread 가동일한하나의프로그램을실행하면서 - 프로그램내의함수를서로다른 data 를가지고병렬로실행하거나 (domain 분해 ) - 서로각기다른함수를맡아병렬로실행한다 ( 기능적분해 ) MPMD (Multiple Program Multiple Data) 여러개의프로그램으로구성되며, 각 process 또는 thread 가서로다른프로그램을실행하면서프로그램이필요로하는 data 를통신을통해주고받는다. SPMD MPMD

Parallel Programming (1) 병렬프로그래밍에서고려되어야할사항 data 의존성 (dependency) 의문제 근본적인의존성이있는경우는병렬화불가의존성을피하며병렬화 교착 (deadlock) 의문제 확장성 (scalability) 이좋도록작성해야 고른작업분배가중요 : 모두가동시에작업이끝나야 I/O 문제 : 입출력은하나의 worker 가담당해야 모든 worker 의작업및진도를어떻게 control? 동기화 (synchronization) 의문제 어떻게 worker 간에정보 ( 즉 data) 를주고받을것인가? shared memory vs. distributed memory program 을어떻게작성할것인가? - SPMD(single program multiple data) - MPMD(multiple program multiple data)

Parallel Programming (2) 병렬처리문제 : 1 부터 n 까지의합을구하라. p : 전체 worker 의수각 worker 는 p 와자신이몇번째 worker 인지 ( 자신의 rank ) 알아야. Algorithm 1. master worker 가 n 을입력받는다. 2. master worker 가고르게작업을나누고각자담당할작업범위를 slave worker 들에게알린다. 3. 각자주어진범위의부분합을독립적으로계산한다. 4. slave worker 는각기 master worker 에게자신이계산한부분합을보고한다. 5. master worker 는자신을포함한모든 worker 의부분합을취합하여답을낸다. 2 대신 master worker 는 n 값을 slave worker 들에게 broadcast 하고모든 worker 는각자자신의 rank 에따라자신의작업범위를파악한다.

Parallel Programming (3) SPMD vs. MPMD [1/2] SPMD(single program multiple data) worker 모두가같은프로그램사용, 단, 각 worker 의 rank 에따라역할을분류 Algorithm master worker의 rank=0을가정 1. If my rank==0, n을입력받고, slave worker들의작업범위를결정하고, 각자에게통보한다. else master worker로부터작업범위를통보받는다.( 그때까지대기 ) 2. 각자독립적으로자신이맡은범위의수를더해부분합을구한다. 3. If my rank==0, 모든 slave worker로부터부분합을전달받아총합을계산한다. else 각자 master worker에게자신의부분합을보고한다. 4. If my rank==0, 총합을출력한다.

Parallel Programming (4) SPMD vs. MPMD [2/2] MPMD(multiple program multiple data) 역할에따라 2 개이상의프로그램사용, Algorithm (master worker) 1. n을입력받고, 각 slave worker의작업범위를결정하고, 각자에게통보한다. 2. 자신이맡은범위의수를더해부분합을구한다. 3. 모든다른 worker로부터부분합을전달받아총합을계산한다. 4. 총합을출력한다. Algorithm (slave worker) 1. master worker로부터자신의작업범위를통보받는다. 2. 각자독립적으로자신이맡은범위의수를더해부분합을구한다. 3. 각자 master worker에게자신의부분합을보고한다.

OpenMP Standard (1) OpenMP 란무엇인가? 공유메모리환경에서다중 thread 병렬프로그램작성을위한 API( 응용프로그램인터페이스 ) 기존의직렬프로그램에병렬화지시어 (directive) 를추가하여 compiler 로하여금병렬화유도 기존의직렬프로그램을그대로혹은약간의변형만으로병렬화가능. 단지공유메모리시스템에서만사용가능

OpenMP Standard (2) OpenMP 의목표 표준과이식성 공유메모리다중 thread 병렬프로그래밍의사실상의표준 (de facto standard) OpenMP 의역사 1990 년대 : 고성능공유메모리시스템의비약적발전업체고유의다양한지시어 (directive) 집합보유 표준화필요 1996 년 openmp.org 설립 http://www.openmp.org/ 1997 년 OpenMP API 발표 2002 년 3 월 OpenMP 2.0 발표 (C/C++) 2008 년 10 월 OpenMP 3.0 발표 2013 년 3 월현재 OpenMP 4.0 개발중 현재대부분의 compiler 가 OpenMP 지원 ex. Linux gcc 4.0 이후 : OpenMP 3.0 지원

OpenMP Standard (3) OpenMP 의구성 #include <omp.h> Compiler directives( 컴파일러지시어 ) - thread 사이의작업분담, 통신, 동기화를담당. 좁은의미의 OpenMP - 형식 : #pragma omp 지시어 ex. program 내에서 #pragma omp parallel #pragma omp for Runtime library( 실행시간라이브러리 ) 병렬매개변수 ( 참여 thread 의개수, 번호등 ) 을설정과조회 ex. program 내에서 omp_set_num_thread(8); 실행시병렬영역에서 thread 개수를 8 개사용 Environment variable( 환경변수 ) 실행시스템의병렬매개변수 ( 사용하는 thread 개수등 ) 를정의 ex. Linux system 에서는 $ export OMP_NUM_THREADS=8 실행시병렬영역에서 thread 개수를 8 개사용

OpenMP Standard (4) OpenMP programming model Thread 기반 Fork-Join model / compiler directive 기반 : 순차 code 에지시어삽입 - compiler 가지시어를참고하여다중 thread 생성 - OpenMP 지원하는 compiler 필요

OpenMP Standard (5) serial code ialpha = 2; for (i=1; i<=100, i++) { : } parallel code ialpha = 2; #pragma omp parallel for for (i=1; i<=100, i++) { : }

OpenMP Standard (6) 주요 OpenMP compiler directives (1) 병렬 block 의지정 #pragma omp parallel 프로그램내에서병렬 block을지정 { fork : 지정된수의 thread가생성병렬 block thread들이병렬로실행 } join : master thread만남음 병렬 block 내에서의작업분할 1) #pragma omp for for 문의작업분할 for (i=0; i<n; i++) 2) #pragma omp sections section별로 thread에게작업할당 { #pragma omp section 각 section을정의 { } } #pragma omp section { } 각 section 을정의 /

OpenMP Standard (7) 주요 OpenMP compiler directives (2) 병렬 block 내에서의작업분할 ( 계속 ) 3) #pragma omp single 단일 thread 가실행할 block 지정 { 단일 thread 가실행할 block } 4) #pragma omp task ex. 명시적인 task 정의. #pragma omp single private(i) { for (i=0; i<n; i++) { #pragma omp task { 여기에 task 정의 } } }

OpenMP Standard (8) 주요 OpenMP compiler directives (3) 상호배제 #pragma omp critical 매순간오직하나의 thread만실행. 다른것은대기 #pragma omp atomic mini critical event 동기화 #pragma omp barrier #pragma omp ordered #pragma omp master 모든 thread들이 barrier에도달할때까지대기내부의 loop 실행을순차적으로실행 master thread만실행. 다른 thread는건너뜀 기타 #pragma omp threadprivate( 변수 ) #pragma omp taskwait #pragma omp flush :

OpenMP Standard (9) OpenMP compiler directive 에추가하여사용하는것 Data 유효범위관련 clauses shared ( 변수들 ) private ( 변수들 ) firstprivate ( 변수들 ) lastprivate ( 변수들 ) default () copyin ( 변수들 ) reduction ( 연산자 : 변수 ) 모든 thread가공유하는변수선언각 thread가각기하나씩가지는변수선언 기타 schedule nowait collapse ordered :

OpenMP Standard (10) 주요 OpenMP runtime library functions omp_get_thread_num() omp_get_num_threads() omp_set_num_threads() omp_get_max_threads() omp_get_num_procs() omp_in_parallel() omp_set_dynamic() thread ID 파악전체 thread 개수파악사용하는 thread 개수설정최대 thread 개수파악사용할수있는 processor 개수파악현재수행중인곳이병렬영역인지판단 thread 개수를동적으로변경가능케할것인가설정 program 내에서함수를호출하여사용 주요 OpenMP 환경변수 OMP_NUM_THREADS 실행시사용할수있는 thread의최대개수 OMP_DYNAMIC 실행시 thread 개수를동적으로변경가능케할것인지 OMP_SCHEDULE scheduling 방식 OMP_NESTED nested parallelism 허용여부 Linux 의경우, 프로그램실행전에다음과같은명령을실행하여환경설정 $ export OMP_NUM_THREADS=16

OpenMP Standard (11) OpenMP 예제프로그램 #include <stdio.h> #include <omp.h> int main(void) { int i, sum=0; #pragma omp parallel for reduction(+:sum) { for (i=1; i<=1000; i++) sum+=i; } printf("sum from 1 to 1000 is %d.\n",sum); } #include <stdio.h> #include <omp.h> int main(void) { int i, sum=0, total=0, Nthreads, th_id; #pragma omp parallel { Nthreads = omp_get_num_threads(); th_id = omp_get_thread_num() for (i=th_id; i<1001; i=i+nthreads) sum+=i; #pragma omp critical total=total+sum; } printf("sum from 1 to 1000 is %d.\n", total); }

OpenMP Standard (12) OpenMP 의장단점, 기타 장점 사용자가직접통신을처리해야하는 MPI 에비해코딩, 디버깅이비교적쉽다. 사용자가직접 data 분할을해야하는 MPI 보다 data 분할에대한부담적다. 각 loop 을하나씩병렬화하여점진적인병렬화가가능 compile option 의조정과선택적 compile 문법을이용해하나의 code 를병렬 code 와순차 code 로 compile 가능 상대적으로 code 의크기가작다. 단점 생성된병렬프로그램은공유메모리환경의다중프로세서아키텍처에서만실행가능 아키텍처 ( 프로세서수, 메모리 ) 의한계로원하는성능을얻기힘들다. OpenMP 를지원하는 compiler 가반드시필요 프로그램의병렬성이 loop 에대한의존도가커서병렬화효율이낮다. - compiler 지시어, runtime library routines, 환경변수알아야 - 사용하는변수가 shared 인지 private 인지 - 기타병렬적감각

MPI Standard (1) Message Passing 이란? 지역적으로독립된메모리를가지는 process 들이 data 를공유하기위해 message(data) 를송수신하여통신하는방식 병렬화를위한작업할당, data 분배, 통신의운용등모든것을프로그래머가담당 : 어렵지만효율높다. 다양한 hardware platform 에서구현가능 분산메모리다중 processor system 공유메모리다중 processor system 단일 processor system MPI 란무엇인가? Message Passing Interface de facto standard( 사실상의표준 ) Message passing 병렬프로그래밍을위해표준화된데이터통신라이브러리 (Message Passing Library) 의표준을정의한것 목적 : 이식성 (portability), 효율성 (efficiency), 기능성 (functionality) * hardware vendor 가자신의 hardware 에최적화된 library 제공가능

MPI Standard (2) MPI 의역사 1980 년대 1990 년대초까지, 다양한분산메모리병렬컴퓨팅 SW 등장. MPI Forum : 표준마련의필요성에서정부, 학계, 산업체등, 1992 시작 http://www.mpi-forum.org/ 1994 년 MPI-1 표준마련 (MPI Forum) http://www.mcs.anl.gov/mpi/index.html 1997 년 MPI-2 발표 2012 년 9 월 MPI-3.0 발표 http://www.mpi-forum.org/docs/docs.html MPI 표준에맞추어개발한 MPI Library MPICH, CHIMP, LAM/MPI, OpenMPI,, 기타각 hardware vendor 들의 MPI (ex. IBM MPI) OpenMPI site http://www.open-mpi.org/

MPI Standard (3) MPI 관련기본개념 (1) Process 와 processor MPI 는 process 기준으로작업할당 Processor : process = 1:1 or 1:many Communicator 서로간에통신이허용되는모든 process 들의집합 Process rank 동일한 communicator 내의 process 들을식별하기위한식별자 만일 p 개의 process 가연산에참여한다면 rank 는 0, 1,, p-1

MPI Standard (4) MPI 관련기본개념 (2) Message [ = data + envelope ] 어느 process 가보내는가 어디에있는 data 를보내는가 어떤 data 를보내는가 Data 를얼마나보내는가 어느 process 가수신할것인가 어디에저장할것인가 얼마나받을준비를해야하는가 Tag ( 꼬리표 ) Message 의 matching 과구분에이용 순서대로메시지도착을처리할수있음 message buffer 사용 wild card 사용가능 : MPI_ANY_TAG ex. 누군가 message 를보내오면

MPI Standard (5) MPI 관련기본개념 (3) 점대점통신 (point to point communication) 두개의 process 사이의통신 하나의송신 process에하나의수신 process가대응 집합통신 (collective communication) 동시에여러개의 process가통신에참여 일대다, 다대일, 다대다대응가능 여러번의점대점통신사용을하나의집합통신으로대체 프로그래밍이쉽고간단하다. 오류의가능성이적다. 최적화되어일반적으로빠르다.

MPI Standard (6) MPI(Message Passing Interface) 표준 http://enc.daum.net/dic100/contents.do?query1=20xx221466 In the definition by Gropp et al 96, MPI "is a message passing applicati on programmer interface, together with protocol and semantic specifica tions for how its features must behave in any implementation", "MPI incl udes point-to-point message passing and collective (global) operation s, all scoped to a user-specified group of processes. MPI is a language-independent communications protocol used to progr am parallel computers. MPI-1 은 127 개의함수로구성 대부분의 MPI implementation 은 Fortran, C, C++ 에서호출가능한 library function 으로구성 보통의 Fortran, C, C++ 프로그램에서다른 process 로통신이필요시적절한함수만호출하면됨 그외 Python, Ocaml, Java 에서도사용하도록노력중

MPI Standard (7) MPI 프로그램의기본구조 #include <mpi.h> - MPI 함수의 prototype 선언 - macro, MPI 관련인수, data type 정의 변수선언 - MPI 함수의 prototype 선언 MPI 환경의초기화 - MPI_Init() - MPI_Comm_rank() - MPI_Comm_size() MPI 통신함수호출하며연산수행 MPI 환경해제 - MPI_Finalize()

MPI Standard (8) MPI 함수에대하여 MPI 함수의이름과형태 MPI_Xxxxxx(parameter, ); MPI_ 로시작그다음첫글자 X 는대문자 MPI 함수의호출과 return 값 호출예 err = MPI_Init(&argc, &argv); if (err == MPI_SUCCESS) { : } 혹은 MPI_Init(&argc, &argv); Return 값첫예의경우, err 로 return. 호출이성공적이면 MPI_SUCCESS 값가짐

MPI Standard (9) MPI 의기본함수 (1) int MPI_Init(&argc, &argv); MPI 환경초기화 MPI 루틴중가장먼저오직한번반드시호출되어야함 MPI_COMM_WORLD 라는 communicator 가정의됨 호출예 MPI_Init(&argc, &argv); int MPI_Comm_rank(MPI_COMM comm, int *rank); 같은 communicator comm 에속한 process 의 rank 를할당 - p 개의 process 를사용할경우, 0 부터 p-1 의값을할당 호출예 MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

MPI Standard (10) MPI 의기본함수 (2) int MPI_Comm_size(MPI_COMM comm, int *size); Communicator comm 에포함된 process 들의총개수가져오기 호출예 MPI_Comm_size(MPI_COMM_WORLD, &p); int MPI_Finalize( ); 모든 MPI 자료구조정리 모든 process 들이마지막으로한번호출되어야함 Process 를종료시키는것은아님 호출예 MPI_Finalize();

MPI Standard (11) MPI 메시지 = data + 봉투 (envelope) Data Buffer : 수신 ( 송신 ) data의변수이름 개수 : 수신 ( 송신 ) data의개수 Data type : 수신 ( 송신 ) data의 data 유형 봉투 수신자 ( 송신자 ) : 수신 ( 송신 ) process의 rank Tag( 꼬리표 ) : 송신 ( 수신 ) data를나타내는고유한정수 Communicator : 송신, 수신 process들이포함된 process group MPI data type 기본 type과유도 type(derived type) 유도 type은마음대로만들수있다. 송신과수신 data type은반드시일치해야한다.

MPI Standard (12) MPI 기본 data type

MPI Standard (13) 점대점통신 (point to point communication) [1/3] 반드시두개의 process 만참여 communicator 내에서만이루어짐 송신 / 수신 process 의확인을위해 communicator 와 rank 사용 통신의완료 메시지전송에이용된메모리위치에안전하게접근할수있음을의미 Blocking 통신과 non-blocking 통신 Blocking : 통신이완료된후루틴으로부터 return 됨 Non-blocking : 통신이시작되면완료와관계없이 return, 이후완료여부검사 통신완료에요구되는조건에따라통신모드분류

MPI Standard (14) 점대점통신 (point to point communication) [2/3] 통신모드

MPI Standard (15) 점대점통신 (point to point communication) [3/3] int MPI_Send(void *message, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) Datatype 형태의자료 message 를 count 개수만큼 dest rank 의 process 에게전송 반환값 : error code ex. MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); int MPI_Recv(void *message, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) Source process 로부터 tag 태그를사용해서보내온메시지를받음 Source : process rank 혹은 MPI_ANY_SOURCE 사용 Tag : 전송자가사용한 tag 혹은 MPI_ANY_TAG 사용가능 ex. MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); 이외많은점대점통신함수가있음

MPI Standard (16) 집합통신 (Collective communication) [1/3] 한그룹의 process가참여 점대점통신을이용한구현보다편리하고성능면에서유리 집합통신루틴 Communicator 내의모든 process 호출 동기화가보장되지않음, Non-blocking 루틴없음, Tag 없음

MPI Standard (17) 집합통신 (Collective communication) [2/3]

MPI Standard (18) 집합통신 (Collective communication) [3/3] int MPI_Bcast(void *message, int count, MPI_Datatype datatype, int root, MPI_Comm comm) 전송자와수신자는모두같은명령을사용 root 라는 rank 를가진 process 가송신하며다른모든 process 는수신 Ex. MPI_Bcast(&uplimit,1,MPI_LONG,0,MPI_COMM_WORLD); Int MPI_Reduce(void *operand, void *result, int count, MPI_Datatype datatype, MPI_Op operator, int root, MPI_COMM comm) Operand 에주어진자료에대해 operator 로지정된연산을수행한결과를 result 로반환함 Operator 는 MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD 등많은종류의연산이가능 Ex. MPI_Reduce(&sum,&gsum,1,MPI_DOUBLE,MPI_SUM,0, MPI_COMM_WORLD);

MPI Standard (19) 1 부터 n 까지더하는 serial program #include <stdio.h> int main(int argc, char* argv[]) { long int i,last; double sum; printf("this program computes the sum from 1 to a given number.\n"); printf("enter a big number : "); scanf("%ld",&last); sum=0.; for (i=1; i<=last; i++) { sum=sum+i; } printf( Sum from 1 to %ld is %f\n",last,sum); }

MPI Standard (20) 1 부터 n 까지더하는 parallel program #include <stdio.h> #include "mpi.h int main(int argc, char* argv[]) { int MyRank, size; long int i, uplimit, mok, nam, start, last, count[16]; double sum, gsum, stime, time1, time2; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &MyRank); MPI_Comm_size(MPI_COMM_WORLD, &size); if (MyRank==0) { printf("enter a big number : "); scanf("%ld", &uplimit); } for (i=0; i<size; i++) { count[i]=mok; if (i < nam) count[i]=count[i]+1; } start=1; for (i=0; i<myrank; i++) start=start+count[i]; last=start+count[myrank]-1; printf(" MyRank = %d : %ld - %ld\n",myrank,start,last); } sum=0.; for (i=start; i<=last; i++) sum=sum + (double) i; MPI_Reduce(&sum,&gsum,1,MPI_DOUBLE,MPI_SUM,0, MPI_COMM_WORLD); MPI_Bcast(&uplimit,1,MPI_LONG,0,MPI_COMM_WORLD); mok=uplimit/size; nam=uplimit % size; if (MyRank==0) printf(" Sum = %f\n",gsum); MPI_Finalize(); }

Related Topics & Questions Related Topics GPU computing GPGPU( 위키백과 ) 한글 http://enc.daum.net/dic100/contents.do?query1=10xx236094 영문 http://enc.daum.net/dic100/contents.do?query1=20x1268939 CUDA 미루웨어 http://www.miruware.com/ 한국 CUDA 사용자그룹 (KCUG) http://cafe.daum.net/kcug OpenCL http://enc.daum.net/dic100/contents.do?query1=10xx281093 Questions?