MPI 를이용한병렬프로그래밍기초 : Point-to-Point Communication Hongsuk Yi KISTI Supercomputing Center 2009-11-09 1
MPI 란무엇인가? MPI Message Passing Interface 병렬프로그래밍을위해표준화된데이터통신라이브러리 MPI-1 표준마련 (MPI Forum) : 1994 년 Procs Procs Procs Procs Memory Memory Memory Memory Network
MPI 포럼 MPI Forum MPI 표준제정 MPI 1.0 June,1994. MPI 1.1 June 12, 1995. MPI-2 - July 18, 1997 3
MPI 의목적과범위 MPI 의목적은 MPI를제공 소스코드이식성보장 효율적인구현을가능하도록 MPI 에는 많은기능들이포함 이질적인병렬아키텍처를위한지원 (Grid 환경등 ) MPI-2 에는 중요한부가적인몇개의기능추가 MPI-1에는변화가없음 4
MPI_COMM_WORLD 서로통신할수있는프로세스들의집합을나타내는핸들 모든 MPI 통신루틴에는커뮤니케이터인수가포함됨 커뮤니케이터를공유하는프로세스들끼리통신가능 프로그램실행시정해진, 사용가능한모든프로세스를포함하는커뮤니케이터 MPI_Init 이호출될때정의됨 1 2 0 5 MPI_COMM_WORLD 3 4 소켓1 7 6 소켓2 0 1 4 3 2 6 5 15 9 14 13 소켓3 8 10 12 11 소켓4 tachyon189
Massage Passing 메시지패싱은 지역적으로메모리를따로가지는프로세스들이데이터를서로공유하기위해메시지 ( 데이터 ) 를주고받음 병렬화를위한작업할당, 데이터분배, 통신의운용등모든것을프로그래머가담당으로, 코딩은어렵지만유용성좋음 (Very Flexible) program program program program communication network
MPI 메시지 MPI 데이터 특정 MPI 데이터타입을가지는원소들의배열로구성 송신과수신데이터타입은반드시일치해야한다. MPI Data Type MPI_CHAR MPI_SHORT MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE MPI_LONG_DOUBLE C Data Type signed char signed short int signed int signed long int float double long double
MPI 의기본개념 프로세스와프로세서 MPI는프로세스기준으로작업할당 프로세서대프로세스 = 일대일또는일대다 메시지 어떤프로세스가보내는가 어디에있는데이터를보내는가 어떤데이터를보내는가 얼마나보내는가 어떤프로세스가받는가 어디에저장할것인가 얼마나받을준비를해야하는가
MPI 의기본개념 꼬리표 (tag) 메시지매칭과구분에이용 순서대로메시지도착을처리할수있음 와일드카드사용가능 커뮤니케이터 (Communicator) 서로간에통신이허용되는프로세스들의집합 프로세스랭크 (Rank) 동일한커뮤니케이터내의프로세스들을식별하기위한식별자
MPI 헤더파일 헤더파일삽입 Fortran INCLUDE mpif.h C #include mpi.h MPI 서브루틴과함수의프로토타입선언 매크로, MPI 관련인수, 데이터타입정의
MPI 의기본개념 점대점통신 (Point to Point Communication) 두개프로세스사이의통신 하나의송신프로세스에하나의수신프로세스가대응 집합통신 (Collective Communication) 동시에여러개의프로세스가통신에참여 일대다, 다대일, 다대다대응가능 여러번의점대점통신사용을하나의집합통신으로대체 오류의가능성이적다. 최적화되어일반적으로빠르다.
MPI 참고도서 MPI: A Message-Passing Interface Standard (1.1, June 12, 1995) MPI-2: Extensions to the Message-Passing Interface (J uly 18,1997) MPI: The Complete Reference Using MPI: Portable Parallel Programming With the Mes sage-passing Interface Using MPI-2: Advanced Features of the Message-Passing Interface. Parallel Programming with MPI 12
MPI 를이용한 병렬프로그래밍기초
필수 MPI Commands : 6 개 int MPI_Init(int *argc, char **argv) int MPI_Finalize(void) int MPI_Comm_size(MPI_Comm comm, int *size) int MPI_Comm_rank(MPI_Comm comm, int *rank) int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)
시작과끝 MPI_Init(int *argc, char **argv) MPI 루틴중가장먼저오직한번반드시호출되어야함 변수선언후바로다음에위치 MPI 환경초기화 MPI_Finalize(void) 코드의마지막끝에위치 모든 MPI 자료구조정리 모든프로세스들에서마지막으로한번호출되어야함
MPI 프로세스설정 MPI_Comm_size(MPI_Comm comm, int *size) 커뮤니케이터에포함된프로세스들의총개수 커뮤니케이터사이즈가져오기 MPI_Comm_rank(MPI_Comm comm, int *rank) 현재프로세스의 ID 같은커뮤니케이터에속한프로세스의식별번호 프로세스가 n개있으면 0부터 n-1까지번호할당 0 rank size-1
Message Passing: Send MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) buf : 송신버퍼의시작주소 count : 송신될원소개수 datatype : 각원소의 MPI 데이터타입 ( 핸들 ) dest : 수신프로세스의랭크 tag : 메시지꼬리표 comm : MPI 커뮤니케이터 ( 핸들 ) MPI_Send(&x,1,MPI_DOUBLE,manager,me, MPI_COMM_WORLD)
Message Passing: Receive MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) buf : 수신버퍼의시작주소 count : 수신될원소개수 datatype : 각원소의 MPI 데이터타입 ( 핸들 ) source : 송신프로세스의랭크 tag : 메시지꼬리표 comm : MPI 커뮤니케이터 ( 핸들 ) status(mpi_status_size) : 수신된메시지의정보저장
블록킹수신 수신자는와일드카드를사용할수있음 모든프로세스로부터메시지수신 : MPI_ANY_SOURCE 어떤꼬리표를단메시지든모두수신 MPI_ANY_TAG 수신자의 status 인수에저장되는정보 송신프로세스, 꼬리표 MPI_GET_COUNT : 수신된메시지의원소개수를리턴
Message Passing 블로킹통신 : 주의사항 표준송신과수신은블로킹통신이다 MPI_Recv는메시지버퍼를완전히받은후통신완료함 MPI_Send는메시지받을때까지블로킹이거나아니거나.. 교착 (deadlock) 에항상주의해야함
Deadlock Code in each MPI process: MPI_Ssend(, right_rank, ) MPI_Recv(, left_rank, ) 오른쪽에있는프로세스는전혀메시지를받을수없다. 0 6 1 5 2 4 3 만일 MPI 가동기프로토콜로구현되어있으면표준송신 (MPI_Send) 모드에서도교착이발생함 21
점대점통신 반드시두개의프로세스만참여하는통신 통신은커뮤니케이터내에서만이루어진다. 송신 / 수신프로세스의확인을위해커뮤니케이터와랭크사용 Communicator 5 2 1 0 source 3 4 destination
점대점통신 통신의완료 전송에이용된메모리위치에안전하게접근할수있음을의미 송신 : 송신변수는통신이완료되면다시사용될수있음 수신 : 수신변수는통신이완료된후부터사용될수있음 블록킹통신과논블록킹통신 블록킹 통신이완료된후루틴으로부터리턴됨 논블록킹 통신이시작되면완료와상관없이리턴, 이후완료여부를검사
통신모드 통신모드 블록킹 MPI 호출루틴 논블록킹 동기송신 MPI_SSEND MPI_ISSEND 준비송신 MPI_RSEND MPI_IRSEND 버퍼송신 MPI_BSEND MPI_IBSEND 표준송신 MPI_SEND MPI_ISEND 수신 MPI_RECV MPI_IRECV
통신모드 표준송신 : Standard send MPI_SEND) minimal transfer time may block due to synchronous mode > risks with synchronous send Synchronous send (MPI_SSEND) risk of deadlock risk of serialization risk of waiting > idle time high latency / best bandwidth Buffered send (MPI_BSEND) low latency / bad bandwidth Ready send (MPI_RSEND) use never, except you have a 200% guarantee that Recv is already called in the current version and all futu re versions of your code 25
Synchronous Send: 동기송신 MPI_SSEND (Blocking Synchronous Send) Task Waits data transfer from source complete S R Wait MPI_RECV Receiving task waits Until buffer is filled 송신시작 : 대응되는수신루틴의실행에무관하게시작 송신 : 수신측이받을준비가되면전송시작 송신완료 : 수신루틴이메시지를받기시작 + 전송완료 가장안전한논-로컬송신모드
Ready Send : 준비송신 MPI_RSEND (blocking ready send) data transfer from source complete S R Wait MPI_RECV Receiving task waits Until buffer is filled 수신측이미리받을준비가되어있음을가정하고송신시작 수신이준비되지않은상황에서의송신은에러 성능면에서유리 ; 논 - 로컬송신모드
Buffered Send : 버퍼송신 MPI_BSEND (buffered send) copy data to buffer data transfer user-supplied buffer complete S R task waits MPI_RECV 송신시작 : 대응되는수신루틴의실행에무관하게시작 송신완료 : 버퍼로복사가끝나면수신과무관하게완료 사용자가직접버퍼공간관리 MPI_Buffer_attach, MPI_Buffer_detach 로컬송신모드
성공적인통신을위해주의할점들 송신측에서수신자랭크를명확히할것수신측에서송신자랭크를명확히할것커뮤니케이터가동일할것메시지꼬리표가일치할것수신버퍼는충분히클것
논블록킹통신 통신을세가지상태로분류 논블록킹통신의초기화 : 송신또는수신의포스팅 전송데이터를사용하지않는다른작업수행 통신과계산작업을동시수행 통신완료 : 대기또는검사 교착가능성제거, 통신부하감소대기 (waiting) 루틴이호출되면통신이완료될때까지프로세스를블록킹 논블록킹통신 + 대기 = 블록킹통신 검사 (testing) 루틴은통신의완료여부에따라참또는거짓을리턴
점대점통신의사용 단방향통신과양방향통신 양방향통신은교착에주의 rank 0 rank 1 rank 0 rank 1 sendbuf recvbuf sendbuf recvbuf recvbuf sendbuf recvbuf sendbuf
단방향통신 (1/2) 블록킹송신, 블록킹수신 IF (myrank==0) THEN CALL MPI_SEND(sendbuf, icount, MPI_REAL, 1, itag, MPI_COMM_WORLD, ierr) ELSEIF (myrank==1) THEN CALL MPI_RECV(recvbuf, icount, MPI_REAL, 0, itag, MPI_COMM_WORLD, istatus, ierr) ENDIF 논블록킹송신, 블록킹수신 IF (myrank==0) THEN CALL MPI_ISEND(sendbuf, icount, MPI_REAL, 1, itag, MPI_COMM_WORLD, ireq, ierr) CALL MPI_WAIT(ireq, istatus, ierr) ELSEIF (myrank==1) THEN CALL MPI_RECV(recvbuf, icount, MPI_REAL, 0, itag, MPI_COMM_WORLD, istatus, ierr) ENDIF
단방향통신 (2/2) 블록킹송신, 논블록킹수신 IF (myrank==0) THEN CALL MPI_SEND(sendbuf, icount, MPI_REAL, 1, itag, MPI_COMM_WORLD, ierr) ELSEIF (myrank==1) THEN CALL MPI_IRECV(recvbuf, icount, MPI_REAL, 0, itag, MPI_COMM_WORLD, ireq, ierr) CALL MPI_WAIT(ireq, istatus, ierr) ENDIF 논블록킹송신, 논블록킹수신 IF (myrank==0) THEN CALL MPI_ISEND(sendbuf, icount, MPI_REAL, 1, itag, MPI_COMM_WORLD, ireq, ierr) ELSEIF (myrank==1) THEN CALL MPI_IRECV(recvbuf, icount, MPI_REAL, 0, itag, MPI_COMM_WORLD, ireq, ierr) ENDIF CALL MPI_WAIT(ireq, istatus, ierr)
실습 1) MPI_Counting3s.c 2) 점대점통신성능시험
Serial counting3s.c #include <stdlib.h>; #include <stdio.h> ; #include <time.h> double dtime(); int main(int argc, char **argv) { } int i,j, *array, count=0; const int length = 100000000, iters = 10; double stime, etime; array = (int *)malloc(length * sizeof(int)); for (i = 0; i < length; i++) array[i] = i % 10; dtime(&stime); for (j = 0; j < iters; j++) { } for (i = 0; i < length; i++) { } if (array[i] == 3) { count++; } dtime(&etime); printf("serial: Number of 3's: %d \t Elapsed Time = %12.8lf (sec) \n ", count, etime-stime); return 0; - tachyon190 $>./c3s_serial.x serial: Number of 3's: 100000000 Elapsed Time = 3.13399506 (sec) 2009-11-09 35
mpi_counting 3s (1/4) 2 3 6 9 8 1 0 3 3 3 4 3 9 0 0 0 1 2 3 4 5 6 7 8 9 Rank=0 Rank=1 Rank=2 Rank=3 배열크기 = length_per_processor Rank=4 1) Rank=0 : 배열초기화 globalarray[i 2) 작업분할 : myarray[i] 의크기 = length_per_processor 3) 각각의 rank로 myarray 배열송신 (MPI_Send) 4) 각각의 rank는 myarray 수신 (MPI_Recv) 및 3s count 시작 5) 각각의 rank에서계산된 count를 master 노드로송신 6) Master 노드는 count를합하여최종 global_count를출력 7) 시간은 MPI_Wtime() 로측정 2009-11-09 36
#include <stdio.h> #include <stdlib.h> #include <mpi.h> mpi_counting3s.c (2/4) int main(int argc, char **argv) { const int length = 100000000, iters = 10; int myid, nprocs, length_per_process, i, j, p; int *myarray, *garray, root, tag, mycount, gcount; FILE *fp; double t1, t2; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myid); mycount = 0; root=0; tag=0; length_per_process=length/nprocs; myarray=(int *)malloc(length_per_process*sizeof(int)); 37
mpi_counting3s.c (3/4) if (myid==root { garray=(int *)malloc(length*sizeof(int)); for (i=0; i< length;i++) garray[i]=i%10; } if(myid==root) { t1 = MPI_Wtime(); for(p=0; p<nprocs-1; p++) { for(i=0;i< length_per_process;i++){ j = i + p*length_per_process; myarray[i]=garray[j]; } MPI_Send(myArray, length_per_process, MPI_INT, p+1, tag, MPI_COMM_WORLD); } }else{ MPI_Recv(myArray, length_per_process, MPI_INT, root, tag, MPI_COMM_WORLD, &status); } 38
mpi_counting3s.c (4/4) for (j=0; j< iters; j++){ } for(i=0; i<length_per_process; i++) { } if(myarray[i]==3) mycount++; MPI_Reduce(&myCount,&gCount,1,MPI_INT,MPI_SUM root, MPI_COMM_WORLD); if(myid==root) { t2 = MPI_Wtime(); printf("nprocs=%d Number of 3's: %d Elapsed Time =%12.8lf(sec)\n, nprocs, gcount, t2-t1); } } MPI_Finalize(); $>./serial.x serial=1 Number of 3's: 100000000 Elapsed Time = 3.05412793 (sec) $> mpirun -np 10 -machinefile hostname./parallel.x -O3 nprocs=10 Number of 3's: 100000000 Elapsed Time = 0.22715100 (sec) 2009-11-09 39
예제 : mpi_multibandwidth.c case 1: MPI_Send(&msgbuf1, n, MPI_CHAR, dest, tag, MPI_COMM_WORLD); case 2: MPI_Recv(&msgbuf1, n, MPI_CHAR, src, tag, MPI_COMM_WORLD, stats); MPI_Send(&msgbuf1, n, MPI_CHAR, dest, tag, MPI_COMM_WORLD); MPI_Irecv(&msgbuf1, n, MPI_CHAR, src, tag, MPI_COMM_WORLD,&reqs[0]); MPI_Wait(&reqs[0], stats); case 3: MPI_Isend(&msgbuf1, n, MPI_CHAR, dest, tag,mpi_comm_world,&reqs[0]); MPI_Irecv(&msgbuf1, n, MPI_CHAR, src, tag,mpi_comm_world,&reqs[1]); case 4: MPI_Waitall(2, reqs, stats); MPI_Ssend(&msgbuf1, n, MPI_CHAR, dest, tag, MPI_COMM_WORLD); MPI_Recv(&msgbuf1, n, MPI_CHAR, src, tag, MPI_COMM_WORLD, stats); LAST REVISED: 12/27/2001 Blaise Barney 2009-11-09 40
Point-to-Point 통신 case 5: MPI_Ssend(&msgbuf1, n, MPI_CHAR, dest, tag, MPI_COMM_WORLD); MPI_Irecv(&msgbuf1, n, MPI_CHAR, src, tag, MPI_COMM_WORLD,&reqs[0]); MPI_Wait(&reqs[0], stats); case 6: MPI_Sendrecv(&msgbuf1, n, MPI_CHAR, dest, tag, &msgbuf2, n, case 7: MPI_CHAR, src, tag, MPI_COMM_WORLD, stats); MPI_Issend(&msgbuf1, n, MPI_CHAR, dest,tag,mpi_comm_world,&reqs[0]); MPI_Irecv(&msgbuf1, n, MPI_CHAR, src, tag, MPI_COMM_WORLD,&reqs[1]); MPI_Waitall(2, reqs, stats); case 8: MPI_Issend(&msgbuf1, n, MPI_CHAR, dest,tag,mpi_comm_world,&reqs[0]); MPI_Recv(&msgbuf1, n, MPI_CHAR, src, tag, MPI_COMM_WORLD, stats); MPI_Wait(&reqs[0], stats); case 9: MPI_Isend(&msgbuf1, n, MPI_CHAR, dest,tag,mpi_comm_world,&reqs[0]); MPI_Recv(&msgbuf1, n, MPI_CHAR, src, tag, MPI_COMM_WORLD, stats); MPI_Wait(&reqs[0], stats); 41
Time and Bandwidth t1 = MPI_Wtime(); MPI_Isend(&msgbuf1, n,mpi_char,dest,tag,mpi_comm_world,&reqs[0]); MPI_Recv(&msgbuf1, n, MPI_CHAR, src, tag, MPI_COMM_WORLD, stats); MPI_Wait(&reqs[0], stats); t2 = MPI_Wtime(); thistime = t2 - t1; bw = ((double)nbytes * 2) / thistime; Bw~1200 MB/s 2009-11-09 42
Output: Mpi_multibandwidth.c **** MPI/POE Bandwidth Test *** Message start size= 104857600 bytes Message finish size= 104857600 bytes Incremented by 100000 bytes per iteration Roundtrips per iteration= 10 MPI_Wtick resolution = 1.000000e-05 ************************* task 0 is on tachyon192 partner= 1 task 1 is on tachyon192 partner= 0 *********************************** *** Case 1: Send with Recv Message size: 104857600 task pair: 0-1: 1066.148 *** Case 2: Send with Irecv Message size: 104857600 task pair: 0-1: 1063.166 *** Case 3: Isend with Irecv avg (MB/sec) avg (MB/sec) Message size: 104857600 avg (MB/sec) task pair: 0-1: 1062.683 *** Case 4: Ssend with Recv Message size: 104857600 task pair: 0-1: 1069.341 *** Case 5: Ssend with Irecv avg (MB/sec) Message size: 104857600 avg (MB/sec) task pair: 0-1: 1065.343 *** Case 6: Sendrecv Message size: 104857600 task pair: 0-1: 1070.993 *** Case 7: Issend with Irecv Message size: 104857600 task pair: 0-1: 1065.081 *** Case 8: Issend with Recv avg (MB/sec) avg (MB/sec) 2009-11-09 43
Project 2 : Bandwidth & Latency 멀티코어 Barcelona Chip 의 Bandwidth 를측정하기 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 소켓 1 소켓 2 소켓 1 소켓 2 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 소켓 3 소켓 4 소켓 3 소켓 4 tachyon189 tachyon189 2009-11-09 44
Membench: What to Expect average cost per access memory time size > L1 cache hit time total size < L1 s = stride 45
500c 50c 1-2c 10c Core-Memory 속도차이 : AMD 2350 L1 (64kB) 1.5ns : 3 cycle L2 (512kB) 15 cycle 3cycle (L1) 3(L1-L2) L1 9cycle(L2 only) L2 L3 cache (2MB) L3 47 cycle 15cyle(L2) 9cycle(L2-L3) memory 23 cycle (L3 only) 2009-11-09 46
AMD 2350 Membenchmark Memory :176 ns (0.17us) 256B line? 4k page TLB? L1:1ns 2009-11-09 47 (2cycle) L3:24ns 48cycle L2:8ns (16cycle)
Bandwidth 2009-11-09 48
Q & A 2009-11-09 49