물리학자를 위한 MPI

Save this PDF as:

WORD PNG TXT JPG

Size: px

Start display at page:

Download "물리학자를 위한 MPI"

은혁 안
6 years ago
Views:

1 물리학자를위한 MPI 이인호 한국표준과학연구원 2003 년 3 월 3 일출발합니다. MPI (Message Passing Interfae) 란무엇인가? MPI 자체는병렬라이브러리들에대한표준규약이다. (125개의서브프로그램들로구성되어있다.) MPI로만들어진병렬라이브러리를사용한다면작성된응용프로그램이 soure level의호환성을보장받을수있다. MPI는약 40개기관이참여하는 MPI forum에서관리되고있으며, 1992년 MPI 1.0 을시작으로현재 MPI 2.0까지버전업된상태이며이들 MPI를따르는병렬라이브러리로는 Ohio superomputer enter 에서개발한 Lam-MPI 와 Argonne National Laboratory 에서개발한 MPICH 가널리사용되고있다. MPI 2.0에서는동원되는프로세스의수를시간에따라서바꿀수도있다. 또한, 병렬 I/O를지원한다. 메시지 = 데이터 + 송신지와수신지주소 병렬프로그래밍의모델들로는아래와같은것들이있다. PVM MPI OpenMP : 공유메모리병렬컴퓨팅의사실상의표준이다. UPC HPF 왜 MPI 인가? 전산물리학을하는입장에서간단하게말하면, 뭐, 업계의 standard라고하니따르겠습니다. 컴퓨터의기종에관계없이일반적인병렬처리를위해서는 MPI를사용하는것이일반적으로유리하다. 물론, OpenMP와같이거의자동적으로소스코드를병렬처리에맞도록재설계해주는경우 (kapf90 -on -psyntax=openmp prog.f) 도있지만 SMP ( 여러개의프로세서가버스를통하여하나의거대한메모리에연결된것. 공유메모리. 이와같은장비는상대적으로고가일수밖에없다. 프로세서숫자의확장에상대적으로심각한제한이있다. Compaq ES40, Sun E10000, HP N- lass) 와같은환경에서만작동하며일반적인적용이불가능하다. 위에서언급한방식으로는유저가원하는대로자동으로병렬화가안된다. SMP기종과달리일반적인클러스터장비들에서는모든메모리가모든프로세서들에게연결되어있지않다. C, (C++) 언어나포트란언어에서같은방식으로사용되는 MPI를이용하는것이병렬계산의기본이다. 유저에따라서는 PVM과 MPI를동시에사 (1 of 73) 오후 4:09:44

2 용하는아주전문적인유저들 ( 병렬계산스페셜리스트라고할만하다.) 도있다. 지역적으로분산된장비들을사용하는분산컴퓨팅 (distributed omputing, luster omputing 과는구별되는것이다.) 방식에서도 MPI를이용할수있다. : MPICH-GM, : MPICH-G2 Top 10 Reasons to Prefer MPI Over PVM 1. MPI has more than one freely available, quality implementation. (LAM,MPICH,CHIMP) 2. MPI defines a 3rd party profiling mehanism. 3. MPI has full asynhronous ommuniation. 4. MPI groups are solid and effiient. 5. MPI effiiently manages message buffers. 6. MPI synhronization protets 3rd party software. 7. MPI an effiiently program MPP and lusters. 8. MPI is totally portable. 컴퓨터의기종에상관없이사용될수있어야한다. 9. MPI is formally speified. 10. MPI is a standard. 물리학에서왜필요한가? 많은전산물리학문제들이병렬알고리즘을이용한병렬처리를하면상대적으로쉽게풀리어진다. 그렇지못한경우도물론있다. 컴퓨터의기원은순차알고리즘을근간으로한다. 이점을생각하면, 격세지감을느낀다. 바야흐로, 순차알고리즘뿐만아니라병렬알고리즘을동시에생각할때가된것이다. 물론, 상당한대가를치르지만, 그비용보다얻는것이더많다면우리는병렬계산을한다. 잘아시다시피물리문제가일반적으로병렬처리가용이하도록정해져있지는않다. 원천적으로불가능한것들도많다. 많은경우전체계산의일부분은병렬처리가가능하다. 예를들어 x % 가병렬처리가능하다면병렬컴퓨터 P 대를사용하면한대를사용할때보다도 100/{x/P+(100-x)} 배정도전체계산이빨라진다. 실제계산에서는병렬계산을위하여정보의교환이이루어진다. 이러한절차때문에위의식에서표현되는것보다는효율이나오지못한다. 즉, 프로세서간정보의교환이병렬화의효율을떨어뜨린다. 실제의계산에서는하나의프로그램에여러가지데이터를각각의 CPU 가자기에게할당된자료들을구별하여독자적으로처리하는형태로프로그램이완성된다. 이를 SPMD (single program multiple data) 라고한다. 모든프로세스가동일한하나의프로그램을실행한다. 데이터를분해하여수행할수도있고, 서로다른함수들을나누어서실행할수도있다. 결국실제응용프로그래밍하기가훨씬더어려워져버리고말았다. 결국, 가장시간이많이소모되는 hot spot 을병렬화할수있는가가관건이다. 병렬컴퓨터의발달 : 1990 년대후반부터는단일프로세서를이용한컴퓨터는 ( 컴퓨터성능세계랭킹 500) 에서발견하기가힘들어짐. 이들은모두가 500 위이하로밀려남. 초고성능의단일 CPU 제작은상당히비경제적이다. 다시말해서, 지금수준의단일 CPU 들을동시에사용하는것이좋은아이디어로보임. 병렬컴퓨터 ( 베어울프, Beowulf 1994 년여름 ) {Thomas Sterling and Donald Beker, CESDIS, NASA} *16 개의 486 DX4, 100MHz 프로세서사용 *16 MB of RAM eah, 256 MB (total) * hannel bonded ethernet (2x10Mbps) Beowulf 형으로최초로슈퍼컴퓨터성능랭킹 에등재된컴퓨터 = Avalon: 140 alpha (2 of 73) 오후 4:09:44

3 21164A, 256MB/node, fast ethernet, 12 port Gigabit ethernet, 1998 년 Bell Prie/Performane prize ( 저비용고효율 ) 블루진 (Bleu Gene) 컴퓨터는 100 만개의 CPU 를이용하려고한다. 이러한목표의시험용버전인 Blue Gene/L 은 32,768(=2 15 ) 개의 CPU 를이용한다 년현존하는최고의컴퓨터이다. 와, 2 15 개이쯤되면거의막가자는것이죠 2004 년 6 월 Intel Itanium2, Tiger4 1.4GHz, Quadris 4096 로만든클러스터 Thunder (LLNL) 가 www. top500.org 에서 2 위에랭크되었다. 뿐만아니라클러스터형태의컴퓨터가슈퍼컴퓨터성능상위 500 위내에서의계속해서점유도를확장해가고있다. 일단병렬컴퓨터의디자인과실질적인사용이널리보급된지금사실상대부분의컴퓨터센터들을병렬컴퓨터를구비하고있다. 많은유저들이순차프로그램을사용할경우그들에게적당한 CPU 를할당해주면되기때문에센터입장에서는효율적으로유저들을지원하는것이다. 유저들입장에서도변화가일어나고있다. 많은슈퍼컴퓨터사용자들은계속해서제공되는거대한컴퓨터의사용을통해서자신의고유한문제를해결할것으로믿어왔으나, 최근동향은그렇게전개되고있지않다. 한때사장되었던병렬알고리즘이리바이블되고각그룹마다새로운형식의병렬컴퓨터계산이득세하고있다. 일반 PC 사용자수준에서는변화가없었다고하더라도컴퓨터를이용한연구개발프로젝트에서는중대한변화가일고있다. 분산컴퓨팅 (distributed omputing, luster omputing 과는구별되는것이다.) 방식으로유명한프로젝트는 seti@home, folding@home 을들수있다. 이러한계산방식은미리셋업을해둔컴퓨터들 ( 클러스터 ) 을사용하는것이아니라연구그룹이외의자원하여컴퓨터를제공한 ( 물론, 인터넷을통한연결 ) 사용자의컴퓨터를이용하는것이다. 위에서언급한프로젝트들에서는윈도우, 맥, 리눅스, 유닉스를가리지않고자원하여제공한일반컴퓨터들을사용하여과학적계산결과들을얻어낸다. 컴퓨터의수가엄청나게많기때문에 (1000 대이상사용 ) 많은 CPU 시간을확보할수있다. free luster (management) softwares: 병렬화항상필요한가?/ 병렬화언제필요한가?/ 병렬화의이득과비용은? 실제병렬계산은특정한시스템에서최적화되도록만들수밖에없다. 보통의이더넷을통신장비로사용할경우그통신에들어가는시간이과도하게많은경우가많다. 즉, 계산을해버리는경우가더낫을수있다. 왜냐하면, 최근에사용되고있는 CPU 들의성능이컴퓨터간통신속도에비해서충분히좋기때문이다. 따라서통신을빨리할수있는값싼장비가나와야병렬계산은더욱활성화될것이다. proessor 들사이의통신기술이병렬컴퓨팅기법의핵심기술사항이라는것이다. 통신장비는데이터를주고받기전단계에소모되는 lateny (miroseond) 시간 ( 당연히짧을수록좋 (3 of 73) 오후 4:09:44

4 다.) 과정보가전해질때한꺼번에얼마나많이 / 빨리전해질수있는가 (bandwidth; ommuniation apaity; bits/se) 라는두가지특성이있다. ( 전화를건다고생각할때, 전화를걸면상대가전화를걸자마자받지않는다. 약간은기다려야한다 :lateny. 상대가전화를받아도주어진시간에얼마나많은정보를전달할수있는가는상황에따라다르다 :bandwidth.) 데이터의분할, 기능적분할등을따져볼수있다. 적은크기의데이터를여러번주고받는것보다는한번에모아서주고받는것이유리하다. 이는데이터통신시필요한 lateny 시간을줄인다는의미이다. 최악의경우 ( 자주일어나는경우이다.) 두개의 CPU 를사용하여한대의 CPU 를사용하는경우보다더느린계산을할수있다. 고속통신장비는고가이며 CPU 장비와거의동일한가격을요구한다. 즉, {CPU 16 대 + 이더넷 ( 저속통신장비 ) } 이 {CPU 8 대 + 미리넷 ( 고속통신장비이름 )} 과얼추비슷한견적을낸다. 보통유선인터넷의 bandwidth 가 100 Mbps 이다. 기가비트이더넷이 bandwidth 에서패스트이더넷보다우의를보여서병렬계산에유리할것이라고생각되지만, 많은경우 lateny 때문에별효과를못보는경우가많다. 대량의정보교환이자주일어나지않는경우에는오케이다. 일반적인컴퓨터코드에서확실한병렬효율성을확보하려면결국고가장비 (Myrinet) 를사용해야한다. Fast Ethernet Gigabit Ethernet Myrinet Quadris 2 lateny 120 miroseond 120 miroseond 7 miroseond 0.5 miroseond bandwidth 100 Mbps 1 Gbps 1.98 Gbps 1 Gbyte/seond 정말로병렬화해야되는가? 그렇다면, 다음의두가지항목으로견적을내보자 로드밸런싱과스피드업 원하는시기, 정확한시기 ( 각 CPU 마다원하는시기 ) 에원하는데이터의송수신이것이 MPI 구현의핵심이다. 한가지더추가하면알고리즘을바꿀필요가있을수도있다는것이다. 결과적으로같은일을하여도병렬화가가능한알고리즘과그렇지못한알고리즘이상존할수있다. 또상황에따라서는알고리즘의효율성이다소나쁘더라도확실한병렬화의장점때문에병렬계산에서대우받고사용되는알고리즘들도많이있다. 이러한상황의경우그렇게해야한다면반드시따져봐야할항목이있다. 결국, 가능한한 CPU 간통신들을줄이고 CPU 중심의계산들이주축이되도록알고리즘을만들수있는가이다. 마지막으로, 다중처리장비와소프트웨어 (MPI를이용한프로그램 ) 를통해서소위 speedup을원하는수준까지향상시킬수있는가이다. 여기서 speedup은, x축을사용한 CPU 수, y축을하나의계산을수행할때 CPU시간이아닌 wall lok시간으로소요된전체계산시간의역수로그래프를그렸을때, 많은 CPU를사용하여실질적인전체계산소요시간의단축된정도를의미한다. 예를들어, 8대의 CPU를사용하여단일 CPU로계산할때보다 7배의정도빨리계산했다면 87.5 % 의 speedup을확보한경우이다. 이정도면일주일에걸쳐서할일을하루안 ( 기본가정 : 하나의 CPU를사용하여 24 시간계산해야할경우, 적당히큰계산, 큰작업량 (? 여전히좋은표현은아닌데 ) 으로봐줄수있겠 (4 of 73) 오후 4:09:44

5 다.) 에할수있다는것이다. ( 많은사람들은이정도면만족하는데, 그렇지못한상황도있을수있다. 즉, 일의크기, 한번일을할때의소요되는계산시간이결국은문제이다. 또다른형태의문제로는한프로그램수행에있어서메모리가많이잡히는상황이다. 한 CPU 에서처리가아예불가능한경우도있을수있겠다.) 이정도면아주성공적인병렬계산을수행한경우라고말할수있겠다. 물론, 사용하는알고리즘이허용하는경우에한해서이다. 또한기술적으로 ( 하드웨어적으로 ) 고속통신망을사용하느냐안하느냐에따라서결정적으로위의퍼포먼스는달라질수있다. 황당한경우이지만 2개의 CPU를사용했는데전체계산시간이 1개의 CPU를사용하는경우보다느려지는경우가있다. 과도한 CPU간통신들이고성능 CPU의발목을잡고있는경우이다. 이보다더나쁠순없다. 병렬계산최악의상황이라고할수있다. 고속통신장비가필수불가결한경우이다. 물론, 이때다시한번따져봐야할것이 load balaning (CPU 들에게얼마나골고루일들이균일하게분배되었는가? 거의동시에 CPU중심의계산이마무리되는가?) 일것이다. 통신에연관된각프로세서들은어떠한시기에, 어떠한데이터형태를, 얼마만한사이즈로받아야혹은주어야하는지를완전히알고있다. 그리고각각의 CPU는묵묵히계산들을수행한다, 그야말로독립적으로...; 그런데, 어떤 CPU는할당된일이많지않아서일을다끝내고놀고있고어떤 CPU는아직도일을다못끝내고아직도계산을수행하는경우가있을수있는데... 엄청난 CPU계산속도를고려할때상당한자원의낭비가있다. 왜냐하면결국계산은가장느린 ( 혹은능력에비해서가장일을많이하는단하나의 CPU에의해서결정되기때문이다.) 다시말해서 load balaning이잘될수록병렬계산은효율적으로진행될수있다. 가장확실한방법중하나는각계산노드에들어가서현재계산을실행하고있는노드들의계산시간을분석해보는것이다. 각노드로 ssh를통해서들어간다. 그다음 top명령어를이용하여현재사용한 CPU시간을노드별로분석해본다. 실제흘러가는시간 (wall lok) 과더불어서노드들의 CPU 시간증가를확인해야한다. load balaning ( 모든노드들에서 CPU시간을골고루잘증가함 ) 이잘되어보인다고하더라도실제시간의흐름과마찬가지로계산에사용된 CPU시간의증가도확인해야한다. 주의해야할것은특정노드로부터의계산결과를기다리는시간이있으면좋지않다는것이다. 그시간에그노드에서도계산을할수있다면해야한다. 어떠한이유에서든지각노드에서계산이쉬는것은좋지않다. 물론고려해야할것이있다. 계산보다통신에시간이더걸리는것은그자리에서해결하는것이좋다. top 명령어에서관찰할때 CPU 의사용현황이 % 단위로표시된다. 계산하지않고있으면 0.0 % 가된다. 계산을집중적으로할때당연히 99.0 % 처럼나온다. 그중간의통신상황에서는다양한퍼쎈트대의값이표시된다. 한번 0. 에서 99.0 까지올라간다음높은 CPU 점유율이유지되는시간이길수록 CPU 중심의계산이잘되고있다는것을의미한다. 가능한한이렇게만들어야좋은병렬효율성을얻을수있다. 한노드에두개의 CPU 를사용하여병렬계산을할때의예. ( 물론, 다른노드에서도유사한자료를확인할수있다.) 일을여러노드에나누는것이목적이아니다, wall lok 시간기준으로일을빨리처리하기위해서병렬계산을할뿐이다. 물론빠른처리를위해서는일을여러개의노드들에잘분담해서처리해야한다. 나누는것이중요하다. 하지만, 더중요한것은빨리계산 (5 of 73) 오후 4:09:44

6 하는것이다. PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND ihlee M 1380 S :27 ation_lbfgs.x ihlee M 1380 S :27 ation_lbfgs.x ihlee R :00 top 당연히하나의 CPU를하나의실행화일이 CPU 중심의계산을수행할때 CPU 점유율이 99.0 % 이상 나온다. #/bin/sh -f set nodes = (hp1 hp2 hp3 hp4 hp5 hp6 hp7 hp8 \ hp9 hp10 hp11 hp12 hp13 hp14 hp15 hp16 \ hp17 hp18 hp19 hp20 hp21 hp22 hp23 hp24 \ hp25 hp26 hp27 hp28 hp29 hp30 hp31 hp32 \ hp33 hp34 hp35 hp36 hp37 hp38 hp39 hp40 \ hp41 hp42 hp43 hp44 hp45 hp46 hp47 hp48 \ hp49 hp50 hp51 hp52 hp53 hp54 hp55 hp56 \ hp57 hp58 hp59 hp60 hp61 hp62 hp63 hp64 \ hp65 hp66 hp67 hp68 hp69 hp70 hp71 hp72 \ hp73 hp74 hp75 hp76 hp77 hp78 hp79 hp80 \ hp81 hp82 hp83 hp84 hp85 hp86 hp87 hp88 \ hp89 hp90 hp91 hp92 hp93 hp94 hp95 hp96 \ hp97 hp98 hp99 hp100 hp101 hp102 hp103 hp104 \ hp105 hp106 hp107 hp108 hp109 hp110 hp111 hp112 \ hp113 hp114 hp115 hp116 hp117 hp118 hp119 hp120 \ hp121 hp122 hp123 hp124 hp125 hp126 hp127 hp128 ) foreah n ($nodes) eho ' ' $n ' ' rsh $n $* 만약 61 개의노드에계산이분포된경우앞에서이야기한확인이결코쉽지않다. 이럴경우위에서제시한스크립트 ( 이름을 pexe 라고하면. 또한노드들의이름을위에서와같이 hp* 처럼정의한경우.) 를이용하여아래와같은명령을주면된다. 병렬계산중에아래의명령어를실행한다. 실제각노드들에서소모된 CPU 시간들을체크할수있다. 아래와같이그결과가나왔다면, 로드밸런싱이잘되어있다고말할수있다. rsh 를통하여노드에들어가고프린트하는시간이포함되기때문에이와관련된시간에의한오차는무시한경우이다. 완전히동일한시간을소모한다면오름차순으로나올것이다. 프로그램개발단계와응용단계에서로드밸런싱결함의심각한문제를빨리체크해낼수있다. $ pexe ps grep admd>summary_file 10487? 00:08:04 admd.x 5696? 00:08:04 admd.x 5174? 00:08:05 admd.x 5166? 00:08:05 admd.x 5159? 00:08:05 admd.x (6 of 73) 오후 4:09:44

7 5150? 00:08:05 admd.x 5150? 00:08:05 admd.x 5150? 00:08:05 admd.x 5150? 00:08:05 admd.x 5223? 00:08:06 admd.x 9241? 00:08:06 admd.x 5502? 00:08:06 admd.x 5339? 00:08:06 admd.x 5392? 00:08:06 admd.x 5355? 00:08:06 admd.x 5639? 00:08:07 admd.x 5336? 00:08:07 admd.x 12646? 00:08:07 admd.x 5296? 00:08:07 admd.x 5543? 8310? 00:08:08 admd.x 00:08:08 admd.x 5790? 00:08:08 admd.x 5360? 00:08:08 admd.x 5360? 00:08:08 admd.x 5360? 00:08:08 admd.x 3118? 00:08:08 admd.x 5336? 00:08:09 admd.x 5334? 00:08:09 admd.x 5334? 00:08:09 admd.x 5334? 00:08:09 admd.x 9306? 00:08:09 admd.x 5736? 00:08:09 admd.x 위의경우와같이각노드들에서 8 분씩만계산을하였다고해도, 8 분 / 노드 *61 노드 => 총 CPU 사용시간 488 분이되는것이다. 즉, 8 분만에 488 분 (= 시간 ) 의 CPU 시간을소모한계산이다. 다시말해서, 병렬효율이높은경우, 8 분만에 8 시간짜리작업을완성할수있다는것이다. 이정도되면가히고효율이라고할수있겠다. 컴퓨터사용료는통상사용한총 CPU 시간기준이다. 단위시간당많은 CPU 시간을사용했기때문에돈도많이내어야하는것은당연하다. (7 of 73) 오후 4:09:44

8 스피드업계산예를표시했다. wall lok 시간기준으로얼마나빨리계산을할수있는가를표시한다. 단일노드를활용할경우약 30 분정도소요 (wall lok 기준되는계산의예를표시했다. 상당히병렬화가잘된경우의예로받아들일수있는경우이다. Embarrassingly parallel 알고리즘의경우이상적인스피드업값 ( 대각선으로표시된값에접근하는경우이다. 소위 observed speedup 이란 (wall-lok time of serial exeution) 와 (wall-lok time of parallel exeution) 의비율을말한다. 즉, CPU 시간기준이아니라 wall-lok 시간기준으로빨라지는것으로정의되는것이다. 통상의구조는어떠한가? MPI 함수 125 개중아래의 6 개만사용하는프로그램도상당히많다. 아주행복한계산들을수행하는경우이다. 실제로이러한함수만을이용하는응용프로그램들이많이존재한다. MPI_INIT: MPI 환경초기화하기 : 유저수준에서바꿀것이사실상없음. 모든 CPU 에서공통으로불리어진다. MPI_COMM_SIZE: 사용중인 proessor 숫자반환 : 유저수준에서바꿀것이사실상없음. MPI_COMM_RANK: 현 CPU 의번호 (rank 라도함. proessor 갯수가 npro 일때, 가능한 rank 값은 (8 of 73) 오후 4:09:44

9 0,1,2,3...npro-1 이다.) : 유저수준에서바꿀것이사실상없음. 두개의 proessor 간통신 : rank값들을사용하여서현재 proessor 번호를확인하고준비된데이터를원하는 proessor로전송한다. 마찬가지로현재의 proessor번호를확인하고전송되어올데이터를받는다. 물론, 병렬계산은 " 짜고치는고스톱이기때문에우리는어떤 proessor로부터데이터가오는지그리고어떤 proessor가이데이터를받아야하는지다알고있다. 정보가특정한노드로보내어지는데, 그노드가받지않으면일이안됩니다. 반드시받아야다음의일들이진행됩니다. 즉, 프로그래밍작업중에, deadlok ( 교착 ) 에걸리는지안걸리는지를점검해야합니다. 최소두대의 CPU들간에다른 CPU로부터의데이터송신이지속적으로발생한경우. 통신, 계산순서의존성, 동기화, 그리고교착상황의체크가병렬프로그래밍의주요항목이라고할수있다. 통상, 순차프로그램의완성, 최적화이후에병렬프로그래밍에착수한다. MPI_SEND: 원하는 proessor에게데이터전송시사용하는데이터형, 사이즈,...) : 유저의구체적인목적이적용됨 ( 원 MPI_RECV: 원하는 proessor 로부터데이터전송받을때사용 : 유저의구체적인목적이적용됨 ( 원하는데이터형, 사이즈,...) MPI_FINALIZE: MPI 환경종료하기 : 유저수준에서바꿀것이사실상없음. 모든 CPU 에서공통으로불리어진다. MPI 함수들은포트란버전과 C 버전으로나누어져있다. 구체적인함수모양은언어의특성을고려하다보니다르게생겼지만, 수행하는일은사실상같다. 실제포트란에서사용될때의모습. 잘알려진것처럼포트란에서는소문자 / 대문자구별이없다. 즉, mpi_init 나 MPI_init 나같은함수를지칭한다. program test USE important_module, ONLY : variables, sub_program_names impliit none inlude "mpif.h" integer istatus(mpi_status_size) MPI_STATUS_SIZE는위에서선언한 inlude문으로불러들인내용 에서이미정의된것들이다 integer npro,myid,ierr,idestination,isoure,iroot,kount INTEGER itemp,itemq,irate CHARACTER*8 fnnd ; CHARACTER*10 fnnt (9 of 73) 오후 4:09:44

10 ... 메시지 = 데이터 + 송신지와수신지주소커뮤티케이터 = 서로통신할수있는프로세스들의집합. (MPI 핸들이다.) MPI_COMM_WORLD 는기본커뮤티케이터이다. 헤더파일에서정의된다. 사용자가특별한프로세스들만으로구성되는커뮤티케이터를만들수있다.... all MPI_Init(ierr) all MPI_Comm_size(MPI_COMM_WORLD,npro,ierr) all MPI_Comm_rank(MPI_COMM_WORLD,myid,ierr) if(myid == 0)then -----[ PROCESS ID = 0 CALL DATE_AND_TIME(date=fnnd,time=fnnt) write(6,'(1x,a10,2x,a8,2x,a10)') 'date,time ', fnnd,fnnt CALL SYSTEM_CLOCK(itemp,irate) -----] PROCESS ID = 0 if(myid == 0)then -----[ PROCESS ID = 0 각종입력들 ] PROCESS ID = 0 읽어들인정보중에서모든노드에게 " 방송할필요가있는경우 kount=1 ; iroot=0 all MPI_BCAST(natom,kount,MPI_INTEGER,iroot,MPI_COMM_WORLD,ierr) myid 는 proessor 번호를나타낸다. (myid=0,1,2,3,...npro-1 중의하나값을가진다. 각노드마다다른값을가진다.) npro 는현재몇개의 proessor 가살아있는지를나타낸다. ( 모든노드에서같은값을가진다.) 즉, SPMD 에따라서, 모든컴퓨터에서같은프로그램을수행하기때문에모든컴퓨터에서현재살아있는컴퓨터의숫자 (npro) 는같다. 각컴퓨터마다자신의번호 (myid 값 ) 는컴퓨터마다다르다. 따라서, 병렬계산은 npro 와 myid 값을가지고주어진문제에대한분업 / 병렬작업들을설계할수있다. 당연히, 메모리할당도노드별로다르게, 또는동시에같은크기로잡을수있다. 미리선언해두는부분은모든노드에서같이해두어야한다. 예를들면, real*8, alloatable :: abd(:,:,:) 처럼. 당연히한노드에서만메모리할당이될경우도있을수있다. 병렬프로그래밍은기본적으로노드별작업을설계해야하기때문에순차프로그래밍보다더난이도가높다. 보다많은테스트작업들이필요하다. 일반으로순차프로그램을끝내고그다음병렬프로그램을작성한다. itag=19 idestination=1 all MPI_S(real_array_user,n_array_length,MPI_REAL8,idestination,itag,MPI_COMM_WORLD,ierr) real*8 형태의데이터가가야할곳지정해주어야한다. 물론, 이데이터의크기도보내는곳에서지정해줘야한다. 특정노드에서정보를보내기때문에위함수는특정노드에서불려져야한다. itag=19 isoure=0 all MPI_Rev(real_array_user,n_array_length,MPI_REAL8,isoure,itag,MPI_COMM_WORLD,istatus,ierr) (10 of 73) 오후 4:09:44

11 데이터를받는쪽에서는그형태와크기를알고있어야하며, 어디에서부터출발했는지를알아야한다. MPI_REAL ( 실수, 싱글프리시전 ) 과 MPI_REAL8 ( 실수, 더블프리시전 ) 는엄연히다른값들임을유의해야한다. MPI_REAL8 보다는 MPI_DOUBLE_PRECISION 을사용하는것이좋다. 왜냐하면, LAM 에서도사용가능하기때문이다. MPI_REAL8 는 MPICH 에서만사용가능한것이다 master/slave 형식으로일할경우, 일들을나누어서수행하는경우 : myid=0 에서일들을분배하고, 취합한다. myid=0 에서도일부 / 균등일들을수행한다. 해야할총일들의단위가 (np+1) 인경우이고, 동원된노드의수는 npro (> 1) 이다. myid=1,2,...npro-2 : 이들노드들은균등한일의양을수행한다. myid=0 노드는상황에따라서위의노드들보다적은양을일들을수행한다. 물론, 균등한일의양을취급하는경우도있다. nblk=(np+1)/npro+1 ; if( np+1 == (nblk-1)*npro) nblk=nblk-1 alloate(wk_input(natom,3,nblk),wk_output(natom,3,nblk),wk_output2(nblk)) wk_input=0.0d0 ; wk_output=0.0d0 ; wk_output2=0.0d0 if(myid == 0)then -----[ PROCESS ID = 0 정보를나누어서전달하기 do loop=0,npro-2 jj=0 do j=(loop)*nblk,(loop+1)*nblk-1 jj=jj+1 ; wk_input(:,:,jj)=qq(:,:,j) kount=3*natom*nblk ; idest=loop+1 ; itag=1 all MPI_SEND(wk_input,kount,MPI_DOUBLE_PRECISION,idest,itag,MPI_COMM_WORLD, ierr) 적당히할당된일하기, myid=0에서 do j=(npro-1)*nblk,np all sma_energy_fore(qq(1,1,j),fore(1,1,j),vofqj(j)) myid /=0에서보내온정보를나누어서받아들이기 do loop=0,npro-2 kount=3*natom*nblk ; isour=loop+1 ; itag=3 all MPI_RECV(wk_output,kount,MPI_DOUBLE_PRECISION,isour,itag,MPI_COMM_WORLD, istatus,ierr) jj=0 do j=(loop)*nblk,(loop+1)*nblk-1 jj=jj+1 ; fore(:,:,j)=wk_output(:,:,jj) kount=nblk ; isour=loop+1 ; itag=4 all MPI_RECV(wk_output2,kount,MPI_DOUBLE_PRECISION,isour,itag,MPI_COMM_WORLD, istatus,ierr) jj=0 do j=(loop)*nblk,(loop+1)*nblk-1 (11 of 73) 오후 4:09:45

12 jj=jj+1 ; vofqj(j)=wk_output2(jj) PROCESS ID = 0 myid=0에서보내온정보받아들이기 kount=3*natom*nblk ; isour=0 ; itag=1 all MPI_RECV(wk_input,kount,MPI_DOUBLE_PRECISION,isour,itag,MPI_COMM_WORLD, istatus,ierr) 할당된일하기, myid /=0에서 do jj=1,nblk all sma_energy_fore(wk_input(1,1,jj),wk_output(1,1,jj),wk_output2(jj)) 정보를 myid=0로보내기, myid/=0에서보내기임 kount=3*natom*nblk ; idest=0 ; itag=3 all MPI_SEND(wk_output,kount,MPI_DOUBLE_PRECISION,idest,itag,MPI_COMM_WORLD, ierr) kount=nblk ; idest=0 ; itag=4 all MPI_SEND(wk_output2,kount,MPI_DOUBLE_PRECISION,idest,itag,MPI_COMM_WORLD, ierr) -----] PROCESS ID = elapsed (or wall) lok : DOUBLE PRECISION MPI_WTIME() t1=mpi_wtime()...null input 형식으로입력이없다. 노드에따라달리시작한다 ode to be timed t2=mpi_wtime() if(myid == 0) write(6,*) t2-t1,' se' if(myid == 0)then -----[ PROCESS ID = 0 각종출력들 ] PROCESS ID = if(myid == 0)then -----[ PROCESS ID = 0 CALL SYSTEM_CLOCK(itemq) write(6,'(2e15.4,2x,a9)') float(itemq-itemp)/float(irate)/60.,float(itemq-itemp)/float(irate)/3600.,' min or h' -----] PROCESS ID = 0 all MPI_Finalize(ierr) stop program test MPI_Wtime() 함수를사용할수있다. double preision 변수이다. wall-lok-time 측정에사용된다. miroseond 수준의분해능을가지고 C/C++/Fortran 에서사용가능하다. 물론, 컴퓨터기종에상관없 (12 of 73) 오후 4:09:45

13 이사용가능하다. s1=mpi_wtime()... s2=mpi_wtime() s2-s1 se 단위로출력됨 subroutine equal_load(n1,n2,npro,myid,istart,ifinish) impliit none integer npro,myid,istart,ifinish,n1,n2 integer iw1,iw2 iw1=(n2-n1+1)/npro ; iw2=mod(n2-n1+1,npro) istart=myid*iw1+n1+min(myid,iw2) ifinish=istart+iw1-1 ; if(iw2 > myid) ifinish=ifinish+1 print*, n1,n2,myid,npro,istart,ifinish return program equal_load_sum impliit none inlude 'mpif.h' integer nn real*8, alloatable :: aa(:) integer npro,myid,ierr,istart,ifinish integer i real*8 xsum,xxsum nn=10000 all MPI_INIT(ierr) all MPI_COMM_SIZE(MPI_COMM_WORLD, npro, ierr) all MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) all equal_load(1,nn,npro,myid,istart,ifinish) alloate(aa(istart:ifinish)) 단순한인덱스의분할뿐만아니라메모리의분할이이루어지고있다. 노드별로 do i=istart,ifinish aa(i)=float(i) xsum=0.0d0 do i=istart,ifinish xsum=xsum+aa(i) all MPI_REDUCE(xsum,xxsum,1,MPI_DOUBLE_PRECISION,MPI_SUM,0, (13 of 73) 오후 4:09:45

14 MPI_COMM_WORLD,ierr) xsum=xxsum if(myid == 0)then write(6,*) xsum,' xsum' dealloate(aa) all MPI_FINALIZE(ierr) program equal_load_sum 위와같은경우 : 블록 (blok) 분할이라고한다. 순환 (yli) 분할은위와대비되는것이다. do i=n1,n 위의루프가아래처럼바뀐다. do i=n1+myid,n2,npro blok-yli 분할 : nblk 의사이즈만큼씩처리한다. do j=n1+myid*nblk,n2,npro*nblk do i=j,min(j+nblk-1,n2)... 병렬계산의이면 : 실제컴퓨터계산에서는유한한정밀도를사용한다. 따라서, aa(1)+aa(2)+aa(3)+,,,, aa(100) 과같이순차프로그램에의해서, 순차적으로계산된양은병렬프로그램에의해서, 즉, [aa(1) +aa(2)+aa(3)]+[aa(4)+aa(5)+aa(6)]+[aa(7)+aa(8)+aa(9)]+... 처럼, 부분적으로합해진합들의합이며, 순서또한순차프로램의것과차이가있다. 이결과들로인해서정밀도근처의오차가있을수있다. 결국 MPI_REDUCE 를사용할경우 rounding error 가있다는것이다. 통상실전에서이것이문제가되는경우는그리많지않다. 블록킹과논 - 블로킹통신 : MPI_S, MPI_Reve 들은통신이완료될때까지호출한프로세스들을블로킹해둔다. 블로킹통신의경우교착이발생할수있다. 교착은일종의프로그래밍에러이다. 송, 수신연산의초기화와종료를분리한형식의호출을통한통신이논 - 블로킹통신이다. 두호출사이에프로그램이다른일들을할수있다는장점이있다. 논 - 블로킹통신을호출해서초기화하는것을포스팅 (posting) 이라고한다. 실제 MPI 프로그램에서는각각의 CPU 를이용한일처리전후에일어나는데이터의교환이핵심이다. 따라서프로그래머가원하는대로데이터가적절한시기에적절한 CPU 로전파되는지를프린트를통해서확인할필요가있다. 원하는시기에원하는노드로의정확한데이터송신및수신이것을반드시테스트해야한다. 이것이야말로모든 MPI 구현의핵심이기때문이다. 하나의프로그램에서각기다른 CPU 상에서일어나는일들을다같이점검해야한다. 물론, 허용한다면, 너무자주컴퓨터간통신을하지않을수록병렬효율성은좋다. (14 of 73) 오후 4:09:45

15 많이쓰는것한가지더추가하면 mpi_bast 를이야기할수있겠다. 이것은 broadast 를의미한다. 특별히, 모든노드에게알릴때사용한다. 알리고자하는정보의근원지를지정한다. 물론, 정보의크기와형태를지정해줘야한다. iroot=0 ; kount=1 all MPI_BCAST(l_pb,kount,MPI_LOGICAL,iroot,MPI_COMM_WORLD,ierr) 정보가공유되도록방송하는관계로모든노드에서동시에위의함수가불려져야한다. 특정한노드에서만변수값이새로읽어지거나계산되었을때, 그리고이것이모든노드들에게알려질필요가있을때사용한다. point-to-point 통신이아닌집단적인통신이다. olletive ommuniation master/slave 형식 : master 노드에서중요한일들을수행하고그이외의노드들은 master 의지휘하에계산들을수행하는형식. 알고리즘구현에서유리한경우가있다. 물론, 그렇지못한경우도많다. master 노드가거의하는일이없어지면병렬효율성이떨어지기마련이다. 물론, master 도일정한 slave 노드처럼일정한계산들을수행함으로써전체병렬효율성을높일수있다. master/slave 형식의경우대개메모리의한계는문제가되질않은경우이다. 하지만, 거대계산의경우메모리할당문제때문에변수들을여러노드들에걸쳐서표현할수밖에없는경우도매우많이있다. 이러한경우는 master/slave 노드개념의프로그래밍은좋은아이디어가아니다. Introdution to MPI Introdution to MPI 2 Cornell theory enter의 Code Examples (15 of 73) 오후 4:09:45

16 간단한예제들 --=============== 프로그램분석을위해서프로그램을프린터할때 a2ps -o output.ps <input.f90 처럼 a2ps 프로그램을이용하면보기좋은 ( 포트란키워드는진하게나타난다.) PS 파일이생긴다. software/a2ps/ 예제 (1) PROGRAM hello IMPLICIT NONE INCLUDE "mpif.h" CHARACTER(LEN=12) :: inmsg,message INTEGER i,ierr,me,npro,itag INTEGER istatus(mpi_status_size) all MPI_Init(ierr) all MPI_Comm_size(MPI_COMM_WORLD,npro,ierr) all MPI_Comm_rank(MPI_COMM_WORLD,me,ierr) if(me == 0.and. npro == 1) write(6,*) npro, 'is alive' if(me == 0.and. npro >1 ) write(6,*) npro,'are alive' itag = 100 if (me == 0) then message = "Hello, world" do i = 1,npro-1 all MPI_S(message,12,MPI_CHARACTER,i,itag,MPI_COMM_WORLD,ierr) do write(6,*) "proess", me, ":", message all MPI_Rev(inmsg,12,MPI_CHARACTER,0,itag,MPI_COMM_WORLD,istatus,ierr) write(6,*) "proess", me, ":", inmsg if all MPI_Finalize(ierr) END PROGRAM hello 이프로그램을사용하여 MPICH 또는 LAM 이시스템에제대로인스톨되어있는지를확인할수있다. 물론, 이러한작업들을통해서, 병렬컴파일러의설치와이용 (MPICH 와 LAM 사이에차이점이존재한다. 두가지모두설치하는경우가많다.) 이자동으로테스트될것이다. PBS 를사용하여 job 를 submit 하는것을테스트해볼수있다. 거의장난하는수준의프로그램이지만, 이것이제대로동작하면그시스템은병렬계산을위한준비가되었다는큰의미를가지게해주는프로그램이다. 예를들면아래와같이설치된파일들을이용할수있다. /usr/loal/lam/bin/mpif90 /usr/loal/lam/lib /usr/loal/mpih/bin/mpihf90 /usr/loal/mpih/lib DQS (16 of 73) 오후 4:09:45

17 #/bin/bash #$ -wd #$ -l qty.eq.16 #$ -N Test #$ -A ihlee /usr/loal/mpih/bin/mpirun -np $NUM_HOSTS -mahinefile $HOSTS_FILE../image_parallel_sma.x <input_file eho "End of Job" #/bin/bash # This is an example DQS sript for running a parallel MPI job # on MAIDROC luster using gigabit ethernet interfae # # start in the diretory where the job was # submitted #$ -wd # # speify number of proessors and whih set of nodes to use # we an request up to 48 proessors beause we use gigabit ethernet # Speify whih set of nodes to use: fastnet_1 - for nodes 1-24 # fastnet_2 - for nodes # here we request 48 proessors on nodes 1-24 #$ -l qty.eq.12,fastnet_2 # # name of the job #$ -N Tlam # # User speified environment variables are set with -v #$ -v NCPUS=12 # ommands to be exeuted # type your ommands below #use mdo_mpi_fast ommand to run mpi job on gigabit ethernet interfae mdo_mpi_fast../admd_lam_sma.x < admd.i #NOTE: mdo_mpi_fast starts and runs your mpi program automagially. #Don't try to all mpirun yourself #mpirun -np 2./a.out PBS #/bin/sh ### Job name #PBS -N AM_A_1 ### Delare job non-rerunable #PBS -r n ### Output files #PBS -j oe ### Mail to user #PBS -m ae ### Queue name (n2, n4, n8, n16, n32) (17 of 73) 오후 4:09:45

18 #PBS -q n4 # This job's working diretory eho Working diretory is $PBS_O_WORKDIR d $PBS_O_WORKDIR eho Running on host `hostname` eho Time is `date` eho Diretory is `pwd` eho This jobs runs on the following proessors: eho `at $PBS_NODEFILE` # Define number of proessors NPROCS=`w -l < $PBS_NODEFILE` eho This job has alloated $NPROCS nodes # your job # Run the parallel MPI exeutable "a.out" # mpirun -v -mahinefile $PBS_NODEFILE -np $NPROCS a.out mpirun -mahinefile $PBS_NODEFILE -np $NPROCS -noloal ation_lbfgs.x > out1 #/bin/bash #$ -wd #$ -l qty.eq.12 #$ -N Tmpih #$ -A jun /usr/loal/mpih/bin/mpirun -np $NUM_HOSTS -mahinefile $HOSTS_FILE../admd_mpih_sma.x < admd.i eho "End of Job" qstat gonza w v maido :3 r RUNNING 02/10/05 14:24: 1 gonza w v maido :1 r RUNNING 02/10/05 14:05:56 sha mmp.sh maido :1 r RUNNING 12/05/04 03:18:48 gonza w v maido :4 r RUNNING 02/10/05 14:26:56 olax benh-zab maido :1 r RUNNING 02/10/05 05:48:55 gonza w v maido :2 r RUNNING 02/10/05 14:08: 3 ihlee Tlam maido :1 r RUNNING 02/10/05 15:17:37 ihlee Tlam maido :1 r RUNNING 02/10/05 15:17:37 ihlee Tlam maido :1 r RUNNING 02/10/05 15:17:37 ihlee Tlam maido :1 r RUNNING 02/10/05 15:17:37 ihlee Tlam maido :1 r RUNNING 02/10/05 15:17:37 ihlee Tlam maido :1 r RUNNING 02/10/05 15:17:37 ihlee Tlam maido :1 r RUNNING 02/10/05 15:17:37 ihlee Tlam maido :1 r RUNNING 02/10/05 15:17:37 ihlee Tlam maido :1 r RUNNING 02/10/05 15:17:37 ihlee ihlee Tlam Tlam maido42 maido :1 r :1 r RUNNING 02/10/05 15:17:37 RUNNING 02/10/05 15:17:37 ihlee Tlam maido :1 r RUNNING 02/10/05 15:17:37 예제 (2) Program hello.ex1.f (18 of 73) 오후 4:09:45

19 Parallel version using MPI alls Modified from basi version so that workers s bak a message to the master, who prints out a message for eah worker program hello impliit none integer, parameter:: DOUBLE=kind(1.0d0), SINGLE=kind(1.0) inlude "mpif.h" harater(len=12) :: inmsg,message integer :: i,ierr,me,npro,itag,iwrank integer, dimension(mpi_status_size) :: istatus all MPI_Init(ierr) all MPI_Comm_size(MPI_COMM_WORLD,npro,ierr) all MPI_Comm_rank(MPI_COMM_WORLD,me,ierr) tag = 100 if (me == 0) then message = "Hello, world" do i = 1,npro-1 all MPI_S(message,12,MPI_CHARACTER,i,itag,MPI_COMM_WORLD,ierr) do write(*,*) "proess", me, ":", message do i = 1,npro-1 all MPI_Rev(iwrank,1,MPI_INTEGER,MPI_ANY_SOURCE,itag,MPI_COMM_WORLD, istatus, ierr) write(*,*) "proess", iwrank, ":Hello, bak" do all MPI_Rev(inmsg,12,MPI_CHARACTER,0,itag,MPI_COMM_WORLD,istatus,ierr) all MPI_S(me,1,MPI_INTEGER,0,tag,MPI_COMM_WORLD,ierr) if all MPI_Finalize(ierr) program hello Program hello.ex2.f Parallel version using MPI alls. Modified from basi version so that workers s bak a message to the master, who prints out a message for eah worker. In addition, the master now ss out two messages to eah worker, with two different tags, and the worker reeives the messages in reverse order. Note that this solution works only beause the messages are small, and an fit into buffers. A later talk will provide details on how buffers are used in MPI_SEND and MPI_RECEIVE, program hello impliit none integer, parameter:: DOUBLE=kind(1.0d0), SINGLE=kind(1.0) inlude "mpif.h" (19 of 73) 오후 4:09:45

20 harater(len=12) :: inmsg,message integer :: i,mpierr,me,npro,itag,itag2,iwrank integer isatus(mpi_status_size) all MPI_Init(ierr) all MPI_Comm_size(MPI_COMM_WORLD,npro,ierr) all MPI_Comm_rank(MPI_COMM_WORLD,me,ierr) itag = 100 itag2 = 200 if (me == 0) then message = "Hello, world" do i = 1,npro-1 all MPI_S(message,12,MPI_CHARACTER,i,itag,MPI_COMM_WORLD,ierr) all MPI_S(message,12,MPI_CHARACTER,i,itag2,MPI_COMM_WORLD,ierr) do write(*,*) "proess",me, ":", message do i = 1,npro-1 all MPI_Rev(iwrank,1,MPI_INTEGER,MPI_ANY_SOURCE,itag,MPI_COMM_WORLD, istatus, ierr) write(*,*) "proess", iwrank, ":Hello, bak" do all MPI_Rev(inmsg,12,MPI_CHARACTER,0,itag2,MPI_COMM_WORLD,istatus,ierr) all MPI_Rev(inmsg,12,MPI_CHARACTER,0,itag,MPI_COMM_WORLD,istatus,ierr) all MPI_S(me,1,MPI_INTEGER,0,itag,MPI_COMM_WORLD,ierr) if all MPI_Finalize(ierr) program hello 예제 (3) program karp This simple program approximates pi by omputing pi = integral from 0 to 1 of 4/(1+x*x)dx whih is approximated by sum from k=1 to N of 4 / (1+((k-.5)/N)**2). The only input data required is N. NOTE: Comments that begin with "spmd" are hints for part b of the lab exerise, where you onvert this into an MPI program. spmd Eah proess ould be given a hunk of the interval to do. RLF 3/21/97 Change floats to real*8 SHM 8/29/97 Change input to read from a file to aommodate VW Companion SHM 8/29/97 Replaed goto with do while Nils Smeds Aug 14, 2000 Converted to F90 impliit none integer, parameter:: DOUBLE=kind(1.0d0), SINGLE=kind(1.0) integer n,i real(double) :: err,pi,sum,w,x intrinsi atan pi = 4.0 * atan(1.0) open (unit = 20,file = "values") (20 of 73) 오후 4:09:45

21 spmd all startup routine that returns the number of tasks and the spmd taskid of the urrent instane. Now read in a new value for N. When it is 0, then you should depart. read(20,*) n print *, "Number of approximation intervals = ", n do while (n > 0) w = 1.0 / n sum = 0.0 do i = 1,n sum = sum + f((i-0.5)*w) do sum = sum * w err = sum - pi print *, "sum = ", sum, " err =", err read (20,*) n print *, "Number of approximation intervals = ", n do lose (unit = 20) ontains real(double) funtion f(x) impliit none real(double), intent(in) :: x f = 4.0 / (1.0+x*x) funtion f program karp 예제 (4) program karp karp.soln.f This simple program approximates pi by omputing pi = integral from 0 to 1 of 4/(1+x*x)dx whih is approximated by sum from k=1 to N of 4 / (1+((k-.5)/N)**2). The only input data required is N. 10/11/95 RLF MPI Parallel version 1 3/7/97 RLF Replae npros and mynum with size and rank 3/21/97 RLF Change floats to real*8 SHM 8/29/97 Change input to read from a file to aommodate VW Companion SHM 8/29/97 Replaed goto with do while Nils Smeds Aug 14, 2000 Converted to F90 Uses only the 6 basi MPI alls impliit none integer, parameter:: DOUBLE=kind(1.0d0), SINGLE=kind(1.0) inlude "mpif.h" integer :: n,i,mpierr,rank,size,tag real(double) :: err,pi,sum,w,x (21 of 73) 오후 4:09:45

22 integer, dimension(mpi_status_size) :: status intrinsi atan pi = 4.0 * atan(1.0) tag = 111 open (unit = 20,file = "values") All proesses all the startup routine to get their rank all MPI_Init(mpierr) all MPI_Comm_size(MPI_COMM_WORLD,size,mpierr) all MPI_Comm_rank(MPI_COMM_WORLD,rank,mpierr) Eah new approximation to pi begins here (Step 1) Get first value of N all soliit(n,size,rank) (Step 2): do the omputation in N steps Parallel Version: there are "size" proesses partiipating. Eah proess should do 1/size of the alulation. Sine we want i = 1..n but rank = 0, 1, 2..., we start off with rank+1. do while (n > 0) w = 1.0 / n sum = 0.0 do i = rank+1,n,size sum = sum + f((i-0.5)*w) do sum = sum * w (Step 3): print the results (Parallel version: ollet partial results and let master proess print it) if (rank == 0) then print *, "host alulated x=", sum do i = 1,size-1 all MPI_Rev(x,1,MPI_DOUBLE_PRECISION,i,tag,MPI_COMM_WORLD,status, mpierr) print *, "host got x=", x sum = sum + x do err = sum - pi print *, "sum, err =", sum, err all MPI_S(sum,1,MPI_DOUBLE_PRECISION,0,tag,MPI_COMM_WORLD,mpierr) if Get a new value of N all soliit(n,size,rank) do all MPI_Finalize(mpierr) lose (unit = 20) ontains real(double) funtion f(x) impliit none real(double), intent(in) :: x f = 4.0 / (1.0+x*x) funtion f subroutine soliit(n,npros,mynum) Get a value for N, the number of intervals in the approximation (22 of 73) 오후 4:09:45

23 (Parallel versions: master proess reads in N and then ss N to all the other proesses) Note: A single broadast operation ould be used instead, but is not one of the 6 basis alls. impliit none Get a value for N, the number of intervals in the approximation (Parallel versions: master proess reads in N and then ss N to all the other proesses) Note: A single broadast operation ould be used instead, but is not one of the 6 basis alls. inlude "mpif.h" integer, intent(inout) :: n integer, intent(in) :: mynum,npros integer :: i,mpierr,tag integer, dimension(mpi_status_size) :: status tag = 112 if (mynum == 0) then read (20,*) n print *, "Number of approximation intervals = ", n do i = 1,npros-1 all MPI_S(n,1,MPI_INTEGER,i,tag,MPI_COMM_WORLD,mpierr) do all MPI_Rev(n,1,MPI_INTEGER,0,tag,MPI_COMM_WORLD,status,mpierr) if subroutine soliit program karp 예제 (5) ######################################### # # This is an MPI example that solves Laplae's equation by using Jaobi # iteration on a 1-D deomposition. Non-bloking ommuniations routines # are used. # It demonstrates the use of : # # * MPI_Init # * MPI_Comm_rank # * MPI_Comm_size # * MPI_Cart_reate # * MPI_Cart_shift # * MPI_Cart_shift # * MPI_Bast # * MPI_Allredue # * MPI_Is # * MPI_Irev # * MPI_Finalize # #################################################### program onedovlp inlude "mpif.h" integer maxn parameter (maxn = 128) double preision a(maxn,maxn), b(maxn,maxn), f(maxn,maxn) double preision diff, diffnorm, diffw (23 of 73) 오후 4:09:45

24 integer nx, ny, myid, numpros, omm1d integer nbrbottom, nbrtop, s, e, it, ierr all MPI_INIT( ierr ) all MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) all MPI_COMM_SIZE( MPI_COMM_WORLD, numpros, ierr ) print *, "Proess ", myid, " of ", numpros, " is alive" if (myid.eq. 0) then Get the size of the problem print *, 'Enter nx' read *, nx nx = 110 all MPI_BCAST(nx,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr) ny = nx Get a new ommuniator for a deomposition of the domain all MPI_CART_CREATE( MPI_COMM_WORLD, 1, numpros,.false.,.true., omm1d, ierr ) Get my position in this ommuniator, and my neighbors all MPI_COMM_RANK( omm1d, myid, ierr ) all MPI_Cart_shift( omm1d, 0, 1, nbrbottom, nbrtop, ierr ) Compute the deomposition all MPE_DECOMP1D( ny, numpros, myid, s, e ) Initialize the right-hand-side (f) and the initial solution guess (a) all onedinit( a, b, f, nx, s, e ) Atually do the omputation. Note the use of a olletive operation to hek for onvergene, and a do-loop to bound the number of iterations. do 10 it=1, 200 all nbexhng1( a, nx, s, e, omm1d, nbrbottom, nbrtop, 0 ) all nbsweep( a, f, nx, s, e, b ) all nbexhng1( a, nx, s, e, omm1d, nbrbottom, nbrtop, 1 ) all nbsweep( a, f, nx, s, e, b ) all nbexhng1( b, nx, s, e, omm1d, nbrbottom, nbrtop, 0 ) all nbsweep( b, f, nx, s, e, a ) all nbexhng1( b, nx, s, e, omm1d, nbrbottom, nbrtop, 1 ) all nbsweep( b, f, nx, s, e, a ) diffw = diff( a, b, nx, s, e ) all MPI_Allredue( diffw, diffnorm, 1, MPI_DOUBLE_PRECISION, MPI_SUM, omm1d, ierr ) if (diffnorm.lt. 1.0e-5) goto ontinue if (myid.eq. 0) print *, 'Failed to onverge' 20 ontinue (24 of 73) 오후 4:09:45

25 if (myid.eq. 0) then print *, 'Converged after ', it, ' Iterations' do i = 1,nx do j = 1,nx print *,"i,j,b=",i,j,b(i,j) do do all MPI_FINALIZE(ierr) 예제 (6) ********************************************************************** matmul.f - matrix - vetor multiply, simple self-sheduling version ************************************************************************ Program Matmult ######################################## # # This is an MPI example of multiplying a vetor times a matrix # It demonstrates the use of : # # * MPI_Init # * MPI_Comm_rank # * MPI_Comm_size # * MPI_Bast # * MPI_Rev # * MPI_S # * MPI_Finalize # * MPI_Abort # ##################################### program main inlude 'mpif.h' integer MAX_ROWS, MAX_COLS, rows, ols parameter (MAX_ROWS = 1000, MAX_COLS = 1000, MAX_PROCS =32) double preision a(max_rows,max_cols), b(max_cols), (MAX_COLS) double preision buffer(max_cols), ans integer pros(max_cols), pro_totals(max_procs) integer myid, master, numpros, ierr, status(mpi_status_size) integer i, j, numsent, numrvd, ser, job(max_rows) integer rowtype, anstype, donetype all MPI_INIT( ierr ) all MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) all MPI_COMM_SIZE( MPI_COMM_WORLD, numpros, ierr ) if (numpros.lt. 2) then print *, "Must have at least 2 proesses" (25 of 73) 오후 4:09:45

26 all MPI_ABORT( MPI_COMM_WORLD, 1 ) stop if (numpros.gt. MAX_PROCS) then print *, "Must have 32 proesses or less." all MPI_ABORT( MPI_COMM_WORLD, 1 ) stop print *, "Proess ", myid, " of ", numpros, " is alive" rowtype = 1 anstype = 2 donetype = 3 master = 0 rows = 100 ols = 100 if ( myid.eq. master ) then master initializes and then dispathes initialize a and b do 20 i = 1,ols b(i) = 1 do 10 j = 1,rows a(i,j) = I 10 ontinue 20 ontinue numsent = 0 numrvd = 0 s b to eah other proess all MPI_BCAST(b, ols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD, ierr) s a row to eah other proess do 40 i = 1,numpros-1 do 30 j = 1,ols buffer(j) = a(i,j) 30 ontinue all MPI_SEND(buffer, ols, MPI_DOUBLE_PRECISION, i, rowtype, MPI_COMM_WORLD, ierr) job(i) = I numsent = numsent+1 40 ontinue do 70 i = 1,rows all MPI_RECV(ans, 1, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE, anstype, MPI_COMM_WORLD, status, ierr) ser = status(mpi_source) (job(ser)) = ans pros(job(ser))= ser pro_totals(ser+1) = pro_totals(ser+1) +1 if (numsent.lt. rows) then do 50 j = 1,ols buffer(j) = a(numsent+1,j) 50 ontinue all MPI_SEND(buffer, ols, MPI_DOUBLE_PRECISION, ser, rowtype, MPI_COMM_WORLD, ierr) job(ser) = numsent+1 numsent = numsent+1 all MPI_SEND(1, 1, MPI_INTEGER, ser, donetype, MPI_COMM_WORLD, ierr) 70 ontinue (26 of 73) 오후 4:09:45

27 print out the answer do 80 i = 1,ols print *,"(", i,") = ", (i)," omputed by pro #",pros(i) 80 ontinue do 81 i=1,numpros write(6,810) i-1,pro_totals(i) 810 format('total answers omputed by proessor #',i2,' were ',i3) 81 ontinue slaves reeive b, then ompute dot produts until done message all MPI_BCAST(b, ols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD, ierr) 90 all MPI_RECV(buffer, ols, MPI_DOUBLE_PRECISION, master, MPI_ANY_TAG, MPI_COMM_WORLD, status, ierr) if (status(mpi_tag).eq. donetype) then go to 200 ans = 0.0 do 100 i = 1,ols ans = ans+buffer(i)*b(i) 100 ontinue all MPI_SEND(ans, 1, MPI_DOUBLE_PRECISION, master, anstype, MPI_COMM_WORLD, ierr) go to all MPI_FINALIZE(ierr) stop 예제 (7) Program Example1 impliit none integer n, p, i, j,num real h, result, a, b, integral, pi real my_a,my_range pi = aos(-1.0) = a = 0.0 lower limit of integration b = pi*1./2. upper limit of integration p = 4 number of proesses (partitions) n = total number of inrements h = (b-a)/n length of inrement num= n/p number of alulations done by eah proess result = 0.0 stores answer to the integral do i=0,p-1 sum of integrals over all proesses my_range = (b-a)/p my_a = a + i*my_range result = result + integral(my_a,num,h) print *,'The result =',result (27 of 73) 오후 4:09:45

28 stop real funtion integral(a,n,h) impliit none integer n, i, j real h, h2, aij, a real ft, x ft(x) = os(x) kernel of the integral integral = 0.0 initialize integral h2 = h/2. do j=0,n-1 sum over all "j" integrals aij = a+j*h lower limit of "j" integral integral = integral + ft(aij+h2)*h return 예제 (7) Program Example2 ################################################ # # This is an MPI example on parallel integration # It demonstrates the use of : # # * MPI_Init # * MPI_Comm_rank # * MPI_Comm_size # * MPI_Rev # * MPI_S # * MPI_Finalize # * MPI_WTime # ############################################## impliit none integer n, p, i, j, ierr, master,num real h, result, a, b, integral, pi double preision MPI_WTime,start_time,_time inlude "mpif.h" This brings in pre-defined MPI onstants,... integer Iam, soure, dest, tag, status(mpi_status_size) real my_result real my_a, my_range data master/0/ 0 is defined as the master proessor whih will be responsible for olleting integral sums... Plaement of exeutable statements before MPI_Init is not advisable as the side effet is implementation-depent pi = aos(-1.0) = a = 0.0 lower limit of integration b = pi*1./2. upper limit of integration n = total number of inrements aross all proessors (28 of 73) 오후 4:09:45

29 dest = master define the proess that omputes the final result tag = 123 set the tag to identify this partiular job **Starts MPI proesses... all MPI_Init(ierr) starts MPI all MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) get urrent proess id all MPI_Comm_size(MPI_COMM_WORLD, p, ierr) get # pros from env print*,'proess #',Iam, ' out of ',p,' total proess' start_time = MPI_Wtime() variable or ommand line h = (b-a)/n length of inrement num=n/p number of inrements for eah proessor my_range = (b-a)/p my_a = a + Iam*my_range my_result = integral(my_a,num,h) ompute loal sum write(*,"('proess ',i2,' has the partial result of',f10.6)") Iam,my_result if(iam.eq. master) then result = my_result initialize final result to master's do soure=1,p-1 loop on soures (serialized) to ollet loal sum all MPI_Rev(my_result, 1, MPI_REAL, soure, tag, MPI_COMM_WORLD, status, ierr) result = result + my_result print *,'The result =',result _time = MPI_Wtime() print *, 'elapsed time is ',_time-start_time,' seonds' all MPI_S(my_result, 1, MPI_REAL, dest, tag, MPI_COMM_WORLD, ierr) s my_result to inted dest. all MPI_Finalize(ierr) let MPI finish up... stop real funtion integral(a,n,h) impliit none integer n, i, j real h, h2, aij, a real ft, x ft(x) = os(x) kernel of the integral integral = 0.0 initialize integral h2 = h/2. do j=0,n-1 sum over all "j" integrals aij = a+j*h lower limit of "j" integral integral = integral + ft(aij+h2)*h return 예제 (8) Program Example2 ################################################ # # This is an MPI example on parallel integration # It demonstrates the use of : (29 of 73) 오후 4:09:45

30 # # * MPI_Init # * MPI_Comm_rank # * MPI_Comm_size # * MPI_Rev # * MPI_S # * MPI_Finalize # * MPI_WTime # ############################################### impliit none integer n, p, i, j, ierr, master,num real h, result, a, b, integral, pi double preision MPI_WTime,start_time,_time inlude "mpif.h" This brings in pre-defined MPI onstants,... integer Iam, soure, dest, tag, status(mpi_status_size) real my_result real my_a, my_range data master/0/ 0 is defined as the master proessor whih will be responsible for olleting integral sums... Plaement of exeutable statements before MPI_Init is not advisable as the side effet is implementation-depent pi = aos(-1.0) = a = 0.0 lower limit of integration b = pi*1./2. upper limit of integration n = total number of inrements aross all proessors dest = master define the proess that omputes the final result tag = 123 set the tag to identify this partiular job **Starts MPI proesses... all MPI_Init(ierr) starts MPI all MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) get urrent proess id all MPI_Comm_size(MPI_COMM_WORLD, p, ierr) get # pros from env print*,'proess #',Iam, ' out of ',p,' total proess' start_time = MPI_Wtime() variable or ommand line h = (b-a)/n length of inrement num=n/p number of inrements for eah proessor my_range = (b-a)/p my_a = a + Iam*my_range my_result = integral(my_a,num,h) ompute loal sum write(*,"('proess ',i2,' has the partial result of',f10.6)") Iam,my_result if(iam.eq. master) then result = my_result initialize final result to master's do soure=1,p-1 loop on soures (serialized) to ollet loal sum all MPI_Rev(my_result, 1, MPI_REAL, soure, tag, MPI_COMM_WORLD, status, ierr) result = result + my_result print *,'The result =',result _time = MPI_Wtime() print *, 'elapsed time is ',_time-start_time,' seonds' all MPI_S(my_result, 1, MPI_REAL, dest, tag, MPI_COMM_WORLD, ierr) inted dest. all MPI_Finalize(ierr) let MPI finish up... stop s my_result to (30 of 73) 오후 4:09:45

31 real funtion integral(a,n,h) impliit none integer n, i, j real h, h2, aij, a real ft, x ft(x) = os(x) kernel of the integral integral = 0.0 initialize integral h2 = h/2. do j=0,n-1 sum over all "j" integrals aij = a+j*h lower limit of "j" integral integral = integral + ft(aij+h2)*h return 예제 (9) Program Example5 ################################ # # This is an MPI example on parallel integration # It demonstrates the use of : # # * MPI_Init # * MPI_Comm_rank # * MPI_Comm_size # * MPI_Bast # * MPI_Redue # * MPI_SUM # * MPI_Finalize # * MPI_WTime # ################################################ impliit none integer n, p, i, j, ierr, master,num real h, result, a, b, integral, pi real my_a, my_range double preision MPI_WTime,start_time,_time inlude "mpif.h" This brings in pre-defined MPI onstants,... integer Iam, soure, dest, tag, status(mpi_status_size) real my_result data master/0/ **Starts MPI proesses... all MPI_Init(ierr) starts MPI all MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) get urrent proess id all MPI_Comm_size(MPI_COMM_WORLD, p, ierr) get number of proesses pi = aos(-1.0) = a = 0.0 lower limit of integration b = pi*1./2. upper limit of integration (31 of 73) 오후 4:09:45

32 dest = 0 define the proess that omputes the final result tag = 123 set the tag to identify this partiular job if(iam.eq. master) then print *,'The requested number of proessors =',p print *,'Enter total number of inrements aross all proessors' read(*,*)n start_time = MPI_Wtime() **Broadast "n" to all proesses all MPI_Bast(n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr) h = (b-a)/n length of inrement num = n/p number of inrements alulated by eah proess my_range = (b-a)/p my_a = a+ Iam*my_range lower limit of my integral my_result = integral(my_a,num,h) write(*,"('proess ',i2,' has the partial result of',f10.6)") Iam,my_result all MPI_Redue(my_result, result, 1, MPI_REAL, MPI_SUM, dest, MPI_COMM_WORLD, ierr) if(iam.eq. master) then print *,'The result =',result _time = MPI_Wtime() print *, 'elapsed time is ',_time-start_time,' seonds' all MPI_Finalize(ierr) let MPI finish up... stop real funtion integral(a,n,h) impliit none integer n, i, j real h, h2, aij, a real ft, x ft(x) = os(x) kernel of the integral integral = 0.0 initialize integral h2 = h/2. do j=0,n-1 sum over all "j" integrals aij = a + j*h lower limit of "j" integral integral = integral + ft(aij+h2)*h return 예제 (10) Program Example6 ############################# # # This is an MPI example on parallel integration (32 of 73) 오후 4:09:45

33 # It demonstrates the use of : # # * MPI_Init # * MPI_Comm_rank # * MPI_Comm_size # * MPI_Pak # * MPI_Unpak # * MPI_Redue # * MPI_SUM, MPI_MAXLOC, and MPI_MINLOC # * MPI_Finalize # ############################## impliit none integer n, p, i, j, ierr, m, master real h, result, a, b, integral, pi inlude "mpif.h" This brings in pre-defined MPI onstants,... integer Iam, soure, dest, tag, status(mpi_status_size),num real my_result(2), min_result(2), max_result(2) real my_a, my_range double preision MPI_WTime,start_time,_time integer Nbytes parameter (Nbytes=1000, master=0) harater srath(nbytes) needed for MPI_pak/MPI_unpak; ounted in bytes integer index, minid, maxid **Starts MPI proesses... all MPI_Init(ierr) starts MPI all MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) get urrent proess id all MPI_Comm_size(MPI_COMM_WORLD, p, ierr) get number of proesses pi = aos(-1.0) = dest = 0 define the proess that omputes the final result tag = 123 set the tag to identify this partiular job if(iam.eq. 0) then print *,'The requested number of proessors =',p print *,'Enter the total # of intervals over all proesses' read(*,*)n print *,'enter a & m' print *,' a = lower limit of integration' print *,' b = upper limit of integration' print *,' = m * pi/2' read(*,*)a,m start_time = MPI_Wtime() b = m * pi / 2. **to be effiient, pak all things into a buffer for broadast index = 1 all MPI_Pak(n, 1, MPI_INTEGER, srath, Nbytes, index, MPI_COMM_WORLD, ierr) all MPI_Pak(a, 1, MPI_REAL, srath, Nbytes, index, MPI_COMM_WORLD, ierr) all MPI_Pak(b, 1, MPI_REAL, srath, Nbytes, index, MPI_COMM_WORLD, ierr) all MPI_Bast(srath, Nbytes, MPI_PACKED, 0, MPI_COMM_WORLD, ierr) all MPI_Bast(srath, Nbytes, MPI_PACKED, 0, MPI_COMM_WORLD, ierr) **things reeived have been paked, unpak into expeted loations index = 1 all MPI_Unpak(srath, Nbytes, index, n, 1, MPI_INTEGER, MPI_COMM_WORLD, ierr) all MPI_Unpak(srath, Nbytes, index, a, 1, MPI_REAL, MPI_COMM_WORLD, ierr) all MPI_Unpak(srath, Nbytes, index, b, 1, MPI_REAL, MPI_COMM_WORLD, ierr) (33 of 73) 오후 4:09:45

34 h = (b-a)/n length of inrement num= n/p number of iterations on eah proessor my_range = (b-a)/p my_a = a + Iam*my_range my_result(1) = integral(my_a,num,h) my_result(2) = Iam write(*,"('proess ',i2,' has the partial result of',f10.6)") Iam,my_result(1) all MPI_Redue(my_result, result, 1, MPI_REAL, MPI_SUM, dest, MPI_COMM_WORLD, ierr) data redution by way of MPI_SUM all MPI_Redue(my_result, min_result, 1, MPI_2REAL, MPI_MINLOC, dest, MPI_COMM_WORLD, ierr) data redution by way of MPI_MINLOC all MPI_Redue(my_result, max_result, 1, MPI_2REAL, MPI_MAXLOC, dest, MPI_COMM_WORLD, ierr) data redution by way of MPI_MAXLOC if(iam.eq. master) then print *,'The result =',result _time = MPI_Wtime() print *, 'elapsed time is ',_time-start_time,' seonds' maxid = max_result(2) print *,'Pro',maxid,' has largest integrated value of', max_result(1) minid = min_result(2) print *,'Pro',minid,' has smallest integrated value of', min_result(1) all MPI_Finalize(ierr) let MPI finish up... stop real funtion integral(a,n,h) impliit none integer n, i, j real h, h2, aij, a real ft, x ft(x) = os(x) kernel of the integral integral = 0.0 initialize integral h2 = h/2. do j=0,n-1 sum over all "j" integrals aij = a + j*h lower limit of "j" integral integral = integral + ft(aij+h2)*h return 예제 (10) pi_s.f FILES: pi_s.f, dboard.f, make.pi.f DESCRIPTION: MPI pi alulation example program. Fortran version. This program alulates pi using a "dartboard" algorithm. See (34 of 73) 오후 4:09:45

모두 보기

Microsoft Word - 3부A windows 환경 IVF + visual studio.doc

Microsoft Word - 3부A windows 환경 IVF + visual studio.doc Visual Studio 2005 + Intel Visual Fortran 9.1 install Intel Visual Fortran 9.1 intel Visual Fortran Compiler 9.1 만설치해서 DOS 모드에서실행할수있지만, Visual Studio 2005 의 IDE 를사용하기위해서는 Visual Studio 2005 를먼저설치후 Integration