Microsoft PowerPoint - 타키온1&2

Similar documents
TACHYON_Userguide_V1_0.hwp

6주차.key

강의10

Microsoft Word - 3부A windows 환경 IVF + visual studio.doc

PowerPoint 프레젠테이션

PCServerMgmt7

MAX+plus II Getting Started - 무작정따라하기

PRO1_04E [읽기 전용]

Solaris Express Developer Edition

untitled

DE1-SoC Board

solution map_....

Backup Exec

APOGEE Insight_KR_Base_3P11

Copyright 2012, Oracle and/or its affiliates. All rights reserved.,.,,,,,,,,,,,,.,...,. U.S. GOVERNMENT END USERS. Oracle programs, including any oper

Microsoft PowerPoint - eSlim SV [ ]

Microsoft PowerPoint - eSlim SV [080116]

Integ

CD-RW_Advanced.PDF

Interstage5 SOAP서비스 설정 가이드

<31325FB1E8B0E6BCBA2E687770>

PowerChute Personal Edition v3.1.0 에이전트 사용 설명서

PowerPoint 프레젠테이션

example code are examined in this stage The low pressure pressurizer reactor trip module of the Plant Protection System was programmed as subject for

Copyright 2012, Oracle and/or its affiliates. All rights reserved.,,,,,,,,,,,,,.,..., U.S. GOVERNMENT END USERS. Oracle programs, including any operat

소개 TeraStation 을 구입해 주셔서 감사합니다! 이 사용 설명서는 TeraStation 구성 정보를 제공합니다. 제품은 계속 업데이트되므로, 이 설명서의 이미지 및 텍스트는 사용자가 보유 중인 TeraStation 에 표시 된 이미지 및 텍스트와 약간 다를 수

Oracle9i Real Application Clusters

ORANGE FOR ORACLE V4.0 INSTALLATION GUIDE (Online Upgrade) ORANGE CONFIGURATION ADMIN O

PRO1_09E [읽기 전용]

<목 차 > 제 1장 일반사항 4 I.사업의 개요 4 1.사업명 4 2.사업의 목적 4 3.입찰 방식 4 4.입찰 참가 자격 4 5.사업 및 계약 기간 5 6.추진 일정 6 7.사업 범위 및 내용 6 II.사업시행 주요 요건 8 1.사업시행 조건 8 2.계약보증 9 3

PWR PWR HDD HDD USB USB Quick Network Setup Guide xdsl/cable Modem PC DVR 1~3 1.. DVR DVR IP xdsl Cable xdsl Cable PC PC DDNS (

°í¼®ÁÖ Ãâ·Â

The Self-Managing Database : Automatic Health Monitoring and Alerting

LXR 설치 및 사용법.doc

4 CD Construct Special Model VI 2 nd Order Model VI 2 Note: Hands-on 1, 2 RC 1 RLC mass-spring-damper 2 2 ζ ω n (rad/sec) 2 ( ζ < 1), 1 (ζ = 1), ( ) 1

목차 제 1 장 inexio Touch Driver소개 소개 및 주요 기능 제품사양... 4 제 2 장 설치 및 실행 설치 시 주의사항 설치 권고 사양 프로그램 설치 하드웨

Oracle Database 10g: Self-Managing Database DB TSC

MySQL-Ch10

untitled

Intra_DW_Ch4.PDF

ecorp-프로젝트제안서작성실무(양식3)

Raspbian 설치 라즈비안 OS (Raspbian OS) 라즈베리파이 3 Model B USB 마우스 USB 키보드 마이크로 SD 카드 마이크로 SD 카드리더기 HDM I 케이블모니터

04_오픈지엘API.key

휠세미나3 ver0.4

chapter4

untitled

Orcad Capture 9.x

Microsoft Word - zfs-storage-family_ko.doc

05( ) CPLV12-04.hwp

2011 PLSI 병렬컴퓨팅경진대회문제 01. 대학원팀 02. 학부팀 - 경진내용 - 경진환경 주어진순차코드를병렬화하여성능향상도 ( 획득점수 ) 를측정 점수 = ( 순차코드수행시간 ) / ( 병렬화코드수행시간 ) 프로그래밍언어 : C, Fortran 순차코드는 50 라

vm-웨어-01장

초보자를 위한 C++

SchoolNet튜토리얼.PDF

0125_ 워크샵 발표자료_완성.key

10X56_NWG_KOR.indd

SRC PLUS 제어기 MANUAL

PRO1_02E [읽기 전용]


Sena Technologies, Inc. HelloDevice Super 1.1.0

품질검증분야 Stack 통합 Test 결과보고서 [ The Bug Genie ]

iii. Design Tab 을 Click 하여 WindowBuilder 가자동으로생성한 GUI 프로그래밍환경을확인한다.

ESP1ºÎ-04

ARMBOOT 1

15_3oracle

PowerPoint Presentation

KDTÁ¾ÇÕ-2-07/03

1217 WebTrafMon II

김기남_ATDC2016_160620_[키노트].key

FMX M JPG 15MB 320x240 30fps, 160Kbps 11MB View operation,, seek seek Random Access Average Read Sequential Read 12 FMX () 2

슬라이드 1

R50_51_kor_ch1

Chapter 1

28 THE ASIAN JOURNAL OF TEX [2] ko.tex [5]

,,,,,, (41) ( e f f e c t ), ( c u r r e n t ) ( p o t e n t i a l difference),, ( r e s i s t a n c e ) 2,,,,,,,, (41), (42) (42) ( 41) (Ohm s law),

USB USB DV25 DV25 REC SRN-475S REC SRN-475S LAN POWER LAN POWER Quick Network Setup Guide xdsl/cable Modem PC DVR 1~3 1.. DVR DVR IP xdsl Cable xdsl C

Page 2 of 6 Here are the rules for conjugating Whether (or not) and If when using a Descriptive Verb. The only difference here from Action Verbs is wh


01Àå

<4D F736F F D20C5EBC7D5C7D8BCAEBDC3BDBAC5DB5F D2BC0C720424D54B0E1B0FABAB8B0EDBCAD2E646F63>

인켈(국문)pdf.pdf

Portal_9iAS.ppt [읽기 전용]

ETL_project_best_practice1.ppt

Remote UI Guide

KDTÁ¾ÇÕ-1-07/03

<C0CCBCBCBFB52DC1A4B4EBBFF82DBCAEBBE7B3EDB9AE2D D382E687770>

DocsPin_Korean.pages

Sena Device Server Serial/IP TM Version

Something that can be seen, touched or otherwise sensed

thesis

<49534F C0CEC1F520BBE7C8C4BDC9BBE720C4C1BCB3C6C320B9D D20BDC3BDBAC5DB20B0EDB5B5C8AD20C1A6BEC8BFE4C3BBBCAD2E687770>

[Brochure] KOR_TunA

Page 2 of 5 아니다 means to not be, and is therefore the opposite of 이다. While English simply turns words like to be or to exist negative by adding not,

#Ȳ¿ë¼®

PowerPoint 프레젠테이션

Chap7.PDF

기타자료.PDF

본교재는수업용으로제작된게시물입니다. 영리목적으로사용할경우저작권법제 30 조항에의거법적처벌을받을수있습니다. [ 실습 ] 스위치장비초기화 1. NVRAM 에저장되어있는 'startup-config' 파일이있다면, 삭제를실시한다. SWx>enable SWx#erase sta

PowerPoint 프레젠테이션

VOL /2 Technical SmartPlant Materials - Document Management SmartPlant Materials에서 기본적인 Document를 관리하고자 할 때 필요한 세팅, 파일 업로드 방법 그리고 Path Type인 Ph

본문서는 초급자들을 대상으로 최대한 쉽게 작성하였습니다. 본문서에서는 설치방법만 기술했으며 자세한 설정방법은 검색을 통하시기 바랍니다. 1. 설치개요 워드프레스는 블로그 형태의 홈페이지를 빠르게 만들수 있게 해 주는 프로그램입니다. 다양한 기능을 하는 플러그인과 디자인

Transcription:

슈퍼컴퓨팅본부교육지원팀 슈퍼컴퓨터 (SUN Tachyon1&2 System) H/W, S/W 환경소개및실습 안병선

INDEX 1. 슈퍼컴퓨터 4 호기 2. Tachyon 소개 3. Tachyon 실습시스템소개 4. Tachyon 계산노드실습 5. SGE 를통한작업실행 6. 병렬코드성능최적화기법

01. 슈퍼컴퓨터 4 호기 Tachyon 소개 KISTI 슈퍼컴퓨터 Tachyon1&2 H/W,S/W 에대하여소개한다.

Agenda SUN Blade 타키온 (Tachyon) HW 소개 계산노드 프로세서 SUN Blade 타키온 (Tachyon) SW 소개 MVAPICH, OPENMPI MATH. Libraries 타키온시스템사용 무료계정발급 초보계정신청 전략과제소개

History of KISTI Supercomputer Cray 2S[1 st ] Cray T3E NEC SX-5[3 rd -1] NEC SX-6[3 rd -2] Tera Cluster SUN B6048[4 th -1] SUN B6275[4 th -2] 1988 1993 1997 2000 2001 2002 2003 2008 2GFlops 16GFlops 131GFlops 242GFlops 306GFlops 1,407GFlops 8,000GFlops 30TFlops 2010 300TFlops Cray C90[2 nd ] HP GS320 HPC 160/320 Pluto cluster IBM p690[3 rd -1] IBM p690[3 rd -2] IBM p595[4 th ]

시스템사양및구성슈퍼컴퓨터 4호기시스템타키온2(Tachyon2) SUN 의 Blade 6275 로이론최고성능 (Rpeak) 300TFlops 제공 슈퍼컴퓨터 4 호기시스템타키온 (Tachyon] SUN 의 Blade 6048 로이론최고성능 (Rpeak) 24TFlops 제공

타키온 2 시스템사양 구분내용비고 프로세서 Intel Xeon X5570 시스템버스 : Intel QPI Speed (6.4GT/sec) 2.93GHz (Nehalem) L1/L2/L3 / : 128KB/1MB/8MB / 노드수컴퓨팅노드 3,176 개로그인 4 개 (X4170), 디버깅노드 24 개 (x6275) CPU 코어수 25,408 개 8 개 / 노드 메모리 DDR3/1333MHz 76.8TB 24GB/ 노드, 3GB/ 코어 디스크스토리지 SUN X4270/STK6140 1,061TB 테이프스토리지 SUN SL8500 2PB Interconnection Infiniband 40G 8X QDR Sun Datacenter 648 스위치 쿨링방식수냉식 Libert XDP/XDH 운영체제 RedHat Enterprise Linux 5.3 Kernel 2.6.18-128.7.1.el5 파일시스템 Lustre 1.8.1.1 1 1 스크래치, 홈디렉터리 아카이빙프로그램 SAM-QFS 5.0 작업관리프로그램 SGE 6.2u5

타키온시스템사양 구분내용비고 프로세서 AMD Opteron 2.0GHz (Barcelona) 시스템버스 : HyperTransport (6.4GB/sec) L1/L2/L3 : 64KB/4*512KB/2MB on-die 노드수컴퓨팅노드 188 개로그인 4 개 (X4600), 디버깅노드 4 개 (6048) CPU 코어수 3,008 개 16 개 / 노드 메모리 DDR2/667MHz 6TB 32GB/ 노드,2GB/core 디스크스토리지 SUN X4500/STK6140 207TB 테이프스토리지 SUN SL8500 422TB Interconnection Infiniband 4X DDR Voltaire ISR 2012 스위치 쿨링방식수냉식 Libert XDP/XDH 운영체제 CentOS 4.6 Kernel 2.6.9-67.0.4.ELsmp 파일시스템 Lustre 1.6.5.1 1 스크래치, 홈디렉터리 아카이빙프로그램 SAM-QFS 4.6 작업관리프로그램 SGE 6.1

Sun Blade 6275 Modular System Sun Blade 6275 Modular System 타키온2(2차시스템 ) 은높은집적도를가진 34 개의 6048 랙으로구성 (3,176 계산노드 ) 각각의랙은 4개의 Shelf로구성 각 shelf 는 24 개의 x6275 노드 24GB 메모리 (3G/Core) HDD 대용의 24GB CF Flash Module 2개의 x8 PCI-e bridge Sun Blade 6048 랙 (24 개*X6275) 4 개의 Shelf x6275 Blade : Nehalem Intel 사의쿼드코어 2.93GHz CPU (Nehalem) 2 개 X6275 ( 네할렘 )

타키온 2&1 성능 타키온 2 계산성능 CPU : Intel Xeon x5570(2.93ghz) Quad-Core 3200node = 25,600cores(2CPU/node, 4core/CPU) 300.03 Gflops = 93.76GFLOPS/node * 3200node Total : ~ 300 Tflops 총메모리 3,200node * 8 cores * 3GB = 76.8 TB 타키온계산성능 188 ( 계산노드 ) * X6420 노드 X6420 노드 : 4 socket x 4 코어 (2.0GHz) = 16*2.0GHz 16*8Gflops=128 GFlops Total ~ 24 Tflops 총메모리 188* 16 cores * 2GB = 6.016 TB

타키온 2 구성도

타키온구성도 Cluster Service Nodes Login Server Mgmt Network (100M) 10G Integrated Network (10Gbps) File Service Nodes 24TFlops Compute Nodes Infiniband Core Switches Voltaire ISR 2012(288port) 100% non-blocking DataMover Login nodes Service Network (Gigabit) Backup Server SAM-QFS Server Archiving Server SAN Brocade SW48K Tape Library Scratch Disk Home Disk Backup Disk SL8500 130TB(X4500) 50TB(ST6140)

타키온 2 X6275 블레이드노드블록다이어그램

타키온 X6420 블레이드노드블록다이어그램

Introducing Nehalem 인텔네할렘프로세서의주요기능

Introducing Nehalem 네이티브쿼드코어 & 메모리컨트롤러내장

Introducing Nehalem 인텔네할렘퀵패스아키텍처

Introducing Nehalem 하이퍼스레딩의부활

Introducing Nehalem 향상된캐쉬 & SSE4 명령어추가

Intel Quad-Core Processor( Nehalem ) 효율적전원관리 & 터보모드

Interconnection 네트워크 Infiniband 는노드간통신을위한주백본네트워크 8x IB QDR 을사용한 non-blocking IB 네트워크로구축됨 주용인프라노드채널당 40GB/s 의대역폭 8대의 SunDatacenter 648 스위치에모든계산노드를연결

스토리지 TACHYON 스토리지시스템 홈디렉토리 (59TB) 와글로벌스크래치디렉토리 (874TB) 제공 36 대 * SUN x4270 서버 72 대 * J4400 스토리지 스크래치및홈디렉토리는 Lustre 파일시스템을통해계산노드를포함하 스크래치및홈디렉토리는 Lustre 파일시스템을통해계산노드를포함하여모든노드에서비스됨

Agenda SUN Blade 6275 타키온 2(Tachyon2) HW 소개 계산노드 Blade 6275 프로세서 Intel Xeon X5570 ( 네할렘 ) SUN Blade 6275 타키온 2(Tachyon2) SW 소개 MVAPICH, OPENMPI MATH. Libraries 타키온시스템사용 무료계정발급 초보계정신청 전략과제소개

타키온 2 S/W List Description Name Version OS RedHat Enterprise Linux 5.3 Shared File System Lustre 1.8.1.1 Home Shared File System SAM-QFS 5.0 Resource Scheduler Sun Grid Engine 6.2 Cluster Monitoring Ganglia 3.0.7 Backup for Home of SMP Veritas Netbackup 6.0 Compiler MPI Library Gcc 4.1.2-44 Intel Compiler 10.1 / 11.1 PGI CDK 8.0-6 / 9.0-4 MVAPICH 1.1.0 / 1.2.0 MVAPICH2 1.2p1 / 1.4 OpenMPI 1.3.2 / 1.3.3 Profiler TAU cvs version for Barcelona 2.17 Debugger Totalview 870 8.7.0

타키온 S/W List OS Description Name Version Shared File System Home Shared File System Resource Scheduler Cluster Monitoring Backup for Home of SMP Compiler CentOS 4.4 44 Lustre 1.6 SAM-QFS 4.6 Sun Grid Engine 6.1 Ganglia 3.0.7 Veritas Netbackup 6.0 Gcc 3.4.6.9 Intel Compiler 10.1 PGI CDK 8.0 MPI Library MVAPICH 1.0 (MVAPICH 2.0 서비스예정 ) 1.2.5 OpenMPI Profiler Debugger TAU cvs version for Barcelona 2.17 Totalview 8.6.1 861

Mathematics and Statistics(T2) Description Name Version ACML 4.4.0 Numerical Programs and Linear Algebra (/applic/compilers/{compiler}/ {ver.}/applib1) ATLAS 3.6.0-15 GOTOBLAS2 1.13 HDF4/HDF5 44 4.4r4-44 4/183 1.8.3-2 LAPACK 3.2.1-3 NCARG4/NCARG5 442-5 4.4.2-5 /521 5.2.1 AZTEC 2.1 mpi 라이브러리 (/applic/compilers/{compiler}/ {compiler_ver.}/mpi/{mpi}/{m pi_ver.}/applib2) 상용소프트웨어 (/applic/applications) BLACS 1.1-33 FFTW2/FFTW3 2.1.5-19 / 3.2.1-3 MPIP 3.1.2 SCALAPACK 1.7.5-5 Gaussian03, Gaussian09, Mathematica, QChem, * http://www.ksc.re.kr kr 보유자원-S/W 정보참조

Mathematics and Statistics Description Name Version ACML 4.0.1 401 Numerical Programs and Linear Algebra (/applic/lib.{compiler}) FFTW 3.1.2 BLAS, BLACS, LAPACK ATLAS 3.6 GotoBLAS 1.26 Scalapack 1.8 Petsc 2.3.3 기타라이브러리 (/applic/lib.{compiler}) Aztec 2.1 21 NCAR 5.0.0-9 NetCDF 3.6.2-4 HDF4/HDF5 4.2/1.8 상용소프트웨어 (/applic/applications) Gaussian 03 * http://www.ksc.re.kr kr 보유자원-S/W 정보참조

MPI Concepts and Interfaces Concepts There are several different modes of message passing Two-sided point-to-point (send/receive) One-sided point-to-point (put, get) Collective (barrier, broadcast) MPI 1.1 Static process allocation Two-sided blocking and non-blocking communication Collective communication Derived data types Virtual topologies MPI 2.0 All of MPI 1.11 Dynamic process allocation One-sided communication Parallel I/O

MVAPICH : Overview What is MVAPICH/MVAPICH2? MVAPICH is pronounced as ``''em-vah-pich'. MVAPICH is a high performance implementation of MPI-1 over InfiniBand VAPI interface based on MPICH1. MVAPICH2 is a high performance MPI-2 implementation based on MPICH2. High performance and scalable support over the Verbs Level Interface (VAPI) are provided by these packages. Focus is on high performance implementation on emerging interconnects InfiniBand iwarp/10gige udapl over InfiniBand (Open Fabrics and Solaris) udapl over 10GigE (Neteffect) Both versions are available from OSU directly http://mvapich.cse.ohio-state.edu Source-tree available from public SVN

OpenMPI : Overview Design Goals Full MPI-2 standard conformance and High Performance Fault tolerant t (optional) Thread safety and concurrency (MPI_THREAD_MULTIPLE) Based On Component Architecture Flexible run-time instrumentation Portable, Maintainable, Production quality Single library support all networks OpenMPI: Merger of ideas from prior implementations: FT-MPI: University of Tennessee LA-MPI: Los Alamos LAM/MPI: Indiana University MVAPICH : The Ohio State University, PACX-MPI

Linear Algebra Libraries BLAS ATLAS GotoBLAS LAPACK ACML ScaLAPACK

BLAS BLAS The Basic Linear Algebra Subprograms (BLAS) are a set of flow level llinear algebra routines: Level 1: Vector-vector (e.g., dot product) Level 2: Matrix-vector (e.g., matrix-vector multiply) Level 3: Matrix-matrix (e.g., matrix-matrix multiply) Many linear algebra packages, including LAPACK, ScaLAPACK and PETSc, are built on top of BLAS. M t t d h i f BLAS Most supercomputer vendors have versions of BLAS that are highly tuned for their platforms.

ATLAS ATLAS The Automatically Tuned Linear Algebra Software package (ATLAS) is a self-tuned version of BLAS (it also includes a few LAPACK routines). When it s installed, it tests and times a variety of approaches to each routine, and selects the version that runs the fastest. ATLAS is substantially faster than the generic version of BLAS.

Goto BLAS GotoBLAS Developed by Kazushige Goto (currently at UT Austin). This version is unusual, because instead of optimizing for cache, it optimizes for the Translation Lookaside Buffer (TLB) which is a special little cache that often is ignored by software developers. Goto realized that optimizing for the TLB would be more effective than optimizing for cache.

LAPACK LAPACK The Linear Algebra PACKage solves dense or special-case sparse systems stems of equations depending on matrix properties such as: Precision: single, double Data type: real, complex Shape: diagonal, bidiagonal, tridiagonal, banded, triangular, trapezoidal, Hesenberg, general dense Properties: orthogonal, positive definite, Hermetian (complex), symmetric, general LAPACK is built on top of BLAS, which means it can benefit from ATLAS or Goto BLAS. Problems that LAPACK can Solve Systems of linear equations Linear least squares problems Eigenvalue problems Singular value problems

AMD Core Math Library (ACML) BLAS Basic Linear Algebra Subprograms Full Level 1, 2, and 3 support Highly optimized DGEMM, other Level 3 BLAS OpenMP support for key routines Lapack Linear Algebra package Uses calls to BLAS to solve linear algebra systems Matrix factorization/solve, eigenvalue solutions OpenMP support tfor key routines FFTs Fast Fourier Transforms Time-to-frequency domain Hand-tuned d assembly OpenMP support for 2D, 3D transforms Fast/vector t transcendental math library 1, 2, 4, or N values per call Single, Double precision (IEEE754) RNGs Random Number Generators Comprehensive reference implementation

ScaLapack Scalable Linear Algebra PACKage http://www.netlib.org/scalapack Developing team from University i of Tennessee, University i of California Berkeley, ORNL, Rice U.,UCLA, UIUC etc. Support in Commercial Packages NAG Parallel Library IBM PESSL CRAY Scientific Library VNI IMSL Fujitsu, HP/Convex, Hitachi, NEC Handles dense and banded matrices

ScaLapack API Prepend LAPACK equivalent names with P P XY Y Z ZZ Computation Performed Matrix Type Data Types GB General Band GE GEneral matrices ti ST Symmetric Tridiagonal Real SY SYmmetric UN UNitary complex SL Linear Equations (SVX) SV LU Solver VDSingular Value EV Eigenvalue (EVX**) GVX Generalized Eigenvalue

Agenda SUN Blade 6275 타키온 2(Tachyon2) HW 소개 계산노드 Blade 6275 프로세서 Intel Xeon X5570 ( 네할렘 ) SUN Blade 6275 타키온 2(Tachyon2) SW 소개 MVAPICH, OPENMPI MATH. Libraries 타키온시스템사용 무료계정발급 초보계정신청 전략과제소개

타키온시스템사용 초보계정신청 http://helpdesk.ksc.re.kr/index.htm 기본정보 계정신청 계정신청안내 전략적과제신청 http://www.ksc.re.kr/ 사용자지원 응용연구지원 유료사용 http://www.ksc.re.kr/ p// / 사용자지원 - 사용안내

Help Desk 헬프데스크 : http://helpdesk.ksc.re.kr 계정신청 : 김성준 sjkim@kisti.re.kr 일반기술지원 : 홍태영 tyhong@kisti.re.kr i k 교육지원 : 이홍석 consult@kisti.re.kr 대표전화 : 080-041-1991 ( 수신자부담전화 ) 홈페이지 :http://wwwkscrekr http://www.ksc.re.kr 교육홈페이지 : http://webedu.ksc.re.kr Q&A 활용

02. Tachyon2 실습시스템소개 타키온컴퓨터실습환경에대해알아본다.

Agenda 실습시스템및계정 기본사용자환경 사용자셀변경 패스워드변경 작업수행시유의사항 로그인, 디버깅, 및계산노드사용 작업디렉토리 환경변수설정 컴파일러환경변수설정 라이브러리사용 디버깅노드에서실행

기본사용자환경 로그인 사용자의초기접속은총 4 대의로그인노드로한정됨 컴퓨팅노드로의액세스불가 액세스인터페이스는 ssh, sftp, ftp, X11 만허용됨 유닉스혹은리눅스에서 $ ssh -l 사용자 ID tachyon2.ksc.re.kr 또는 $ ssh -l 사용자 ID IP address 윈도우즈에서 putty 나 SSH Secure Shell Client 등의 ssh 접속유틸리티를이용함 프로그램은인터넷을통해무료로다운받을수있음

PuTTY 사용법 Host Name : tachyon2.ksc.re.kr 교육시스템 : 150.183.146.202

PuTTY 사용법 SSH -> X11 탭에서 Enable X11 forwarding 체크 X display location : localhost:0.0 Xming 실행필요

PuTTY 사용법 PuTTY Security Alert 에서 예 선택

PuTTY 사용법 login as : edunxx EdunXX@remotehost s password : secret

PuTTY 사용법 접속완료 Unix 명령입력

로그인로드 로그인화면

기본사용자환경 타키온 2 노드구성 비고호스트이름 IP 주소기타사항 tachyon2.ksc.re.kr DNS 대표호스트네임 로그인노드 (4 노드 ) tachyon2a.ksc.re.kr 150.183.175.101 tachyon2b.ksc.re.kr 150.183.175.102 tachyon2c.ksc.re.kr 150.183.175.103 tachyon2d.ksc.re.kr k 150.183.175.104183 175 104 Interactive process : CPU limit 10 분 디버깅노드 (24 노드 ) s3177 ~ s3200 로그인노드를통해서접근가능 (e.g. ssh s3189) 컴파일및디버깅용컴퓨팅노드와동일한시스템 120분간어플리케이션수행가능 컴퓨팅노드 (3176 노드 ) s0001 ~ s3176 SGE( 배치스케줄러 ) 를통해서만작업실행가능 일반사용자는모니터링을위해 1 분간접근허용

기본사용자환경 타키온노드구성 비고호스트이름 IP 주소기타사항 tachyon.ksc.re.kr DNS 대표호스트네임 로그인노드 (4 노드 ) tachyona.ksc.re.kr 150.183.147.213 tachyonb.ksc.re.kr 150.183.147.214 tachyonc.ksc.re.kr 150.183.147.215 tachyon189 Interactive process : CPU limit 10 분 *tachyond 노드현재로그인불가 디버깅노드 (4 노드 ) tachyon190 tachyon191 로그인노드를통해서접근가능 (e.g. ssh tachyon189) 컴파일및디버깅용컴퓨팅노드와동일한시스템 30 분간어플리케이션수행가능 tachyon192 컴퓨팅노드 (188 노드 ) tachyon001-188 SGE( 배치스케줄러 ) 를통해서만작업실행가능 일반사용자는모니터링을위해 1 분간접근허용

디버깅노드 교육시스템 - ssh node01( ~ node04) : tachyon - ssh s0001 ( ~ s0004) : tachyon2

기본사용자환경 사용자쉘변경 기본으로설정되는 bash 에서다른 shell 로변경했을경우 사용자의홈디렉터리에있는해당환경설정파일을적절히 수정하여사용함 원본이필요한경우사용자가직접 /applic/shell 디렉터리에서필요한쉘의환경설정파일을자신의홈디렉터리로복사하여적절히수정하여사용함 $ ldapchsh (bash 로변경 )

기본사용자환경 패스워드변경 사용자패스워드를변경하기위해서는 passwd 명령을사용함 패스워드관련보안정책 사용자 password 길이를 8 character 이상, 특수문자 2 자이상포함 사용자 password 변경기간을 2 개월 (62 일 ) 로설정 ( 로그인시공지 ) 새로운패스워드는이전패스워드와비교하여 2문자이상달라야함최대허용로그인재시도회수 :10 회사용자가 password 변경시새로운 password가사용자가계정을 갖고있는 KISTI 슈퍼컴퓨팅센터의 GAIA 등다른시스템에서 그대로적용

기본사용자환경 작업수행시유의사항 로그인노드에서는 CPU time 을 10 분으로제한 프로그램수정, 디버깅및배치작업제출등 CPU time으로 10분이상소요되는디버깅및기타인터랙티브작업은디버깅노드에서수행해야함 계산노드는홈디렉터리가마운트되어있지않음 모든계산작업은스크래치디렉터리사용스크래치디렉터리의경우 4일간사용하지않은데이터는자동삭제 SGE 을통해서작업수행전에 /scratch ( 타키온의경우 /work01) 디렉터리로작업에필요한데이터파일을복사해야함 또는 link 를이용

기본사용자환경 타키온 2 작업디렉터리 디렉터리마운트여부 구분내용용량제한 파일삭제정책 파일시스템종류 백업유무 로그인 컴퓨팅 디버깅 노드 노드 노드 홈디렉터리 /home01 구좌당 6GB - 스크래치디렉터리 /scratch 사용자당 1TB 4일이상액세스하지않은파일자동삭제 Lustre 애플리케이션 /applic - -

기본사용자환경 타키온작업디렉터리 구분내용용량제한 파일삭제 파일시스템 백업 정책 종류 유무 로그인 노드 디렉터리마운트여부 컴퓨팅노드 디버깅노드 홈디렉터리 /home01 구좌당 6GB - 스크래치디렉터리 /work01 /work02 사용자당 1TB 20 일이상액세스하지않은파일자동삭제 Lustre 애플리케이션 /applic - -

기본사용자환경 홈및스크래치디렉터리용량제한및사용량확인 $ quotaprint $ quotaprint [ USER DISK USAGE IN THE HOME & SCRATCH DIR ] ====================================================== ID/GROUP DIR QUOTA_LIMIT USED_DISK AVAIL_DISK ====================================================== in1000 /home01 12573MB 10937MB 1636MB test /work01 1073740MB 54905MB 1018835MB test /work02 1073740MB 0MB 1073740MB ======================================================

사용자프로그래밍환경 프로그램컴파일 제공컴파일러 : GNU, Portland Group (PGI), Intel 모든프로그램은 PGI, Intel, GNU 컴파일러를사용하여컴파일가능 MPI 환경을이용한컴파일도가능함 각컴파일러에대한자세한내용은다음웹링크참조 GCC : http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc.pdf PGI : http://www.pgroup.com/doc/pgiug.pdf Intel : http://www.intel.com/cd/software/products/asmona/eng/compilers/219831.htm 컴파일러 위치 PGI compiler Intel compiler GNU compiler /applic/compilers/pgi /applic/compilers/intel /applic/compilers/gcc

사용자프로그래밍환경 라이브러리 컴파일러와 MPI의이름에따라 /applic/<compiler dir>/applib1 또는 /applic/<mpi dir>/applib2 에설치됨 구분내용 /applic/compilers/pgi/linux86-64/8.0-6/applic1 / il / i/li 64/8 6/ pgi 로컴파일된라이브러리들 /applic/compilers/intel/10.1/applic1 /applic/compilers/gcc/4.1.2/applic1 intel 로컴파일된라이브러리들 gcc 로컴파일된라이브러리들 MPI 를사용하는라이브러리 MPI blibrary 별로컴파일되어있음 예 ) gcc 로컴파일된 BLACS 구분내용 /applic/compilers/gcc/4.1.2/mpi/mvapich/1.1.0/applic2 /applic/compilers/gcc/4.1.2/mpi/openmpi/1.3.2/applic2 mvapich 로컴파일된 BLACS openmpi 로컴파일된 BLACS

사용자프로그래밍환경 컴파일러환경변수 각컴파일러에맞는환경변수의자동설정을위해.bashrc 와.switchenv 파일을제공함. path,manpath,ld_library_path, licence_file 등의환경이설정됨 컴파일러환경변경하기 switchenv 을통해쉽게다른컴파일러환경으로변경가능 switchenv 을이용한컴파일러및 MPI, shell 지정 사용법예제선택가능한 mpi & version 선택가능한 compiler & version 선택가능한 shell switchenv [mpi] [mpi_version] [compiler] [compiler_version] [shell] switchenv mvapich2 1.4 intel 11 bash mvapich 1.1 or 1.1.0 / 1.2 or 1.2.0 mvapich2 1.2 or 1.2p1 / 1.4 / 1.5 openmpi 1.3.2 / 1.3.3 / 1.4 / 1.4.1 / 1.4.2 gcc 4 or 4.1.2 intel 10 or 10.1 / 11 or 11.1 pgi 8 or 8.0 / 9 or 9.0 sun 12 or 12.1 bash, csh, tcsh, ksh

환경설정 : MPI 와컴파일러선택

사용자프로그래밍환경 순차프로그램컴파일 벤더컴파일러명령프로그램소스확장자 pgcc C.c pgcpp C++.c,.C,.cc,.cpp pgi pgf77 F77.f,.for,.fpp,.F,.FOR pgf90/pgf95 F90/95.f,.for,.f90,.f95,.fpp,.F,.FOR,.F90,.F95 icc C.c intel icc C++.c,.C,.cc,.cpp,.cxx,.c++ ifort F90.f,.for,.ftn,.f90,.fpp,.F,.FOR,.FTN,.FPP,.F90 gnu gcc C.c g++ C++.C,.cc,.cpp,.cxx

사용자프로그래밍환경 MPI 병렬프로그래밍.bashrc를통해지정된컴파일러에해당하는 wrapper들이소스를컴파일하게됨 컴파일러프로그램소스확장자 mpicc C.c mpicxx/mpicc C++.cc,.c,.cpp,.cxx mpif90 F77/F90.f,.for,.ftn,.f90,.f95,.fpp mpicc로컴파일을하더라도, 옵션은 wrapping되는본래의컴파일러에해당하는옵션을사용해야함. 사용예제 intel : mpicc/mpif90 o test.exe O2 xw m64 test.cc/test.f90 pgi : mpicc/mpif90 o test.exe fast test.cc/test.f90

MPI 병렬프로그래밍 컴퓨팅노드에서 MPI 프로그램실행 컴퓨팅노드 (s0001- s3176, tachyon001-188) 에서 MPI 프로그램을실행하기위해서는스크래치디렉터리로작업실행에필요한파일들을스크래치디렉터리로복사 이후 SUN Grid Engine(SGE : 배치작업스케줄러 ) 을사용하여 배치모드로프로그램을실행해야함. 디버깅노드에서 MPI 프로그램디버깅 디버깅노드 (s3177 s3200, tachyon189-192) 에서 MPI 프로그램을디버깅하기위해서는먼저작업실행에필요한파일들을스크래치디렉터리로복사하고디버깅노드에로그인함.

03 Tachyon2 계산 03. Tachyon2 계산 node 에서실습

Parallel Performance Metrics The simplest parallel performance metric is always wallclock time. Represents the "time to solution CPU time is even less useful here than in the single processor case. Hardware performance counters can still be used to assess overall performance as well. The parallelism in the application introduces the concept of scalability and two new metrics to consider: Speedup Parallel efficiency

Parallel Performance Metrics : Speedup Speedup is the ratio of the time for a single processor to that for N processors, given the same input : S(N) = N0t(N0) /t(n) This is a measure of how much faster an application becomes as more processors are used. Ideally, speedup would be exactly equal to the processor count : Sideal(N) = N superlinear speedup : the speedup on N processors is greater than N.

Elapsed Times and Speedup

Amdahl's Law Amdahl's Law : 1967 An observation on the practical limits of speedup for a given problem The maximum speedup that can be achieved by that application is: S=1/[(1-p)+p/N] Note that this does not consider other limiting factors such as memory, interconnect, t or I/O bandwidth.

Gustafson's Law Gustafson's Law : 1988 Gustafson's Law contradicts Amdahl's law, which describes a limit on the speed-up that parallelization can provide. Gustafson's law was described : S(N)=N- α (N-1) where P is the number of processors, S is the speedup, and α the non-parallelizable part of the process

Amdahl s law & Gustafson s law

Measuring the Performance of Parallel Applications wallclock timing /usr/bin/time. it is usually a good idea to instrument your application with timing calls around important sections of the code. MPI (C/C++/Fortran): MPI_Wtime

Profiling of Parallel Programs Profiling parallel program is often difficult. Profiling tools included with compilers typically not designed d for use with concurrent programs. two parallel profiling tools Tau TAU (Tuning and Analysis Utilities) is a set of tools for analyzing the performance of C, C++ and Fortran programs. http://acts.nersc.gov/tau/ Totalview

MPI 병렬프로그래밍작성예 PI 계산알고리즘 정사각형에원을내접시킴 정사각형내에서무작위로점추출 추출된점들중원안에있는 점의개수결정 PI = 4 ⅹAc/As

Our running Example : The PI program Numerical Integration Mathematically, we know that: 1 4.0 (1+x 2 ) 0 dx = We can approximate the integral as a sum of rectangles: N F(x i ) x i = 0 Where each rectangle has width x and height F(x i ) at the middle of interval i. n 4 1 1 i 1 2 1 (( i 0.5) n x (2 n ) 2 1 x1 (1 0.5) n 1 n f ( x 1 ) f ( x 2 )... 2 0.5) 1 n f ( x n ) 1 x n ( n 0.5) n

PI Program : The sequential program! ----------------------------------------------------------------------! PI calculation : serial code! F(x) = 4.0/(1+x^2) : PI from integral calculation of F(x) from 0 to 1! ---------------------------------------------------------------------- program main implicit none integer*8,parameter :: num_step = 1000000000 integer*8 :: i double precision :: sum,step,pi,x double precision :: stime,etime,rtc step = (1.0d0/dble(num_step)) sum = 0.0d0 write(*,400) stime=rtc()!starting time do i=1,num_step x = (dble(i)-0.5d0)*step sum = sum + 4.d0/(1.d0+x*x)! F(x) enddo etime=rtc()!ending time pi = step * sum write(*,100) pi,dabs(dacos(-1.0d0)-pi) write(*,300) etime-stime write(*,400) 100 format(' PI = ', F17.15,' (Error =',E11.5,')') 300 format(' Elapsed Time = ',F8.3,' [sec] ') 400 format('-----------------------------------------------------') stop end program

컴파일및실행 ( 순차코드 ) 실습디렉토리생성및예제파일가져오기 $ mkdir pical $ cd pical $ cp /work01/edu_ex/tachyon/pical/serial.f./ 순차코드컴파일 $ pgf90 serial.f -o serial.x $ ifort serial.f -o serial.x 실행 $./serial.x

PI Program : OpenMP program program main implicit none integer*8,parameter :: num_step = 1000000000 integer*8 :: i,tid,num_threads integer*8 :: OMP_GET_THREAD_NUM,OMP_GET_NUM_THREADS double precision :: sum,step,pi,x pp double precision :: stime,etime,rtc step = (1.0d0/dble(num_step)) sum = 0.0d0 write(*,400) stime=rtc()!starting time!$omp PARALLEL PRIVATE(tid,x) tid = OMP_GET_THREAD_NUM() num_threads = OMP_GET_NUM_THREADS() write(*,10) tid,num_threads!$omp DO REDUCTION(+:sum) do i=1,num_step x = (dble(i)-0.5d0)*step sum = sum + 4.d0/(1.d0+x*x)! F(x) enddo!$omp END PARALLEL etime=rtc()!ending time pi = step * sum write(*,400) write(*,100) pi,dabs(dacos(-1.0d0)-pi) write(*,300) etime-stime write(*,400) 10 format(' My thread ID =',i3,', Total',i3,' threads are activated') 100 format(' PI = ', F17.15,' (Error =',E11.5,')') 300 format(' Elapsed Time = ',F8.3,' [sec] ') 400 format('-----------------------------------------------------') stop end program

PI Program : MPI program program main implicit none include 'mpif.h integer :: ierr, nprocs, myrank, tag integer :: status(mpi_status_size) _ integer*8,parameter :: num_step = 1000000000 integer*8 :: i,j double precision :: sum,step,pi,x,recv double precision :: stime,etime,rtc step = (1.0d0/dble(num_step)) sum = 0.0d0 CALL MPI_INIT(ierr) CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr) if(myrank==0) then write(*,400) Endif stime=rtc()!starting time do i=myrank+1,num_step,nprocs x = (dble(i)-0.5d0)*step sum = sum + 4.d0/(1.d0+x*x) d0+x*x)! F(x) enddo etime=rtc()!ending time pi = step * sum if(myrank /= 0) then CALL MPI_SEND(pi, 1, MPI_DOUBLE_PRECISION, 0, 1, + MPI_COMM_WORLD, ierr) endif if(myrank==0) then do j = 1, nprocs-1 CALL MPI_RECV(recv, 1, MPI_DOUBLE_PRECISION, j, 1, + MPI_COMM_WORLD, status, ierr) pi = pi + recv enddo write(*,100) pi,dabs(dacos(-1.0d0)-pi) write(*,300) etime-stime write(*,400) 100 format(' PI =', F17.15,' (Error =',E11.5,')') ) 300 format(' Elapsed Time = ',F8.3,' [sec] ') 400 format('-----------------------------------------------------') endif CALL MPI_FINALIZE(ierr) stop end program

PI Program : MPI program program main implicit none include 'mpif.h' integer :: ierr, nprocs, myrank, tag integer :: status(mpi_status_size) _ integer*8,parameter :: num_step = 1000000000 integer*8 :: i,j,is, ie,tid, num_threads integer*8 :: OMP_GET_THREAD_NUM, + OMP_GET_NUM_THREADS double precision :: sum,step,pi,x,recv double precision :: stime,etime,rtc step = (1.0d0/dble(num_step)) sum = 0.0d0 CALL MPI_INIT(ierr) INIT(i CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr) IF(myrank==0) THEN write(*,400) ENDIF stime=rtc()!starting time CALL OMP_SET_NUM_THREADS(2)!$OMP PARALLEL PRIVATE(tid, x) tid = OMP_GET_THREAD_NUM() THREAD num_threads = OMP_GET_NUM_THREADS() write(*,10) tid, num_threads, myrank!$omp DO REDUCTION(+:sum) do i=myrank+1,num_step,nprocs x = (dble(i)-0.5d0)*step sum = sum + 4.d0/(1.d0+x*x)! F(x) enddo!$omp END PARALLEL etime=rtc()!ending time pi = step * sum if(myrank /= 0) then CALL MPI_SEND(pi, 1, MPI_DOUBLE_PRECISION, 0, 1, + MPI_COMM_WORLD, ierr) endif if(myrank==0) then do j = 1, nprocs-1 CALL MPI_RECV(recv, 1, MPI_DOUBLE_PRECISION, PRECISION, j, 1, + MPI_COMM_WORLD, status, ierr) pi = pi + recv enddo write(*,100) pi,dabs(dacos(-1.0d0)-pi) write(*,300) etime-stime write(*,400) endif 10 format(' My thread ID =',i3,', Total',i3,' threads are activated, ' in rank: ', i3) 100 format(' PI = ', F17.15, 15 ' (Error =',E11.5, ')') )) 300 format(' Elapsed Time = ',F8.3,' [sec] ') 400 format('-----------------------------------------------------') CALL MPI_FINALIZE(ierr) stop end program

PI Program : MPI program program main implicit none include 'mpif.h integer :: ierr, nprocs, myrank, tag integer :: status(mpi_status_size) integer*8,parameter :: num_step = 1000000000 integer*8 :: i,j,is, ie,tid, num_threads integer*8 :: OMP_GET_THREAD_NUM, + OMP_GET_NUM_THREADS double precision :: sum,step,pi,x,recv double precision :: stime,etime,rtcetime step = (1.0d0/dble(num_step)) sum = 0.0d0 CALL MPI_INIT(ierr) CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr) IF(myrank==0) THEN write(*,400) ENDIF stime=rtc()!starting time CALL OMP_SET_NUM_THREADS(2)!$OMP PARALLEL PRIVATE(tid, x) tid = OMP_GET_THREAD_NUM() num_threads = OMP_GET_NUM_THREADS() write(*,10) tid, num_threads, myrank!$omp DO REDUCTION(+:sum) do i=myrank+1,num_step,nprocs x = (dble(i)-0.5d0)*step sum = sum + 4.d0/(1.d0+x*x)! F(x) enddo!$omp END PARALLEL etime=rtc()!ending time pi = step * sum if(myrank /= 0) then CALL MPI_SEND(pi, 1, MPI_DOUBLE_PRECISION, 0, 1, + MPI_COMM_WORLD, ierr) endif if(myrank==0) then do j = 1, nprocs-1 CALL MPI_RECV(recv, 1, MPI_DOUBLE_PRECISION, PRECISION, j, 1, + MPI_COMM_WORLD, status, ierr) pi = pi + recv enddo write(*,100) pi,dabs(dacos(-1.0d0)-pi) write(*,300) etime-stime write(*,400) endif 10 format(' My thread ID =',i3,', Total',i3,' threads are activated, ' in rank: ', i3) 100 format(' PI = ', F17.15, 15 ' (Error =',E11.5, ')') )) 300 format(' Elapsed Time = ',F8.3,' [sec] ') 400 format('-----------------------------------------------------') CALL MPI_FINALIZE(ierr) stop end program

컴파일및실행 (OpenMP, MPI, Hybrid) 컴파일 $pgf90 mp openmp.f o openmp.x (PGI Compiler) ifort openmp openmp.f o openmp.x (Intel Compiler) $ mpif90 mpi.f o mpi.x $ mpif90 mp hybrid.f o hybrid.x mpif90 openmp hybrid.f o hybrid.x 실행 $ export OMP_NUM_THREADS=4 $./openmp.x $ mpirun np 4 machinefile hosts./mpi.x (need dhost file) $ export OMP_NUM_THREADS=2 $ mpirun np 2 machinefile e hosts./hybrid.x (need host file)

04. SGE SUN Grid Engine SGE 를통한작업방법에대해알아본다.

SGE Overview Resource Selection Enterprise Allocation and Prioritization Policies Extensible Workload to Resource Matching Resource Control Customizable System Load and Access Regulation Definable Job Execution Contexts Resource Accounting Web-based b Reporting and Analysis Open and Integratable Data Source Proven Solution for Large Cluster Env.

KISTI SGE Architecture ssh, sftp, scp, ftp etc Lic#1 License Server LDAP master UNIX ssh, sftp, scp, ftp etc Lic#2 Lic#3 LDAP slave Windows Exceed, X-manager, VNC, etc Sun Grid Engine Interactive Nodes Users Login Nodes Master Server Master Server (Active) Spool DB (standby) Internet Accounting server Batch Nodes Management Nodes Execution Nodes

Queue configuration 큐구성 (Tachyon2) 큐이름 Wall Clock Limitit 작업실행노드 ( 시간 ) 작업별 CPU수 Pi Priorityit SU Charge Rate 비고 Normal 48 Tachyon026-188 17-1536 Normal 1 Long 168 Tachyon001-025 1-128 Low 1 Long running 작업 Strategry 168 Tachyon001-188 256-3008 High 1 Grand Challenge 작업 Special 12 Tachyon001-188 1537-3008 - 2 대규모자원전용 ( 사전예약 ) 사용자별최대 Running 작업수 : 20 개 ( 작업부하에따라수시로조정될수있음 ) 큐구성정보확인하기 $showq 교육시스템 (master node) - long (node01 ~ node04) - normal (s000s1 ~ s0004)

Queue configuration 큐구성 (Tachyon) 큐이름 Wall Clock Limitit 작업실행노드 ( 시간 ) 작업별 CPU수 Pi Priorityit SU Charge Rate 비고 normal 48 Tachyon026-188 17-1536 Normal 1 Long 168 Tachyon001-025 1-128 Low 1 Long running 작업 Strategry 168 Tachyon001-188 256-3008 High 1 Grand Challenge 작업 Special 12 Tachyon001-188 1537-3008 - 2 대규모자원전용 ( 사전예약 ) 사용자별최대 Running 작업수 : 20 개 ( 작업부하에따라수시로조정될수있음 ) 큐구성정보확인하기 $showq 교육시스템 (master node) - long (node01 ~ node04) - normal (s000s1 ~ s0004)

Job submission Job Submission $ qsub job_script Job script examples /applic/shell/job_examples/job_script 에서복사하여사용

Job Script Practice 실습파일다운로드 $ cd /work01/edunxx ( 디렉토리가없을시 mkdir edunxx 로생성 ) $ mkdir SGE $ cd SGE $tar zxvf /work01/edu_ex/tachyon/sge/sgejob.tar.gzex/tachyon/sge/sgejob.tar.gz $ cp /applic/shell/job_examples/job_script/*.sh./ 컴파일 $ pgf90 serial.f -o serial.x ifort serial.f -o serial.x $ mpif90 mpi.f -o mpi.x $ pgf90 -mp openmp.f -o openmp.x ifort -openmp openmp.f -o openmp.x $ mpif90 -mp hybrid.f -o hybrid.x mpif90 -openmp hybrid.f -o hybrid.x

Job Script Serial Program Serial 프로그램 (1CPU) 작업스크립트작성예제 (serial.sh) #!/bin/bash #$ -V # 작업제출노드의쉘환경변수를컴퓨팅노드에도적용 (default) #$ -cwd # 현재디렉터리를작업디렉터리로사용 #$ -N serial_job # Job Name, 명시하지않으면 job_script 이름을가져옴 #$ -q special # Queue name #$ -R yes # Resource Reservation ##$ -wd /work01/<user01>/serialtest / # 작업디렉터리를설정. 현재디렉토리 (PWD) 가 # /work01/<user01>/serialtest 가아닌경우사용, # 그렇지않으면 cwd 로충분함 #$ -l h_rt=01:00:00 # 작업경과시간 (hh:mm:ss) (wall clock time), 누락시작업강제종료 #$ -l exclusive=true # 노드에서자신의작업만배타적으로수행할경우명시 #$ -M myemailaddress # 작업관련메일을보낼사용자메일주소 #$ -m e # 작업종료시에메일을보냄./serial.x

Job Script Practice (serial) script 파일수정 ( vi serial.sh ) #!/bin/bash #$ -V #$ -cwd #$ -N serial_job #$ -q long #$ -R Ryes #$ -wd /work01/edunxx/sge #$ -l h_rt=01:00:00 ##$ -M myemailaddress ##$ -m me./serial.x job submit $ qsub serial.sh Your job 134619 ("serial_job") has been submitted $ qstat $ more serial_job.o134619 $ more serial_job.e134619

Job Script MPI Program(1) MPI 프로그램작업스크립트작성예제 (mpi.sh) select-mpi-[shell] 명령어를이용하여 job 실행환경선택 $ select-mpi-bash [mvapich openmpi] [pgi intel gnu] $exit ( eit exit 후다시로그인해야선택한환경으로설정됨 ) MPI task(cpu) 수명시 #$ -pe mpi_fu {Total_MPI_task(CPU)} #$ -pe mpi_fu 32

Job Script MPI Program(2) #!/bin/bash #$ -V # 작업제출노드의쉘환경변수를컴퓨팅노드에도적용 (default) #$ -cwd # 현재디렉터리를작업디렉터리로사용 #$ -N mvapich_job # Job Name, 명시하지않으면 job_script 이름을가져옴 #$ -pe mpi_fu 32 # selec-bash-mpi에서선택한 mvapich로실행되며각노드의가용 cpu를 # 모두채워서 (fu : fill_up) 총 32 개의 MPI task 가실행됨. #$ -q normal # 큐이름 #$ -R yes # Resource Reservation #$ -wd /work01/<user01>/mvapich / / # 작업디렉터리를설정. 현재디렉토리 (PWD) 가 # /work01/<user01>/mvapich가아닌경우사용, # 그렇지않으면 cwd로충분함 #$-lh_rt=01:00:00 # 작업경과시간 (hh:mm:ss)(wallclocktime), 누락시강제작업종료 #$ -l exclusive=true # 노드에서자신의작업만배타적으로수행할경우명시 #$ -M myemailaddress # 작업관련메일을보낼사용자메일주소 #$ -m e # 작업종료시에메일을보냄 mpirun -machinefile $TMPDIR/machines -np $NSLOTS /work01/<user01>/mvapich/mpi.x

Job Script Practice (MPI) script 파일수정 ( vi mpi.sh ) #!/bin/bash #$ -V #$ -cwd #$ -N mvapich_job #$ -pe mpi_fu 4 #$ -q long #$ -R yes #$ -wd /work01/edun??/sge #$ -l h_rt=01:00:00 ##$ -M myemailaddress ##$ -m e mpirun -machinefile $TMPDIR/machines -np $NSLOTS./mpi.x job submit $ qsub mpi.sh Your job 134622 ("mpi_job job") has been submitted $ qstat $ more mpi_job.o134622 $ more mpi_job.e134622

Job Script OpenMp Program OpenMP 프로그램작업스크립트작성예제 (openmp.sh) #!/bin/bash #$ -V # 작업제출노드의쉘환경변수를컴퓨팅노드에도적용 (default) #$ -cwd # 현재디렉터리를작업디렉터리로사용 #$ -N openmp_job # Job Name, 명시하지않으면 job_script 이름을가져옴 #$ -pe openmp 4 # OpenMP thread 수 #$ -q small # Queue name(openmp 작업은 small or long 큐사용가능 ) #$ -R yes # Resource Reservation #$ -wd /work02/<user01>/openmp # 작업디렉터리를설정. 현재디렉토리 (PWD) 가 # /lustre1/<user01>/openmp가아닌경우사용, # 그렇지않으면 cwd로충분함 #$ -l h_rt=01:00:00 # 작업경과시간 (hh:mm:ss) (wall clock time), 누락시작업강제종료 #$ -l exclusive=true # 노드에서자신의작업만배타적으로수행할경우명시 #$ -M myemailaddress # 작업관련메일을보낼사용자메일주소 #$ -m e # 작업종료시에메일을보냄 export OMP_NUM_THREADS=4 /work02/<user01>/openmp.x

Job Script Practice (OpenMP) script 파일수정 ( vi openmp.sh ) #!/bin/bash #$ -V #$ -cwd #$ -N openmp_job #$ -pe openmp 4 #$ -q long #$ -R yes #$ -wd /work01/edun??/sge #$ -l h_rt=01:00:00 ##$ -M myemailaddress ##$ -m e export OMP_NUM_THREADS=4./openmp.x job submit $ qsub openmp.sh Your job 134624 ("openmp_job") has been submitted $ qstat $ more openmp_job.o134624 $ more openmp_job.e134624

Job Script Hybrid Program(1) Hybrid(MPI+OpenMP) 프로그램작업스크립트작성예제 (hybrid.sh) select-mpi-[shell] 명령어를이용하여 job 실행환경선택 $ select-mpi-bash [mvapich openmpi] [pgi intel gnu] $ exit ( exit 후다시로그인해야선택한환경으로설정됨 ) 노드당 / 전체 MPI task(cpu) 수명시 #$ -pe mpi_{mpi_task(cpu)_per_node}cpu {Total_MPI_task(CPU)} #$ -pe mpi_4cpu 16 OMP_NUM_THREADS 리소스지정 #$ -l OMP_NUM_THREADS={OpenMP_threads_per_MPI_task} #$ -l OMP_NUM_THREADS THREADS=4 OMP_NUM_THREADS 리소스지정 export OMP_NUM_THREADS=4

Job Script Hybrid Program(2) #!/bin/bash #$ -V # 작업제출노드의쉘환경변수를컴퓨팅노드에도적용 (default) #$ -cwd # 현재디렉터리를작업디렉터리로사용 #$ -N hybrid_job # Job Name, 명시하지않으면 job_script 이름을가져옴 #$ -pe mpi_4cpu 8 # 전체 MPI task(cpu) 8개, 노드당 MPI task(cpu) 4개 #$ -q normal # Queue name #$ -R yes # Resource Reservation #$ -wd /work02/<user01>/hybrid # 작업디렉터리를설정. 현재디렉토리 (PWD) 가 # /lustre1/<user01>/hybrid가아닌경우사용, # 그렇지않으면 cwd 로충분함 #$-l h_rt=01:00:00 # 작업경과시간 (hh:mm:ss)(wall clock time), 누락시강제작업종료 #$ -l exclusive=true # 노드에서자신의작업만배타적으로수행할경우명시 #$ -l OMP_NUM_THREADS=2 # MPI 타스크당 OpenMP 쓰레드수를의미하며, 아래 # OMP_NUM_THREADS 환경변수에서명기한 OpenMP # 쓰레드숫자와동일한값지정. 누락시작업강제종료 #$ -M myemailaddress # 작업관련메일을보낼사용자메일주소 #$ -m e # 작업종료시에메일을보냄 export OMP_NUM_THREADS=2 mpirun -machinefile $TMPDIR/machines -np $NSLOTS /work02/<user01>/hybrid.x

Job Script Practice (Hybrid) script 파일수정 ( vi hybrid.sh ) #!/bin/bash #$ -V #$ -cwd #$ -N hybrid_job #$ -pe mpi_2cpu 2 #$ -q long #$ -R yes #$ -wd /work01/edun??/sge #$ -l h_rt=01:00:00 #$ -l OMP_NUM_THREADS=2 ##$ -M myemailaddress ##$ -m e export OMP_NUM_THREADS=2 mpirun machinefile $TMPDIR/machines np $NSLOTS./hybrid.x job submit $ qsub hybrid.sh Your job 134626 ("hybrid_job") has been submitted $ qstat $ more hybrid d_job.o134626 6 $ more hybrid_job.e134626

Job Script Option (1/2) 옵션 Argument 기능 -pe pe_name min_proc [-max_proc] mpi_rr : 라운드로빈방식으로노드의 CPU 할당 mpi_fu : 각노드의비어있는 CPU를꽉채워서할당 mpi_[1-16]cpu : 정해진 ( 범위 : 1-16) 숫자만큼노드의 CPU 할당 Openmp : 순수한 openmp프로그램의쓰레드를위한 CPU 할당 * mpi의종류 [mvapich, openmpi] 는 select-mpi-[bash,csh,ksh] 스크립트로미리선택 -N job_name Job 의이름을정해줌 -S Shell (absolute path) Batch 작업의 shell 을지정, 미지정시 SGE 가지정한 Shell 로수행 (/bin/bash) -M Email address 사용자의 email address 를명시 -m {b e a s n} 언제 email notification 을보낼지명시 - b : job beginning -e:jobending - a : job aborted or rescheduled - s : job suspended -n:nomailissent(default) -V 사용자의현재 shell 의모든환경변수가 qsub 시에 job 에적용되도록함

Job Script Option (2/2) 옵션 Argument 기능 -cwd 현재디렉터리를 job 의 working directory 로사용 (default) -o Output_file Job 의 stdout 결과를 output_file 로저장 -e Error_file Job 의 stderr 결과를 error_file 로저장 - h_rt : 작업경과예상시간 (hh:mm:ss) (wall clock time) - normal : normal 큐에작업제출시 Job이높은우선순위를얻기위해반드시명시 ( -l normal 혹은 -l normal=true ) - strategy : strategy 큐에작업제출시작업이높은우선순위를얻기위해 -l resource=value 반드시명시 ( -l strategy 혹은 -l strategy=true ) - OMP_NUM_THREADS : MPI 타스크당쓰레드수를의미하며, hybrid[mpi+openmp] 병렬작업실행시반드시명기 (-l OMP_NUM_THREADS=[MPI _ 타스크당 OpenMP 쓰레드수 ]) * normal, strategry 큐를제외한다른큐는 l 옵션으로기본 priority 이기 때문에큐이름을명시할필요없음, 추후변경시공지예정

Job Submission 연관된다수작업 Job_A 가끝난후 Job_B 가실행되어야하는경우 $ qsub Job_A.sh (Jobname은 Job_A라고가정 ) Your job 504 ("Job_A") has been submitted $ qsub -hold_jid Job_A job_b.sh 혹은 $ qsub -hold_jid 504 job_b.sh Job_A 와 job_b 가끝난후 Job_C 가실행되어야하는경우 $ qsub Job_A.sh (Jobname은 Job_A라고가정 ) Your job 504 ("Job_A") has been submitted $ qsub Job_B.sh (Jobname은 Job_B라고가정 ) Your job 505 ("Job_B") has been submitted $ qsub -hold_jid Job_A,Job_B Job_C.sh 혹은 $ qsub -hold_jid 504,505 Job_C.sh

Job monitoring (1/2) 기본작업정보 $ qstat ( 사용자자신 ) job-id prior name user state submit/start at queue slots ja-task-id ------------------------------------------------------------------------------ 254 0.55500 work6 user1 r 04/02/2008 10:13:09 bmt.q@s0087 1 253 0.55500 work5 user1 r 04/01/2008 03:44:20 bmt.q@s0035 1 252 0.55500 work7 user1 r 04/01/2008 11:54:34 bmt.q@s0035 1 $ qstat u * ( 모든사용자 ) 상세작업정보 $ qstat f u "*" queuename qtype used/tot. load_avgavg arch states ---------------------------------------------------------------- all.q@s0002 BIP 1/4 0.14 lx24-amd64 257 0.55500 sleep root r 04/01/2008 10:49:54 1 ---------------------------------------------------------------- all.q@s0003 BIP 1/4 0.13 lx24-amd64 258 0.55500 sleep root r 04/01/2008 10:49:54 1 ------------------------------------------------------

Job monitoring (2/2) Pending 작업에대한상세정보 [Pending 이유 ] 출력 $ qalter w p job_id qstat 옵션 Option Result no option 명령을실행한사용자 job 의상세 list 를보여줌 -f / -F [resource_attribute] -u user_list Full output / * qstat f grep long Full output and show (selected) resources of queue(s) 명시한 user_id 에대한상태를보여줌. u "*" 는전체사용자의상태를보여줌. 주로 f 옵션과함께쓰임. -r Job 의 resource requirement 를 display -ext Job 의 Extended d information i 을 display -j <jobid> Pending/running job 에대한 information 을보여줌 -t Job 의 subtask 에대한추가정보 display

Node monitoring 노드상태모니터링 $ showhost HOSTNAME ARCH NCPU(AVAIL/TOT) LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------------------------------------------- s0001 lx24-amd64 0/8 16.03 31.4G 2.7G 0.0 0.0 s0002 lx24-amd64 0/8 16.00 31.4G 3.7G 0.0 0.0 s0003 lx24-amd64 0/8 16.04 31.4G 3.7G 0.0 0.0 s0004 lx24-amd64 0/8 16.03 31.4G 3.7G 0.0 0.0 s0005 lx24 amd64 1/8 15 53 31 4G 3 6G 0 0 0 0 s0005 lx24-amd64 1/8 15.53 31.4G 3.6G 0.0 0.0 s0006 lx24-amd64 0/8 16.00 31.4G 3.7G 0.0 0.0 s0007 lx24-amd64 0/8 16.01 31.4G 3.7G 0.0 0.0 s0008 lx24-amd64 2/8 14.03 31.4G 3.4G 0.0 0.0 s0009 lx24-amd64 0/8 16.00 31.4G 3.7G 0.0 0.0 s0009 lx24 amd64 0/8 16.00 31.4G 3.7G 0.0 0.0 s0010 lx24-amd64 0/8 16.03 31.4G 3.7G 0.0 0.0 s0011 lx24-amd64 0/8 16.03 31.4G 3.7G 0.0 0.0 s0012 lx24-amd64 0/8 16.02 31.4G 3.7G 0.0 0.0 s0013 lx24-amd64 0/8 16.03 31.4G 3.7G 0.0 0.0

Job control 작업삭제 $ qdel <jobid> : 해당 <jobid> 를가지는작업삭제 $ qdel -u <username> : <username> 의모든작업삭제 작업 suspend/resume $ qmod -sj <jobid> # suspend job $ qmod usj <jobid> # unsuspend(resume) job

QACCT Report and account for Sun Grid Engine usage 1068% [sunedu1@login02 SGE]$ qsub mpi.sh Your job 140996 ("mvapich_job") has been submitted 1069% [sunedu1@login02 SGE]$ qstat job-id prior name user state submit/start at queue slots ja-task-id ----------------------------------------------------------------------------------------------------------------- 140996 0.50007 mvapich_jo sunedu1 r 01/09/2009 15:36:33 long@s0017 4 1070% [sunedu1@login02 SGE]$ ls mvapich_job.* ihj b* mvapich_job.e140997 mvapich_job.o140997 mvapich_job.pe140997 mvapich_job.po140997 1071% [sunedu1@login02 SGE]$ qacct -j 140997 JOB ID

QMON Can comfortably utilize SGE via Graphical User Interface (GUI) by qmon Among the facilities provided by the qmon are submitting jobs, managing jobs, managing hosts, and managing job queues X-Windows is required by qmon for providing GUI Start qmon by type qmon

Submitting a Job via QMON Click, the submit job window will show

Job Control via QMON Click for viewing job status and controlling jobs

Queue Control Only one compute node usually consists of one queue but you can add more queues or remove existing queues Slot management Slot is the capacity of a queue that can handle concurrent jobs May provide Number of slot of a queue = Number of processor of the compute node

Queue Control via SGE Click for control queues

Queue Control via SGE (Cont ) This icon present a queue named compute0 prepared for a host named comp-pvfs-0-0 This queue consists of only one slot You can modify properties of this queue by highlight its icon and click the Modify button * Normal user cannot control queues

Queue Control via SGE (Cont ) Modify the properties of a queue Try to modify the number of slot

05. 병렬코드성능최적화기법 병렬코드성능을최적화하기위한타키온의컴파일러옵션과프로파일러 (gprof), 디버거 (totalview) 의사용방법을소개한다.

Agenda 순차컴파일러옵션소개및실습 - Intel, PGI, GNU 컴파일러주요옵션 - 실습및성능측정 디버거 & 프로파일러소개및실습 - Totalview -gprof, TAU, pgprof

컴파일러소개 벤더컴파일러명령프로그램소스확장자 pgcc C.c PGI pgcpp C++.c,.C,.cc,.cpp pgf77 F77.f,.for,.fpp,.F,.FOR pgf90/pgf95 F90/95.f,.for,.f90,.f95,.fpp,.F,.FOR,.F90,.F95 icc C.c Intel icc C++.c,.C,.cc,.cpp,.cxx,.c++ ifort F90.f,.for,.ftn,.f90,.fpp,.F,.FOR,.FTN,.FPP,.F90 GCC gcc C.c g++ C++.C,.cc,.cpp,.cxx 권장사항 :PGI 컴파일러

사용자프로그래밍환경 컴파일러별주요옵션 : GNU 컴파일러 컴파일러옵션 설명 -O[1 2 3] 오브젝트최적화. 숫자는최적화레벨 -funroll-all-loops 모든루프를 unrolling 함 -ffast-math Fast floating point model 사용 -minline-all-stringops 더많은 ilining 허용 -g 디버깅정보를생성 --help 옵션목록출력 권장옵션 : -O3 m64 fpic

사용자프로그래밍환경 컴파일러별주요옵션 : Intel 컴파일러옵션 -O[1 2 3] -ip, -ipo -vec_report[0 1 2 3 4] -xsse4.2 -fast 설명오브젝트최적화. 숫자는최적화레벨프로시저간최적화벡터진단정보의양을조절타겟아키텍처 : SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 인스트럭션을위한코드를포함 -xt -O3 -ipo -no-prec-div -static의매크로 -g fp 디버깅정보를생성 -openmp -openmp_report[0 1 2] -help OpenMP 기반의 multi-thread 코드사용 OpenMP 병렬화진단레벨조절옵션목록출력 권장옵션 : -fast xsse4.2 m64 (-mcmodel=medium) ed ver.11 이후 -fast xs m64 (-mcmodel mcmodel=medium) ver.10

사용자프로그래밍환경 컴파일러별주요옵션 : PGI 컴파일러 컴파일러옵션 설명 -O[0 1 2 3 4] 오브젝트최적화. 숫자는최적화레벨 -Mipa=fast 프로시저간최적화 -fast -fastsse -g, -gopt -mp -Minfo=mp, ipa -help -O2 -Munroll=c:1 -Mnoframe -Mlre Mautoinline 의매크로 SSE, SSE2를지원하는최적화디버깅정보를생성 OpenMP 기반의 multi-thread 코드사용 OpenMP관련정보, 프로시저간최적화옵션목록출력 권장옵션 : -fast tp nehalem-64 (-mcmodel=medium) medium)

TotalView 소개 Parallel and Distributed Debugging Tool Multi-process and multi-threaded Distributed, clusters MPI,OpenMP,PVM,PE, others C,C++,Fortran(F77),F90,HPF,assembler Memory debugging capabilities Powerful and Easy GUI

프로그램컴파일 디버깅옵션 -g 대부분의 UNIX 계열디버거옵션 소스코드에 symbolic 디버그정보를만들어주는기능과함께컴파일 대부분의 compiler 들이이옵션을사용 주의사항 디버깅과정에서는다른최적화옵션과함께사용하지않는것이좋음 최적화옵션들이소스코드를변경하기도하므로, 실제코드와다른형태를가질수도있음 병렬프로그램의디버깅을위해서는추가적인컴파일 flag 들이삽입됨 Ex) np, machinefile

TotalView 실행 (1/2) TotalView 의실행 사용자의필요에따라다양한방법이있음 실행파일의디버깅실행프로세스의추가 (attach) Core 파일을이용한디버깅디버거의환경이나실행형태를정의

TotalView 실행 (2/2) Command / Action totalview Starts t the debugger. You can then load a program or corefile, or else attach to a running process. totalview filename Starts t the debugger and loads the program specified by filename. totalview filename corefile Starts the debugger and loads the program specified by filename and its core file specified by corefile. totalview filename -a args Starts the debugger and passes all subsequent arguments (specified by args) to the program specified by filename. The -a option must appear after all other TotalView options on the command line. total ie filename remote hostname[:portn mber] totalview filename -remote hostname[:portnumber] Starts the debugger on the local host and the totalview debugger server on the remote host hostname. Loads the program specified by filename for remote debugging. You can specify a host name or TCP/IP address for hostname, and optionally, a TCP/IP port number for portnumber.

TotalView 실습 (1/4) 디버깅노드로접속 ssh s3177 $ ssh s3177 [3177 ~ 3200] ( 교육시스템 : ssh s000x) 예제파일 copy 및압축해제 $ mkdir Totalview $ cd Totalview $ tar -xvf /work01/edu_ex/tachyon/totalview/extv.tar 예제파일 (Fortran files) $ l $ ls ex1.f ex2.f ex2.inp ex3.f exmpi.f exomp.f machines run

TotalView 실습 (2/4) 컴파일 ex1.f ( basic ) $ pgf90(ifort) g ex1.f o ex1 TotalView 실행 $ totalview./ex1 & TotalView Windows Root window Process window

TotalView 실습 (3/4) 기본 TotalView 창 Root Window Process Window

TotalView 실습 (4/4) TotalView s state codes State Code B E H Description Stopped at a breakpoint Stopped because of an error In a Hold state K Thread is executing within the kernel M R T W Mixed some threads in a process are running and some not Running Thread is stopped At a watchpoint

TotalView 실습예제 1 (1/10) 실습예제 1 : 행렬계산 1) ( 1) ( : j i A C B A 1) 1)*( ( : 1) ( 1) ( : j i B j i A 6 4 2 0 3 2 1 0 0 0 0 0 2 1 1 0 9 6 3 0 6 4 2 0 3 2 1 0 3 2 2 1

TotalView 실습예제 1 (2/10) 프로그램실행 Process Window 의 Go 버튼클릭 단축키 : Process Window 에서 g 입력 실행 (Go) 의단축키가 g 이고이는 Window 에서 g 를누름으로써실행가능하다. or

TotalView 실습예제 1 (3/10) Breakpoint 설정 line 41 : printmatrix 에서왼쪽마우스버튼으로설정 Stop 아이콘 클릭

TotalView 실습예제 1 (4/10) 프로그램실행 Breakpoint 에서코드실행이잠시멈춤 Root window 에 B 상태로표시

TotalView 실습예제 1 (5/10) Dive : 실행코드서브루틴보기 Printmatrix line 에서마우스오른쪽클릭후 Dive 왼쪽마우스더블클릭 Undive : source 창 (process 창 ) 우측상단 더블클릭 클릭 (Undive) 우클릭 클릭

TotalView 실습예제 1 (6/10) Variable window 열기 line 36 a,b,c array 에서 dive( 마우스왼쪽더블클릭 ) Array정보를보여주는새창 (variable window) 더블클릭

TotalView 실습예제 1 (7/10) Array slice Slice 항목에서특정부분입력 Fortran (2:3,:)/ C [2:3][:]

TotalView 실습예제 1 (8/10) Modify variable value Value field 수정후 Enter Output result terminal 에서확인 클릭후임의의값입력

TotalView 실습예제 1 (9/10) Stepping 단축기 :s s s

TotalView 실습예제 1 (10/10) Quit TotalView Root window 에서 CTRL-Q Select yes

TotalView 실습예제 2 (1/4) 컴파일 totalviewt 실행 (i index 버그 ) pgif90 g ex2.f o ex2 totalview & New program window에서 browse : ex2 선택후 OK Process window $ pgf90(ifort) g ex2.f o ex2 $ totalview &

TotalView 실습예제 2 (2/4) 코드의실행 버그로인해코드에러발생 Root window 에서 E 상태로확인 Stack Trace에서 ex2 항목클릭하여 Error 위치확인 클릭

TotalView 실습예제 2 (3/4) 디버깅을위한 Evaluation point 설정 Line 29 : 오른쪽마우스클릭후 properties Evaluate click Expression 입력 Fortran : If(j.gt.4) $stop / C : if(j>4) $stop Eval 아이콘등록후 Restart Stack Frame 에서 j 값확인

TotalView 실습예제 2 (4/4) 디버깅 Action points list 에서 Eval 수정 Fortran : if(j.gt.4) goto $32 / C : if(j>4) goto 32 Restart 우클릭

TotalView 실습예제 3 (1/2) 컴파일 TotalView 실행 $ pgf90(ifort) g ex3.f o ex3 $./ex3 & $ ps u $ totalview & => compile => background 로실행 => 프로세서확인 => totalview 실행 New program window 에서 Attach to process browse : ex3 선택후 OK Process window 확장

TotalView 실습예제 3 (2/2) 디버깅 Program halt ( 단축키 : h ) Dive index variable i (line 8, yellow arrow) New variable window Edit i value : 99 -> 101 Enter

TotalView 실습예제 4 (1/8) OpenMP program 의디버깅 컴파일 totalview 실행 $ pgf90(ifort) g mp(openmp) exomp.f o exomp Compile $ export OMP_NUM_THREADS=4 $ totalview./exomp & Set breakpoints on lines 42 and 45 and Go Thread 정보확인 : root window 및 process window Environment Variable Execute the Totalview

TotalView 실습예제 4 (2/8) OpenMP program의디버깅 새로운추가 process window열기 Root window thread list 에서 View - Dive in New window root window thread list 에서선택한 Thread 의새로운 process window 가열림

TotalView 실습예제 4 (3/8) 각 thread 에따른 variable 확인 Set breakpoint on line 35 Restart on process window Dive tid variable on line 28 New window에서 View-Show Across-Threads 각 thread에따른 variable 확인가능 threads value

TotalView 실습예제 4 (4/8) MPI program 의디버깅 컴파일 totalview 실행 $mpif90 g exmpi.f o exmpi Compile $ export TOTALVIEW=/applic/debuggers/toolworks/totalview.8.3.0-1/bin/totalview $ ls machines Environment variable 설정및 machinefile 확인 $ mpirun dbg=totalview ssh np 4 machinefile./machines./exmpi Root window 와 Process window 의확인 4 개의 processes : Go 이후확인가능

TotalView 실습예제 4 (5/8) Breakpoint 설정 (line 73) Running program : Go Mpi process 임을확인하는대화창 : Check No MPI task 정보확인 Root window 및 process window 의 processes

TotalView 실습예제 4 (6/8) 영역분할확인 Dive js and je (line 60) New window에서 View-Show Across-Processes 각프로세서별로분할된영역인덱스 분할된영역인덱스

TotalView 실습예제 4 (7/8) Set barrier Breakpoint 설정과흡사 BARR 아이콘등록 View Message Queue and Message Queue Graph Process window 에서 Tools-Message Queue

TotalView 실습예제 4 (8/8) View Message Queue and Message Queue Graph Message Queue Graph 예제 Set breakpoint line 157 and 169 / Go Open Message Queue Graph Check option : pending sends Update : 통신정보 (tag)

프로파일링의목적 사용자코드의성능확인 프로그램의실행시간측정 특정함수가차지하는시간비율측정 병목과다른부분을기다리는코드의영역찾기

프로파일링 tools prof, gprof(gnu Profiler) PAPI Dynaprof GuideView (OpenMP) Vampir (MPI) TAU (OpenMP, MPI, Hybrid) Vprof Etc.

GNU Profiler (gprof) gprof 를이용한프로파일링단계 프로파일링옵션을사용한프로그램컴파일 $gcc o [myprog][myprog.c] -pg 프로파일데이터생성을위한프로그램실행 $./[myprog] // gmon.out( 프로파일데이터파일 ) 생성 gprof 를사용하여프로파일데이터분석 $gprof [option]./[myprog] gmon.out > myprog_prof.txt

GNU Profiler (gprof) gprof 의주요옵션 출력형식관련옵션 -A 소스코드에분석결과를삽입하여출력 -C[funcName] -J[funcName] -p[funcname] 지정된심볼만함수분석에사용지정된심볼만소스코드에분석결과를삽입평면프로파일생성, [funcname] 지정시지정된심볼만생성 -q[funcname] 호출그래프프로파일생성, [funcname] 지정시지정된심볼만생성 분석옵션 -a static 으로정의된함수를분석하지않음 -c 프로파일링옵션으로컴파일되지않은함수에대하여짐작하여호출관계를추정하여분석함 -l 줄단위프로파일링 -s 프로파일링데이터파일들을읽어서 gmon.sum 에합산하여기록

GNU Profiler (gprof) 프로파일링데이터분석 평면프로파일 (Flat Profile) 각함수의총수행시간, 평균수행시간등을분석 함수간호출정보는없음 -z, -c 옵션을통해호출되지않은함수의정보분석 호출그래프프로파일 (Call Graph) 각함수의호출관계를통한수행시간분석 평면프로파일보다자세한분석

GNU Profiler (gprof) 프로파일링출력분석 : 평면프로파일 (Flat Profile) % time cumulative seconds self seconds calls self ms/call total ms/call name 프로그램내에서함수가수행된전체시간의백분율 (self) 프로그램내에서함수와테이블의위함수들이수행된시간의합 프로그램내에서함수가수행된전체시간 프로그램내에서함수가호출된횟수 함수의호출당평균수행시간, 다른함수호출에의해사용된시간제외 ( = self seconds / calls ) 함수의호출당평균수행시간, 다른함수호출에의해사용된시간포함 함수의이름

GNU Profiler (gprof) 프로파일링출력분석 : 호출그래프프로파일 (Call Graph) index 각함수마다유일하게지정되는번호 % time 함수수행시간중차지하는비율 self 대상함수의순수수행시간 children called name 다른함수를호출하는데사용된시간총함수의호출횟수와분석함수에의한호출횟수표시함수의이름과 index

gprof 실습 (1/4) 디버깅노드로접속 ssh s3177 ~ 3200 $ ssh s3177 [~ 3200] 예제파일 Copy 및압축해제 mkdir gprof cd gprof cp /home01/sunedu1/test.c./ $ mkdir gprof $ cd gprof $ cp /work01/edu_ex/tachyon/gprof/test.c./ 예제파일 (c file) 반복연산코드

gprof 실습 (2/4) int main(void){ int i,x1=10,y1=3,r1=0; float x2=10,y2=3,r2=0; 0; } for(i=0;i<1000000;i++){ r1+=int_math(x1,y1); r2+=float_math(y2,y2); } int_math x 1,000,000 float_math int_submath x 1 int_power float_submath x 1 float_power x 2 x 2 int_power float_power

gprof 실습 (3/4) 컴파일 gcc test.c o test.x pg $ gcc test.c o test.x -pg 실행./test.x gmon.out file 생성 $./test.x gprof 를이용한 Profling gprof./test.x t gmon.out > profile.txt t Profile 파일의분석 여러옵션들의사용실습 $ gprof./test.x gmon.out > profile.txt

gprof 실습 (4/4) 프로파일링결과확인 $ more profile.txt Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 29.36 0.07 0.07 3000000 23.49 23.49 float_power 25.16 0.13 0.06 3000000 20.13 20.13 int_power 25.16 0.19 0.06 1000000 60.39 100.65 int_submath 12.58 0.22 0.03 1000000 30.20 77.17 float_submath 4.19 0.23 0.01 1000000 10.07 110.72 float_math 4.19 0.24 0.01 main 0.00 0.24 0.00 1000000 0.00 120.78 int_math. Call graph (explanation follows) granularity: each sample hit covers 2 byte(s) for 5.23% of 0.19 seconds index % time self children called name <spontaneous> [1] 100.0 0.01 0.23 main [1] 0.00 0.12 1000000/1000000 int_math [2] 0.0101 0.10 1000000/1000000 float_math [3] ----------------------------------------------- 0.00 0.12 1000000/1000000 main [1] [2] 50.0 0.00 0.12 1000000 int_math [2] 0.06 0.04 1000000/1000000 int_submath [4] 0.02 0.00 1000000/3000000 int_power [7] -----------------------------------------------. 0.06 0.04 1000000/1000000 int_math [2] [4] 41.7 0.06 0.04 1000000 int_submath [4] 0.0404 0.00 00 2000000/3000000 int_power [7] ----------------------------------------------- 0.02 0.00 1000000/3000000 int_math [2] 0.04 0.00 2000000/3000000 int_submath [4] [7] 25.0 0.06 0.00 3000000 int_power [7]

TAU Tuning and Analysis Utilities Performance tool suite that offers profiling and tracing of programs Multi-level performance instrumentation Multi-language automatic source instrumentation Flexible and configurable performance measurement Widely-ported parallel performance profiling system Computer system architectures and operating systems Different programming languages and compilers Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid Support for object-oriented and generic programming Integration in complex software systems and applications

TAU 예제실습 TAU 의기본적인사용법 tau compiler 를통해 instrumented 바이너리를생성 위의과정을통해만들어진바이너리를실행 작업이종료후만들어진 performance file (profile.***)..) 을확인 위의단계에서만들어진 performance file를 visualization 등을통해분석 pprof paraprof // GUI-based // text-based

TAU 예제실습 (1/5) TAU 사용예 (mvapich + pgi) mvapich & pgi 선택 not need in Edu_system $ cd ~ $ cp a /applic/shell/.bash* ~ $./select-mpi-bash mvapich ihpgi $ exit 실습폴드생성 $ mkdir TAU $ cd TAU 예제파일복사 $ cp /work01/edu_ex/tachyon/tau/*./ $ ls exmpi.f90 machines $ source tau_env

TAU 예제실습 (2/5) TAU 사용예 (mvapich + pgi) tau 컴파일러를이용한 compile $ tau_f90.sh -o test exmpi.f90 바이너리파일실행 $ mpirun -np 4 -machinefile machines./test 실행후생성된 trace file 확인 $ ls *.trc tautrace.0.0.0.trc tautrace.1.0.0.trc tautrace.2.0.0.trc tautrace.3.0.0.trc performance file 생성 $ tau_merge tautrace.*.trc exmpi.trc trc 파일생성 OR $ trace2profile exmpi.trc tau.edf GUI Mode

TAU 예제실습 (3/5) TAU 사용예 (mvapich + pgi) 실행후생성된 performance file 확인 $ cd MULTI GET_TIME_OF_DAY/ $ ls proifle.0.0.0 profile.1.0.0 profile.2.0.0 profile.3.0.0 profile.4.0.0 performance file 분석 $ pprof Text Mode OR $ paraprof GUI Mode

TAU 예제실습 (4/5) TAU 사용예 (mvapich + pgi) pprof NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 64 3,002 1 48659 3002597 FDM2D 63.1 1,895 1,895 16218 0 117 FDM2D => JACOBI 63.1 1,895 1,895 16218 0 117 JACOBI 16.6 97 499 16218 64872 31 COMMUNICATE 16.6 97 499 16218 64872 31 FDM2D => COMMUNICATE 11.8 352 354 1 45 354422 FDM2D => MPI_Init() 11.8 352 354 1 45 354422 MPI_Init() 9.2 277 277 32436 0 9 COMMUNICATE => MPI_Wait() 9.2 277 277 32436 0 9 MPI_Wait() 6.1 184 184 16220 0 11 MPI_Allreduce() 6.1 184 184 16218 0 11 FDM2D => MPI_Allreduce() 3.5 105 105 16218 0 7 COMMUNICATE => MPI_Isend() 3.5 105 105 16218 0 7 MPI_Isend() 0.6 19 19 16218 0 1 COMMUNICATE => MPI_Irecv() 0.6 19 19 16218 0 1 MPI_Irecv() 0.1 3 3 1 5 3626 FDM2D => MPI_Finalize() 0.1 3 3 1 5 3626 MPI_Finalize() 0.0 0.623 0.623 1 0 623 MPI_Allgather() 0.0 0.623 0.623 1 0 623 MPI_Init() => MPI_Allgather() 0.0 0.318 0.396 2 6 198 MPI_Comm_create() 0.0 0.318 0.396 2 6 198 MPI_Init() => MPI_Comm_create() 0.0 0.269 0.269 2 0 134 MPI_Init() => MPI_Allreduce() 0.0 0.237 0.237 13 0 18 MPI_Errhandler_set() 00 0.0 0.22 022 0.22 022 4 0 55MPI_Comm_rank() 0.0 0.115 0.195 1 3 195 MPI_Comm_split() 0.0 0.115 0.195 1 3 195 MPI_Init() => MPI_Comm_split() 0.0 0.147 0.147 3 0 49 MPI_Group_free()

TAU 예제실습 (5/5) TAU 사용예 (mvapich + pgi) paraprof

PGI Parallel Profiler (PGPROF) PGPROF 소개 (v 7.1) C, C++, F77, F90으로컴파일된프로그램의수행으로생성된프로파일링데이터분석 프로파일링을위한 GUI 제공 프로파일링방법 Fine grain observations Different levels of intrusiveness 병렬프로파일링기능제공 MPI applications OpenMP multi-threaded applications http://www.pgroup.com/doc/pgitools.pdf

PGI Parallel Profiler (PGPROF) pgprof 를이용한프로파일링단계 pgi 프로파일링옵션을사용한프로그램컴파일 $pgf90 o [myprog] [myprog.f90] Mprof=[option] 프로파일데이터생성을위한프로그램실행 $./[myprog] // pgprof.out 생성 ( 프로파일데이터파일 ) pgprof 를사용하여프로파일데이터분석 $pgprof [options] [ 프로파일데이터파일 ] $pgprof -exe./[myprog] [ 프로파일데이터파일 ]

PGI Parallel Profiler (PGPROF) 프로파일링방법 프로파일링정보 프로그램에서가장큰수행시간을차지하는함수 각함수나라인이실행되는데걸리는시간 프로그램에서 hot-spots 병렬프로그램의효율 프로파일링방법에따라프로파일링은더혹은덜프로그램에끼어들수있다.

PGI Parallel Profiler (PGPROF) 프로파일링을위한주요컴파일옵션 Option -Mprof=lines -Mprof=func -Mprof=time Profiling Information Line Level profiling, 프로그램의각 line 에대한소요시간, 실행횟수계산 Function Level profiling, 프로그램의각 function에대한소요시간, 호출횟수계산 Sample-based profiling, sample-based method 를사용한명령어에대한소요시간계산 + intrusive -Mprof=hwcts x64 HW counters 에접근하는 PAPI 인터페이스를이용한 Sample-based profiling MPI profiling, MPI message size 와 sends/receives 횟수수 -Mprof=[mpi] 집, 다른프로파일링옵션들과함께사용되어야한다. [mpi] : mpich1, mpich2, (mvapich1) -

PGI Parallel Profiler (PGPROF) 프로파일링출력분석 Count Time Cost Cover Bytes Bytes Receive Bytes Send Messages Receives es Sends 프로그램내에서 line 혹은 function 이수행된횟수 프로그램내에서 line 혹은 function 이수행된시간 프로그램내에서 line 혹은 function 이수행된총시간, 함수내에서다른함수를수행한시간까지포함 프로그램내에서적어도한번이상수행되어진함수내의 Line들의길이와전체 Line에대한 percentage 송수신된메시지바이트수 수신된메시지바이트수 송신한메시지바이트수 송수신된메시지개수 수신된메시지개수 송신한메시지개수

PGPROF 실습 (1/3) 디버깅노드로접속 150.183.143.89 ~ 92 $ ssh tachyon189 [190, 191, 192] 예제파일 Copy 및압축해제 mkdir gprof cd gprof cp /home01/sunedu1/test.c./ $ mkdir pgprof $ cd pgprof $ cp /work01/edu_ex/tachyon/pgprof/*./ 예제파일 (fortran file) 2D FDM code

PGPROF 실습 (2/3) 컴파일 gcc test.c o test.x pg $ mpif90 exmpi.f90 o test.x Mprof=func,mvapich 실행./test.x pgprof.out, pgprof1.out, pgprof2.out, pgprof3.out file 생성 $ mpirun -np 4 -machinefile hosts./test.x pgprof 를이용한 Profling gprof./test.x gmon.out > profile.txt Profile 파일의분석 여러옵션들의사용실습 $ pgprof exe./test.x pgprof.out