누리온슈퍼컴퓨터소개및실습 2019. 2. 14 Intel Parallel Computing Center at KISTI
Agenda 09:00 10:30 누리온소개 10:45 12:15 접속및누리온실습 12:15 13:30 점심 13:30 15:00 성능최적화실습 (I) 15:15 16:45 성능최적화실습 (II) 2
History of KISTI Supercomputer SUN B6275[4 th -2] SUN B6048[4 th -1] NEC SX-6[3 rd -2] Cray T3E NEC SX-5[3 rd -1] Tera Cluster Cray 2S[1 st ] Cray CS500[5 th ] 1988 1993 1997 2000 2001 2002 2003 2008 2GFlops 16GFlops 131GFlops 242GFlops 306GFlops 1,407GFlops 8,000GFlops 30TFlops 2010 2018 300TFlops 25.7PFlops Cray C90[2 nd ] HP GS320 HPC 160/320 Pluto cluster IBM p690[3 rd -1] IBM p690[3 rd -2] IBM p595[4 th ] 3
Nurion System 누리 ( 세상, 세계, 함께누리다 )+ 온 ( 전부, 모두의 ) 온국민이다함께누리는국가슈퍼컴퓨터 KNL Node( 누리온 ) SKL Node( 누리온 ) 구분 모델 운영체제 내용 노드수 8305 Cray 3112-AA000T CentOS 7.4 (Linux, 64-bit) CPU Intel Xeon Phi KNL 7250 1.4GHz(68-core) / 1 socket 메인메모리 이론성능 노드당 96GB DDR + 16GB MCDRAM 노드당 3.0464TFlops 구분 모델 운영체제 내용 노드수 132 Cray 3111-BA000T CentOS 7.4 (Linux, 64-bit) CPU Intel Xeon Skylake(Gold 6148) 2.4GHz(20-core) /2 sockets 메인메모리 이론성능 노드당 192GB DDR4 Memory 노드당 3.072TFlops 4
누리온시스템하드웨어 후면냉각도어 전면 상부 OPA 케이블 8 열, 126 랙 15x CS500 12x DDN Lustre Storage 병렬파일시스템 (20PB) Testbed Network Switch TS 4500 Tape Library 5 OPA Core Switch 테이프스토리지 (10PB)
Compute Node(KNL) Cray 3112-AA000T 1 x Intel Xeon Phi KNL 7250 processor(68 cores per processor) 96GB(6*16GB) DDR-2400 RAM 1x Single-port 100Gbps OPA HFI card 1x On-board GigE(RJ45) port 6
Compute Node(SKL) Cray 3111-BA000T 2 x Intel Xeon SKL 6148 processors 192GB (12x 16GB) DDR4-2666 RAM 1x Single-port 100Gbps OPA HFI card 1x On-board GigE(RJ45) port 7
Performance(Flops) 노드당성능 KNL Core : (8*2)*(2)*1.4G = 44.8Gflops Node : 44.8*68=3046.4Gflops SKL Core : (8*2)*(2)*2.4G=76.8Gflops Node : 2*20*76.8Gflops/core = 3.072Tflops KNL SKL Number of cores 68 20 SIMD width (doubles) 8 * 2 8 * 2 Multiply/add in 1 cycle 2 2 Clock speed(gcycle/s) 1.4 2.4 DP Gflop/s/core 44.8 76.8 DP Gflops/s/processor 3046 1536 누리온 KNL SKL 8305 nodes : 3.0464Tflps*8305=25.3Pflops 132 nodes : 3.072Tflps*132=405.5Tflops=0.4Pflops KNL+SKL : (25.3+0.4) Pflops = 25.7Pflops Tachyon2 : 300.032Tflops Benchmarks HPL HPCG GRAPH IO Performance 13.92PF (No.13) 391.45TF (No.8) 1048.86GTEPS(No.23) 16.67pt(No.2) 8
SW 리스트 구분 항목 Cray 의존라이브러리 cdt/17.10 cray-impi/1.1.4(default) mvapich2_cce/2.2rc1.0.3_noslurm(default) cray-ccdb/3.0.3(default) mvapich2_gnu/2.2rc1.0.3_noslurm perftools-lite/6.5.2(default) cray-lgdb/3.0.7(default) PrgEnv-cray/1.0.2(default) 컴파일러 cce/8.6.3(default) gcc/6.1.0 gcc/7.2.0 intel/17.0.5(default) intel/18.0.1 intel/18.0.3 MPI 라이브러리 ime/mvapich-verbs/2.2.ddn1.4 impi/17.0.5(default) impi/18.0.3 openmpi/3.1.0 ime/openmpi/1.10.ddn1.0 impi/18.0.1 mvapich2/2.3 MPI 의존라이브러리 fftw_mpi/2.1.5 fftw_mpi/3.3.7 hdf5-parallel/1.10.2 netcdfhdf5-parallel/4.6.1 parallel-netcdf/1.10.0 pio/2.3.1 Libraries hdf4/4.2.13 hdf5/1.10.2 lapack/3.7.0 ncl/6.5.0 ncview/2.1.7 netcdf/4.6.1 Commercial applications cfx/v145 cfx/v181 fluent/v145 fluent/v181 gaussian/g16.a03 lsdyna/mpp cfx/v170 cfx/v191 fluent/v170 fluent/v191 gaussian/g16.a03.linda lsdyna/smp applications advisor/17.0.5 forge/18.1.2 ImageMagick/7.0.8-20 python/3.7 R/3.5.0 singularity/2.5.1 vtune/17.0.5 advisor/18.0.1 grads/2.2.0 lammps/8mar18 qe/6.1 siesta/4.0.2 singularity/2.5.2 vtune/18.0.1 advisor/18.0.3 gromacs/2016.4 namd/2.12 qt/4.8.7 siesta/4.1-b3 singularity/3.0.1 vtune/18.0.3 cmake/3.12.3 gromacs/5.0.6 python/2.7.15 qt/5.9.6 singularity/2.4.2 tensorflow/1.12.0 9
KNL Architecture 최대 36 tile ( 72 cores / 256 threads) 2 cores / tile 1MB shared L2 cache /tile 2 * 512-bit VPUs /cores Based on Intel Atom architecture 2D mesh interconnect 2 DDR memory controller 6 channels DDR4 Up to 90 GB/s 16 GB MCDRAM 8 embedded DRAM controllers Up to 450 GB/s 10
Vector Registers KNL 512-bit register 512bit*(1byte/8bit)=64byte Double Precision : 64byte*(1DP/8byte)=8DP 11
Instruction Set Architecture(ISA) 인텔 AVX-512 instruction set architecture(isa) 종류 AVX-512 Foundation Instructions : AVX-512F AVX-512 Conflict Detection Instructions: AVX-512CD AVX-512 Exponential and Reciprocal Instructions: AVX-512ER AVX-512 Prefetch Instructions: AVX-512PF AVX-512BW, AVX-512DQ, AVX-512VL(for Xeon processor) 인텔컴파일러옵션 -xcommon-avx512 = AVX-512F + AVX-512CD -xmic-avx512 = AVX-512F + AVX-512ER + AVX-512PF -xcore-avx512 = AVX-512F + AVX-512CD + AVX-512BW + AVX-512DQ + AVX-512VL 12
컴파일명령예시 Serial SKL : icc -O3 -xcore-avx512 (-qopt-report=5) pi.c -o pi_skl.x KNL : icc -O3 -xmic-avx512 (-qopt-report=5) pi.c -o pi_knl.x OpenMP SKL : icc -O3 -xcore-avx512 -qopenmp piopenmp.c -o piopenmp_skl.x KNL : icc -O3 -xmic-avx512 -qopenmp piopenmp.c -o piopenmp_knl.x MPI SKL : mpiicc -O3 -xcore-avx512 pimpi.c -o pimpi_skl.x KNL : mpiicc -O3 -xmic-avx512 pimpi.c -o pimpi_knl.x Hybrid SKL : mpiicc -O3 -xcore-avx512 -qopenmp pihybrid.c -o pihybrid_skl.x KNL : mpiicc -O3 -xmic-avx512 -qopenmp pihybrid.c -o pihybrid_knl.x 13
클러스터모드 (Cluster modes) 3가지클러스터모드를지원하며, 각모드는성능향상을위해서로다른 affinity를제공 all-to-all mode quadrant mode(or hemisphere) (default) sub-numa clustering(snc) mode(snc-4 or SNC-2) 14
클러스터모드 (Cluster modes) Quadrant mode each memory type is UMA SNC-4 The latency from any given core to any memory location within the same memory type(mcdram or DDR) is essentially the same. each memory type is NUMA The cores and memory are divided into (four) quadrants with lower latency for near memory accesses (within the same quadrant) and higher latency for far (within a different quadrant) memory accesses. SNC-4 is well suited for MPI applications that utilize four, or a multiple of four, ranks per KNL. 15
클러스터모드 (Cluster modes) Hemisphere and SNC-2 Variations on quadrant and SNC-4 Identical to quadrant and SNC-4, except divided the cores and memory into halves instead of quadrants All-to-all It can be used with any DDR DIMM configuration. This mode will be lowest in general performance than the other modes. 16
MCDRAM and DDR MCDRAM(Multi-Channel DRAM) is the high-bandwidth memory 8 MCDRAM devices integrated: 8 * 2 GB = 16 GB 8 devices have their own memory controllers (EDC) Bandwidth up to 475 GB/s DDR offers high-capacity memory 2 DDR4 memory controllers( 2 * 3 = 6 channels) Max 64 GB/channel 384 GB Bandwidth up to 90 GB/s 17
메모리모드 (Memory Modes) Cache(default) Flat MCDRAM acts as L3 Cache MCDRAM, DDR4 are all just RAM numactl command memkind/autohbw library different NUMA nodes Hybrid MCDRAM is used as a L3 cache as a DDR 18
Default Cluster Mode and Memory Mode 클러스터모드로는 Quadrant 모드, 메모리모드로는 Cache 모드의사용이대 부분의응용프로그램에대해좋은선택임 MPI+X (e.g., MPI+OpenMP) 형태의응용프로그램은클러스터모드로 SNC-4 모드를사용할경우성능이잘나올수있음 Quadrant 모드또한충분히근접한성능을낼수있으며, 균일한사용환경을위해누리온은이를지원하지않음 대부분의응용프로그램은메모리모드를 Cache Mode로사용하기를권장하지만, 아래와같은일부경우 Flat Mode에서성능이더잘나올수있음 사용하는메모리크기가작아서 MCDRAM만사용할수있는경우 Memory-bounded 프로그램이아니어서 L3 Cache가필요하지않은경우 ( 대표적인경우가 HPL임 ) 19
numactl numactl -H quadrant (all-to-all or hemisphere) + cache : 1 NUMA (DDR) 20
numactl numactl -H quadrant (all-to-all or hemisphere) + flat : 2 NUMA (MCDRAM and DDR) 21
numactl numactl -H SNC-4 + flat : 8 NUMA(4 MCDRAM and 4 DDR) DDR nodes are listed first, and the MCDRAM nodes are listed last. The distances reflect the affinization of DDR and MCDRAM to the divisions of KNL in this mode. example : 64 cores(4threads/core), DDR : 64G, MCDRAM : 16G 22
numactl MCDRAM 사용 (flat 모드만해당 ) -m 옵션과해당되는 NUMA 노드를명시 numactl -m 1./a.out (quadrant + flat, DDR이 0, MCDRAM이 1) numactl -m 4-7./a.out (SNC-4 + flat, DDR이 0~3, MCDRAM이 4~7) -m 대신 -p 옵션사용을권장 -p 옵션은 preference를의미 : MCDRAM 사용이필수가아닌선호 MCDRAM이모두사용되었을경우, DDR 메모리를자동으로사용함. -m 의경우메모리부족으로프로그램종료 numactl -p 1./a.out (quadrant + flat) numactl -p 4-7./a.out (SNC-4 + flat) 23
Basic Environment(1) 1. 시스템접속 2. Linux 기초 기본명령어 VI Editor 3. Environment Module Module 명령어 avail add rm list purge 권장컴파일러옵션 2019-02-12 24 24
시스템접속 노드구성 호스트명 CPU Limit 비고 ssh/scp/sftp 접속가능 로그인노드 nurion.ksc.re.kr 20 분 컴파일및 batch 작업제출용 ftp 접속불가 ssh/scp/sftp 접속가능 Datamover 노드 nurion-dm.ksc.re.kr - ftp 접속가능 컴파일및작업제출불가 계산노드 KNL node[0001-8305] - PBS 스케줄러를통해작업실행가능 CPU-Only cpu[0001-0132] - 일반사용자직접접근불가 25
시스템접속 Xming X 환경실행을위해필요 Putty 사용 Host Name : nurion.ksc.re.kr( port : 22) Xming 실행필요 26
시스템접속 접속 ID & otp sedu##( 01~48) OTP : xxxx Passwd : xxxxxxxxx Last login: Mon Jan 7 10:00:35 2019 from xxx.xxx.xxx.xx ================ KISTI 5th NURION System ==================== * Compute Nodes(node[0001-8305],cpu[0001-0132) - KNL(XeonPhi 7250 1.40GHz 68C) / 16GB(MCDRAM),96GB(DDR4) - CPU-only(XeonSKL6148 2.40GHz 20C x2) / 192GB(DDR4) * Software - OS: CentOS 7.4(3.10.0-693.21.1.el7.x86_64) - System S/W: BCM v8.1, PBS v14.2, Lustre v2.10 * Current Configurations - All KNL Cluster modes - Quadrant - Memory modes : Cache-node[0001-7980,8281-8300]/Flat-node[7981-8280] : PBS job sharing mode-exclusive(running 1 job per node) (Except just the commercial queue) * Policy on User Job. (Use the # showq & # pbs_status commands for more queue info.) 27
시스템접속 Policy on User Job Queue Wall-Clock Limit Max Running jobs Max Active Jobs (running+waiting) exclusive unlimited 30 40 normal 48h 20 40 burst_buffer 48h 10 20 long 120h 10 20 flat 48h 10 20 debug 48h 2 2 commercial 48h 5 10 norm_skl 48h 10 20 28
Linux 기초 File Hierarchy 경로 절대경로 : /home/userid/mpi/examples 상대경로 :../../MPI/example 29
Linux 기초 명령어구조 (command) + (options) + (arguments) ls ls -a ls -a /home Manual page 시스템에서제공하는도움말 (man page) 기본적으로 command 마다해당 man page를가짐 다음페이지를보기위해서는 space bar 또는 f 입력 이전페이지를보기위해서는 b 입력 마치려면 q 입력 $ man who WHO(1) WHO(1) User Commands NAME who - show who is logged on SYNOPSIS who [OPTION]... [ FILE ARG1 ARG2 ] DESCRIPTION -a, --all same as -b -d --login -p -r -t -T -u -b, --boot time of last system boot -d, --dead print dead processes 30
ls 기본명령어 디렉터리내의파일목록을위한명령 자주사용되는명령어 명령어 cd pwd mkdir cp rm mv cat echo diff file 내용디렉터리이동명령현재디렉터리위치를보여줌새로운디렉터리를만들때사용파일복사명령, 속성을유지할경우 -a 옵션사용파일이나디렉터리삭제파일과디렉터리의이름을변경하거나경로를옮길때사용간단한텍스트파일내용확인텍스트를화면상에출력 2개의텍스트파일내용을비교할때사용, 바이너리파일인경우같은지여부만알려줌파일의타입 (ASCII, Binary) 를알아볼때사용 31
기본명령어 tar 명령어 단순하게파일을압축하는용도가아닌파일이나디렉터리를묶는용도 gzip, unzip과같이압축프로그램과같이쓰이는게일반적 기본적인옵션 -z : gzip으로압축또는압축해제할때사용 -f : tar 명령어를이용할때반드시사용 (default) x : tar 파일로묶여있는것을해제할때사용 (extract) c : tar 파일을생성할때사용 (create) 32
VI Editor vim(vi) 가장기본적인텍스트에디터, OS에기본적으로포함됨 VIsual display editor를의미 파일개방 $ vi file( 편집모드 ) $ view file( 읽기모드 ) modes 입력모드 입력모드로전환 : i (,I, a, A, o, O, R) 입력하는모든것이편집버퍼에입력됨 입력모드에서빠져나올때 ( 명령행모드로변경시 ) : ESC key 명령행모드 입력하는모든것이명령어해석됨 파일저장 / 종료명령 명령행모드에서 :w ( 저장 ), :q ( 종료 ), :wq( 저장후종료 ), :q! ( 저장없이종료 ) 33
Environment Module 사용자가쉘환경 (shell environment) 을관리하도록도와주는도구 module 명령 부명령 (subcommand) avail(av) 사용가능한모듈파일들 (modulefiles) 을보여줌 add(load) 쉘환경으로모듈파일들을적재함 (load) rm(unload) 쉘환경에서적재된모듈파일들을제거함 li(list) 적재된모듈파일들을나열함 purge 적재된모든모듈파일들을제거함 34
Environment Module Default modulefiles login 을하면, 기본모듈파일이적재됨 $ module list Currently Loaded Modulefiles: 1) craype-network-opa module 명령 사용가능모듈확인 (avail) $ module avail -------- /opt/cray/craype/default/modulefiles --------------------- craype-mic-knl craype-network-opa craype-x86-skylake ---------------- /opt/cray/modulefiles ---------------------------- cdt/17.10 cray-impi/1.1.4(default) perftools-base/6.5.2(default) --------- /apps/modules/modulefiles/compilers --------------------- cce/8.6.3(default) gcc/6.1.0 gcc/7.2.0 intel/17.0.5(default) intel/18.0.1 intel/18.0.3 35
Environment Module 모듈명령 모듈정보출력 $ module help impi/17.0.5 ----------- Module Specific Help for 'impi/17.0.5' ---------------- This module is for use of impi/17.0.5 use example: $ module load intel/17.0.5 impi/17.0.5 모듈적재 $ module load craype-mic-knl $ module load intel/18.0.3 (or $ module add craype-mic-knl intel/18.0.3 ) 36
Environment Module Default modulefiles in Nurion 적재된모듈파일확인 (list subcommand) $ module list Currently Loaded Modulefiles: 1) craype-network-opa 2) craype-mic-knl 3) intel/17.0.5 적재된모듈삭제 / 모듈추가 (rm / add subcommand) $ module rm craype-mic-knl $ module add craype-x86-skylake $ module list Currently Loaded Modulefiles: 1) craype-network-opa 2) intel/18.0.3 3) craype-x86-skylake 적재된모든모듈삭제 $ module list Currently Loaded Modulefiles: 1) craype-network-opa 2) intel/18.0.3 3) craype-x86-skylake $ module purge $ module li No Modulefiles Currently Loaded. 37
Basic Environment 프로그래밍도구설치현황 컴파일러및라이브러리모듈 구분 아키텍처구분모듈 craype-mic-knl craype-x86-skylake Cray 모듈 perftools/6.5.2 perftools-base/6.5.2 컴파일러 cce/8.6.3 gcc/7.2.0 gcc/6.1.0 컴파일러의존라이브러리 hdf4/4.2.13 hdf5/1.10.2 lapack/3.7.0 MPI 라이브러리 impi/17.0.5(default) impi/18.0.1 impi/18.0.3 항목 craype-network-opa PrgEnv-cray/1.0.2 intel/17.0.5(default) intel/18.0.1 intel/18.0.3 ncl/6.5.0 ncview/2.1.7 netcdf/4.6.1 openmpi/3.1.0 mvapich2/2.3 38
Basic Environment 프로그래밍도구설치현황 컴파일러및라이브러리모듈 구분 MPI 의존라이브러리 fftw_mpi/2.1.5 fftw_mpi/3.3.7 hdf5-parallel/1.10.2 Intel 패키지 advisor/17.0.5 advisor/18.0.1 advisor/18.0.3 응용소프트웨어 forge/18.1.2 ImageMagick/7.0.8-20 python/2.7.15 python/3.7 gromacs/2016.4 namd/2.12 qt/4.8.7 qt/5.9.6 가상화모듈 singularity/2.5.1 singularity/2.5.2 singularity/3.0.1 항목 netcdf-hdf5-parallel/4.6.1 parallel-netcdf/1.10.0 pio/2.3.1 vtune/17.0.5 vtune/18.0.1 vtune/18.0.3 R/3.5.0 grads/2.2.0 lammps/8mar18 qe/6.1 siesta/4.0.2 siesta/4.1-b3 cmake/3.12.3 gromacs/5.0.6 singularity/2.4.2 tensorflow/1.12.0 39
Basic Environment 상용소프트웨어설치정보 분야소프트웨어버전라이선스디렉터리위치 구조역학 열유체역학 화학 / 생명 Abaqus MSC ONE (Nastran) LS-DYNA ANSYS CFX ANSYS Fluent Gaussian 6.14-6 2016 2017 2018 20182 60 토큰 R10.1.0 R9.2.0 V145 V170 V181 V191 G16-a03 G16- a03.linda 151 토큰 /apps/commercial/abaqus/ 최대 128 코어사용가능 17 Solvers (HPC 640) 작업수제한없음단일노드내 CPU 수제한없음 /apps/commercial/msc/nas tran /apps/commercial/lsdyna /apps/commercial/ansys/ /apps/commercial/g16/g16 40
Basic Environment 프로그램컴파일 누리온시스템 Intel 컴파일러, GNU 컴파일러, Cray 컴파일러제공 Intel MPI(IMPI), Mvapich2, OpenMPI 제공 기본필요모듈 craype-network-opa craype-mic-knl(knl), craype-x86-skylake(skl) 41
Basic Environment 프로그램컴파일 순차프로그램컴파일 프로그램벤더컴파일러소스확장자사용모듈 C / C++ Intel icc / icpc intel/17.0.5 intel/18.0.1 intel/18.0.3 GNU gcc / g++.c,.cc,.cpp,.cxx,.c++ gcc/6.1.0 gcc/7.2.0 Cray cc / CC PrgEnv-cray/1.0.2 & cce/8.6.3 F77/F90 intel/17.0.5 intel/18.0.1 Intel ifort.f,.for,.ftn,.f90,.fpp, intel/18.0.3.f,.for,.ftn,.fpp, GNU gfortran gcc/6.1.0 gcc/7.2.0.f90 Cray ftn PrgEnv-cray/1.0.2 & cce/8.6.3 42
Basic Environment 프로그램컴파일 순차프로그램컴파일 Intel 컴파일러주요옵션 -O[1 2 3] 컴파일러옵션 -qopt-report=[0 1 2 3 4 5] -xcore-avx512 -xmic-avx512 -qopenmp -fpic, -fpic 설명 오브젝트최적화, 숫자는최적화레벨 벡터진단정보의양을조절 512bit 레지스터를가진 CPU 지원 512bit 레지스터를가진 MIC 지원 OpenMP 기반의 multi-thread 코드사용 PIC(Position Independent Code) 가생성되도록컴파일 권장옵션 -O3 -fpic -xcore-avx512 ( Skylake) -O3 -fpic -xmic-avx512 (KnightsLanding) -O3 -fpic -xcommon-avx512(skylake & KnightsLanding) $ icc ifort o test.exe O3 fpic xmic-avx512 test.[c cc f90] 43
Basic Environment 프로그램컴파일 순차프로그램컴파일 GNU 컴파일러주요옵션 -O[1 2 3] 컴파일러옵션 -march=skylake-avx512 -march=knl -Ofast -fopenmp -fpic 설명 오브젝트최적화, 숫자는최적화레벨 512bits 레지스터를가진 CPU 지원 512bits 레지스터를가진 MIC 지원 -O3 -ffast-math 매크로 OpenMP 기반의 multi-thread 코드사용 PIC(Position Independent Code) 가생성되도록컴파일 권장옵션 -O3 -fpic -march=skylake-avx512 ( Skylake) -O3 -fpic -march=knl (KnightsLanding) -O3 -fpic -mpku (Skylake & KnightsLanding) $ gcc gfortran o test.exe O3 fpic march=knl test.[c cc f90] 44
Basic Environment 프로그램컴파일 순차프로그램컴파일 Cray 컴파일러주요옵션 컴파일러옵션 -O[1 2 3] -hcpu=mic-knl -homp(default) 설명 오브젝트최적화, 숫자는최적화레벨 512bits 레지스터를가진 MIC 지원사용하지않으면 Skylake 지원 (default) OpenMP 기반의 multi-thread 코드사용 -h pic 2GB 이상의 static memory 가필요한경우사용 (-dynamic 과함께사용 ) -dynamic 공유라이브러리를링크 권장옵션 Default 옵션사용을권장 $ cc ftn o test.exe hcpu=mic-knl test.[c cc f90] 45
Basic Environment 프로그램컴파일 병렬프로그램컴파일 OpenMP 컴파일 OpenMP는컴파일러지시어만으로멀티스레드를활용할수있도록개발된기법임 컴파일러옵션을추가하여병렬컴파일을할수있음» Intel compiler : -qopenmp» GNU compiler : -fopenmp» Cray compiler : -homp $ icc ifort o test.exe qopenmp O3 fpic xmic-avx512 test.[c cc f90] $ gcc gfortran o test.exe fopenmp O3 fpic march=knl test.[c cc f90] $ cc ftn o test.exe homp hcpu=mic-knl test.[c cc f90] 46
Basic Environment 프로그램컴파일 병렬프로그램컴파일 MPI 컴파일 MPI 명령을이용하여컴파일 MPI 명령은일종의 wrapper로써지정된컴파일러가소스를컴파일함 구분 Intel GNU Cray Fortran ifort gfortran ftn Fortran + MPI mpiifort mpif90 ftn C icc gcc cc C + MPI mpiicc mpicc cc C++ icpc g++ CC C++ + MPI mpiicpc mpicxx CC $ mpiicc mpiifort o test.exe O3 fpic xmic-avx512 test.[c 90] $ mpicc mpif90 o test.exe O3 fpic march=knl test.[c f90] $ cc ftn o test.exe hcpu=mic-knl test.[c f90] 47
Basic Environment 작업디렉터리및쿼터정책 구분 디렉터리경로 용량제한 파일수제한 파일삭제정책파일시스템백업유무 홈디렉터리 /home01 64GB 100K N/A Lustre 스크래치디렉터리 /scratch 100TB 1M 15일동안접근하지않은파일은자동삭제 O X 현재사용량확인 $ lfs quota /home01 Disk quotas for usr sedu01 (uid 1000163): Filesystem kbytes quota limit grace files quota limit grace /home01 104 67108864 67108864-26 100000 100000 - $ lfs quota /scratch Disk quotas for usr sedu01 (uid 1000163): Filesystem kbytes quota limit grace files quota limit grace /scratch 4 107374182400 107374182400-1 1000000 1000000 - Disk quotas for grp in0163 (gid 1000163): 홈디렉터리는용량및 I/O 성능이제한되어있기때문에, 모든계산작업은스크래 치디렉터리에서이루어져야함. 48
실습파일복사 cp -r /home01/sedu49/01_testbed_usage./ cp r /home01/sedu49/02_knl_tutorial_src./ 49
Job Scheduler 1. PBS command 2. Job script examples Serial code OpenMP code MPI code Hybrid code 3. Using PBS for interactive jobs 2019-02-12 50
Scheduler 명령어모음 KISTI Scheduler 명령비교 누리온은 PBS(Portable Batch System) job scheduler 를사용함 User Commands PBS (Nurion) SGE (Tachyon2) Slurm (KAT) LoadLeveler (Sinbaram) 작업제출 qsub [script_file] qsub [script_file] sbatch [script_file] llsubmit [script_file] 작업삭제 qdel [job_id] qdel [job_id] scancle [job_id] llcancel [job_id] 작업조회 (job_id) qstat [job_id] qstat -u\* [-j job_id] squeue [job_id] llq -l [job_id] 작업조회 (user) qstat -u [user_name] qstat [-u user_name] squeue -u [user_name] llq -u [user_name] Queue 목록 qstat -Q qconf -sql squeue llclass Node 목록 pbsnodes -as qhost sinfo -N or scontrol show nodes llstatus -L machine Cluster 상태 pbsnodes -asj qhost -q sinfo llstatus -L cluster GUI xpbsmon qmon sview xload 51
Nurion Queue 큐정책 Queue Wall-Clock Limit Max Running jobs KISTI 큐정책에의해변경될수있음 Max Active Jobs (running+waiting) exclusive unlimited 30 40 normal 48h 20 40 burst_buffer 48h 10 20 long 120h 10 20 flat 48h 10 20 debug 48h 2 2 commercial 48h 5 10 norm_skl 48h 10 20 52
Nurion Queue 큐정책 누리온시스템은배타적노드할당정책을기본으로함 한노드에한사용자의작업만이실행될수있도록보장 normal 큐 일반사용자를위한큐 commercial 큐 상용 SW 수행을위한큐 공유노드정책이적용됨 노드의규모가크지않아서효율적으로자원을활용하기위함임 debug 큐 공유노드정책이적용됨 사용한자원만큼만과금됨 Interactive job 제출이가능 53
Nurion Queue 큐조회 showq, pbs_status 54
PBS command : Queue 목록조회 qstat Queue 목록조회 : -Q Queue 상세정보조회 : -f $ qstat -Q Queue Max Tot Ena Str Que Run Hld Wat Trn Ext Type ---------------- ----- ----- --- --- ----- ----- ----- ----- ----- ----- ---- exclusive 0 1 yes yes 0 1 0 0 0 0 Exec commercial 0 6 yes yes 0 6 0 0 0 0 Exec norm_skl 0 56 yes yes 9 46 1 0 0 0 Exec $ qstat -Qf normal Queue: normal queue_type = Execution Priority = 100 total_jobs = 143 state_count = Transit:0 Queued:0 Held:8 Waiting:0 Running:135 Exiting:0 Beg un:0 max_queued = [u:pbs_generic=40] acl_host_enable = False acl_user_enable = False resources_max.walltime = 48:00:00 resources_min.walltime = 00:00:00 55
Nurion Queue 큐조회 현재계정으로사용가능한큐리스트조회 pbs_queue_check 56
PBS command : node 조회및변경 pbsnodes -a : 등록된계산노드목록조회 -asj : 노드사용내역조회 $ pbsnodes asj mem ncpus nmics ngpus vnode state njobs run susp f/t f/t f/t f/t jobs --------------- --------------- ------ ----- ------ ------------ ------- ------- ------- ----- -- node0001 free 0 0 0 110gb/110gb 68/68 0/0 0/0 -- node0007 free 1 1 0 110gb/110gb 4/68 0/0 0/0 6615 node0008 free 0 0 0 110gb/110gb 68/68 0/0 0/0 -- node0009 free 0 0 0 110gb/110gb 68/68 0/0 0/0 -- node0010 free 0 0 0 110gb/110gb 68/68 0/0 0/0 -- cpu0004 job-busy 1 1 0 188gb/188gb 0/40 0/0 0/0 6643 cpu0003 job-busy 1 1 0 188gb/188gb 0/40 0/0 0/0 6644 cpu0002 job-busy 1 1 0 188gb/188gb 0/40 0/0 0/0 6628 cpu0001 job-busy 1 1 0 188gb/188gb 0/40 0/0 0/0 6627 ( : pilot system에서출력예임 ) Column mem ncpus nmics ngpus f/t Description 기가바이트 (GB) 단위의메모리양 이용가능한총 CPU 개수 이용가능한많은통합코어들 (MIC) 의총개수 - Intel 이용가능한총 GPU 의개수 f=free, t=total 57
PBS command : 작업제출 작업제출 사용자작업은반드시 /scratch 에서만제출이가능함 /home 디렉터리에서제출불가능 qsub {job_scropt_name} depend 옵션을사용하여의존성있는작업제출가능 qsub -W depend={option}:{jobid} {job_scropt_name} afterok : 의존작업이성공시다음작업수행 afternotok : 의존작업이실패시다음작업수행 afterany : 의존작업의성공여부에관계없이다음작업수행 $ qsub serial.sh 1820015.pbs $ qsub -W depend=afterok:1820015.pbs serial.sh 1820017.pbs $ qstat -u sedu01" pbs: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 1820015.pbs sedu01 normal Serial_Job 47089 1 1 -- 00:10 R 00:00 1820017.pbs sedu01 normal Serial_Job -- 1 1 -- 00:10 H -- 58
PBS command : 작업제출및삭제 qdel 제출된작업삭제 qdel {JOBID} $ qstat -u sedu01" pbs: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 1822816.pbs sedu01 normal Serial_Job 63673 1 1 -- 00:10 R 00:00 1822817.pbs sedu01 normal Serial_Job -- 1 1 -- 00:10 H -- $ qdel 1822817.pbs $ qstat -u sedu01" pbs: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 1822816.pbs sedu01 normal Serial_Job 63673 1 1 -- 00:10 R 00:00 59
PBS command : 수행중작업조회 qstat 실행및대기중인작업조회 기본값은모든사용자의작업목록출력 지정계정작업목록출력 : -u 작업수행계산노드정보출력 : -n $ qstat Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 1819461.pbs G16-Si-b-TD x1679a02 3756:42: R long 1819463.pbs G16-Si-c-TD x1679a02 3715:10: R long 1822818.pbs Serial_Job sedu01 00:00:00 R normal $ qstat u sedu01 pbcm: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 1822818.pbs sedu01 normal Serial_Job 63895 1 1 -- 00:10 R 00:00 $ qstat -n -u sedu01 pbcm: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 1822818.pbs sedu01 normal Serial_Job 63895 1 1 -- 00:10 R 00:00 node2780/0 60
PBS command : 종료된작업조회 qstat -x 기본값은모든사용자의작업출력 -u : 지정계정의종료작업목록출력 -f {JOBID} : 종료작업상세정보출력 $ qstat xu sedu01 pbcm: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 1822810.pbs sedu01 norm_skl Serial_Job 425430 1 1 -- 00:10 F 00:00 1822818.pbs sedu01 normal Serial_Job 63895 1 1 -- 00:10 F 00:01 $ qstat -xf 1822818.pbs Job Id: 1822818.pbs Job_Name = Serial_Job Job_Owner = sedu01@login01 resources_used.cpupercent = 99 resources_used.cput = 00:00:57 resources_used.mem = 3636kb resources_used.ncpus = 1 resources_used.vmem = 250260kb resources_used.walltime = 00:01:11 61
Job Script 작성 #PBS 지시자를사용하여옵션지정 chunk 단위로 host/vnode에자원할당 -l select 로 chunk 자원할당 -l select=<numerical>:<res1>=<value>:<res2>=<value> 각리소스는 colon(:) 으로구분 기본은 1 chunk == 1 task #PBS -l select=128 : 128개의 chunks #PBS -l select=1:mem=16gb+15:mem=1gb : 16GB를사용하는 1개의 chunk와 1GB를사용하는 15개의 chunk로작업수행 #!/bin/sh #PBS -V # 작업제출노드의쉘환경변수를컴퓨팅노드에도적용 #PBS -N hybrid_node # 작업이름지정 #PBS -q workq # 작업 queue 지정 #PBS -l walltime=01:00:00 # 작업 walltime 지정 #PBS -M abc@abc.com # 작업관련메일을수신할주소 #PBS -m abe # a( 작업실패 )/b( 작업시작 )/e( 작업종료 ) 시메일발송, n : 메일보내지않음 #PBS -l select=2 # 2 chunk 로작업자원할당지정 cd $PBS_O_WORKDIR # PBS 는작업제출경로가 WORKDIR 로설정되지만기본값으로 $HOME 에서 # 작업이실행됨. 상대경로파일을사용한경우 PBS_O_WORKDIR 로변경필요. mpirun -machinefile $PBS_NODEFILE./hostname.x 62
Job Script 작성 작업스크립트주요키워드 옵션 형식 설명 -V 환경변수내보내기 -N <alphanumeric> Job 이름지정 -q <queue_name> 서버나큐의이름지정 -l <resource_list> Job 리소스요청 -M <id@domain.xxx> 이메일받는사람리스트설정 -m <string> 이메일알람지정 -W sandbox= [HOME PRIVATE] 스테이징디렉터리와실행디렉터리 -X Interactive job으로부터의 X output PBS 배치작업수행하는경우 STDOUT과 STDERR을시스템디렉터리의 output에저장하였다가작업완료후사용자작업제출디렉터리로복사함 사용자는작업완료시까지작업진척내용을알수없음 #PBS W sandbox=private 을추가하여스크립트를작성하는경우, STDOUT과 STDERR을작업실행중확인가능 63
Job Script 작성 사용가능한환경변수 환경변수 설명 PBS_JOBID PBS_JOBNAME PBS_NODEFILE PBS_O_PATH PBS_O_WORK_DIR TMPDIR Job에할당되는식별자사용자에의해제공되는 Job 이름작업에할당된계산노드들의리스트를포함하고있는파일이름제출환경의경로값 qsub이실행된절대경로위치 Job을위해지정된임시디렉터리 64
PBS Job Script 사용예제 : (PI 코드 ) 코드컴파일 Intel Compiler/MPI 사용 $ module add intel/18.0.3 impi/18.0.3 KNL(Knights Landing) 노드사용시 craype-mic-knl 모듈사용 $ module add craype-mic-knl $ icc -xmic-avx512 source.c -o executable.x SKL(Skylake) 노드사용시 craype-x86-skylake 모듈사용 $ module add craype-x86-skylake $ icc -xcore-avx512 source.c -o executable.x craype-mic-knl 모듈과 craype-x86-skylake 모듈을동시에사용할수없음 모듈을변경할때충돌되는모듈을 unload 하고, 사용하고자하는모듈을 load 해야함 65
PBS Job Script 사용예제 : (PI 코드 ) 컴파일 (pi.c) KNL SKL $ module add craype-mic-knl $ icc pi.c -o pi_serial_no_vec_knl $ icc -xmic-avx512 pi.c -o pi_serial_vec_knl $ module rm craype-mic-knl $ module add craype-x86-skylake $ icc pi.c -o pi_serial_no_vec_skl $ icc -xcore-avx512 pi.c -o pi_serial_vec_skl 66
PBS Job Script 사용예제 : (PI 코드 ) serial.sh(knl) $ cat serial.sh #!/bin/bash #PBS -V #PBS -N Serial_job #PBS -q normal #PBS -l walltime=00:05:00 #PBS -l select=1 cd $PBS_O_WORKDIR./pi_serial_no_vec_knl./pi_serial_vec_knl serial.sh(skl) $ cat serial.sh #!/bin/bash #PBS -V #PBS -N Serial_job #PBS -q norm_skl #PBS -l walltime=00:05:00 #PBS -l select=1 cd $PBS_O_WORKDIR./pi_serial_no_vec_skl./pi_serial_vec_skl KNL w/o AVX512 PI= 3.141592653589798 (Error = 4.440892e-15) Elapsed Time = 57.227066, [sec] w/ AVX512 PI= 3.141592653589845 (Error = 5.151435e-14) Elapsed Time = 22.057640, [sec] SKL w/o AVX512 PI= 3.141592653589798 (Error = 4.440892e-15) Elapsed Time = 6.036585, [sec] w/ AVX512 PI= 3.141592653589783 (Error = 9.769963e-15) Elapsed Time = 3.929958, [sec] 67
PBS Job Script 사용예제 : (PI 코드 ) OpenMP(piOpenMP.c) #include <stdio.h> #include <math.h> #include <sys/time.h> #include <omp.h> inline double cputimer() { struct timeval tp; gettimeofday(&tp,null); return ((double)tp.tv_sec + (double)tp.tv_usec*1e-6); } int main() { double istart, ElapsedTime; const long num_step = 5000000000; long i; double sum, step, pi, x; int num_threads; step = (1.0/(double)num_step); sum = 0.0; istart=cputimer(); printf("-------------------------------------\n"); 68
PBS Job Script 사용예제 : (PI 코드 ) OpenMP(piOpenMP.c) #pragma omp parallel { #pragma omp master { num_threads=omp_get_num_threads(); printf("# of threads : %d\n",num_threads); } #pragma omp for reduction(+:sum), private(x) for(i=1;i<=num_step;i++){ x = ((double)i-0.5)*step; sum += 4.0/(1.0+x*x); } } pi = step*sum; ElapsedTime= cputimer() - istart; printf("pi= %.15f (Error = %e)\n",pi, fabs(acos(-1)-pi)); printf("elapsed Time = %f, [sec]\n", ElapsedTime); printf("----------------------------------------\n"); return 0; } 69
PBS Job Script 사용예제 : (PI 코드 ) 컴파일 (piopenmp.c) KNL $ module add craype-mic-knl $ icc qopenmp piopenmp.c -o piopenmp_no_vec $ icc qopenmp -xmic-avx512 piopenmp.c -o piopenmp_vec SKL $ module rm craype-mic-knl $ module add craype-x86-skylake $ icc qopenmp piopenmp.c -o piopenmp_no_vec $ icc qopenmp -xcore-avx512 piopenmp.c -o piopenmp_vec 70
PBS Job Script 사용예제 : (PI 코드 ) openmp.sh(knl) $ cat openmp.sh #!/bin/bash #PBS -V #PBS -N OMP_job #PBS -q normal #PBS -l walltime=00:02:00 #PBS -l select=1:ncpus=34:ompthreads=34 (#PBS -l select=1:ncpus=68:ompthreads=68) cd $PBS_O_WORKDIR./piOpenMP_no_vec./piOpenMP_vec KNL # of threads : 34 w/o AVX512 Elapsed Time = 1.647456, [sec] w/ AVX512 Elapsed Time = 0.626319, [sec] # of threads : 68 w/o AVX512 Elapsed Time = 0.868751, [sec] w/ AVX512 Elapsed Time = 0.350071, [sec] openmp.sh(skl) $ cat openmp.sh #!/bin/bash #PBS -V #PBS -N OMP_job #PBS -q norm_skl #PBS -l walltime=00:10:00 #PBS -l select=1:ncpus=20:ompthreads=20 (#PBS -l select=1:ncpus=40:ompthreads=40) cd $PBS_O_WORKDIR./piOpenMP_no_vec./piOpenMP_vec SKL # of threads : 20 w/o AVX512 Elapsed Time = 0.316807, [sec] w/ AVX512 Elapsed Time = 0.199470, [sec] # of threads : 40 w/o AVX512 Elapsed Time = 0.259656, [sec] w/ AVX512 Elapsed Time = 0.162671, [sec] 71
PBS Job Script 사용예제 : (PI 코드 ) MPI(piMPI.c) #include <stdio.h> #include <math.h> #include "mpi.h" int main(int argc, char *argv[]){ long i; int myrank, nprocs; const long num_step = 5000000000; double mypi, x, pi, h, sum; double st, et; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); if(myrank==0) printf("# of processes : %d\n",nprocs); h=1.0/(double)num_step; sum = 0.0; st = MPI_Wtime(); for(i=myrank;i<num_step;i+=nprocs){ x = h*((double)i-0.5); sum += 4.0/(1.0+x*x); } mypi= h*sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); et=mpi_wtime(); if(myrank==0){ printf("pi= %.15f (Error = %e)\n",pi, fabs(acos(-1)-pi)); printf("elapsed Time = %f, [sec]\n", et-st); printf("----------------------------------------\n"); } MPI_Finalize(); return 0; } 72
PBS Job Script 사용예제 : (PI 코드 ) 컴파일 (pimpi.c) KNL $ module add craype-mic-knl $ mpiicc pimpi.c -o pimpi_no_vec $ mpiicc -xmic-avx512 pimpi.c -o pimpi_vec SKL $ module rm craype-mic-knl $ module add craype-x86-skylake $ mpiicc pimpi.c -o pimpi_no_vec $ mpiicc -xcore-avx512 pimpi.c -o pimpi_vec 73
PBS Job Script 사용예제 : (PI 코드 ) mpi.sh(knl) $ cat mpi.sh #!/bin/bash #PBS -V #PBS -N MPI_job #PBS -q normal #PBS -l walltime=00:02:00 #PBS -l select=1:ncpus=68:mpiprocs=68:ompthreads=1 (#PBS -l select=2:ncpus=68:mpiprocs=68:ompthreads=1) cd $PBS_O_WORKDIR mpirun -machinefile $PBS_NODEFILE./piMPI_no_vec mpirun -machinefile $PBS_NODEFILE./piMPI_vec KNL # of processes : 68 w/o AVX512 Elapsed Time = 1.587632, [sec] w/ AVX512 Elapsed Time = 0.900600, [sec] # of processes : 136 w/o AVX512 Elapsed Time = 0.792766, [sec] w/ AVX512 Elapsed Time = 0.489747, [sec] 74
PBS Job Script 사용예제 : (PI 코드 ) mpi.sh(skl) $ cat mpi.sh #!/bin/bash #PBS -V #PBS -N MPI_job #PBS -q norm_skl #PBS -l walltime=00:10:00 #PBS -l select=1:ncpus=40:mpiprocs=40:ompthreads=1 (#PBS -l select=2:ncpus=40:mpiprocs=40:ompthreads=1) cd $PBS_O_WORKDIR mpirun -machinefile $PBS_NODEFILE./piMPI_no_vec mpirun -machinefile $PBS_NODEFILE./piMPI_vec SKL # of processes : 40 w/o AVX512 Elapsed Time = 0.176598, [sec] w/ AVX512 Elapsed Time = 0.162338, [sec] # of processes : 80 w/o AVX512 Elapsed Time = 0.094650, [sec] w/ AVX512 Elapsed Time = 0.327157, [sec] 75
PBS Job Script 사용예제 : (PI 코드 ) Hybrid(piHybrid.c) #include <stdio.h> #include <math.h> #include "mpi.h" #include "omp.h" int main(int argc, char *argv[]) { long i; int myrank, nprocs,provide; const long num_step = 5000000000; double mypi, x, pi, h, sum; double st, et; int num_threads; MPI_Init_thread(&argc, &argv,mpi_thread_funneled,&provide); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); if(myrank==0)printf("# of processes : %d\n",nprocs); h=1.0/(double)num_step; sum = 0.0; st = MPI_Wtime(); 76
PBS Job Script 사용예제 : (PI 코드 ) Hybrid(piHybrid.c) #pragma omp parallel { #pragma omp master { num_threads=omp_get_num_threads(); printf("# of threads : %d\n",num_threads); } #pragma omp } } for reduction(+:sum), private(x) for(i=1;i<=num_step;i+=nprocs) { x = h*((double)i-0.5); sum += 4.0/(1.0+x*x); } mypi= h*sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); et=mpi_wtime(); if(myrank==0){ printf("pi= %.15f (Error = %e)\n",pi, fabs(acos(-1)-pi)); printf("elapsed Time = %f, [sec]\n", et-st); printf("----------------------------------------\n"); } MPI_Finalize(); return 0; 77
PBS Job Script 사용예제 : (PI 코드 ) 컴파일 (pihybrid.c) KNL $ module add craype-mic-knl $ mpiicc qopenmp pihybrid.c -o pihybrid_no_vec $ mpiicc -qopenmp -xmic-avx512 pihybrid.c -o pihybrid_vec SKL $ module rm craype-mic-knl $ module add craype-x86-skylake $ mpiicc qopenmp pihybrid.c -o pihybrid_no_vec $ mpiicc -qopenmp -xcore-avx512 pihybrid.c -o pihybrid_vec 78
PBS Job Script 사용예제 : (PI 코드 ) hybrid.sh(knl) $ cat hybrid.sh #!/bin/bash #PBS -V #PBS -N Hybrid_job #PBS -q normal #PBS -l walltime=00:02:00 #PBS -l select=2:ncpus=68:mpiprocs=2:ompthreads=34 cd $PBS_O_WORKDIR mpirun -machinefile $PBS_NODEFILE./piHybrid_no_vec mpirun -machinefile $PBS_NODEFILE./piHybrid_vec KNL # of processes : 4 w/o AVX512 Elapsed Time = 0.940793, [sec] ---------------------------------------- # of processes : 4 w/ AVX512 Elapsed Time = 0.562912, [sec] 79
PBS Job Script 사용예제 : (PI 코드 ) hybrid.sh(skl) $ cat hybrid.sh #!/bin/bash #PBS -V #PBS -N Hybrid_job #PBS -q norm_skl #PBS -l walltime=00:02:00 #PBS -l select=2:ncpus=40:mpiprocs=2:ompthreads=20 cd $PBS_O_WORKDIR mpirun -machinefile $PBS_NODEFILE./piHybrid_no_vec mpirun -machinefile $PBS_NODEFILE./piHybrid_vec SKL # of processes : 4 # of threads : 20 w/o AVX512 Elapsed Time = 0.117037, [sec] ---------------------------------------- # of processes : 4 w/ AVX512 Elapsed Time = 0.091773, [sec] 80
PBS Interactive 작업제출 누리온시스템은 debug 노드대신 debug 큐를제공 debug 큐를이용하여작업을제출함으로써디버깅수행이가능 qsub I ( 대문자 i 임 ) qsub 를이용한 Interactive 작업사용예 (MPI) [sedu01@pbcm Pi_Calc]$ qsub -I -V -l select=1:ncpus=68:mpiprocs=68 -l walltime=00:10:00 -q debug qsub: waiting for job 6719.pbcm to start qsub: job 6719.pbcm ready Intel(R) Parallel Studio XE 2017 Update 2 for Linux* Copyright (C) 2009-2017 Intel Corporation. All rights reserved. [sedu01@node8281 ~]$ cd $PBS_O_WORKDIR [sedu01@node8281 ~]$ mpirun -n 68./piMPI_vec [sedu01@node8281 Pi_Calc]$ mpirun -np 68./piMPI_vec # of processes : 68 PI= 3.141592653989790 (Error = 3.999969e-10) Elapsed Time = 3.176321, [sec] ---------------------------------------- [sedu01@node8281 ~]$ exit [sedu01@login04 ~] $ 81
PBS Interactive 작업 Interactive 작업조회 : qstat, pbsnodes $ qstat Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 6538.pbcm vasp_07 hskim0 11830:20 R knl 6615.pbcm vasp_13 hskim0 4664:02: R knl 6628.pbcm ESM_pos2_0.0139 hskim0 2387:09: R cpu 6638.pbcm vasp_16 hskim0 2536:51: R knl 6641.pbcm vasp_18 hskim0 2533:49: R knl 6643.pbcm ESM_pos1_0.0139 hskim0 1177:07: R cpu 6644.pbcm ESM_pos1_0.0559 hskim0 1176:39: R cpu 6719.pbcm STDIN sedu01 00:05:30 R knl $ pbsnodes -asj mem ncpus nmics ngpus vnode state njobs run susp f/t f/t f/t f/t jobs --------------- --------------- ------ ----- ------ ------------ ------- ------- ------- ----- -- node8281 job-busy 1 1 0 110gb/110gb 0/68 0/0 0/0 6719 node8282 free 1 1 0 110gb/110gb 4/68 0/0 0/0 6638 node0010 free 0 0 0 110gb/110gb 68/68 0/0 0/0 -- cpu0004 job-busy 1 1 0 188gb/188gb 0/40 0/0 0/0 6643 cpu0003 job-busy 1 1 0 188gb/188gb 0/40 0/0 0/0 6644 cpu0002 job-busy 1 1 0 188gb/188gb 0/40 0/0 0/0 6628 cpu0001 free 0 0 0 188gb/188gb 40/40 0/0 0/0 -- 82
Compile Serial SKL : icc -O3 -xcore-avx512 (-qopt-report=5) pi.c -o pi_skl.x KNL : icc -O3 -xmic-avx512 (-qopt-report=5) pi.c -o pi_knl.x OpenMP SKL : icc -O3 -xcore-avx512 -qopenmp piopenmp.c -o piopenmp_skl.x KNL : icc -O3 -xmic-avx512 -qopenmp piopenmp.c -o piopenmp_knl.x MPI SKL : mpiicc -O3 -xcore-avx512 pimpi.c -o pimpi_skl.x KNL : mpiicc -O3 -xmic-avx512 pimpi.c -o pimpi_knl.x Hybrid SKL : mpiicc -O3 -xcore-avx512 -qopenmp pihybrid.c -o pihybrid_skl.x KNL : mpiicc -O3 -xmic-avx512 -qopenmp pihybrid.c -o pihybrid_knl.x 83
Code Optimization 1. Vectorization 2. MCDRAM Memory Modes 3. MCDRAM using by numactl command 4. MCDRAM using by memkind library 5. 64 Physical Cores & 256 Logical Cores 6. Thread Management 7. Set KMP_AFFINITY 2019-02-12 84
Vectorization What is SIMD (Single Instruction Multiple Data)? 붕어빵굽기 반죽, 팥, 굽기 붕어빵 8 칸짜리틀 8 개의붕어빵 배열연산 A, B, 연산 C 8 칸짜리연산공간 8 개의 C 틀 연산공간 (vector register) 굽기 연산 (vector operation) A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] CU ALU ALU ALU ALU C[0] C[1] C[2] C[3] 2 512-bit VPUs (AVX512) per core vector register size: 512bit 한번에 8 개의 64Byte type (double, int64_t) 한번에 16 개의 32Byte type (float, int) 85
Vectorization Memory Alignment Conditions for High Vectorization 1. Memory alignment 2. Memory access pattern 3. Loop data dependency Memory align function _mm_malloc _mm_free hbw_posix_memalign for HBM POSIX posix_memaglign C11 algined_alloc Windows - _aligned_malloc Cache block Memory 00 01 02 03 04 05 06 07 08 09 10 11 Cache block Memory 00 01 02 03 04 05 06 07 08 09 10 11 86
Physical Address Physical Address MCDRAM Memory Modes Three modes. Selected at boot Cache Mode Flat Mode Hybrid Mode 16GB MCDRAM 8 or 12 GB MCDRAM 16GB MCDRAM DDR DDR 4 or 8 GB MCDRAM DDR MCDRAM is used as a L3 cache MCDRAM is used as a DDR - numactl command - memkind library MCDRAM is used - as a L3 cache - as a DDR 87
MCDRAM using by numactl command Check memory details using numactl command $ numactl -hardware KNL with 2 NUMA nodes DDR KNL MC DRAM node 0 node 1 We can simply use MCDRAM with numactl command with membind option $ numactl -membind 1./myapp.ex 88
MCDRAM using by memkind library Use hbw_malloc / hbw_free function, instead of malloc / free function Add memkind library to your compile option CFLAGS = -O3 std=c11 qopenmp qop-report=5 xmic-avx512 -lmemkind Add a header file <hbwmalloc.h> in your source code #include <hbwmalloc.h> https://github.com/memkind/memkind 89
64 Physical Cores & 256 Logical Cores $ vi /proc/cpuinfo 90
Thread Management Tread Binding Allocation of threads may affect performance seriously especially for computation with many threads export KMP_AFFINITY=compact,verbose Compact Scatter Threads are allocated to be close to each other Threads are allocated to be close to each other 91
Set KMP_AFFINITY Can you guess the env. option of process? OMP: Info #156: KMP_AFFINITY: 256 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 1 packages x 64 cores/pkg x 4 threads/core (64 total cores) OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 OMP: Info #171: KMP_AFFINITY: OS proc 64 maps to package 0 core 0 thread 1 OMP: Info #242: KMP_AFFINITY: pid 4393 thread 0 bound to OS proc set {0} OMP: Info #242: KMP_AFFINITY: pid 4660 thread 1 bound to OS proc set {64} 92
Set KMP_AFFINITY Can you guess the env. option of process? OMP: Info #156: KMP_AFFINITY: 256 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 1 packages x 64 cores/pkg x 4 threads/core (64 total cores) OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 OMP: Info #171: KMP_AFFINITY: OS proc 64 maps to package 0 core 0 thread 1 OMP: Info #242: KMP_AFFINITY: pid 4393 thread 0 bound to OS proc set {0} OMP: Info #242: KMP_AFFINITY: pid 4660 thread 1 bound to OS proc set {64} export OMP_NUM_THREADS export KMP_AFFINITY=compact,verbose 93
Examples 1. Dense Matrix multiplication 2. Dot Product 3. Histogram 4. Loop Dependency 5. SoA vs. AoS 2019-02-12 94
Code compile Compile script(compile.sh) $ cat compile.sh if [ $# -lt 1 ] then echo "please, give one of numbers; 1, 2, or 3" fi case "$1" in 1) #01_MMmul icc 01_MMmul/MMmul.c -std=c11 -qopt-report=5 -O2 -no-vec -qopenmp -o 01_MMmul/MMmul_O2.ex mv 01_MMmul/MMmul.optrpt 01_MMmul/MMmul_O2.optrpt icc 01_MMmul/MMmul.c -std=c11 -qopt-report=5 -O3 -no-vec -qopenmp -o 01_MMmul/MMmul_O3.ex mv 01_MMmul/MMmul.optrpt 01_MMmul/MMmul_O2_AVX512.optrpt icc 01_MMmul/MMmul.c -std=c11 -qopt-report=5 -O3 -qopenmp xmic-avx512 -o 01_MMmul/MMmul_O3_AVX512.ex mv 01_MMmul/MMmul.optrpt 01_MMmul/MMmul_O3_AVX512.optrpt 2) icc 01_MMmul/MMmul.c -std=c11 -qopt-report=5 -O3 -qopenmp -xmic-avx512 -DHAVE_CBLAS -mkl -o 01_MMmul/MMmul_MKL.ex mv 01_MMmul/MMmul.optrpt 01_MMmul/MMmul_MKL.optrpt ;; #02_VVdot icc 02_VVdot/VVdot.c -std=c11 -O0 -qopt-report=5 -qopenmp -no-vec -o 02_VVdot/VVdot_O0.ex mv 02_VVdot/VVdot.optrpt 02_VVdot/VVdot_O0.optrpt icc 02_VVdot/VVdot.c -std=c11 -O1 -qopt-report=5 -qopenmp -no-vec -o 02_VVdot/VVdot_O1.ex mv 02_VVdot/VVdot.optrpt 02_VVdot/VVdot_O1.optrpt icc 02_VVdot/VVdot.c -std=c11 -O2 -qopt-report=5 -qopenmp -no-vec -o 02_VVdot/VVdot_O2.ex mv 02_VVdot/VVdot.optrpt 02_VVdot/VVdot_O2.optrpt icc 02_VVdot/VVdot.c -std=c11 -O2 -qopt-report=5 -qopenmp -xmic-avx512 -o 02_VVdot/VVdot_O2_AVX512.ex mv 02_VVdot/VVdot.optrpt 02_VVdot/VVdot_O2_AVX512.optrpt icc 02_VVdot/VVdot.c -std=c11 -O3 -qopt-report=5 -qopenmp -xmic-avx512 -o 02_VVdot/VVdot_O3_AVX512.ex mv 02_VVdot/VVdot.optrpt 02_VVdot/VVdot_O3_AVX512.optrpt ;; 95
Code compile Compile script(compile.sh) 3) #03_Histogram icc 03_Histogram/Histogram.c -std=c11 -qopt-report=5 -O2 -qopenmp -no-vec -o 03_Histogram/Histogram_O2.ex mv 03_Histogram/Histogram.optrpt 03_Histogram/Histogram_O2.optrpt icc 03_Histogram/Histogram.c -std=c11 -qopt-report=5 -O2 -qopenmp -xmic-avx512 -o 03_Histogram/Histogram_O2_AVX512.ex mv 03_Histogram/Histogram.optrpt 03_Histogram/Histogram_O2_AVX512.optrpt icc 03_Histogram/Histogram.c -std=c11 -qopt-report=5 -O3 -qopenmp -xmic-avx512 -o 03_Histogram/Histogram_O3_AVX512.ex mv 03_Histogram/Histogram.optrpt 03_Histogram/Histogram_O3_AVX512.optrpt ;; *) echo "Wrong argument. please check" ;; esac 04_loop, 05_soa 해당디렉터리로이동하여 make 실행 96
Example 1 : Dense Matrix multiplication Human friendly code for(int i=0; i<size; i++) { for(int j=0; j<size; j++) { double sum = 0; for(int k=0; k<size; k++) { sum += A[i][k] * B[k][j]; } C[i][j] = sum; } } For a 4 x 4 case - # of cache miss 4 + 16 + 4 = 24 For a general case of SIZE x SIZE - # of cache miss SIZE + SIZE * SIZE + SIZE = SIZE * (SIZE+2) 97
Example 1 : Dense Matrix multiplication Cache & Vectorization friendly code for(int i=0; i<size; i++) { for(int k=0; k<size; k++) { double A_val = A[i][k]; for(int j=0; j<size; j++) { C[i][j] += A_val * B[k][j]; } } } For a 4 x 4 case - # of cache miss 4 + 4 + 4 = 12 For a general case of SIZE x SIZE - # of cache miss SIZE + SIZE + SIZE = 3 * SIZE 98
Example 1 : Dense Matrix multiplication Source code - Mmmul.c 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 #include <stdio.h> #include <string.h> #include <omp.h> #define SIZE 4096 int main(int argc, char *argv[]) { double time; double *A = (double*)_mm_malloc(sizeof(double)*size*size, 64); double *B = (double*)_mm_malloc(sizeof(double)*size*size, 64); double *C = (double*)_mm_malloc(sizeof(double)*size*size, 64); #pragma omp parallel for for(int i=0; i<size; i++) { #pragma vector aligned #pragma omp simd for(int j=0; j<size; j++) { A[i*SIZE+j] = (double)(i + j); B[i*SIZE+j] = (double)(j - i); } } 99
Example 1 : Dense Matrix multiplication Source code - Mmmul.c 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ///////////////////////////////////////////////// memset(c, 0, sizeof(double)*size*size); time = -omp_get_wtime(); #pragma omp parallel for for(int i=0; i<size; i++) { #pragma omp simd #pragma vector aligned for(int j=0; j<size; j++) { double sum = 0; for(int k=0; k<size; k++) { sum += A[i*SIZE+k] * B[k*SIZE+j]; } C[i*SIZE+j] = sum; } } time += omp_get_wtime(); printf("\ti-j-k MMmul time: %lf (secs)\n", time); printf("\t\tlast element: %lf\n\n", C[(SIZE-1)*SIZE+SIZE-1]); 100
Example 1 : Dense Matrix multiplication Source code - Mmmul.c 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 ///////////////////////////////////////////////// memset(c, 0, sizeof(double)*size*size); time = -omp_get_wtime(); #pragma omp parallel for for(int i=0; i<size; i++) { for(int k=0; k<size; k++) { double A_val = A[i*SIZE+k]; #pragma omp simd #pragma vector aligned for(int j=0; j<size; j++) { C[i*SIZE+j] += A_val * B[k*SIZE+j]; } } } time += omp_get_wtime(); printf("\ti-k-j MMmul time: %lf (secs)\n", time); printf("\t\tlast element: %lf\n\n", C[(SIZE-1)*SIZE+SIZE-1]); ///////////////////////////////////////////////// _mm_free(a); _mm_free(b); _mm_free(c); } return 0; 101
Example 1 : Dense Matrix multiplication Results Vectorization (w/ simd directive) Auto vectorization (wo/ no simd) 102
Example 1 : Dense Matrix multiplication Results directives Vectorization (w/ simd directive) Auto vectorization (wo/ no simd) 103
Example 2 : Dot Product (Prefetch) Dot Product Between Sparse Vector and Dense Vector Code double A = malloc(sizeof *A * N); double B = malloc(sizeof *B * M); double B_ = malloc(sizeof *B_ * N); double C = malloc(sizeof *C * N); int index = malloc(sizeof *index * N); B 0 1 2 3 4 5 6 7 8 9 10 11 for (int i = 0; i < N; i++) C[i] = A[i] * B[index[i]]; for (int i = 0; i < N; i++) B_[i] = B[index[i]]; for (int i = 0; i < N; i++) C[i] = A[i] * B_[i]; index index A 1 3 6 8 C 104
Example 2 : Dot Product (Prefetch) Dot Product Between Sparse Vector and Dense Vector Code double A = malloc(sizeof *A * N); double B = malloc(sizeof *B * M); double B_ = malloc(sizeof *B_ * N); double C = malloc(sizeof *C * N); int index = malloc(sizeof *index * N); B 0 1 2 3 4 5 6 7 8 9 10 11 for (int i = 0; i < N; i++) C[i] = A[i] * B[index[i]]; for (int i = 0; i < N; i++) B_[i] = B[index[i]]; for (int i = 0; i < N; i++) C[i] = A[i] * B_[i]; index index A 1 3 6 8 C 105
Example 2 : Dot Product (Prefetch) Dot Product Between Sparse Vector and Dense Vector Code double A = malloc(sizeof *A * N); double B = malloc(sizeof *B * M); double B_ = malloc(sizeof *B_ * N); double C = malloc(sizeof *C * N); int index = malloc(sizeof *index * N); B 0 1 2 3 4 5 6 7 8 9 10 11 for (int i = 0; i < N; i++) C[i] = A[i] * B[index[i]]; for (int i = 0; i < N; i++) B_[i] = B[index[i]]; for (int i = 0; i < N; i++) C[i] = A[i] * B_[i]; index index A 1 3 6 8 C 106
Example 2 : Dot Product (Prefetch) Dot Product Between Sparse Vector and Dense Vector Code double A = malloc(sizeof *A * N); double B = malloc(sizeof *B * M); double B_ = malloc(sizeof *B_ * N); double C = malloc(sizeof *C * N); int index = malloc(sizeof *index * N); B 0 1 2 3 4 5 6 7 8 9 10 11 for (int i = 0; i < N; i++) C[i] = A[i] * B[index[i]]; for (int i = 0; i < N; i++) B_[i] = B[index[i]]; for (int i = 0; i < N; i++) C[i] = A[i] * B_[i]; index index A 1 3 6 8 C 107
Example 2 : Dot Product (Prefetch) Dot Product Between Sparse Vector and Dense Vector Code double A = malloc(sizeof *A * N); double B = malloc(sizeof *B * M); double B_ = malloc(sizeof *B_ * N); double C = malloc(sizeof *C * N); int index = malloc(sizeof *index * N); B 0 1 2 3 4 5 6 7 8 9 10 11 for (int i = 0; i < N; i++) C[i] = A[i] * B[index[i]]; for (int i = 0; i < N; i++) B_[i] = B[index[i]]; for (int i = 0; i < N; i++) C[i] = A[i] * B_[i]; index A 1 3 6 8 B_ C 108
Example 2 : Dot Product (Prefetch) Dot Product Between Sparse Vector and Dense Vector Code double A = malloc(sizeof *A * N); double B = malloc(sizeof *B * M); double B_ = malloc(sizeof *B_ * N); double C = malloc(sizeof *C * N); int index = malloc(sizeof *index * N); B 0 1 2 3 4 5 6 7 8 9 10 11 for (int i = 0; i < N; i++) C[i] = A[i] * B[index[i]]; for (int i = 0; i < N; i++) B_[i] = B[index[i]]; for (int i = 0; i < N; i++) C[i] = A[i] * B_[i]; index A 1 3 6 8 B_ C 109
Example 2 : Dot Product (Prefetch) Dot Product Between Sparse Vector and Dense Vector Code double A = malloc(sizeof *A * N); double B = malloc(sizeof *B * M); double B_ = malloc(sizeof *B_ * N); double C = malloc(sizeof *C * N); int index = malloc(sizeof *index * N); B 0 1 2 3 4 5 6 7 8 9 10 11 for (int i = 0; i < N; i++) C[i] = A[i] * B[index[i]]; for (int i = 0; i < N; i++) B_[i] = B[index[i]]; for (int i = 0; i < N; i++) C[i] = A[i] * B_[i]; index A 1 3 6 8 B_ C 110
Example 2 : Dot Product (Prefetch) Dot Product Between Sparse Vector and Dense Vector Code double A = malloc(sizeof *A * N); double B = malloc(sizeof *B * M); double B_ = malloc(sizeof *B_ * N); double C = malloc(sizeof *C * N); int index = malloc(sizeof *index * N); B 0 1 2 3 4 5 6 7 8 9 10 11 for (int i = 0; i < N; i++) C[i] = A[i] * B[index[i]]; for (int i = 0; i < N; i++) B_[i] = B[index[i]]; for (int i = 0; i < N; i++) C[i] = A[i] * B_[i]; index A 1 3 6 8 B_ C 111
Example 2 : Dot Product (Prefetch) Dot Product Between Sparse Vector and Dense Vector Code double A = malloc(sizeof *A * N); double B = malloc(sizeof *B * M); double B_ = malloc(sizeof *B_ * N); double C = malloc(sizeof *C * N); int index = malloc(sizeof *index * N); B 0 1 2 3 4 5 6 7 8 9 10 11 for (int i = 0; i < N; i++) C[i] = A[i] * B[index[i]]; for (int i = 0; i < N; i++) B_[i] = B[index[i]]; for (int i = 0; i < N; i++) C[i] = A[i] * B_[i]; index A 1 3 6 8 B_ C 112
Example 2 : Dot Product (Prefetch) Source Code of Dot Product 01: #include <stdio.h> 02: #include <stdlib.h> 03: #include <math.h> 04: #include <omp.h> 05: #define N 160000000 06: #define Nnz 64000 07 08: int main(int argc, char **argv){ 09: double time; 10: int *index = malloc(sizeof *index * Nnz); 11: double *svector_in = malloc(sizeof *svector_in * Nnz); 12: double *fvector_in = malloc(sizeof *fvector_in * N ); 13: double *svector_out = malloc(sizeof *svector_out * Nnz); 14: double *temp = malloc(sizeof *temp * Nnz); 15: for (int i = 0; i < N; i++) 16: fvector_in[i] = (double)(i); 17: for (int i = 0; i < Nnz; i++) { 18: svector_in[i] = (double)(i); 19: index[i] = i * (int)(n / Nnz); 20: svector_out[i] = 0.; 21: temp[i] = fvector_in[index[i]]; 22: } 113
Example 2 : Dot Product (Prefetch) Source Code of Dot Product 23: time = -omp_get_wtime(); 24: #pragma omp parallel for 25: for (int j = 0; j < 100000; j++) 26: for (int i = 0; i < Nnz; i++) 27: svector_out[i] = svector_in[i] * fvector_in[index[i]]; 28: time += omp_get_wtime(); 29: printf("\t1 VVdot time:%lf (secs)\n", time); 30: 31: time = -omp_get_wtime(); 32: #pragma omp parallel for 33: for (int j = 0; j < 100000; j++) 34: for (int i = 0; i < Nnz; i++) 35: svector_out[i] = svector_in[i] * temp[i]; 36: time += omp_get_wtime(); 37: printf("\t2 VVdot time: %lf (secs)\n", time); 38: 39: free(index); free(svector_in); 40: free(svector_out); free(fvector_in); 41: return 0; 42: } 114
Example 2 : Dot Product (Prefetch) without MCDRAM 34 threads w/ scatter 68 threads w/ scatter $ cat openmp.sh #!/bin/bash #PBS -V #PBS -N OMP_job #PBS q normal #PBS -l walltime=00:10:00 #PBS -l select=1:ncpus=68:ompthreads=68 #PBS -l place=scatter cd $PBS_O_WORKDIR./02_VVdot/VVdot_O0.ex./02_VVdot/VVdot_O1.ex./02_VVdot/VVdot_O2.ex./02_VVdot/VVdot_O2_AVX512.ex./02_VVdot/VVdot_O3_AVX512.ex 115
Example 3 : Histogram Human friendly code for(int i=0; i<n; i++) { int index = (int)(age[i] / 20); hist[index]++; } age 63 29 67 46 52 19 3 22 34 63 / 20 index 3 hist[3]++ hist +1 116
Example 3 : Histogram Human friendly code for(int i=0; i<n; i++) { int index = (int)(age[i] / 20); hist[index]++; } age 63 29 67 46 52 19 3 22 34 29 / 20 index 1 hist[1]++ hist +1 +1 117
Example 3 : Histogram Human friendly code for(int i=0; i<n; i++) { int index = (int)(age[i] / 20); hist[index]++; } age 63 29 67 46 52 19 3 22 34 67 / 20 index 3 hist[3]++ hist +1 +2 118
Example 3 : Histogram Human friendly code for(int i=0; i<n; i++) { int index = (int)(age[i] / 20); hist[index]++; } age 63 29 67 46 52 19 3 22 34 46 / 20 index 2 hist[2]++ hist +1 +1 +2 119
Example 3 : Histogram Human friendly code for(int i=0; i<n; i++) { int index = (int)(age[i] / 20); hist[index]++; } age 63 29 67 46 52 19 3 22 34 52 / 20 index 2 hist[2]++ hist +1 +2 +2 120
Example 3 : Histogram Cache & Vectorization friendly code for(int i=0; i<n; i++) { int index = (int)(age[i] / 20); hist[index]++; } age VL 63 29 67 46 52 19 3 22 34 age[j] / 20 index 3 1 3 2 hist[3]++ hist +1 121
Example 3 : Histogram Cache & Vectorization friendly code for(int i=0; i<n; i++) { int index = (int)(age[i] / 20); hist[index]++; } age VL 63 29 67 46 52 19 3 22 34 age[j] / 20 index 3 1 3 2 hist[1]++ hist +1 +1 122
Example 3 : Histogram Cache & Vectorization friendly code for(int i=0; i<n; i++) { int index = (int)(age[i] / 20); hist[index]++; } age VL 63 29 67 46 52 19 3 22 34 age[j] / 20 index 3 1 3 2 hist[3]++ hist +1 +2 123
Example 3 : Histogram Cache & Vectorization friendly code for(int i=0; i<n; i++) { int index = (int)(age[i] / 20); hist[index]++; } age VL 63 29 67 46 52 19 3 22 34 age[j] / 20 index 3 1 3 2 hist[3]++ hist +1 +1 +2 124
Example 3 : Histogram Cache & Vectorization friendly code for(int i=0; i<n; i++) { int index = (int)(age[i] / 20); hist[index]++; } age VL 63 29 67 46 52 19 3 22 34 age[j] / 20 index 2 0 0 1 hist[3]++ hist +1 +2 +2 125
Example 3 : Histogram Source code - Histogram.c 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 #include <stdio.h> #include <time.h> #include <omp.h> #define N 960000000 #define VL 512 int main(int argc, char *argv[]) { srand(time(null)); double time; int *age = (int*)_mm_malloc(sizeof(int)*n, 64); int hist[5]; int randomnum = 0; #pragma omp parallel { randomnum = rand() % 100; #pragma omp for simd #pragma vector aligned for(int i=0; i<n; i++) age[i] = randomnum; } ///////////////////////////////////////////////// for(int i=0; i<5; i++) hist[i] = 0; time = -omp_get_wtime(); for(int i=0; i<n; i++) { int index = (int)(age[i] / 20); hist[index]++; } time += omp_get_wtime(); printf("\t1 Histogram time: %lf (secs)\n", time); for(int i=0; i<5; i++) printf("\t\t%d\n", hist[i]); printf("\n"); 126
Example 3 : Histogram Source code - Histogram.c 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 ///////////////////////////////////////////////// for(int i=0; i < 5; i++) hist[i] = 0; time = -omp_get_wtime(); #pragma omp parallel { int *index = (int*)_mm_malloc(sizeof(int)*vl, 64); int hist_private[5]; for(int i=0; i<5; i++) hist_private[i] = 0; #pragma omp for for(int i=0; i<n; i+=vl) { #pragma omp simd #pragma vector aligned for(int j=i; j<i+vl; j++) index[j-i] = (int)(age[j] / 20); for(int j=0; j<vl; j++) hist_private[index[j]]++; } #pragma omp critical { for(int i=0; i<5; i++) hist[i] += hist_private[i]; } _mm_free(index); 62 63 64 65 66 } time += omp_get_wtime(); printf("\t2 Histogram time: %lf (secs)\n", time); for(int i=0; i<5; i++) printf("\t\t%d\n", hist[i]); printf("\n"); //////////////////////////////// _mm_free(age); } return 0; 127
Example 3 : Histogram Results 128
Example 3 : Histogram - PBS Results $ cat openmp.sh #!/bin/bash #PBS -V #PBS -N OMP_job #PBS -q normal #PBS -l walltime=00:10:00 #PBS -l select=1:ncpus=68:ompthreads=68 Histogram_O2_AVX512.ex cd $PBS_O_WORKDIR./Histogram_O2.ex./Histogram_O2_AVX512.ex./Histogram_O3_AVX512.ex Histogram_O2.ex Histogram_O3_AVX512.ex 129
Example 3 : Histogram Optimization Report 130
Example 3 : Histogram Optimization Report 131
Example 4 : Loop Dependency Loop Dependency and Vectorization #define N 100000000 int *a = malloc(sizeof *a * (N + 1)); a for (int i = 0; i < N; i++) a[i] = a[i + 1]; for (int i = 1; i <= N; i++) a[i] = a[i - 1]; write read read 132
Example 4 : Loop Dependency Loop Dependency and Vectorization #define N 100000000 int *a = malloc(sizeof *a * (N + 1)); for (int i = 0; i < N; i++) a[i] = a[i + 1]; a write write after read read read for (int i = 1; i <= N; i++) a[i] = a[i - 1]; 133
Example 4 : Loop Dependency Loop Dependency and Vectorization #define N 100000000 int *a = malloc(sizeof *a * (N + 1)); a for (int i = 0; i < N; i++) a[i] = a[i + 1]; write write after read write after read write read after read read for (int i = 1; i <= N; i++) a[i] = a[i - 1]; 134
Example 4 : Loop Dependency Loop Dependency and Vectorization #define N 100000000 int *a = malloc(sizeof *a * (N + 1)); a for (int i = 0; i < N; i++) a[i] = a[i + 1]; write write after read write after read write after read read for (int i = 1; i <= N; i++) a[i] = a[i - 1]; 135
Example 4 : Loop Dependency Loop Dependency and Vectorization #define N 100000000 int *a = malloc(sizeof *a * (N + 1)); a for (int i = 0; i < N; i++) a[i] = a[i + 1]; write write after read write after read write after read write after read read for (int i = 1; i <= N; i++) a[i] = a[i - 1]; 136
Example 4 : Loop Dependency Loop Dependency and Vectorization #define N 100000000 int *a = malloc(sizeof *a * (N + 1)); a for (int i = 0; i < N; i++) a[i] = a[i + 1]; for (int i = 1; i <= N; i++) a[i] = a[i - 1]; read write 137
Example 4 : Loop Dependency Loop Dependency and Vectorization #define N 100000000 int *a = malloc(sizeof *a * (N + 1)); a for (int i = 0; i < N; i++) a[i] = a[i + 1]; read read after write write for (int i = 1; i <= N; i++) a[i] = a[i - 1]; 138
Example 4 : Loop Dependency Loop Dependency and Vectorization #define N 100000000 int *a = malloc(sizeof *a * (N + 1)); a for (int i = 0; i < N; i++) a[i] = a[i + 1]; read read after write read after write write for (int i = 1; i <= N; i++) a[i] = a[i - 1]; 139
Example 4 : Loop Dependency Loop Dependency and Vectorization #define N 100000000 int *a = malloc(sizeof *a * (N + 1)); a for (int i = 0; i < N; i++) a[i] = a[i + 1]; read read after write read after write read after write write for (int i = 1; i <= N; i++) a[i] = a[i - 1]; 140
Example 4 : Loop Dependency Loop Dependency and Vectorization #define N 100000000 int *a = malloc(sizeof *a * (N + 1)); a for (int i = 0; i < N; i++) a[i] = a[i + 1]; read read after write read after write read after write read after write write for (int i = 1; i <= N; i++) a[i] = a[i - 1]; WAR : write after read RAW : read after write Which one is vectorizable? 141
Example 4 : Loop Dependency Source Code 142
Example 4 : Loop Dependency Results of Code Run icc -std=c99 -qopt-report=5 -xmic-avx512 -o loop loop.c $./loop write after read : 5.000000e+15 read after write : 0.000000e+00 write after read (simd) : 5.000000e+15 read after write (simd) : 5.000000e+15 143
Example 4 : Loop Dependency - PBS Results of Code Run icc -std=c99 -qopt-report=5 -xcommon-avx512 -o loop loop.c $ cat serial.sh #!/bin/bash #PBS -V #PBS -N Serial_job #PBS -q normal #PBS -l walltime=00:20:00 #PBS -l select=1 cd $PBS_O_WORKDIR./loop $ cat Serial_job.o6803 write after read : 5.000000e+15 read after write : 0.000000e+00 write after read (simd) : 5.000000e+15 read after write (simd) : 5.000000e+15 144
Example 4 : Loop Dependency Loop Dependency and Vectorization #define N 100000000 int *a = malloc(sizeof *a * (N + 1)); for (int i = 0; i < N; i++) a[i] = a[i + 1]; for (int i = 1; i <= N; i++) a[i] = a[i - 1]; 145
Example 4 : Loop Dependency Loop Dependency and Vectorization #define N 100000000 int *a = malloc(sizeof *a * (N + 1)); for (int i = 0; i < N; i++) a[i] = a[i + 1]; for (int i = 1; i <= N; i++) a[i] = a[i - 1];!!!!! 146
Example 4 : Loop Dependency Optimization Report LOOP BEGIN at loop.c(30,5) remark #25401: memcopy(with guard) generated remark #15541: outer loop was not auto-vectorized: consider using SIMD directive LOOP BEGIN at loop.c(30,5) <Multiversioned v2> remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning remark #25439: unrolled with remainder by 2 remark #25456: Number of Array Refs Scalar Replaced In Loop: 2 LOOP END LOOP BEGIN at loop.c(30,5) <Remainder, Multiversioned v2> LOOP END LOOP END 147
Example 5 : Structure of Array vs Array of Structure SOA and Vectorization point x y x y x y x y x y x y #define N 100000000 struct { double x; double y; } *point = malloc(sizeof *point * N); struct { double *x; double *y; } set; set.x = malloc(sizeof *(set.x) * N); set.y = malloc(sizeof *(set.y) * N); 148
Example 5 : Structure of Array vs Array of Structure SOA and Vectorization point x y x y x y x y x y x y #define N 100000000 set struct { double x; double y; } *point = malloc(sizeof *point * N); struct { double *x; double *y; } set; set.x = malloc(sizeof *(set.x) * N); set.y = malloc(sizeof *(set.y) * N); x y 149
Example 5 : Structure of Array vs Array of Structure SOA and Vectorization point x y x y x y x y x y x y stride = 2 #define N 100000000 set struct { double x; double y; } *point = malloc(sizeof *point * N); struct { double *x; double *y; } set; set.x = malloc(sizeof *(set.x) * N); set.y = malloc(sizeof *(set.y) * N); x y stride = 1 150
Example 5 : Structure of Array vs Array of Structure Source Code 151
Example 5 : Structure of Array vs Array of Structure Performance on Xeon Phi Knights Landing 7210(64 cores, 4HyperT/core) without MCDRAM icc -std=c99 -qopt-report=5 -xmic-avx512 -o soa soa.c -lgomp $./soa Array of Structure: 0.262025 (secs) Structure of Array: 0.123625 (secs) 152
Example 5 : Structure of Array vs Array of Structure-PBS Performance on Xeon Phi Knights Landing 7250(68 cores, 4HyperT/core) without MCDRAM icc -std=c99 -qopt-report=5 -xcommon-avx512 -o soa soa.c -lgomp $ cat serial.sh #!/bin/bash #PBS -V #PBS -N Serial_job #PBS -q normal #PBS -l walltime=00:20:00 #PBS -l select=1 cd $PBS_O_WORKDIR./soa $ cat Serial_job.o6804 Array of Structure: 0.222429 (secs) Structure of Array: 0.104448 (secs) 153
Example 5 : Structure of Array vs Array of Structure Optimization Report LOOP BEGIN at soa.c(23,5) remark #15416: vectorization support: non-unit strided store was generated for the variable <point->y[i]>, stride is 2 [ soa.c(24,9) ] remark #15415: vectorization support: non-unit strided load was generated for the variable <point->x[i]>, stride is 2 [ soa.c(24,27) ] remark #15305: vectorization support: vector length 16 remark #15399: vectorization support: unroll factor set to 2 remark #15300: LOOP WAS VECTORIZED remark #15452: unmasked strided loads: 1 remark #15453: unmasked strided stores: 1 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 7 remark #15477: vector cost: 4.180 remark #15478: estimated potential speedup: 1.670 remark #15488: --- end vector cost summary --- remark #25015: Estimate of max trip count of loop=3125000 LOOP END 154
Example 5 : Structure of Array vs Array of Structure Optimization Report LOOP BEGIN at soa.c(29,5) remark #15388: vectorization support: reference set.y[i] has aligned access [ soa.c(30,9) ] remark #15389: vectorization support: reference set.x[i] has unaligned access [ soa.c(30,25) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15412: vectorization support: streaming store was generated for set.y[i] [ soa.c(30,9) ] remark #15412: vectorization support: streaming store was generated for set.y[i] [ soa.c(30,9) ] remark #15305: vectorization support: vector length 16 remark #15309: vectorization support: normalized vectorization overhead 1.182 remark #15300: LOOP WAS VECTORIZED remark #15442: entire loop may be executed in remainder remark #15449: unmasked aligned unit stride stores: 1 remark #15450: unmasked unaligned unit stride loads: 1 remark #15467: unmasked aligned streaming stores: 2 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 7 remark #15477: vector cost: 0.680 remark #15478: estimated potential speedup: 10.180 remark #15488: --- end vector cost summary --- remark #25015: Estimate of max trip count of loop=6250000 LOOP END 155
Q&A 계산과학응용센터 / 과학데이터스쿨 156
계산과학응용센터 / 과학데이터스쿨 157