Microsoft PowerPoint - CUDA_NeuralNet_정기철_발표자료.pptx

정기철 (kcjung@ssu.ac.kr/ http://hci.ssu.ac.kr) 숭실대학교 IT대학미디어학부 (http://www.ssu.ac.kr/ http://media.ssu.ac.kr)

VMD/NAMD Molecular Dynamics 일리노이주립대 가시분자동력학 (VMD)/ 나노분자동력학 (NAMD) 240X 속도향상 http://www.ks.uiuc.edu/research/vmd/projects/ece498/lecture/

Evolved machines 신경회로시뮬레이션 130X 속도향상

ETC. MRI:40~170X Virus:110X EM:45X

GPU 란? Graphics Processing Unit 1999/08 NVIDIA에서처음발표 CPU의그래픽처리작업을돕기위해만듦 3D그래픽가속칩을개량

Why use GPU? 연산속도 : 367 GFLOPS vs. 32 GFLOPS 메모리대역폭 : 86.4 GB/s vs. 8.4 GB/s

Why so fast GPU? GPU는병렬처리와수학적연산에대하여특화 더많은트랜지스터가흐름제어나데이터캐싱보다데이터처리에집중

Geforce 7800 GTX

Geforce 8800 GTX Host Input Input Assembler Execution Manager Vtx Issue Geom Issue Setup / Rstr / ZCull Pixel Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Parallel TF Data Parallel TF Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data TF TF TF TF TF TF Cache Cache Cache Cache Cache Cache Cache Cache Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Processor Load/store L2 Load/store L2 Load/store L2 Load/store L2 Load/store L2 Load/store L2 FB FB FB Global MemoryFB FB FB 9

Streaming Multiprocessor 8개의스트리밍프로세서로구성 Load/store 구조 32-bit integer 명령어 IEEE754 32-bit floating point Branch, call, return, predication 8K registers, 쓰레드에배치 16K Shared memory 협력하는쓰레드간의데이터공유

What is CUDA? Compute Unified Device Architecture 2007/02 CUDA Beta 최초공개 2007/07 CUDA 1.0 정식버전공개 2007/11 CUDA 1.1 공개 현재 CUDA 2.0 Beta 공개

What is CUDA? (continue.) General purpose programming model 유저가 GPU의쓰레드를배치 GPU = 대용량병렬처리프로세서 GPU에프로그램을적재하기위한드라이버 Driver Optimized for computation 그래픽 API를사용하지않음 OpenGL 버퍼와데이터공유 최고의다운로드 & 다시읽기속도보장 명백한 GPU 메모리관리

CUDA Advantages over Legacy GPGPU 제한없는메모리접근 쓰레드는필요한부분에읽고쓰기가능 공유메모리와쓰레드 쓰레드는공유메모리에서데이터를읽어협력가능 한블럭안의쓰레드는누구나공유메모리접근가능 상대적으로적은지식필요 C의확장형언어 그래픽적인지식필요없음 그래픽API에의한오버헤드가없음

Environment Geforce 8 시리즈이상의그래픽카드 Windows Visual Studio 2003 or 2005 Visual Studio 2008 4 분기지원예정 Linux Redhat Enterprise Linux3.x, 4.x or 5.x SUSE Linux Enterprise Desktop 10-SP1 OpenSUSE 10.1 or 10.2 Ubuntu 7.04 Mac OS X 10.5.2

CUDA Programming Model Kernel = GPU 프로그램 Grid = 커널을수행하는쓰레드블록의배열 block = 커널을수행하고공유메모리를통하여대화하는SIMD 쓰레드의그룹 Host Kerne l 1 Kerne l 2 Block (1, 1) Device Grid 1 Block (0, 0) Block (0, 1) Grid 2 Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

and Block IDs 쓰레드와블록은 ID 를가짐 각각의쓰레드는일을하기위한데이터를결정가능 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block ID : 1D or 2D ID : 1D, 2D, or 3D Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) 다차원의데이터처리는메모리참조를간편하게함 이미지프로세싱 볼륨에서편미분방정식해결 etc. (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2) (3, 0) (3, 1) (3, 2) (4, 0) (4, 1) (4, 2)

CUDA Memory Spaces 메모리영역별권한 Register : R/W 쓰레드 Local Memory : R/W 쓰레드 Shared Memory : R/W 블록 Global Memory : R/W grid Constant Memory : R/O grid Texture Memory : R/O grid (Device) Grid Block (0, 0) Shared Memory Registers Registers (0, 0) (1, 0) Block (1, 0) Shared Memory Registers Registers (0, 0) (1, 0) Host Global, Constant, and Texture Memory : R/W Host Local Memory Global Memory Constant Memory Local Memory Local Memory Local Memory Texture Memory

Arrays of parallel threads Kernel 은쓰레드의배열에의해수행됨 모든쓰레드는동일한코드를수행 각각의쓰레드는컨트롤을결정하고메모리주소를계산하기위해 ID를가짐

Blocks 여러개의블록안에단일화된쓰레드배열로나눠짐 공유메모리를통하여한블록안의쓰레드가협력 다른블록에존재하는쓰레드와는협력불가

Examples Increment Array Elements Neural Networks

Example : Increment Array elements CPU program void increment_cpu(float *a, float b, int N){ for (int idx = 0; idx<n; idx++) a[idx] = a[idx] + b; } void main(){ CUDA Program... global void increment_gpu(float *a, float b, int N){ increment_cpu(a, b, N); int idx = blockidx.x * blockdim.x + threadidx.x; } if (idx < N) a[idx] = a[idx] + b; } void main(){.. dim3 dimblock (blocksize); dim3 dimgrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimgrid, dimblock>>>(a, b, N); }

Example : Increment Array Elements N 개의요소를가진벡터 a 에 b 더하기 N=16, blockdim = 4 라고가정하면 blockidx.x=0 blockdim.x=4 threadidx.x=0,1,2,3 idx=0,1,2,3 blockidx.x=1 blockdim.x=4 threadidx.x=0,1,2,3 idx=4,5,6,7 blockidx.x=2 blockdim.x=4 threadidx.x=0,1,2,3 idx=8,9,10,11 blockidx.x=3 blockdim.x=4 threadidx.x=0,1,2,3 idx=12,13,14,15 int idx = blockdim.x * blockidx.x + threadidx.x;

Neural Networks ( 인공 ) 신경회로망 인간의두뇌작용을신경세포들간의연결관계로모델링 인간의학습을모델링 기본작업 학습 (learning): 패턴부류에따라신경망의연결가중치조정 재생 (recall): 학습된가중치와입력벡터와의거리계산하여가장가까운클래스로분류 사람과같은학습능력 : 패턴분류, 인식, 최적화, 예측 응용분야 화상인식, 음성인식, 로봇제어등다양한인공지능분야에적용

Concept 생물학적인공신경망뉴런 X 1 핵 W 1 신경절 X 2.. W 2 f 세포체 Y = f(x) 수상돌기 X n W n X = Σ W 축색돌기 i X i f : 응답함수

Kinds of Neural Networks 입력형식학습방식신경회로망모델 지도학습 Hopfield network 이진입력 지도학습및비지도학습을결합한학습 Counterpropagation network 실수입력 비지도학습 지도학습 비지도학습 ART model Perceptron Multilayer Perceptron Competitive learning SOM

Multilayer Perceptron 1969 년 Minsky s attack 이후신경망연구침체 1986년이후 Rumelhart 등의연구에의해다층신경망의 error backpropagation 알고리즘완성 비선형 (non-linear) 문제를학습가능 신경망연구의비약적발전

Multilayer Perceptron Model i j(h) k x Σ net 1 σ o 1 Σ net 1 σ o 1 1 x 21 w 1 2 w 2 x Σ σ Σ σ 2 x w net 2 o 2 2 1 w net 2 o 2 2 22 2 w 2 2 3 x w x 23 Σ net σ 3 3 o Σ σ 3 3 net 3 o 3 2

What do I say?

GPU Operation 각 노드 에서의내적연산은입력벡터와웨이트벡터를축적함으로써행렬의곱연산으로변환가능

MLP Implementation

What is OpenMP? Open Multi-Processing 1997/08 FORTRAN 1.0 표준발표 1998/08 C/C++ 1.0 표준발표 2002/03 FORTRAN, C/C++ 2.0 표준발표 2005/05 C/C++/FORTRAN 통합 2.5 표준발표

Why OpenMP? GPU 로의잦은데이터전송에의한지연발생 대용량의데이터를생성하여데이터전송 대용량의데이터생성에오버헤드발생 멀티코어 cpu 를이용한병렬처리로오버헤드감소 소스의적은수정으로병렬처리가능

OpenMP Preference (later VS2005)

Model

Fork-Join Model

Parallel Section(for) #pragma omp for [clause [clause ] ] { for loop }

Parallel Section(sections) #pragma omp sections [clause [clause ] ] { [#progma omp section] structured code block }

Using Neural Network #pragma omp parallel sections{ #pragma omp section{ int y1 = y + count*input_width/(mwidth-10); int x1 = x + (count*input_width)%(mwidth-10); GetConfigMatrix( x1, y1, input1); } 중략 #pragma omp section{ int y4 = y + (count+3)*input_width/(mwidth-10); int x4 = x + ((count+3)*input_width)%(mwidth- 10); GetConfigMatrix(x4, y4, input4); } }

Experimental Results

Time Complexity

참고논문 GPU implementation of neural networks, International Journal of Pattern Recognition, Vol. 37, Issue 6, Pages 1311-1314, 2004, SCIE CUDA 와 OpenMP 를이용한신경망구현, Korea Computer Congress 2008(2008 한국컴퓨터종합학술대회 ), 발표예정