Parallel Computation of Neural Network Sungjoo Ha September 8th, 2016 Sungjoo Ha 1 / 46
Parallel Computation 뉴럴넷의재기의원동력은많은데이터와병렬연산 최근 ASIC 기반의구현으로연구가옮겨가고있으나여전히가장많이활용되는것은 GPGPU GPGPU 를활용한뉴럴넷연산에필요한내용을설명 Computational graph GPU-friendly 한뉴럴넷표현 FFN, convolution Optimized implementation examples Sungjoo Ha 2 / 46
Computational Graph 연산을추상적인 DAG 로표현 이를바탕으로 forward pass 와 backward pass 의연산을쉽게이해할수있음 Sungjoo Ha 3 / 46
Computational Graph Example Sungjoo Ha 4 / 46
Forward Pass Sungjoo Ha 5 / 46
Gradient Computation Sungjoo Ha 6 / 46
Derivatives 간단히말해서 gradient 는특정변수의값을조금바꾸면다른값이어떻게바뀔지를계산하는것 인접한노드사이의계산은간단 인접한노드가아닌사이의계산은변수가도달할수있는모든 path 의 gradient 를전부더하면됨 Sungjoo Ha 7 / 46
Combinatorial Explosion 뉴럴넷처럼이전층의가중치가다음층의모든가중치에영향을주는방식이면가능한 path 의수가폭발함 적당히항을모아서계산하면이를피할수있음 일종의동적계획법 Sungjoo Ha 8 / 46
Forward Mode Differentiation Sungjoo Ha 9 / 46
Reverse Mode Differentiation Sungjoo Ha 10 / 46
Backpropagation Backpropagation = reverse mode differentiation 노드와노드사이의편미분계산은미리할수있음 이를미리구현하므로써 automatic differentiation 을수행 Sungjoo Ha 11 / 46
Deep Learning Library 현대의딥러닝라이브러리들은사용자가생성할수있는변수의종류가정해져있고, 변수에가할수있는연산의미분도미리구현해둠 사용자는제공된타입의변수와연산으로뉴럴넷을생성하고이는암묵적으로 computational graph 를이룸 이렇게만들어진 computational graph 를활용하면 backpropagation 을통해 gradient 를쉽게구할수있음 여기에더해서각연산을 GPU 에적합한꼴로바꿔주는역할도수행함 Sungjoo Ha 12 / 46
SIMD Single instruction multiple data nvidia 는 SIMD 라는표현대신 SIMT (thread) 라는용어를활용 같은연산을여러데이터에적용하는것이가능하면큰속도향상을볼수있음 행렬곱셈 Element-wise function application Reduction Scan (prefix sum) Sungjoo Ha 13 / 46
Neural Network Implementation in GPU 뉴럴넷연산은대부분행렬곱셈혹은 element-wise function application 과 reduction, scan 연산등으로표현가능 자세한설명은뒤에이어짐 동시에여러데이터를행렬꼴로입력 (mini-batch) Sungjoo Ha 14 / 46
FFN Forward Pass in Matrix Form Sungjoo Ha 15 / 46
FFN Backpropagation in Matrix Form Sungjoo Ha 16 / 46
Remarks Backpropagation 을위해 forward pass 에계산한중간결과들을들고있어야함 Gradient 를효율적으로모으는것이 (reduction) 단순하지는않음 Multi-GPU 시스템이빛을발하기어려운이유 단순히 mini-batch 를여러기계에뿌리는것으로는큰이득을보기힘듦 Sungjoo Ha 17 / 46
Convolution 현대의뉴럴넷에서가장비싼연산은 convolution 여러층의 convolution 을중첩시키며 나이브한구현은매우느림 Sungjoo Ha 18 / 46
Convolution Implementation 행렬연산으로변환 나이브한구현은추가적인메모리가필요할수있으며충분한속도개선을얻기도힘듦 FFT/Winograd 기반의방법 Convolution theorem 을활용 Under suitable conditions the Fourier transform of a convolution is the pointwise product of Fourier transforms FFT 는 O(n 2 ) 를 O(n log n) 으로떨어뜨림 추가메모리공간을많이필요 Stride 등의구현이쉽지않음 직접계산 코너케이스를잘처리하기위해개별구현이필요 잘구현하면타겟패러미터공간에서는매우빠름 Alex Krizhevsky 의 2012 ImageNet 우승구현체인 cuda-convnet Sungjoo Ha 19 / 46
Convolution in Matrix Form Sungjoo Ha 20 / 46
Convolution in Matrix Form 입력데이터를중복하여사용해서행렬꼴로변환 필터행렬과중복데이터행렬의곱으로 convolution 을계산 Sungjoo Ha 21 / 46
Convolution and FFT Convolution theorem 을활용한고속연산 원래공간에서의 convolution 이변환된공간에서의 element-wise 곱셈에대응됨을활용 Sungjoo Ha 22 / 46
Polynomial Multiplication f(x) = a 0 + a 1 x + a 2 x 2 + a 3 x 3 g(x) = b 0 + b 1 x + b 2 x 2 h(x) = f(x)g(x) = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + c 4 x 4 + c 5 x 5 c 0 = a 0 b 0 c 1 = a 0 b 1 + a 1 b 0 c 2 = a 0 b 2 + a 1 b 1 + a 2 b 0 c 3 = a 1 b 2 + a 2 b 1 + a 3 b 0 c 4 = a 2 b 2 + a 3 b 1 c 5 = a 3 b 2 Sungjoo Ha 23 / 46
1D Convolution c 0 = a 0 b 0 c 1 = a 0 b 1 + a 1 b 0 c 2 = a 0 b 2 + a 1 b 1 + a 2 b 0 c 3 = a 1 b 2 + a 2 b 1 + a 3 b 0 c 4 = a 2 b 2 + a 3 b 1 c 5 = a 3 b 2 Sungjoo Ha 24 / 46
Convolution and Polynomial Multiplication Convolution 은 polynomial multiplication 문제로치환됨 다항함수의곱을취한뒤각항의계수를구하는것과같음 만약 polynomial multiplication 을빠르게수행할수있다면 convolution 도빠르게수행할수있음 Sungjoo Ha 25 / 46
Polynomial Representation 다항함수를표현하는두가지방법 Coefficient representation f(x) = a 0 + a 1x + a 2x 2 + a 3x 3 각항의계수 a i 를알면해당함수를안다고할수있음 Point-value representation x i, f(x i) 값을 n 개알면 n 1 차다항식을유일하게한정할수있음 Sungjoo Ha 26 / 46
Polynomial Multiplication 두다항함수의곱셈을 coefficient representation 에서수행하면 O(n 2 ) 의연산이필요함 Point-value representation 에서는 O(n) (2n 1 개의위치에서 ) Coefficient representation 에서 point-value representation 으로의변환이싸다면 O(n 2 ) 을줄일수있음 변환코스트가실제로는 O(n log n) Sungjoo Ha 27 / 46
FFT Basic Idea 다항함수를홀수 / 짝수차수로분해 공통된부분계산을활용해서 divide-and-conquer A(x) = a 0 + a 1 x + a 2 x 2 + a 3 x 3 + a 4 x 4 + a 5 x 5 + a 6 x 6 + a 7 x 7 A even (x) = a 0 + a 2 x + a 4 x 2 + a 6 x 3 A odd (x) = a 1 + a 3 x + a 5 x 2 + a 7 x 3 A(x) = A even (x 2 ) + xa odd (x 2 ) A( x) = A even (x 2 ) xa odd (x 2 ) x에 1, i를넣으면 n 2 다항식에서두점을계산하는것으로원래다항식의네점을얻을수있음 Sungjoo Ha 28 / 46
FFT 위의아이디어와비슷하게 1 의 n 제곱근을 x 에넣는식으로 divide-and-conquer f f t ( n, A [ 0 : n 1]) { i f ( n == 1) r e t u r n A [ 0 ] E [ 0 : n /2 1] = f f t ( n / 2, A [ 0 : n 2:2]) D[ 0 : n /2 1] = f f t ( n / 2, A [ 1 : n 1:2]) f o r k = 0 to n /2 1 { wk = exp (2 p i k / n ) Y [ k ] = E [ k ] + wk D[ k ] Y [ k + n / 2 ] = E [ k ] wk D[ k ] } } r e t u r n Y [ 0 : n 1] Sungjoo Ha 29 / 46
FFT Remarks 역변환도거의같은꼴 변환비용이 O(n log n), 변환공간에서의곱셈이 O(n), 역변환이 O(n log n) 으로총 O(n log n) 시간복잡도 CUDA 플랫폼에서는 cufft 등을이미제공 Sungjoo Ha 30 / 46
CUDA Execution Model Sungjoo Ha 31 / 46
CUDA Execution Model 쓰레드, 블럭, 그리드의계층구조 계산단위는워프라불리는 32 개의쓰레드 같은워프에속하는쓰레드는같은연산을수행 Branch 가생기면다른연산을중복해서수행 하나의블럭은같은 streaming processor 에서수행 I/O 를기다리는워프가생기면워프스케쥴러가다른워프를실행 Sungjoo Ha 32 / 46
CUDA Memory Model Sungjoo Ha 33 / 46
CUDA Memory Model 모든메모리자원은동적할당됨 쓰레드는레지스터에접근할수있음 할당된레지스터를모두사용하면 spilling 이일어나며 global memory 를활용 같은블럭에위치한쓰레드들은 shared memory 를공유 (L2 캐시와같은종류의메모리 ) 특별한명령을사용해서 synchronization 가능 모든블럭 / 그리드는 global memory 를공유 Global memory 는 synchronization 을할방법이없음 새로운 kernel 을호출해야함 Sungjoo Ha 34 / 46
CUDA Performance Optimization I/O 밴드위스를최대한활용 빠른속도의메모리를활용 Bank conflict를회피 동시실행쓰레드를최대한늘림 (thread occupancy)... Sungjoo Ha 35 / 46
Matrix Multiplication 행렬곱셈은나이브한구현과최적화된구현의차이가큼 Shared memory 를활용한 coalesced 메모리접근이필요함 Sungjoo Ha 36 / 46
Naive Matrix Multiplication 행렬 A 에접근하는쓰레드는 global memory 에있는같은데이터를여러차례접근 Sungjoo Ha 37 / 46
Coalesced Access of Tile A 미리행렬 A 의데이터를읽어서 shared memory 에저장한뒤이를재활용 Sungjoo Ha 38 / 46
Coalesced Access of Tile B 비슷한방식으로행렬 B 의데이터를읽어서 shared memory 에저장한뒤이를재활용 Sungjoo Ha 39 / 46
Matrix Multiplication Remarks Shared memory 를활용한접근으로거의두배가량의속도를낼수있음 추가적으로연산도중데이터를미리읽어오는방식의 double buffering 등을활용하여더큰개선을볼수있음 더효율적인방법으로구현된 cublas 를이미제공 Sungjoo Ha 40 / 46
Reduction 여러개의값을하나로합치는연산 i x i Sungjoo Ha 41 / 46
Parallel Reduction Shared memory 에데이터를넣고 하나의쓰레드가인덱스의두배씩뛰며 reduction 수행 O(log n) 만에 reduction 완료 Sungjoo Ha 42 / 46
Parallel Reduction 2 동시에 shared memory 에접근할때연속되지않은접근은 bank conflict 발생 데이터에접근하는순서를변경 Sungjoo Ha 43 / 46
Reduction Remarks 추가적으로최적화기법이더있음 First add during load Loop unrolling Sungjoo Ha 44 / 46
References Convolution and FFT, Wayne, http://www.cs.princeton.edu/courses/archive/ spr05/cos423/lectures/05fft.pdf, 2005 Fast Implementation of DGEMM on Fermi GPU, Tan et al. International Conference for High Performance Computing, Networking, Storage and Analysis, 2011 Optimizing Parallel Reduction in CUDA, Harris Artificial Neural Networks: Matrix Form (Part 5), Dolhansky, http://briandolhansky.com/blog/2014/10/30/ artificial-neural-networks-matrix-form-part-5, 2014 cudnn: Efficient Primitives for Deep Learning, Chetlur et al. ArXiv Preprint, 2014 Sungjoo Ha 45 / 46
References Learning Semantic Image Representations at a Large Scale, Jia, Technical Report, 2014 Why GEMM is at the heart of deep learning, Warden, https://petewarden.com/2015/04/20/ why-gemm-is-at-the-heart-of-deep-learning/, 2015 Calculus on Computational Graphs: Backpropagation, Olah, http://colah.github.io/posts/2015-08-backprop/, 2015 CUDA C Best Practices Guide 7.5, nvidia, 2015 Fast Algorithms for Convolutional Neural Networks, Lavin and Gray, ArXiv Preprint, 2015 Sungjoo Ha 46 / 46