CUDA Programming Tutorial 2 Memory Management Matrix Transpose Sungjoo Ha April 20th, 2017 Sungjoo Ha 1 / 29
Memory Management 병렬연산장치를활용하기위해하드웨어구조의이해를바탕에둔메모리활용이필요 CUDA 프로그래밍을하며알아야하는두가지메모리특성을소개 전치행렬계산을예제로 Sungjoo Ha 2 / 29
CUDA Memory Model Thread Per-thread Local Memory Block Per-block Shared Memory Grid Block Block Block Block Block Block Grid Global Memory Block Block Block Block Block Block Sungjoo Ha 3 / 29
Measurement cudaeventrecord(start, 0); for (int i=0; i < NUM_REPS; i++) { kernel<<<blocks, threads>>>(d_odata, d_idata, size_x, size_y); } cudaeventrecord(stop, 0); cudaeventsynchronize(stop); float kerneltime; cudaeventelapsedtime(&kerneltime, start, stop); Sungjoo Ha 4 / 29
Grid, block, thread dim3 block(size_x/tile_dim, size_y/tile_dim); dim3 threads(tile_dim, BLOCK_ROWS); Grid, block, thread 는 3 차원까지생성할수있음 상식적인방식으로 global id 가부여됨 Sungjoo Ha 5 / 29
Comparison Strategy 최고의성능을낼수있는코드와구현체를비교 성능차이를메우는식으로점진적으로진행 Sungjoo Ha 6 / 29
Copy 메모리복사보다전치행렬계산이빠를수는없다 성능의상한 1 1 물론복사도아주 trivial 하지는않고최적화할여지가있음 Sungjoo Ha 7 / 29
Copy global void copy(float *odata, float *idata, int width, int height) { int xindex = blockidx.x * TILE_DIM + threadidx.x; int yindex = blockidx.y * TILE_DIM + threadidx.y; int index = xindex + width*yindex; } for (int i=0; i<tile_dim; i+=block_rows) { odata[index+i*width] = idata[index+i*width]; } Sungjoo Ha 8 / 29
Copy Throughput Matrix Size Operation Throughput 1024 2 Copy 216.1302 GB/s 2048 2 Copy 219.5193 GB/s 4096 2 Copy 211.8688 GB/s 8192 2 Copy 219.6724 GB/s 16384 2 Copy 221.8129 GB/s 사실상행렬의크기에무관한 throughput 2 2 TILE DIM = 16, BLOCK ROWS = 16 Sungjoo Ha 9 / 29
Naive Transpose global void transpose1(float *odata, float *idata, int width, int height) { int xindex = blockidx.x * TILE_DIM + threadidx.x; int yindex = blockidx.y * TILE_DIM + threadidx.y; int index_in = xindex + width * yindex; int index_out = yindex + height * xindex; } for (int i=0; i<tile_dim; i+=block_rows) { odata[index_out+i] = idata[index_in+i*width]; } Sungjoo Ha 10 / 29
Naive Transpose Throughput Matrix Size Operation Throughput Operation Throughput 1024 2 Copy 216.1302 GB/s Naive 70.6491 GB/s 2048 2 Copy 219.5193 GB/s Naive 87.6363 GB/s 4096 2 Copy 211.8688 GB/s Naive 92.5567 GB/s 8192 2 Copy 219.6724 GB/s Naive 105.7644 GB/s 16384 2 Copy 221.8129 GB/s Naive 91.6658 GB/s Sungjoo Ha 11 / 29
Transpose vs Copy 코드상의차이는별로없으나성능차이는크게남 일차적으로적용할기법은 global memory coalescing Sungjoo Ha 12 / 29
Global Memory Coalescing Global memory 접근은 128 byte 단위로이루어지며 3 같은 warp 에있는 thread 가연속된 4 byte 값을접근하면단 1 회의 global memory 접근만일어남 만약캐시라인에맞지않으면 (unaligned) 해당하는캐시라인만큼의접근이일어날수있음 3 Compute capability 버전에따라다름 Sungjoo Ha 13 / 29
Copy/Naive Transpose Memory Access Pattern Sungjoo Ha 14 / 29
Copy/Naive Transpose Memory Access Pattern Copy 의경우 TILE_DIM 이 16 으로 half warp 의크기에딱맞음 Global memory 접근방식을보면하나의 half warp 는정확히두번의메모리접근만필요 읽기 1 회쓰기 1 회 Naive transpose 는쓰기가 coalesced 되지않음 동일 warp 의모든 thread 가다른캐시라인에접근 쓰기작업시 half warp 가총 16 회의메모리접근 Sungjoo Ha 15 / 29
Coalesced Transpose global void transpose2(float *odata, float *idata, int width, int height) { shared float tile[tile_dim][tile_dim]; } int xindex = blockidx.x * TILE_DIM + threadidx.x; int yindex = blockidx.y * TILE_DIM + threadidx.y; int index_in = xindex + (yindex)*width; xindex = blockidx.y * TILE_DIM + threadidx.x; yindex = blockidx.x * TILE_DIM + threadidx.y; int index_out = xindex + (yindex)*height; for (int i=0; i<tile_dim; i+=block_rows) { tile[threadidx.y+i][threadidx.x] = idata[index_in+i*width]; } syncthreads(); for (int i=0; i<tile_dim; i+=block_rows) { odata[index_out+i*height] = tile[threadidx.x][threadidx.y+i]; } Sungjoo Ha 16 / 29
Coalesced Transpose Memory Access Pattern Sungjoo Ha 17 / 29
Coalesced Transpose Memory Access Pattern Shared memory 는 global memory 와달리 coalesced 접근이필요하지않음 Global memory 에서 coalesced 접근으로데이터를읽고이를 shared memory 에쓴뒤 Shared memory 의 noncontiguous 메모리에접근하여이를 global memory 에 coalesced 접근으로쓰기 Sungjoo Ha 18 / 29
Coalesced Transpose Throughput Matrix Size Operation Throughput Operation Throughput Operation Throughput 1024 2 Copy 216.1302 GB/s Naive 70.6491 GB/s Coalesced 211.9276 GB/s 2048 2 Copy 219.5193 GB/s Naive 87.6363 GB/s Coalesced 211.5603 GB/s 4096 2 Copy 211.8688 GB/s Naive 92.5567 GB/s Coalesced 196.7131 GB/s 8192 2 Copy 219.6724 GB/s Naive 105.7644 GB/s Coalesced 123.0030 GB/s 16384 2 Copy 221.8129 GB/s Naive 91.6658 GB/s Coalesced 110.4511 GB/s Sungjoo Ha 19 / 29
Coalesced Transpose Memory Access Pattern 행렬이적당히작을때에는복사와같은성능 4 가설몇가지 syncthreads() 가성능저하를가져오는가? Shared memory bank conflict 가성능저하를가져오는가? 4 공정한비교를위해서는 shared memory 를사용한복사와비교해야함 Sungjoo Ha 20 / 29
Shared Memory Access Pattern Shared memory 는 32-bit word 로구성된 32 개의 bank 로이루어져있음 원칙적으로동시에같은 bank 에접근할수없음 다만아예동일한 word 에접근하는경우에는 bank conflict 가일어나지않음 Sungjoo Ha 21 / 29
Bank Conflict Sungjoo Ha 22 / 29
Bank Conflict Sungjoo Ha 23 / 29
No Bank-Conflict Transpose shared float tile[tile_dim][tile_dim+1]; Padding 을추가해서 bank conflict 를피하도록구성 Sungjoo Ha 24 / 29
No Bank-Conflict Transpose Throughput Matrix Size Operation Throughput Operation Throughput Operation Throughput 1024 2 Copy 216.1302 GB/s Coalesced 211.9276 GB/s Padded 211.1651 GB/s 2048 2 Copy 219.5193 GB/s Coalesced 211.5603 GB/s Padded 218.1229 GB/s 4096 2 Copy 211.8688 GB/s Coalesced 196.7131 GB/s Padded 193.3113 GB/s 8192 2 Copy 219.6724 GB/s Coalesced 123.0030 GB/s Padded 122.8659 GB/s 16384 2 Copy 221.8129 GB/s Coalesced 110.4511 GB/s Padded 110.4508 GB/s Sungjoo Ha 25 / 29
Analysis Global memory coalescing 은매우중요함 이경우에는 bank conflict 를제거하는것이큰영향을주진않음 하드웨어종류에따라다른결과가나올수있음 어떤부분이병목인지알기위해위의코드를분해하여비교해볼수있음 아직최적화할여지는남아있음 (transpose 도 copy 도 ) Tile 크기, row 크기 Thread, block 개수 Instruction 최적화 알고리즘변경 Sungjoo Ha 26 / 29
Conclusion 실제구현체를만들어보고다양한설정을비교해보기전에성능을예측하기는매우어려움 인내심을갖고다양한구현과설정을실험해봐야함 Sungjoo Ha 27 / 29
Next Topic Parallel reduction Instruction-level parallelism Parallel scan algorithm Sungjoo Ha 28 / 29
References Optimizing Matrix Transpose in CUDA, NVidia, 2010 CUDA C Best Practices Guide, NVidia CUDA C Programming Guide, NVidia Sungjoo Ha 29 / 29