2015-1 14. Hash Map 2015 년 6 월 1 일 교수김영탁 영남대학교공과대학정보통신공학과 (Tel : +82-53-810-2497; Fax : +82-53-810-4742 http://antl.yu.ac.kr/; E-mail : ytkim@yu.ac.kr)
Outline Hashing 이란? 사전 (dictionary), map, table과해싱 Hash map의구현방법 도서관의서적 (book) 들을위한 hash map의구현 ch14-2
해싱이란? 대부분의탐색 (search) 방법들은키값비교 (comparision) 로써탐색하고자하는항목에접근 해싱 (hashing) 키값에대한산술적연산에의해테이블의주소를계산하여항목에접근 해시테이블 (hash table) 키값의연산에의해직접접근이가능한구조 해싱은물건을정리하는것과같다 ch14-3
추상자료형사전구조 사전구조 (dictionary) 맵 (map) 또는테이블 (table) 로불리움 탐색키와관련된값의 2 가지필드로구성 영어단어나사람의이름같은탐색키 (search key) 단어의정의나주소또는전화번호같은탐색키와관련된값 (value) 객체 : 일련의 (key,value) 쌍의집합 연산 : add(key, value) ::= (key,value) 를사전에추가한다. delete(key) ::= key에해당되는 (key,value) 를찾아서삭제한다. 관련된 value를반환한다. 만약탐색이실패하면 NULL를반환한다. search(key) ::= key에해당되는 value를찾아서반환한다. 만약탐색이실패하면 NULL를반환한다. ch14-4
해싱의구조 해시함수 (hash function) 탐색키를입력받아해시코드 (hash code) 및해시주소생성 이해시주소가배열로구현된해시테이블 (hash table) 의인덱스 ch14-5
해시테이블의구조 해시테이블 ht M 개의버켓 (bucket) 으로구성된테이블 ht[0], ht[1],...,ht[m-1] 의원소를가짐 각버켓에 s 개의슬롯 (slot) 가능 충돌 (collision) 서로다른두개의탐색키 k1 과 k2 에대하여 h(k1) = h(k2) 인경우 오버플로우 (overflow) 충돌이버켓에할당된슬롯수보다많이발생하는것 오버플로우해결방법반드시필요 ch14-6
이상적인해싱 학생정보를해싱으로저장, 탐색해보자 5자리학번중에앞 2자리가학과번호, 뒤 3자리가각학과의학생번호 : h(k) = k % 1000; 같은학과학생들만가정하면뒤의 3자리만사용해서탐색가능 학번이 00023이라면이학생의인적사항은해시테이블 ht[23] 에저장 만약해시테이블이 1000개의공간을가지고있다면탐색시간이 O(1) 이되므로이상적임 ch14-7
실제의해싱 실제로는해시테이블의크기가제한되므로, 존재가능한모든키에대해저장공간을할당할수없음 h(k)= k mod M 의예에서보듯이필연적으로충돌과오버플로우발생함 M = 31 23 % 31 = 23 54 % 31 = 23 ch14-8
실제의해싱 (cont.) 알파벳문자열키의해시함수가키의첫번째문자의순서라고하자 h(keystr) = keystr[0] a ; 입력데이터 : array, binary, bubble, file, digit, direct, zero, bucket h("array")=0; h("binary")=1; h("bubble")=1; h( file ) = 5; h( digit ) = 3; h( direct ) = 3; h( zero ) = 25; h("bucket")=1; ch14-9
Bucket Array 기반의 Hash Map 구현 Bucket Array a bucket array for a hash table is an array A of size N, where each cell of A is thought of as a bucket (collection of keyvalue pairs), and the integer N defines the capacity of the array ch14-10
Hash 함수 (1) A hash function is usually specified as the composition of two functions: (i) Hash code: h 1 :keys integers (ii) Compression function: h 2 : integers [0, N 1] The hash code is applied first, and the compression function is applied next on the result, i.e., h(x) = h 2 (h 1 (x)) The goal of the hash function is to disperse the keys in an apparently random way ch14-11
Hash 함수 (2) 좋은해시함수의조건 충돌이적어야한다 해시함수값이해시테이블의주소영역내에서고르게분포되어야한다 계산이빨라야한다 ch14-12
제산 (modular) 함수 Hash 함수 (3) h(k)=k mod M 해시테이블의크기 M은소수 (prime number) ( 예 : 11, 101) 선택 폴딩함수 이동폴딩 (shift folding) 과경계폴딩 (boundary folding) ch14-13
Hash 함수 (4) 중간제곱함수 탐색키를제곱한다음, 중간의몇비트를취해서해시주소생성 비트추출함수 탐색키를이진수로간주하여임의의위치의 k 개의비트를해시주소로사용 숫자분석방법 키중에서편중되지않는수들을해시테이블의크기에적합하게조합하여사용 ch14-14
Hash 함수 (5) Cyclic Shift Hash Code int hashcode (const char* p, int len) unsigned int h = 0; for (int i=0; i< len; i++) h = (h << 5) (h >> 27); h += (unsigned int) p[i]; return h Experimental results shift comparisons of the number of collisions for various shift amounts for 25,000 English words collisions collisions shift total max total max ch14-15
압축함수 (Compress Function) Division h (k) k mod N The size N of the hash table is usually chosen to be a prime The reason has to do with number theory and is beyond the scope of this course Multiply, Add and Divide (MAD) h (k) ak b mod N a and b are nonnegative integers such that a mod N 0 Otherwise, every integer would map to the same value b ch14-16
충돌해결책 충돌 (collision) 서로다른탐색키를갖는항목들이같은해시주소를가지는현상 충돌이발생하면해시테이블에항목저장불가능 충돌을효과적으로해결하는방법반드시필요 충돌해결책 선형조사법 : 충돌이일어난항목을해시테이블의다른위치에저장 체이닝 : 각버켓에삽입과삭제가용이한연결리스트할당 ch14-17
선형조사법 (linear probing) 충돌이 ht[k] 에서발생했다면, ht[k+1] 이비어있는지조사 만약비어있지않다면 ht[k+2] 조사 비어있는공간이나올때까지계속조사 테이블의끝에도달하게되면다시테이블의처음부터조사 조사를시작했던곳으로다시되돌아오게되면테이블이가득찬것임 조사되는위치 : h(k), h(k)+1, h(k)+2, 군집화 (clustering) 과결합 (Coalescing) 문제발생 ch14-18
선형조사법 (linear probing) ( 예 ) h(k)=k mod 7 1 단계 (8) : h(8) = 8 mod 7 = 1( 저장 ) 2 단계 (1) : h(1) = 1 mod 7 = 1( 충돌발생 ) (h(1)+1) mod 7 = 2( 저장 ) 3 단계 (9) : h(9) = 9 mod 7 = 2( 충돌발생 ) (h(9)+1) mod 7 = 3( 저장 ) 4 단계 (6) : h(6) = 6 mod 7 = 6( 저장 ) 5 단계 (13) : h(13) = 13 mod 7 = 6( 충돌발생 ) (h(13)+1) mod 7 = 0( 저장 ) 1단계 2단계 3단계 4단계 5단계 [0] 13 [1] 8 8 8 8 8 [2] 1 1 1 1 [3] 9 9 9 [4] [5] [6] 6 6 ch14-19
Chaining 을사용한충돌해결 Collisions occur when different elements are mapped to the same cell Separate Chaining: let each cell in the table point to a linked list of entries that map there Separate chaining is simple, but requires additional memory outside the table ch14-20
체이닝 (chaining) 오버플로우문제를연결리스트 (linked list) 로해결 각버켓에고정된슬롯이할당되어있지않음 각버켓에, 삽입과삭제가용이한연결리스트할당 버켓내에서는연결리스트순차탐색 ( 예 ) 크기가 7 인해시테이블에서 h(k)=k mod 7의해시함수사용 입력 (8, 1, 9, 6, 13) 적용 ch14-21
체이닝 (chaining) 1 단계 (8) : h(8) = 8 mod 7 = 1( 저장 ) 2 단계 (1) : h(1) = 1 mod 7 = 1( 충돌발생 -> 새로운노드생성저장 ) 3 단계 (9) : h(9) = 9 mod 7 = 2( 저장 ) 4 단계 (6) : h(6) = 6 mod 7 = 6( 저장 ) 5 단계 (13) : h(13) = 13 mod 7 = 6( 충돌발생 -> 새로운노드생성저장 ) ch14-22
해싱과다른탐색방법의비교 ch14-23
Re-hashing Resizing if the load factor of a hash table goes significantly above a specified threshold, then it is common to resize the hash table with the new array s size be at least double the previous size once we have allocated a new bucket array, we must define a new hash function to go with it given this new hash function, we then reinsert every item from the old array into the new array using this new hash function ch14-24
도서관의서적 (Book) 들을위한 Hash Map 구현 /* Book.h */ #ifndef BOOK_H #define BOOK_H typedef struct Date int year; int month; int day; Date; typedef struct Book int isbn; char title[16]; char author_name[16]; double price; char publisher[16]; Date publication_date; Book; Book *genbook(int isbn); void printbook(book *pbk); #endif /* Book.cpp (1) */ #include <iostream> #include <stdio.h> #include "Book.h" using namespace std; extern void genbigrandarray(int ma[], int range); Book *genbook(int isbn) int name_len; int j; double br; Book *pbk; pbk = (Book *)malloc(sizeof(book)); pbk->isbn = isbn; name_len = rand() % 11 + 5; pbk->title[0] = rand() % 26 + 'A'; for (j = 1; j<name_len; j++) pbk->title[j] = rand() % 26 + 'a'; pbk->title[j] = ' 0'; ch14-25
/* Book.cpp (2) */ br = ((rand() << 15 rand()) % 100001) / 100.0; pbk->price = br; name_len = rand() % 11 + 5; pbk->publisher[0] = rand() % 26 + 'A'; for (j = 1; j<name_len; j++) pbk->publisher[j] = rand() % 26 + 'a'; pbk->publisher[j] = ' 0'; pbk->publication_date.year = rand() % 41 + 1960; pbk->publication_date.month = rand() % 12 + 1; pbk->publication_date.day = rand() % 30 + 1; return pbk; void printbook(book *pbk) printf("book[ isbn: %7d, title: %-15s, author: %-15s", pbk->isbn, pbk->title, pbk->author_name); printf(", price: %7.2f,, Publisher: %-15s", pbk->price, pbk->publisher); printf(", Publication date: %4d-%2d-%2d ]", pbk->publication_date.year, pbk->publication_date.month, pbk->publication_date.day); ch14-26
/* Entry.h */ #ifndef ENTRY_H #define ENTRY_H #include <string> #include "Book.h" using namespace std; #define Key char* typedef struct Entry Key pkey; // key string Book *pbk; Entry; /* Entry.cpp */ #include "Entry.h" #include "Book.h" void printentry(entry *pe) printf("[ key(title): %-15s, ", pe->pkey); printbook(pe->pbk); printf(" ], "); void printentry(entry *pe); #endif ch14-27
/* Bucket.h */ #ifndef BUCKET_H #define BUCKET_H #include "Entry.h" typedef struct ListNode Entry *pe; ListNode* pprev; ListNode* pnext; ListNode; typedef struct Bucket ListNode *pfirst; ListNode *plast; int num_node; Bucket; void initbucket(bucket *pbkt); int insertbucket(bucket *pbkt, Entry *pe); void printbucket(bucket *pbkt); bool searchbucket(bucket *pbkt, Entry *pe); void deletebucket(bucket *pbkt); #endif /* Bucket.cpp (1) */ #include "Bucket.h" void initbucket(bucket *pbkt) pbkt->num_node = 0; pbkt->pfirst = pbkt->plast = NULL; int insertbucket(bucket *pbkt, Entry *pe) ListNode *pnew; ch14-28 pnew = (ListNode *)malloc(sizeof(listnode)); pnew->pnext = pnew->pprev = NULL; pnew->pe = pe; if (pbkt->num_node == 0) pbkt->pfirst = pbkt->plast = pnew; pbkt->num_node++; else pnew->pprev = pbkt->plast; pbkt->plast->pnext = pnew; pbkt->plast = pnew; pnew->pnext = NULL; pbkt->num_node++; return pbkt->num_node;
/* Bucket.cpp (2) */ void printbucket(bucket *pbkt) Entry *pe; int num_entry = pbkt->num_node; ListNode *pln = pbkt->pfirst; if (pln!= NULL) printf(" n"); for (int i = 0; (i < num_entry) && (pln!= NULL); i++) pe = pln->pe; if (pe!= NULL) printf(" "); printentry(pe); printf(" n"); else printf("error in printentry() pe is NULL!! n"); exit; pln = pln->pnext; /* Bucket.cpp (3) */ bool searchbucket(bucket *pbkt, Entry *pe) ListNode *pln; Entry *pcure; int num_entry = pbkt->num_node; pln = pbkt->pfirst; for (int i = 0; (i < num_entry) && (pln!= NULL); i++) pcure = pln->pe; if ((pe!= NULL) && (pcure->pbk->isbn == pe->pbk->isbn)) return true; pln = pln->pnext; return false; ch14-29
/* Bucket.cpp (4) */ void deletebucket(bucket *pbkt) ListNode *pln, *pcurln; int num_entry = pbkt->num_node; pln = pbkt->pfirst; for (int i = 0; (i < num_entry) && (pln!= NULL); i++) pcurln = pln; pln = pln->pnext; free(pcurln->pe); free(pcurln); ch14-30
/* Hash.h */ #ifndef HASH_H #define HASH_H #include "Entry.h" // Key is defined in Entry.h unsigned int cyclicshifthashcode(key keystr); #endif /* Hash.cpp */ #include "Hash.h" #include "Entry.h" unsigned int cyclicshifthashcode(key keystr) unsigned int h = 0; int len = strlen(keystr); for (int i = 0; i < len; i++) h = (h << 5) (h >> 27); h += (unsigned int)keystr[i]; return h; ch14-31
/* HashMap.h */ #ifndef HASHMAP_H #define HASHMAP_H #include <stdio.h> #include "Bucket.h" #include "Entry.h" #include "Hash.h" typedef struct HashMap int bktarraysize; // Size of Bucket Array Bucket *bktarray; // BucketArray[] unsigned int(*hash)(key); HashMap(int bktarrsize); // constructor of HashMap HashMap; void printhashmap(hashmap *phm); void inserthashmap(hashmap *phm, Entry *pe); void searchhashmap(hashmap *phm, Entry *pe); void deletehashmap(hashmap *phm); #endif ch14-32
/* HashMap.cpp (1) */ #include "HashMap.h" #include <stdlib.h> HashMap::HashMap(int bktarrsize) bktarraysize = bktarrsize; bktarray = (Bucket *)malloc(sizeof(bucket)* bktarraysize); for (int i = 0; i < bktarraysize; i++) initbucket(&bktarray[i]); void inserthashmap(hashmap *phm, Entry *pe) unsigned int hashcode; unsigned int bucketindex; Bucket *pbkt; hashcode = phm->hash(pe->pkey); bucketindex = hashcode % phm->bktarraysize; pbkt = &phm->bktarray[bucketindex]; insertbucket(pbkt, pe); ch14-33
/* HashMap.cpp (2) */ void printhashmap(hashmap *phm) Bucket *pbkt; Entry *pe; int bktindex; for (bktindex = 0; bktindex < phm->bktarraysize; bktindex++) pbkt = &phm->bktarray[bktindex]; printf("bucket[%2d]: %2d entries: ", bktindex, pbkt->num_node); printbucket(pbkt); printf(" n"); ch14-34
/* HashMap.cpp (3) */ void searchhashmap(hashmap *phm, Entry *pe) unsigned int hashcode; unsigned int bucketindex; Bucket *pbkt; hashcode = phm->hash(pe->pkey); bucketindex = hashcode % phm->bktarraysize; pbkt = &phm->bktarray[bucketindex]; if (searchbucket(pbkt, pe)) printf("==> found in %2d_th bucket n", bucketindex); else printf("failed to find the entry from %2dd_th bucket n", bucketindex); void deletehashmap(hashmap *phm) Bucket *pbkt; int bktindex; for (bktindex = 0; bktindex < phm->bktarraysize; bktindex++) pbkt = &phm->bktarray[bktindex]; deletebucket(pbkt); ch14-35
/* main.cpp (1) */ #include <stdio.h> #include <stdlib.h> #include "HashMap.h" #include "Hash.h" #include "Star.h" #include "Book.h" #define NUM_BOOK 30 #define BUCKET_ARRAY_SIZE 17 extern void genbigrandarray(int ma[], int range); void main() int *starid; int *bookisbn; Book *books[num_book], *pb; Entry *pe; HashMap hashmap((int)bucket_array_size); hashmap.hash = cyclicshifthashcode; bookisbn = (int *)malloc(sizeof(int)* NUM_BOOK); genbigrandarray(bookisbn, NUM_BOOK); printf("generate and insert %d books into hash map... n", NUM_BOOK); ch14-36
/* main.cpp (2) */ for (int i = 0; i < NUM_BOOK; i++) books[i] = genbook(bookisbn[i]); pe = (Entry *)malloc(sizeof(entry)); pe->pbk= books[i]; pe->pkey = books[i]->title; inserthashmap(&hashmap, pe); printf("hash map status: n"); printhashmap(&hashmap); printf("searching star from the Hash Map... n"); for (int i = 0; i < NUM_BOOK; i++) pb = books[i]; pe = (Entry *)malloc(sizeof(entry)); pe->pbk = books[i]; pe->pkey = books[i]->title; printbook(pb); printf(" : "); searchhashmap(&hashmap, pe); free(pe); deletehashmap(&hashmap); free(bookisbn); ch14-37
실행결과 (1) ch14-38
실행결과 (2) ch14-39
Homework 14 14.1 별 (star) 들을관리하기위한 hash map 구현 (1) Star 들을관리하기위한구조체 Star 는다음과같이정의된다. typedef struct Star char name[star_name_len]; int id; double distance; double luminosity; double mass; double radius; int age; Star; (2) 구조체 Star 관련함수로 genstar() 와 printstar() 함수를작성하라. Star* genstar(int starid); void printstar(star *ps) (3) hash map 을구성하기위한구조체 Entry, ListNode, Bucket, HashMap 을작성하고, 관련함수들을작성하라. hash 함수로는 cyclic shift hash code 를사용할것. (4) 총 90 개의 star 를생성한후, bucket array size 101 인 hash map 에 star 들을추가한후, 각 bucket 에서의충돌발생회수를비교하라. (5) hash map 에포함된각 star 들을 searchhashmap(hashmap *phm, Entry *pe) 함수를사용하여탐색하고, 각 star 가정확하게탐색되는가를확인하라. ch14-40