Microsoft PowerPoint Index Structures.ppt

자료처리 () 26 년봄학기문양세강원대학교컴퓨터과학과 강의 강의내용 Binary Trees (Ch. 6 in 화일구조 ) B-tree, B*-tree, Trie (Ch. 6 in 화일구조 ) B + -tree (Ch. 7 in 화일구조 ) Indexed Sequential File (Ch. 7 in 화일구조 ) Page 2

색인, 인덱스 (Index) 특징 파일의레코드들에대한효율적접근구조 < 레코드키값, 레코드 ( 에대한 ) 포인터 > 쌍 종류 키에따른인덱스분류 기본인덱스 (Primary Index): 기본키 (Primary Key) 를포함하는인덱스 ( 키의순서가레코드의순서를결정지음 ) 보조인덱스 (Secondary Index): 기본인덱스이외의인덱스 ( 키의순서가레코드의순서를의미하지는않음 ) 파일조직에따른인덱스 집중인덱스 (Clustered Index): 데이타레코드의물리적순서가그파일에대한인덱스엔트리순서와동일하게 ( 유사하게 ) 유지되도록구성된인덱스 비집중인덱스 (Unclustered Index): 집중형태가아닌인덱스 데이타범위에따른인덱스분류 밀집인덱스 (Dense Index): 데이터레코드각각에대해하나의인덱스엔트리가만들어진인덱스 희소인덱스 (Sparse Index): 레코드그룹또는데이터블록에대해하나의엔트리가만들어지는인덱스 Page 3 Binary Search Tree (BST) (/2) Binary Tree ( 이진트리 ) 유한한수의노드를가진트리 왼쪽서브트리와오른쪽서브트리로구성 Binary Search Tree (BST, 이원탐색트리 ) 이진트리 각노드 N i : 레코드키 K i 와이키가가지고있는레코드포인터를포함 공백 (empty) 이아닌이원탐색트리의성질 모든노드는상이한키값을갖는다. 루트노드 N i 의왼쪽서브트리 (Left(N i )) 에있는모든노드의키값은루트노드의키값보다작다. (maximum key value of the left sub-tree is less than key value of the root node.) 루트노드 N i 의오른쪽서브트리 (Right(N i )) 에있는모든노드의키값은루트노드의키값보다크다. (minimum key value of the right sub-tree is greater than key value of the root node.) 왼쪽서브트리와오른쪽서브트리는모두이원탐색트리이다. (Both left sub-tree and right sub-tree are also binary search trees.) Page 4 2

Binary Search Tree (BST) (2/2) Binary Tree 와 Binary Search Tree 예제 Binary Trees Binary Search Trees 28 5 3 2 32 7 42 4 3 5 37 44 Page 5 BST 에서의검색 루트 N i 일때, 주어진키값 K 인노드검색과정. 트리가공백 : 검색은노드를찾지못하고실패 2. K = K i : 노드 N i 가원하는 ( 찾고자하는 ) 노드 검색종료 3. K < K i : N i 의왼쪽서브트리를검색 left sub-tree를대상으로 recursive하게검색 4. K > K i : N i 의오른쪽서브트리를검색 right sub-tree를대상으로 recursive하게검색 searchbst(b, s_key) // B = binary search tree, s_key = search key value p B; if (p = null) then // 검색실패 ( 해당키값을가진노드가없음 ) return null; if (p.key = s_key) then // 검색성공 return p; if (p.key < s_key) then // 오른쪽서브트리검색 return searchbst(p.right, s_key); else return searchbst(p.left, s_key); // 왼쪽서브트리검색 end searchbst() Page 6 3

BST 에서의삽입 루트 N i 인 BST 에, 키값 K 인새로운노드를삽입. 트리가공백 (empty): K를루트노드로삽입 2. K = K i : 트리에값이같은키값이이미존재하므로삽입을거부 (unique 보장 ) 3. K < K i : N i 의왼쪽서브트리로이동하여계속탐색 4. K > K i : N i 의오른쪽서브트리로이동하여계속탐색 BST 에삽입예 : 키값 3, 5 인노드를삽입 5 키값 3 인노드를삽입 5 키값 5 인노드를삽입 5 7 7 7 5 5 3 5 3 5 Page 7 BST 에서의삭제 (/2) 삭제될노드가가진자식수에따라다른연산을수행. 자식이없는단말노드 (leaf node) 의삭제 단순히해당단말노드를삭제함 2. 자식이하나인노드의삭제 (left or right child 하나만존재 ) 삭제되는노드자리에자식노드를위치시킴 3. 자식이둘인노드의삭제 - 삭제되는노드자리에왼쪽서브트리에서제일큰키값또는오른쪽서브트리에서제일작은키값으로대체선택 Max(left sub-tree) = key value of the right most node in the left sub-tree Min(right sub-tree) = key value of the left most node in the right sub-tree - 해당서브트리에서대체노드를삭제 Page 8 4

BST 에서의삭제 (2/2) BST 에삭제예 : 키값 4, 2, 5 인노드를삭제 5 3 55 키값 4 삭제 ( 단말노드삭제 ) 5 3 55 키값 2 삭제 ( 자식이하나 ) 5 3 55 2 4 52 6 2 52 6 25 52 6 25 25 키값 5 삭제 ( 자식이둘 ) 3 키값 5 삭제 ( 자식이둘 ) 52 25 55 3 55 키값 3 으로대체 3 = Max(left sub-tree) 52 6 키값 52 으로대체 52 = Min(right sub-tree) 25 6 Page 9 BST 의성능 (/2) 편향된 (Skewed) Binary Search Tree 편향되어있는경우, BST 의탐색시간은최악이됨 N 개의노드인 BST 에서최악의탐색시간 : N 번의노드탐색 Skewed BST 의예 6 25 55 3 52 52 3 55 25 6 Page 5

BST 의성능 (2/2) 성능 트리형태와 ( 개별 ) 노드에대한접근빈도에의존적임 가장자주접근되는노드는루트에가장가깝게유지하는것이유리함 균형트리 (balanced tree) 유지 : 모든노드에대해양쪽서브트리의노드수가가능한동일하게만들어트리의최대경로길이를최소화 BST 의단점 잦은삽입과삭제가발생하는경우, 균형트리를유지하기어려움 ( AVL-tree) 작은분기율 (branching factor) 에따른긴탐색경로와검색시간 ( B-tree) - 분기율이 2이므로, 각노드는많아야두개의서브트리를가짐 - N개의노드를갖는트리의최소높이 : log N 2 + Page AVL 트리 (/2) 높이균형 (height-balanced) BST 삽입, 삭제연산시간이짧음 검색시간 : O(logN) ( Skew가발생치않으므로, worst case = average case가됨 ) AVL 트리 : 고안자 Adelson-Velskii와 Landis 이름의 Initial에서유래 트리전체를재균형시키지않고도 ( 국부적으로재균형시키면서 ) 트리의균형유지 정의 AVL 트리 T: 공백이아닌 Binary (Search) Tree B f (= balance factor) = h(right(n i )) - h(left(n i )), N i T ( 모든노드의왼쪽서브트리와오른쪽서브트리의높이는그차가 이하임 ) - 오른쪽서브트리의높이가큰경우 : B f (N i ) = - 왼쪽서브트리의높이가큰경우 : B f (N i ) = - - 두서브트리의높이가같은경우 : B f (N i ) = 공백서브트리의높이 (B f (empty tree)): - 로정의 Page 2 6

AVL 트리 (2/2) 6 8 8 4 2 8 4 2 4 3 6 6 3 6 6 9 5 5 7 5 6 8 7 2 5 3 9 4 7 6 2 8 4 6 9 6 5 4 8 2 Page 3 AVL 트리에서의검색과삽입 (/2) 검색 일반 BST에서의검색연산과동일 시간복잡도 : O(log 2 N) 삽입 삽입되는위치에서루트로의경로에있는조상노드들의 B f 에영향을줄수있음 즉, 불균형이발생할수있으며, 따라서, 불균형이탐지된가장가까운조상노드의 B f 를 ± 이하로재균형시켜야함 2 2 2 2 8 6 8 6 4 4 4 4 2 6 2 6 신규노드삽입으로인한 AVL 파괴예 Page 4 7

AVL 트리에서의검색과삽입 (2/2) 삽입 ( 계속 ) 불균형으로판명된노드를 x라할때, x의두서브트리높이의차, 즉 B f 가 2가됨 다음네가지경우중하나로인해발생함 LL : x의왼쪽자식 (L) 의왼쪽서브트리 (L) 에삽입 RR : x의오른쪽자식 (R) 의오른쪽서브트리 (R) 에삽입 LR : x의왼쪽자식 (L) 의오른쪽서브트리 (R) 에삽입 RL : x의오른쪽자식의왼쪽서브트리 (L) 에삽입 Page 5 AVL 트리에서회전 (rotation) (/5) 회전이필요한이유 노드의삽입 ( 혹은삭제 ) 로인하여불균형이발생한경우 ( B f >), 트리의재균형을위하여회전이필요함 즉, 불균형이발생한노드 x를중심으로회전을수행하여균형을맞춤 회전의종류 단일회전 (single rotation) LL, RR 등이이에해당하며, 한번의회전만필요함 탐색순서를유지하면서부모와자식원소의위치를교환함으로써재균형이이루어짐 이중회전 (double rotation) LR, RL에서발생하며, 두번의회전을필요로함 Page 6 8

AVL 트리에서회전 (rotation) (2/5) LL Rotation 2 A LL B A B T 3 T T 2 T 3 T T 2 RR Rotation -2 A - B RR A B Page 7 AVL 트리에서회전 (rotation) (3/5) LR Rotation: case (a) - B 2 A C LR(a) B C A LR Rotation: case (b) B 의 left sub-tree 는 C 보다모두작음 - B C 2 A A 의 right sub-tree 는 C 보다모두큼 LR(b) B C A - T T 2 T 3 T 4 T T 2 T 3 T 4 Page 8 9

AVL 트리에서회전 (rotation) (4/5) LR Rotation: case (c) B 의 left sub-tree 는 C 보다모두작음 - B - C A 2 A 의 right sub-tree 는 C 보다모두큼 LR(c) B C A T T 2 T 3 T 4 T T 2 T 3 T4 RL Rotation: case (a) -2 A - B RL(a) A C B C Page 9 AVL 트리에서회전 (rotation) (5/5) RL Rotation: case (b) A 의 left sub-tree 는 C 보다모두작음 A -2 - C B의 right sub-tree는 C보다모두큼 B RL(b) A C B T T 2 T 3 T 4 T T 2 T 3 T4 RL Rotation: case (c) A 의 left sub-tree 는 C 보다모두작음 A -2 C B의 right sub-tree는 B C보다모두큼 RL(c) A C - B Page 2

AVL 트리구성예제 (/4) 키값 (8, 9,, 2,, 5, 3, 6, 4, 7,, 2) 을차례대로삽입하면서 AVL 트리를구성하는예 empty tree - 8 8 9 8 9-2 8-9 RR 9 8 2 9 8 2 Page 2 AVL 트리구성예제 (2/4) 9 2 8 LL 2 2 9 2 8 5 2 9-2 8 5 LR 8-2 9 5 3 8 - - 2 9 5 3 Page 22

AVL 트리구성예제 (3/4) 2-8 - -2 8-2 9 2 9 6 4 RL 5 5-3 6 3 6 4 2 5-3 9 5 4 6 2-8 - 7 3 9-2 5 - LR 4 6 7 5 3 8 - - 2 4 6 9 7 Page 23 AVL 트리구성예제 (4/4) - 5-3 8 - -2 RR 2 4 6 9-7 5 3 8-2 4 6 7 9 2-5 - 3 8 - - 2 4 6-7 9 2 Page 24 2

AVL 트리의높이 AVL 트리에서높이의범위 N개의노드를가진높이 AVL 트리는완전균형이진트리 ) 보다 45% 이상은높아지지않음이증명되었음 log 2 (N+) h.444 log 2 (N+2) -.328 AVL 트리 : 완전균형이진트리 ) O(.4 log N) : O(log N) AVL 트리는부분적인재구성만을수행하기때문에, 탐색시간측면에서는 AVL 트리가더길다. ( 그러나, 훨씬실용적이다.) ) 완전균형이진트리 (complete balanced binary tree): 주어진키값에대해서, ( 검색측면에서 ) 가장이상적으로좌우균형이잡힌트리 Page 25 강의 강의내용 Binary Trees (Ch. 6 in 화일구조 ) B-tree, B*-tree, Trie (Ch. 6 in 화일구조 ) B + -tree (Ch. 7 in 화일구조 ) Indexed Sequential File (Ch. 7 in 화일구조 ) Page 26 3

m-way Search Tree (/3) BST 보다분기율 (branch factor) 을높이면? 즉, m 개서브트리를형성하면? 장점 : 트리의높이가감소 ( 특정노드의탐색시간감소 ) 단점 : 삽입, 삭제시트리의균형을유지하기위해복잡한연산이필요 m-way search tree 의성질 노드구조 = <n, P, K, P, K 2, P 2,, P n-, K n, P n > (n: 키값의수, n m, P i : 서브트리에대한포인터, K i : 키값 ) 키는오름차순으로정렬 : K i < K i+, i n- P i ( i n-) 가가리키는서브트리내의키값 < K i+ P n 이가리키는서브트리노드들의키값 > K n P i 가가리키는서브트리 : m-way search tree Page 27 m-way Search Tree (2/3) 예제 (3-way search tree) 4 Internal Nodes 3 6 2 7 2 ^ ^ 2 ^ ^ 4 5 7 9 5 6 8 22 25 55 Leaf Nodes 실제로, 키값 K i 는 (K i, A i ) 를의미하며, 여기서 A i 는데이타레코드의주소를의미함 Page 28 4

m-way Search Tree (3/3) m-way search tree 의탐색시간 : 탐색경로길이 ( 높이 ) 에비례 각레벨에서는한개의노드만탐색 ( 높이가 h이면 h개노드탐색 ) 분기율 (m) 을크게하면할수록트리의높이가낮아짐 한노드에 m- 개키값을저장하는 m-way search tree 의경우, 높이 h이면 (m h -) 개의키값을저장 ((m-) + m(m-) + m(m(m -)) + = m + m 2 m + m 3 m 2 + = m h ) 예 : 4-way search tree에서높이 3이면, 4 3 -=63개키값저장 n 개의키를가진 m-way search tree 최소높이 h = log m (N+) 최대탐색시간 : O(h) = O(log m (N+)) 예 : m=2이면, BST의탐색시간 (= O(log 2 (N+)) Page 29 B-트리 Bayer & McCreight 가제안 R. Bayer and C. McCreight, Organization and Maintenance of Large Ordered Indexes, Acta Informatica, pp. 73-89, 972. B-tree?: balanced m-way search tree m-way search tree의균형을유지하기위하여효율적인알고리즘을제공 B + -tree와함께가장많이사용되는인덱스방법 차수 m 인 B- 트리의특성 B-트리는공백 (empty) 이거나높이가 이상인 m-way search tree임 루트와단말 (leaf) 을제외한내부노드 (internal node) 는다음특성을가짐 최소 m/2, 최대 m개의서브트리를가짐 ( 절반이상의 Utilization을보장 ) 적어도 m/2 -개의키값( 노드의반이상채워짐 ) 루트노드 : 단말이아니면적어도두개의서브트리를가짐 모든단말노드는같은레벨임 Page 3 5

B-트리의노드구조 (/2) 노드구조 (m-way) <n, p, <K, A >, P, <K 2, A 2 >, P 2,, P n-, <K n, A n >, P n > n = 키값의수 ( n<m), P,, P n = 서브트리에대한포인터, K n = 키값, A i = 키값K i 를가진레코드에대한포인터 한노드안의키값은오름차순으로정렬 ( i n- K i < K i+ ) P i 가가리키는서브트리의모든키값 < K i+ P n 이가리키는서브트리의모든키값 > K n P i ( i n) 가가리키는서브트리 : m-way B- 트리구조를만족함 B- 트리의장점 삽입과삭제가발생한후에도균형상태를유지 저장장치의효율성제공 - 각노드 ( 루트제외 ) 는반이상의 Storage Utilization을제공 - 최악의경우 O(log n (N+)) 의높이및탐색시간 (n = m/2) Page 3 B-트리의노드구조 (2/2) B- 트리의예 (3-way) a 69 ^ root node b 9 43 28 38 c d e f g h i 6 ^ 26 4 6 ^ ^ 32 ^ 45 ^ internal nodes j k l m n o p q r s t u v 7 5 8 2 3 36 42 5 58 62 65 7 2 3 36 4 5 leaf nodes Page 32 6

B-트리에서의연산 검색 : m-way search tree 의검색과같은과정 직접탐색 : 주어진키값을사용하여 tree traverse ( 중간에검색이종료되기도함 ) 순차탐색 : inorder traversal 을수행해야하므로성능이좋지않음 ( B + - 트리 ) 삽입 : 항상단말노드에삽입 단말노드에여유공간이있는경우 : 단순히순서에맞도록단말노드내에삽입 여유공간이없는경우 : overflow, split 발생 - 해당노드를두개의노드로분할해야함 - 새로운키값을포함하여키값들을정렬한후, 중간키값을중심으로큰키들은새로운노드에저장하고, 나머지는기존노드에저장 - 중간키값 : 분할된노드의부모노드로올라가삽입 (recursion) - 그결과, 다시 overflow 발생하면, 위와같은분할 (split) 작업을반복 Page 33 B-트리에서삽입의예 (/4) 앞서예를든 B- 트리에새로운키값 22, 4, 59, 57, 54, 44, 75, 24, 22, 23 을차례로삽입 l 2 l 2 22 n 42 n 4 42 (a) 단말노드 l 에키 22 삽입 (b) 단말노드 n 에키 4 삽입 b b f 6 ^ f 58 6 p o 5 58 p o 5 o 59 (c) 노드 o 에키 59 삽입 Split 발생 Page 34 7

B-트리에서삽입의예 (2/4) o 5 o 5 57 o 5 57 o 5 o 57 (d) 노드 o 에키 57 삽입 (d) 키 54 삽입으로, 노드 o 의분할 (54 는부모노드 f 로이동 ) f 58 6 o o p f 54 ^ o o f 6 ^ o p (e) 노드 f 에키 54 의삽입 (58 은부모노드 b 에삽입 ) Page 35 B-트리에서삽입의예 (3/4) b 9 43 d e f b 9 ^ (g) 노드 b 에키 58 삽입 (43 은부모노드 a 에삽입 ) d e b 58 ^ f f a a 69 ^ 43 69 b c b b (h) 노드 a 에키 43 삽입 c 나머지키값인 33, 75, 24, 22 를차례로삽입 : 문제 (overflow) 가발생하지않음 마지막키값인 23 을삽입 : B- 트리는한레벨증가됨 (simulation 해볼것 ) Page 36 8

B-트리에서삽입의예 (4/4) 반복된삽입에따라서한레벨이증가된 B- 트리 a o 69 a a 43 28 b b c c 9 33 58 2 38 d e e f f g g h i 6 26 4 54 6 23 32 45 j k l m m n o o o p q r r r s t u v 36 7 8 2 4 3 57 59 7 5 22 42 65 75 5 62 22 24 3 36 4 5 Page 37 B-트리에서의삭제 삭제될키값이내부노드에있는경우 이키값의후행 (successor) 키값과교환후단말노드에서삭제 단말노드에서의삭제연산이더간단 후행 (successor) 키값대신선행 (predecessor) 키값을사용할수도있음 최소키값수 ( m/2 -) 보다작은경우 : Underflow 재분배 (redistribution) - 최소키값보다많은키를가진형제노드 (sibling node) 를선택 - 해당노드와선택한노드의키값을재분배하고, 중간키값을부노노드의키값으로이동 - 트리구조를변경시키지않음 합병 (merge) - 재분배가불가능한경우에적용 ( 재분배해도조건인 m/2 - 를만족하지못하는경우 ) - 형제노드와합병을수행하여, 합병결과빈노드는제거 - 트리구조가변경됨 ( 부모노드로삭제가전이되며, recursive하게적용됨 ) Page 38 9

B-트리에서삭제의예 (/3) 앞서예를든 B- 트리에키값 58, 7, 6, 2, 5 를차례로삭제 o 5 58 o 5 j 7 5 j 5 노드 o 에서키값 58 의삭제 노드 j 에서키값 7 의삭제 b b b f 6 f 62 f 62 o 5 p 62 65 o 5 p 6 65 o 5 p 65 노드 f 에서키값 6 의삭제 Page 39 B-트리에서삭제의예 (2/3) b b l 2 e 26 4 m 3 36 n 42 l 26 e 3 4 m 36 n 42 노드 l 에서키값 2 의삭제 Page 4 2

B-트리에서삭제의예 (3/3) j 5 jk 6 d 6 k 8 d 9 l 8 26 b 9 43 l 26 a b 3 43 e f 4 m n 36 a f e 3 4 m n 36 42 42 jk 6 Page 4 d 8 b 9 노드 j 에서키값 5 의삭제 43 l 26 a e 3 4 m 36 f n 42 B*- 트리 (/2) B- 트리의문제점 구조유지를위해추가적인연산필요 ( 잦은분할, 재분배, 합병이발생 ) 삽입 노드의분할, 삭제 노드의합병또는재분배 B*- 트리 : Knuth 가제안한 B- 트리의변형 D. Knuth, The Art of Programming, Vol. 3, Sorting and Searching, Addison-Wesley Publishing Co. Inc., 973. B*-트리 : 공백 (empty) 이거나높이가 이상인 m-way search tree 루트 : 단말이아닌이상최소 2개, 최대 2 (2m-2)/3 + 개의서브트리를갖는다. 루트, 단말제외한 ( 내부 ) 노드 : 적어도 (2m-2)/3 + 개의서브트리, 즉, 최소 (2m- 2)/3 개의키값을갖는다. 모든단말노드는같은레벨에있다. 노드의밀도를높여 (branching factor를크게하여 ), 노드의분할빈도를줄임 각내부노드는 2/3 이상키값으로채워짐 B-트리보다적은수의노드필요 Page 42 2

B*- 트리 (2/2) B*- 트리에서연산의특징 삽입으로인한 Overflow 발생시, - 즉시분할하는대신, 인접한형제노드 (sibling node) 와의재분배를실시 - 재분배가불가한경우 ( 형제노드도 full인경우 ), 두노드와새로운노드의세개노드를사용하여재분배롤실시 분할을지연시키고, 노드가항시 2/3 이상채워지도록보장 삭제로인한 Underflow 발생시, - 기본적으로는 B-트리와유사함 ( 일단재분배이후필요시합병 ) - 합병시, B-트리와다른점은세개의노드를두개의노드로합병하는점임 Page 43 트라이 (Trie) retrieval의약자키를구성하는문자나숫자의순서로키값을표현한구조 m-ary trie 차수 m: 키값을표현하기위해사용하는문자의수, 즉기수 (radix) 예 ) 숫자 : 기수 (~9) 가 이므로 m=, 영문자 : 기수 (a~z) 가 26이므로 m = 26 m-ary trie: m개의포인터를표현하는 차원배열 트라이의높이 = 키필드의길이 ( 예 : 키 = 234 이면, 높이 4) 진트라이의노드구조 2 3 4 5 6 7 8 9 P P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P Page 44 22

높이가 4인 -ary Trie 레벨 r 23456789 P 3 레벨 2 23456789 a 23456789 23456789 23456789 P 4 23456789 23456789 23456789 23456789 23456789 레벨 3 b P 23456789 23456789 23456789 23456789 23456789 23456789 23456789 23456789 레벨 4 c ****** d ******* ******* ******* ********* **** ****** ***** : 널포인터 *: 해당키값을가지고있는데이타레코드의주소 Page 45 m-ary trie 연산 탐색 탐색종료 : 단말노드에서성공적으로종료하거나, 중간에키값이없을때실패로종료 탐색속도 키필드의길이 = 트라이의높이 ( 최대탐색비용 키필드의길이 ) 장점 : 균일한탐색시간 ( 단점 : 저장공간이크게필요 ) 선호하는이유 : 없는키에대한빠른탐색때문 (Sparse한경우에유리 ) 삽입 단말노드에새레코드의주소나마크를삽입 단말노드가없는경우, 새단말노드를생성하고내부노드에연결 노드의첨가나삭제는있으나분할이나병합은없음 삭제 노드와원소들을찾아서널값으로변경 노드의원소값들이모두널 ( 공백노드 ): 노드삭제 Page 46 23

강의 강의내용 Binary Trees (Ch. 6 in 화일구조 ) B-tree, B*-tree, Trie (Ch. 6 in 화일구조 ) B + -tree (Ch. 7 in 화일구조 ) Indexed Sequential File (Ch. 7 in 화일구조 ) Page 47 Basic Concepts Why we use indexing (or hashing)? We need to speed up access to desired data. Terminology (Search) key: attribute or set of attributes used to look up records in a file. An index file consists of records, called index entries, of the form search key pointer to the records Index files are typically much smaller than the original file. Indexing and Hashing (Database System Concepts) 48 24

Which one is the best? No one technique is the best. Evaluation criteria Access time (simple query, range query) Insertion time Deletion time Space overhead Indexing and Hashing (Database System Concepts) 49 Ordered Indexes In an ordered index, index entries are stored sorted on the search key value. Primary index: the index whose search key specifies the placement of the records. Secondary index: the index that does not determine the structure of the file or placement of the records. Index-sequential file: ordered sequential file with a primary index. good performance for sequential access and random access Indexing and Hashing (Database System Concepts) 5 25

Dense Index Files In a dense index, index record appears for every searchkey value. (i.e., an index entry for every search key) search key pointer branch-name account number customer name balance Brighton Brighton 27 Green 75 Downtown Downtown Johnson 5 Mianus Downtown Peterson 6 Mianus 25 Smith 7 Redwood 2 Hayes 4 Round Hill 2 Williams 9 28 Lyle 7 Redwood 222 Lindsay 7 Round Hill 35 Turner 35 To deletion To insertion Indexing and Hashing (Database System Concepts) 5 Sparse Index Files In a sparse index, index record appears for some searchkey values. (i.e., an index entry for multiple search keys) search key pointer branch-name account number customer name balance Brighton Brighton 27 Green 75 Mianus Downtown Johnson 5 Redwood Downtown Peterson 6 Mianus 25 Smith 7 2 Hayes 4 2 Williams 9 28 Lyle 7 Redwood 222 Lindsay 7 Round Hill 35 Turner 35 To deletion To insertion Indexing and Hashing (Database System Concepts) 52 26

Dense Index vs. Sparse Index Dense index faster access more space more maintenance overhead for insertions and deletions Sparse index slower access less space less maintenance overhead for insertions and deletions A good compromise multilevel index Dense index for index entries to data records Sparse index for index entries to index blocks Indexing and Hashing (Database System Concepts) 53 Multilevel Index (/2) Index itself can be large Index does not fit in memory Access becomes expensive Build an index on top of an index Treat primary index kept on disk as a sequential file and construct a sparse index on it. Outer index a sparse index of primary index Inner index the primary index file If even outer index is too large to fit in main memory, yet another level of index can be created, and so on. Indexing and Hashing (Database System Concepts) 54 27

Multilevel Index (2/2) outer index index block index block data block data block inner index To insertion Indexing and Hashing (Database System Concepts) 55 Index Update: Deletion If deleted record was the only record in the file with its particular search key value, the search key is deleted from the index also. Single-level index deletion: Dense index deletion of search key is similar to file record deletion. Sparse index If an entry for the search key exists in the index, it is deleted by replacing the entry in the index with the next search key value in the file (in search key order). If the next search key value already has an index entry, the entry is deleted instead of being replaced. Multilevel deletion algorithms are simple extensions of the singlelevel algorithms. To dense index files To sparse index files Indexing and Hashing (Database System Concepts) 56 28

Index Update: Insertion Single-level index insertion: Perform a lookup using the search key value appearing in the record to be inserted. Dense index if the search key value does not appear in the index, insert it. Sparse index If index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created. If a new block is created, the first search key value appearing in the block is inserted into the index. Multilevel insertion algorithms are simple extensions of the singlelevel algorithms. To dense index files To multilevel indexes Indexing and Hashing (Database System Concepts) 57 Secondary Indexes (/2) In general, only one primary index can be created for one data file. However, there are needs to find all the records whose values in a certain field, which is not the search key of the primary index, satisfy some condition. select * from account where customer-name = Johnson select * from account where balance > 7 search key pointer branch-name account number customer name balance Brighton Brighton 27 Green 75 Downtown Downtown Johnson 5 Mianus Downtown Peterson 6 Mianus 25 Smith 7 Redwood 2 Hayes 4 Round Hill 2 Williams 9 28 Lyle 7 Redwood 222 Lindsay 7 Round Hill 35 Turner 35 Indexing and Hashing (Database System Concepts) 58 29

Secondary Indexes (2/2) In a secondary index, an index record points to a bucket that contains pointers to all the actual records with that particular search key value. balance 35 4 5 6 7 75 9 branch-name Brighton Downtown Downtown Mianus Redwood account number 27 25 2 2 28 222 customer name Green Johnson Peterson Smith Hayes Williams Lyle Lindsay balance 75 5 6 7 4 9 7 7 Round Hill 35 Turner 35 Secondary indexes improve the performance. However, they impose a series overhead on modification of the database. Indexing and Hashing (Database System Concepts) 59 Why B + -Tree? Disadvantage of index-sequential files Performance degrades as file grows, since many overflow blocks get created. Periodic reorganization of entire file is required. As an alternative to index-sequential files, B + -tree automatically reorganizes itself with small, local, changes, in the face of insertions and deletions and does not require of reorganization of entire file to maintain performance. B + -tree index files are used extensively in practical areas. Indexing and Hashing (Database System Concepts) 6 3

B + -Tree Index Files A B+-tree is a rooted tree that takes the form of a balanced tree in which every path from the root to a leaf is of the same length. root node If the root is an internal node, it has at least 2 children. If the root is a leaf, it can have between and (n-) values. internal node leaf node Each node has between n/2 and n children. Each node has between (n-)/2 and n- values. data block (records) The block stores records. Indexing and Hashing (Database System Concepts) 6 B + -Tree Node Structure Typical node of a B + -tree P K P 2 P n- K n- P n K i are the search key values. P i are pointers. In case of internal nodes (non-leaf nodes), they point children nodes. In case of leaf nodes, they point records. The search keys in a node are ordered K < K 2 < K 3 <.... < K n- Indexing and Hashing (Database System Concepts) 62 3

Leaf Nodes in B + -Trees Properties of a leaf node: For i =, 2,, n-, P i points to a file record with search key value K i. If L i, L j are leaf nodes and i < j, L i s search key values are less than L i s search key values. P n points to next leaf node in search key order. leaf node Brighton Downtown next leaf node Brighton 27 account table Green 75 Downtown Johnson 5 Downtown Peterson 6 Indexing and Hashing (Database System Concepts) 63 Internal Nodes in B + -Trees Internal nodes form a multi-level sparse index on the leaf nodes. For an internal node with m pointers: P K P 2 K i- P i K i K m- P m X X X X < K K i- X < K i X K m- Indexing and Hashing (Database System Concepts) 64 32

Example of a B + -Tree (/2) B + -tree for account file (n = 3) Mianus Redwood Brighton Downtown Mianus Redwood Round Hill Brighton Downtown Downtown Mianus Redwood Round Hill 27 25 2 2 28 222 35 Green Johnson Peterson Smith Hayes Williams Lyle Lindsay Turner 75 5 6 7 4 9 7 7 35 Indexing and Hashing (Database System Concepts) 65 Example of a B + -Tree (2/2) B + -tree for account file (n = 5) Brighton Downtown Mianus Redwood Round Hill Leaf nodes must have between 2 and 4 values ( (n-)/2 and n-, with n = 5). Internal nodes other than root must have between 3 and 5 children ( n/2 and n, with n = 5). Indexing and Hashing (Database System Concepts) 66 33

Observations about B + -Trees Searches can be conducted efficiently since the B + -tree contains a relatively small number of levels. Set of leaf nodes forms a simple relation having the search key as an attribute. Indexing and Hashing (Database System Concepts) 67 Queries on B + -Trees (/2) Find all records with a search key value of k. Algorithm Search_B + -tree Start with the root node Examine the node for the smallest search key value > k. If such a value, K i, exists, then follow P i to the child node. Otherwise k K m-, then follow P m to the child node. If the child node is an internal node, repeat the above procedure on the node. Else, i.e., if the child node is a leaf node: If key K i = k, follow pointer P i to the desired record or bucket. Else, no record with search key value k exists. Indexing and Hashing (Database System Concepts) 68 34

Queries on B + -Trees (2/2) A path is traversed from the root to some leaf node. Searches can be conducted efficiently. For K search key values, the path will be about log n/2 (K). The size of a node 4 KB, n (4 bytes per index entry) For million search key values, at most 4 (= log 5,,) nodes are access in a lookup. In case of a binary tree with million search key values, around 2(= log 2,,) nodes are accessed in a lookup. The difference is significant since every node access may need a disk I/O. Indexing and Hashing (Database System Concepts) 69 Updates on B + -Trees: Insertion (/3) Insert a record with a search key value of k. Algorithm Insert_B + -tree Find the leaf node in which k would appear. If k is already there in the leaf node, the record is added to file. If k is not there, then add the record to the file. Then: if there is room in the leaf node, insert (k, pointer) pair into leaf node at appropriate position. if there is no room in the leaf node, split it and insert (k, pointer) pair. Indexing and Hashing (Database System Concepts) 7 35

Updates on B + -Trees: Insertion (2/3) Algorithm Splitting_Node Take the n (key, pointer) pairs in sorted order. Place the first n/2 in the original node, and the rest in a new node. Make an index entry (k, p) for the new node. Insert the entry (k, p) in the parent of the node being split. If the root node is split, then the height of the tree is increased by. Indexing and Hashing (Database System Concepts) 7 Updates on B + -Trees: Insertion (3/3) Mianus Redwood Brighton Downtown Mianus Redwood Round Hill Insert Clearview split of leaf node Brighton Clearview Downtown Downtown Mianus Redwood Brighton Clearview Downtown Mianus Redwood Round Hill Indexing and Hashing (Database System Concepts) 72 36

Updates on B + -Trees: Deletion (/2) Delete a record with a search key value of k. Algorithm Delete_B + -tree Find the record to be deleted, and remove it from the file. Remove (k, pointer) from the leaf node. If the node has too few entries(< (n-)/2 ) due to the removal, and if the entries in the node and a sibling fit into a single node, then merge these nodes into one. If the entries in the node and a sibling do not fit into a single node, then redistribute index entries into these nodes. Indexing and Hashing (Database System Concepts) 73 Updates on B + -Trees: Deletion (2/2) Downtown Mianus Redwood Brighton Clearview Downtown Mianus Redwood Round Hill Delete Downtown merge nodes Brighton Clearview Mianus Redwood Brighton Clearview Mianus Redwood Round Hill Indexing and Hashing (Database System Concepts) 74 37

B-tree Index Files (/2) Similar to B + -tree, but B-tree allows search key values to appear only once; eliminates redundant storage of search keys. Search keys in internal nodes appear nowhere else in the B-tree; an internal node includes an additional pointer field for each search key. Downtown Redwood Brighton Clearview Mianus Round Hill Brighton 27 Green 75 Clearview 225 George 85 Downtown Johnson 5 Downtown Peterson 6 Mianus 25 Smith 7 2 Hayes 4 2 Williams 9 Redwood 222 Lindsay 7 Round Hill 35 Turner 35 Indexing and Hashing (Database System Concepts) 75 B-tree Index Files (2/2) Advantages of B-tree indexes May use less tree nodes than a corresponding B + -tree. Sometimes possible to find search key value before reaching leaf node. Disadvantages of B-tree indexes Only small fraction of all search key values are found early. Since fan-out of internal nodes is reduced, depth is higher than B + -tree.. Insertion and deletion more complicated than in B + -trees. Implementation is harder than B + -trees. The advantages of B-tress are marginal for large indexes. The structural simplicity of a B + -tree is practically preferred. Indexing and Hashing (Database System Concepts) 76 38

강의 강의내용 Binary Trees (Ch. 6 in 화일구조 ) B-tree, B*-tree, Trie (Ch. 6 in 화일구조 ) B + -tree (Ch. 7 in 화일구조 ) Indexed Sequential File (Ch. 7 in 화일구조 ) Page 77 VSAM 파일 (/3) VSAM: Virtual Storage Access Method B + - 트리인덱스구조기법을이용하는대표적인 Index-Sequential File 구성방법 VSAM 파일의구조 인덱스세트 (index set): 순차세트의상위인덱스로서, 여러단계로구성 Internal Nodes in B + -tree 순차세트 (sequence set): 제어구역에대한인덱스저장 Leaf Nodes in B + -tree 제어구간 (control interval): 데이타레코드저장 Data Pages(or Blocks) in B + -tree c.f.) 다음슬라이드의예참조 Page 78 39

VSAM 파일 (2/3) VSAM 파일의예 레벨 3 인덱스세트... 레벨 2... 레벨 순차세트............ 제어구간 2 3 4.... 5 6 7 제어구역 제어구역 2 제어구역 n- 자유공간 Page 79 VSAM 파일 (3/3) 제어구간의정보및구조 데이타블록 : 하나이상레코드의저장공간 추가적인데이타레코드의삽입에대비하여, 자유공간 (free space) 을유지함 데이타블록내에서는, 키값에따라물리적순서로저장 제어정보 (control information): 데이타레코드와자유공간에대한정보 제어구간의구조 레코드 레코드 2 레코드 3 레코드 4 레코드 5 자유공간 레코드 6 자유공간 레코드제어정보 자유공간제어정보 Page 8 4

ISAM 파일 (/4) ISAM: Indexed Sequential Access Method 인덱스, 기본데이타구역 (prime data area) 과오버플로우구역 (overflow area) 으로구성 인덱스 : 마스터인덱스 (master index), 실린더인덱스 (cylinder index), 트랙인덱스 (track index) 로구성 효율적인삽입, 삭제처리를위하여오버플로우구역을별도로유지하는점이특이한점 많은 UNIX 시스템에서, ISAM 파일을기본적으로제공하며, 특히 C-ISAM이란이름으로, C 언어에서사용할수있는 ISAM Interface를제공함 Page 8 ISAM 파일 (2/4) ISAM 파일의예 기본데이터구역은 6개실린더로구성 실린더당 4개트랙 ( 마지막트랙은오버플로우구역 ) 트랙당 ( 최대 ) 5개의레코드수용 트랙당 4% 의자유공간 Page 82 4

ISAM 파일 (3/4) 오버플로우가발생한예 Page 83 ISAM 파일 (4/4) X/Open 규격 (X/Open Company, Ltd., X/Open Portability Guide, Prentice-Hall Inc., Dec. 988.) 에정의된 ISAM 인터페이스 isopen, isstart open a scan isclose close a scan isread fetch a tuple (read a tuple) iswrcurr, iswrite insert a tuple isdelcurr, isdelete delete a tuple isrewcur, isrewrite update a tuple isbuild create a relation isaddindex create a B + -tree index iserase destroy a relation isdelindex drop a B + -tree index isindexinfo, isrename, isunlock, and etc. Page 84 42