Recent Issues in Flash-based DBMSs Apr. 20, 2010 Sang-Won Lee http://icc.skku.ac.kr/~swlee 1
Table of Contents Flash Database Architecture FASTer FTL for OLTP workloads Flash as Extended Buffer Cache A Case for FlashSSD in Database Recovery 2
One FlashSSD beats Ten 15K rpm HDDs But 3
Flash Database Architectures 4
Page-Differential Logging Page-Differential Logging: An Efficient and DBMS- Independent Approach for Storing Data into Flash Memory, SIGMOD 2010 The difference b/w old and new version of a page is very small Sandforce-like approach? Assume page-mapping FTL? Differential = <physical page ID, creation time stamp, [offset, length,changed data ]+>. At-most one differential per page Physical changes vs. logical changes 5
IPL Basics, Beauty and Limitations Transactional extensions: submitted for publication Multi-version concurrency control (SI) and recovery IPL: larger flash page, less efficient 6
SCM Source: FAST 2009 tutorial by Dr. Winfried W. Wilcke 7
IPL + SCM: Opportunities Source: A Hybrid Solid-State Storage Architecture for the Performance, Energy Consumption, and Lifetime Improvement, HPCA 2010 8
Better Performance, Energy, Lifetime 9
Why SandForce, IPL, PDL works? In TPC-C, the average size of differentials is around 200B. 200B/4K = 5% Write amplication, performance, wearleveling 10
SCM Opportunities in DB Implications of Storage Class Memories (SCM) on Software Architectures, HPCA 2010 West Workshop, C. Mohan @ IBM Almaden PCM as disk, paging device, memory, extended memory SCM as log device Should log records be written directly to PCM Or, first to DRAM log buffers and then be forced to PCM (rather than disk) PCM replaces DRAM? Whole DB fits in PCM? No logging?.. SafeRAM @ VLDB 1988 11
SCM as Log Device SQL Buffer Log Buffer pi DB LOG 12
Future SW Architecture for NVRAM?? Need to learn from database?? E.g. applications, file system, or OS should be able to capture the (logical or physical) differentials (or delta) and then write only the differentials, not the new version itself. File as byte-stream vs. record-oriented page layout Can we model the changes in PPT or work or save only the changes? What about multi-versioning? rollback? It is time to rethink the paradigm of overwrite or single version 13
FASTer FTL for OLTP Workloads SNAPI 2010 Joint Work with Lim and Moon 14
Motivation FAST Originally designed for random writes With small log space, just high log block utilization and reduced log block thrashing Large scale SSD For better performance, it can employ larger log space FAST criticized in DFTL With 3%, performance and fluctuation Revisit FAST with OLTP workloads High Resp. Time Variation 15
Skewed Write Patterns in OLTP Write pattern of PC and embedded applications Small-scale range Spatial(i.e, sequential write)/temporal(e.g, meta-data) locality How about OLTP applications? Large-scale small random writes (few sequential writes) Write skew : temporal locality Data Set: 8GB TPC-C Mixed Workload 16
Skewed Write Patterns in OLTP Temporal locality in OLTP: write arrival interval per a page Hot / cold page 5% 17
FAST and Temporal Locality DFTL [Aayush Gupta, ASPLOS 09] FAST dose not provide any special mechanism to handle temporal locality in random streams. With 3% over-provisioning, FAST shows poor performance and high variation Serious Fluctuation 18
FAST and Temporal Locality Log window data invalidation performance & fluctuation of response time Flash Memory Original Data Blocks Log Blocks = Log window <Temporal locality of OLTP Write patterns in FAST> <Merge Cost Estimation in FAST> 19
FASTer FTL for OLTP Workloads FASTer FTL Second chance policy Isolation area No complex processing and meta info. Management overhead Performance improvement 20~40% than FAST Even wins Greedy in some(?) cases (pure-page mapping) More uniform response time 20
Second Chance Policy Give another chance to page in victim block, instead of immediate merge Just copy-back 21
Second Chance Policy(2) Pros: If a warm page is invalidated by the second chance, we can avoid costly merges. Cons: If the copied page is cold page, we wasted copy time and a precious write buffer resource (reduced effective log block utilization) Pros >> Cons Log Blocks = Log window 22
Second Chance Policy(3) Double the effective size of log window FASTer can skip numerous merges with doubled log window Exploit the temporal locality further (1) FAST Log Window (1) FAST Log Window (2) Doubled Log Window 23
Second Chance Policy Fluctuation goes down 24
Isolation Area Isolation area Write buffering for cold dirty pages Merge progressively in the background More uniform response time than FAST cold cold cold cold 25
Performance Evaluation FASTer w/ 10% > FAST w/ 20% log space W/ same log space, FASTer ~~ Greedy With less address mapping information and SRAM 26
Performance Evaluation(2) FASTer also mitigate the average response time and variations with less provisioning More uniform 27
Performance Evaluation(3) More skewed, better performance 28
Flash(SSD) as Extended Buffer Cache On-going work 29
Flash: Extended Disk vs. Extended Buffer Source: The Five-Minute Rule 20 Years Later, CACM 2009, Graefe Flash as extended disk approach: Flashing up the storage layer, VLDB 2009 30
Flash as Extended Buffer Cache LRU and 2Q Intel MLC SSD(80G, 250$): 30000 random reads, 3000 random 31
Flash as Extended Buffer Cache(2) Benefits: Preliminary results 32
A Case for Flash SSD in Database Recovery On-going work 33
Database Recovery Buffer Cache Data File Redo Log 4 steps Log scan: seq. scan + CPU Read into buffers to be redo/undo: random IOs Log apply: seq. scan + CPU Write the updated pages to disk: random Ios Then vs. now 34
Recovery Performance Single 15K HDD, 8 HDDs vs. SLC SSD 35
Bill Gates 36
Bill Gates TED SPEECH 2010 P: People S: Services / person E: Energy / service C: CO2 / unit energy P 는사람수다. 빈곤퇴치에성공할수록이숫자는늘어날것이다. 제 3 세계의보건건강문제가해결될것이고, 어린이들이질병으로죽어가는일이줄고성인이사소한질병으로목숨을잃는일이줄어들것이기때문이다. S 는한사람이제공받는의식주, 의료, 교육등의서비스총량이다. 빈곤퇴치에성공할수록 S 역시늘어날것이다. E 는서비스 1 단위생산에드는에너지다. 여기서부터는좋은소식이있다. 기술발전으로에너지를덜사용하면서도같은삶의질을유지하는방법이늘어나고있다. 석유를덜쓰는하이브리드자동차가대표적예다. 정작빌게이츠가하고싶었던말은 C 였다. 보다시피빈곤을퇴치할수록탄소배출은늘어날수밖에없다. E 에서조금절감해볼수있지만, 제한적이다. 근본적인해법은에너지생산과정에서탄소가배출되지않게만드는것일수밖에없다는것이빌게이츠의이야기다. 위의공식에서명백하게드러난다는것이다. 빌게이츠는 테라파워 라는새로운아이디어하나를제시한다. 폐우라늄을활용한원자력발전이다. 탄소배출이적으면서도싸게공급될수있는혁신적기술이다. 그러나이는하나의아이디어일뿐이라고스스로말한다. 다만그는, 인류의양대문제인빈곤과기후변화를동시에이겨내려면, 에너지생산에배출되는탄소를줄일수있는혁신이반드시필요하다고강조한다. 테라파워와비슷한아이디어가계속나와야한다는것이다. ( 출처 : Bill @ TED, www.ted.org, http://goodeconomy.hani.co.kr/blog/archives/788) 37
Storage Metrics in OLTP In OLTP databases 2009, 1 flash SSD >> 10 15K rpm HDDs 2010, 1 flash SSD >> 20 15K rpm HDDs Storage Metrics = f(performance(iops) X Cost X Energy x Endurance X People X????? ) 38