1 (, ), ( )
2 1. 2. (, sta*s*cal disclosure control) - (Risk) and (U*lity) - - 3. (Synthe*c Data) 4. 5.
3 1.
+ 4 1. 2.,. 3. K
+ [ ] 5 ' ', " ", " ". (SNS), '. K KT,, KG (PG), 'CSS'(Credit Scoring System)....,,,.
+ 6? 1. ( ): 2. ( ) ( ). ( 2 2 ) 2. : 1. " ", ( ).( 2 1 ) 3. : 6. " " ( ). ( 2 6 )
+ k- 7 (2016.06.09) [ ]
+ 8
9 / / :. (2013) 2,.
10 ( DB)/ KCB- : DB
3 KCB 1 ( 31 ) KCB KCB DB * DB ( ) ( ) [ : ] 11
12 / - KCB DB/.
3. K- 13
+ 1: (disclosure Risk) 14 k-, l-, t- ( )
+ k-, l-, t- : 15 [ (2016.06.09) [ ],.]
+ : k- l- 16 Key 1 Key 2 k-, l- 1 1 1 50 3 2 2 1 1 50 3 2 3 1 1 42 3 2 4 1 2 42 1 1 5 2 2 62 2 1 6 2 2 62 2 1
+ (general) 17 : : (intruder) k- : =
+ 2: 18
1. (recording):. 5. / (top- down coding). 100 1000 (top coding) 2. (local suppression): -. 3. (micro- aggrega*on): k.. 19
+ ( ) k- 20 K-.. On k-anonymity and the Curse of Dimensionality VLDB 2005, By Charu C. Aggarwal [G-citation number 572]
+ 1: k,, 21 : German Credit Data : 20, : 1000 : Rondom Forest X1 : account balance (4 levels), X2 : credit history (5 levels) X3 : Purpose (10 levels), X4 : Savings account (5 levels) R sdcmicro local suppression k - ( ) recording k-
+ k=2, key =X1~X5: 5 k- obs.no X1 X2 X3 X4 1 <0DM critical account NA no savings account 2 0<=...<200DM credits paid back till now radio/television <100DM 3 no account critical account NA <100DM 4 <0DM credits paid back till now furniture/equipment <100DM 5 <0DM delay in paying off NA <100DM 6 no account credits paid back till now NA no savings account 7 no account credits paid back till now NA 500<=...<1000 DM 8 0<=...<200DM credits paid back till now used car <100DM 9 no account credits paid back till now NA >=1000DM 10 0<=...<200DM critical account new car <100DM 11 0<=...<200DM credits paid back till now new car <100DM 12 <0DM credits paid back till now business <100DM 13 0<=...<200DM credits paid back till now radio/television <100DM 14 <0DM critical account new car <100DM 15 <0DM credits paid back till now new car <100DM 16 <0DM credits paid back till now NA 100<=...<500DM 17 no account critical account radio/television no savings account 18 <0DM no credit taken business NA 19 0<=...<200DM credits paid back till now used car <100DM 20 no account credits paid back till now radio/television 500<=...<1000 DM 22
+ k=5, key =X1~X20: 20 k- 23 obs.no X1 X2 X3 X4 1 <0DM critical account radio/television no savings account 2 0<=...<200DM credits paid back till now NA <100DM 3 NA critical account NA <100DM 4 <0DM credits paid back till now furniture/equipment <100DM 5 <0DM NA NA <100DM 6 no account credits paid back till now NA no savings account 7 no account credits paid back till now NA 500<=...<1000 DM 8 NA credits paid back till now used car <100DM 9 no account credits paid back till now radio/television NA 10 NA critical account NA <100DM 11 0<=...<200DM credits paid back till now new car NA 12 <0DM credits paid back till now NA <100DM 13 0<=...<200DM credits paid back till now radio/television <100DM 14 <0DM critical account NA NA 15 <0DM credits paid back till now new car <100DM 16 <0DM credits paid back till now NA NA 17 no account critical account radio/television no savings account 18 NA NA NA NA 19 0<=...<200DM credits paid back till now NA <100DM 20 no account credits paid back till now radio/television 500<=...<1000 DM
+ Summary on missing rates: k- 24 Key K X1 X2 (%) X1-X5 k=2 1 33 177 0.9 k=3 3 67 273 1.5 k=5 3 125 373 2.4 X1-X10 k=2 102 151 754 5.7 k=3 154 239 905 8.1 k=5 232 349 978 11.3 X1-X15 k=2 154 157 825 9.0 k=3 222 224 959 12.4 k=5 333 325 1000 16.8 X1-X20 k=2 183 132 935 15.6 k=3 222 209 1000 21.6 k=5 310 279 1000 27.2
+ 2: k-, (microaggregation) 25 R sdcmicro microaggregation : German credit data (3 ) simulated data - X1, X2, X3 : German credit data (X1 : duration, X2: credit amount, X3 : Age) - X4, X5, X20 [1,10] uniform : 1000
k=10, key = X1~X20 obs. No Du CA Age 1 6.0 1,169.0 67.0 2 48.0 5,951.0 22.0 3 12.0 2,096.0 49.0 4 42.0 7,882.0 45.0 5 24.0 4,870.0 53.0 6 36.0 9,055.0 35.0 7 24.0 2,835.0 53.0 8 36.0 6,948.0 35.0 9 12.0 3,059.0 61.0 10 30.0 5,234.0 28.0 11 12.0 1,295.0 25.0 12 48.0 4,308.0 24.0 13 12.0 1,567.0 22.0 14 24.0 1,199.0 60.0 15 15.0 1,403.0 28.0 16 24.0 1,282.0 32.0 17 24.0 2,424.0 53.0 18 30.0 8,072.0 25.0 19 24.0 12,579.0 44.0 20 24.0 3,430.0 31.0 10- obs. No Du CA Age 1 16.4 1,703.4 50.0 2 47.3 7,027.0 30.5 3 15.7 2,314.4 39.1 4 37.2 6,317.5 32.7 5 39.0 4,185.2 47.6 6 15.7 2,314.4 39.1 7 17.2 1,754.2 42.3 8 27.0 4,561.5 26.1 9 13.0 2,293.3 59.3 10 24.3 5,059.7 39.1 11 14.1 1,821.5 31.4 12 48.6 5,711.0 29.5 13 14.7 2,174.4 28.6 14 24.0 2,931.5 47.0 15 16.8 2,169.0 32.2 16 19.4 3,346.7 30.2 17 23.4 4,554.7 31.7 18 43.2 10,447.9 30.1 19 19.2 9,100.5 50.8 20 16.6 2,512.8 32.5
10- obs.no Du CA Age 3 12.0 2,096.0 49.0 6 36.0 9,055.0 35.0 148 12.0 682.0 51.0 306 6.0 1,543.0 33.0 415 24.0 1,381.0 35.0 552 6.0 1,750.0 45.0 601 7.0 2,329.0 45.0 622 18.0 1,530.0 32.0 639 12.0 1,493.0 34.0 756 24.0 1,285.0 32.0 15.7 2,314.4 39.1 obs.no Du CA Age 3 15.7 2,314.4 39.1 6 15.7 2,314.4 39.1 148 15.7 2,314.4 39.1 306 15.7 2,314.4 39.1 415 15.7 2,314.4 39.1 552 15.7 2,314.4 39.1 601 15.7 2,314.4 39.1 622 15.7 2,314.4 39.1 639 15.7 2,314.4 39.1 756 15.7 2,314.4 39.1
( ) k=10 169, 726, 916
30 4. (synthetic data)
31 Synthetic Data (synthe*c data, ), - (par*ally synthe*c data) - (fully synthe*c data)
(synthe*c): =, =.[www.imbc.com ] 32
-, key- synthe*ze - 33
34 Synthetic Data : 1) (re- iden*fica*on).. 2) [ ] 3).
:. (. ) -., Synthe*c data. 35
36 Synthetic data 1. 2. SBB(SIPP Synthe*c Beta): Federal Privacy Council Census of Bureau Census of Bureau Survey of Income and Program Par8cipa8on(SIPP) Social Security Administra8on(SSA)/Internal Revenue Service(IRS)
: hxps://www.census.gov/programs- surveys/sipp/ 3. Census of Bureau, / 37
Synthe*c SIPP data 1. Applica*on form,, SIPP (123 ) 2. 5 3. account SAS Stata SSB data 4. (SAS Stata code) 38
39 Synthetic Data : German Credit Data Data : German credit data( =1000) 900 : training data 100 : test data Synthetic variable y (credit status : good, bad) 3 (duration, credit amount, age) Models for synthesis : f(y ): logistic regression : f(c1 ), f(c2 ), f(c3 ): linear regression
Synthe*c data 1. 900 training synthesis 2. training synthetic data set 3. Synthetic data set y (logistic regression) 4. 3 model original training data Testing (100 ) 40
Synthe*c data : Original Training Data obs.no y duration credit.amount age account balance Credit history purpose 1 good 6 1,169 67 <0 DM critical account radio/television 2 bad 48 5,951 22 0<=...<200 DM credits paid back till now radio/television 3 good 12 2,096 49 no account critical account education 4 good 42 7,882 45 <0 DM credits paid back till now furniture/equipment 6 good 36 9,055 35 no account credits paid back till now education 7 good 24 2,835 53 no account credits paid back till now furniture/equipment 8 good 36 6,948 35 0<=...<200 DM credits paid back till now used car 9 good 12 3,059 61 no account credits paid back till now radio/television 10 bad 30 5,234 28 0<=...<200 DM critical account new car 41
Synthe*c Data obs.n o 1 y goo d duratio n credit.amou age nt account balance Credit history purpose 33 2,672 43 <0 DM critical account radio/television 2 bad 32 4,751 39 3 4 6 7 8 9 goo d goo d goo d goo d goo d goo d 0<=...<200 DM credits paid back till now radio/television 10 839 46 no account critical account education 41 4,753 23 <0 DM 24 8,606 31 no account 9 2,957 46 no account 29 5,420 42 0<=...<200 DM 32 3,685 41 no account 10 bad 19 3,407 33 0<=...<200 DM credits paid back till furniture/ now equipment credits paid back till education now credits paid back till furniture/ now equipment credits paid back till used car now credits paid back till radio/television now critical account new car 42
Synthe*c data Synthetic 100 set, ( ) (confusion matrix) original data synthetic data model true true true bad true bad good good pred. bad 56 15 55.87 14.34 (0.418) (0.497) pred. good 13 16 13.13 16.66 (0.418) (0.497) model original data synthetic data 0.28 0.275 43
44 K- Synthetic Data K- 1. 2. 3.
45 5. (Differential Privacy)
+ (Differential Privacy) 46 1. 2. : 라플라스기계 (Laplance machine) 3. Local differential privacy 4. 5.
+ 47 : 1. ( ) (Q) 2. R
+ 48
A,, ( ). : NOISE NOISE. 49
+ 1: 50
: 2: : 51
52
Post- processing Invariance: ε-dp Algorithm Algorithm Output ε- DP [ K. Chaudhuri A.D. Sarwate ] 53
Composi*on: A 1 (, ) ε 1 - DP A 2 (, ) ε 2 - DP A 1 (D) A 2 (D) [ K. Chaudhuri A.D. Sarwate ] 54
Dependency to :.,, (classifier),. 55
+ Local Differential Privacy 56 DP( ) central. Epsilon noise. Local DP central noise central.
+ 3: randomized response 57 Randomized response: : / O/X. O. Google Chrome : /.
+ 4: NYC taxicab data set 58 Riding with the stars, from research.neustar.biz
+ 59 2013 NY taxicab dataset : pickup and drop off times, locations, fare and tip amounts de-anonymize, (Driver privacy) privacy (Passenger privacy in the NYC taxicab dataset). : the picture, some information from celebrity gossip blogs
+ 60 Jessica Alba,, ($9), ($0)
+ Differentially Privatized Trip (drop-off) 61 Total Fare: $25 - $30 Tip Amount: $6 - $8
+ NYC taxicab data: local DP 62
+ K- 63 Differentially Privatized Synthetic Data: 1. DB Local DP ε-ldp /
2. LDP. / / ε-ldp / ε-ldp / ε-ldp / 3. DP/local DP procedure 64
65 6.
66 1.. 2. - - / 3... K- 4.,
67-5. / - /
Thank you 68