<4D F736F F F696E74202D F ABFACB1B8C8B85FBEF0BEEEC3B3B8AEBFCDB1E2B0E8B9F8BFAAC7F6C8B228C1F6C3A2C1F829>

Ebiz 연구회 2017 9 21 정의용 FrankJeong@systrangroupcom SYSTRAN History & Technology Natural Language Processing Machine Translation History MT Technique Neural Network Neural Machine Translation Data Landscape - 2 -

SYSTRAN History & Technology - 3 - History - 4 -

Technology Map Strategic Alliance Harvard FaceBook ETRI CNRS~ Training Server Professional Service (Software Development) * Enterprise PN9 SYSTRANIO Desktop II Satellite Technologies Embedded ASR (Automatic Speech Recognition) Links (Web) Mobile Enterprise V8 PNMT (Pure Neural Machine Translation) LDK 20 (Natural Language Processing Modules) Corpus Professional Service (Resources Development) Connectors Professional Service (Integration) OCR (Optical Character Recognition) Desktop Hybrid MT Customization technologies (SPE & other) Language Resources Trained Models Oracle, Sales Force Adobe K->Cura Professional Service (Customization) RBMT (Rule-Based Machine Translation) Rule SMT (Statistical Machine Translation) Statistic Machine Learning DNN (Deep Neural Networks) RNN (Recurrent Neural Network) CNN (Convolutional Neural Network) Tech depth Base technologies Core technologies Products Assets - 5 - Natural Language Processing - 6 -

Use case & Solution NLP big data 분석 - 텍스트에대한간소화 / 핵심키워드추출 / 도메인분석 / 감정분석 News article Customer Feedback Online Information Website 전체내용을어떻게쉽게알수있을까? 어떤감정이숨어있을까? 어떤핵심내용이담겨있을까? 어떤도메인과핵심키워드가있을까? simplification 긴문장을자동으로요약해서핵심내용만간략화한다 Sentiment Analysis 대량의데이터에있는키워드분석을통해사용자의감정을분석한다 Named Entity recognition 문서의내용을바탕으로인명, 지명등고유명사를자동적으로인식한다 Domain Detect 특정사이트에대한도메인분류, 핵심키워드추출 Contents 로부터고객의숨은 needs 를찾아라! - 7 - Linguistic Development NLP - 8 -

Named Entity Recognition NLP - 9 - Domain Detection NLP Contents - 10 -

Simplification NLP - 11 - Sentiment Analysis NLP - 12 -

NLU vs NLP vs ASR NLP - 13 - Classical NLP vs Deep Learning NLP NLP - 14 -

Machine Translation History - 15 - Progress of MT MT History - 16 -

SYSTRAN Through Machine Translation History MT History - 17 - MT History - 18 -

MT History - 19 - MT Technique Rule-Based MT Statistical MT Hybrid MT Customization cycles Optimize Translation Quality - 20 -

Rule-Based MT MT Technique Analysis Transfer Synthesis Sentence and word Segmentation Syntax Analysis Lexicographic Transfer Morphological Generation Morphological Analysis Lexical Search Semantic Analysis Pronoun Resolution Structural Transfer Linearization Source Analysis Morphology Source Target Lexicon Source Analysis Grammar Source-Target Transfer Rules Target Generation Morphology - 21 - Statistical MT MT Technique - 22 -

Hybrid MT MT Technique SYSTRAN Hybrid Engine Rules-based Linguistic processing Corpus-based Statistical processing 5 Types of Custom Resources Monolingual Normalization Dictionaries Bilingual User Dictionaries Translation Memories Bilingual Translation Models Monolingual Language Models Translation Profiles S BS BS BS Linguistic Customization Benefits Accuracy Predictability Consistency Statistical Customization Benefits Translation fluency Ambiguity resolution Style - 23 - Customization cycles 2 update cycles Rules-based : manual updates applied in real-time with SYSTRAN Expert Tools Short cycle : Daily task, or several times a week as needed Statistic/Hybrid : automated process using corpus updates with SYSTRAN Training Server Long cycle : Done once or twice a year, as needed SYSTRAN Training Server Corpus Manager Training Manager Statistical Resources (models) Linguistic Resources (dictionaries) (translation memories) SYSTRAN Translation Server Online Tools SYSTRAN API MT Technique User Tools SYSTRAN Translator Plugins Source documents BS BS Translated documents S B S Translation memories & Training corpus SYSTRAN Expert Tools S BS Translation memories update - 24 -

Optimize Translation Quality 1 Increase user adoption 2 ROI for translation projects MT Technique Better quality results in more users Post-editing H i g h e r t r a n s l a t i o n q u a l i t y High translation quality reduces the post-editing effort Training Translation Profiles 20 Specialized Dictionaries User Dictionaries Normalization Dictionaries Translation Memories Advanced Coding Source Language Models Target Language Models Bilingual Translation Models Machine Translation Automation Manual Customization Services Translation Services - 25 - Neural Network - 26 -

Neuron & Network NN 1000 억개정도 - 27 - Training Backward Forward NN Input w ij w ij w ij Error Rate Output Reference - 28 -

Calculation Example NN 0 k M-1 W kj θ k 0 j L-1 θ j W ji 0 i N-1 threshold threshold X ) net t p = ( X 0, X 1,, X N 1 t p = ( d0, d1,, d M 1 D ) pj pj = O = net δ pk pk pk = O = = ( d E= E+ E N 1 W i= 0 ji f j( net pj X ) pi L 1 W j= 0 kjopj fk( netpk) pk O p, ( E θ θ ' pk) fk( netpk M 1 = p k= 1 k j ) = ( d δ 2 pk ' M 1 δpj = fj( netpj) δpkwkj = Wkj( t+ 1) = Wkj( t) + ηδ θ ( t+ 1) = θ ( t) + β δ k k ) pk O M 1 δ k= 0 k= 0 pk pk pk O pj W pk kj 1 f( x) = 1 + e ) O O pj pk (1 O (1 O pj ) pk x ) X p0 X pi X pn-1 Wji( t+ 1) = Wji( t) + ηδ θ ( t+ 1) = θ ( t) + β δ j j pj pj X pi - 29 - Example NN - 30 -

Neural Machine Translation - 31 - NMT Training NMT Source Sentences This is then processed into fuel that can fly airplanes Encoder Training [Z1, Z2, Z3,, Zn] Decoder Target Sentences 이것은비행기를조종할수있는연료로처리된다 - 32 -

NMT advantage NMT 월등한번역품질 매끄러운번역문장 특정도메인집중학습 동일한양의데이터 ( 코퍼스 ) 를가지고엔진을학습시킬경우, 기존의 RBMT와 SMT보다훨씬월등한번역품질을확보 기존의번역엔진학습방법인 word by word가아닌, sentence by sentence로학습하기때문에사람이번역한것처럼상당히매끄럽게번역 기본엔진을기반으로적은양의특정도메인데이터 ( 코퍼스 ) 로집중훈련이가능 - 33 - NMT translation process NMT Attention + How are you? <eos> 어떻게 지내요? <eos> How are you? Encoder Decoder - 34 -

NMT Alignment Visualization NMT - 35 - NMT Adaptation Model NMT - 36 -

Data Contents - 37 - Finding Data Online Data Web Data e-commerce Catalog Open Source Data Forum and Blogs Corporate Website Daily news - 38 -

Data produced worldwide in a one-minute period Data 3,000 words in newspaper (for about 30,000 newspapers worldwide) 570 new websites 277,000 tweets 500,000 reviews (products, hotels, restaurants) 72 hours of new video on YouTube 4M search on Google Messenger applications : over 15M messages 204M emails Unquantified Corporate Data Traditional Publishing Web Data Open Source Data Tweets User Review Videos Online Requests Messenging e-mails Corporate Data Traditional Publishing - 39 - Unreachable data Data Traditional Publishing 334,000 words published per minute Novel, essays Patents Internal private data Internal documentation Meeting notes, etc Private emails Trial recording Medical reports Systems trained on generic open-data will never be able to cover the variety of use-case where domain data is not available - 40 -

Big data-driven evolving NLP System Data - 41 - Data - 42 -

Landscape - 43 - AI Trends (https://trendsgooglecom/trends/) - 44 -

- 45 - Evolution of the Translation Technology Landscape - 46 -

Open-Source Competition Landscape - 47 - Translation Technology TAUS Translation Technology Landscape Report (September 2016 ) Landscape - 48 -

- 49 - - 50 -

- 51 -