자연어처리소개
차례 자연어처리소개 자연어처리역사
자연어처리 자연언어 인공언어에대응되는개념 인공언어 : 특정목적을위해인위적으로만든언어 (ex. 프로그래밍언어 ) 자연언어에비해엄격한구문을가짐 특정집단에서사용되는모국어의집합 한국어, 영어, 불어, 독일어, 스페인어, 일본어, 중국어등 자연언어처리 (Natural Language Processing) 컴퓨터를통하여인간의언어를이해하고처리하는학문분야 기계번역, 자동통역, 정보검색, 질의응답, 문서요약, 철자오류수정등 Google, Naver, IBM Watson, Apple Siri,
IBM Waston http://www.youtube.com/watch?v=repnuf8i_i0 왓슨 (Waston) 은자연어형식으로된질문들에답할수있는인공지능컴퓨터시스템이며, 시험책임자데이비드페루치가주도한 IBM 의 DeepQA 프로젝트를통해개발되었다 2011 년기능시험으로서왓슨은퀴즈쇼제퍼디! 에참가하였으며, 이는이제까지도유일한인간대컴퓨터대결이었다 2 월 14 일부터 16 일까지세개의제퍼디! 에피소드의방송에서왓슨은금액기준사상최대우승자브레드러터, 가장긴챔피언십 (74 번연속승리 ) 의기록보유자켄제닝스와대결하였다. 첫상금에서켄제닝스와브레드러터가각각 300,000 달러와 200,000 달러를받는사이왓슨은 100 만달러를거머쥐었다. 제 1 장자연언어처리의개념 4
5
자연언어분석단계 자연언어문장 형태소분석 (Morphological Analysis) 구문분석 (Syntax Analysis) 의미분석 (Semantic Analysis) 형태소분석 : 감기는 의결과 감기 ( 명사 :cold) + 는 ( 조사 ) 감 ( 동사어간 ) + 기 ( 명사화어미 ) + 는 ( 조사 ) 감 ( 동사어간 ) + 기는 ( 어미 ) 구문분석 : Structural Ambiguities Time flies like light 2가지이상 tree A man see a woman with a telescope 2가지이 상 tree S sub NP VP obj N N V 나는 사과를 먹었다 나는 사과를 먹었다 화용분석 (Pragmatic Analysis) 분석결과 의미분석 : 말이많다 말 : horse or speech? 화용분석 : A 씨는 B 씨는 그는 그 : A or B? 6
형태론적다양성 첨가어 한국어, 일본어, 터키어등 다수의형태소가결합하여어절형성 터키어는평균 7개의형태소가결합 굴절어 라틴어 ( 영어, 불어등은첨가어와굴절어의특징이모두있음 ) 어간이변함 ( 영어의예 : run, ran, run) 스와히리어 수 (number) 를위한형태소가문두에붙음 ( 예 ) 사람 : m+tu ( 단수 ), wa+tu ( 복수 ) 나무 : m+ti ( 단수 ), mi+ti ( 복수 ) 아랍어 자음이어간이고모음이시제, 수등을표현 ( 예 ) ktb( 쓰다 ) katab( 능동 ) KUtIb( 수동 ) kttb( 쓰게하다 ) kattab( 능동 ) KUttIb( 수동 ) 제 1 장자연언어처리의개념 7
통사적다양성 Postfix 언어 (Head-Final Languages) 동사가문장의뒤에위치 한국어, 일본어등 Infix 언어 동사가문장의중간에위치 영어, 불어등 Prefix 언어 동사가문장의처음에위치 아일랜드어 제 1 장자연언어처리의개념 8
형태소분석 (Morphological Analysis) 어절 양쪽에공백을갖는띄어쓰기단위의문자열 단어 / 형태소 단일품사를갖는단위 / 사전에등록되어있는색인어의집합 형태소분석 입력된문자열을분석하여형태소 (morpheme) 라는최소의미단위로분리 사전정보와형태소결합정보이용 정규문법 (Regular Grammar) 으로분석가능 언어에따라난이도가다름 영어, 불어 : 쉬움 한국어, 일본어, 아랍어, 터키어 : 어려움 나는 : 나 + 는날다 + 는나다 + 는 제 1 장자연언어처리의개념 9
형태소분석의어려운점 중의성 (ambiguity) 감기는 의분석결과감기 ( 명사 :cold) + 는 ( 조사 ) 감 ( 동사어간 ) + 기 ( 명사화어미 ) + 는 ( 조사 ) 감 ( 동사어간 ) + 기는 ( 어미 ) 접두사, 접미사처리 고유명사, 사전에등록되지않은단어처리 한국어, 독일어처럼복합명사내의명사를띄우지않거나, 일본어처럼띄어쓰기가없으면더욱어려워짐 한국어형태소결합의예 ( 친구에게서였었다라고 ) 친구 ( 명사 ) + 에게 ( 조사 ) + 서 ( 조사 ) + 이 ( 서술격조사 ) + 었 ( 과거시제어미 ) + 었 ( 회상어미 ) + 다 ( 어말어미 ) + 라고 ( 인용격조사 ) 제 1 장자연언어처리의개념 10
형식문법과자연언어 Chomsky 의형식문법분류 Type Format of Productions Remarks 0 A Unrestricted Substitution Rules (Contracting) 1 A, S Context-Sensitive Grammar 2 A, S Context-Free Grammar 3 A ab, A a Right Linear S Regular A Ba, A a Grammar Left Linear S 자연언어의구문이 Context-Free Grammar 로표현가능한지아닌지에대해서는결론이내려지지않고있다. 제 1 장자연언어처리의개념 11
문법, 구문분석 문법 (Grammar) : 문장의구조적성질을규칙으로표현한것 구문분석기 (Parser) : 문법을이용하여문장의구조를찾아내는 process 문장의구문구조는 Tree 형태로표현할수있다. 즉, 몇개의형태소들이모여서구문요소 ( 구 : phrase) 를이루고, 그구문요소들간의결합구조를 Tree 형태로써구문구조를이루게된다. S NP VP N NP V ART N John ate the apple object subject det John ate the apple
문법 (Grammars) Grammar : a set of rewrite rules (ex) S NP VP NP ART N NP N VP V NP Context Free Grammar (CFG) : 각 rule 의 LHS(Left-Hand side) 가하나의 symbol 로이루어진문법규칙 Grammar Rule 을이용해서문장 (sentence) 을생성할수도있고 (sentence generation), 분석할수도있다 (sentence parsing).
Sentence Generation (ex) By rewrite rule S NP VP N VP John VP John V NP John ate ART N John ate the N John ate the apple.
Bottom-up Parsing (ex) John ate the apple. N V ART N NP V ART N NP V NP NP VP S S NP VP N NP V ART N John ate the apple
구문분석 - Structural Ambiguities S S NP VP NP VP NP Time NP flies V like NP light Time V flies PP IN NP like light Structural Ambiguities Time flies like light. 2 가지이상의구조로분석됨 flies (noun or verb), like(verb or preposition) A man see a woman with a telescope on the hill. 5 가지이상 제 1 장자연언어처리의개념 16
의미분석 (Semantic Analysis) 통사분석결과에해석을가하여문장이가진의미를분석 형태소가가진의미를표현하는지식표현기법이요구됨 통사적으로옳으나의미적으로틀린문장이있을수있음 돌이걸어간다 (cf. 사람이걸어간다 ) 바람이달린다 (cf. 말이달린다 ) Ambiguity 말이많다 (horse, speech) 제 1 장자연언어처리의개념 17
의미분석 cont d 문법적으로는맞지만의미적으로틀린문장들 사람이사과를먹는다. (o) 사람이비행기를먹는다. (x) 비행기가사과를먹는다. (x) 구문구조 NP S VP N V N 사람비행기 먹다 사과비행기 [ 먹다 의미적제약 [ agent : 먹을수있는주체 object : 먹을수있는대상...]]
의미역결정 (Semantic Role Labeling)
한국어의미역결정 (SRL) 서술어인식 (PIC) 그는르노가 3 월말까지인수제의시한을 [ 갖고 ] 갖.1 있다고 [ 덧붙였다 ] 덧붙.1 논항인식 (AIC) 그는 [ 르노가 ] ARG0 [3 월말까지 ] ARGM-TMP 인수제의 [ 시한을 ] ARG1 [ 갖고 ] 갖.1 [ 있다고 ] AUX 덧붙였다 [ 그는 ] ARG0 르노가 3 월말까지인수제의시한을갖고 [ 있다고 ] ARG1 [ 덧붙였다 ] 덧붙.1 의존구문분석 의미역결정
화용분석 (Pragmatic Analysis) 문장이실세계 (real world) 와가지는연관관계분석 실세계지식과상식의표현이요구됨 지시 (anaphora), 간접화법 (indirect speech act) 등의분석 Anaphora : 대명사의지시대상 The city councilmen refused the women a permit because (1) they feared violence. (2) they advocated revolution. Speech Act : 상대방에게행동을요구하는언어행위 Can you give me a salt? Would you mind opening the window? 제 1 장자연언어처리의개념 21
한국어상호참조해결 상호참조 (Coreference) 문서내에서이미언급된객체에대하여표현이다른단어로다시언급하는것 Mention: 상호참조해결의대상이되는모든명사구 ( 즉, 명사, 복합명사, 수식절을포함한명사구등 ) 를의미 Entity: 상호참조가해결된 Mention 들의집합 Mention Detection 예제 [[ 고양 ] 에서발생한용오름 ] 은 [ 토네이도 ] 와같은것으로 [[[ 지상 ] 의뜨거운공기 ] 가 [[ 상층 ] 의찬공기 ] 와갑자기섞일때 ] 발생합니다. [ 뜨거운공기 ] 가빠르게상승하고 [ 찬공기 ] 는하강하면서 [[ 길다란기둥 ] 모양의구름 ] 이생겨나고 [[ 그 ] 안 ] 에서격렬한 [ 회오리바람 ] 이부는겁니다. Entity 예제 [ 지상의뜨거운공기 ], [ 뜨거운공기 ] [ 상층의찬공기 ], [ 찬공기 ] [ 길다란기둥모양의구름 ], [ 그 ]
자연어처리특징 Natural languages are ambiguous Rule Classification (Maximum Entropy, SVM) Deep Learning NLP datasets are high dimensional One-hot representation Continuous representation (Word Embedding) Many NLP problems can be viewed as sequence labeling tasks Hidden Markov Model(HMM) Conditional Random Fields (CRF) Deep Learning (RNN) Many NLP problems can be posed as sequence-to-sequence tasks Rule Statistical Machine Translation Neural MT 감기는 감기 ( 명사 ) or 감다 ( 동사 ) + 기 말이많다 말 = horse or speech? A 씨는 B 씨는 그는 그 : A or B? Ex. [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0] Dimensionality 50K (PTB) 500K (big vocab) 3M
차례 자연어처리소개 자연어처리역사
Early History (1) 최초의시도 Warren Weaver : 기계번역제안 (1949) Idea: Translation is a process of dictionary lookup, plus substitution, plus grammatical reordering. Example I must go home Ich muss nach hause gehen 초기기계번역연구 W.Weaver and A.D.Booth : 영어 - 불어 (Early 1950) George Town Univ. 와 IBM : 러시아어 - 영어 (1954) 25
Early History (2) - 초기기계번역의교훈 - Translation is really not possible without understanding. Example (English Russian English) The spirit is willing but flesh is weak The vodka is strong but the meat is rotten. A great amount of world knowledge was needed, a program had to understand what was being said in order to be able to translate it properly. The pen is in the box. The box is in the pen. Syntactic Ambiguities They are flying planes. Time flies like an arrow. He saw a man on the hill with a telescope. Give a great deal of impetus to work on syntactic theories. 26
Early History (3) - 정보검색 - IBM 1950년대말대량의연구논문을대상으로한정보검색연구시작 1964년에의학문헌의정보검색시스템 MEDLARS 서비스개시 27
Early History (4) - 기타관련연구 - Automata Theory 1950 년대말부터 1960 년대에여러 Automata 모델제안 계산이론의기초일뿐만아니라, 언어분석모델로서중요한역할 Introduction of the idea of heuristic search Newell and Simon (1956) Introduction of the LISP programming language John McCarthy (1960) 28
Early History (5) - 언어학이론 - Chomsky Syntactic Structure(1957), Aspect of the Theory of Syntax(1965) 변형생성문법 구구조개념, 변형개념 문장의기본은구구조이며, 문장은구구조의변형이다. C. Hockett Grammar for the Hearer(1961) 인간의언어이해는문장을끝까지다들은후, 구문분석을시도하는것이아니고, 문장을들으면서그때까지의구문구조를이해하고있으며, 다음에어떤어구, 문장의구조가발화되는지예상하면서듣는다. 29
1960 년대의자연언어처리 Ideas The use of limited domains for language-understanding systems The use of key words to trigger certain actions The translation of English into formal languages Some Systems Key-word systems : ELIZA, DOCTOR, PARRY, etc. Translating English into a Formal System : STUDENT Data-base Question Answering : BASEBALL 30
BASEBALL (1) Bert F. Green, Jr., Alice K. Wolf, Carol Chomsky, and Kenneth Laughery (1963) Database Question Answering system Database query generation from English 미국프로야구자료를자연언어로검색하는시스템 31
BASEBALL (2) BASEBALL S database MONTH PLACE DAY GAME WINNER/SCORE LOSER/SCORE July Cleveland 6 95 July Boston 7 96 July Detroit 7 97 White Sox/2 Indians/0 Red Sox/5 Yankees/3 Tigers/10 Athletics/2 Question: Who did the Yankees play on July 7? After Query Generator (OR (July 7 Yankees/?ANSWER/ (July 7?ANSWER/ Yankees ) ) Answer : Red Sox 32
ELIZA (1) Joseph Weizenbaum (at MIT), 1966 인간과컴퓨터간의대화시스템 Key-word에의한패턴비교기법이용 상당한지적능력을가진시스템처럼보여짐 Pattern (* computers *) (* mother *) (* I believe that *) <nothing matched> Respond Do computers frighten you? Tell me more about your family. How long have you believe that. Please go on. 33
ELIZA (2) User : Men are all alike. Eliza : In what way? User : They re always bugging us about something or other. Eliza : Can you think of a specific example? User : Well, my boyfriend made me come here. Eliza : Your boyfriend made you come here. User : He says I m depressed much of the time. Eliza : I am sorry to hear you are depressed. User : It s true. I am unhappy. Eliza : Do you think coming here will help you not to be unhappy? User : I need some help; that much seems certain. Eliza : What would it mean to you if you got some help? User : Perhaps I could learn to get along with my mother. Eliza : Tell me more about your family. 34
STUDENT Developed by Daniel Bobrow, 1968. Translates algebra problems into a set of linear equations John s age now is two times Mary s age. JA = 2 * MA In three years John will be 6 years older than Mary. JA + 3 = MA + 6 패턴비교기법이용 패턴의빈자리 (slot) 에필요한요소를끼워넣는방법 단순한패턴비교와소수의경험적규칙만으로도문장의분석이가능하다는것을보여준프로그램 35
격문법 (Case Grammar) C. Fillmore (1968) 문장의각주요명사구가술어동사에대해어떤격으로역할하는가에주목 격관계를의미적으로해석 행위자격 (agent), 대상격 (object), 도구격 (instrument) 등 다음두문장의표면구조는다르나심층격은동일 He opened the door by the key. A key opened the door 기계적으로처리하기매우어려움 하나하나의개별동사에대해그동사가어떤의미의격 ( 명사구 ) 를요구하는지상세하게사전에기술해야함 의미소라는것을수십내지수백개설정 36
1970 년대의자연언어처리 The flowering of Semantic Information Processing and Seeds of Cognitive Science Systems SHRDLU (1972) LUNAR (1972) MARGIE (1973) NLPQ (1974) 37
SHRDLU Terry Winograd (1972) Transform sentences into programs (in Block-world domain) Carry out various tasks(e.g., moving blocks on a table), or search for information in SHRDLU s database, or generate an answer for its user. Can handle sentences exhibiting a wide variety of linguistic phenomena Interpreted declarative sentences as database updates, interrogative sentences as database searches, and imperative sentences as specifications for goals; these goals were achieved Linguistic coverage was very broad compared to previous programs Can handle quantifications, generate natural-sounding dialogue, and answer questions about the history of its dialogue and plan execution. 38
LUNAR Woods, Kaplan, and Nash-Webber (1972) A Natural Language Front-end for a database containing moon rock sample analysis Use ATNs (Augmented Transition Networks) Very general notion of quantification based on predicate calculus Use sophisticated techniques to translate questions into database queries. 39
SHRDLU and LUNAR Use relatively unconstrained language Work in very narrow domain SHRDLU : Block-world LUNAR : Moon-rock sample analysis Have complete, privileged knowledge of their work 40
MARGIE (1) Shank, Goldman, Rieger, and Riesbeck (1973) Deal with much more unconstrained language, particularly language about human actions Based on Conceptual Dependency Theory (by Shank) Every EVENT has : an ACTOR an ACTION performed by that actor an OBJECT that the action is performed upon a DIRECTION in which that action is oriented CD primitive actions ATRANS MTRANS SPEAK INGEST PTRANS MBUILD GRASP EXPEL PROPEL ATTEND MOVE 41
MARGIE (2) (e.g.) John gave Mary a book. actor John action ATRANS /* transfer possession */ object book direction FROM John TO Mary John ATRANS book P O R Mary John 42
1970 년대의교훈 Knowledge Representation Central importance to all natural processing Issues How should items in memory be indexed and accessed How should context be represented How should memory be updated How can programs deal with inconsistency Common Sense Knowledge of the outside world (e.g.) The city councilmen refused the women a permit because they feared violence // they : city councilmen they advocated revolution // they : women 43
FRAMES Minskey, 1975 Structures consisting of a core and slots Each slot corresponding to Either a facet or participant of a concept embodied in the frame or a space for a pointer to a related concept Provide a neat explanation for default reasoning 44
SCRIPTS Roger Shank and his collaborators at Yale (1977) (e.g.) Track : Coffee Shop Props : Table Roles : S Customers Manu W Waiters F Food C Cook Check M Cashier Money O Owner 45
Unification-based Grammar Formalisms Grammatical Theories LFG (Lexical Functional Grammar) : Bresnan (1982) GPSP (Generalized Phrase Structure Grammar) : Gazdar (1985) HPSG (Head-driven Phrase Structure Grammar) : Pollard (1985) Grammatical Tools DCG (Definite Clause Grammar) : Pereira & Warren (1980) FUG (Functional Unification Grammar) : Kay (1983) PATR-II : Shieber et al. (1983) 46
Unification-based Grammar Formalisms Augmented Phrase Structure Grammar Context-Free based grammar rules Use feature structures instead of simple grammar symbols Feature structure Complex-feature-based informational elements Associations between features and values Unification Information-combining operation main operation in unification-based grammar formalisms 47
Feature Structure 명사 철수 와동사 먹다 의자질구조 (HPSG 의예 ) PHON " 철수 " HEAD SYN LOC LEX MAJ N PHON SYN " 먹다 " HEAD MAJ V SYN LOC HEAD LOC SUBCAT SYN LOC HEAD LEX 48 MAJ N GR SUBJ MAJ N GR OBJ
cat : NP ( FS1) Unification agreement : number :singular person : third ( FS2) cat : NP FS1 FS2 number :singular ( FS3) agreement : person : third 49
Unification cat : NP agreement : number person : singular : third ( FS3) cat : NP agreement : number : plural ( FS4) FS 3 FS4 Unification Failed Unification of FS3 and FS4 is failed because the values of agreement : number feature of them are not the same (conflict) 50
최근자연언어처리연구동향 문법규칙의단순화, 사전의대용량화 각종대용량분석사전, 시소러스등 Corpus 에기반한언어처리 원시 Corpus, Tagged Corpus 문법, 어휘정보등각종언어정보추출 통계기반언어처리 기계학습기반언어처리 실용수준의자연언어처리시스템개발 상용기계번역시스템 정보검색시스템 문서분류, 요약시스템등 딥러닝 (Deep Learning) 기술의발달 이미지인식, 음성인식분야에서딥러닝기술이최고의성능을보여줌 자연어처리분야에도최근딥러닝기술이많은응용분야에서최고성능을보여주고있음 51
기계번역의역사 (1) GAT 1952년에시작하여 1965년에완성 소련어-영어번역시스템 번역대상 : 물리학분야논문 단어대단어에숙어처리가미 번역의질은매우떨어졌으나, 1979년까지미국원자에너지국에서사용 52
기계번역의역사 (2) CETA 1967 년에완성되어 1971 년까지사용 프랑스 Grenoble 대학에서시작 언어학이론에기반한번역 Interlingua 방식 (Pivot approach) GETA Interlingua : 개별언어와독립적표현 CETA 의후속시스템 CETA 의실패를거울삼아변환방식 (transfer approach) 채택 53
기계번역의역사 (3) TAUM 일기예보대상 영어-불어번역시스템 순수한변환방식 METEO TAUM을확장한완전자동번역시스템 번역성공률이 90-95% 수준 실패하는경우도대부분철자오류등임 54
기계번역의역사 (4) SYSTRAN 최초로상품화된기계번역시스템 1970년미국연방정부 FTD 사용 ( 러시아-영어 ) 1974년 NASA 사용 ( 러시아-영어 ) 1976년 EC 사용 ( 영어-불어 ) 1978년불어-영어 1979년영어-이태리어 1985년불어-독어, 영어-독어 55
기계번역의역사 (5) METAL 1982년에개발된독어-영어양방향기계번역시스템 GPSG를이용한영어분석 EUROTRA 유럽공동체의 9개언어번역을시도 1992년 1단계연구종료 : 시스템개발에는실패 유럽공동체예산의 40% 정도가번역비용으로드는만큼, 연구개발이계속될전망 56
기계번역의역사 (6) 일본의연구 1964년교토대학 Nagao 교수에의해시작 1990년현재 20여개시스템이상품화 기계번역연구를가장활발히진행하는국가중하나임 한국의연구 1980년정도부터대학및연구소에서연구시작 현재영-한, 일-한, 한-일번역시스템상품화 대학, 기업체중심으로연구개발 57
기계번역의역사 (7) Statistical Machine Translation (SMT) 구글번역기, Word based model GIZA++ (IBM model 1~6) Phrase based model Moses Parallel corpus (sentence aligned corpus) word alignment (GIZA++) phrase extraction reordering model language model (SRILM) decoding 58
SMT: example 59
기계번역의역사 (8) Neural Machine Translation (NMT) 딥러닝을이용한 end-to-end 기계번역시스템 Word-based Recurrent Neural Network (RNN) encoder + RNN decoder 로구성됨 Parallel corpus (sentence aligned corpus) NMT training RNN decoding 최근에는 Attention Mechanism 을도입하여더욱높은성능을보임 Phrase-based MT, Hierarchical Phrase-based MT 보다높은성능을보임 60
NMT example 61