Tacotron 이말을했어요 AiFrenz 2019 년 04 월 24 일, 조희철

목차 1. 음성합성 & T a c o t r o n 2. R N N, A t t e n t i o n 3. T a c o t r o n 분석 4. A p p e n d i x 2

1. 음성합성 & Tacotron 3

AiFrenz 회원여러분, 반갑습니다. 4

음성합성모델 전통적음성합성 concatenative TTS: database 에있는음성을합성 è 음질은좋지만, 방대한 DB 필요 statistical parametric TTS: HMM 같은모델에기반한방법 Ø text analyzer, F0 generator, spectrum generator, pause estimator, vocoder Ø 음질은좋지못하지만, Acoustic Feature 조절가능 Text Text Analyzer Lingusitc Feature Acoustic Model Acoustic Feature (deterministic) Vocoder Speech TACOTRON: End-To-End Speech Synthesis, 2017 년 3 월 Wavenet Vocoder 로대체가능 Text Neural Network (Tacotron) (deterministic) Vocoder Speech 5

Tacotron Model Architecture Encoder Wavenet Vocoder Postprossing network Decoder 합성단계 Train 단계 Input: Text Outputs: Mels-pectrogram( 예측 ) à (linear) Spectrogram( 예측 ) à Speech(Audio) Inputs: Text, Mel-spectrogram, (linear) Spectrogram Outputs: Mels-pectrogram( 예측 ) à (linear) Spectrogram( 예측 ) à Speech(Audio) 6

Tacotron 구현 C o de keithito(2017 년 7 월 ) Ø 대표적인 Tacotron 구현 Ø https://github.com/keithito/tacotron - Wavenet(2016 년 9 월 ) - ibab 코드공개 (2016 년 9 월 ) - Tacotron 논문발표 (2017 년 3 월 ) carpedm20(2017 년 10 월 ) Ø keithito 코드를기반으로 Tacotron 모델로한국어생성 Ø DeepVoice 2 에서제안한 Multi-Speaker 모델로확장 Ø Tensorflow 1.3 è 최신버전에작동하지않음. Ø https://github.com/carpedm20/multi-speaker-tacotron-tensorflow - Tacotron2(2017 년 12 월 ) - r9y9 코드 (wavenet vocoder) 공개 (2018 년 1 월 ) Rayhane-mamah(2018 년 4 월 ) Ø keithito, r9y9 코드를기반으로구현된대표적인 Tacotron 2 구현 Ø Wavenet 구현도포함 Ø https://github.com/rayhane-mamah/tacotron-2 hccho2(2018 년 12 월 ) Ø 한국어 Tacotron + Wavenet, Tensorflow 최신버전으로실행 Ø 빠른 (speed up) convergence Ø https://github.com/hccho2/tacotron-wavenet-vocoder 7

Audio Samples t r a i n s t e p : 1 0 6 0 0 0 ( G T X 1 0 8 0 t i - 1 8 h ) m o o n d a t a : 1, 1 2 5 e x a m p l e s ( 0. 8 9 h o u r s ) s o n d a t a : 2 0, 1 0 5 e x a m p l e s ( 1 9. 1 0 h o u r s ) 이런논란은타코트론논문이후에사라졌습니다. 오는 6 월 6 일은제 64 회현충일입니다. Son Moon 8

Audio Samples(Tacotron2+Griffin-Lim) t r a i n s t e p : 1 0 0, 0 0 0 ( 2 7 h ) m o o n d a t a : 1, 1 2 5 e x a m p l e s ( 0. 8 9 h o u r s ) s o n d a t a : 2 0, 1 0 5 e x a m p l e s ( 1 9. 1 0 h o u r s ) 이런논란은타코트론논문이후에사라졌습니다. 오는 6 월 6 일은제 64 회현충일입니다. Model # of trainable_variables() sec/step (GTX1080ti) Tacotron 1 7M 0.60 Tacotron 2(Griffin-Lim) 29M 0.98 9

한국어 D ata 준비 : 음성 / T ext ( 약 12 초이하길이의음성파일, Script) 쌍이필요하다. 긴음성파일 è 약 12 초이하의길이로잘라야한다. ü 문장단위로자르지않아도된다. 침묵구간을기준으로자른다. ü 잘라진음성파일과 script 의 sync 를맞추는것은고단한작업. ü 잘라진음성파일 è Google Speech API(STT) 로 script 생성 ü STT 로생성한 script 를원문과비교하여수정 ( 수작업 vs programming) 10

한글 Text 분해 한글 text 를 ( 초성 / 중성 / 종성 ) 으로나누어진 sequence 로만들어야한다. Ø jamo package 를이용하면된다. Ø ' 존경하는 è ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ, ~ ] Ø è [14, 29, 45, 2, 27, 62, 20, 21, 4, 39, 45, 1] 초성과종성의자음은각각다른 character로처리 '_': 0, '~': 1, 'ᄀ': 2, 'ᄁ': 3, 'ᄂ': 4, 'ᄃ': 5, 'ᄄ': 6, 'ᄅ': 7, 'ᄆ': 8, 'ᄇ': 9, 'ᄈ': 10, 'ᄉ': 11, 'ᄊ': 12, 'ᄋ': 13, 'ᄌ': 14, 'ᄍ': 15, 'ᄎ': 16, 'ᄏ': 17, 'ᄐ': 18, 'ᄑ': 19, 'ᄒ': 20, 'ᅡ': 21, 'ᅢ': 22, 'ᅣ': 23, 'ᅤ': 24, 'ᅥ': 25, 'ᅦ': 26, 'ᅧ': 27, 'ᅨ': 28, 'ᅩ': 29, 'ᅪ': 30, 'ᅫ': 31, 'ᅬ': 32, 'ᅭ': 33, 'ᅮ': 34, 'ᅯ': 35, 'ᅰ': 36, 'ᅱ': 37, 'ᅲ': 38, 'ᅳ': 39, 'ᅴ': 40, 'ᅵ': 41, 'ᆨ': 42, 'ᆩ': 43, 'ᆪ': 44, 'ᆫ': 45, 'ᆬ': 46, 'ᆭ': 47, 'ᆮ': 48, 'ᆯ': 49, 'ᆰ': 50, 'ᆱ': 51, 'ᆲ': 52, 'ᆳ': 53, 'ᆴ': 54, 'ᆵ': 55, 'ᆶ': 56, 'ᆷ': 57, 'ᆸ': 58, 'ᆹ': 59, 'ᆺ': 60, 'ᆻ': 61, 'ᆼ': 62, 'ᆽ': 63, 'ᆾ': 64, 'ᆿ': 65, 'ᇀ': 66, 'ᇁ': 67, 'ᇂ': 68, '!': 69, "'": 70, '(': 71, ')': 72, ',': 73, '-': 74, '.': 75, ':': 76, ';': 77, '?': 78, ' ': 79 80 개 token 11

소리의 3 요소 (3 elements of sound) 소리의세기 (loudness) Ø 소리의세기는물체가진동하는폭 ( 진폭 ) 에의하여정해지는데, 센 ( 강한 ) 소리는진폭이크고, 약한소리는진폭이작다. 그리고소리의세기가변하더라도진동수는달라지지않으며, 소리의세기단위로는dB ( 데시벨 ) 을사용한다. 소리의높이 ( 음정, 고음 / 저음, frequency or pitch) Ø ü 소리의높낮이는음원의진동수에의해정해지며, 진동수가많을수록높은소리가나며, 적을수록낮은소리가난다. 그리고단위는진동수와같은단위인 Hz. Pitch Extraction(Detection) Algorithm 소리의음색 (timbre, 맵시 ) Ø 소리의맵시는파형 ( 파동의모양 ) 에따라구분된다. Ø 사람마다목소리가다른것은소리의맵시가다르기때문이다. Ø MFCC는음색의특징을잘잡아낸다. 어떤 Feature 를사용할것인가? 12

M F C C (Mel-frequency cepstral coefficients) 순음 음색의특징을잘나타낼수있는 feature. 음소 (phoneme) 을구분한다. 배음 (overtone, 밑음 + 부분음 ) 구조를잘파악한다. 배음 음정의차이는무시 è 악보를그리는데는부적절. 악기의소리구분, 사람의목소리구분에적합함. MFCC values are not very robust in the presence of additive noise, and so it is common to normalize their values in speech recognition systems to lessen the influence of noise.(wikip edia) Tacotron 모델에서는 MFCC보다는 Mel-Spectrogram을활용한다. Ø MFCC 는만드는과정에서소리의많은정보를잃어버린다. è 복원을고려해서 Mel-Spectrogram Ø Linear, Mel-Spectrogram, MFCC 모두 Phase( 위상 ) 에대한정보를가지고있지않다. è Griffin-Lim으로복원 13

F o urier Transform 시간에대한함수 (or 신호 ) 를주파수성분으로분해하는작업이다. 시간의함수가푸리에변환이되면, 주파수의복소함수가된다. 이것의절대값은원래함수를구성하는주파수성분의양을, 편각은기본사인곡선과의위상차 (phase offset) 을나타낸다. 14

STFT(Short T i m e Fourier Transform) y = librosa.load(audio_clip, sampling_rate) STFT(y, window_size, hop_size, fft_size) e.g. sampling_rate = 24000/1sec window_size = 1200(=0.05sec) hop_size = 300 è 길이결정 fft_size = 2048 è output 크기결정 audio_clip(1d data) è 2D data ( T, fft_size/2 +1) 17

STFT(Short T i m e Fourier Transform) n_fft 가 output 크기결정 18

F r o m Audio To Mel-spectrogram Raw Audio Clip(73512) sr = 24000 Padded Audio(69900) Silence Trim(69880) 233 x hop_size(300) = 69900 69880/300 = 232.93 STFT(233,1025) fft_size = 2048 è fft_size/2 +1 Griffin-Lim ABS(233,1025) num_mels=80 Mel_Basis 곱하기 (233,80) (1025,80) amp_to_db amp_to_db ref_level_db 빼기 normalize normalization 방식에따라 [0,1] 또는 [0,4], [-4,4] 의값을가질수있다. ref_level_db 빼기 normalize (linear) spectrogram mel-spectrogram 19

M i ni-batch Data 생성 (audio, text, mel-spectrogram, linear-spectrogram, tokens, ) 묶어 npz로미리만들어놓는다. DataFeeder Class를만들어, training 할때 data를공급한다. Ø Mini-Batch(N개. e.g. 32) data를적당한개수 (M개. e.g. 32) 만큼만들어 Queue에쌓는방식을사용. Ø N x M 개의 data를길이로정렬후, N개씩나누어공급한다 è padding 최소화 Ø (input_data, input_length, mel_target, linear_target, speaker_id) Ø Speaker별로 feed되는 data의비율이동일하게처리 Ø hyper parameter가바뀌면 data를새로만들어야함. (eg. hop_size)

2. RNN, Attention 21

Tensorflow BasicRNNCell/dynamic_rnn Tensorflow 내부에서는 U,W 가각각잡히지않고, 묶어서 Y 하나만잡힌다. Sample Code(C) 참조 dynamic_rnn 은 teacher-forcing 으로구현은가능하나, free-running(inference) 에는불편한점이많다. 22

Tensorflow dynamic_decode Sample Code(A) 참조 [ 중요 ] cell, Helper, BasicDecoder 를 customization 할수있어야 Tensorflow 에서 API 가제공되지않더라도, 원하는모듈을구현할수있다. 참고자료 : Customization Tutorial 23

B asicdecoder BasicDecoder 의 step 함수는 cell, Helper 를결합하여 (outputs, next_state, next_inputs, finished) 를 return 한다. cell 이나 Helper 가 customization 되어있다면, BasicDecoder 도 customization 해야한다. outputs Helper BasicDecoder 의 step 함수는 cell.call 과 Helper.next_inputs 호출하고, return(outputs, next_state, next_inputs, finished) cell.call next_state input next_inuts 24

Tensorflow AttentionMechanism 25

Attention score 를계산하는과정에 encoder hidden state, decoder hidden sate 모두반영된다. Bahdanau Attention 과 Luong Attention 은 score 계산방식에서만차이가있다. For decoder time step i e [ h, h, L, h ]: encoder hidden state, h Î R 1 2 s :decoder hidden state, s Î R i = [ e, e, L, e i 1 softmax j = 1 i 2 attention vector ]: score alignment(weight), aij Î R Te c = a h : context i i å ij j T e it e i dh i eh 용어 : - score - alignment - context - attention 26

B ahdanau & Luong Attention 2: Luong 3: Bahdanau _bahdanau_score 27

Tensorflow-AttentionMechanism Bahdanau Attention è tf.contrib.seq2seq.bahdanauattention Luong Attention è tf.contrib.seq2seq.luongattention attention_mechanism = tf.contrib.seq2seq.bahdanauattention(num_units=11, memory=encoder_outputs, memory_sequence_length=input_lengths) TIP memory = encoder hidden state è memory_layer è key query = decoder hidden state è query_layer è processed_query 28

AttentionWrapper & AttentionWrapperState Output AttentionWrapper ~ AttentionWrapperState 관계는 BasicRNNCell ~ hidden_state 관계와같다. concat + state.attention_state cell_output Attention Mechanism attention alignment next_attention_state state state.cell_state GRUCell next_state.cell_state AttentionWrapperState 모여서 AttentionWrapperState 가된다 state.attention + concat input 왜 concat 인가? 29

AttentionWrapper tf.contrib.seq2seq.attentionwrapper(cell, attention_mechanism, attention_layer_size=13) 이전 alignment weighted sum c i attention encoder_hidden attention_state AttentionMechanism score - alignment 계산 alignment ( 더이상계산에사용되지않는다 ) next_attention_state _compute_attention Sample Code(B) 참조 Data 흐름살펴보기 cell_output(query) 모여서 AttentionWrapperState 가된다 30

AttentionWrapper AttentionWrapper(cell,attention_mechanism) self. init (cell, attention_mechanism) cell 은 BasicRNNCell, BasicLSTMCell 같은 RNNCell 이며, reture 되는 cell_output 과 next_cell_state 는같은값이다. self.call( input, state(attentionwrapperstate) ) cell_output, next_cell_state = cell(input,state.cell_state) cell_state 는보통의 hidden_state _compute_attention(attention_mechanism, cell_output,state.attention_state) alignments, next_attention_state = attention_mechanism(cell_output,state.attention_state)... return attention, alignments, next_attention_state alignment 와 next_attention_state 는같은값이다.... return cell_output, next_state(attentionwrapperstate) next_state = AttentionWrapperState( cell_state = next_cell_state, attention = attention, attention_state = next_attention_state, alignments= alignments) 31

M o notonicattention tf.contrib.seq2seq. BahdanauMonotonicAttention tf.contrib.seq2seq. LuongMonotonicAttention 이전 alignment Monotonic Attention Alignment Non Monotonic score Monotonic BahdanauMonotonicAttention(hp.attention_size, encoder_outputs,memory_sequence_length = input_lengths, normalize=true) 32

3. Tacotron 분석 33

Encoder 국민과함께하는 è ['ᄀ', 'ᅮ', 'ᆨ', 'ᄆ', 'ᅵ', 'ᆫ', 'ᄀ', 'ᅪ', ' ', 'ᄒ', 'ᅡ', 'ᆷ', 'ᄁ', 'ᅦ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ'] è [2, 34, 42, 8, 41, 45, 2, 30, 79, 20, 21, 57, 3, 26, 20, 21, 4, 39, 45] è Character embedding ( e N, T,256) ㄱㅜㅇㅁㅣㄴ ( e N, T,256) ㄱㅜㄱㅁㅣㄴ 34

C B HG ( e N, T,256) 4 layers Bidirectional RNN 이므로, 128 x 2 = 256 -Batch normalization is used for all convolutional layers. - 논문에서는 CBHG 가 overfitting 을줄이고, 보통의 RNN 보다 mispronunciation 을적게만들어낸다고하고있다. Increse local invariances 2 layers ( e N, T,2048) Kernel_size=3, filters=128 C o n v o l u t i o n B a n k H i g h w a y G R U Highway Net 35

D ecoder ConcatOutputAndAttentionWrapper: [attention(256),output(256)] concat è RNNWrapper Class 를만들수있어야한다. 36

Tensorflow RNNCell-Wrapper 활용 37

Tensorflow RNNCell-Wrapper 활용 Tensorflow 에구현되어있는 RNNWrapper class Ø OutputProjectionWrapper Ø InputProjectionWrapper Ø ResidualWrapper 38

P o s t - P r o cessing audio Linear-Spectrogram Mel-Spectrogram Mel-Spectrogram을 CBHG layer에넣어, Linear-Spectrogram을만든다. 이 CBHG layer는 encoder에있는 CBHG와같은 layer지만, hyper parameter가다르다. CBHG를 training할때, 어떤 Mel-Spectrogram을입력으로사용할것인가? Ø 이전단계에서만들어낸 Mel-Spectrogram vs Ground Truth Mel-Spectrogram Tacotron 모델이최종적으로 Linear-Spectrogram 을만들면, Griffin-Lim Algorithm 을이용해서 audio 를생성한다. Loss = mel_output mel_target + linear_output linear_target 39

M ulti-speaker Model 로확장 (DeepVoice 2) speaker embedding vector 를 network 중간 - 중간에 concat 하는방식의모델도있다. DeepVoice2 논문에는 Tacotron 모델을 Multi-Speaker 모델로확장하는방법을제시하고있다. 여기서는 speaker embedding vector 를 RNN 의 initial state 로넣어주는방식을사용한다. 40

Speaker Embedding è S p e a k e r 별 d a t a 불균형해소 각 speaker 별로 embedding vector(dim 16) 을먼저만든후, FC 를거치면서필요한크기로만든다음, activation function 으로 softsign 함수를적용한다. softsign 함수는 tanh 대신사용되었다고보면된다. 논문에서는언급되지않았지만, speaker embedding 을만들지않고, initial state 자체를 embedding vector 로만들수도있다. speaker embedding(16) dense(n1) à softsign dense(n2) à softsign dense(n3) à softsign dense(n4) à softsign speaker embedding(n1) speaker embedding(n2) speaker embedding(n3) speaker embedding(n3) 41

Tacotron2 2017 년 12 월발표 CBHG 제외 Location Sensitive Attention, Residual Layer 추가 Stop Token 도입 Vocoder 로 Modified Wavenet 사용 : MoL L2 regularization 대표적인구현코드 : Rayhane Mama 42

4. Appendix 43

Sample Code(A) 44

input_dim=8 hidden_dim=6 (input_dim+hidden_dim) x (hidden_dim x 4): 첫번째 LSTM (hidden_dim+hidden_dim) x (hidden_dim x 4): 두번째 LSTM FC(13) FC(19) FC(5) 46

Sample Code(B) encoder_hidden_state: (N,encoder_length,encoder_hidden_dim) 47

(input_dim+decoder_hidden_dim+attention_layer_size) x (decoder_hidden_dim) (encoder_hidden_dim x num_units) W m (decpcer_hidden_dim x num_units) W q W a v a (num_units) (encoder_hidden_dim+decoder_hidden_dim) x attention_layer_size 48

정리 : for decoder time step i (N,6) (N,27) (27,6 ) (11,) (N,20) (N,20) (N,20,30) ( 30, 11 ) (N,6 ) (6, 11 ) (N,30) 덧셈이되기위해서는뒤쪽의 (N,11) 을 expand_dims 를통해 (N,1,11) 로변환해야한다. (N, 13) (N,36) (36, 13) Tensorflow tensor 연산으로표현하면어떻게되나? 49

Sample Code(C) [<tf.variable 'rnn/basic_rnn_cell/kernel:0' shape=(10, 8) dtype=float32_ref>, <tf.variable 'rnn/basic_rnn_cell/bias:0' shape=(8,) dtype=float32_ref>] 50

B ahdanau Attention (N,20,30) W m (N,20,11) v a AttentionMechanism score alignment sum, tanh (N,20,11) (N,20) softmax (N,20) (N,6) W q (N,11) weighted sum context (N,30) concat encoder_length=20 encoder_hidden_dim=30 decoder_hidden_dim=6 attention (N,13) W a (N,36) 51

L uong Attention W m ( Y a ) (N,20,30) (N,6) (N,20,6) AttentionMechanism score alignment inner product (N,20) softmax (N,20) weighted sum context (N,30) concat encoder_length=20 encoder_hidden_dim=30 decoder_hidden_dim=6 attention (N,13) W a (N,36) 52

Tensorflow-Dropout keithito 코드에서는 2018 년 8 월 31 일 bug 수정 53

Tacotron Training 과정 9 월 - Single speaker 모델시도. 왜안되지? - Attention Model 로 MonotonicAttention 적용. alignment 합이왜 1 이아닌가? - librosa version 0.5.1 vs 0.6.1 - tensorflow 1.8 로변경 - dropout bug 발견 è keith ito 코드와비교 - AttentionWrapper 와 PrenetWrapper 순서바로잡음 10 월 - Padding에 Attention 가지않도록 n batch_size = 1로변경? n Attention class로 customization시도 è Tensorflow내에이미구현되어 있다는것발견 - 수작업으로 script 수정 è Data 품앗이 - Tacotron2 모델의 stop token 적용, location sensitive attention, GMM attention 시도 11 월 - Mel Spectrogram 생성방식수정 56