https://huggingface.co/microsoft/speecht5_tts

 

microsoft/speecht5_tts · Hugging Face

๐Ÿ‘ฉ‍๐ŸŽค Matthijs/speecht5-tts-demo ๐Ÿš€ Zhenhong/text-to-speech-SpeechT5-demo ๐Ÿ† course-demos/speech-to-speech-translation ๐Ÿš€ Sandiago21/speech-to-speech-translation-german ๐Ÿš€ Sandiago21/speech-to-speech-translation-italian ๐Ÿš€ Sandiago21/text-t

huggingface.co

 

๊ธ‰ํ•˜๊ฒŒ ๋ฆฌ์„œ์น˜ ํ•  ๊ฒƒ์ด ์ƒ๊ฒจ์„œ

์ž ์‹œ ๊ฐ•์˜ ๋“ฃ๊ธฐ๋Š” ๋ณด๋ฅ˜

 

 

SpeechT5 model : 

- ์Œ์„ฑ ํ•ฉ์„ฑ์„ ์œ„ํ•ด์„œ ํŒŒ์ธํŠœ๋‹๋จ

- ๋ฐ์ดํ„ฐ์…‹ : Libri TTS

- ๋ฐœํ‘œ๋…ผ๋ฌธ : https://arxiv.org/abs/2110.07205

 

- NLP ๋ชจ๋ธ ์ค‘ ์‚ฌ์ „ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์ธ T5 (Text-To-Text Transfer Transformer)์—์„œ unified-modal SpeechT5 framework๋กœ ์ œ์•ˆ๋จ

- SpeechT5 framework: encoder-decoder network์™€ ์—ฌ์„ฏ ๊ฐœ์˜ modal specific (์Œ์„ฑ/ํ…์ŠคํŠธ) ์ „/ํ›„์ฒ˜๋ฆฌ ๋„คํŠธ์›Œํฌ(pre/post-nets)๋กœ ๊ตฌ์„ฑ

- (1) Preprocessing : ์ž…๋ ฅ๋œ ์Œ์„ฑ/ํ…์ŠคํŠธ๋Š” ๋จผ์ € ์ „์ฒ˜๋ฆฌ ๋„คํŠธ์›Œํฌ(pre-nets)๋ฅผ ํ†ตํ•ด ์ „์ฒ˜๋ฆฌ๋จ

- (2) Shared encoder-decoder network : sequence-to-sequence ๋ณ€ํ™˜

- (3) Generate output : ๋””์ฝ”๋”์˜ ์ถœ๋ ฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ›„์ฒ˜๋ฆฌ ๋„คํŠธ์›Œํฌ(post-nets)๊ฐ€ ์Œ์„ฑ/ํ…์ŠคํŠธ output์„ ์ถœ๋ ฅ (ํ›„์ฒ˜๋ฆฌ)

 

 

 

 

- ๋ผ๋ฒจ ์—†๋Š” ํฐ ๋ฐ์ดํ„ฐ์…‹ ํ™œ์šฉ : SpeechT5๋Š” ๋ผ๋ฒจ๋ง์ด ์•ˆ๋˜์–ด ์žˆ๋Š” ๋งŽ์€ ์–‘์˜ ์Œ์„ฑ ๋ฐ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์ „ ํ›ˆ๋ จ๋จ. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ ๋ฐ์ดํ„ฐ์—์„œ ์œ ์šฉํ•œ ํŠน์„ฑ๊ณผ ํŒจํ„ด์„ ์Šค์Šค๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

- Unified modal representation : ๋ชจ๋ธ์€ ์Œ์„ฑ๊ณผ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ์ผ ํ†ตํ•ฉ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•จ. ์ด๋Š” ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฐ„์˜ ๊ณตํ†ต๋œ ์˜๋ฏธ์  ํŠน์„ฑ์„ ํŒŒ์•…ํ•˜๊ณ  ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•จ.

- Cross-modal vector quantization approach : ์Œ์„ฑ๊ณผ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ฉ ์˜๋ฏธ ๊ณต๊ฐ„ ๋‚ด์—์„œ ์ •๋ ฌํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ์ˆ , ์ด ๋ฐฉ์‹์€ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” ์‚ฌ์ด์— ์œ„์น˜ํ•˜๋Š” ์ž ์žฌ ์œ ๋‹›์„ ํ†ตํ•ด ์Œ์„ฑ ๋ฐ ํ…์ŠคํŠธ์˜ ์ƒํƒœ๋ฅผ ์„ž๋Š” ๊ฒƒ์„ ํฌํ•จํ•จ. ์ด๋Š” ๋ชจ๋ธ์ด ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์‚ฌ์ด์—์„œ ๋ณด๋‹ค ํšจ๊ณผ์ ์œผ๋กœ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•˜๊ณ  ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•จ.

 

 

 

 

 

ํŒŒ์ธ ํŠœ๋‹ ํ•ธ์ฆˆ์˜จ ์‹ค์Šต (ํ—ˆ๊น…ํŽ˜์ด์Šค ๋ชจ๋ธํŽ˜์ด์ง€ ์ œ๊ณต)

 

[ Load the model ]

 

- from transfomers import SpeechT5Processor, SpeechT5ForTextToSpeech

- ๋ฌธ์ž์ฒ˜๋ฆฌ : SpeechT5Processor ( SpeechT5FeatureExtractor + SpeechT5Tokenizer )

- ๋ชจ๋ธ : SpeechT5ForTextToSpeech ( ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ )

 

 

[ Dataset for fine-tuning ]

 

- ํ˜„์žฌ SpeechT5๋Š” ์˜์–ด ์Œ์„ฑ์œผ๋กœ๋งŒ train์ด ๋˜์–ด์žˆ์Œ

- VoxPopuli๋ผ๋Š” ๋ฐ์ดํ„ฐ์…‹ : ์˜ค๋””์˜ค-ํ…์ŠคํŠธ 15๊ฐœ ๊ตญ์–ด ์Šคํ”ผ์น˜ ๋ฐ์ดํ„ฐ ์ œ๊ณต

- ์—ฌ๊ธฐ์„œ Dutch๋งŒ ๋ฝ‘์•„์„œ fine-tuning์„ ๋ชฉํ‘œ๋กœ ํ•จ (20,968๊ฐœ์ธ๋ฐ ์ด ์ •๋„๋ฉด ์ถฉ๋ถ„ํ•œ ๊ฐฏ์ˆ˜)

- cf ) VoxPopuli๊ฐ™์€ ASR(automatic speech recognition) dataset์ด ๊ผญ tts training์„ ์œ„ํ•œ ์ตœ์ ์˜ ๋ฐ์ดํ„ฐ์…‹์ธ ๊ฒƒ์€ ์•„๋‹˜. ASR ๋ฐ์ดํ„ฐ์…‹์ด ์ž˜ ๋จนํž๋ ค๋ฉด ์˜ค๋””์˜ค์— ๋…ธ์ด์ฆˆ ๋น„์œจ์ด ์ ์€์ง€.. ๋“ฑ์ด ์ค‘์š”ํ•œ ์š”์†Œ์ž„. ๊ทธ๋Ÿฐ๋ฐ ์—ฌ๋Ÿฌ ๊ตญ์–ด, ์—ฌ๋Ÿฌ ๋ฐœํ™”์ž๊ฐ€ ๋‹ค ํฌํ•จ๋˜์–ด ์žˆ๋Š” tts dataset์œผ๋กœ ๊ดœ์ฐฎ์€๊ฒŒ ASR dataset๋งŒํ•œ๊ฒŒ ์—†๋‹ค๊ณ  ํ•จ

- ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ์—๋Š” 'sampling rate'๋ผ๋Š” ๊ฒƒ์ด ์žˆ๋Š”๋ฐ ๊ฐ„๋žตํžˆ ์„ค๋ช…ํ•˜์ž๋ฉด 1์ดˆ์— ๋ช‡ ๋ฒˆ ์ƒ˜ํ”Œ๋งํ•˜๋ƒ? ๋ฅผ ์ •์˜ํ•˜๋Š” ๊ฐœ๋…์ž„. Speech T5์˜ ๊ฒฝ์šฐ์—๋Š” 16kHz๋กœ ์„ค์ •ํ•ด์•ผ ํ•จ.

- Dataset ๊ตฌ์กฐ :

์ด๋Ÿฐ์‹์ž„

   

 

 

[ Clean up the text ]

 

- ํ˜„์žฌ SpeechT5๋Š” ์˜์–ด ์Œ์„ฑ์œผ๋กœ๋งŒ train์ด ๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์šฐ๋ฆฌ๊ฐ€ ์ƒˆ๋กญ๊ฒŒ ๊ฐ€์ ธ์˜จ ๋ฐ์ดํ„ฐ์…‹์— SpeechT5Tokenizer vocabulary (ํ˜„์žฌ vocab size 79)์— ์—†๋Š” ๋ฌธ์ž๋“ค์ด ํฌํ•จ๋˜์–ด ์žˆ์„ ๊ฑฐ์ž„ : ๊ทธ๋ž˜์„œ ์ด๋Ÿฐ ๊ฒƒ๋“ค์€ <unk> token์œผ๋กœ ์ถ”๊ฐ€ํ•ด์•ผํ•จ 

- dataset์— ํ˜„์žฌ raw_text๊ฐ€ ์žˆ๊ณ  normalized_text๊ฐ€ ์žˆ๋Š”๋ฐ normalized_text๋Š” ์•ฝ์–ด ๋“ฑ์„ ๋ฌธ๋ฒ•์ ์œผ๋กœ ๊ณ ์ณ๋†“๊ฑฐ๋‚˜, ์ˆซ์ž๋ฅผ ๋ฌธ์ž๋กœ ๋ฐ”๊ฟ”๋†“๊ฑฐ๋‚˜(18์„ eighteen), ๋Œ€๋ฌธ์ž๋ฅผ ๋ชจ๋‘ ์†Œ๋ฌธ์ž๋กœ ๋ณ€ํ™˜ํ•ด๋†“๋Š” ๋“ฑ์˜ ์ •๊ทœํ™” ๊ณผ์ •์ด ๊ฑฐ์ณ์ง„ text์ด๋‹ค. 

- normalized_text์—์„œ ์›๋ž˜ speecht5tokenizer vocabulary์— ์—†๋Š” ๋ฌธ์ž๋“ค์„ ๋ฐ”๊พผ๋‹ค

 

 

[ Speakers ]

 

- VoxPopuli ๋ฐ์ดํ„ฐ์…‹์€ multi-speaker dataset์ž„

- ๋ฐœํ™”์ž์— ๋”ฐ๋ผ ๋ฐœํ™” ๋ฐ์ดํ„ฐ ๊ฐฏ์ˆ˜๋ฅผ ์นด์šดํŠธํ•ด๋ณด๋ฉด ๋ฐœํ™”์ž์˜ 1/3 ์ •๋„๋Š” 100๊ฐœ ์ดํ•˜๋กœ ์ƒ˜ํ”Œ์„ ๊ฐ–๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ

- ๋‚˜๋จธ์ง€ ํ•œ 10๋ช…์ •๋„๊ฐ€ ๋Œ€๋ถ€๋ถ„์˜ ์ƒ˜ํ”Œ (500๊ฐœ ์ด์ƒ์”ฉ), ๊ทธ๋ž˜์„œ ํ˜„์žฌ ์ƒ˜ํ”Œ ์ค‘์— 100~400๊ฐœ์˜ ์ƒ˜ํ”Œ๋“ค์„ ๋งํ•˜๋Š” ๋ฐœํ™”์ž๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ ๋ฆ„

 (๊ทธ๋Ÿฌ๋ฉด ๋ฐœํ™”์ž๊ฐ€ 42๋ช…, ์ƒ˜ํ”Œ ์ˆ˜๋Š” 9973๊ฐœ ๋‚จ์Œ : ์ ˆ๋ฐ˜ ๋‚ ๋ฆฌ๊ธด ํ–ˆ๋Š”๋ฐ ์ถฉ๋ถ„ํ•˜๋‹ค๊ณ  ํŒ๋‹จ)

- ๋ฐœํ™”๊ธธ์ด๊ฐ€ ๋„ˆ๋ฌด ๊ธด ์ƒ˜ํ”Œ๋„ ์ง€์šฐ๋ฉด ์ข‹์Œ (์—ฌ๊ธฐ์„  ์ƒ๋žต)

 

 

[ Speaker Embeddings ]

 

- ๋ชฉ์  : TTS model์ด multiple speakers๋“ค์˜ ์Œ์„ฑ์„ ๊ตฌ๋ถ„ํ•˜๋„๋ก ํ•œ๋‹ค. 

- ๋ฐฉ๋ฒ• : ๋ฐœํ™”์ž๋ณ„๋กœ embedding์„ ๋งŒ๋“ ๋‹ค. 

- ์–ด๋–ป๊ฒŒ? : SpeechBrain์˜ spkrec-xvect-voxceleb model ์ด์šฉ : input์œผ๋กœ ์˜ค๋””์˜ค ํŒŒํ˜•์„ ๋ฐ›์œผ๋ฉด output์œผ๋กœ 512 ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ๋ฅผ ๋‚ด๋ฑ‰์Œ

- 'create_speaker_embedding' ํ•จ์ˆ˜ ๊ตฌ์„ฑ (input : waveform / output : speaker_embeddings)

- cf ) ์‚ฌ์‹ค ์œ„ ๋ชจ๋ธ์€ ์˜์–ด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์ด๋ผ dutch ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋œ X-vector ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด ๋” ์ข‹์„ ๊ฒƒ

 

 

[ Preparing the dataset ]

 

- 'prepare_dataset' ํ•จ์ˆ˜ ๊ตฌ์„ฑ : "SpeechT5Processor"๋ฅผ ์ด์šฉํ•ด์„œ input text๋ฅผ ํ† ํฐํ™”, target audio๋ฅผ log-mel spectrogram์œผ๋กœ ๋ณ€ํ™˜

- input : ์œ„์—์„œ ์ค€๋น„ํ•œ dataset (Dataset ๊ตฌ์กฐ ์‚ฌ์ง„ ์ฐธ๊ณ )

- output : 'input_ids' (input text๊ฐ€ ํ† ํฐํ™”๋œ ๊ฒƒ) / 'speaker_embeddings' (๋ฐœํ™”์ž ์Œ์„ฑ ์ž„๋ฒ ๋”ฉ๋œ ๊ฒƒ) / 'labels' (target spectrogram)

 

- TTS๋Š” processor + model + vocoder์˜ ์ผ๋ จ์˜ ๊ณผ์ •์„ ๊ฑฐ์น˜๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ ์œ„์˜ output์˜ 'labels'(๋ฉœ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ)์„ vocoder์— ๋Œ๋ฆฌ๋ฉด ์›๋ž˜์˜ ์˜ค๋””์˜ค๊ฐ€ ์ƒ์„ฑ๋œ๋‹ค. 

- SpeechT5HifiGan (HiFi-GAN vocoder) ๋กœ๋“œํ•ด์„œ ๋ณด์ฝ”๋”๋กœ ๋Œ๋ฆฌ๋ฉด ์Œ์„ฑ์ด ๋“ค๋ฆผ 

 

- ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” SpeechT5 model์€ maximum input length๊ฐ€ 600 token์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ๊ฑฐ๋ณด๋‹ค ๋„˜๋Š”๊ฒƒ์€ ์ œ๊ฑฐํ•ด์ค˜์•ผ ํ•จ. (ํ•ธ์ฆˆ์˜จ์—์„œ๋Š” ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋ฅผ ๋” ํฌ๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ 200 token ๋„˜๋Š” sample๋“ค์„ ๋ชจ๋‘ ์ œ๊ฑฐํ•จ : ์ตœ์ข… sample 8259๊ฐœ ๋‚จ์Œ)

 

 

[ Collate Function to make batches ]

 

- ํŒจ๋”ฉ ํ† ํฐ์œผ๋กœ ํŒจ๋”ฉ 

- spectrogram label๋กœ๋Š” ํŒจ๋”ฉ๋œ ๋ถ€๋ถ„์„ '-100'์œผ๋กœ ๋Œ€์ฒด (์ดํ›„ loss ๊ณ„์‚ฐ์‹œ ๋ฌด์‹œ๋จ)

 

 

[ Training ]

 

- model checkpoint ๋‚ด huggingface repo๋กœ push (์ด๊ฑฐ ๊ณ„์† ์—๋Ÿฌ๋‚˜์„œ ๋””๋ฒ„๊น… ํ•ด์•ผํ•จ)

- Huggingface 'Trainer' ํด๋ž˜์Šค ์ด์šฉ

 

 

 

'Study > ๋”ฅ๋Ÿฌ๋‹' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Batch Normalization  (0) 2023.12.19
Regularization  (0) 2023.12.17
Pytorch Tensorboard  (0) 2023.12.12
ํŒŒ์ดํ† ์น˜  (0) 2023.12.11
Weight Initialization  (1) 2023.10.22