FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO...

244
FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL TO TTS SYSTEMS João Paulo Ramos Teixeira The thesis fulfilled the degree of Doctor in Electrotechnical and Computer Engineering (Engenharia Electrotécnica e de Computadores) Supervisor: Diamantino Rui da Silva Freitas May 2004

Transcript of FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO...

Page 1: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO

Departamento de Engenharia Electrotécnica e de Computadores

A PROSODY MODEL TO TTS SYSTEMS

João Paulo Ramos Teixeira

The thesis fulfilled the degree of Doctor in Electrotechnical and Computer Engineering

(Engenharia Electrotécnica e de Computadores)

Supervisor:

Diamantino Rui da Silva Freitas

May 2004

Page 2: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Jury:

Professor Doutor Eugénio da Costa Oliveira (FEUP)

Professor Doutor Luís Miguel Caldas Oliveira (IST)

Professor Doutor Francisco José de Oliveira Restivo (FEUP)

Professor Emeritus Hiroya Fujisaki (University of Tokyo)

Professor Doutor Joaquim Pontes Marques de Sá (FEUP)

Professor Doutor Aníbal João de Sousa Ferreira (FEUP)

Professor Doutor Victor Manuel Cicouro Pêra (FEUP)

Professor Doutor Diamantino Rui da Silva Freitas (FEUP)

Thesis unanimously approved at 22 October, 2004

Page 3: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Dedico este trabalho aos meus pais como que um

testemunho da sabedoria que tiveram em me criar as

condições suficientes para estudar, quiçá as óptimas.

Afinal, o percurso que hoje parece óbvio não foi o

seguido por tantos e tantos colegas de carteira.

This work is dedicated to my parents as a testimony

of the wisdom they had giving me the sufficient

conditions for study, maybe the optimum conditions.

The today obvious trajectory was not followed by

many and many collegial colleagues.

Page 4: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …
Page 5: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

iii

Abstract

This PhD thesis presents the development of a prosody system for European Portuguese (EP) for text-to-speech (TTS) applications. Basically, TTS systems carry out the automatic utterance of a text and consist in a sequence of several modules. Those modules implement the pre-processing of the text input, the phonetic transcription and the supra-segmental processing that consists in the inclusion of prosodic patterns. The prosody is responsible for a communicative intention and guarantees some naturalness in the uttered speech. The prosodic features consist in the imposition of the timing, characterized by the segmental durations and pauses, the intonation, characterized by the fundamental frequency (F0) curve, and by the intensity curve.

The preparatory work that was fundamental for modelling and testing purposes is presented in the beginning. It starts with a preliminary study about the stressed syllable. This study identifies the variation range of F0, duration and intensity features in stressed syllable along contexts. Then the FEUP-IPB EP speech database that was used in following studies is presented. The database is labelled at the levels of the phoneme, word, sentence and F0. The thesis follows on with the presentation of two algorithms to provide the syllabic splitting of the text and of the phoneme sequences. This chapter ends with a proposed set of rules for the automatic phonetic transcription of the most problematic graphemes in EP.

The proposed prosody model consists of several sub-models, namely, the duration model to predict the segmental durations and the model to predict the F0 pattern.

Two proposals, based in artificial neural networks (ANNs), to predict the segmental durations are presented.

The first proposal consists of one ANN carefully selected concerning its architecture and type as well as the set of input features with the objective of minimizing the error between predicted and measured durations. The second proposal, entitled alternative model, is based on same considerations of the first proposal but uses one dedicated ANN for each phoneme, in a total of 44 ANNs. The alternative model, with dedicated ANNs, improved the final performance.

A model of insertion and prediction of durations of the pauses is proposed, based on a preliminary study over the FEUP-IPB database.

The proposed model to predict the F0 contour is based on the Fujisaki model and consists of two sub-models. One predicts the Phrase Commands’ (PCs) parameters and the other predicts the Accent Commands’ (ACs) parameters.

The PCs and the ACs were manually estimated in 101 paragraphs of the database under the criterion of the minimization of the error between estimated and measured F0 contours.

The prediction of the PCs is performed in two stages. The first stage is carried out by an algorithm responsible for the insertion of the PCs connected to the text and based on a mathematical model obtained from experimental observations. The second stage of the model predicts the PCs amplitude, Ap, and anticipation, T0a, relatively to the initial position. The anticipation allows the determination of the exact position in the speech signal. The two parameters are predicted with ANNs.

A strong connection between ACs and syllables was found in the database. This strong connection justified the adopted methodology of predicting ACs associated with syllables. Therefore, the ACs model consists of one ANN to predict the existence of AC associated with the syllable and other three ANNs to predict the parameter’s amplitude (Aa) and anticipation of the onset (T1a) and offset (T2a) instants.

The final perceptual test using the category-judgment method and the MOS scale resulted in a classification of 4.6 for the natural speech, 4.4 for the estimated F0, 4.2 for predicted durations, 3.1 for the predicted F0 and 2.9 for the complete proposed model (duration and F0 models). The MOS for the complete model is at the ‘Fair’ level.

Key words: TTS systems, speech synthesis, prosody, intonation, timing, F0, modeling, European Portuguese.

Page 6: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …
Page 7: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

v

Resumo

Este trabalho apresenta o desenvolvimento de um sistema de prosódia para o Português Europeu (EP) para aplicação em sistemas de conversão texto-fala (TTS). Basicamente, estes sistemas fazem a leitura automática de um texto escrito e consistem numa sequência de diversos módulos. Esses módulos implementam o pré-processamento do texto de entrada, a transcrição fonética e o processamento supra-segmental que consiste na introdução de padrões prosódicos. As características prosódicas são responsáveis pela marcação de uma intenção comunicativa e por conferirem naturalidade na forma como o texto é lido. Estas características consistem na imposição de um ritmo, caracterizado pelas durações segmentais e pausas, de uma entoação, descrita por uma curva de frequência fundamental (F0), e pela curva de intensidade.

No início são apresentados os trabalhos denominados preparatórios que foram fundamentais para o estudo e desenvolvimento do sistema. Inicia-se com um estudo preliminar sobre a sílaba tónica. Neste estudo são identificadas as gamas de variação dos parâmetros F0, duração e intensidade na sílaba tónica em diversos contextos. Depois é apresentada a base de dados de fala FEUP-IPB DB usada nos estudos seguintes. Esta base de dados de fala em EP está etiquetada ao nível do fonema, da palavra, da frase e de F0. Seguidamente apresentam-se dois algoritmos de divisão silábica para o texto escrito e para a sequência de fonemas. Este capítulo termina com a proposta de um conjunto de regras para realizar automaticamente a transcrição fonética dos grafemas mais problemáticos no EP.

O modelo de prosódia proposto consiste em vários sub-modelos, concretamente num modelo de durações segmentais, que faz a predição das durações dos segmentos e o modelo de predição do contorno de F0.

São propostas duas alternativas, baseadas em redes neuronais artificiais (ANN), para predição das durações segmentais.

A primeira proposta consiste numa ANN cuidadosamente seleccionada no que concerne à sua arquitectura e tipo, bem como o conjunto de características a usar no vector de entrada, sempre com o objectivo de minimizar o erro entre as durações preditas e as medidas. A segunda proposta de modelo, denominada modelo alternativo, baseia-se nos mesmos pressupostos da primeira proposta, mas com uma ANN dedicada à predição da duração de cada fonema, num total de 44 ANNs. Este modelo demonstrou conseguir melhores resultados que o anterior.

Propõe-se ainda um modelo de inserção e predição das durações das pausas baseado num estudo preliminar sobre a base de dados usada.

O modelo proposto para predição do contorno de F0, baseia-se no modelo de Fujisaki e divide-se em dois sub-modelos. Um para predição dos parâmetros dos comandos de frase (PCs) e outro para predição dos parâmetros dos comandos de acento (ACs).

Foram manualmente estimados os PCs e os ACs de referência em 101 parágrafos da base de dados FEUP-IPB de forma a minimizar o erro entre os contornos de F0 estimado e medido.

A predição dos PCs é realizada em duas etapas. A primeira consiste num algoritmo para inserir PCs associados ao texto com base num modelo matemático obtido a partir dos resultados experimentais. A segunda faz a predição da amplitude dos PCs, Ap, e da antecipação destes relativamente à sua posição inicial, T0a. Esta antecipação permite determinar a sua localização exacta no sinal de fala. Estes dois parâmetros são preditos com AANs.

Encontrou-se uma forte associação entre os ACs e as sílabas. Esta associação levou à adopção da metodologia de predição dos ACs associados às sílabas. Assim, o modelo de ACs consiste numa ANN para fazer a predição da existência de AC associado à sílaba e mais 3 para fazer a predição dos parâmetros amplitude (Aa) e antecipação dos instantes de início (T1a) e de fim (T2a).

Os testes perceptuais finais usando o método de julgamento de categorias na escala MOS, resultaram numa classificação de 4.6 para a fala original, de 4.4 para F0 estimado, 4.2 para a predição das durações, 3.1 para o F0 predito e de 2.9 para o modelo completo (modelos de durações e F0). O valor final do modelo completo está ao nível ‘Aceitável’.

Palavras Chave: Sistemas TTS, síntese da fala, prosódia, entoação, ritmo, F0, modelização, Português.

Page 8: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …
Page 9: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

vii

Résumé Cette thèse de PhD présente le développement d'un système de prosodie pour le

Portugais européen (EP) pour les applications texte-parole (TTS). Fondamentalement, les systèmes de TTS effectuent l'expression automatique d'un texte et consistent en une séquence de plusieurs modules. Ces modules mettent en application le pré-traitement des textes, la transcription phonétique et le traitement supra-segmentaire qui consiste dans l’inclusion des modèles prosodiques. La prosodie est responsable pour l’intention communicative et garantit de la naturalité dans le discours parlé. Les dispositifs prosodiques consistent dans l’imposition de la synchronisation, caractérisée par les durées et les pauses segmentaires, l'intonation, caractérisée par la courbe de la fréquence fondamentale (F0), et par la courbe d'intensité.

Dans le début sont présentés les travaux nommés préparatoires qui ont été fondamentaux pour l'étude et le développement du système. Il s'initie avec une étude préliminaire sur la syllabe tonique. Dans cette étude sont identifiés les intervaux de variation des paramètres F0, durée et intensité dans la syllabe tonique dans de divers contextes.

Ensuite est présentée la base de données de parole FEUP-IPB DB utilisée dans les études suivantes. Cette base de données de parole dans EP est étiquetée au niveau du phonème, du mot, de la phrase et de F0. Ensuite se présentent deux algorithmes de division silábique pour le texte écrit et pour la séquence de phonèmes. Ce chapitre de la thèse finit avec la proposition d'un ensemble de règles pour réaliser automatiquement la transcription phonétique des graphèmes les plus problématique dans l’EP.

Le modèle de prosodie proposé est composé de deux sous-modèles, le modèle de duration pour prédire les durées segmentales et le modèle pour prédire le tracé de F0.

Deux propositions, basées dans les réseaux neuronaux artificiels (ANNs), pour prévoir les durées segmentaires sont présentées.

Le premier consiste en un ANN soigneusement choisi au sujet de son architecture et type aussi bien que l'ensemble de caractéristiques d'entrée avec l'objectif de réduire au minimum l'erreur entre les durées prévues et mesurées. La deuxième proposition, apellée modèle alternatif, est basée sur les mêmes considérations de la première proposition mais utilise une ANN consacrée pour chaque phonème, dans un total de 44 ANNs. Le modèle alternatif avec ANNs consacré a amélioré l'exécution finale.

On propose un modèle d'insertion et de prévision des durées des pauses, basé sur une étude préliminaire sur de la base de données de FEUP-IPB.

Le modèle proposé pour prévoir le contour de F0 est basé sur le modèle de Fujisaki et se compose de deux sous-modèles. Un prédit les paramètres Commandes de Phrase (PCs) et láutre prévoit les paramètres des Commandes d'accent (ACs). Les PCs et les ACs de référence ont été manuellement estimés à 101 paragraphes de la base de données sous le critérium de la minimisation de l'erreur entre courbes estimées F0 et mesurées.

La prévision des PCs est exécutée dans deux étapes. La première étape est effectuée par un algorithme responsable de l'insertion des PCs reliés au texte et basés sur un modèle mathématique obtenu à partir des observations expérimentales. La deuxième étape du modèle prévoit l'amplitude de PCs, Ap, et l'anticipation, T0a, relativement à la position initiale. L'anticipation permet la détermination de la position exacte dans le son articulé. Les deux paramètres sont prévus avec ANNs.

Un raccordement fort entre ACs et syllabes a été trouvé dans la base de données. Ce raccordement fort a justifié la méthodologie adoptée de prévoir ACs associées aux syllabes. Par conséquent, le modèle d'ACs se compose d'une ANN pour prévoir l'existence du AC associée à la syllabe et à autres trois ANNs pour prévoir l'amplitude du paramètre (Aa) et l'anticipation du début (T1a) et de la fin (T2a).

L'essai perceptuel final en utilisant la méthode de catégorie-jugement et l’échelle MOS a eu comme conséquence une classification de 4.6 pour le discours naturel, de 4.4 pour le F0 estimé, de 4.2 pour des durées prévues, de 3.1 pour le F0 prévu et de 2.9 pour le modèle proposé complet (durée et F0). Le MOS pour le modèle complet est de niveau juste de ‘juste'. Mots clé: Systèmes TTS, synthèse de parole, prosodie, intonation, rythme, F0, modeler, Portugais Européen.

Page 10: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …
Page 11: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

ix

Acknowledgements I would like to thank my supervisor Diamantino Freitas for his support and advises

always with gentleness, and the opportunities to be involved in national and international projects and cooperate with other international researchers and research laboratories.

My gratitude also to the colleagues of the LPF-ESI laboratory that were directly or indirectly involved in the work, namely Daniela Braga, Paulo Gouveia, Maria João Barros, Vagner Latsch and Helder Ferreira. Thanks to Constança Homem that helped me in some translations and to Esmeralda Miguel for the printings. I express my thanks also to Irene Fernandes for all diligences.

I would like to homage Prof. Carlos Espain, a senior member of LPF-ESI and friend that left us during this work.

A special thank to Prof. Hiroya Fujisaki from the University of Tokyo for the important discussions and advices during the development work, and for the reviewing work of part of this document.

I am also very grateful to Daniel Hirst from CNRS, Nick Campbell from ATR and Mark Huckvale from UCL for their important comments and reviews to this document.

I am grateful also to the participants in the COST 258 Action “Naturalness of Synthetic Speech” in the name of the chairperson, Eric Keller, for the shared experiences and contacts with some of the most important European researchers and research laboratories in this topic.

I appreciated also some discussion related with my work with Alex Mohanagan, Eduardo Banga from the University of Vigo, Hansjörg Mixdorff from the University of Berlin, J.- P. Martens from University of Gent, Isabel Trancoso and Luis Oliveira from INESC-L2F, the colleagues of the Univ. of Aveiro namely, Lurdes Moutinho and António Teixeira and João Veloso from FLUP. The discussions with Luis Calôba, Manuel Seixas, Fernando Gil and Sérgio Netto from UFRJ-LPS were also welcomed.

My thanks to the colleagues of my Department that allowed me been released of teaching duties in 1999-2000 for developing this work. My appreciation to the directors of the ESTiG-Bragança that gave me conditions to develop this work, namely, Rolando Dias and José Adriano. I would like to homage the memory of the director in charge in the beginning of this work, Prof. Alcínio Miguel. I am also grateful to the dean of the Polytechnic Institute of Bragança, Prof. Dionísio Gonçalves, for authorizing the application to the PRODEP scholarship.

I express my thanks also to the RDP Porto for conceding me all technical support for recording the database and to the speaker Diamantino Guedes that gave his voice and attention in the recording process of the database.

I would like to acknowledge my gratitude to the colleagues that participated in the perceptual tests.

A special hug to my friends that decompressed me in the coffe-break times with their always interesting talks, particularly to my dear friend Paula Odete, and for Luis Alves, Carlos Balsa, Alcina, Ana Moura, Henrique Gonçalves, Avelino Marques, João Nunes, Florbela, Ramiro Martins, Fernando Monteiro, João Ribeiro, Pedro Oliveira and many others.

Recognition for my canary friends that allowed me thinking beyond the PhD, keeping my mind healthy (I think).

Finally, last but not the least, my thanks to my beloved wife Lina for all eventual surcharges of tasks, responsibilities and fatigue that she has been exposed to during this work, and to Monica. The biggest thank is to my lovely Dorothy Rita for making me proud every day I play with her.

This work was financed by the program Prodep (5.3/N/199.006/00).

Page 12: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …
Page 13: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Contents

xi

Contents

Abstract….....................................................................................................................................iii Resumo….......................................................................................................................................v Résumé….....................................................................................................................................vii Acknowledgements…...................................................................................................................ix Contents…....................................................................................................................................xi List of Figures............................................................................................................................xvii List of Tables… .........................................................................................................................xxi Abbreviations............................................................................................................................xxv

1 INTRODUCTION.......................................................1

1.1 Foreword........................................................................................................................... 2

1.2 What Is This Thesis About?............................................................................................ 3

1.3 Motivation and Objectives .............................................................................................. 4

1.4 FEUP TTS System for European Portuguese................................................................ 7 1.4.1 Pre-processing of text module.................................................................................... 8 1.4.2 Linguistic analysis ..................................................................................................... 8 1.4.3 Phonetic transcription of text ..................................................................................... 9 1.4.4 Prosody pattern determination ................................................................................... 9 1.4.5 Production of speech signal waveform ...................................................................... 9

1.5 Organization Aspects of the Thesis .............................................................................. 11

1.6 Original Contributions .................................................................................................. 13

2 PREPARATORY WORK.........................................15

2.1 Introduction.................................................................................................................... 16

2.2 Preliminary Prosodic Study of the Tonic Syllable ...................................................... 17 2.2.1 Introduction.............................................................................................................. 17 2.2.2 Method ..................................................................................................................... 17

2.2.2.1 Corpus.................................................................................................................. 17 2.2.2.2 Recording conditions ........................................................................................... 18 2.2.2.3 Signal Analysis .................................................................................................... 18

2.2.3 Analysis and results ................................................................................................. 18 2.2.3.1 Fundamental frequency ....................................................................................... 19 2.2.3.2 Duration............................................................................................................... 20

Page 14: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

xii

2.2.3.3 Intensity ............................................................................................................... 22 2.2.4 Comments and conclusion ....................................................................................... 23

2.2.4.1 Future developments............................................................................................ 24

2.3 Speech Corpus - FEUP-IPB Database.......................................................................... 25 2.3.1 Introduction.............................................................................................................. 25 2.3.2 Speech corpus .......................................................................................................... 25 2.3.3 Sound segmentation and labelling............................................................................ 26 2.3.4 Characteristics.......................................................................................................... 28 2.3.5 Phonetic changing phenomena in database .............................................................. 31

2.3.5.1 Dialectal changing ............................................................................................... 31 2.3.5.1.1 “Dialectal slips” ............................................................................................ 31

2.3.5.2 Contextual changing ............................................................................................ 32 2.3.5.2.1 Suppressions or reductions ........................................................................... 32 2.3.5.2.2 Vowel quality transformations...................................................................... 32 2.3.5.2.3 Additions....................................................................................................... 32 2.3.5.2.4 Allophones.................................................................................................... 32 2.3.5.2.5 Phonetic Changes.......................................................................................... 33

2.3.6 Final remarks............................................................................................................ 33

2.4 Syllabification................................................................................................................. 34 2.4.1 Introduction.............................................................................................................. 34 2.4.2 Syllable splitting of written text ............................................................................... 36

2.4.2.1 Rules .................................................................................................................... 36 2.4.2.2 Algorithm............................................................................................................. 37

2.4.3 Syllabic splitting of spoken text ............................................................................... 39 2.4.3.1 Rules .................................................................................................................... 39 2.4.3.2 Algorithm............................................................................................................. 39

2.4.4 Analysis and results ................................................................................................. 42 2.4.5 Conclusions.............................................................................................................. 42

2.5 Phonetic Transcription from Text ................................................................................ 43 2.5.1 Dedicated ANN to transcribe the graphemes <a> and <e>...................................... 45 2.5.2 Rules to transcribe graphemes <a>, <e>, <o> and <x> ........................................... 46

2.5.2.1 Rules for grapheme <a> ...................................................................................... 46 2.5.2.2 Rules for grapheme <e> ...................................................................................... 46 2.5.2.3 Rules for grapheme <o> ...................................................................................... 48 2.5.2.4 Rules for grapheme <x> ...................................................................................... 50

2.5.3 Co-articulation rules or post-lexical rules ................................................................ 51 2.5.4 Final remarks............................................................................................................ 53

3 DURATION MODEL ............................................... 55

3.1 Introduction.................................................................................................................... 56

3.2 Other Duration Models ................................................................................................. 58 3.2.1 The Klatt model ....................................................................................................... 58 3.2.2 Sum-of-Products models.......................................................................................... 59 3.2.3 The Jan van Santen model........................................................................................ 60 3.2.4 The Keller-Zellner algorithm ................................................................................... 62

Page 15: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Contents

xiii

3.2.5 The Campbell model................................................................................................ 64 3.2.6 The Barbosa-Bailly model – Inter-Perceptual-Centre-Groups................................. 65 3.2.7 Model for the Hungarian language .......................................................................... 66 3.2.8 Model for the Galician language.............................................................................. 67 3.2.9 Model for the Castilian language ............................................................................. 67

3.3 Duration Model for Standard European Portuguese ................................................. 69 3.3.1 Considerations on the speech database .................................................................... 69 3.3.2 Network architecture................................................................................................ 71 3.3.3 Neural network training ........................................................................................... 72 3.3.4 Features.................................................................................................................... 76

3.4 Model Evaluation........................................................................................................... 85 3.4.1 Standard deviation (σ) or (std)................................................................................. 85 3.4.2 Mean absolute error (δ)............................................................................................ 85 3.4.3 Linear correlation coefficient (r) .............................................................................. 86 3.4.4 Results and discussion ............................................................................................. 86

3.5 Alternative Model .......................................................................................................... 93 3.5.1 Alternative model results ......................................................................................... 93

3.6 Pauses.............................................................................................................................. 98 3.6.1 Pause occurrence...................................................................................................... 98 3.6.2 Pause duration.......................................................................................................... 99 3.6.3 Final considerations on studying pauses ................................................................ 101

3.7 Conclusion .................................................................................................................... 103

4 FUNDAMENTAL FREQUENCY ...........................105

4.1 Introduction.................................................................................................................. 106

4.2 The Fujisaki Model ...................................................................................................... 110 4.2.1 Phrase component .................................................................................................. 112 4.2.2 Accent component ................................................................................................. 114

4.3 Parameters Estimation of Fujisaki Model ................................................................. 117 4.3.1 Tool to support the manual estimation of Fujisaki model parameters ................... 118 4.3.2 Parameters estimation process ............................................................................... 121 4.3.3 Evaluation of the estimated F0 contour in the Database ........................................ 124

4.4 Application of the Model ............................................................................................. 126

4.5 Phrase Commands ....................................................................................................... 128 4.5.1 PC positions in text ................................................................................................ 129

4.5.1.1 PCs linked with orthographic marks.................................................................. 129 4.5.1.2 PCs not linked with orthographic marks............................................................ 129 4.5.1.3 Algorithm to insert PCs ..................................................................................... 131

4.5.2 Evaluation of preliminary inserted PC................................................................... 133

Page 16: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

xiv

4.5.3 Prediction of Ap and T0a parameters..................................................................... 136 4.5.3.1 Architecture of ANNs........................................................................................ 136 4.5.3.2 Training the ANNs ............................................................................................ 137 4.5.3.3 Set of features for Ap and T0a........................................................................... 138

4.5.4 Evaluation of the prediction of Ap and T0a ........................................................... 141 4.5.5 Results of the PC model......................................................................................... 142

4.6 Accent Commands ....................................................................................................... 144 4.6.1 ANN architectures.................................................................................................. 145 4.6.2 Training.................................................................................................................. 146 4.6.3 Features .................................................................................................................. 147 4.6.4 Results of prediction with ANNs ........................................................................... 151

4.6.4.1 Ca ANN results.................................................................................................. 152 4.6.4.2 Aa ANN results.................................................................................................. 154 4.6.4.3 T1a ANN results ................................................................................................ 155 4.6.4.4 T2a ANN results ................................................................................................ 157

4.6.5 Results of AC model .............................................................................................. 158

4.7 Results of the Predicted F0 Contour .......................................................................... 161 4.7.1 F0 model ................................................................................................................ 161 4.7.2 F0 model over segmental durations ....................................................................... 161

4.8 Conclusion .................................................................................................................... 164

5 PERCEPTUAL TESTS ......................................... 167

5.1 Introduction.................................................................................................................. 168

5.2 Perceptual Test of Duration Models........................................................................... 169 5.2.1 Discussion .............................................................................................................. 173

5.2.1.1 Correlation between objective and subjective measurements............................ 174

5.3 Perceptual Test of F0 Models...................................................................................... 178 5.3.1 Discussion .............................................................................................................. 185

5.3.1.1 Correlation between objective and subjective measurements............................ 186

5.4 Conclusion .................................................................................................................... 189

6 CONCLUSIONS AND FUTURE WORK ............... 191

6.1 General Observations about the Tasks ...................................................................... 192

6.2 General Conclusions .................................................................................................... 193 6.2.1 Preparatory work.................................................................................................... 193 6.2.2 Timing.................................................................................................................... 194 6.2.3 Fundamental frequency.......................................................................................... 197 6.2.4 Complete prosody model ....................................................................................... 198

Page 17: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Contents

xv

6.3 Final Considerations about the Error Contributions ............................................... 201

6.4 Resume of Results and Conclusions ........................................................................... 203

6.5 Future Work................................................................................................................. 205

BIBLIOGRAPHY...................................................207

Page 18: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …
Page 19: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

List of Figures

xvii

List of Figures

Fig. 1.1 – Architecture of the FEUP TTS system............................................................................... 7 Fig. 1.2 – Sequence of 5 frames data of a diphone (F1 to F5, B1 to B5, voiced/unvoiced, amplitude). ....................................................................................................................................... 10 Fig. 1.3 – Formant synthesizer block diagram. Ag and An mean the amplitude of excitation source.......................................................................................................................................................... 10

Fig. 2.1 – Recorded parameters for tonic and reference syllables using the developed package for analysis. Top graph: waveform signal of the word “café” and its classifications, in red as 1 – silence; 2 – unvoiced; 3 – mixed; 4 – voiced. Middle graph: F0. Bottom graph: Intensity............. 18 Fig. 2.2 – Relative variation of F0 in tonic syllable (95% confidence). .......................................... 19 Fig. 2.3 –Standard Deviation of F0 variation between the three speakers...................................... 20 Fig. 2.4 – Relative Duration of tonic syllable (95% confidence)..................................................... 21 Fig. 2.5 – Standard deviation of average duration between the three speakers. ............................ 21 Fig. 2.6 – Average intensity variation of tonic syllable for all speakers (95% confidence). ........... 22 Fig. 2.7 – Standard deviation of average intensity variation between the three speakers............... 23 Fig. 2.8 – Above: representation of the acoustic signal in the phoneme sequence [lej] in the word ‘lei’ – ‘law’. Below: spectrogram.................................................................................................... 28 Fig. 2.9 – Relative frequencies of the segments in the corpus. ........................................................ 30 Fig. 2.10 – Illustration of the speech rate for the different texts (here represented by the inverse, that is, time per segment in average). The figure shows the accumulated duration of elapsed segments. Track one is displayed using a solid line, track two using a dotted line and thus successively for the 7 tracks............................................................................................................. 30 Fig. 2.11 – Flow chart for one word syllabic splitting of a a written text. V-vowel; C-consonant; ...- any sequence of graphemes; .- syllable boundary; ?-grapheme not determined yet; bold- grapheme already stored in the output string; underline-pointed grapheme by index i................................... 38 Fig. 2.12 – Flowchart of a spoken text syllabic splitting. ................................................................ 41 Fig. 2.13 – Previous processing blocks of phonetic transcription................................................... 44 Fig. 2.14 – Processing of phonetic transcription............................................................................. 45

Fig. 3.1 – The van Santen category-distinction tree. ....................................................................... 60 Fig. 3.2 – Relative frequency (%) of the phonemes in the training and test sets. ............................ 70 Fig. 3.3 – Network architecture for this model. ............................................................................... 72 Fig. 3.4 – Error evolution in the performance function in the training and validation sets during a training session. ............................................................................................................................... 75

Page 20: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

xviii

Fig. 3.5 – Sequence of processing blocks prior to the development stage of the duration model and its application to TTS. ...................................................................................................................... 76 Fig. 3.6 – Error histogram and normal distribution curve for every segment in both sets.............. 87 Fig. 3.7 – Normal probability distribution and absolute error curve for every segment in both sets.......................................................................................................................................................... 87 Fig. 3.8 – Measured, predicted and average duration contours for the phoneme sequence in the sentence “Conhece a situação na pele. Aprendeu-a na idade em que se aprende e se não esquece.”. Meaning ‘Knows the situation on the skin. Learned it in the ages when we learn and don’t forget.’.......................................................................................................................................................... 88 Fig. 3.9 – Measured and predicted duration contours for the paragraph “Que igualdade perante a lei? João Amaral”. Meaning ‘How equal before the law? João Amaral’. ...................................... 89 Fig. 3.10 – Histogram of measured and predicted durations for phoneme [a]. .............................. 92 Fig. 3.11 – Histogram of measured and predicted durations for the burst part of phoneme [t]. .... 92 Fig. 3.12 – Error histogram and normal distribution curve for all segments in both sets with the alternative model. ............................................................................................................................ 94 Fig. 3.13 – Normal probability distribution and absolute error curve for all segments in both sets with the alternative model................................................................................................................ 94

Fig. 4.1 – Example of a ToBI intonation representation. (taken from http://www.ling.ohio-state.edu/~tobi/). ............................................................................................................................ 108 Fig. 4.2 – Processes by which various types of information are manifested in the segmental and supra-segmental features of speech. (Figure published in [Fujisaki, 2002], edited with courtesy of Hiroya Fujisaki)............................................................................................................................. 110 Fig. 4.3 – Functional model for the process of generating F0 contours. (Figure published in [Fujisaki, 2002], edited with courtesy of Hiroya Fujisaki)............................................................ 111 Fig. 4.4 – Phrase component for PCs magnitude Ap= 0.15, 0.30, 0.50 and 0.80 with α=2 /s, logarithmically added with Fb=75Hz............................................................................................ 113 Fig. 4.5 – Phrase components for PCs with α=1, 2, 3 and 4 /s with Ap=0.5, logarithmically added with Fb=75Hz. ............................................................................................................................... 113 Fig. 4.6 – Accent components for ACs with T1=0 s, T2=0.15 s, beta=30 /s and Aa=0.15, 0.30, 0.50 and 0.80, logarithmically added with Fb=75Hz. ........................................................................... 115 Fig. 4.7 – Accent components for ACs with beta=30 /s, Aa=0.60, T1=0 s, and T2=0.05, 0.1, 0.15 and 0.2 s, logarithmically added with Fb=75Hz. .......................................................................... 115 Fig. 4.8 – Accent components for ACs Aa=0.60, T1=0 s, T2=0.15 s and beta=20, 25, 30 and 35 /s, logarithmically added with Fb=75Hz............................................................................................ 116 Fig. 4.9 – Example of the data provided by the tool to manually estimate the Fujisaki parameters........................................................................................................................................................ 118 Fig. 4.10 – Window with menus of the tool to manually estimate the Fujisaki parameters. .......... 119 Fig. 4.11 – Example of the estimated parameters in first (black) and second (red) phases. ......... 122 Fig. 4.12 – Example of the AC parameters correction done in the third phase of parameters estimation....................................................................................................................................... 123

Page 21: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

List of Figures

xix

Fig. 4.13 – Flow chart of the algorithm to connect ACs to syllables............................................. 125 Fig. 4.14 – Organization structures. On the top, the orthographic marks..................................... 126 Fig. 4.15 – Representation of Eligible positions, T0E, and anticipation, T0a, of PCs. .................. 128 Fig. 4.16 – Histogram and Gaussian approximation of distances from PCs not linked with orthographic marks to previous PCs and next PCs. ...................................................................... 130 Fig. 4.17 – Weight for length of previous word. ............................................................................ 131 Fig. 4.18 – Flow chart to insert PC in text. ................................................................................... 132 Fig. 4.19 – Eligible area and candidate positions. ........................................................................ 133 Fig. 4.20 – Application example of the algorithm. ........................................................................ 133 Fig. 4.21 – Comparison of histograms of estimated and inserted PC distances............................ 135 Fig. 4.22 – Comparison of estimated and inserted PC positions. Black arrows are the estimated PCs; magenta arrows are the inserted PCs................................................................................... 136 Fig. 4.23 – Evolution of ANNs performances in test set, over the used extension of the training set........................................................................................................................................................ 138 Fig. 4.24 – Best Linear fit between target (T) and predicted (A) values for Ap (left) and T0a (right)........................................................................................................................................................ 142 Fig. 4.25 – Probability error in test set for predicted Ap and T0a. Lines show the adjusted normal probability distribution with a) µ=0.093, σ=0.075 and b) µ=0.148, σ=0.097............................. 142 Fig. 4.26 – Application example of the insertion PC model. PCs and components: black –estimated; green - initial position of estimated PCs with predicted Ap and T0a; magenta – predicted with PC model................................................................................................................ 143 Fig. 4.27 – Evolution of average ANNs performances in the test set, over the dimension of training set................................................................................................................................................... 147 Fig. 4.28 – Best Linear fit between target (T) and predicted (A) values for Aa (left) and Probability error (|Aatarget-Aapredicted|) in test set for predicted Aa (right), red line shows the adjusted normal probability distribution with µ=0.12 and σ=0.12. ........................................................................ 155 Fig. 4.29 – Best Linear fit between target (T) and predicted (A) values for T1a (left) and Probability error (|T1atarget-T1apredicted|) in test set for predicted the T1a values (right), red line shows the adjusted normal probability distribution with µ=0.022 (s) and σ=0.024 (s)................ 157 Fig. 4.30 – Best Linear fit between target (T) and predicted (A) values for T2a (left) and Probability error in test set for predicted T2a (right), red line shows the adjusted normal probability distribution with µ=0.028 (s) and σ=0.026 (s). .......................................................... 157 Fig. 4.31 – Result of predicted ACs. In black, the estimated PCs, ACs and the associated F0 contour. In magenta, the predicted ACs, based on estimated PCs, and the corresponding F0 contour. Vertical lines represent word boundaries........................................................................ 160 Fig. 4.32 – Application of the complete F0 model. In black the estimated PCs, ACs and F0 contour. In magenta the predicted ACs, PCs and F0 contour. .................................................................... 162 Fig. 4.33 – Application of the complete F0 model over the modified duration with the duration’s model. In magenta the predicted ACs, PCs and F0 contour.......................................................... 163

Page 22: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

xx

Fig. 5.1 – Average opinion values of each subject for the 5 stimuli. ............................................. 171 Fig. 5.2 – Average opinion values by paragraph for the 5 stimuli. ............................................... 171 Fig. 5.3 – Analysis of opinion scores. ............................................................................................ 172 Fig. 5.4 – Comparison of measurement indicators by paragraph for Alternative Model.............. 175 Fig. 5.5 – Comparison of measurement indicators by paragraph for Model. ............................... 175 Fig. 5.6 – Comparison of measurement indicators by paragraph for No model. .......................... 176 Fig. 5.7 – Average opinion values for each subject in the 9 stimuli. ............................................. 181 Fig. 5.8 – Average opinion values for each paragraphs in the 9 stimuli. ...................................... 182 Fig. 5.9 – Analysis of opinion scores by stimuli. Stimuli from 0 to 8 corresponds to: 0 – No model; 1 – Natural; 2 – Durations; 3 – Estimated F0; 4 – Predicted ACs based on estimated ACs and PCs; 5 – Predicted ACs with estimated PCs; 6 – F0 Model; 7 – Duration + F0 model with Aa*0.75; 8 – Durations + F0 model. .................................................................................................................. 184

Fig. 6.1 – PC and AC error components in stimuli 5 and 6, considering orthogonal axis. ........... 201 Fig. 6.2 – PC and AC error components in stimuli 5 and 6, considering non-orthogonal axis..... 202

Page 23: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

List of Tables

xxi

List of Tables

Table 2.1: Consistent F0 variation in tonic syllable, values in %. .................................................. 20 Table 2.2: Duration rules for tonic syllables, values in %. ............................................................. 22 Table 2.3: Change of intensity in the tonic syllable, values in dB. .................................................. 23 Table 2.4: Summary of qualitative trends (varying the tonic position from beginning to the end of word) for all word positions in the phrase....................................................................................... 24 Table 2.5: Phoneme, word and sentence level labels used in labelling the database...................... 27 Table 2.6: Percentage of occurrences, average duration and standard deviation of all phones, considering general positions (including tonic) and just tonic syllable positions.. ......................... 29 Table 2.7: Examples of dialectal slips. ............................................................................................ 31 Table 2.8: Examples of non-tonic vowels suppressions................................................................... 32 Table 2.9: List of phones used in Phonetic transcription. ............................................................... 44 Table 2.10: rules for conversion of grapheme <a>, presented by priority order............................ 46 Table 2.11: rules for conversion of grapheme <e>, presented by priority order............................ 47 Table 2.12: rules for conversion of grapheme <o>, presented by priority order............................ 48 Table 2.13: Co-articulation rules. ................................................................................................... 52

Table 3.1: ANN architectures and performances. ........................................................................... 71 Table 3.2: codification of the ‘position’ feature in relation to the tonic syllable. ........................... 78 Table 3.3: Codification of the ‘syllable type’ and ‘previous syllable type’ features........................ 79 Table 3.4: Codification of the ‘syllable vowel’, ‘previous syllable vowel’ and ‘following syllable vowel’ features................................................................................................................................. 79 Table 3.5: Final feature set, the corresponding importance and the correlation with the segmental durations. ......................................................................................................................................... 81 Table 3.6: Correlation between the segments and the surrounding segments with the segmental durations. ......................................................................................................................................... 82 Table 3.7: Global results for the duration model. ........................................................................... 86 Table 3.8: Values for each segment type (phone) in both sets: occurrence number (#); error standard deviation (σ); mean absolute error (δ); linear correlation coefficient (r); measured average (Av.) and predicted average (Pred. Av.); measured minimum value (Min.) and predicted minimum value (Pred. Min.); measured maximum value (Max.) and predicted maximum value (Pred. Max.)..................................................................................................................................... 90 Table 3.9: Global results for the alternative duration model. ......................................................... 93 Table 3.10: Values for each segment type (phone) in both sets of the alternative model: occurrence number (#); error standard deviation (σ); mean absolute error (δ); linear correlation coefficient

Page 24: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

xxii

(r); measured average (Av.) and predicted average (Pred. Av.); measured minimum value (Min.) and predicted minimum value (Pred. Min.); measured maximum value (Max.) and predicted maximum value (Pred. Max.). .......................................................................................................... 95 Table 3.11: Statistics on pause occurrence...................................................................................... 98 Table 3.12: Parameters for the pause duration predictor. ............................................................ 100 Table 3.13: Best results for the intra-paragraph pause duration predictor. ................................. 100 Table 3.14: Marker type results for the pause duration predictor................................................. 100

Table 4.1: Constant parameters..................................................................................................... 117 Table 4.2: Root mean squared error and correlation coefficient between estimated F0 and post processed original F0 (non zero values)........................................................................................ 124 Table 4.3: Numbers of occurrences of orthographic punctuation marks, associated PCs and percentages of punctuation marks with PCs associated. ............................................................... 129 Table 4.4: Statistical data of distance to previous and next PC. ................................................... 130 Table 4.5: Weights for type of word............................................................................................... 131 Table 4.6: Comparison between estimated and inserted PCs. The number of PCs, the minimum, maximum and average distances and standard deviations in seconds. ......................................... 134 Table 4.7: Numbers of correctly inserted PCs (C), insertion errors (I), deleted PCs (D), the recall rate (R) and precision rate (P), at a tolerance time distance X, from the labelled PCs. ............... 134 Table 4.8: Correctly inserted (C), deletion errors (D), insertion errors (I), recall rate (R) and precision rate (P), for the positions of inserted PC compared to the positions of estimated PC considering the eligible position. ................................................................................................... 135 Table 4.9: Best performance (correlation coefficient), architectures and training algorithms to predict Ap....................................................................................................................................... 137 Table 4.10: Best performance (correlation coefficient), architectures and training algorithms to predict T0a..................................................................................................................................... 137 Table 4.11: Set of features and their correlations r with Ap and T0a............................................ 139 Table 4.12: Linear correlation coefficient obtained in the test set for the predicted Ap and T0a values, relative to the estimated (labelled) values. ........................................................................ 141 Table 4.13: Linear correlation coefficient between AC parameters calculated along the labelled database. ........................................................................................................................................ 145 Table 4.14: List of features and their correlations, r, with Ca, Aa, T1a, and T2a......................... 148 Table 4.15: best performances (A and r) in Ca ANN with different architectures, activating functions, training algorithms, set of features, limit of decision L and output processing. ........... 153 Table 4.16: Performance values for the best Ca ANN. .................................................................. 154 Table 4.17: best performance (correlation coefficient) of architectures to predict Aa.................. 154 Table 4.18: best performance (correlation coefficient) of architectures to predict T1a................ 156 Table 4.19: best performance (correlation coefficient) of architectures to predict T2a................ 158 Table 4.20: Final performance of prediction the model parameters for ACs. ............................... 159

Page 25: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

List of Tables

xxiii

Table 5.1: Portuguese and respective translation of the 5 paragraphs used in the perceptual test, and respective number of segments. .............................................................................................. 170 Table 5.2: Correlation coefficient, r, and rmse between original and the other three stimuli in each paragraph. ..................................................................................................................................... 170 Table 5.3: Mean Opinion Score (MOS) and standard deviation of the perceptual test................. 172 Table 5.4: Significance level between pairs of stimuli................................................................... 173 Table 5.5: Measurement indicators for models, by paragraph...................................................... 174 Table 5.6: Correlation coefficient along paragraphs between measurement indicators............... 176 Table 5.7: Mean values along paragraphs of evaluation measurements....................................... 177 Table 5.8: Correlation between mean values of evaluation measurements................................... 177 Table 5.9: Portuguese and respective translation of the 5 paragraphs used in the perceptual test........................................................................................................................................................ 179 Table 5.10: Objective measurements of each stimulus by paragraph. For each paragraph the first line represents the correlation coefficient and second line the rmse. ............................................ 180 Table 5.11: Mean Opinion Score (MOS) and standard deviation of the perceptual test............... 183 Table 5.12: Significance level between pairs of stimuli. Stimuli from 0 to 8 have the same correspondence as the ones in Fig. 5.9.......................................................................................... 184 Table 5.13: Indicator measurements for stimuli by paragraph. .................................................... 187 Table 5.14: Correlation coefficient along paragraphs between measurement indicators............. 187 Table 5.15: Mean values along paragraphs of indicator parameters. .......................................... 188 Table 5.16: Correlation between mean values along models of indicator parameters. ................ 188

Table 6.1: Resume of average (over the 5 paragraphs) evaluation parameters in the 4 stimuli types used for perceptual tests. ............................................................................................................... 199

Page 26: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …
Page 27: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Abbreviations

xxv

Abbreviations

Aa – Amplitude of AC;

ABU – Acoustic Building Unit;

AC – Accent Command;

ANN – Artificial Neural Network;

Ap – Magnitude of phrase command;

Ca – ANN that predicts the amplitude of the AC;

CA – ANN that predicts the existence of AC associated to the syllable;

CEFAT – Centro de Estudos de Física, Acústica e Telecomunicações;

EP – European Portuguese;

F0 – Fundamental frequency;

FEUP – Faculty of Engineer of University of Porto;

FEUP-IPB DB – FEUP-IPB speech DataBase;

FEUP-TTS – FEUP Text-To-Speech system;

LPF-ESI – Research Laboratory for Speech Processing, Electroacustic, Signal and Instrumenta-

tion of FEUP;

LSS – Laboratory of Signals and Systems research unit of FCT, hosted at LPF-ESI;

MOS – Mean Opinion Score;

PC – Phrase Command;

r – Linear correlation coefficient;

rmse – Root mean squared error;

std – Standard deviation;

T0 – Onset time of PC;

T0a – Anticipation of PC;

T0E – Beginning of accent group where PC was inserted;

T1 – Onset time of AC;

T1a – Anticipation of the onset time of the AC;

T2 – Offset time of AC;

T2a – Anticipation of the offset time of the AC;

TPML – Text Processing Markup Language; TTS – Text-To-Speech;

UFRJ – Federal University of Rio de Janeiro;

XML – eXtensible Markup Language; δ – Mean absolute error;

σ – Standard deviation.

Page 28: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …
Page 29: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

1 Introduction

This introductory chapter makes a short overview of what is prosody and describes the motivations and objectives for this work. The FEUP-TTS system for European Portuguese, which will be, in first instance, the host of the proposed prosody model, is briefly described. Finally an overview of this document and a reference to the original contributions are made.

Page 30: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

2

1.1 Foreword

This document attempts to report and arguing the results of the work and experiences been de-veloped under the construction of a prosody model for European Portuguese (henceforth EP). The work was developed under a PhD program in electrotechnical and computer engineering in FEUP by the author.

The object of study is the European Portuguese Language. The Portuguese language belongs to the family of Romance languages and is the fifth most widely spoken language in the world with more than 200 million speakers. It is the official language in Portugal (10 millions) (Europe), Brazil (175 millions) (South America), Angola (10 millions), Mozambique (20 millions), Guinea-Bissau (1.3 millions), São Tomé and Príncipe (165 thousandths), Cape Verde (400 thousandths) (Africa) and East Timor (800 thousandths) (Asia). Any country has its own version of Portuguese. Even though any version could be understood in any of the speakers’ country, the pronunciations are dif-ferent and it is not easy to accept in Europe a TTS system with a Brazilian version of Portuguese, neither the opposite.

The lack of resources in this language for speech science, like tools or labelled databases, has in-troduced an inevitable delay in achievements of the main objectives in order to create and prepare those resources for this work.

The initial main objective of the work was the development of the naturalness of synthetic speech. This objective lead to an important effort under prosodic modelling since it was the main lack in the existent TTS system.

There is no unique definition of what prosody is, but a broadly accepted concept was summa-rised by Ladd and Cutler [1983] into “concrete” and “abstract” categories. The “concrete” defini-tion lies with objective physical measurable acoustic parameters like F0, duration and intensity. The “abstract” definition stands for the linguistic point of view concerning its structure “as phe-nomena that involve phonological organization at levels above the segment”. That is why prosody is considered as a suprasegmental category. The first definition is more close to objective meas-urements and the second one with building theories, according to caricature made by the authors. A third definition was presented by Fujisaki [1997] that aggregates both previous definitions and brings together, with a pleasant will, the work usually made by Engineers and Linguistics (not al-ways working under the same “prosody”):

“Prosody is the systematic organization of various linguistic units into an utterance or a coher-ent group of utterances in the process of speech production. Its organization involves both segmen-tal and suprasegmental features of speech, and serves to convey not only linguistic information, but also paralinguistic and non-linguistic information.”

This definition is more consentaneous with the present work. Any how, it concentrates in the su-prasegmental features: duration and F0, as the ones, broadly known, as being most perceptually important. The proposed models for duration and F0 gather just part of the linguistic information and neither paralinguistic nor non-linguistic information is available. No syntactic morphologic or semantic information is used since these cannot be automatically extracted from text due to non availability of such a tool with the required quality to use with the FEUP-TTS system. Concerning paralinguistic and non-linguistic information, they are not related with text at all, but just with the speaker. Since not all information gathered by the speaker is used in the prosody model it can not be expected that the prosody model can exhaustively reproduce the same patterns as the speaker.

Page 31: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 1-Introduction

3

1.2 What Is This Thesis About?

This work attempts to produce a prosody module for TTS systems in EP. A TTS system is a program or a machine that automatically converts a written text into speech. In order to be intelligi-ble, the synthetic speech must reproduce the correct sounds. But, for that system to be accepted, it must produce the sequence of sounds in a natural or human like way. The naturalness, a character-istic of the human speech, can also improve the intelligibility of the synthetic speech.

The same sequence of basic sounds can be produced with very different characteristics, depend-ing on the intention of the speaker. These characteristics are named prosodic features and consist in segmental durations, or the duration of each sound, the intensity variation and the tone pattern. A good pattern of prosody is essential for reaching the objective of naturalness in synthetic speech. In an extreme, it is possible that a different pattern of prosody can even change the meaning of the ut-terance. Different ways to produce the same utterance can be natural, but not all patterns of prosody are natural. Unnatural prosody pattern can be the reason for rejection of the synthetic speech.

All utterances carry a prosodic pattern, even if it is a theoretical constant pattern, that by the way, is also not well accepted. In the scientific community it is well known that the timing and the tone or pitch or even F0 pattern, are the more important features of prosody.

This work mainly consists in a proposed model to automatically produce the durations of the segments and the F0 pattern for EP written text. These prosodic feature patterns are very variable according to the context sounds, the meaning of the utterance, the sequence of words, the intention, the type of sentence and the length.

For humans, the task of producing a natural prosodic utterance is very simple. Persons speak without thinking about prosody. They do not care about duration of segments, the intensity or the tone. They do not think about sequences of segments either. All this information is intuitively proc-essed by the human mind. However humans still cannot produce systems that do the same process-ing they do intuitively.

This is where the author found some of the reasons to use ANN to produce prosody. The funda-mentals of ANNs are based in the human neurons [Rumelhard and McClelland, 1986]. ANNs, just like humans, can produce a result based on previous experiments. For instance, it is time to mention that the author´s daughter when learning how to speak, at about one year of age, could not even ut-ter the words correctly, or know their meanings, but she already produced a perfect prosody to ex-press some feelings. Her neurons were in a strong process of learning and prosody was learned be-fore vocabulary or sounds. This was, one of the reasons for the strong use of ANNs in this work. Another more objective reason was the ANNs’ capacity of very fast processing the input to deter-mine the right output after the training phase is accomplished. Other pattern recognition techniques could be used in complement of ANNs or even instead, however this was considered to be out of the scope of the present thesis.

Page 32: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

4

1.3 Motivation and Objectives

The author started in 1994 with the “Speech-Aid” project that aimed at the development of a speech interface for speech impaired persons. The main work consisted in developing the Portu-guese version of the MULTIVOX TTS system [Teixeira et al., 1998], that is, the creation of the lin-guistic and acoustic modules for EP. The linguistic module consisted in the pre-processing block, the establishment of the set of phonemes, the set of grapheme-phoneme conversion rules and intro-duction of some prosodic markers at the phonetic code level. The acoustic module work consisted in the creation of the Acoustic Building Units (ABUs) database, creation of rules for concatenation of those ABUs to produce acoustic structures of the phonemes, creation of some basic prosodic rules and the superimposition of those prosodic elements correcting the original F0, duration and intensity. The phonetic transcription rules already introduce several prosodic markers such as pause deletion, pause insertion and stressed syllable markers. Some elementary prosodic patterns were implemented for declarative and interrogative sentences. The system achieved an acceptable intel-ligibility for speech impaired persons applications.

Latter, in 1996, a subsequent project aiming the improvements of the MULTIVOX TTS system was followed. This second version broke several restrictions of the first version but the main im-provement was in the acoustic module, introducing a human formant coded speech database. This second version allowed the usage of large prosodic markers, claiming more prosodic knowledge to deal with those markers. A strong need for a prosodic model was felt in this project.

Meanwhile, the author had gained experience in speech analysis with the work in his Master dis-sertation [Teixeira, 1995], where he developed several analysis tools such as: automatic extraction of F0, formant frequencies and respective bandwidths; voiced, unvoiced, mixed and silence classi-fication; and formant synthesis processing, among others. This Master dissertation preparation lasted one year research.

The field of the PhD work was defined by the strong need for improving naturalness of Euro-pean Portuguese TTS systems, including the inevitable prosody module.

The original main objective of this work was defined as the improvement of the naturalness of EP synthetic speech by prosodic and acoustic modulation of the EP language. The objects of the work were:

• prosodic models to be implemented in a TTS system;

• EP linguistic corpus for the prosody studies;

• EP phonetic corpus for extraction of prosodic parameters;

• set of instrumental and computational tools for speech signal analysis.

In order to fulfil the main goal, the original planed tasks consisted of:

• study of the state of the art concerning TTS systems and prosody models;

• study and development of a consistent set of instrumental tools of analysis and synthesis for prosodic studies;

Page 33: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 1-Introduction

5

• creation of new models or adaptations of the existing ones to the EP language, their vali-dation and comparison with other models already known;

• study and development of a new TTS system for EP to incorporate the previous planed developments of speech naturalness at the acoustic and prosodic levels.

Meanwhile, the author participated in several projects under his integration in the CEFAT re-search laboratory, firstly, and then in the LSS research laboratory. Namely:

• the project “Processamento Automático do Português” – ‘Automatic Processing of Portu-guese’, with the Universidade Federal do Rio de Janeiro (UFRJ) team [Souza et al., 1999];

• the successor of the previous project, the “SIRI” also with the UFRJ team;

• the “ANTÍGONA” project, that aimed the development of a speech interface for elec-tronic commerce [Freitas et al., 2002];

• and the most important for this work, the COST 258 action “Naturalness of Synthetic Speech”, where the author and the laboratory had the opportunity to cooperate with sev-eral other European speech research laboratories and researchers [Keller et al., 2002].

The participation in these projects and mainly in COST 258 action gave a good background in the state of the art and helped to clarify the original objectives focussing now the main purpose on a prosody model.

The participation of the laboratory in the “ANTÍGONA” project allowed the development of a robust EP TTS system namely FEUP-TTS that will be briefly described bellow.

Under the scope of improving naturalness of a TTS system several modules were found as need-ing improvements. Therefore, improvements were made in those modules, some of them under this PhD work, and others under the projects and made by other colleagues. Namely, the pre-processing module suffered several improvements made by Hélder Ferreira [Report of ANTIGONA project, 01], [Braga et al., 2003] and [Ferreira, 2003]. The linguistic module had several improvements made under this work, in collaboration with other researchers, namely, Paulo Gouveia and Daniela Braga in a work reported in the next chapter like phonetic transcription, syllable division and label-ling of the speech database.

The acoustical module was improved also, and there are two alternatives. The first alternative is a formant synthesiser (formant module) with five formants, implemented in a co-work with Vagner Latsch [Report of ANTIGONA project, 01]. The second alternative uses pitch-synchronous con-catenative techniques and was developed by Barros [2002].

The prosody module is presented in this work. This module consists basically in the model to predict segmental duration, based on ANNs, and in the F0 prediction scheme based on the Fujisaki model, with parameters predicted from text, also by means of ANNs.

Since the prosody module was produced with the objective of being introduced in the FEUP-TTS system, only the automatically available information was used. Although the focus of this dis-sertation is the report of duration and F0 models, several preparatory modules were fundamental and are also described as important issues, not just for the prosody studies purpose, but also as re-

Page 34: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

6

sources for EP language researchers. Those resources are the FEUP-IPB database, the syllable divi-sion module and the set of phonetic transcription rules.

The motivation for the strong usage of ANNs was based on the typology of the problems, where no sets of rule are known as solutions and a statistical tool like ANNs could efficiently achieve good results, based on a good representation and carefully prepared statistic information. The ANNs allow obtaining a good solution, without being known the functional mechanism, using only the already known results. Anyhow, ANNs allow the evolution to a model where the phenomena that interfere in the functional mechanism are known.

Page 35: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 1-Introduction

7

1.4 FEUP TTS System for European Portuguese

The FEUP TTS system is briefly described in this section just to clarify the environment and the processing sequence where the prosody model will be hosted, in first instance. The development of this TTS system had the contribution of several researchers in the LSS laboratory and is still con-tinuously receiving contributions. So, no updated reference can be mentioned, but the report [Re-port of ANTIGONA project, 2001] contains the most complete description of the system.

Fig. 1.1 – Architecture of the FEUP TTS system.

Pre-processing of text: Conversion to plain text of:

Numerals, acronyms, abbreviations, dates, etc.

Linguistic analysis: Morphology and syntactic structure

Word, phrase and sentence boundaries

Phonetic transcription of text: Pre-phonetic transcription (some digraphs)

Syllabic splitting Tonic syllable identification

Table of exceptions Grapheme-phoneme rules

Co-articulation rules

Prosody pattern determination: F0

Segmental durations Intensity

Production of speech signal waveform: Grapheme-diphone sequences conversion

Concatenation Prosodic manipulation

TEXT

SOUND

Page 36: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

8

Other European Portuguese TTS system, the DIXI, was reported by Oliveira and co-authors in [Oliveira et al., 1991, 1993] and [Oliveira, 1996]. This system is a rule based synthesiser using the Formants Klatt model. Latter, the system improved to the DIXI+, which is a concatenative based synthesiser [Carvalho et al., 1998].

The architecture of the FEUP TTS system can be described by the 5 combined modules pre-sented in Fig. 1.1.

1.4.1 Pre-processing of text module

This is the input block of the TTS system. It receives the text to be synthesised through the Speech API. The pre-processing is a fundamental block in TTS system. This block consists in con-verting numerals, abbreviations, acronyms, dates and other symbols into formatted text. Anyhow, for instance, numerals have several forms to be written (dates, telephone numbers, measurements, prices, etc.) and the ways the classes must be spelled are different. Therefore, numbers are con-verted in two phases: in the first phase, the class is automatically identified; the second phase con-sists in the conversion of the number according to its class.

The implementation of the first phase is based on linguistic context information extracted from text of morphological nature including gender and number.

The implementation of this classification can immediately activate the adequate conversion or an intermediate labelling of the element to be converted latter. This label activates the second phase where a parser interprets the meaning and finally converts the number to extended text format.

After that, the text is organized into smaller units like sentences and paragraphs. The text is, also, labelled with the mark-up language specially developed for this purpose and based in XML mark-up language. This mark-up TPML language is also extended to allow the insertion of pro-sodic labels.

1.4.2 Linguistic analysis

This block could be understood as integrating also the block of phonetic transcription. It intends to get the morphological and the syntactic structure from text. This information can be useful in the phonetic transcription and prosodic blocks. Despite the effort in the development of this block and the usage of a commercial morphologic analyser, no reliable information is generated to be used in the prosody module, yet.

The word and sentence boundaries are easily generated, but the phrase boundaries depend strongly on the syntactic structure.

The system has an organisation structure of the dynamic variables which is prepared to receive the information generated by the morpho-syntactic analysis and stores this information for further usage in subsequent blocks.

In this phase, some morpho-prosodic labels should be introduced to be used in the grapheme-phoneme conversion and in the prosodic module.

Page 37: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 1-Introduction

9

1.4.3 Phonetic transcription of text

This block gets the text and the information generated by the previous one and converts them to the list of the sequence of phonemes organised in syllables carrying additional information about the stressed syllable.

The pre-phonetic transcription converts some digraphs into the unequivocal phoneme represen-tation such as <rr> into [R], <lh> into [L], <nh> into [J], etc, in order to facilitate the subsequent possessing.

The following task is splitting words into syllables. The rules and algorithm are described below in section 2.4.

Then, the tonic syllable is determined using a set of rules already described in [Teixeira, 1995].

After that, the grapheme-phoneme conversion is performed using firstly a table of exceptions, then a set of rules as described in section 2.5 and finally the set of co-articulation rules, described in the same section, are applied. The co-articulation rules are under the phase of implementation.

The output of this block is the phoneme code sequences (using SAMPA code [Wells, 2000]), the delimitation of syllables, words, phrases, sentences and paragraphs boundaries, and the identifica-tion of tonic syllables.

1.4.4 Prosody pattern determination

This block aims to the determination of the segmental duration of each phoneme in the sequence as well as the F0 patterns. The intensity pattern has less importance in terms of its audibility. Its study has therefore been neglected facing F0 and duration features.

The block of prosody pattern determination consists of the prosody models developed under this work and described in further chapters.

1.4.5 Production of speech signal waveform

This block is responsible for the production of speech sound waveform from the phoneme se-quences and the prosodic information.

Firstly the phoneme sequences are converted into diphone sequences. Then two alternative tech-niques are available for the acoustic processing: the formant synthesizer and the concatenation syn-thesizer.

The formant synthesizer retrieves the sequence of frames of the diphones from the specific data-base. This database consists of diphones of natural speech coded in a sequence of frames. Fig. 1.2 presents the information of the sequence of frames of one diphone. Each frame corresponds to 10 ms of speech coded in the parameters: F1, F2, F3, F4, F5, (5 formants), B1, B2, B3, B4, B5 (re-spective bandwidths), information about voicing/devoicing and the amplitude of the excitation source. Then the sequences of frames and the patterns of F0 and duration are used as inputs of the synthesizer module represented by the blocks of Fig. 1.3.

Page 38: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

10

Fig. 1.2 – Sequence of 5 frames data of a diphone (F1 to F5, B1 to B5, voiced/unvoiced, amplitude).

Fig. 1.3 – Formant synthesizer block diagram. Ag and An mean the amplitude of excitation source.

The glottal excitation is produced with the LF model [Fant et al., 1985] allowing the association of physical characteristics with parameters of the model and better control over the voice quality.

The noise generator produces a noise signal generated by means of random numbers with Gaus-sian distribution.

The filter of spectral correction was introduced to compensate the observed difference in spec-tral decay between natural human speech and synthetic speech, accounting for linear distortion in the coding phase and in the selection of the source signal parameters.

The prosody manipulation is produced parametrically. F0 patterns are a sequence of values in-dexed like the sequence of frames and are used to control the frequency of glottal pulses. The Dura-tion of segments controls the number of frames used to produce each segment, by a removal or in-sertion of frames process.

The time domain concatenation synthesizer described in [Barros, 2002] uses diphone units, col-lected from the speech FEUP-IPB database. The control of segmental duration is done by repetition or deletion of pitch periods, and the F0 control is achieved by shorting of enlarging the pitch peri-ods.

Finally, the sound is produced through the MS windows SAPI.

An

Ag F0

Glottal excita-tion

Noise generator

Filter of spectral correction

Vocal filter Filter of Radia-tion of lips

5 Formants and bandwidths X

+

X

...

Page 39: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 1-Introduction

11

1.5 Organization Aspects of the Thesis

This document describes the prosody model developed in the scope of the PhD work. Only ex-periences, programs, tools, and realised sub-modules that are considered important are described in order to avoid dispersion of the important objective and results.

Several parts of the work reported here were already published in International specialised con-ferences. It is intended to report in this dissertation the work with more detail and some times with more developments.

The main prosody work is documented in chapters 3 and 4, duration model and F0 model, re-spectively. The accessory works, although also important for the main prosody model, are de-scribed in chapter 2, where several preparatory work are reported, and in chapter 5 where the per-ceptual evaluations of the models are discussed.

Chapter 2 describes several components and works as preparatory works to reach the main pros-ody model. This chapter starts by describing a preliminary study with tonic syllable in EP in section 2.2. This work was not used directly in the present model, but was developed under this PhD work and the resulting experience was important to clarify the research trajectory. This study intended to present preliminary measurements of the modifications in the tonic syllable of prosodic features syllable duration, F0 and intensity according to their position in the word and in the phrase. Section 2.3 describes the speech corpus FEUP-IPB database used in the development of the prosody model. The process of labelling the database at segmental, word and phrase levels is described, several sta-tistics of the database are presented and finally several phonetic modifications phenomena in the database are reported. Section 2.4 describes two developed algorithms and set of rules to do the syl-labic splitting both of the text and of the phonetic sequences resulted from the process of phonetic labelling the database. This chapter ends with section 2.5 where some contributions to the EP pho-netic transcription of text are presented.

Chapter 3 describes the proposed model to predict segmental durations. An overview of other recent duration models is presented in section 3.2. In section 3.3 one model is proposed to predict the segmental durations for EP based on one ANN. The aspects of ANN architecture and training as well as a study of important features are presented. Then section 3.4 presents some parameters used in the measurements, and the proposed model is evaluated. An alternative model using the characteristics of the proposed one but with dedicated ANNs for each segment type is proposed in section 3.5. Finally, section 3.6 presents a preliminary study of a model to predict the pause inser-tion and pause duration.

Chapter 4 presents the proposed model to predict F0 patterns from text with dedicated ANNs. A short overview of F0 coding models is described in the introduction section 4.1. In section 4.2 some discussion about theory and practical aspects of the Fujisaki modelling is made. Section 4.3 describes the process of estimation of the model parameters. Section 4.4 clarifies the sequences of the application of the model. Section 4.5 presents the whole process of inserting phrase commands controlled by an algorithm and the prediction of their magnitudes and final positions with ANNs, as well as the study of selecting the ANNs and the set of features. Section 4.6 presents the process of insertion of accent commands, the prediction of their parameters with ANNs, the selection of the ANNs, and features. In section 4.7 the results of the F0 predicted contour are analysed, separating the phrase and accent components. The predicted F0 contour over the predicted segmental dura-tions is also analysed.

Page 40: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

12

In chapter 5 perceptual tests made to evaluate the proposed models are presented and discussed. Section 5.2 compares the results of both the proposed duration models with natural speech and the considered absence of a duration model. Section 5.3 discusses the loss in naturalness after the ap-plication of each component of the prosody model (duration and F0 models).

Finally, Chapter 6 presents the final extended and resumed conclusions and future develop-ments.

Page 41: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 1-Introduction

13

1.6 Original Contributions

The present work presents the first known prosody model for EP TTS systems. The usage of ANNs for segmental duration prediction models or the prediction of Fujisaki model parameters with ANNs is not innovative, but the particular implementations of the present solution and the ap-plication to the EP language considering its peculiarities and features are innovative solutions. In chapter 2 some new developments are presented that are original resources useful for EP specialist researchers.

Concretely, the original resources presented in chapter 2 are:

• the variation of prosodic acoustic parameters study in tonic syllable, already published in [Teixeira et al., 1999] and [Teixeira and Freitas, 2002];

• the speech labelled corpus FEUP-IPB database for EP, already published in [Teixeira et al., 2001];

• the algorithms for text syllabification and phonetic syllabification of EP based in the con-sideration of grammatical sequences of vowels and consonants and some complementary rules, already published in [Gouveia et al., 2000];

• the contribution to the set of rules of the phonetic transcription of text of several graph-emes in EP.

The usage of ANNs in the models presented in chapter 3, for prediction of segmental duration, was already experimented for other languages with good results. The original contributions in this model were the extended list of features and the dedicated ANNs for each type of segment pro-posed in the alternative model. Both contributions proved to improve the performance of the final model. This work was already partially published in [Teixeira and Freitas, 2002, 2003a, 2003b].

The estimation of the F0 contour with the Fujisaki model was already published for several lan-guages with their own peculiarities. The known published works reporting the prediction of F0 con-tour from text are [Navas, 2003] for Basque language, [Mixdorff, 1998, 2002] and [Möbius et al., 1993] for German language. Mixdorff presented in 2002 the prediction of parameters by the usage of one ANN, the other works to predict the parameters by a rule process are based on statistical analysis. So, the process of predicting the model parameters with dedicated ANNs is also innova-tive. Other new contributions in this model are the process of insertion of the phrase commands and the association of accent commands to syllables. Some parts of this work were already published in [Teixeira et al., 2003, 2004].

Page 42: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …
Page 43: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

2 Preparatory Work

This chapter describes several components developed during the work. These components are not directly related with the prosody model but are used by it and are essential to supply linguistic in-formation about the text. A preliminary study about the tonic syllable was done before the prosody model. This study gave several hints for the duration and F0 models, and produced some quantita-tive information about the tonic syllable. The FEUP-IPB phonetically labelled speech database is also described which was used in all following studies. The algorithms for syllabification of the written text and the phoneme sequence produced by the speaker are also described. Finally, sev-eral rules for the EP grapheme-phoneme transcription are presented, as an important part to pro-duce accuracy in synthetic speech of TTS systems.

Page 44: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

16

2.1 Introduction

This chapter presents several disconnected block developed during this work by the author in cooperation with other colleagues of the laboratory. Each sub-chapter describes one disconnected study, but any of them has an important or fundamental contribution for the final prosody model.

The first sub-chapter, 2.2, describes the variation of the prosodic parameters duration, F0 and in-tensity introduced by the effect of accented syllable. This study was done previously to the prosody model presented in next chapters and was presented and published in [Teixeira et al., 1999] and [Teixeira and Freitas, 2002]. Although the following model has a radically different methodology of the one followed in this study, a very good suggestions to the development of the prosody model resulted from the experience and discussion of this work.

The section 2.3 presents the FEUP-IPB speech corpus database for EP which were used in all subsequent developments. The database was produced with the main objective of the development of the prosody model and as a source to extract speech segments for a TTS database. The database is phonetically labelled and has also several other labels described in the mentioned section and in [Teixeira et al., 2001].

Section 2.4 describes the developed algorithms, also presented in [Gouveia et al., 2000], to split words into syllables. Two distinct algorithms were developed. The first one with a very good per-formance, splits the written text, and attempts to be applied in the TTS process. The second one splits phonetic word, as they were produced by the speaker. This second algorithm has the addi-tional difficulty of dealing with several suppressions, very frequent in EP. This last algorithm was used in all development studies of prosodic model, once the source of information are the pho-nemes sequences as they were produced by the speaker. These algorithms can be considered as part of the prosody model.

Last section presents a set of rules to be used in the grapheme-phoneme conversion process of the TTS system, and discuss the major difficulties for the EP language. The set of rules are already implemented in FEUP-TTS system. Some post-lexical rules are also proposed in order to reduce the distance between phonetic transcription and phonological production.

Page 45: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

17

2.2 Preliminary Prosodic Study of the Tonic Syllable

This study was realised previously to the prosody model presented in the next chapters. In fact, it remained as a preparatory work and the objective results were not directly used in the proposed prosody model. This is mainly due to the fact that those are in a quantified statistical model format and the way research progressed to forward the proposed model lead to a complete ANN statistical model. Anyhow, the results and methodology are reported here because of their validity and impor-tance. It must be mentioned that this study was already reported in [Teixeira et al., 1999] and [Teixeira and Freitas, 2002].

2.2.1 Introduction

It is assumed by some authors, for instance [Zellner, 1998], [Andrade and Viana, 1988] and [Mateus et al., 1990], that accurate modelling of tonic syllables is crucially important in the modu-lation of prosody, and specifically in developing prosodic models to improve the naturalness of synthetic speech. This requires the modification of the acoustic parameters duration, intensity and fundamental voicing frequency, F0, but there are no previously published works that quantify sys-tematically the variation of these parameters for EP.

F0, duration or intensity variation in the tonic syllable may depend on its function in the context, the word length, the position of the tonic syllable in the word, or the position of this word in the sentence (initial, medial or final). The function of words will not be considered, since it is not gen-erally predictable by a TTS system. The main objective was to develop a quantified statistical model to implement the necessary F0, intensity and duration variations on the tonic syllable for TTS synthesis, considering only the position dependency.

2.2.2 Method

2.2.2.1 Corpus

A short corpus was recorded with phrases of varying lengths with a selected tonic syllable al-ways containing the phoneme [E] (Sampa code). The syllables were analysed in various positions in the phrases and in isolated words. The short corpus was built bearing in mind that this study should be extended to a larger corpus with other phonemes and with refinements in the method re-sulting from this first stage.

Two words were considered for each of the three positions of the tonic syllable (final, penulti-mate and antepenultimate stress). Three sentences were created with each word, and one sentence with the word isolated, giving a total of 24 sentences. The non sense word “fefeto” was also in-cluded. The characteristics of the tonic syllable were then extracted and analysed in comparison to a neighbouring reference syllable (unstressed) in the same word (e.g. Amélia, ferro, café: bold = tonic syllable, underlined = reference syllable). The non-sense word is full of interest because it contains the same syllable twice, in pre-tonic or post-tonic positions, allowing the reference sylla-ble to be the same as the tonic syllable.

Page 46: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

18

2.2.2.2 Recording conditions

The 24 sentences were read by three speakers, two males and one female. Each speaker read the material three times. Recording was done directly to a PC hard disk using a 50 cm unidirectional microphone and a sound card (16 bits, 11 kHz). The room that was used was only moderately acoustically treated.

2.2.2.3 Signal Analysis

The MATLAB package was used for analysis, and appropriate measuring tools were created. All frames were first classified into voiced, unvoiced, mixed and silence. Intensity in dB was calculated as in [Rowden, 1992], and in voiced sections the F0 contour was extracted using a cepstral analysis technique [Rabiner and Schafer, 1978]. These three aspects of the signal were verified by eye and by ear. The following values were recorded for tonic syllables (T) and reference syllables (R) as depicted in Fig. 2.1: syllable duration (DT – tonic and DR - reference), maximum intensity (IT and IR), and initial (FA and FC) and final (FB and FD) F0 values, as well as the type of shape of the con-tour.

100 200 300 400 500 600 -2

0

2

4

Signal

100 200 300 400 500 600

100

150

200

250

F0 Hz

100 200 300 400 500 600 0

10

20

30

40 Intens. dB

ms

DR DT

ε f α k

FC FD

FA FB

IR IT

Fig. 2.1 – Recorded parameters for tonic and reference syllables using the developed package for analysis. Top

graph: waveform signal of the word “café” and its classifications, in red as 1 – silence; 2 – unvoiced; 3 – mixed; 4 – voiced. Middle graph: F0. Bottom graph: Intensity.

2.2.3 Analysis and results

The effects on parameters fundamental frequency (F0), duration and intensity will be analysed and discussed. Each presented result for the tonic position in the word and word position in the sen-

Page 47: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

19

tence results from analysis of 18 utterances (two different words in each sentence type read three times by three speakers: 2x3x3).

2.2.3.1 Fundamental frequency

The difference in F0 variation between tonic and reference syllables relative to the initial value of F0 in the tonic syllable, given by Eq. (2.1), was determined for all sentences. As these syllables are in neighbouring positions the common variation of F0 is the result of sentence intonation. The difference of F0 variation in these two syllables is due to the tonic position.

( ) ( ) ( )Relative variation of F0 100 %B A D C

A

F F F FF

− − −= × Eq. (2.1)

There are some cross-speaker tendencies, and some minor variations that seem irrelevant. Fig. 2.2 shows average relative variation of F0, ± 2·σ (σ-standard deviation), of the tonic syllable for all speakers.

Fig. 2.3 shows the standard deviation between the three speakers. In some cases (low standard deviation) the F0 variations in the tonic syllable are similar for the three speakers but in other cases (high standard deviation) the F0 variations are very different. Reliable rules can therefore only be derived in a few cases. Table 2.1 shows the cases that can be taken as a more consistent rule, taken in consideration the standard deviation. These rules can be interpreted as the situations where the F0 variation should be incremented in the mentioned percentage amount.

Relative variation of F0 in tonic syllable

-40.0

-30.0

-20.0

-10.0

0.0

10.0

20.0

30.0

40.0

50.0

1 2 3 4 5 6 7 8 9 10 11 12

% o

f F0

varia

tion

Isolated Word1. Beginning2. Middle3. EndWord in the Beggining4. Beginning5. Middle6. EndWord in the Middle7. Beginning8. Middle9. EndWord at the End10. Beginning11. Middle12. End

Fig. 2.2 – Relative variation of F0 in tonic syllable (95% confidence).

Page 48: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

20

Begi

nnin

g

Mid

dle

End

Begi

nnin

g

Mid

dle

End

Isol

ated

0.02.04.06.08.0

10.0

12.0

14.0

16.0

std (%)

Position of tonic in the

word

Position of word in the phrase

Standard deviation of F0 between speakers (%)

Fig. 2.3 –Standard Deviation of F0 variation between the three speakers.

Table 2.1: Consistent F0 variation in tonic syllable, values in %.

Tonic syllable posi-tion in the word

Isolated word

Phrase initial

Phrase medial

Phrase final

Beginning 5

Middle 10

End -21 12.5 -12

Although only the values for F0 variation are reported here, the shape of the variation is also im-portant. The patterns were observed and recorded. In most cases they can be approximated by ex-ponential curves.

2.2.3.2 Duration

The relative duration for each tonic syllable was calculated by the relation in Eq. (2.2). For each speaker the average relative duration of the tonic syllable was determined and tendencies were ob-served for the position of the tonic syllable in the word and the position of this word in the phrase.

( )relative duration of tonic 100 %T

R

DD

= × Eq. (2.2)

Fig. 2.4 shows the average duration ± 2·σ (σ-standard deviation) of the tonic relative to the ref-erence syllable for all speakers at 95% confidence. A general increase can be seen in the duration of the tonic syllable from the beginning to the end of the word. The low values for standard devia-tion in Fig. 2.5 (compared to the ones of previous figure) show that the patterns and ranges of

Page 49: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

21

variation are quite similar across the three speakers, leading to the conclusion that variation in rela-tive duration of the tonic syllable is speaker independent.

Rules for tonic syllable duration can be derived from Fig. 2.4, based on position in the word and the position of the word in the phrase. Table 2.2 summarises these rules.

Note that when the relative duration is less than 100% the duration of the tonic syllable will be reduced.

Relative Duration of Tonic Syllable

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

1 2 3 4 5 6 7 8 9 10 11 12

% o

f dur

atio

n

Isolated Word1. Beginning2. Middle3. EndWord in the Beggining4. Beginning5. Middle6. EndWord in the Middle7. Beginning8. Middle9. EndWord at the End10. Beginning11. Middle12. End

Fig. 2.4 – Relative Duration of tonic syllable (95% confidence).

Beg

inni

ng

Mid

dle End

Beg

inni

ngMid

dleE

nd

Isol

ated

0.0

5.0

10.0

15.0

20.0

25.0

30.0

standart deviation in %

Position of tonic in the word

Position of word in the phrase

Standard deviation of tonic duration between speakers

Fig. 2.5 – Standard deviation of average duration between the three speakers.

Page 50: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

22

Table 2.2: Duration rules for tonic syllables, values in %.

Tonic Syllable Position in the word

Isolated word

Phrase initial

Phrase medial

Phrase final

Beginning 69 140 210 120

Middle 139 187 195 167

End 341 319 242 324

There are still some questions about these results. Firstly, the reference syllable differs segmen-tally from the tonic syllable. Secondly, the results were obtained for a specific set of syllables and may not apply to other syllables. Thirdly, in synthesising a longer syllable, which constituents are longer? Only the vowel, or also the consonants should be longer? Does the type of consonant (stop, fricative, nasal, lateral) matter? A future study with a much larger corpus will address these issues.

2.2.3.3 Intensity

For each speaker the average intensity variation between tonic and reference syllables was de-termined (Eq. (2.3)), in dB, according to the position of the tonic syllable in the word and the posi-tion of this word in the phrase. There are cross-speaker patterns of decreasing relative intensity in the tonic syllable from the beginning to the end of the word. Fig. 2.6 shows the average intensity variation, ± 2·σ (95% confidence).

( ) ( )Intensity variation T dB R dBI I= − Eq. (2.3)

The standard deviation between speakers is shown in Fig. 2.7. The pattern of variation for this parameter is consistent across speakers.

Variation of Intensity in Tonic Syllable

-10.0

-5.0

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

1 2 3 4 5 6 7 8 9 10 11 12

dB

Isolated Word1. Beginning2. Middle3. EndWord in the Beggining4. Beginning5. Middle6. EndWord in the Middle7. Beginning8. Middle9. EndWord at the End10. Beginning11. Middle12. End

Fig. 2.6 – Average intensity variation of tonic syllable for all speakers (95% confidence).

Page 51: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

23

Begi

nnin

g

Mid

dle

End

Begi

nnin

g

Mid

dle En

d

Isol

ated

0.01.02.03.04.05.0

6.07.0

8.0

dB

Position of tonic in the word Position of word in the

phrase

Standard deviation of intensity between speakers

Fig. 2.7 – Standard deviation of average intensity variation between the three speakers.

In contrast to the duration parameter, a general decreasing trend can be seen in the tonic syllable intensity variation as its position changes from the beginning to the end of the word. Again, a set of rules can be derived from Fig. 2.6, giving the change in intensity of the tonic syllable according to its position in the word and in the phrase. Table 2.3 shows these rules. It can be seen that in cases 1, 2, 10 and 11 the inter-speaker variability is high and the rules are therefore unreliable.

Table 2.3: Change of intensity in the tonic syllable, values in dB.

Tonic syllable posi-tion in the word

Isolated word

Phrase initial

Phrase medial

Phrase final

Beginning 15.2 10.3 6.6 16.8

Middle 9.2 4.6 3.0 7.2

End -0.4 2.8 1.3 -0.4

As in these experiments the tonic syllable always contains the phoneme [E], that is one rather open phoneme and strongly pronounced, how much does this affect the results? In order to elimi-nate this problem the reference syllable should ideally be the same as the tonic, even if non sense words like (“fefeto”) should be used.

2.2.4 Comments and conclusion

Some interesting variations of F0, duration and intensity in the tonic syllable have been shown as a function of its position in the word, for words in initial, medial and final position in the phrase and for isolated words. The analysis of the data is quite complex due to its multi-dimensional na-ture. The variations by position in the word are shown in Fig. 2.2, Fig. 2.4 and Fig. 2.6, comparing

Page 52: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

24

the sets [1,2,3], [4,5,6], [7,8,9] and [10,11,12]. The average values of these sets show the effect of the position of the word in the phrase.

Firstly, the variation of average relative duration and intensity of the tonic syllable are opposite in phrase-initial, phrase-final and isolated words. Secondly, Comparing the variation in average relative variation of F0 in Fig. 2.2 and average relative duration in Fig. 2.4, the effect of syllable position in the word is similar in the cases of phrase-initial and phrase-medial words, but opposite in phrase-final words. Thirdly, for relative F0 and intensity variation shown in Fig. 2.2 and Fig. 2.6 respectively, opposite trends can be observed for phrase-initial words but similar trends for phrase-final words. In phrase-medial and isolated words the results are too irregular for valid conclusions. These qualitative comparisons are summarised in Table 2.4.

Table 2.4: Summary of qualitative trends (varying the tonic position from beginning to the end of word) for all word positions in the phrase.

Word position Parameter Isolated Beginning Middle End

Relative F0 variation *

Relative duration

Intensity

* Irregular variation.

Finally, there are some general tendencies across all syllable and word positions. For F0 relative variation, the most significant tendency is a regular decrease from the initial to the final position in the phrase, but in isolated words the behaviour is irregular with an increase at the middle of the word. There is a regular increase in the relative duration of the tonic syllable, up to 200%. Less regular variation in intensity can be observed, moderately decreasing (2-3 dBs) as the word varies from the initial to the medial position in the phrase, but increasing (2-4 dBs) phrase-final and in isolated words.

In informal listening tests of each individual characteristic in synthetic speech, the most impor-tant perceptual parameter is F0 and the least important is intensity. Duration and F0 are thus the most important parameters for a synthesiser.

2.2.4.1 Future developments

This preliminary study clarified some important issues. In future studies the reference syllable should be similar to the tonic syllable for comparisons of duration and intensity values, and should be contiguous to the tonic in a neutral context. Consonant duration should also be controlled. These conditions are quite hard to fulfil in general, leading to the use of nonsense words containing the same syllable twice.

For duration and F0 variations a larger corpus of text is needed in order to increase the confi-dence levels. The default duration of each syllable should be determined and compared to the dura-tion in tonic position. The F0 variation in the tonic syllable is assumed to be independent of seg-mental characteristics. The number and variety of speakers should be also increased so that the results could be more generally applicable.

Page 53: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

25

2.3 Speech Corpus - FEUP-IPB Database

This section documents the FEUP-IPB database that was developed with the main purpose of being the object of analysis for the following presented prosody model. The characteristics of the database allowed its use in other projects of the Laboratory with the return of resources for their self development. Namely, the project ANTIGONA [Freitas et al., 2002] gave the resources for their manually phonetically labelling, by a phonetician and used it as a source to build the diphone database of a concatenative synthesiser [Barros, 2002]. The FEUP-IPB database was briefly pre-sented in [Teixeira et al., 2001].

2.3.1 Introduction

The present database was built during this work because there was not at the time any public phonetically labelled European Portuguese DB. With FEUP/IPB-DB, described below, it was aimed at developing a new high quality EP TTS, for two purposes. The first purpose is to supply word and phrase level annotations that are used to study and built prosody models for EP read speech. The second is to provide a phonetically rich and natural database of EP phonemes and ar-ticulations specifically recorded from the high quality voice of a skilled professional speaker. This database was phonetically segmented, labelled and annotated in a way that allows it to be used for quasi-automatic construction of the segmental base of a TTS system, because of its structural or-ganization.

It is also important to stress that this DB allows us to extract segmental and supra-segmental fea-tures for EP, what means that it is the basis for a broader knowledge on EP phonetics and prosody.

The voice recordings were done in an acoustically treated professional studio of RDP, the public national radio broadcast company. The professional male speaker read the text materials and speech was digitally recorded using the regular studio equipment. A careful preparation of the ses-sion had been done with text preparations and trial readings. Different text materials serve different purposes of the database and the speaker was carefully instructed in accordance. After some edition treatment of the digital sound records, such as cutting out mistakes, sound material with a total du-ration of approximately 100 minutes was produced, organized in a set of sound tracks with duration between 2 and 3 minutes each. An audio CD in cda format and a set of .wav files in 44.1 KHz sam-pling rate, 16 bits, mono, were produced.

Section 2.3.2 describes the text corpus, section 2.3.3 the segmentation process, and section 2.3.4 reports several characteristics of the database. In section 2.3.5, some relevant phonetic aspects are presented that resulted from the phonetic inspection, segmentation and labelling of the database, as reported by the phonetician Daniela Braga [Teixeira et al., 2001] and incorporated here as an im-portant piece to complete the description of the database.

2.3.2 Speech corpus

The text corpus of the speech database consists of 9 text excerpts from different articles pub-lished in the biggest nationwide newspaper in November 1999, 2 additional texts from another arti-cle and one interview published in the weekly biggest newspaper also in the same month, 2 sets of specially prepared interrogative sentences, with and without interrogative words (who – “quem”, which – “qual”, how many – “quantos”, how – “como”, where – “onde”, etc.), and 1 set of phoneti-

Page 54: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

26

cally engineered log-atoms carrying all standard Portuguese diphones and several triphones in a congruent context. Some text readings, due to their extensions are divided into two or more sound tracks.

The set of log-atoms consists of syllables with vowels, nasal vowels and diphthongs, read in a continuous way in concatenative alternation between vocalic sounds or between vocalic and conso-nantal sounds. This was divided into 3 tracks. The main purpose of this set is to guarantee that some specimens of each rare diphone are present in the data base, spoken in an as monotonous as possible way, for use in speech synthesis.

Each track was latter divided into files associated with text paragraphs.

2.3.3 Sound segmentation and labelling

Every track has been carefully examined and segmentation marks placed using the Speech Filing System (SFS) software tool from UCL [Huckvale]. PRAAT [Boersman and Weenink] has been used as well. A log-book of events was maintained. Phrase, word and phone labels were then at-tached. The tonic syllable was also identified and labelled with a mark just before the first phone of the syllable. All annotations reside in a text file together with the time labels of the instant of be-ginning of the element. The phonetic level labels are based on the SAMPA code [Wells, 2000] ex-tended with some other necessary codes presented in Table 2.5. Occlusive consonants were labelled in two segments, the occlusion and burst parts. The segmentation labels used at the word and phrase levels are presented in the final rows.

When one word starts right after the previous symbol without a break, the code of start of word was used to simultaneously label the end of one word and the beginning of the next. The same pro-cedure is used for phrases and sentence boundaries.

All work of word and phrase labelling and about half of the phonetic labelling were manually done. This task was accomplished by a professional phonetician and production rate was about 1 day for 1 minute of sound material. The other half phonetic labelling was initially done using an automatic alignment tool from University of Gent [Vorstermans et al., 1996] and the result was subsequently manually reviewed and corrected. This automatic alignment tool starts from the wave file and the phonetic transcription of the text, as well as the word and phrase labels in the phonetic transcription, to finally produce the phonetic labelling, inserting or removing some phones due to reduction phenomena. This process is strongly encouraged because there are benefits in time con-sumption.

In spite of the usage of specific tools for the labelling process, phone boundary identification is neither always obvious nor consensual. Fig. 2.8 shows an example of the difficulty identifying boundaries between [e] and [j] in the word ‘lei’ – ‘law’. The transition between [e] and [j] occurs in the period of about 50 ms labelled as [ej], in the above picture. It is clear that there is no precise lo-cation for the boundary.

To minimise this problem, the database was labelled by only one phonetician, so as to keep the regularity in the identification of the mentioned boundary. No study was made to quantify the error in phoneme and boundary labelling, since a study of that kind would require more phoneticians to label a sample of the database and a comparison of the results from each labelling. However, some observations have pointed out to an average labelling error of 5 to 10 ms.

Page 55: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

27

Table 2.5: Phoneme, word and sentence level labels used in labelling the database.

Label Meaning

p, b, t, d, k, g Burst segment of plosive consonants

! Occlusion segment of plosive consonants

f, v, s, z, S, Z Fricatives

m, n, J Nasals

L, l, R, r Liquid consonants

in SAMPA code

l* l in syllable-final position (velar)

i, e, E, a, 6, O, o, u, @ Vowels

i~,e~,6~,o~,u~,w~,j~ Nasal vowels

w, j Glides

in SAMPA code

X Silence

XX Inhalation

“ Beginning of tonic syllable

Word Level

p Beginning of word

f End of word

Sentence Level

i Beginning of sentence

. End of sentence

, ! () - ; : ... “ Every punctuation marker in the text

Language changing issues were taken into consideration in the construction of this DB, in par-ticular those related to dialectical or geographic varieties, as well as those concerning individual tendencies, style or habits. These aspects will be described below.

Labelled files are read and processed with a Matlab-generated function, making all the labelling information available. Phone identity, phone duration, word boundary, sentence boundary and punctuation information can thus be extracted from the labelling files.

Page 56: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

28

Time (s)1.205 1.45

-0.28

0.28

0

l e ej j

Time (s)1.205 1.45

Time (s)1.205 1.450

4500

Fig. 2.8 – Above: representation of the acoustic signal in the phoneme sequence [lej] in the word ‘lei’ – ‘law’.

Below: spectrogram.

2.3.4 Characteristics

The tracks 1, 2, 3, 4, 5, 7 and 8 were first manually labelled and the others tracks were, in a sec-ond phase, semi-automatically labelled using the automatic alignment tool and then manually cor-rected. Only the seven tracks were used in the following studies. These seven tracks give a total of 21 minutes of speech, which consist of 18.647 segments and 15.633 phones.

For each considered phone segment or phoneme, the relative occurrence frequency (in %), aver-age duration and standard deviation were determined in a general position and in the tonic syllable position. This data are reported in Table 2.6. Fig. 2.9 displays the relative frequency of each seg-ment.

Comparing the phone segment’s duration in a general position with the phone segment’s dura-tion in tonic syllable position, we can conclude that all vowels and the phoneme [l*] are longer in tonic position, the phoneme [L] is shorter, and all other consonants including the stops of plosives [!] are not affected by the tonic syllable position.

Fig. 2.10 presents a graph showing the regularity of the speech rate in the readings of the tracks 1, 2, 3, 4, 5, 7 and 8. A different slope in the time axis would indicate a distinct rate for that specific track. The speech rate for the reading of the tracks varies between 11.6 and 13.0 phones/sec. The average speech rate is 12.2 phones/sec.

Page 57: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

29

Table 2.6: Percentage of occurrences, average duration and standard deviation of all phones, considering gen-eral positions (including tonic) and just tonic syllable positions.

General Position Tonic Syllable Po-sition General Position Tonic Syllable Po-

sition Phone

% Av. (ms) std % Av.

(ms) std Phone

% Av. (ms) std % Av.

(ms) std

a 4.0 110 34 2.2 121 32 !k 4.1 59 17 1.0 61 16

6 10.0 68 28 0.8 75 33 b 1.3 17 18 0.5 15 7

E 1.7 97 29 1.0 102 27 !b 1.3 43 16 0.5 44 15

e 1.8 95 40 1.0 102 38 d 4.7 20 17 0.8 15 5

@ 1.7 53 38 0.03 33 15 !d 4.7 41 17 0.8 39 15

i 5.2 69 28 1.5 85 28 g 1.3 20 13 0.6 19 7

O 1.4 106 33 0.8 116 29 !g 1.3 44 13 0.6 43 12

o 1.6 97 34 0.9 103 34 m 2.8 62 19 0.7 63 19

u 5.1 57 29 0.7 65 33 n 2.0 54 19 0.4 51 15

j 2.8 49 26 0.8 53 24 J 0.4 68 18 0.1 67 19

w 2.5 44 27 0.7 47 31 l 1.8 52 20 0.4 53 20

j~ 0.1 64 20 0.03 61 21 l* 0.9 68 30 0.4 78 32

w~ 0.04 53 31 0.02 55 34 L 0.4 56 21 0.1 43 15

6~ 2.9 75 35 0.9 97 38 r 6.5 32 16 2.1 34 17

e~ 1.2 107 31 0.6 117 33 R 0.7 73 21 0.1 78 20

i~ 0.7 109 42 0.2 132 49 v 1.4 65 22 0.3 69 20

o~ 0.9 98 36 0.3 119 41 f 1.2 93 27 0.4 99 25

u~ 0.6 86 45 0.2 77 43 z 1.6 70 18 0.4 74 19

p 3.3 20 9 1.0 18 6 s 4.2 103 31 1.1 100 28

!p 3.3 64 19 1.0 70 19 S 4.1 89 33 0.6 83 26

t 5.3 29 19 1.3 23 10 Z 1.9 78 25 0.4 79 25

!t 5.3 48 20 1.3 49 20 XX 2.4 320 173

k 4.1 37 16 1.0 36 11 X 3.6 165 219 !p, !t, !k, !b, !d, !g – means the occlusion part of the respective consonants.

Page 58: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

30

0

1

2

3

4

5

6

7

8

9

a 6 e f@E O o u j w vRrLl*lJnm!gg!dd!bb!kk!tt!ppu~o~i~e~6~w~j~i z S Zs0

1

2

3

4

5

6

7

8

9

a 6 e f@E O o u j w vRrLl*lJnm!gg!dd!bb!kk!tt!ppu~o~i~e~6~w~j~i z S Zsa 6 e f@E O o u j w vRrLl*lJnm!gg!dd!bb!kk!tt!ppu~o~i~e~6~w~j~i z S Zs Fig. 2.9 – Relative frequencies of the segments in the corpus.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0

200

400

600

800

1000

1200

1400

nº of segments

Tim

e (s

ec.)

Fig. 2.10 – Illustration of the speech rate for the different texts (here represented by the inverse, that is, time per segment in average). The figure shows the accumulated duration of elapsed segments. Track one is dis-

played using a solid line, track two using a dotted line and thus successively for the 7 tracks.

Page 59: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

31

2.3.5 Phonetic changing phenomena in database

The distance between a phonetic and a phonological transcription is rather close to the prac-tice/theory binomial. Phonetics, as it is defined by Fromkin and Rodman [1983], “gives us the means that lead to the spoken sound description”, while Phonology “studies how the sounds of lan-guage form systems and models”. Therefore, any methodological options concerning speech label-ing lay between these two classic linguistic fields: a phonological transcription, involving pho-nemes’ interactions, their distinctive and semantic importance; and a closer report of the corporeal reality of the language, which is, its phonetics.

Before any regard on phonetic transcription, two main aspects must be considered: in one hand, the inherent subjectivity of the transcriptor himself when making the report of the speech signal, and, on the other, the linguistic changing factors. Therefore, being aware of these conditions, a trial to carry out an accurate and close phonetic transcription of the DB, following coherent criteria was done. Some of the questions that have to be taken into consideration when labelling the speech sig-nal are now going to be described. These are of great importance to the quality of the synthetic speech subsequently produced, because of their strong impact in phonetic co-articulation events and consequently in prosodic aspects.

2.3.5.1 Dialectal changing

Social-linguistics explains that each language has a range of regional varieties that may differ in phonetic, morphological, syntactic or even lexical aspects, though they still belong to the same lan-guage. Political, sociological and historical reasons decide which variety is elected to be the stan-dard and prestigious one. Hence, regional varieties are understood by these classes as deviations, outsiders or outcasts. Considering language as social phenomena, it was decided to choose the standard Portuguese, for its official, institutional and academic importance and extension. Never-theless, some of the “dialectal slips” that are legitimate and interesting in a certain way are going to be described.

2.3.5.1.1 “DIALECTAL SLIPS”

These “dialectal slips” originate in relaxed articulation habits that sometimes happen even in a professional speaker. In Table 2.7 some of these habits that can be identified in Oporto region are presented.

Table 2.7: Examples of dialectal slips.

Example Standard EP Dialectal change

doutores [o] [ow] diphtongization

hoje ele

[o] [e]

[oj], [je] diphthongization before palatal consonant

regressou [R] [r] multiple alveolar trill

embora [e~] [6~j]

Page 60: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

32

2.3.5.2 Contextual changing

Linguistic changing is also related to phonetic context and inter-segmental co-articulation phe-nomena. Despite the classic well-known EP distribution features of the phonemes /l/ or /s/, this DB allows an experimental and faithful report of Portuguese phonetic reality, especially concerning suppressions, additions and allophones.

2.3.5.2.1 SUPPRESSIONS OR REDUCTIONS

From the labelling of this DB, it can be observed that the vowels [@] and [u] are often practi-cally omitted, at every possible position in the word (beginning, middle, or end), except in a tonic syllable position. Anyhow, these phenomena occur in non stressed syllables, thus producing unex-pected consonant clusters (Table 2.8).

Table 2.8: Examples of non-tonic vowels suppressions.

Suppressions [@] [ u ]

In the beginning <explorado> - [Splu”radu] Not available

In the middle <decisão> - [dsi”z6~w]

<português> - [prt”geS]

In the end <sete> - [“sEt] <Porto> - [“port]

2.3.5.2.2 VOWEL QUALITY TRANSFORMATIONS

These phenomena occur when two vowels of different qualities get together in an utterance. Two events are expected:

- the two vowels melt and experience a quality change; this occurs between non-closed vowels (e.g. <fica admirado> [fikadmiradu]; <contra o> [kõtrO]).

- one of the vowels, the closed one, [@] or [i], is reduced and becomes a semivowel; the result is a diphthong [ e.g. <se aprende> [sj6pre~d]>; <na idade> [n6jdad])

The above-described events are ancient and have always existed in a conscious domain since Latin literature, which always used this knowledge with metrical and rhythmic purposes.

2.3.5.2.3 ADDITIONS

It is also common to produce reduced vocalic sounds so-called “schwas” between relaxed con-sonantal groups such as the pair plosive/lateral (pl, tl, kl, bl, dl, gl) or plosive/trill (pr, tr, kr, br, dr, gr): e.g. <branco> [b@rãku].

2.3.5.2.4 ALLOPHONES

Using the common definition of an allophone, in phonology, as a variant of a phone, when ana-lysing the speech signal’s physical and acoustical characteristics, it can be observed that two equal phones cannot be found; they all have a certain degree of dispersion which allows them to vary ac-cording to the speaker’s mood, age, health, condition or other factors. Anyway, there are some es-

Page 61: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

33

sential features that remain intact and that carry out the information conveyed. Additionally, there are some contextual interference from the neighbour phones that change phones so much that they can only be recognized by the phonological structure of the word and its connections to the psycho-cognitive meaning. Some of those changes motivated by the articulatory context are listed and ex-plained below:

- <-te> syllable in a word final position followed by a pause: the closed reduced vowel [@] is acoustically weak and its presence is not absolutely necessary for the communication success, which causes its reduction; the plosive “fricatizes” with the voiceless fricative consonant that is closer to its articulatory point – [s]; we can observe this phenomenon in the database: e.g. <sete> [sEt].

- <-r> in word final position followed by a pause: it’s a different [r], longer in duration and usu-ally voiceless.

- <l> in closed syllable: as this phoneme’s contextual variant is already assumed by Portuguese phonetics, we labelled it with a stipulated code [l*], because of its big distinctive acoustical im-portance.

- The “fricatization” of voiced plosives in an intervocalic context (< -b- >→ β; < -d- >→ ð ;< -g- >→ γ) can also be observed.

2.3.5.2.5 PHONETIC CHANGES

Co-articulation phenomena and compensatory mechanisms sometimes commit mistakes in the physical plan, though assuring communication success in a perceptive way. That’s what happens when a voiced sound, like a vowel, transmits its voiced characteristics to the neighbour voiceless consonant. This phenomenon, called sonorization, is one of a wide range of phonetic assimilations, which are responsible for diachronic linguistic change when it becomes a habit. In this database some of these occurrences (e.g. <ao contrário> [awgo~”trariu]; <quarenta e cinco> [kware~d6jsi~ku]) can be found.

2.3.6 Final remarks

It should be mentioned that just tracks 1, 2, 3, 4, 5, 7 and 8 were used in prosody analyses pre-sented in next chapters, because they were the ones available at that time. Meanwhile, the entire da-tabase was completely labelled.

Those seven tracks were also separated by paragraphs with their respective labels. Thus, the total of 21 minutes of labelled speech is available in the format of seven tracks of seven newspaper texts or as a set of 101 paragraphs.

Those 101 paragraphs were later prosodically labelled with accent commands and phrase com-mands according to the Fujisaki model of F0 [Fujisaki et al., 2001], as described in chapter 4.

Page 62: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

34

2.4 Syllabification

This work consists of an algorithm that allows carrying out the syllabic splitting automatically as a stage of the development of a more extensive work that is the study of prosodic models for the EP.

The work [Gouveia et al., 2000] of syllabic splitting is conceived for application in two distinct situations: in the first one it is applied to the written text and in the second one to the sequence of phonemes really produced in the locution of this text. Each one of the applications has its peculiari-ties and difficulties, that are described, as well as the solutions adopted for its resolution. In the first case an error rate of 0.06% is obtained and in the second case the score is 0.89%. The algorithm is based on the consideration of syllables of types V, VC, VCC, CV, CVC, CCV and CCVC, V being a vowel or diphthong and C a consonant. It is admitted that these categories of syllables cover all the existing syllables realizations in Portuguese.

2.4.1 Introduction

It is commonly accept by authors of prosody models for other languages that the syllable is one important part-of-speech in the determination of prosodic parameters such as phonemes’ durations or fundamental frequency variations in speech synthesized from text. Being the aim of this work the construction of prosody models for a TTS system a process is necessary to automatically split words into syllables. These words can be in a written text or even as a sequence of produced pho-nemes by the speaker.

In a preceding studies for Portuguese [Catarino, 2000], and for Spanish [Benenati, 2000], it was observed that there are several common rules of splitting syllables. It was also observed that the sets of rules in both references are not a consistent set to allow its implementation in an algorithm to produce automatic syllabic splitting. Besides, they are not enough to solve all the cases. In these two references there are some contradictory rules such as the splitting of digraphs <rr> and <ss> in [Catarino, 2000] and non-spitting of the same digraphs in [Benenati, 2000].

The rules presented in [Catarino, 2000] for Portuguese are the following:

− Diphthongs and thriphthongs are not divided; − Vowels must be separated from hiatus1; − Following digraphs are not split: <ch>, <lh>, <nh>, <qu>, <gu>; − Following digraphs must be divided: <rr>, <ss>, <sc>, <sç>, <xc>; − Impure consonantal jointures must be separated; − Identical vowels and groups of consonants <cc> and <cç> must be separated; − Consonant in the end of a prefix must be linked to previous syllable if the word begins with

consonant, or linked to next syllable if the word begins with vowel.

The rules presented in [Benenati, 2000] for Spanish are the following:

− When it is possible syllable must finish with vowel;

1 Hiatus - cluster of two vowels belonging to different syllables.

Page 63: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

35

− Two vowels must be separated unless one of them is a semi-vowel; − In general two consonants must be divided. The digraphs <ch>, <ll> and <rr> are considered

just one consonant, and should not be separated. Double c or double n must be separated. − Consonants <r> and <l> should not be separated from the preceding consonant except if it is

an <s>. − In clusters of 3 or 4 consonants the letter s must belong to the preceding syllable.

The previous set of rules aim the syllabic splitting of text. Differently, the syllabification pre-tended in this work must separate “phonetic syllables”. Therefore, syllables must be separated in the way they are spelled, making no sense, for instance, to split the digraph <rr> to different sylla-bles because together they are produced as just one phoneme.

In spite of the previous contradictions, syllabic division is a more or less objective question, ex-cept in those situations where two vowels may form a hiatus or a diphthong, and in some cases of consonant clusters. As rising diphthongs2 are unstable, according to [Cunha and Cintra, 1997] and [Bergström and Reis, 1997], they can be uttered as hiatus or diphthong, the cases where two vowels may form a rising diphthong or a hiatus, can be always considered as hiatus. When the vowels se-quence indicates a falling diphthong, very frequently they really are a diphthong. Anyhow, as it will be shown in the results section, the found mistakes are exclusively a few very rare situations of two vowels sequence been erroneously interpreted as a falling diphthong.

Consonant clusters, in medial position of word, are in several situations (<bc>, <bd>, <bj>, <bs>, <bt>, <cm>, <cn>, <ct>, <dj>, <dm>, <dq>, <cç>, <ds>, <dv>, <fn>, <ft>, <gd>, <gm>, <gn>, <mn>, <pç>, <pn>, <ps>, <pt>, <tm> and <tn>) an ambiguous question from the view point of descriptive linguistics and psycholinguistics. From the point of view of the descriptive lin-guistics, these consonantal clusters should be divided into different syllables, generating the follow-ing division type: <rit-mo> – ‘rhythm’. On the other hand, from the psycholinguistic point of view, they should not be divided, resulting one division of the type: <ri-tmo>. According to [Cunha and Cintra, 1997], both are possible in a tense pronunciation. Yet for same authors those consonants clusters in the beginning of words are indivisible (e.g. <psi-có-lo-go> – ‘psychologist’). Both points of view were implemented in different versions of the algorithm, been reported just the re-sults of the first one.

As mentioned before, the syllabic division operations were applied to written text and spoken text.

The first situation considers the grapheme sequence as they are written in text after some pre-processing.

The second situation, aiming just the prosodic analysis, considers exactly the phones resulted from the phonetic transcription of FEUP-IPB database, as described in the previous section. This case has the following type of sequence (using symbols of Table 2.5):

u~ !"por!t vi~!"taZ 6!ke"sew !"ti!tlu X 6"Raj6 "miu!d6 !"krES@ X 6 "OLZ "viS!tS X

2 Rising diphthongs consist in a sequence of semi-vowel followed by vowel. Falling diphthong consist in the opposite sequence, vowel followed by semi-vowel.

Page 64: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

36

This short example of ‘spoken text’ corresponds to the text “Um porto vintage aqueceu o título. Arraia miúda cresce a olhos vistos” – ‘A vintage port warmed the title. Small teams rise in classifi-cation’.

The semi-vowels are grouped with vowels to form diphthongs. The semi-vowels surrounded by two vowels are grouped with previous vowel forming a falling diphthong instead of a rising diph-thong for the same reason pointed out before.

Syllables of written text with suppressed vowels become quite difficult to identify (e.g. <fute-bol> [ftbOl] – ‘football’). New consonant clusters come out, from consonants belonging originally to different syllables. As the objective is to identify the original sequence, this leads to the consid-eration of those syllables formed just by consonants where the vowels were eventually suppressed.

The word boundaries and the beginning of tonic syllables codes are used as a syllable boundary, facilitating the correct identification of that syllable boundary.

The melted vowels of two words sequence introduce the problem of automatically deciding which word they belong to?

It is also very frequent the phenomenon of addition, as discussed in 2.3.5.2.3, and suppression forming a new legal syllable (e.g. <bran-co> [b@-r6~-ku], <pa-ra> [pr6]), difficult, again, the process of syllable identification boundaries as the ones produced in the written form.

Section 2.4.2 describes the set of rules and their implementation for the written text. Section 2.4.3 describes the set of rules and their implementation for the spoken text. Section 2.4.4 presents the results for both implementations and an error analysis. Section 2.4.5 presents the conclusions of the syllable splitting rules.

2.4.2 Syllable splitting of written text

This section presents the set of rules generically applied to syllabic splitting of text and some pre-processing used to identify the diphthongs and their substitution by a code, as well as the con-version of some digraphs corresponding to nasal vowels and phonetic consonants (e.g. <am>, <rr>, <ss>, <ch>, <lh>, <nh>) by their codes. Section 2.4.2.2 presents the algorithm that imple-ments the rules.

2.4.2.1 Rules

The set of considered rules aiming syllabic splitting are based in the supposition that any EP syl-lables can be of one of the following groups: V, VC, VCC, CV, CVC, CCV and CCVC, where C means a phonetic consonant and V a vowel or diphthong. This supposition is an enormous contri-bution to the process of detecting syllable boundaries. The small number of cases not solved just by this supposition demand complementary rules. Just two types of situations are not solved by this supposition. The first case is two consonants between vowels (...VCCV...) and the second case is three consonants between vowels also (...VCCCV...).

The first case is solved by the rule that the syllable boundary can not separate a vowel after con-sonant (C-V); if the two consonants form a inseparable pair, that is the first consonant belong to the group <b, p, d, t, g, k, v, f> (<k> corresponds to one of the letters <k>, <c> or <q>) and the second

Page 65: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

37

one belongs to the group <l, r> (e.g. <a-tlas>) then the two consonants start a new syllable; if not, the boundary will be necessarily between the consonants (e.g. <al-tas>).

The second case, (...VCCCV...), is solved by following rule: as a sequence of three consonants can not belong to the same syllable, the boundary will be between the second and the third conso-nants if the first two consonants form an inseparable pair or if the second consonant is an <s>, once the consonant <s> preceded by an other consonant makes the boundary between the consonants (e.g. <obs-tar>); if not, the boundary will be necessarily between the first two consonants (e.g. <ul-tra>).

When two or more vowels follow in a sequence, it is necessary to verify if they form a falling diphthong or hiatus3. For falling diphthongs detection the sequence of vowel followed by semi-vowel is searched. Phonetic semi-vowels are considered the letters <i> and <u> not preceding an <r> or <l> as last letter of word or as the first of two or more consonants (e.g. semi-vowels: <cai>, <cai-ro>; hiatus: <ca-ir> and <ca-ir-mos>). They are not considered as semi-vowel when pre-ceded by the same letter (e.g. <ni-ilismo>), or when preceding the vowel <u> (e.g. <ca-iu>) or the case of nasalization (e.g. <a-in-da>). Finally, the letter <o> proceeded by letter <a> is also consid-ered as a semi-vowel (e.g. <ao>).

2.4.2.2 Algorithm

The implementation was done in C. Fig. 2.11 illustrates the flow chart responsible for one word split. The word to be processed is stored in the string designated by pal, being represented as in C language. The character ‘\0’ is used for word end and the first string character has the index 0. The variable i is the index of the grapheme of the word been processed. The functions vowel(x) and semivow(x) have the function of identify if character x is phonetic vowel or semi-vowel, respec-tively. Function put(x) sends to the output string the character x. The function seg(x,y) allows to verify if x and y form one pair of inseparable consonants. Finally, the character ‘.’ is used as a syl-lable boundary (e.g. being the word <fluxograma> the input string, the result is the string <flu.xo.gra.ma>).

3 Rising diphthong can be interpreted as hiatus, as already mentioned in the introduction.

Page 66: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

38

Fig. 2.11 – Flow chart for one word syllabic splitting of a a written text. V-vowel; C-consonant; ...- any se-

quence of graphemes; .- syllable boundary; ?-grapheme not determined yet; bold- grapheme already stored in the output string; underline-pointed grapheme by index i.

put(pal[i-1]) i++put(pal[i])

put('.') put(pal[i-1]) put(pal[i])

No

put('.') put(pal[i-1])

i++

put(pal[i])

put(pal[i])

semivow(pal[i])

put('.')

vowel(pal[i+2]) ? No i++

seg(pal[i-1], pal[i]) or pal[i]='s' ?

Yes

Yes put('.')

No put(pal[i-1]) i++put(pal[i])put('.')

?

v

c c?

... v ... v

?

c? c c

cc v

... v ? ... v v

...v ...v.v

... v c ? ...vc

... v c c

... v c v ...v.cv

... v c c v

... v put(pal[i-1])

Beg. of word

put(pal[i])

vowel(pal[i]) ?

Yes

No

pal[i]='\0' ?

No

vowel(pal[i]) ?

No

i<-0

i++ pal[i]='\0' ? No

END of word

put(pal[i])

Yes

Yes

pal[i]='\0' ?

No

Yes

vowel(pal[i]) ?

No

put(pal[i-1])

vowel(pal[i+1]) ? Yes

i++

i++

seg(pal[i-1], pal[i]) ?

put(pal[i-1]) put('.')

vowel(pal[i]) ? No

Yesput(pal[i])

i++

put(pal[i])

put('.')

put(pal[i])

No

Yes

Yes

No

Yes

Yes

END of word

END of word

put(pal[i]) i++ put(pal[i])

c. c v

... v. c c v

... v c c cv

... v c c cc ... v c c . c c

... c v

... v c c . c v

... v c . c c v

Page 67: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

39

2.4.3 Syllabic splitting of spoken text

Next section presents the set of rules specifically applied to syllabic splitting of the phonetic transcribed sequences, here named as spoken text. This section justifies the reasons for some modi-fications introduced for the set of types of syllables and the treatment of diphthong situations. Sec-tion 2.4.3.2 presents the algorithm that implements those rules.

2.4.3.1 Rules

In this case the distinction between diphthongs and hiatus are simplified because the semivowels are already identified by their respective label.

The major problem is due to the suppression of several vowels originating consonant clusters of different syllables, making difficult the correct identification of syllable boundaries according to the respective written text. The suppression phenomena lead to the consideration of two more ab-stract syllable types, besides the ones listed in 2.4.2.1 that are the C and CC. The new types appear in syllables of types CV, CVC and CCV, where the vowel was suppressed. In syllables of the type CCVC these suppression phenomena are very rare (e.g. [p@nEtr6S]). Thus, the option was to not consider syllables of type CCC avoiding very frequent erroneous boundary identification.

The syllabic splitting of the spoken text is also based in the suppositions that any syllable be-longs to one of the types: V, VC, VCC, CV, CVC, CCV, CCVC, C and CC. However, due to ad-ditional difficulties introduced by the vowel suppression phenomena, an additional set of rules are needed to those specific situations:

− Consonants [l, r, S, z and Z] followed by other consonant always precede syllable boundary (e.g. [sal-tu]). This group includes also the consonant [Z] as result of voicing the unvoiced consonant [S] in voiced context (e.g. [meZ-mu]).

− Consonants [S, z and Z] in end of word position, belongs to previous syllable4. − A vowel followed by one of the following pair of consonants {[bk], [bd], [bZ], [bs], [bt], [km],

[kn], [kt], [ks], [dZ], [dm], [dk], [ds], [dv], [fn], [ft], [gd], [gm], [gn], [mn], [ps], [pn], [pt], [tm] and [tn]} inserts a syllable boundary between consonants, producing a syllable of the type VC(e.g. [ap-tu]). The same pair of consonants in beginning of word are not separated (e.g. [pnew]), [Cunha and Cintra, 1997].

2.4.3.2 Algorithm

Before the application of the algorithm of syllabic splitting for each word the marks of occlusion (!) and semi-vowels are excluded from the string of phonetic symbols and their original positions are stored. After the syllabic splitting those marks are re-introduced in the original positions to re-store the correct sequence of segments (occlusive consonants and diphthongs).

This algorithm introduces the next syllable boundary returning again to the beginning of the al-gorithm to find another boundary until the end of word.

4 This rule originate some non recognised syllables in words where originally last syllable consisted in those consonant followed by a suppressed vowel (e.g. original phonetic word – [Ri-a-Z@] non recognized sylla-ble [Ri-aZ]).

Page 68: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

40

An execution cycle of the algorithm presented in Fig. 2.12 ceases with one decision of the type of syllable detected V, CV, CVC, etc.

The function F(x) reads the phoneme with index x. The phoneme can belong to one of the fol-lowing groups: V – vowel; C – consonant; V1 – one of the vowels [a, 6, O, o]; C1 – one of the con-sonants [b, p, d, t, g, k, v, f]; C2 – one of the consonants [l, r]; C22 – one of the consonants [S, z, Z]; C3 – one of the consonants [l, r, S, z, Z]; C-C3 – no C3 group’s consonant; ac – phoneme has the marker of tonic syllable (just the first phoneme of a syllable can carry this marker); fp – end of word (last phoneme of word is previous to this mark).

Function cond(a,b) returns the logical value 1 (yes) if [(F(a)=C1 and F(b)=C2) or (syllable in be-ginning of word) and (F(a)F(b) are one of the sequences: {[bk], [bd], [bZ], [bs], [bt], [km], [kn], [kt], [ks], [dZ], [dm], [dk], [ds], [dv], [fn], [ft], [gd], [gm], [gn], [mn], [ps], [pn], [pt], [tm], [tn]})], otherwise returns the logical value 0 (no).

The decision process to split the word [kaz6] is presented as example: F(1) is the phoneme [k], once it is a consonant the algorithm takes the right branch and read F(2); as F(2) is the phoneme [a] that is a vowel the algorithm proceeds by the left branch and reads F(3); as F(3) is the consonant [z] belonging to the group C3 the left branch is taken and reads F(4); as F(4) is the vowel [6] the deci-sion is taken considering one syllable of the type CV inserting the boundary after [ka]. In the new cycle the algorithm reads the new F(1); as the new F(1) is the consonant z the algorithm proceeds by the right branch and reads F(2); as the F(2) is now the vowel [6], the algorithm follows by left branch and reads F(3); as F(3) is the end of word (fp) the new syllable is of the type CV. The split-ting process of the word has finished producing the syllables [ka-z6].

Page 69: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

41

Fig. 2.12 – Flowchart of a spoken text syllabic splitting.

F(1)

F(2)

F(4)

F(3)

F(2) F(2)

F(3)

cond(1,2)

cond(3,4)

F(3)

F(4)

F(4)

F(5)

F(4)

F(3)

F(2)

cond(2,3)

F(1)=V 1 e F(2)='b'

F(4)

F(3)

F(3)

F(5)

V

V VC

V VC

VC V

VCC VC

VC

V

CV

CV CVC

CVC

CV CVC

CV

CVC CV

CVC CV

CC

CCV

CCV CCVC

C

CC C

CCC

C

V

V, ac, fp C-C 3

C 3

V other

V, fp C3, ac

C22 C2

V, ac fp

YesNo

Yes No

C

V C

ac, fp

C3 C-C3

V, ac, fp

V ac, fp

otherfp

C22 other

V, ac C

fp

C22 other

Yes No

Yes No

VC, ac, fp

other, fp C3

V other, fp

fp C

other

C22 other'S' otherC

C

Beginning of syllable

Page 70: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

42

2.4.4 Analysis and results

The error rate was determined dividing the number of errors by the effective number of sylla-bles, in percentage. The effective number of syllables and the effective boundaries were manually determined by a linguistic expert.

The presented results were measured using a test set of texts different from the one used in de-velopment and refinement process. Both tests belong to the corpus used in the FEUP-IPB database.

The written text algorithm was tested with a set of not repeated words taken from five texts of the mentioned corpus. Just words with more than two letters were considered. The algorithm com-mit just two mistakes in a total of 1164 words and 3387 syllables, corresponding to an error rate of 0.06% by syllable. Both error situations correspond to a hiatus wrongly interpreted as a falling diphthong (<cai-re-mos> and <reu-ni-ão>). This error takes place when one vowel is followed by the grapheme <i> or <u> that, exceptionally, does not behave as a semi-vowel (the rules classify this generic case as a falling diphthong). The identified errors have not an immediate solution since other words of same kind behave in a different manner (e.g. <Cai-ro> and <reu-má-ti-co>). Probably the consideration of syllabic context will help to solve these cases.

The spoken text algorithm was tested with two texts from the mentioned corpus. Monosyllable words were not considered. Fourteen (14) mistakes were produced by the algorithm in a total of 1569 syllables, corresponding to an error rate of 0.89%. The mistakes took place in seven different words (this test admitted repeated words). The mistakes occurred in underlined phonemes: <fute-bol> [ft-“bol], <evidentemente> [iv-de~-t-“me~-t], <ministério> [mniS-“tE-riw], <irresponsabili-dade> [iRS-po~-s6-bli-“da], <industrial> [i~d-S-tri-“al], <acusação> [6k-z6-“s6~w], <demon-stração> [dmo~S-tr6-“s6~w]. Tonic syllables are identified by <”>. All mistakes took place in syllables where the vowel was suppressed and the consonants were associated to the neighbour syl-lable. The situations where the spoken text boundaries do not follow written text boundaries but the produced syllables are phonetically ‘admissible’ (e.g. pa-ra uttered as [pr6]), were not added-up as errors.

2.4.5 Conclusions

The developed algorithms has show different error rates in the two applications (written and spoken text), as expected, due to the additional difficulty introduced in spoken text by the vowel suppression phenomena. Nevertheless, in both cases the error rate is very low, 0.06% and 0.89% for written and spoken text, respectively.

Unfortunately, there are no other published works with measured results to be compared. But in both cases the results fulfil the objectives.

Page 71: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

43

2.5 Phonetic Transcription from Text

An exercise of phonetic transcription of text for EP using rules of grapheme-phone conversion was made in collaboration with Daniela Braga and Paulo Gouveia in a non-published work.

A previously published work for grapheme phone conversion of EP were presented in [Trancoso et al., 1994], [Teixeira, 1995] and more recently [Caseiro and Trancoso, 2002]. The work presented by Teixeira implements the grapheme-phoneme conversion in MULTIVOX TTS system for the EP version [Teixeira et al., 1998], and proceeds in two phases. The first phase consists in the applica-tion of a list of rules presented in a tabular format that converts sequences of graphemes by se-quences of phoneme codes. These rules specify the elementary conversion of graphemes, se-quences of graphemes, words and parts of text. The second phase consists in a programmed application of the more complex rules that corrects several sequences of the previous phase. Dia-mantino Caseiro and their co-workers presented a description and comparison of a rule-based ap-proach, a data-driven approach by mean of Weighted Finite State Transducers (WFSTs) trained with automatically transcribed material and a hybrid approach. The best score was achieved with the rule system with an error rate per word of 3.25%, whereas the compilation of that set of rules with WFSTs scores 3.56%. The WFSTs implemented by the way of a data-driven approach achieved a 9.02% error rate. Combining data-driven and knowledge based approaches the best score was 3.94%. Despite the rule based best scores the data-driven with knowledge base approach are very promising.

Filipe Barbosa and co-authors presented a grapheme-phone transcription algorithm for a Brazil-ian Portuguese TTS system [Barbosa, Ferrari and Resende, 2003] based in rules with an accuracy rate of 98,4% per phone.

Despite the set of rules and table of exceptions presented, the problem of homograph words re-mains unsolved. These cases many times can be solved by the knowledge of the morphology of the word, for words like <espeto> verb – [SpEtu] and <espeto> noun – [Spetu] with different morpho-logical categories, but in other words like <sede> noun ‘headquarters’ [sEd@] and <sede> noun ‘thirst’ [sed@] , with identical morphological categories, even this information can not help in the decision. Filipe Barbosa and co-authors presented a work [Barbosa et al., 2003] to disambiguate the word <sede> with an accuracy rate of 95%.

The set of grapheme-phoneme conversion rules described in this section were implemented in the FEUP-TTS system. The list of rules is not complete yet, but solves almost all the cases. Only some graphemes in EP <a, e, o, x> present a higher complexity to be converted by rules, or they can not even be completely described just by rules without morphologic knowledge or even knowl-edge of the origin of word. Therefore a special list of rules and their exceptions was developed for those graphemes. The exceptions and other cases not solved by rules can be correctly converted us-ing an additional table of conversion.

The produced set of rules for EP incorporates the previous rules reported in [Teixeira, 1995]. As for other languages EP has graphemes that are univocally converted into one phoneme and a simple rule is needed, or a sequence of two graphemes converted into just one phoneme (e.g. <ch> – [S] and <lh> – [L]), or even one grapheme converted into more than one phoneme (e.g. <têm> – [t6~j~6~j~] ). But, these cases are well behaved and have always the same conversion. The major problems come in the conversion of graphemes <a, e, o, x>, which can be converted into different phonemes, according to the specific case, and no known set of rules can solve all the cases. This work concentrates in those graphemes, since others can be transcribed using rather immediate rules.

Page 72: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

44

The set of phones used for EP in SAMPA code is presented in Table 2.9.

Table 2.9: List of phones used in Phonetic transcription.

SAMPA Code Example SAMPA Code Example

a pato k casa

6 bala b bata

E terra d dado

e Pedro g gato

@ secar m ama

i livro n nada

O gola J pinho

o poço l lado

u pula l* alto

j pai L filho

w pau r caro

j~ lições R carro

w~ coração v vaca

6~ canta f filho

e~ dente z casa

i~ Pinto s sábado

o~ ponte S chama

u~ fundo Z jardim

p para

Fig. 2.13 displays the processing blocks sequence leading to the phonetic transcription in the FEUP-TTS system. Fig. 2.14 displays the processing sequence of the phonetic transcription block.

Fig. 2.13 – Previous processing blocks of phonetic transcription.

Word boundaries

Pré-phonetic transcription

Phonetic transcription

Tonic sylla-ble detection

Syllabic splitting

Page 73: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

45

Fig. 2.14 – Processing of phonetic transcription.

2.5.1 Dedicated ANN to transcribe the graphemes <a> and <e>

A very preliminary exercise with ANNs for alternatively transcription of the graphemes <a> and <e> was made. The experiment was not exhaustively explored or optimized, but gives hints of a promising process to solve the specific problems not easily solved by rules.

The experiment consisted in finding one of the possible phonemes to transcribe the specific grapheme <a> or <e>. In the case of <a>, the considered possible phonemes were [a, 6], and in case of <e> the possible phonemes were [E, e, @, i]. For grapheme <a> a perceptron layer ANN was used, and in the case of grapheme <e> a feed-forward ANN was used. About 2 thousand non repeated words were used in the training set and another 2 thousand for test in both grapheme cases. The list of features for both cases is presented in the following:

• Position of grapheme syllable concerning tonic syllable (5 possibilities: tonic syllable; pre-vious, before previous; next syllable; after next syllable);

• Identity of previous phoneme;

• Identity of next phoneme;

• Position of syllable in word (3 possibilities: beginning; middle; end);

• Position of grapheme in syllable (3 possibilities: beginning; middle; end);

• Position of grapheme in word (3 possibilities: beginning; middle; end);

• Closed syllable finished with <al> sequence (used just in grapheme <a> ANN).

The relevance of each feature was measured by comparing the performance with and without the specific feature. The first three features (plus last feature for grapheme <a>) really influence the performance, the other features do not influence the general performance alone, but together, the performance becomes improved with their inclusion in the set of features.

The output of a perceptron layer is binary. So each level was associated to the output category of each target phone [a, 6]. The best (lowest) measured error rate (number of errors/number of graph-emes) for the grapheme <a> in the test set was 1.7%.

The grapheme <e> cases that must be converted into a nasal vowel [e~] were previously tran-scribed by rules. Several output ANN codifications could be used to select one of the four pho-nemes [E, e, @, i]. The option of 4 nodes was selected, associating one node to each phoneme, and selecting the highest output node. This grapheme is more difficult to correctly transcribe because four phonemes can be obtained. The best-measured error rate was 6.4%.

Table of ex-ceptions

Co-articulation

rules

Rules

Page 74: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

46

The solution of transcribing a specific grapheme with a dedicated ANN was not completely ex-plored concerning possible features, like more phoneme context information, or implementing some well known rules. So, there is a space for evolution in this matter with the present approach that was not fulfilled.

2.5.2 Rules to transcribe graphemes <a>, <e>, <o> and <x>

Rules presented in this section are ordered from more specific rule to more general, as they are applied in appearance order. The set of rules do not solve all cases. The exceptions are converted by a table of exceptions.

The sequence of phones to be produced by TTS systems is established finally after the applica-tion of the co-articulation rules. This final processing block of rules attempts to reduce the distance between the phonetic transcription and the actually produced phone sequence by a speaker. This distance is many times created by co-articulation effects of neighbour sounds. That’s the reason why this bock is called co-articulation rules.

2.5.2.1 Rules for grapheme <a>

Grapheme <a> can be produced as phone [a] or [6] in general. Table 2.10 displays the set of implemented rules in FEUP-TTS system.

Table 2.10: rules for conversion of grapheme <a>, presented by priority order.

Rule phone example

<al>, <ax> or <az> in end of syllable position a almirante, fax, goraz

<ar> in end of word position a armar

in end of syllable position followed by <m>, <n> or <nh> 6 lama, cama

in tonic syllable a pato, guarda

followed by semi-vowel a ao, lidai, paulada

in <acc> or <acç> a acção

in end of syllable position followed by <xai>, <xou> or <xei> a praxai, taxou, taxei

in end of syllable position followed by <x> a praxe

other cases 6 seta

An algorithm to measure the accuracy rate of the conversion of the grapheme <a> using texts not seen in the development phase was used. The text contains 5619 <a> graphemes in non re-peated words. The set of rules failed in 19 cases. The resultant error rate was 0.34%.

2.5.2.2 Rules for grapheme <e>

Grapheme <e> can be produced as phonemes [E], [e], [@] or [i] in general. In some particular cases, due to the articulation process, as will be seen in section 2.5.3 , this grapheme can also be produced as phoneme [6]. Table 2.11 displays the set of implemented rules in FEUP-TTS system.

Page 75: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

47

Table 2.11: rules for conversion of grapheme <e>, presented by priority order.

Rule phone example exception

words started by: <hiper>, <hetero>, <ferro>, <recti>, <enea>, <etno>,

<helio>, <hemo>, <hemato>, <hemi>, <hepta>, <hexa>, <mega>, <neo>, <tetra>, <ego>, <ergo>, <hepa>,

<herma>, <herb>, <hectares>, <su-per>, <aero>

E

hiper-mercado, hetero-géneo, ferro-

magnético, rectilíneo, etc.

ferroso

<he> in beginning of word i herói, herança

<é> E até, patético

<en> in end of word En líquen, Cármen

<ela> or <elas> in end of word E vela, velas, bela, belas estrela (n)

<ês>, <esa>, <essa> or <er> in end of word e Portuguesa, condessa,

espremer quer, essa, mulher

tonic syllable <ez> in end of word e sensatez, timidez, fez

<el> in end of syllable E papel, carrossel, relva, selvático

<ect>, <ecç>, <e.x>5 (been <e> in open syllable6), <egn>, <epç> or

<ept> E

dialecto, direcção, re-flexo, interregno,

acepção, susceptível

in words with 3 or more syllables: <er> in tonic syllable in interior posi-

tion E interno, deserto, mod-

erno

in words with 2 syllables <er> in tonic syllable e verde, nervo, perda terno, servo, lerdo,

perca, perco

<ê> e lê

<êm> in end of syllable 6~j~6~j~ têm, vêm

<e.m>, <e.n> in tonic syllable e cena, pena, comemos higiene, solene, gene, creme, leme

<e> in open tonic syllable preceding <lh> [L], <nh> [J], <ch> [S], [Z] and

[j] 6 telha, lenha, seja,

fecha, senha, areia, sei

<e> in first syllable in words of 2 syl-lables ending with <s>, <e> or <m> E deves, temes, segues,

bebe, bebem esses, estes, temos,

esse, este

5 . represents the syllable boundary.

6 Open syllable finishes with vowel; closed syllable finishes with consonant.

Page 76: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

48

Rule phone example exception

<e> in first syllable in words of 2 syl-lables ending with <o> e temo, devo, bebo,

medo Melo, prelo

<e> in first syllable in words of two syllables ending with <i> @ temi, bebi, meti

<e> in end of word @ feche, vinte, que

<e> in beginning of word in open non tonic syllable i economia, energia,

evento elite, emir

<e> in non tonic syllable preceding vowel to which they can make a diph-

thong j or i Leonor, real, veado,

área

<es> in beginning of word in closed non tonic syllable None estrada, esperto

<ex> in beginning of word in closed non tonic syllable

none or 6j extensão, excelente

<ex> in close tonic syllable 6jS pretexto, contexto

<em> in end of word 6~j~ bem, ordem, homem

<emp> or <emb> e~ empresa, empurrar, sempre

<en> in closed syllable e~ centro, lento, legenda

<ên> in closed syllable e~ existência, aparência, frequência

<e> in tonic syllable e preto

<e> in non tonic position @ apetecer, receber

2.5.2.3 Rules for grapheme <o>

Grapheme <o> can be produced as phonemes [O], [o], or [u] in general. Table 2.12 displays the set of implemented rules in FEUP-TTS system.

Table 2.12: rules for conversion of grapheme <o>, presented by priority order.

Rule phone example exception

words started by: <hemo>, <homeo>, <iso>, <lito>, <meso>, <neo>,

<octo>, <oftalmo>, <onoma>, <oxi>, <orto>, <quiro>, <rino>, <rizo>,

<xeno>, <xilo>, <megalo>, <eco>, <horós>, <lipo>

O isotérmico, xilofone, horóscopo, etc. ortografia, lito, litoral

Page 77: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

49

Rule phone example exception

words started by: <aero>, <agro>, <astro>, <bio>, <auto>, <electro>,

<hetero>, <micro>, <poli>, <pseudo>, <proto>, <retro>, <termo>

O polivalente, aeronáuti-ca, agropecuária, etc

biologia, biografia, astrologia, astrono-

mia, automóvel, fono-logia, fotografia, ter-

mo

<oo> in end of word, been the first <o> in tonic syllable oow voo, perdoo

<oso> in the end of word o virtuoso, saudoso, Matoso

<osa>, <osos>, <osas> in the end of word O virtuosa, virtuosos, vir-

tuosas

<oro>, <oros>, <ore>, <ores>, <ora>, <oras> in the end of word O melhore, melhoro,

melhora, adoro, moro

<oca>, <ocas> in tonic syllable O doca, toca, minhoca

<ote>, <otes>, <ota>, <otas> in tonic syllable in the end of word O sacerdote, boicota,

lotes gota, rota, minhoto,

minhota

<doxo>, <doxos> in end of word O paradoxo

<onho>, <onhos>, <onha>, <onhas> in end of word o risonho, ponho

after vowel [a] forming diphthong w ao, aos

<o>, <os> in end of word u filhos, medo, nossos, o

<ou> o estou, ouve, ouro, lou-ro, pouco

<oi> o boi, coito, oito comboio

<oz> in end of word O voz, algoz, atroz arroz

<oc>, <op> in beginning of word fol-lowed by consonant O octano, optimismo,

opção, Octávio

<or> in end of word o compor, repor, dor, amor, exterior

maior, menor, melhor, pior, major, suor, por

<or> in beginning of word O orca, orgânico, orla

<or> in tonic syllable O sorte, morte, porte força, acordo, aborto, desporto, Porto, gor-

do, forma (n)

<ol> in end of word or in tonic sylla-ble O futebol, sol, folga, vol-

ta solto, solta

<ol> in non tonic syllable o soltar, voltar

<ô> o pôr, sôfrego, pôde

Page 78: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

50

Rule phone example exception

<o.a>7 forming a hiatus where <o> is in tonic syllable o Lisboa, pessoa, boa,

perdoa, canoa

<o.a> forming a hiatus where <o> is in non tonic syllable u soante, voar, Póvoa,

mágoa

<ob>, <obs> tonic syllable in begin-ning of word O obter, objecto, obstácu-

lo

<om>, <on> in closed syllable o~ com, som, comprar, contente

<õe> o~j~ põe, lições, razões

<ão> 6~w~ coração, violão, sensa-ção

<o> in tonic syllable o como

<o> in non tonic syllable u comigo, posição, átono

2.5.2.4 Rules for grapheme <x>

Grapheme <x> can be produced as phonemes [S], [z] or [ks] in general.

Rules implemented in FEUP-TTS system for <x> are presented in an algorithm format (more easy to understand , in this case): The cases: <proxi>, <próxi>, <maxim>, <auxili>, <troux> - [s]

<ex>

in beginning of word

as prefix [S] (e.g.ex-ministro)

followed by vowel [z] (e.g. exemplo)

followed by consonant [S] (e.g. exposto)

in middle word position

followed by vowel

preceded by <in> in beginning of word [z] (e.g. inexis-tente)

preceded by <s> [ks] (e.g. sexualidade)

other cases

<e> in tonic syllable [ks] (e.g. convexo)

in non tonic syllable

preceded by cons + <l> [ks] (e.g. flexível)

7 . means syllable boundary.

Page 79: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

51

followed by stressed vowel [ks] (e.g. conexão)

other cases [S] (e.g. mexer)

followed by consonant [S] (e.g. pretexto)

in end of word position [ks] (e.g. duplex)

<x> (not preceded by <e>)

beginning of word [S] (e.g. xadrez)

end of word [ks] (e.g. tórax)

in middle position

followed by consonant [S]

followed by vowel

in beginning of syllable preceded by diphthong [S] (e.g. deixa)

<axa> [S] (e.g. taxa, axadre-zado)

<axe> [S] (e.g. praxe)

<ax> [ks] (e.g. táxi, maxi-lar)

<óx> [ks] (e.g. dióxido)

<ox> with <o> in tonic syllable [ks] (e.g. para-doxo)

<ox> in beginning of word [ks] (e.g. oxigénio)

<oxi> [ks] (e.g. dioxina)

<fix>, <fux>, <flix>, <flux> [ks] (e.g. fixo, fluxo) (exception: fixe)

other cases [S] (e.g. rixa)

An algorithm to measure the accuracy rate of the conversion of the grapheme <x> using texts not seen in the development phase was used. The text contains 1649 <x> cases in non repeated words. The set of rules failed in 56 cases. The resultant error rate was 3,4%.

2.5.3 Co-articulation rules or post-lexical rules

Co-articulation or post-lexical rules are included in the block of phonetic transcription and are applied to the resulting sequence of phonemes produced by the rules. It is well known that the se-quence of phones uttered by a human speaker is not exactly the same sequence of phonemes re-sulted from the phonetic transcription of the individual words of the uttered text. Several phonetic changing phenomena found in FEUP-IPB database were already reported in section 2.3.5. The scope of these rules is modelling these phenomena in synthetic speech in order to obtain a more human like speech.

The phonetic transcription is applied to each individual word. Co-articulations rules will care about the phoneme modifications phenomena that happen by co-articulation effects when words are spoken together.

Page 80: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

52

Table 2.13: Co-articulation rules.

Grapheme context Phone Example

Suppression of glide [@]

<e> in end of word none dente, bebe, sete

<es> in end of word S abres, Lopes, leves

Voice quality transformation (Craze8)

<a a> a fica admirado, deseja alterar

<o o> o or O muito obrigado

Transformation of hiatus into rising diphthong

<e *> where * can be: a, e, o, u, an/m, en/m, on/m, un/m. j (first grapheme)9 de ovo, de uva, se aprende, de

um, se andar, de entre

<o *> where * can be: o, a, e, an/m, en/m w do óbvio, do amigo, quanto é, do

antes, do entrudo

Transformation of hiatus into falling diphthong

phonemes: [u or a] [i] or graph-emes:

<g1 g2> where g1 is grapheme <a or o> and g2 is grapheme <e or i>

uj or aj na idade, como exemplo

Allophone

<s> followed by voiced consonant Z olhos verdes, mesmo, cisma

<s> in end of word followed by vowel z alhos azuis

<l> velar in end of word followed by vowel l mal entrou

<te> in absolute final position ts sete

<r> in absolute final position the phone is produced with certain degree of

devoicing

Brinckmann and Trouvain [2003] reported based in their experiments that a group of listeners clearly reject the synthetic speech produced with the lexical form resulted straight from phonetic transcription, which might sound too unnatural, but makes no difference, to them, between original form (as uttered by a speaker) and post lexical rules (or co-articulation rules).

8 Craze – contraction of two non tonic vowels with similar or equal timbre into just one.

9 <e e> can follow this rule or the suppression of glide [@].

Page 81: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 2 - Preparatory Work

53

The set of co-articulation rules presented in Table 2.13 will be soon implemented in FEUP-TTS system.

2.5.4 Final remarks

A set of rules implemented in the FEUP TTS system for conversion of graphemes <a>, <e>, <o> and <x> were presented, as well as some proposed rules of co-articulation to be applied after grapheme-phoneme conversion. The accuracy rate of the ensemble set of phoneme-grapheme con-version rules was not measured yet. The co-articulation rules were not yet completely implemented in FEUP TTS system. A dedicated ANN to solve particular remaining problems of some graph-emes can be a valid solution with the correct set of features.

Page 82: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …
Page 83: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

3 Duration Model

This chapter describes one of the most important parts of the prosody model developed in this work, the segmental duration model. It starts by making an overview of the most recent and prominent duration models. Then some considerations are made about the speech database, concerning segmental durations, the architecture of the selected ANN is represented, some training functions are introduced, the training process explained, and the set of features is detailed. Besides de proposed model with one ANN to predict segmental durations, an alternative model based in one ANN dedicated to each type of segment is also proposed. Results of both models are discussed. Finally, a simple model to insert pauses and predict their durations is presented.

Page 84: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

56

3.1 Introduction

In this work, the word duration refers to the period of time that a given speech unit lasts or in other words, its length. Distinct speech units have been considered along different models and languages. Some authors use highly distinctive units such as syllables [Campbell and Isard, 1991] or Inter-Perceptual-Centre-Group - IPCG, [Barbosa and Bailly, 1994]. Basically, IPCG are speech segments between the beginnings of vowels or from the beginning of initial syllables. Others use models to estimate the length of the phonemes their selves [Córdoba et al., 1997]. Throughout this work, speech units will be regarded as segments. These segments are either a clearly indivisible part of the phoneme, e.g. the two segments into which plosives can be divided, occlusion and explosion, or the phoneme itself.

Ferreira [1998] claims that since phonological syllables in European Portuguese derive from the collapse of weaker syllables, they cannot be regarded as rhythmic units, as opposed to other languages.

Correct utterance requires the duration of each segment to have a suitable degree of harmony. It is accepted that this prosodic parameter follows the F0 contour as the second most important parameter to achieve naturalness in speech. If we consider it to belong to the rhythmic dimension of prosody, then we have to mention different types of breaks and their corresponding duration as part of this dimension.

Different types of phonemes have different elasticity degrees, concerning their durations. The standard deviation of a sufficiently vast amount of measured durations of a segment or phoneme is a reliable elasticity indicator [Campbell and Isard, 1991]. Thus, generally speaking, vowels have more elasticity than consonants. Exceptionally, some fricative consonants, [f], [s], [S], and velar [l], have similar elasticity to vowels in Portuguese [Teixeira et al., 2001].

The difficulty in handling this matter is the set of features that may influence the duration of a segment, as well as its influence degree on others and the way they correlate. Generally these features aren’t linearly independent and they cannot form an orthogonal basis [van Santen, 1994]. Segments occupying stressed-syllable positions have larger duration; therefore, this feature should be taken into consideration, as well as others which, with more or less detail, characterize the context, such as the identity of the surrounding segments, within-word position, phrase position, etc. Semantic features such as emphasis, intonation groups or sentence type, prosodic features such as pitch level accents, as well as syntactic features such as word class, may also be used. The choice of features shouldn’t disregard the fact that the system where the model is included may or may not be able to determine them.

Since some of the presented features aren’t linearly independent, their effects cannot be added to others’ when measuring duration [van Santen, 1994]. One way to deal with this problem is to use quasi-minimal sets of feature vectors and compare the average duration of the segments on those sets to acknowledge the dependency between features. These quasi-minimal sets of features consist of two sets of condition vectors in which all features but one match.

Syntactic features have a strong influence on the prosodic structure. Since the duration structure is dependent on the latter [Zellner, 1994], that would be reason enough to add them to a list of useful features. However, they do not show in most TTS linguistic analysis models. Syntactic analysis tools are still very expensive to the system as a whole and that is why not all TTS systems include them.

Page 85: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

57

Duration models use and handle relevant features distinctively. We can distinguish the traditional models by the way they handle these features. Thus, there are rule-based models such as the Keller-Zellner Model, which applies more or less complex rules to lengthen or shorten the duration of the segments, mathematical models such as the Klatt and the van Santen Models, which combine multiple features into a single expression, usually a sum-of-products, that establishes the duration of the segments, and finally statistical models, which apply generic tools such as Classification and Regression Trees – CARTs, or ANNs, and consider the sets of features in their input to predict the duration of the segments. Some models combine several of those functionalities, as does the Nick Campbell and the Barbosa-Bailly models, which combines neural nets and mathematical models.

This chapter presents the state-of-the-art concerning duration models in the next section. Section 3.3 describes the proposed model, based on ANNs. It starts making some considerations on speech database concerning segmental durations, then describes the ANN architecture and the training process. The set of features is discussed in section 3.3.4. The model is evaluated and its results discussed in section 3.4. In section 3.5, a variation of the model is presented as the alternative model. This variation, basically, consists in splitting the task of predicting segmental durations with one ANN by 44 dedicated ANNs. Once this alternative model has its results improved, they will be taken as a serious model to be perceptually evaluated in chapter 5. The present chapter ends with a simple proposed model to insert pauses and predict its durations.

Page 86: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

58

3.2 Other Duration Models

Other durations' models are published in the literature for other languages, and can be grouped in rule-based models, mathematical models and statistical models.

Rule-based models should allow a straightforward knowledge of the effects of each feature in the duration of the segments. Examples of this type of models are the Klatt rule-based model [Klatt, 1976], the rule-based algorithm for French [Zellner, 1998], presented by Zellner for different speech rates, and the look-up-table for Galician [Salgado and Banga, 1999].

Mathematical models usually appear as a Sum-of-Products, where the features are statistically weighted and summed to produce the segmental duration [van Santen, 1994].

Statistical duration models become more and more used with the availability of large phonetically labelled data-bases. Neural networks and regression trees are the more often used tools, applied in different ways for different languages and using different type of segments. Campbell [1993] introduced the concept of Z-score to distribute the duration estimated by a neural network, for a syllable, among its segments. He argued in favour that the syllable is the more stable unit. Barbosa and Bailly also presented a two steps model for French [Barbosa and Bailly, 1997] and Brazilian Portuguese [Barbosa, 1997]. In the first step, using a neural network, they estimate the duration of the Inter-Perceptual Centre Groups (IPCG), arguing that is the more stable unit. In the second step they distribute the duration of the IPCG among its segments, using the Z-score concept. This model can deal with different speech rates, and pauses. Other neural network-based models were also presented for Spanish [Córdoba et al., 1999] and Arabic [Hifny, 2002]. Example of a CART-based model applied for Korean can be found in Chung [2002].

Some recent, successful duration models are now shortly described in terms of result and application to Text-To-Speech systems.

3.2.1 The Klatt model

The Klatt duration model [Klatt, 1976] is possibly the best known. It has been implemented in English in MITalk and also adapted to other languages.

The model consists of an equation, Eq. (3.1), which is applied to a sequence of segments successively, starting with an initial or inherent segment.

( )min, min,p p in pD D k D D= + × − Eq. (3.1)

Here, Dp is the predicted duration for segment p, Dmin, p is the minimum duration for segment p, Din is the output from preceding rules. For the first segment of the sequence, Din equals the inherent duration of segment p. Finally, k is a parameter reflecting the contribution to duration of a set of features expressed by the following rules:

• Insertion of pause at main clause boundary and comma;

• Lengthening of prepausal syllable at clause boundary;

Page 87: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

59

• Lengthening of syllabic segments at the end of syntactic units;

• Shortening of segments in within-word position;

• Shortening of segments belonging to poly-syllabic words;

• Shortening of non-initial consonants;

• Shortening of unstressed segments;

• Lengthening of stressed vowels;

• Shortening of vowel when followed by voiceless consonant;

• Shortening of consonants in clusters;

• Lengthening of stressed vowels when preceded by voiceless plosive.

k is the feature product of each of these rules, obtained with:

1

Nfi

ik k

==∏ Eq. (3.2)

where kfi is the value of feature i. k has a value between 0 and 1 for shortening rules and superior to one for lengthening rules.

This model is both based on rules and mathematical modelling. Moreover, it implies minimum duration and inherent duration values for each segment.

3.2.2 Sum-of-Products models

A sum-of-products model consists of an equation that combines features exclusively under the form of sums or products [van Santen, 1994].

( ),( )i

i j ji T j I

Dur p S p∈ ∈

= ∑ ∏ Eq. (3.3)

Si, j is the parameter which associates j, and possibly the correlation between features i and j, to the duration of segment p.

For a given set of features, several sum-of-products may be generated. The possibilities increase in proportion to the number of features.

This model is basically a generalization of several existing models, namely, of the previously described Klatt model. It is also used in the Jan van Santen model which will now be briefly described.

Page 88: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

60

3.2.3 The Jan van Santen model

van Santen [1994] develops his model on the assumption that, on one hand, statistical models that apply generic statistical tools are hurt by the lack of balance in the frequency of the different feature sets that make up the database, and, on the other, that feature interaction was neglected by the existing duration models.

The system is composed of a tree (Fig. 3.1) that can handle the linguistic heterogeneity of the segments, allowing a separate treatment for each category and its own sum-of-products model at the end of the tree. Each model differs from the remaining because the features affecting each category also differ. For instance, the features affecting vowel duration are different from the ones affecting intervocalic consonants. A second category classification distinguishes consonants according to their articulation and voicing: there are voiceless plosives, voiceless affricates, liquid consonants and glides, voiceless fricatives, nasals, voiced plosives, voiced affricates, voiced fricatives and aspirate. In addition, plosives and affricates are divided into two moments: occlusion and burst part. There are tables of predicted parameter values for each model.

Fig. 3.1 – The van Santen category-distinction tree.

The sum-of-products model for vowels employs the following features:

Phrase-final

All cases

Vowels Consonants

In Clusters

Phrase-medial Onsets

Coda

Intervocalic

Consonant classes Consonant classes Consonant classes

Page 89: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

61

• Pitch accent;

• Syllabic stress;

• Vowel identity;

• Preceding consonant class;

• Postvocalic consonant class;

• Number of consonants preceding the vowel in the word;

• Number of syllables following the vowel in the word;

• Phrasal position.

The model for intervocalic consonants employs the following features:

• Stress levels of surrounding vowels;

• Within-word position;

• Accent status of the word;

• Phrasal position.

The models for consonants in clusters as syllable onsets, phrase-medial codas and phrase-final codas employ the following features distinctively:

• Class of following segment;

• Class of preceding segment;

• Stress accent of last vowel;

• Stress accent of next vowel;

• Syllable boundary;

• Stress accent of final vowel;

• Silence.

The system has a sum-of-products model for vowels and several models for consonants.

The reported results refer to the correlation coefficient considering all types of segments of 0.93 for the parameter determination database and 0.884 for other databases, which is excellent. Perceptual tests were made to compare this model with a previous one based on hundreds of duration rules and the van Santen model was an overall preference percentage of 73%.

Page 90: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

62

3.2.4 The Keller-Zellner algorithm

Eric Keller and Brigitte Zellner have developed a rule-based algorithm for French [Zellner, 1994]. The purpose of that set of rules was to fit manually the criteria of simplicity and respect for psycho-linguistic plausibility and high capacity to manually predict the duration of a segment. The rules were applied to prosodic components that were generated on simple proximity syntax, as seen in the algorithm [Zellner, 1994]. The nuclear identification of the prosodic components is based on the grammatical information of the words (nouns, verbs, adjectives, adverbials and pronouns).

Final syllable duration and final segment duration increase according to the previous component. This increase goes from a minimum to a maximum empirical value, initially taking the same steps. It corresponds to the re-length that is usually observed in speech phenomena.

Rhythmic variance was also observed in post-verb position and within 4-to-6-word components. Rhythmic variance occurs when the lengthening of one element is superior to the strictly necessary. Consequently, the following element must be shortened to end the component “in time”. This leads to the inversion of duration of variant word pairs.

The linear correlation between predicted and measured values reported by the author is never inferior to 0.7 for final syllable plus pause and usually around 0.8.

Later, in her PhD thesis, Brigitte Zellner [1998] suggests another duration model also for French. This model proceeds in two phases. In the first phase predicts the syllable duration based on the type of word the syllable belongs to (lexical VS grammatical), the position of the syllable in the word, group, sentence, etc. In the second phase the distribution of that duration to the component segments of each syllable is made. The logic of that distribution varies with different types of syllabic structure.

When estimating syllable duration on the first stage, the author employs the following six parameters:

• X1 – Segmental duration classes1 (158 classes);

• X2 – Temporal groups (10 groups: minor initial; major initial in the beginning of the sentence; minor initial after pause; major initial after pause; intermediate position; minor final; minor final before pause; major final before pause; major final; major final in the end of the sentence);

• X3 – Number of syllable segments (5 modalities);

• X4 – Presence/absence of schwa (2 modalities);

• X5 – Grammatical/lexical word (2 modalities);

• X6 – Mono/Poly-syllabic word (2 modalities).

1 Each syllable contains a set of segments. The author attributes a set of duration classes to each set of segments. It took 158 different combinations of segments to translate all the syllables in her study.

Page 91: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

63

These parameters are combined into a sum-of-products called linear model according to the following Eq. (3.4):

0 1 1 2 2 3 3 4 4 5 5 6 6Y b b X b X b X b X b X b X= + + + + + + Eq. (3.4)

Two models are considered, one for fast speech rate and the other for slow speech rate. The statistically-obtained coefficients are considerably different for the two models. For fast speech rate, the duration is strongly conditioned by segmental class types – the segments are intrinsically long or short - whereas for slow speech rate there is a higher degree of syllable elasticity, since duration is highly dependant on the number of segments. This work also tested a neural net model which used the same parameters, only with worse results.

In the second stage, the syllable durations are distributed by the segments each syllable contains, according to the following algorithm [Zellner, 1998:139] for both speech rates: If the syllable has a single segment,

Attribute duration to the segment

Otherwise

Add durations of intermediate segments according to their classes

Determine the difference between the predicted syllable duration and the sum of segmental durations

If the result is different, adjust:

If the syllable has 2 segments,

Attribute MAX or MIN value to the first segment

Re-determine the difference between the predicted syllable duration and the sum of segmental durations

If the result is different, adjust:

Attribute MAX or MIN value to the second segment

Re-determine the difference between the predicted syllable duration and the sum of segmental durations

If the result is different, adjust:

the nucleus so that the syllable has the predicted value.

If the syllable has 3 segments

Attribute MAX and MIN values to every segment

Re-determine the difference between the predicted syllable duration and the sum of segmental durations

If the result is different, adjust:

The nucleus in minor or major.

The author presents a duration model for two speech rates. The result is presented as the correlation coefficient between a sequence of predicted values and a sequence of values produced by a speaker. The results were partially presented for the two stages of the model. The correlation coefficient values of predicted durations obtained for syllable duration are, for fast and slow speech rates respectively, of 0.80 and 0.73 and of 0.74 for the segmental duration prediction of both rates,

Page 92: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

64

based on the syllable durations produced by a speaker. For the two stages jointly, the results were never inferior to 0.7 for both rates.

3.2.5 The Campbell model

Nick Campbell claimed that the syllable is the most stable unit duration in the logarithmic domain in the speech process. His algorithm, [Campbell, 1992], proceeds in two stages: the first predicts the syllable duration from phonological information using a neural network; the second applies a mathematical model to distribute the durations to the syllable elements. The time scale is the logarithm of duration measured in ms, also know as transformed duration.

The first stage uses a perceptron multi-layer ANN which describes the syllable according to the 10 features presented in descending order of its relevance:

• Number of syllable segments;

• Break index;

• Nature of the rhyme;

• Function/content distinction;

• Nature of the peak;

• Stress index;

• Type of foot;

• Number of syllables in the foot;

• Position in the Word;

• Position of the phrase in the utterance.

The second stage develops the elasticity concept, according to which the duration of syllable segments is obtained through the application of a single z score, normalized duration, in the Eq. (3.5), so that the sum of segmental durations equals the syllable duration, Eq. (3.6).

( )expi i iDur zµ σ= + Eq. (3.5)

ii Dur syllable duration=∑ Eq. (3.6)

µi , and σi respectively are the mean and standard deviation of the transformed durations or logarithmic duration for segment i.

The author registered that the model has difficulty in predicting final syllable segmental durations, due to segmental lengthening in this position.

Page 93: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

65

Later, Campbell [1993] enhanced his model so it could handle the final syllable lengthening problem, which affects the rhyme more than the onset. The modification consisted of considering an alpha value to multiply by the z score, alpha depending on the context. This improvement caused the model to have better results and the author presented a correlation coefficient of 0.93 for syllable duration. When comparing the predicted and the measured durations produced by 4 speakers the model achieve an average correlation coefficient of 0.71.

3.2.6 The Barbosa-Bailly model – Inter-Perceptual-Centre-Groups

Barbosa and Bailly [1994], have proposed the use of a rhythmic unit alternative to the syllable: the Inter-Perceptual-Centre-Group (IPCG). Their duration predictor proceeds in two stages. Stage one determines the necessary duration for the group to be perceptible; stage two distributes the IPCG duration among the segmental constituents, including automatic pause generation according to a certain speech rate.

The perceptual centre (PCenter) is located at the vocalic onset, when the syllable is not preceded by a silence. If there is a silence, the PCentre is usually placed earlier in the syllable. The gap between two perceptual centres is the Inter-Perceptual-Centre-Interval and the unit is known as the IPCG.

The authors used an internal clock to actively control the speech rate. A feed-forward ANN transforms simple ramps, indicating the length and function of each linguistic unit of the utterance, into rhythmic contours according to speech rate, prosodic markers, nature of the vowel, number of consonants in coda and number of consonants in IPCG [Barbosa and Bailly, 1997]. The ANN predicts the duration’s logarithm using the following parameters:

• Frequency of the internal clock;

• Sentence modality;

• Sentence extent (using a ramp with the number of IPCG in the phrase);

• Prosodic group extent (using a ramp with the number of IPCG in the group);

• Current prosodic marker;

• Next prosodic marker;

• Nature of the current vowel;

• Nature of the next vowel;

• Number of consonants in the IPCG;

• Number of consonants in coda.

The distribution of durations to the IPCG constituents is accomplished with a modification to the previously mentioned repartition algorithm, developed by Campbell and Isard [1991], as to include emerging pauses. That modification, at first justified with experimental results and then presented,

Page 94: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

66

is based on the assumption that a pause has a minimum duration of approximately 60 ms, which was experimentally confirmed for different speech rates. The modified algorithm consists of: Computation of the z-score for a given IPCG;

If z is smaller or equal to the critical value of 0.79,

The procedure is over: segmental durations are obtained by using Eq. (3.5);

If z is greater than 0.79,

The z-score of the vowel is obtained by regression Eq. (3.7) by setting zvs = z; and determine the new z=zv

( ) ( ) ( )0.595 5 exp 0.72v vsz z+ = + Eq. (3.7)

The segmental durations are computed with the repartition Eq. (3.5) and added up. The difference between this result and the original IPCG duration gives the duration of the silence;

If the silence duration is greater than the minimum (apprx. 60 ms),

The procedure is over,

If not,

No silence is inserted and the z-score of the IPCG is kept equal to z.

Values µi and σi, in the Eq. (3.5), were previously determined for each segment using the mean and standard deviation of a database with several occurrences for each segment.

The authors present the mean and standard deviation error, for segments in general and pause or silence, at 5 speech rates in the whole model. The test set exhibits values of -105 ± 113 ms for silence and 5 ± 43 ms for the remaining segments at a normal speech rate. For fast speech rate, however, the test set displays values of 64 ± 144 ms for silence and 0 ± 28 ms for the remaining segments. The IPCG has better results with pauses and consonants, but no advantage for vowels, compared to the syllable as a unit. The results show very low mean error values but this mean error is different from the absolute mean error other works mention. Null mean error value indicates that the time unit (sentence, text, etc.) in which it was measured has the same duration as its reference. Using the IPCG would help maintaining the rhythmic structure of the sentence, including its speech rate, i.e., its total duration.

Later, Barbosa [1997] applied this model to Brasilian Portuguese, with the suitable adjustments, with mean error of 2 ms and standard deviation of 36 ms.

3.2.7 Model for the Hungarian language

The Hungarian TTS prosodic model developed by Olaszy, Nemeth and Olaszy, [2001], is a rule-based time structured in 3 levels:

• Level 1 - determines the specific segmental duration, influenced only by the articulation of adjacent sounds, with no supra-segmental effects;

Page 95: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

67

• Level 2 - modifies the specific segmental duration, using a map of suprasegmental functions which depend on the type of sentence, the length of the word, the quantity of word class indicating phonologically short or long sounds, the type of sound and its order. At the end of this level, durations are very close to their final value;

• Level 3 - modifies the previous level’s durations to establish final durations according to the length of the word, the position of the word within the phrase and sentence boundaries. Pauses are separately inserted in sentence break markers and between phrases.

Author reported that the durations are set about 98% correctly after level 3.

3.2.8 Model for the Galician language

Salgado and Banga [1999], have also developed a duration model for the Galician language which was included in their TTS system. They classified each phoneme according to the number of phonemes in the syllable, its position within the syllable, its phonetic class, stressed/non-stressed and pre-/ post-pausal position. The model then proceeds in two stages. Firstly, it determines syllable duration, using information that concerns the phoneme itself and other features, such as accent and position in relation to the following pause. Secondly, it distributes the syllable duration to its phonemes, according to their average duration in percentage in the syllable duration. This information is stored in a table of average duration values for each cell.

The reported absolute mean error and standard deviation values for allophones in the training set are 16, 3 and 19, 6 ms, respectively.

3.2.9 Model for the Castilian language

Córdoba and others [1999] have also developed a duration model for the Castilian Language. It is an ANN based model that replaced the rule-based model in the TTS of Universidad Politécnica de Madrid.

The model chose the phoneme as its segmental unit. The ANN is a multi-layer perceptron type. For the network input, several parameters were studied, but only the following ones succeeded:

• Phoneme identity;

• Surrounding phonemes (previous and next);

• Accent (5 value window including context information);

• Syllable stress;

• Phoneme in function-word;

• Sentence type (4 types);

• Position in the sentence (position of the phoneme within the syllable, of the syllable within the word and of the word within the sentence);

Page 96: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

68

• Number of phrasal units (number of phonemes in the syllable, number of syllables in the word and number of words in the sentence);

• Beginning of the sentence (up to the first accent) and end of the sentence (after final accent).

After suitable codification, these parameters had better results than the ones obtained for a reference set, composed of phoneme identity and accent, exclusively.

The network output corresponds to the duration of the phoneme presented as a standard deviation logarithm, since it shows better results that those of other tested codifications.

The authors made their assessment according to the specifications of the database they used. Their best result is of an absolute error equivalent to 14.3 ms, which is far better than the results by their previous rule-based model.

Page 97: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

69

3.3 Duration Model for Standard European Portuguese

This section will describe every aspect related to the creation of a duration model for European Portuguese, a model based on artificial neural networks (ANN) [Rumelhard and McClelland, 1986].

If the ANN input contains all the features likely to influence segmental duration and if its architecture is able to learn how each feature exerts its influence under different circumstances, based on a sufficiently large set of natural utterances exempla, then, supposedly, the network is able to predict the sequence of durations that correspond to the natural utterance of the segments resulting from the analysis of a text.

This was the basic idea for the creation of the model. The next sections will describe the process to choose a set of examples to “teach” the ANN, the choice of network architecture and its training, the selection of the most suitable set of features and its parameters. Finally, the model will be evaluated and criticized.

Initially, the model was a large number or parameters, which were then modelled and tested, to find the set with the best results, having in mind the relevance degree of each parameter. The aim was not so much to reduce the number of features, but to reduce the error in the segmental duration prediction. The decision of including or not including a particular feature was based in the improvement or not of the correlation coefficient between predicted and original durations, with and without that feature in the input vector. Since the correlation coefficient is very high correlated (r=0.999) with the MOS of a perceptual test using the whole paragraphs of the test set, as described in section 5.2.1.1, this process can be considered capable to select the features by their perceptual relevance.

There was an attempt to improve the structure of the ANN in terms of hidden layers, nodes per layer, learning functions, output functions of the final layer and codification of the input and output parameters.

High level linguistic features, such as morpho-syntactic features, were not considered due to the lack of automatically accessible information at this stage.

The chosen segmental unit is the phoneme, but plosive consonants are divided into their two moments: occlusion and burst part. A list of segments is presented in Table 2.6.

3.3.1 Considerations on the speech database

For the training and testing of the presented model, the speech database mentioned in section 2.3 was employed. This database was recorded by a professional speaker that read texts from newspapers. The waveform files were later labelled in three levels by a trained phonetician: phoneme, word, and sentence levels. Phoneme identity, phoneme duration, word boundary, sentence boundary and punctuation information are exclusively extracted from the labelled files in this model.

The corpus used here consists of more than 100 paragraphs of every type and dimension divided into 7 texts, with a total of 18.700 sound segments for 21 minutes of speech uttered by the same

Page 98: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

70

speaker. Training was done using sentences from 6 texts, with approximately 15.000 phoneme segments and testing was done with the remaining text containing about 3.000 segments.

In spite of there being other features that the model should not disregard, the identity of the phoneme segment is the most important one. This fact justifies an analysis of the corpus composition in relation to that feature. Fig. 2.9 therefore shows the distribution of frequency of the phoneme segments in the corpus, which is identical in the training and test sets, as seen in Fig. 3.2.

0

1

2

3

4

5

6

7

8

9

Training setTest set

a 6 E e @ i O o u j w j~ w~ 6~ e~ i~ o~ u~ p !p t !t k !k b !b d !d g !g m n J l l* L r R v f z s S Z Fig. 3.2 – Relative frequency (%) of the phonemes in the training and test sets.

This duration model is valid for the speech rate at which the database was recorded. For a different speech rate, another database would have to be recorded and labelled at the chosen rate, and the ANN trained with new data. Some co-articulation phenomena, modelled for this speech rate at the grapheme-phoneme conversion level, may differ for other speech rates. In chapter 2 it is said that the speech rate for this database is of 12.2 phones per second, the equivalent to the normal reading of a news report.

Other phonetic changing phenomena likely to influence the model and documented in [Teixeira et al., 2001], such as dialectal changing, contextual changing (suppression and reduction, vowel quality transformation, addition, allophones and phonetic changes) are treated in the phonetic transcription and co-articulation events process and thus, supposedly, included in the model.

Considering the features established ahead and the way they are parameterized in the network, the number of possible combinations for different input vectors is of about 1016. However, only a minute part of those vectors is linguistically possible, since many combinations are merely hypothetical. In a total of 18.700 sound segments, around 1000 are pauses and silences; the remaining 17.700 phoneme segments were used for the training and testing of the model and later parameterized in vectors. Of these 17.700, only about 2% are repeated, with a remaining of 17.350 distinct vectors.

%

Page 99: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

71

3.3.2 Network architecture

The work tool for neural nets was the toolbox included in version 4 of Matlab® [Demuth and Beale, 2000]

Several architectures were tested, with different network types, structures, number of hidden layers and corresponding transfer functions, as well as number of nodes per layer.

Perceptron networks and recurrent networks (Hopfield networks and Elman networks) were also tested but, the learning results were never satisfactory. The given network is a feed-forward type network, and it was trained using back-propagation algorithms with good results from the beginning.

The network input has all the necessary nodes to codify the chosen parameters, later discussed. The output has one node, which will indicate the segment duration value. Between one and four hidden layers were tested, but, the best option varies between one or two layers. Table 3.1 displays the performance values for the best architectures. Where, Log, Tan and Lin means hyperbolic logarithmic, hyperbolic tangent and linear transfer functions, respectively. The number of nodes in the input layer is not shown in the first column but will be discussed in detail in the following sections. The choice was for a network with two hidden layers, 4 nodes in the first hidden layer and 2 nodes in the second, because it got the best results in the testing phase.

Fig. 3.3 exhibits the architecture of the chosen network, with n input nodes, duration, d, in the output layer activated by the linear transfer function, a first hidden layer with 4 nodes, activated by the hyperbolic tangent transfer function, and a second hidden layer with 2 nodes, activated by the hyperbolic logarithmic transfer function. The nodes of subsequent layers are fully connected. The polarisation values or bias of each node are expressed by b. To avoid confusion, weights were not displayed in the figure, but are expressed in each filled arrow connecting one node to the other, including input nodes. The total number of weights are nx4+4x2+2x1+7=4n+17.

Table 3.1: ANN architectures and performances.

Nodes in

layers

Activating Functions

Training Algorithm

Value of r in test

set

4-2-1 Tan-Log-Lin Levenberg-Marquardt 0.839

2-4-1 Tan-Log-Lin Levenberg-Marquardt 0.838

10-1 Tan-Log Resilient–Back-propagation 0.837

2-4-1 Log-Log-Lin Levenberg-Marquardt 0.836

2-4-1 Tan-Log-Lin Resilient–Back-propagation 0.836

6-1 Tan-Log Resilient–Back-propagation 0.836

10-1 Tan-Lin Resilient–Back-propagation 0.835

Page 100: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

72

Fig. 3.3 – Network architecture for this model.

The activating functions for each node are graphically displayed in Fig. 3.3. The activating functions for the first hidden layer, hyperbolic tangent functions, vary between -1 and +1; the second layer’s hyperbolic logarithmic functions vary between 0 and +1. The output node activating function is a strictly linear function.

3.3.3 Neural network training

The training and test sets consist of vectors sets, which along with the parameters described in the following sections, characterise each segment with target duration’s value.

The architecture of an ANN should be carefully designed in order to guarantee that the available number of training vectors is several (at least more than 5) times larger than the number of weights of the ANN. Otherwise the training set will not be enough to optimise all ANN weights. Even if the predicted data of the training set is very good, when data from others sets is used the predicted results of the ANN do not follow the quality level of the results of the training set. In the presented case, considering the architecture described above, and the number of features, discussed in next section, the number of weights is 410, and the number of training vectors is about 15.000, about 36 times larger.

On the other hand, an over-fitting problem may occur, independently of the relation between number of training vectors / number of ANN weights, if the number of training sessions is excessive. The network adapts itself perfectly to the training set (the easier the smaller the relation between the number of training vectors and the number of ANN weights is), but fails to handle other input vectors. It ‘memorizes’ the training examples but doesn’t ‘learn’ how to deal with the problem.

In order to avoid over-fitting problems, three sets were initially used in the training process. The training set, used to train the ANN, the validation set, used to stop training early if further training with the training set will hurt generalisation to the validation set, and a test set to evaluate if training and validation sets are representative of the universality of the problem. If this does not

-6 -4 -2 0 2 4 6 8

-1

-0.5

0

0.5

1

d

.

.

.

b1,1 Σ

Σ

p1

p2

p3

pn

-6 -4 -2 0 2 4 6 8

-1

-0.5

0

0.5

1

b1,2

b3,1

b2,2

b2,1

-8 -6 -4 -2 0 2 4 6

-1

-0.5

0

0.5

1

b1,3

b1,4

-8 -6 -4 -2 0 2 4 6

-1

-0.5

0

0.5

1

-5 -4 -3 -2 -1 0 1 2 3 4 5-5

-4

-3

-2

-1

0

1

2

3

4

5

-6 -4 -2 0 2 4 6 8

-1

-0.5

0

0.5

1

-6 -4 -2 0 2 4 6 8

-1

-0.5

0

0.5

1

Σ

Σ

Σ

Σ

Σ

Page 101: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

73

happen, the performance in the test set does not follow the performance in the training and validation sets.

Consequently, the database was first divided into a training set of approximately 13.000 vectors, a validation set of about 3.000 vectors and a test set of about 2.000 vectors. These sets were organized by distributing 5 texts to the training set and one to each of the others. Later, the test set was eliminated since the performance followed closely the performance of the other two sets, proving that the training and validation sets are representative. The data of the test set was transferred to the training set and the validation set was used also for testing. Hence, the final training set comprises approximately 15.000 vectors and the test set about 3.000.

The chosen performance function was the root-mean-square error between the predicted outputs and the target values.

For the network’s training, several variants of the back-propagation training algorithm were tested, all available in Matlab®’s toolbox for neural networks [Demuth and Beale, 2000] and shortly described next.

Every tested algorithm is of the ‘batch training’ type, i.e., at every iteration, the weights and biases are only updated after every vector in the training set has been applied to the network.

Experienced training algorithms:

• traingd - ‘Batch Gradient Descending’ – the weights and biases are updated in the direction of the negative gradient of the performance function. The learning rate is fixed;

• traingdm - ‘Batch Gradient Descending with Momentum’ – momentum allows the network to respond not only to the local gradient, but also to the recent trends in the error surface. It works like a low-pass filter, allowing the network to ignore minor changes in the error surface.

These two algorithms are usually very slow handling practical problems. The alternative is fast learning algorithms, which can be divided into two categories. The first one uses heuristic techniques developed from the analysis of the performance of the standard steepest descent algorithm. One heuristic technique is the momentum technique, used in the previously mentioned algorithm. The other two techniques are variable learning rate and resilient back-propagation. The second category uses standard numerical optimization techniques. There are three types of optimization techniques for neural nets: conjugate gradient, ‘quasi-Newton’ and ‘Levenberg-Marquardt’.

Heuristic techniques – Variable learning rate:

• traingda - standard steepest descent algorithms use a constant learning rate throughout the training, but the performance of the algorithm is very sensitive to the setting of the learning rate. If the learning rate is too high, the algorithm may oscillate and become unstable; if it is too small, the algorithm will take too long to converge. This algorithm uses an adaptive learning rate, in order to keep the learning step as large as possible and make sure the algorithm remains stable;

• traingdx - this algorithm combines an adaptive learning rate with momentum.

Page 102: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

74

Heuristic techniques – Resilient back-propagation:

• trainrp - multilayer networks typically use hyperbolic transfer functions in the hidden layers. This function compresses an infinite input range into a finite output range and one of its main features is that its slope must approach zero as the input gets large. The learning process is slow because it is proportional to the performance function gradient. To eliminate the harmful effects of the magnitudes of the partial derivatives, almost null, this algorithm uses the sign of the derivatives to determine the direction of the weight update; the magnitude of the derivative has no effect on the weight update. [Riedmiller and Braun, 1993]. It requires a small increment to memory resources.

Numerical optimization techniques – conjugate gradient:

Basic back-propagation algorithms adjust the network weights in the steepest descent direction (negative of the gradient). However, this process doesn’t necessarily lead to faster convergence. Conjugate gradient algorithms search the performance function variation throughout conjugate directions. Their learning rate is adjusted at each iteration, so as to minimize the performance function throughout conjugate directions. These algorithms are usually faster than variable learning rate algorithms, and sometimes even faster than resilient back-propagation algorithms, but their results depend on the kind of problem they’re handling. They only require a little more storage than the simpler algorithms, so they are often a good choice for networks with a large number of weights. The alternatives available in the neural network ‘toolbox’ [Demuth and Beale, 2000] are: traincgf – ‘Fletcher-Reeves Update’, traincgp – ‘Polak-Ribiére Update’, traincgb – ‘Powell-Beale Restarts’ and trainscg – ‘Scaled Conjugate Gradient’. Essentially, they differ in the way the search is done for a new direction.

Numerical optimization techniques – ‘Quasi-Newton’:

• trainbfg - the Newton method uses a second derivative matrix of the performance index for each iteration, the Hessian matrix. However, this is a complex and expensive to compute matrix. ‘Quasi Newton’ algorithms update an approximate matrix at each iteration and the update is computed as a function of the gradient, making the algorithm lighter. They require more computation in each iteration and more storage than the conjugate gradient methods, although they generally converge in less iterations. It is recommended for smaller networks;

• trainoss - ‘One Step Secant Algorithm’ – this algorithm is an attempt to bridge the gap between conjugate gradient algorithms and the previous algorithm, as far as storage and computation requirements are concerned. It doesn’t store the complete Hessian matrix, it assumes that at each iteration, the previous Hessian was the identity matrix.

Numerical optimization techniques – ‘Levenberg-Marquardt’:

• trainlm - ‘Levenberg-Marquardt’ – this algorithm was designed to approach second-order training speed without having to compute the Hessian matrix. Consequently, it approaches that matrix using the Jacobian matrix and a vector of network errors, through a much less complex technique. It requires a lot of storage, although memory can be exchanged with computation time. It is recommended, as being very fast, for medium-size networks (a few hundred weights).

To sum up:

Page 103: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

75

The trainlm algorithm is recommended for networks with a few hundred weights, otherwise it is rather heavy in terms of memory. The trainrp algorithm is faster recognizing patterns, but ineffective for approximate functions because it degrades significantly when the error rate is small. The trainscg algorithm handles a wide range of problems, especially on large networks. It doesn’t require too much storage and it is almost as fast as trainlm handling approximate functions (even faster for large networks). Also, it is almost as fast as trainrp recognizing patterns and it does not degrade as much as trainrp with small error rates. The trainbfg algorithm is similar to trainlm in terms of performance and it does not require as much storage. However, its computation needs increase geometrically with the network size. Finally, the traingdx algorithm is usually the slowest and requires as much storage as trainrp. It is quite useful for slow convergence situations.

For present problem, each of the mentioned algorithms was tested. In the initial phase, for a large amount of features and, consequently, a large amount of weights to adjust, the resilient backpropagation algorithm was selected - trainrp. Once the feature set was significantly reduced, and consequently the number of weights to adjust, the trainlm algorithm became more useful and produced better results. With the other algorithms, the performance values were far from the adopted solution, the network was in some cases even unable to ‘learn’.

Fig. 3.4 displays the error evolution in the performance function for the training and validation sets in a training session with trainlm algorithm. Training was automatically interrupted after 41 iterations to prevent over-fitting, after 300 seconds2. With the resilient back-propagation algorithm, trainrp, the time used in training is approximately 10 times smaller, but the performance values are also worse.

0 5 10 15 20 25 30 35 4010-3

10-2

10-1

100

41 Epochs

Trai

ning

-Blu

e G

oal-B

lack

Val

idat

ion-

Gre

en

Performance is 0.00713325, Goal is 0.0016

Fig. 3.4 – Error evolution in the performance function in the training and validation sets during a training

session.

The very close evolution of performance in both training and validation sets, proves the homogeneity in the two sets. When the test set was used, it also followed the evolution of performance of the training and validation sets.

2 The training process ran on a computer with a Pentium IV 1.8 MHz processor and 512 Mbytes of RAM.

Page 104: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

76

3.3.4 Features

The basic idea for the creation of this model was attempting to collect every feature likely to influence the duration of a given segment, even if the influence of certain features is rather subtle or if some of them are in strict correlation to others.

This section will describe the set of tested features, how they are automatically extracted from the text, the best way to codify them, whether or not they are influential and how.

Fig. 3.5 – Sequence of processing blocks prior to the development stage of the duration model and its

application to TTS.

Before describing the chosen features and the way they are automatically extracted from the data, it is convenient to describe the sequence of processing blocks presented in Fig. 3.5. To the left, written in continuous lines, the processing blocks for the TTS converter. To the right, written in broken lines, the processing blocks for the development of the duration model. Below, in dotted lines, lie the common blocks and the model itself.

Syllable division

Tonic syllable labelling

Phonetic transcription

Word labelling and phoneme line-up

Sentence labelling and phoneme line-up

Co-articulation rules

Pre-processed text

Phoneme labels Word labels

Sentence labels

Accent group syllable enumeration

Phrase and accent group identification

Syllable division

Feature extraction

Duration model

Phoneme duration sequence

Page 105: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

77

The two processing sequences, for the TTS application stage and for the development of the model stage, are distinct because the object for the development stage is the labelled database and not the text. However, one may question: why is the left sequence not the only source? Why are the readings of the database not used and followed by a line-up of the phonetic transcription results and the database labelling results?

In fact that was not the option, since some TTS blocks, namely the phonetic transcription block, was simultaneously developed with the duration model, therefore, still lacking stability when the model began to be developed. Moreover, the usage of the database durations would necessarily imply the usage of a phoneme sequence matching the database sequence. There was no warranty that the phonetic transcription results would be exactly the same as the labelling results. But to prevent neglecting any specific aspect of the model when applied to TTS, the phonetic transcription block should be handled carefully, with regard to post-lexical rules. The transcription should be phonological rather than phonetic, so that its results match those of the labelling for the same texts3.

The left block sequence in Fig. 3.5 has the pre-processed text at the input, so that any acronym, abbreviation or numeric character is already in full text form. Afterwards, a syllable division algorithm [Gouveia et al., 2000] is applied to divide the text into syllables. Then, the tonic syllable is marked according to specific rule set in [Teixeira, 1995]. Later, the phonetic transcription is made and co-articulation rules applied, according to the description in the previous chapter.

At the input, the right block sequence, the model’s development stage, has files containing three labelling levels: phonetic labelling, word labelling and sentence labelling, as seen in section 2.3. These files show the time instant label and its corresponding label (Table 2.5). Phoneme segments are not yet grouped in words or sentences, so the first processing block lines-up the word markers and the phoneme segment markers, allowing segments to be easily distributed to the words they belong to. The same happens with sentences, making it easy to group words and phonemes belonging to the same sentence. The third processing block handles syllable division, but with a different algorithm than the one mentioned for the left side of the figure, since this division is phoneme-based instead of grapheme-based. Syllable identification became, in some cases, much harder, due to several phoneme reductions and suppressions in the spoken text. However, the database markers at the beginning of the tonic syllable are very handy at this stage. The algorithm used in the syllable division was described in previous chapter.

One way or the other, after knowing the phoneme sequences, divided into syllables, the tonic syllables and word and sentence boundaries, accent groups and phrases can be identified. Every sentence marker in Table 2.5 was considered a phrase boundary marker4. As for accent groups, the idea was to combine words with their neighbouring mono-syllables in order to create groups of over three syllables, but only one tonic accent. These groups are made by a word combination process: each group should have more than two syllables and no less than two phonemes in the last one, unless it is the last word in the phrase. Because it lacks higher level linguistic background

3 The phoneme sequences should be very close to the one produced by the speaker, and not the full lexical form. For that, the set of post lexical or co-articulation rules proposed in previous chapter is very important. This aspect of a correct lexical sequence instead of the full lexical form is very important to the naturalness as reported in [Brinckmann and Trouvain, 2003].

4 Linguistically speaking, this is not the accurate phrase boundary identification process. However, because it was rather difficult to group phrases automatically, respecting linguistic criteria, these text groupings became the less correct option for what we call phrases.

Page 106: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

78

information, this process sometimes fails and separates words which belong together as a unit (ex: Vieira da Silva, Vila Real). If there’s more than one tonic syllable marker in the group, only the final one is valid. The final step is numbering syllables in accent groups. Accent groups have usually 3 to 5 syllables. An example of application of the concept of accent groups is presented in the following sentence (‘a strong reserve with justice situation’): “uma forte / reserva / em relação / à situação / da justiça”.

The final set of features used in the model is now presented:

• Phoneme syllable position in relation to group’s tonic syllable – this feature was initially codified so as to activate the input node that corresponds to one of the following 5 categories: before prior to tonic; prior to tonic; tonic; subsequent to tonic; after subsequent to tonic. This feature was later re-codified into a single node with values obtained from the correlation, r, between each category and segmental duration, according to Table 3.2. The new codification reduces the number of input nodes without loss in final performance.

Table 3.2: codification of the ‘position’ feature in relation to the tonic syllable.

Position in relation to tonic syllable

r Codification

value

Before prior to tonic -0.054 0.15

Prior to tonic -0.087 0

Tonic 0.131 1

Subsequent to tonic -0.021 0.3

After subsequent to tonic 0.042 0.6

• Phoneme syllable type – initially codified activating one of the following categories5: V; C; VC; CV; CC; VCC; CVC; CCV; CCVC, where V stands for vowel or diphthong and C stands for consonant. This feature was later re-codified into a single node, with the values obtained from the correlation between each category and segmental duration, according to the third column in Table 3.3. Again, the new codification reduces the number of input nodes without loss in final performance.

• Type of previous syllable – similar processing to the ‘type of syllable’ feature, but final codification used different values, since the correlation values also differ. The last column in Table 3.3 displays the codification values for this feature.

• Type of syllable vowel – Initially codified activating one of the following types: long vowels – {a, E, e, O, o}; medial vowels – {6, i}; short vowels – {@, u}; diphthongs; nasals – {6~, e~, i~, o~, u~}. This feature was later re-codified into a single node with the values obtained from the correlation of each type with segmental duration, according to third column in Table 3.4.

5 These categories were considered the only possible syllable types for Portuguese, section 2.4. Types C and CC are only possible in the phonetic sequence due to vowel suppression in syllables of the CV, CVC and CCV types.

Page 107: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

79

• Type of previous syllable vowel – similar processing to the ‘type of syllable vowel’ feature, but final codification used different values, since the correlation values also differ. The fifth column in Table 3.4 displays the codification values for this feature.

• Type of following syllable vowel – similar processing to the ‘type of syllable vowel’ feature, but final codification used different values, since the correlation values also differ. The last column in Table 3.4 displays the codification values for this feature.

Table 3.3: Codification of the ‘syllable type’ and ‘previous syllable type’ features.

Syllable type Previous syllable type Syllable type

r Codification value r Codification

value

V 0.122 1 0.000 0.3

C -0.062 0.1 0.020 0

VC 0.069 0.8 -0.015 0.5

CV -0.011 0.4 0.017 0

CC -0.048 0.2 -0.022 0.6

VCC 0.029 0.6 0.002 0.3

CVC 0.027 0.6 -0.045 1

CCV -0.091 0 0.000 0.3

CCVC -0.006 0.4 -0.002 0.3

Table 3.4: Codification of the ‘syllable vowel’, ‘previous syllable vowel’ and ‘following syllable vowel’ features.

Syllable vowel Previous syllable vowel Following syllable vowel Vowel type r Codification

value r Codification value r

Codificationvalue

Long 0.171 1 -0.011 0.5 -0.093 1

Medium -0.053 0.1 0.037 0 -0.028 0.5

Short -0.069 0 0.023 0.1 0.035 0

Diphthong -0.051 0.1 -0.037 0.8 -0.040 0.6

Nasal 0.090 0.7 -0.058 1 -0.044 0.6

• Position in accent group – codified into two nodes, both showing normalized positions, one from the beginning and the other from the end of the group.

• Position in phrase – codified into two nodes, both showing normalized positions, one from the beginning and the other from the end of the phrase.

Page 108: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

80

• Distance to next pause – measured in seconds and normalized relatively to the maximum value in database.

• Accent group length – codified into two nodes, both normalized, one showing the number of group segments and the other the number of group syllables.

• Accent group position in the phrase – Codified into three nodes, by activating the node that corresponds to the beginning, the middle or the end of the phrase.

• Final vowel suppression – Codified6 into a single node, showing whether or not there is final vowel suppression. This feature is only used for the final phoneme in the word.

• Identity of the segment – coded in 44 nodes, by activating one of 44 given segments.

• Identity of the previous segment (-1) – After analysing the correlation between the identity of the previous segment (-1) and the duration of the current segment, a total of 20 relevant phones were found. Thus, this feature is codified into 20 nodes, by activating the node that corresponds to the previous segment.

• Identity of the following segment (+1) – After analysing the correlation between the identity of the following segment (+1) and the duration of the current segment, a total of 12 relevant phones was found. Thus, this feature is codified into 12 nodes, by activating the node that corresponds to the following segment.

• Identity of the segment subsequent to the following (+2) – After analysing the correlation between the identity of the segment subsequent to the following (+2) and the duration of the current segment, a total of 4 relevant phones was found. Thus, this feature is codified into 4 nodes, by activating the node that corresponds to the segment subsequent to the following.

• Identity of the segment (+3) – After analysing the correlation between the identity of the segment (+3) and the duration of the current segment, a total of 2 relevant phones was found. Thus, this feature is codified into 2 nodes, by activating the node that corresponds to the segment (+3).

In the first 6 features, the codification allowed a significant reduction of network input nodes, without any consequences for the model’s performance. As for the features concerning neighbouring segments, the last 4 features, the number of nodes was also considerably reduced: segment types showing a weak correlation with segmental durations were not considered, not harming the model’s performance.

6 The codification for this feature is distinct for both routes in Fig. 3.5. The left route, application to TTS, extracts this information from the co-articulation block, where final vowel may or may not be suppressed due to co-articulation. As for the right route, training and model development stage, that information is harder to obtain because suppressions are not registered in the database. Therefore, final consonants are a good indicator of whether or not suppression occurred. If the consonant is of the {r, l*, S} type, then admittedly there was no suppression, since those consonants are likely to occur in word-final position; if the consonant is of a different type, then admittedly suppression occurred, since no other consonant is likely to appear in word-final position unless the vowel was omitted.

Page 109: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

81

Table 3.5: Final feature set, the corresponding importance and the correlation with the segmental durations.

# Node Feature Detail r Importance

1 Position in relation to tonic syllable 0.145 Relevant

2 Type of syllable 0.175 Slightly relevant

3 Type of previous syllable -0.055 Slightly relevant

4 Syllable vowel 0.208 Relevant

5 Previous syllable vowel -0.075 Slightly relevant

6 Following syllable vowel -0.151 Slightly relevant

7 From beginning 0.026 Slightly relevant

8 Position in accent group

From end -0.153 Relevant

9 From beginning -0.038 Slightly relevant

10 Position in phrase

From end -0.244 Relevant

11 Distance to next pause 0.203 Relevant

12 # of Syllable 0.052 Slightly relevant

13 Accent group length

# of Phoneme 0.026 Slightly relevant

14 Beginning 0.015 Slightly relevant

15 Middle -0.081 Relevant

16

Accent group position in the phrase

End 0.114 Relevant

17 Final vowel suppression 0.082 Slightly relevant

18-61 Identity of the segment Detail in Table 3.6 Very relevant

62-81 Identity of the previous segment (-1) Detail in Table 3.6 Relevant

82-93 Identity of the following segment (+1) Detail in Table 3.6 Relevant

94-97 Identity of the segment subsequent to following

(+2) Detail in Table 3.6 Relevant

98-99 Identity of the segment (+3) Detail in Table 3.6 Relevant

The relative importance of the final feature set is presented in Table 3.5. The importance was measured taken out one feature from the set of features and measuring the new performance of the model in the test set. The decreasing in performance was the measure of importance for that particular feature. The value r, presented in the table is the correlation between the input feature and the output. The r was not directly used in the measure of importance. Table 3.6 shows the

Page 110: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

82

correlation of each segment identity and the neighbouring segments identity in detail, with segmental durations.

Table 3.6: Correlation between the segments and the surrounding segments with the segmental durations.

# Node

Feature Segment Identity r

# Node

Feature Segment Identity r

18 a 0.262 59 s 0.235

19 6 0.065 60 S 0.151

20 E 0.130 61

Phone

Z 0.060

21 e 0.122 62 !p -0.189

22 @ -0.019 63 t 0.083

23 i 0.052 64 !t -0.184

24 O 0.143 65 k 0.050

25 o 0.118 66 !k -0.121

26 u -0.025 67 b 0.042

27 j -0.050 68 !b -0.122

28 w -0.072 69 d 0.071

29 j~ -0.005 70 !d -0.227

30 w~ -0.012 71 g 0.053

31 6~ 0.062 72 !g -0.123

32 e~ 0.140 73 n 0.055

33 i~ 0.110 74 J 0.051

34 o~ 0.093 75 l 0.068

35 u~ 0.046 76 r 0.089

36 p -0.195 77 R 0.053

37 !p 0.019 78 v 0.060

38 t -0.187 79 z 0.075

39 !t -0.068 80 S -0.057

40 k -0.124 81

Phone (-1)

Pause 0.082

41 !k -0.005 82 a -0.083

42 b -0.121 83 6 -0.119

43 !b -0.048 84 u -0.077

44 d -0.226 85 6~ -0.056

45

Phone

!d -0.110 86

Phone (+1)

o~ -0.052

Page 111: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

83

# Node

Feature Segment Identity r

# Node

Feature Segment Identity r

46 g -0.123 87 t -0.063

47 !g -0.050 88 !t 0.107

48 m 0.009 89 d -0.104

49 n -0.025 90 !d 0.095

50 J 0.011 91 l* 0.062

51 l -0.031 92 v 0.053

52 l* 0.025 93

Phone (+1)

Pause 0.282

53 L -0.002 94 t 0.107

54 r -0.189 95 d 0.091

55 R 0.026 96 r -0.080

56 v 0.014 97

Phone (+2)

Pause 0.141

57 f 0.097 98 u 0.049

58

Phone

z 0.032 99 Phone (+3) Pause 0.110

Apart from the final feature set, other features and other codifications for some features in the final set were studied, but they brought no benefit to the model’s performance, neither individually nor as a whole. Notwithstanding, those features are listed below and shortly described:

• Type of following syllable – parameterized like the ‘type of syllable’ feature.

• Phrase length – characterized by three values: phonemes’ number; syllables’ number; groups’ number.

• Phrase-final boundary – by activating one of the following markers {f, ., ,, !, ?, ..., (, ), “, :, ;, -}.

• Previous segment duration – codified between 0 and 1 by dividing the mentioned duration in ms by 250. If the segment’s duration is superior to 250 ms, then it is codified as 1 (minimum {D(ms)/250,1}).

• Previous segment type – by activating one of the following: vowel; glide; nasal vowel; plosive consonant; nasal consonant; lateral consonant; multiple vibrant (R); simple vibrant (r); fricative consonant and pause. The surrounding segments’ codification was less profitable for the model this way.

• Following segment type – parameterized like the previous feature.

• Identity of previous segment (-2) – After analysing the correlation between the identity of previous segment (-2) and duration of current segment, no relevant phone identity was found in that position.

Page 112: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

84

• Identity of previous segment (-3) – After analysing the correlation between the identity of previous segment (-3) and duration of current segment, no relevant phone identity was found in that position.

These were the results for the given database. For other data sets, the best results would not exactly match these ones. However, the best feature set would probably not show considerably different results, because the features presented in Table 3.5 were confirmed by the test and validation data sets. Even if the best feature set is different for other databases, it would probably only differ in the ‘not relevant’ features, with no major changes to the model’s performance.

The network’s output node represents the duration for a given segment (ms/250) between 0 and 1, i.e., between 0 and 250 ms. Other codifications for segmental duration were tested, namely using logarithmic functions, but the results did not improve. The ANN architecture has the ability to model non-linear functions such as logarithmic function. This can be the explanation why the logarithmic codification of the segmental durations in the ANN output did not improve the model performance like reported in other model types by other authors.

Page 113: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

85

3.4 Model Evaluation

This section aims to evaluate the model’s results extensively. Despite the impossibility of evaluating every possible given input vector in the feature set, the evaluation is based on the database texts. The texts are analysed, separating the training set, used in the development of the model, from the test set. Each case has standard deviation values of the differences between predicted and measured durations7 or root mean squared error, an absolute average of those differences and the correlation coefficient between predicted and measured duration vectors for each set. The unit for the two first cases is the millisecond (ms).

3.4.1 Standard deviation (σ) or (std)

The standard deviation (σ) or (std) is given by the expression:

2i

ix

Nσ =

∑ Eq. (3.8)

where N is the number of segments, and xi is the error difference of each segment and the mean error:

i ix e e= − Eq. (3.9)

where the error, ei, equals the difference between the measured and the predicted duration of each segment:

_ _i measured i predicted ie d d= − Eq. (3.10)

If the average error, e , is null, the standard deviation equals the root mean square error, rmse, used by some authors [Goubanova and Taylor, 2000], and given in the expression:

2i

ie

rmseN

=∑

Eq. (3.11)

3.4.2 Mean absolute error (δ)

The mean absolute error (δ) is given by the following expression, indicating the mean error:

ii

e

Nδ =

∑ Eq. (3.12)

7 ‘Measured durations’ means the durations resulting from the reading of texts in the labelled database.

Page 114: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

86

3.4.3 Linear correlation coefficient (r)

The linear correlation coefficient (r) measures the linear correlation degree between two variables [Guimarães, 1997], namely the predicted duration and the measured duration vectors.

Since the variance between vectors A=[a1 a2 ... ai ...] and B=[b1 b2 ... bi ...] with the same dimension N is:

( ) ( ),

.i ii

A B

a a b bV

N

− −=∑

Eq. (3.13)

The variance of a certain X vector with itself is just the squared standard deviation of that vector:

2,X X xV σ= Eq. (3.14)

The correlation coefficient between vectors A and B is then the cross variance of those vectors, divided by the product of their corresponding standard deviation values:

, ,, 2 2

, ,..

A B A BA B

A BA A B B

V Vr

V V σ σ= = Eq. (3.15)

r varies between -1 and 1 ( 1 1r− ≤ ≤ ).

3.4.4 Results and discussion

Table 3.7 displays the global standard deviation, mean absolute error and linear correlation coefficient values for the training and test sets. These values are determined using each set’s vectors, containing a sequence with all segments in the text except pauses.

Table 3.7: Global results for the duration model.

Set σ (ms) δ (ms) r

Training 19.85 14.17 0.834

Test 19.46 14.32 0.839

Fig. 3.6 shows an error histogram of every segment in both sets (since they have similar error), compared with the normal distribution. One may observe the concentration of very low error situations, more frequent than in the normal distribution. The error is the difference between measured and predicted values; therefore, the Figure gives a clear account of the model’s difficulty to predict high duration values and how low values are slightly predicted by excess. This is one of the typical characteristics of the outcome of a statistical model. It can be said that there is a slight reduction in the dynamics of the predicted duration when compared with the original.

Page 115: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

87

-100 -50 0 50 100 150 2000

200

400

600

800

1000

1200

1400

e (ms) Fig. 3.6 – Error histogram and normal distribution curve for every segment in both sets.

0 10 20 30 40 500.02 0.05 0.10

0.25

0.50

0.75

0.90 0.95 0.98 0.99 0.9970.999

|e| (ms)

Pro

babi

lity

Normal Probability Plot

Fig. 3.7 – Normal probability distribution and absolute error curve for every segment in both sets.

Fig. 3.7 shows the normal probability curve and the absolute error probability distribution for every segment in both sets. If the error had a normal distribution, it would be shaped like a straight line in the chart.

The analysis of the charts in Fig. 3.6 and Fig. 3.7 indicates that, although the distribution deviates from the normal pattern, it is somewhat close to it. The very low error situations are more concentrated than in the normal distribution, therefore, they are more likely to occur. The higher

Page 116: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

88

positive error situations (the positive error situation occurs when measured durations are superior to predicted ones) are also more frequent than in normal distribution.

When analyzing the dispersion values for measured and predicted durations, using the standard deviation of those durations, (σ=35.9 ms and σ=30.6 ms for measured and predicted durations respectively), one can see that there is lower dispersion in the model’s predicted durations. This tendency confirms the model’s difficulty to predict the durations of very large segments in comparison to the average, which validates the impressions of the visual inspection made to some examples.

Fig. 3.7 shows that the model predicts 75% of the durations with an error inferior to 20 ms, 90 % with an error inferior to 30 ms and 95% with an error inferior to 40 ms.

Fig. 3.8 shows real, predicted and average durations for the phoneme sequence of a given sentence. This example is not an attempt to evaluate the proximity of durations in the model, but simply the application of the model to a sentence from the database. The average duration sequence consists of replacing the duration of a segment with the average duration of the corresponding phoneme in the database. The figure reveals the model’s difficulty to match the highest measured duration values. Fig. 3.9 shows predicted and measured durations for a different sentence.

Fig. 3.8 – Measured, predicted and average duration contours for the phoneme sequence in the sentence

“Conhece a situação na pele. Aprendeu-a na idade em que se aprende e se não esquece.”. Meaning ‘Knows the situation on the skin. Learned it in the ages when we learn and don’t forget.’.

Measured Predicted Average

“Conhece a situação na pele. Aprendeu-a na idade em que se aprende e se não esquece.”

0

50

100

150

200

250

ms

Page 117: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

89

0 5 10 15 20 25 30 350

20

40

60

80

100

120

140

160

k i ! g a "r ! ! a "d w l* d ! @ p 6~ ! t 6 l 6 j Z u "6~ w 6 m 6 "r a l*

Que igualdade perante a lei? João Amaral

Fig. 3.9 – Measured and predicted duration contours for the paragraph “Que igualdade perante a lei? João

Amaral”. Meaning ‘How equal before the law? João Amaral’.

Desirably, these results would be compared with those of other models, which would certainly prove a complex task. However, it is probably not correct to do so, since each model has its own particular characteristics, not covered by these parameters, such as the ability to predict duration at different speech rates, as in the Barbosa-Bailly and Keller-Zellner models, or even the ability to insert pauses. Besides, models for other languages use different databases; there is no common corpus for precise evaluation. The choice of the evaluation corpus is a relevant aspect, since results differ from sentence to sentence, even for the same type of sentences. The very size of the database used in the model’s learning stage is likely to influence final results. Additionally, there is some divergence in the indicators used for results presentation. The language itself imposes a different number of phoneme segments, which varies from author to author: thinner segments may be used, causing results to differ. Finally, the speech rate is not always the same, and some times not mentioned. The model results are very sensitive to the speech rate.

Thus, due to the previously stated reasons, this model was not objectively compared with other duration models. Still, its standard deviation, of approximately 20 ms, as well as its linear correlation coefficient, superior to 0.8, is at the state-of-the-art level of duration models, judging by the relevant papers from the bibliographical list and by the systems presented earlier in this chapter.

In spite of a wide feature range specifying each segment, the phoneme identity feature is clearly dominant. Consequently, an analysis of the model’s results, by segment type, is now presented. Table 3.8 displays values concerning occurrence number, standard deviation of the error, mean absolute error, linear correlation coefficient, measured and predicted average, measured and predicted minimum8 and measured and predicted maximum9 for each type of segment in both sets.

8 If these minimum values are very low, they were either caused by a labelling error or a sporadic situation. Consequently, they have little importance.

9 When the measured value is superior to 250 ms, it is limited to that value.

Measured Predicted

Page 118: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

90

The best linear correlation value does not always correspond to the best standard deviation value, as seen in the phoneme segments of [E] and [i], for instance.

Table 3.8: Values for each segment type (phone) in both sets: occurrence number (#); error standard deviation (σ); mean absolute error (δ); linear correlation coefficient (r); measured average (Av.) and predicted average (Pred. Av.); measured minimum value (Min.) and predicted minimum value (Pred. Min.); measured maximum value (Max.) and predicted maximum value (Pred. Max.).

Phone # σ δ r Av. Pred. Av. Min.

Pred. Min.

Max. Pred. Max.

a 631 25.6 19.8 0.66 110 109 29 63 238 180

6 1559 20.6 15.0 0.67 68 67 15 38 232 133

E 269 21.7 17.0 0.68 97 96 35 45 195 177

e 283 29.0 21.6 0.68 95 93 31 52 250 177

@ 271 33.6 24.5 0.46 53 54 11 21 205 98

i 819 22.6 16.9 0.60 68 68 11 31 224 129

O 218 25.6 20.4 0.62 106 105 40 70 197 170

o 247 26.3 20.1 0.64 97 97 27 54 227 182

u 804 24.2 18.2 0.55 57 55 7 25 197 122

j 433 21.7 15.8 0.56 49 50 8 20 206 103

w 393 19.9 14.5 0.68 44 43 8 17 158 100

j~ 10 17.0 16.7 0.44 63 53 36 44 89 61

w~ 6 28.9 26.1 0.30 53 29 21 28 105 31

6~ 450 23.4 17.6 0.74 75 74 1 33 192 198

e~ 192 24.9 20.0 0.60 107 109 36 56 232 183

i~ 107 25.5 19.3 0.79 109 107 48 79 250 209

o~ 137 26.9 19.8 0.66 98 97 36 65 250 177

u~ 92 31.0 23.2 0.70 86 84 27 58 250 217

p 520 8.4 6.1 0.35 20 20 3 16 86 31

!p 493 17.8 13.7 0.39 64 63 18 42 159 101

t 824 12.4 8.4 0.78 29 29 3 18 160 102

!t 803 16.4 12.6 0.60 48 48 7 28 135 83

k 635 13.8 10.5 0.49 37 37 7 23 137 107

!k 594 16.2 12.5 0.34 59 59 20 36 134 86

b 196 11.6 6.1 0.88 17 16 3 13 188 98

!b 196 14.4 10.7 0.37 43 43 8 28 108 58

Page 119: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

91

Phone # σ δ r Av. Pred. Av. Min.

Pred. Min.

Max. Pred. Max.

d 740 10.3 6.8 0.81 20 20 3 15 176 98

!d 723 16.8 12.6 0.26 41 40 6 26 153 56

g 207 8.1 6.1 0.77 20 20 5 17 102 82

!g 203 13.0 10.3 0.23 44 44 16 30 91 66

m 435 19.3 14.2 0.25 62 62 20 38 197 114

n 312 17.2 13.0 0.41 54 53 19 30 149 108

J 57 16.6 13.8 0.22 68 66 25 44 108 89

l 277 19.3 15.3 0.26 53 51 7 33 117 79

l* 146 23.3 19.4 0.64 68 73 22 54 182 131

L 56 14.7 12.0 0.72 56 57 18 38 137 110

r 1018 12.4 9.3 0.62 32 32 7 25 145 95

R 104 18.6 14.9 0.44 73 72 18 43 145 119

v 222 19.6 15.0 0.46 65 64 24 44 148 124

f 194 22.3 17.3 0.58 93 92 33 48 203 144

z 255 16.7 12.5 0.35 70 69 24 51 130 102

s 648 24.6 17.6 0.60 103 103 32 45 250 171

S 639 24.2 17.5 0.69 89 89 29 56 248 133

Z 294 22.7 16.4 0.44 78 76 33 45 194 139

Fig. 3.10 and Fig. 3.11 contain examples of measured and predicted duration histograms for vowel [a] and consonant [t], respectively. A similarity exists between predicted and measured duration histograms. The same similarity exists in histograms of the other segments. Some differences between measured maximum values and predicted maximum values presented in Table 3.8, occur because of the outliers in measured values as happen in Fig. 3.10 and Fig. 3.11.

Statistical data concerning the comparison of the model’s results with utterance values, which so far we named ‘real’ values, are not the only parameters used for evaluation, since the model’s performance is compared with an utterance which is possibly not the best one and certainly not the only accurate one. Thus, chapter 5 presents a perceptual test, which is an additional and important evaluation indicator for the duration model.

Page 120: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

92

Fig. 3.10 – Histogram of measured and predicted durations for phoneme [a].

Fig. 3.11 – Histogram of measured and predicted durations for the burst part of phoneme [t].

Page 121: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

93

3.5 Alternative Model

The implementation of the mentioned model raises the following questions: to what extent is it beneficial to develop a single-network model, able to learn the influence of a certain feature in every type of phoneme segment and apply that knowledge to a specific phoneme segment? Should that parameter influence every segment similarly? Or can it progress in opposite ways for different segments?

To answer these questions, the model was tested by an alternative application of the neural network, from now on referred to as the alternative model. The alternative model consists in one ANN for each type of segment where all the features were kept, except the one concerning the identity of the phoneme segment. The structure for each network is the same as in the previous model, although in this case, each network can only access each set’s stimuli for a given segment. The networks were individually trained, using a similar process to the previously mentioned one.

Shortly, each of the 44 ANNs has the same structure presented in Fig. 3.3, and is composed by 55 input nodes, 4 nodes in the first hidden layer, activated by the hyperbolic tangent function, and 2 nodes in the second hidden layer, activated by the hyperbolic logarithmic function, and 1 node in the output layer, activated by the linear function. The node corresponds to segmental duration.

One of the advantages of the alternative model is the fact that a given phoneme segment duration cannot be “disturbed” in any direction by the influence of the other segments’ features. However, that may also become a disadvantage, since the parameter information for a given segment is not applied to others. This becomes more relevant when the number of stimuli for each segment is clearly not enough to train a sizeable network.

3.5.1 Alternative model results

Table 3.9 contains the global results for this model. When they are compared with those of Table 3.7, the alternative model proved to be slightly better results than those of the initially proposed model.

Table 3.9: Global results for the alternative duration model.

Set σ (ms) δ (ms) r

Training 19.0 13.3 0.850

Test 18.2 13.5 0.861

Fig. 3.12 shows an error histogram for all segments in both sets in comparison to the normal distribution plot.

The error distribution in Fig. 3.12 is very similar to the error distribution for the initial model, in Fig. 3.6.

Fig. 3.13 shows the normal probability distribution and absolute error curve for every segment in both sets with the alternative model.

Page 122: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

94

The absolute error curve in Fig. 3.13 has a similar pattern to that of Fig. 3.7. However, this model predicts 75% of the duration values with an error rate inferior to 18 ms, against the 20 ms for the previous model, 90 % with an error rate inferior to 30 ms and 95% with an error rate inferior to 37 ms, against the 40 ms for the previous model.

-100 -50 0 50 100 150 2000

200

400

600

800

1000

1200

1400

1600

e (ms) Fig. 3.12 – Error histogram and normal distribution curve for all segments in both sets with the alternative

model.

0 10 20 30 40 500.02 0.05 0.10

0.25

0.50

0.75

0.90 0.95 0.98 0.99 0.9970.999

Pro

babi

lity

Normal Probability Plot

|e| (ms) Fig. 3.13 – Normal probability distribution and absolute error curve for all segments in both sets with the

alternative model.

Page 123: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

95

Table 3.10: Values for each segment type (phone) in both sets of the alternative model: occurrence number (#); error standard deviation (σ); mean absolute error (δ); linear correlation coefficient (r); measured average (Av.) and predicted average (Pred. Av.); measured minimum value (Min.) and predicted minimum value (Pred. Min.); measured maximum value (Max.) and predicted maximum value (Pred. Max.).

Phone # σ (ms)

δ (ms) r Av.

(ms)

Pred. Av. (ms)

Min. (ms)

Pred.Min. (ms)

Max. (ms)

Pred. Max. (ms)

a 631 22.7 17.4 0.75 110 110 29 69 238 156

6 1559 18.6 13.6 0.74 68 68 15 51 232 121

E 269 24.7 18.7 0.54 97 92 35 74 195 116

e 283 22.1 16.3 0.83 95 95 31 65 250 181

@ 271 31.1 21.3 0.57 53 53 11 38 205 119

i 819 23.2 17.2 0.57 68 68 11 52 224 92

O 218 19.0 14.6 0.81 106 104 40 69 197 180

o 247 27.0 20.8 0.61 97 101 27 78 227 128

u 804 22.1 16.3 0.65 57 55 8 38 197 99

j 433 21.0 15.9 0.60 49 50 8 41 206 120

w 393 18.3 13.7 0.74 44 38 8 16 158 101

j~ 10 12.4 10.7 0.79 63 62 36 32 89 93

w~ 6 25.6 22.9 0.77 53 30 21 22 105 35

6~ 450 25.0 21.0 0.69 75 83 1 54 192 132

e~ 192 23.6 17.8 0.65 107 108 36 88 232 154

i~ 107 32.5 24.3 0.62 109 114 48 90 250 141

o~ 137 16.6 12.4 0.89 98 97 36 51 250 209

u~ 92 30.9 22.0 0.71 86 86 27 73 250 193

p 520 7.5 5.3 0.56 20 20 3 13 86 34

!p 493 16.2 12.3 0.55 64 64 18 52 159 106

t 824 10.8 7.5 0.83 29 29 3 18 160 114

!t 803 14.3 10.8 0.72 48 49 7 19 135 103

k 635 15.0 10.8 0.32 37 37 7 26 137 41

!k 594 16.3 12.8 0.30 59 59 20 20 134 63

b 196 16.8 6.6 0.55 17 15 3 3 188 28

!b 196 13.3 10.1 0.52 43 44 8 36 108 73

d 740 7.4 5.6 0.90 20 20 3 11 176 164

Page 124: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

96

Phone # σ (ms)

δ (ms) r Av.

(ms)

Pred. Av. (ms)

Min. (ms)

Pred.Min. (ms)

Max. (ms)

Pred. Max. (ms)

!d 723 16.9 12.6 0.27 41 41 6 38 153 92

g 207 6.0 4.1 0.88 20 20 5 12 102 91

!g 203 12.6 10.0 0.33 44 44 16 28 91 48

m 435 17.4 12.5 0.45 62 64 20 40 197 77

n 312 12.6 9.4 0.74 54 53 19 41 149 155

J 57 11.9 7.6 0.74 68 66 25 44 108 98

l 277 15.8 12.1 0.61 53 52 7 21 117 66

l* 146 23.1 16.9 0.64 68 69 22 29 182 99

L 56 9.4 5.5 0.90 56 57 18 33 137 102

r 1018 10.8 8.0 0.73 32 32 7 22 145 100

R 104 17.2 13.5 0.57 73 73 18 67 145 125

v 222 15.8 11.9 0.69 65 64 24 40 148 129

f 194 19.4 15.6 0.71 93 91 33 50 203 167

z 255 13.2 9.8 0.67 70 70 24 45 130 126

s 648 22.8 16.5 0.67 103 104 32 51 250 203

S 639 23.1 16.1 0.73 89 86 29 61 248 130

Z 294 22.8 17.1 0.44 78 79 33 53 194 105

Table 3.10 displays values concerning occurrence number, error standard deviation, mean absolute error, linear correlation coefficient, measured and predicted average, measured and predicted minimum and measured and predicted maximum values for each type of segment in both sets.

In comparison to Table 3.8, one can observe significantly different values for some phoneme segments, as far as standard deviation, mean absolute error and linear correlation coefficient are concerned, usually with better results for this model. However, this model experienced greater difficulty estimating extreme segmental duration values, very high or very low. As expected, it exhibits lower value dispersion for each phone, since the training set for each phone is also smaller.

This model was only presented as the alternative model because in the beginning of the study the set of features and network input nodes was much larger than the current one, which led to the calculation of a larger set of network parameters during training. Since the number of training situations should be at least 5 times the number of these parameters, most phones in the training set never reached those numbers and the results were slightly worse than those of the original model. With a significant reduction of network input nodes, most phones were able to satisfy that requirement and consequently the model’s results improved significantly.

Page 125: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

97

The alternative model got better results than the original one. The perceptual test in the following section will be determinant for the choice between the two.

Page 126: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

98

3.6 Pauses

As far as pauses are concerned, it is important to distinguish intra-paragraph pauses from inter-paragraph pauses: the first occur within the paragraph, separating sentences or phrases or even imposed by punctuation (e.g.: . , ; : ! ? etc); the second are paragraph boundaries, usually longer. Each type of pause has its own duration, which had to be studied.

This study used the speech corpus described in chapter 2. Due to recording and editing conditions (interruptions and cuts between paragraphs), inter-paragraph pause duration was not considered, since some of these pauses are artificial. It is known that there is always a relatively long pause between paragraphs, but its duration is not part of this study, only that of intra-paragraph pauses.

Due to several restrictions, like not much adequate database, calendar and schedule, a very simple and incomplete model of pausing is presented.

This study has two different tasks. The first one is to predict the locations where pauses occur; the second, to model their durations.

3.6.1 Pause occurrence

A correlation between pauses and certain sentence boundaries was established, since they are a direct result of the written text. However, pauses between words also occur, in spite of there being no punctuation marker in the text. Pauses are associated to prosodic phrasing, which more or less follows syntactic phrasing, as mentioned in the literature [Oliveira, 2002]. This study attempts to overcome the lack of this information, since it considers no syntactic knowledge of the text.

Table 3.11 exhibits the number of occurrences for each type of studied sentence marker and the number of silences associated to those occurrences, only for the texts used in the training set (same as for segmental durations). The table reports markers in no paragraph endings. Apart from the reported examples, 119 pauses between words with no punctuation marker occurred.

Table 3.11: Statistics on pause occurrence.

Orthographic marker # Occurrences # Associated

pauses Pause occurrence probability (%)

. 51 51 100

, 317 207 65.3

? 3 3 100

! 3 3 100

; 2 2 100

: 4 4 100

- *1 5 4 80

“ 24 5 20.8

Page 127: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

99

Orthographic marker # Occurrences # Associated

pauses Pause occurrence probability (%)

... *2

( 2 2 100

) *3

*1 - Only hyphens between words were considered, and not within words (ex: ‘chamo-me’ – My name is).

*2 - This type of sentence marker is always associated to paragraph change, so there is no intra-paragraph occurrence. However, this marker is known to impose a pause.

*3 - This marker was always (just two cases) followed by another one (comma or full stop).

As for sentence markers “.” and “,” the number of occurrences in the database is statistically relevant, it allows to conclude that there is always a pause associated to “.” and a frequent pause associated to “,”. For other markers, the number of occurrences has no statistical significance, though it indicates that there is a pause associated to “?”, “!”, “;”, “:” and “(“. For “-“ there is usually a pause associated, and usually no pause associated to “””.

For pauses between words with no sentence marker, there was an attempt to identify words associated to pauses, before or after them, but there was no word with significant occurrence near any pause.

Thus, in spite of knowing that this issue needs further studying, a preliminary rule was established for pause imposition in a synthesizable text:

The occurrence of at least one of the following markers imposes a pause: {. , ? ! ; : - ... (}. The other types of pauses (about 30%) were not considered.

Pauses with no association to sentence markers should function as semantic group boundaries, if they were obtained by a prosodic phrasing determination process. If not, with the available semantic group automatic identification tools, the door is open for future research in this area.

Pausing model should consider a prosodic phrasing, as work described by Viana and others [2003], and their phrasing markers are very serious candidates for pausing.

3.6.2 Pause duration

The number of occurrences of each sentence marker {? ! ; : - “ ... ( )} is not relevant to determine the duration of the pauses individually associated to them. Pauses {. ,} were separated from the remaining pauses, which make up a distinct set. An ANN was setup, by developing its architecture, topology, activating functions and training algorithm. The final network has 17 nodes in the entrance layer, 4 nodes at the hidden layer and 1 node at the output. The activating functions are hyperbolic tangent and hyperbolic logarithmic in hidden and output nodes, respectively. The duration of each pause was linearly codified between 0 and 1, assuming that the longest pause has 565 ms (measured duration of the longest intra-paragraph pause in the database). The selected training algorithm was the ‘Levenberg-Marquardt’ algorithm [Demuth and Beale, 2000] and [Hagan and Menhaj, 1994], because it led to better results than the others. The training process was identical to the one described for the duration model, using a validation set and interrupting the process once the performance began to degrade, so as to prevent over-fitting.

Page 128: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

100

A set of automatically-extractable features was assembled, to enable the network to enhance its performance. This set of features is presented in Table 3.12.

Table 3.12: Parameters for the pause duration predictor.

Features Fields

Type of sentence marker associated to pause By activating 1 of 3{. , other}

Distance to previous pause Time (s), number of segment, number of intonation group

Sentence marker of the previous pause

By activating 1 of 4{beginning of paragraph . , other}

Distance to following pause Time (s), number of segment, number of intonation group

Sentence marker of the following pause By activating 1 of 4{end of paragraph . , other}

Table 3.13: Best results for the intra-paragraph pause duration predictor.

Set rmse (ms) r

Training 87.2 0.59

Test 94.7 0.54

Table 3.14: Marker type results for the pause duration predictor.

Training Test

Original Predicted Original Predicted Marker type N. *1 d *2

(ms) σ *3 (ms)

rmse *4

(ms) r

N. *1 d *2 (ms)

σ *3 (ms)

rmse *4

(ms) r

. 51 309 103 95 0.43 11 225 97 104 0.55

, 207 198 95 80 0.53 47 228 86 85 0.33

others 28 243 129 121 0.35 16 360 119 145 0.35

*1 - Number of cases. *2 - Average duration of the pauses produced by the speaker. *3 - Standard deviation of the pauses produced by the speaker. *4 – Root mean square error (Eq. (3.11)) of the difference between predicted durations and

durations produced by the speaker.

Page 129: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

101

Table 3.13 exhibits the model’s best results for the training and test sets10. Table 3.14 displays the results by type of marker associated to pause, both for the training and test sets.

In this case, the root mean square error (rmse) replaced standard deviation of the difference between predicted and measured durations since it has no mean null value, which means the model failed to estimate the average duration value for every duration and for durations grouped by marker type. This happens because the average duration value differed significantly for the training and test sets. Moreover, rmse value for predicted durations is large in comparison to the standard deviation (σ) of measured durations, which proves that results were not good.

A logarithmic codification for duration values was also tested, using Eq. (3.16), where D is the pause duration in seconds and D’ is the codified duration. Yet, there was no improvement.

( )2log 1D D′ = + Eq. (3.16)

3.6.3 Final considerations on studying pauses

Some aspects associated to pause insertion and pause duration were studied using the speech database. The given database was clearly unsuitable for this study. However, pause studying and its basic procedures were thought upon, so the issue can be further studied in the future.

First, it was necessary to separate inter-paragraph pauses from intra-paragraph pauses. Inter-paragraph pauses were not studied due to cuts during the recording process. As for intra-paragraph pauses, statistical data allowed some rules to be established, regarding their location in relation to sentence markers, though there were not many occurrences in the database. There are also pauses between words, not associated to any sentence marker. For these pauses, the only available information was words; it was impossible to establish a model, given the database restrictions and the amount of automatically-extractable information. Desirably, linguistic information would enable an automatic division of the sentences into semantic groups in this case [Oliveira, 2002] and [Masaki et al., 2002]. Viana and her collaborators [2003] mention that their phrasing module for European Portuguese in a correctly punctuated text should have a 61% average performance of correctly inserted pauses and no false pause insertion. When word information is added (functional or relative to contents), the performance level increases to 85%, but false insertions also increase from 0 to 17%. When punctuation information is crossed with POS (part-of-speech) information, correct pauses have a performance level 92% and false insertions drop to 4%, for a set of 12 labels.

In this work a very simple pausing model was proposed to insert pauses and predict its durations in intra-paragraph breaks. The model inserts pauses just in accordance with orthographic text markers, disregarding other breaks also important. In the considered database pauses correlated with orthographic markers correspond to 70%. An ANN was proposed, considering just distances and type of previous and following pause. The rmse (95 ms) and correlation coefficient (0.54) achieved in the test set are at the level of the results produced by Navas [2003] for Basque language using a CART based approach, 80 ms and 0.50 of rmse and correlation, respectively.

10 These sets were obtained from the existing intra-paragraph pauses in the training and test texts of the duration model.

Page 130: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

102

The pausing module can be improved, with a specific database containing a lot more pauses, which would attain relevant statistic results and a significant amount of syntactic information. For the given database, there is no need to label the phonemes; it is sufficient to identify the pauses by type and register syntactic information.

Page 131: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 3 – Duration Model

103

3.7 Conclusion

One model and one alternative of that model were proposed to predict from the text the phoneme segmental durations with view to synthesizing the speech of that text. Both models are based on feed-forward artificial neural networks, with a set of input features that specify the identity and context of a given segment. The model consists of just one ANN with the identity of segment codified in the input features while the alternative model consists in an ANN for each segment identity in a total of 44 ANNs. The remaining features are the same in both models. The set of features, ANN architecture and training alternatives were carefully optimised. Training was done under a read texts database with several types of sentences.

Both models achieved a very high performance level. But the alternative model, with a specific training for each segment type had slightly better final results. The alternative model achieved a standard deviation of error of 18.2 ms and a correlation coefficient of 0.86 against 19.5 ms and 0.84 achieved by the model. The perceptual relevance of this difference will be studied in chapter 5. It was proved that the prediction of segmental duration benefits in splitting a large model into a smaller dedicated model units.

The presented results were as good as the best presented in the literature for different models and other languages.

The model’s results were analysed by comparison with the speech labelling duration results of a text set. The way the texts were read is certainly not the only possible way and although the model tried to “imitate” a (professional) speaker, his reading rhythm is not always coherent for every sentence. This becomes quite obvious when the model’s results are preferred to the original ones.

One should also take into consideration the error margin in the speech labelling itself. There are two types of errors: gross errors, resulting from the incorrect marking of segments; and precision errors, resulting from the lack of coherence marking every segment in the same moment of the cycle. The first errors were deleted as they were found; the second are typical of the manual labelling process and reflect the difficulty to identify phoneme boundaries. Consequently, there is a certain error margin in the very identification of the original segment durations.

The purpose of the duration model is the application to a text-speech synthesis system; therefore the durations of the voiced speech segments are always multiple of the fundamental period durations. Thus, there is no benefit if the model’s durations are more precise than the fundamental period’s durations. Usually, for a fundamental frequency of about 100 Hz, this period has 10 ms, and, in some cases, about half.

As mentioned by several authors, namely Klatt [1976], there is a minimum value in segmental duration differences which is perceived by the listener. It differs according to the length of the segment and its location within the word and sentence. In a summary of other studies for several languages, Klatt points to 10 ms for duration segments of 100 ms for vowels, fricatives, plosives and nasals in Japanese, where some phonemes are differentiated by their durations. He also mentions 20 and 25 ms for studies by different authors for English. Lastly, he concludes that the duration modifications inferior to the minimum perceived value of 25 ms are, from the perception point of view, considerably less relevant than those superior to that value.

The model is obviously not perfect. It can evolve, specifically in those rare situations where the error margin is large. However, for the model to improve when applied to a text-speech synthesis

Page 132: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

104

system, other blocks of the system should also be improved. Brinckmann and Trouvain [2003] mention that for TTS purposes it is more important the quality of symbolic representation (instead of full lexical representation) than some perceptually masked improvements in the prediction duration models. At this stage, the focus of this work is no longer in segmental modelling, but the improvement of other synthesis blocks.

Finally a simple pausing model to insert pauses and predict their durations was presented. Pause emergence is determined just by orthographic punctuation markers, covering about 70% of existing breaks. Durations are predicted with an ANN using the text information and having in mind contextual aspects solely. The prediction duration model achieved the promising results of 95 ms of rmse and a correlation of 0.54. Still, the database was considered not suitable for the purpose and contextual information was not sufficient. A prosodic phrasing model with a larger linguistic basis is required.

Page 133: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

4 Fundamental Frequency

In this chapter, some of the most relevant Fundamental Frequency (F0) models are referred. An in-troduction to the Fujisaki F0 contour generation model is made, as well as a description of the in-teractive tool that allows the estimation of the F0 parameters semi-automatically according to this model. There will be a discussion on the methodology adopted in the estimation process that asso-ciates Accent Commands (henceforth ACs) with syllables. Phrase Commands (henceforth PCs) are inserted by a rule based method aligned with accent groups. The final position of PCs is deter-mined by anticipation to the accent group, which is predicted with ANN, as well as its magnitude. ACs are predicted with four ANNs for its amplitude, onset time, offset time and existence of AC as-sociated with syllable or not.

Page 134: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

106

4.1 Introduction

The F0 contour has been proven as the most relevant prosodic parameter to confer naturalness to synthetic speech. Due to its complexity, it is also the most focused issue in scientific publications related to prosody.

There is no consensus when it comes to defining prosody or prosodic models in the literature. There are usually two views on these concepts, even if the definitions differ a lot. One of them is more concrete [Ladd and Cutler, 1983], and it conceives prosody from a physical point of view as a set of acoustic parameters that can be measured and modelled, including ‘pitch’ (F0), duration an intensity. This was the view adopted in this work.

Prosody may include non-lexical information regarding types of utterance (declarative, inter-rogative, etc); it may also accumulate utterance functions such as sentence focus or prominence of certain sections of the sentence. Moreover, prosody may contain information on the potential emo-tions of the utterance.

Prosody, here expressed by F0, represents information of linguistic, nonlinguistic and paralin-guistic levels, as defined by Fujisaki [1997:28] and transcribed bellow:

“Here I define linguistic information as the symbolic information that is represented by a set of discrete symbols and rules for their combination. It can be represented either explicitly by the written language, or can be easily and uniquely inferred from context.”

“On the other hand, paralinguistic information is defined as the information that is not inferable from the written counterpart but is deliberately added by the speaker to modify or supplement the linguistic information.”.

“Nonlinguistic information concerns such factors as the age, gender, idiosyncrasy, physical, and emotional states of the speaker, etc. These factors are not directly related to the linguistic and paralinguistic contents of the utterances and cannot generally be controlled by the speaker, ...”.

Naturally, present TTS systems can not handle paralinguistic and nonlinguistic information. However, this information is included in databases from which prosodic models are built and there-fore it is also an unrestrained part of these models. Thus, linguistic information is the only source providing hints to monitor the F0 contour according to the model in use.

Most TTS systems divide the intonation generation task into the linguistic and the F0 generation components [Sproat, 1998]. The linguistic component is responsible for analysing the text, process-ing the input text along with possible high-level markers. These markers, not deductible from the text, contain information on prosodic intentions, information that gives birth to prosodic events. The F0 generation component consists of the process of generating an F0 contour from linguistic representations or generated prosodic events. Traditionally, the F0 generation component is con-ceived to support a specific abstract representation.

Prosodic models become more effective, as better linguistic component information they get. Syntactic and morphological analysers are likely to improve the linguistic component results sig-nificantly, since they allow more accurate decision as far as the F0 generation component is con-

Page 135: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

107

cerned. The linguistic component alone is not sufficient to build a prosodic model, because the same sentence can be said in various ways, depending on the context. One can say:

• I didn’t eat the apple. (He did);

• I didn’t eat the apple. (Post-accusation denial);

• I didn’t eat the apple. (I picked it up, but gave it to the boy instead);

• I didn’t eat the apple. (I ate an orange and a pear, but not the apple).

In these examples of the same sentence, the linguistic information is precisely the same, but the emphasis lies on the word written in bold, according to the context. Cases such as these can only be solved with prosodic models prepared to handle high-level prosodic markers, so linguistic features like emphasis or sentence focus can be identified. If a certain prosodic feature is not clearly infer-able from the text or if it lacks an identifying marker, a neutral production rather than a wrong one is preferred.

Approaching TTS interactive systems requires more freedom for prosodic expression than what is currently allowed [Kochanski and Shih, 2002]. Most TTS systems are conceived to handle little or no prosodic information marked outside the text. Kochanski and Shih believe that the next gen-eration of TTS applications will not suffer from these constraints, as they will be directed at dia-logue applications, thus containing information regarding the goals and intentions of the utterance. This information must be expressed by prosody, so the “concept to speech” should be seriously thought of in speech synthesis. Moreover, some applications require emotional simulation, stylistic variation, etc, so this information should be provided to TTS systems by adding markers to the text. With these markers, the system would not have to infer so much from the text and will conse-quently make fewer mistakes and attempt a more daring, less neutral utterance.

For that matter, the model presented here can be easily adapted to handle a set of prosodic mark-ers from a marking system.

There are different intonation schools describing prosody. The best known are now shortly de-scribed:

• ToBI (Tone and Break Indices) – the most widely used intonation and prosodic structure representation basis for several languages. It is based on thorough research of intonation systems and on the relation between intonation and prosodic structure for a given lan-guage. Each accent is represented by no more than two points, which specify the relative contrast between high (H) and low (L) tones abstractly [Pierrehumbert, 1980], [Hirschberg and Pierrehumbert, 1986] and [Silverman and Pierrehumbert, 1990] (Fig. 4.1). The ToBI system aims at specifying a minimal set of intonation category markers, which are usually interpreted as phonological distinctions of accent types. Frota [2000] made a prosodic characterisation of Standard European Portuguese using this model for the intonation de-scription;

Page 136: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

108

Fig. 4.1 – Example of a ToBI intonation representation. (taken from http://www.ling.ohio-state.edu/~tobi/).

• Tilt – Model that represents intonation in the shape of a linear sequence of events, which may be F0 accents or boundary tones. Each event is characterised according to continuous parameters representing amplitude, duration and ‘tilt’ (measure of the shape of the event) [Taylor, 2000];

• INTSINT (INternational Transcription System for INTonation) – proposed by Hirst and Di Cristo [1998], [Hirst et al., 2000], [Hirst, 2002]. It is an intonation transcription system which codifies F0 patterns using a set of abstract tone symbols. Those symbols can be ab-solute or relative symbols. The {T, M B} symbols, (Top, Mid, Bottom), are absolute sym-bols for the F0 variance range of a speaker. The {H, S, L, U, D} symbols, (Higher, Same, Lower, Upstepped, Downstepped), are relative to the previous target-point. Each symbol characterises a target-point in the phonetic transcript and these point are later expanded by the MOMEL algorithm (MOdélisation de MELodie) [Hirst and Espesser, 1993]. This al-gorithm allows for automatic modelling of the macroprosodic component of the F0 con-tour with a sequence of points which define a quadratic spline function. Ferreira [1998], described the Standard European Portuguese intonation patterns according to this model. She found three descending tone patterns, one bidirectional pattern and two ascending tone patterns;

• Fujisaki – an F0 generation physiological model developed by Fujisaki and presented in several publications [Fujisaki, 1983, 1988, 1997, 2002]. This quantitative model divides F0 into three components added in the logarithmic domain: baseline frequency, phrase components and accent components. Phrase components are regulated by a set of impulse-like commands known as phrase commands; accent components are regulated by a step-wise set of commands known as accent commands.

The Fujisaki model was adopted in this work for the following reasons:

• It is both a physical and physiological model;

• It had a successful application to TTS systems in other languages, namely Japanese, Ger-man [Mixdorff, 1998, 2002] and Basque [Navas, 2003]. Those systems get improved re-sults when moved from the original prosodic models to the Fujisaki model;

Page 137: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

109

• It allows precise modelling of the F0 contour with a relatively small amount of parame-ters, as will be seen in 4.3;

• Separating phrase and accent components is like dividing the problem, thus a more rigor-ous analysis of each part was made possible;

• It is a mathematical F0 generation model which allows a discreet quantification of intona-tion events.

The physiological basis and the mathematical exploration of the Fujisaki model are presented in section 4.2. Section 4.3, presents the developed tool for estimation of the Fujisaki parameters, and the self process and consideration taken in the estimation of those data. The organization of data and the definition of several parts-of speech used are explained in section 4.4. In section 4.5 a PC insertion model is presented, as well as a model to predict the magnitude of these commands and their anticipation from accent groups. In section 4.6 an AC prediction model is documented, where their location, amplitude and duration are discussed. Results of each part of the model are pre-sented. The predicted F0 contour is obtained after application of each part of the entire model as documented in section 4.7, and is applied over segmental durations modified speech signal.

Page 138: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

110

4.2 The Fujisaki Model

The Fujisaki quantitative intonation model [Fujisaki and Hirose, 1984], initially developed for Japanese, has been expanded to other languages by Fujisaki himself and his collaborators. As op-posed to previously mentioned models which try to model F0 contours, this model seeks to model the very process of F0 generation, explaining the physical and physiological mechanisms that un-derlie its application [Fujisaki, 2002]. The author considers the speaker to gather several types of information in the process of communication, information that is manifested in the segmental and suprasegmental features of speech. Fig. 4.2 represents the inclusion of linguistic, paralinguistic and nonlinguistic information in the consecutive stages of speech feature processing. Each of those stages is ruled by a set of physical or physiological constraints. Fujisaki’s definition of linguistic, paralinguistic and nonlinguistic information was described in the previous section. The picture be-low helps to understand the difficulty of finding a clear and single correspondence between the physical features detected in speech and the prosodic organisation of a sentence. The author follows these two steps to infer the prosodic organisation of the physical features detected in speech:

1. Inferring the commands from the speech characteristics;

2. Inferring the units and the structures of prosody from the commands.

Fig. 4.2 – Processes by which various types of information are manifested in the segmental and supra-

segmental features of speech. (Figure published in [Fujisaki, 2002], edited with courtesy of Hiroya Fujisaki).

As step one is the inverse operation to the speech production process, it may be conducted with more accuracy and objectively if there is a quantitative model for the production stage. That model has been applied to several languages successfully. The process of inferring units and prosodic

Linguistic

Non- linguistic

Para-linguistic

Message Planning

Utterance Planning

Motor Command Generation

Lexical Syntactic Semantic Pragmatic

Input Information

Rules of Grammar

Rules of Prosody

PhysiologicalConstraints

Physical Constraints

Intentional Attitudinal Stylistic

Physical Emotional

Segmental and supra-segmental features of speech

Speech Sound

Production

Page 139: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

111

structures described in step 2 gave way to the development of a statistical model in this work, one able to generate the corresponding parameters from the text automatically.

Fig. 4.3 – Functional model for the process of generating F0 contours. (Figure published in [Fujisaki, 2002],

edited with courtesy of Hiroya Fujisaki).

Fig. 4.3 represents the process of generating F0 contours from PCs and ACs in Fujisaki’s model. The PCs are a set of impulses, and the ACs are a set of stepwise functions. The F0 contour can be expressed by Eq. (4.1), where Gp(t), Eq. (4.2), represents the impulse response function of the phrase control mechanism and Ga(t), Eq. (4.3), represents the step response function of the accent control mechanism.

( ) ( ) ( ){ }0 0 1 21 1

log ( ) logI J

e e b pi p i aj a j a ji j

F t F A G t T A G t T G t T= =

= + − + − − −∑ ∑ Eq. (4.1)

( )2 exp , 0,( )

0, 0,p

t t tG t

t

α α − ≥= <

Eq. (4.2)

( ) ( ) ( )min 1 1 exp , , 0,

0, 0a

t t tG t

t

β β γ − + − ≥ = <

Eq. (4.3)

where,

Fb : baseline value of fundamental frequency;

I : number of phrase commands;

J : number of accent commands;

Api : magnitude of the ith phrase command;

Aaj : amplitude of the jth accent command;

t

Aa Ga(t) t

Phrase Control Mechanism

Accent Control Mechanism

loge F0(t)

Fundamental Frequency Contour

t

Ap Phrase Commands

Accent Commands

Phrase Components

Accent Components

loge Fb loge Fb

Gp(t)

Page 140: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

112

T0i : timing of the ith phrase command;

T1j : onset of the jth accent command;

T2j : offset of the jth accent command;

α : natural angular frequency of the phrase control mechanism;

β : natural angular frequency of the accent control mechanism;

γ : relative ceiling level of accent components.

Fujisaki assumes that parameters α and β are constant at least within an utterance, and the pa-rameter γ is set equal to 0.9. The rapid downfall of F0, often observed at the end of a sentence, can be regarded as response of the phrase control mechanism to a negative impulse for resetting the phrase component.

Fujisaki [1988, 2002] presents the physiological and physical mechanism underlying the model.

The three components added in a logarithmic scale in the Fujisaki model are the F0 baseline that is dependent of the speaker, the phrase component that is related with the prosodic phrasing, and the accent component related with syllable or word accents. The first is constant and the model to produce the last two components from text is explored in next sections.

4.2.1 Phrase component

The inputs of the mechanism to produce the phrase component are impulses defined by their magnitude Ap and onset time T0. The natural angular frequency, α, is assumed as constant within an utterance.

Fig. 4.4 displays the phrase components for different magnitudes, Ap. The shape of the phrase components is the same but the higher the magnitude Ap is, the higher is the components and the faster is the rising and the falling slope, which models the declination line of the F0 contour.

Fig. 4.5 displays the phrase components for different natural angular frequency, α. As higher α is as sharper becomes the shape, with faster rising and falling slope and also higher magnitude of the phrase components.

α should be chosen by the shape of lower values of F0 contour and the magnitude Ap, adjusted according to F0 amplitude.

Page 141: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

113

0 0.5 1 1.5 2 2.5 370

80

90

100

110

120

130

140

Time(s)

F0(H

z)

0.800.500.300.15

Alpha=2.0 Ap:

Fig. 4.4 – Phrase component for PCs magnitude Ap= 0.15, 0.30, 0.50 and 0.80 with α=2 /s, logarithmically

added with Fb=75Hz.

0 0.5 1 1.5 2 2.5 370

80

90

100

110

120

130

140

150

160

Time(s)

F0(H

z)

4321

Ap=0.5 Alpha:

Fig. 4.5 – Phrase components for PCs with α=1, 2, 3 and 4 /s with Ap=0.5, logarithmically added with

Fb=75Hz.

Page 142: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

114

4.2.2 Accent component

The inputs of the mechanism to produce the accent component are step-wise signals defined by their amplitude Aa, onset time T1 and offset time T2. The natural angular frequency, β, is assumed as constant within an utterance.

Fig. 4.6, Fig. 4.7 and Fig. 4.8 display the accent component with the variation of Amplitude (Aa), accent command duration (T2-T1) and angular frequency (β), respectively.

As higher is the amplitude Aa, the higher is the amplitude of accent component and the sharper is the contour, with higher variation in rising and fall curves. The length of the component is inde-pendent of the amplitude.

The accent component has different shapes as it reaches or not the maximum amplitude, depend-ing of the accent command duration. This component is the addition of rising and fall parts, which start with onset (T1) and offset (T2) time respectively. If the offset timing starts before the end of the rising part, then the rising and fall parts are added till the end of the rising part. The timing be-tween the end of rising part and the offset timing corresponds to the flat part of the accent compo-nent, controlled by γ parameter.

If offset time comes after the completion of the rising part, the shape of the fall part of the com-ponent is equal the inverted rising part. Therefore, have exactly the same duration, and the fall part starts exactly at offset time. But, if offset time comes before the full rise of rising part, rising part is shorter than fall part, and fall part starts only after offset time. This difference can be clearly ob-served in Fig. 4.7 in the component with T2=50 ms.

The accent component amplitude is limited by the value Aa*γ (in logarithmic scale), as denoted by Eq. (4.3). The value of γ was set to 0.9 as proposed by Fujisaki in the above mentioned refer-ences.

The duration of the rising and fall parts depends of β. As higher is β faster is the rising and fall parts of the accent component.

Page 143: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

115

0 0.05 0.1 0.15 0.2 0.25 0.370

80

90

100

110

120

130

140

150

160

Time(s)

F0(H

z)

0.80.50.30.15

T2-T1=150 ms beta=30

Aa:

Fig. 4.6 – Accent components for ACs with T1=0 s, T2=0.15 s, beta=30 /s and Aa=0.15, 0.30, 0.50 and 0.80,

logarithmically added with Fb=75Hz.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.3570

80

90

100

110

120

130

140

Time(s)

F0(H

z)

200 ms150 ms100 ms50 ms

beta=30, Aa=0.6, T1=0

T2:

Fig. 4.7 – Accent components for ACs with beta=30 /s, Aa=0.60, T1=0 s, and T2=0.05, 0.1, 0.15 and 0.2 s,

logarithmically added with Fb=75Hz.

Page 144: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

116

0 0.05 0.1 0.15 0.2 0.25 0.3 0.3570

80

90

100

110

120

130

140

Time(s)

F0(H

z)

35302520

Aa=0.6, T2-T1=150 ms

beta:

Fig. 4.8 – Accent components for ACs Aa=0.60, T1=0 s, T2=0.15 s and beta=20, 25, 30 and 35 /s, logarithmi-

cally added with Fb=75Hz.

Page 145: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

117

4.3 Parameters Estimation of Fujisaki Model

In this section the process of estimating the Fujisaki model parameter’s in the paragraphs of the data base will be explained. The developed tool to support the manual estimation process, the com-parison between estimated and original F0 contours and some considerations resulting from the ex-perience of the estimation process will be discussed.

To start it is important to notice the difference between the terms estimation and prediction. The word estimation is used for a bottom-up process that consists in getting parameters, commands in this case, from F0 contour. The word prediction is used in a top down process that consists in get-ting parameters, commands in this case, from text.

The process of parameters estimations is very laborious. There are some algorithms to do this task automatically as the ones presented by [Mixdorff, 2000], [Rossi et al., 2002], [Fujisaki and Narusawa, 2002] and [Narusawa et al., 2001, 2002a, 2002b]. In this work, the Mixdorff algorithm was used in a first approximation, and then the parameters were manually corrected using a tool specially developed for this task.

Fb is speaker dependent and is not constant even for one speaker and can vary slightly from ut-terance to utterance.

Parameters α and β, do not vary so much from one speaker to another, nor from one utterance to another, according to the Fujisaki’s experience on many languages and speakers [personally re-ported], and can be approximated by 3.0 /s and 20 /s, respectively. A smaller value for α tends to miss small and short phrases, and tends to approximate several small phrases by one long phrase.

There is a physiological reason to consider the value of β somewhat different for the onset and offset of the accent command [Fujisaki, 2002]. It is larger for the offset, but the same value is used for the sake of reducing the number of variables.

Once the data base used was recorded by the same speaker, the model developed here is opti-mised for the characteristics of that particular speaker. In order to reduce the number of variables of the model, and without loss of quality some parameters were considered constant. The experience of estimating parameters for the speech of that particular speaker based in the preliminary analysis of several utterances showed that it would be appropriate to considering the Fb, α and β, with the values presented in Table 4.1.

Table 4.1: Constant parameters.

Parameter Value

Fb 75 Hz

α 2.0 /s

β 20 /s

A special purpose tool was developed to support the manual estimation of parameters. Next sec-tion presents this tool.

Page 146: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

118

4.3.1 Tool to support the manual estimation of Fujisaki model parameters

The mentioned tool was developed over the Matlab® environment, using the re-synthesised speech signals produced in PRAAT software [Boersman and Weenink].

The data provided by the tool is obtained from several modules described before, such as syl-labification, intonation group rules, original F0 contour, supplied by PRAAT files, and other data given by the labelled files of the database.

Fig. 4.9 displays the data provided to help the manual labelling. From top to bottom are pre-sented: the speech signal; the F0 determined by PRAAT (with blue signs +); the estimated F0 pro-duced with the labelled commands, accent components plus phrase components plus Fb (in black); the phrase components plus Fb (in black); PCs (black arrows); ACs (black pedestals); the syllables in descending lines (red - tonic syllable, blue – normal syllable, black – syllable without vowel), each descending line is one accent group; the orthographic phrase marks (in red); the words (begin-ning of words are marked with vertical cyan dotted lines); and finally the sequence of phoneme segments. All data are synchronised with the speech signal waveform. Top of figure gives the root mean squared error between estimated and original F0 contours, considering only the non zero val-ues.

0 1 2 3 4 5 6-150

-100

-50

0

50

100

150

200

250

300

t(s)

Aa/

100

A

p/10

0

Hz

i , , , , .saohomens

comonos

e ate acrescentarei

eu porque

sofreram

esofremmais

maismerecem

s6~w"O

m6~jS

!"kum

nO

SX

i6!

"t

EX

6!

kSse~

!t6

"r

6j

ew

XXX

!"p

ur!

k

@s

uf

@"re

r

6~wi

"s

Of

r6~

jm

aj

SX

ma

jZm

@

"rE

s6~

jXXX

rmse1=2.96 Hz

Fig. 4.9 – Example of the data provided by the tool to manually estimate the Fujisaki parameters.

Original F0 contour is determined using PRAAT 4.0 software and saving the data into a file which will be read by the tool. The command ‘To Pitch…’ of PRAAT is used to determine the original F0 contour. This command performs a pitch analysis based on autocorrelation method. The

Page 147: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

119

algorithm performs acoustic periodicity detection on the bases of an accurate autocorrelations method, as described in [Boersman, 1993]. The author claim that the method is more accurate, noise-resistant, and robust, than methods based on cepstrum or combs, or even the original autocor-relation methods. Any how there are several outliers in the F0 contour. Low amplitude and not clearly voiced sound are the most frequent situations where outliers appear. A post processing is performed to remove outliers.

The post processing algorithm removes all F0 values above a maximum threshold as well as se-quences between one and four F0 values, where the variation before and after the sequence is higher than a chosen delta variation. For the present speaker the threshold limit is 200 Hz and the delta variation is 10 Hz.

Fig. 4.10 – Window with menus of the tool to manually estimate the Fujisaki parameters.

Fig. 4.10 displays the window of the tool. The left part of menus contains the default items of a Matlab figure, and the right part contains the special purpose tool menus. In the middle there is a menu with the identification of the paragraph (t2_p19, in this case). The figure toolbar of Matlab, presented in a second line, has the facility to make zoom in and zoom out. The contents of the spe-cial purpose menus are described below:

Play original:

• Play all – plays the entire paragraph speech signal;

Page 148: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

120

• Select and play – allows the selection and play of a part of speech. The initial and final instants of the selected part still available for the present signal and for the re-synthesized speech signal;

• Play selected – Plays the previously selected part of speech.

Play Re-synthesis:

• Load Re-synthesis – loads the file with the re-synthesised speech signal pre-viously saved with PRAAT with a specific name;

• Play all – plays the entire speech signal of the paragraph;

• Select and play – allows the selection and play of a part of speech. The initial and final instants of the selected part still available for the present signal and for original speech signal;

• Play selected – Plays the previously selected part of speech.

C. Phrase:

• T0 – Change PC position;

• Ap – Change PC magnitude;

• Insert – Insert new PC;

• Delete – Delete PC.

C. Accent:

• T1 – Change onset time of AC;

• T2 – Change offset time of AC;

• Aa – Change amplitude of AC;

• Insert – Insertion of a new AC;

• Delete – Delete AC.

Options:

• Undo – Restores previous Commands;

• Save Commands – Saves the changed set of Commands;

• Load Commands – Load a saved set of commands and plots the Commands, their respective components and F0 contour.

Page 149: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

121

The play menus are also available through a shortcut keys. The change, insertion and deletion operations are done with the mouse.

Initial set of commands are plotted in black colour. The changed commands and respective F0 contour are plotted in red colour. Fig. 4.10 displays an example with the change of amplitude of the command phrase at instant about 3.3 s. The new value of rmse between the original F0 curve and the new estimated F0 curve is displayed beside the rmse of the initial set of commands.

Fujisaki recommends the use of the logarithmic scale with the advantage of having a visual addi-tion effect of the phrase and accent components in the manual commands labelling process. But, a linear scale was used, in order to allow the representation in the same graphic of phrase and accent commands, just by the use of a factor of scale (1/100), and the speech waveform, just by adding a constant offset. The linear scale allows a better resolution in higher frequencies, especially interest-ing during the manual labelling process and the process of manual commands labelling is also very intuitive, according to the author’s experience.

4.3.2 Parameters estimation process

The estimation of Fujisaki parameters was done in three phases. In the first phase the Mixdorff [2000] algorithm (gently provided by H. Mixdorff) was used to automatically extract the Fujisaki parameters based just on the F0 contour. Once the program is optimised for German and not for Portuguese, this first estimation allowed a rough approximation to the F0 contour for the Portu-guese utterances giving an rmse of 9.5 Hz. The estimated PC was an acceptable approximation considering its position in the utterance.

In the second phase, the set of commands was manually optimized, using the tool described above. No linguistic constrains were taken in consideration. This optimization started by adjusting the PCs position and amplitude, making the phrase component touch the valleys of F0 contours. Next, ACs were changed and/or introduced to produce an estimated F0 closely fitting the original.

Fig. 4.11 displays the first and second phases of the parameters estimation. The first estimation uses few AC making the estimated cross the original F0 contours without concern in following ex-actly the original shape. In the second phase, no restrictions about the proximity and number of AC were kept, allowing a better fitting and a precise tracking of the original F0 shape. The rmse be-comes improved from 8.97 Hz to 4.39 Hz between the first and the second phases of the estimation process in the whole paragraph, partially presented in the above figure.

After an attentive analysis of the AC, a strong connection between AC and syllables becomes clear. One AC is considered connected with one syllable if the accent component influences the F0 of the syllable, considering the delay between T1 of the AC and the effective contour of the respec-tive accent component.

In order to objectively decide if the accent component influences the F0 of the syllable, the con-cept of zone of influence was introduced as the interval between the instants where the accent com-ponent is higher than X% of its maximum value. If there is any interception between the zone of in-fluence of the AC and the voiced part of the syllable, the AC is considered as a candidate to be connected to the syllable. Several values for X were considered between 35% and 60%.

Page 150: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

122

6.5 7 7.5 8 8.5

-100

-50

0

50

100

150

200

250

t(s)

Aa/

100

A

p/10

0

Hz

, portalegre

estima

queocomboio

de altavelocidade

X!

p

ur

!t

6"l

E!

grS

!

"ti

m6

!k

ju

!k

o~!

"bO

j!

d@

"al*

!t6

v

ls

i!

"da

!d

@X

XX

rmse1=8.97 Hz rmse2=4.39 Hz

Fig. 4.11 – Example of the estimated parameters in first (black) and second (red) phases.

Assuming the general observation of the connection between AC and syllables, there are some exceptions discussed in the following topics:

• Syllables without any connected AC – These syllables can be either the type of no voiced sound or the type with voiced sound(s). The first case is obvious. Some times the next syl-lable needs a long excursion of F0 leading to a longer AC and consequently the onset time (T1) of this AC must be early and starting in the current one. This AC must be associated with the next syllable and not with the current syllable. This case is very frequent in sylla-bles with or without vowels. No rules were found for these cases yet, but they should be considered in the model;

• One AC with zone of influence spanned through more than one syllable – This AC must be considered as a sequence of contiguous ACs with identical amplitudes, where each new AC is associated with the respective syllable. This effect does not alter the accent compo-nent because the accent component of one AC is the same as the addition of component accents of two AC with same amplitude and total duration, if T2 of first AC coincides to T1 of second AC (i.e. the system is linear);

• Several ACs with zone of influence in the same syllable – Usually two, very rarely three, ACs appear in this case. These ACs will be named as candidates to be connected to the syllable. The candidates that could be connected to neighbour syllables still with no con-nected AC, must be connected to it. These cases are solved considering that the AC con-nected to a given syllable may influence the F0 of neighbour syllables as well. There still remain the unsolved cases where more than one AC are connected to just one syllable.

Page 151: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

123

These cases were analysed in order to inquire the parameterization of the ACs, to observe if there was really need for two or more ACs to produce the F0 contour for these syllables. Once there was observed that no significant loss in the accuracy of the fitting the F0 con-tour exist with only one AC, then a third phase of parameterization was performed.

The third phase of parameterization was just the correction of the cases where more than one AC was connected with the same syllable. This correction did not cause significant loss in the param-eterized database because the global rmse varies from 3.94 Hz to 3.98 Hz and the correlation coef-ficient varies from 0.974 to 0.973. One example of the mentioned correction is presented in Fig. 4.12 where AC number 21 in black at 5.3 s was deleted. In this figure are also visible the ACs (numbers in black) associated with syllables (numbers in bleu), and the accent component corre-sponding to each AC.

It should be noted that the zone of influence of AC number 20 spans syllables 26 and 27, but, since syllable 26 already has one associated AC (number 19), then AC number 20 becomes associ-ated only to syllable 27. Also, the zone of influence of AC number 21 spans syllables 27 and 28, but they do not coincide with the voiced part of syllable 28. In this case the AC is, again, associated just with syllable 27. AC number 18 is associated with syllables 24 and 25. So, this AC will be di-vided, exactly at the end of voiced part of syllable 24 in two ACs associated each one, to its.

Fig. 4.12 – Example of the AC parameters correction done in the third phase of parameters estimation.

An algorithm was implemented to connect ACs with syllables and identify syllables with more than one related AC. The flow chart sequence for each syllable is presented in Fig. 4.13. In the flow chart, the zone of influence of the ACs is between the time instants where these accent components

Page 152: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

124

keep values greater then 35% of its maximum. The ACs with zone of influence overlapping the voiced part of the current syllable and not related yet to previous syllables are candidates to be re-lated to the current syllable.

4.3.3 Evaluation of the estimated F0 contour in the Database

The generated F0 with the present model using the estimated parameters was compared with the post processed original F0 contour (non zero values) in order to determine the rmse and the correla-tion coefficient between estimated and original F0. The rmse is given by Eq. 3.11 calculated per sample point on the F0 contour. The correlation coefficient is given by Eq. 3.15. Results are pre-sented in Table 4.2.

Table 4.2: Root mean squared error and correlation coefficient between estimated F0 and post processed original F0 (non zero values).

rmse (Hz) r

3.98 0,973

No audibly perceptible difference seems to exist between the original speech and the re-synthesised speech with estimated F0 contours. Anyhow, some perceptual testes were made to con-firm the findings and the results are presented in chapter 5.

Page 153: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

125

Fig. 4.13 – Flow chart of the algorithm to connect ACs to syllables.

Identify voiced limits of syllable

Is there any AC candidate?

Connect the AC candi-dates to the syllable

Identify syllable boundaries

Identify type of syllable (1- tonic; 2- with vowel; 3- without vowel; 4- voiceless)

Identify AC candidates and re-spective zone of influence

Influence zone of previous AC, inter-cepts current sylla-

ble?

Previous syllable has two ACs con-

nected?

Second AC discon-nects from previous

syllable and connects to current syllable

Previous syllable has voiced

sounds?

Accent component crosses the end of

voiced part of previ-ous syllable?

Splits previous AC into two ACs at the end of the voiced part of the pre-vious syllable and connect the two new ACs to previous and current syl-

lables, respectively

Yes

Yes

Yes

Yes

Yes

No

No

Go to next syllable

No

No

No

Page 154: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

126

4.4 Application of the Model

In order to better understand the application of the model, in this section the organization of the considered structures is presented followed by a set of questions that remain to be solved after the second phase of the parameters estimation process. These questions allow understanding the flow of studies presented in the next two sections where answers are given.

Fig. 4.14 presents the organization of linguistic and prosodic structures in a paragraph, from lower to higher level: segment, syllable, word, accent group, prosodic phrase, phrase, sentence and paragraph. The segment is one of the 44 different segments of phonemes that were considered (Ta-ble 2.6). Syllable is well defined in chapter 2. Words, delimited by spaces, are also very well known. Accent group intends to be a prosodic structure and is defined in the previous chapter. Pro-sodic phrase is also a prosodic structure and is delimited by PCs. Phases are considered as any part of text between two orthographic marks (, ! ( ) - ; : … “ .) (including the beginning of paragraph). Sentences are delimited by any of the following marks (. ? ! …). Paragraphs are delimited by a car-riage return in text. The orthographic marks presented on top are boundaries for phrases, sentences and paragraphs.

Fig. 4.14 – Organization structures. On the top, the orthographic marks.

After the second phase of commands estimation several questions come out. These questions re-sult from the statistical analysis of estimated commands and how to use them to build the model.

• Which orthographic marks, that are structure boundaries, generate PCs?

• What generates PCs, or prosodic phrases, inside linguistic phrases?

• How should connections between ACs and syllables be dealt with? Assuming that there is a generic connection between ACs and syllables, how to deal with the situations where those connection are not direct? More specifically the following cases:

segment

syllable

word

accent group

prosodic phrase prosodic phrase

phrase phrase

sentence sentence

paragraph

, ? .

Page 155: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

127

• Syllables without any AC associated;

• ACs linked to more than one syllable;

• More than one AC connected to the same syllable.

The third question about ACs leaded to the third phase of the estimation process, as described in section 4.3.2. First two questions, related to PCs, will be studied and answered in next two sections.

Page 156: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

128

4.5 Phrase Commands

The solution of the problem of prediction of PCs in TTS, addresses two issues. The first is to de-termine its insertion position in text and the second is to predict the magnitude (Ap) and anticipa-tion distance (T0a) to the time point in speech associated with the text position [Teixeira et al., 2003].

Assuming the accent groups behaves as prosodic words, the only eligible positions to insert PCs are the beginning of these groups. The onset time of the PCs, T0, is usually anticipated relative to the beginning of the accent groups. This anticipation, noted as T0a, will be subtracted from the in-stant time of the beginning of accent group (eligible position) T0E to produce T0, as depicted in Fig.4.15, and represented by Eq. (4.4).

0 0 0E aT T T= − Eq. (4.4)

0 1 2 3 4 5 6 7 8 9 10-100

-50

0

50

100

150

200

250

t(s)

A

p/10

0

H

z

i , .assuasopinioes

sobreasituacao

dajusticarevelam

muitareflexao

esaocertamente

importantes

paratodos

particularmente

paraosque

temresponsabilidades

nasreformas

afazer

Estimated PCs

T0a

Eligible positions

Fig. 4.15 – Representation of Eligible positions, T0E, and anticipation, T0a, of PCs.

The following sections will deal with eligible positions in the text for inserting PCs, as well as the prediction of magnitude, Ap, and anticipation, T0a.

Page 157: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

129

4.5.1 PC positions in text

From the analysis of the location of PCs it is quite obvious that some orthographic punctuation marks impose presence of a PC.

Besides the PCs imposed by orthographic marks, about 70%, there are other PCs, about 30% of total, not linked with the punctuation.

The algorithm described bellow, was designed to govern the location of inserted PCs. In the first step, PCs linked with orthographic punctuation marks are inserted and, subsequently, several can-didate positions to insert other PCs are considered. For each candidate position a score is deter-mined by a mathematical model, as described in next section.

4.5.1.1 PCs linked with orthographic marks

Table 4.3 presents the percentage of occurrences of orthographic punctuations that originate PCs, according to the estimated parameters from the database. In this table the punctuation marks at the end of paragraph are excluded, because no PCs are inserted at the end of a paragraph. Although punctuation marks “!”,“…”“-““;”“:” do not present statistical relevance, the table suggests to have one PC associated to each orthographic punctuation mark. In case of comma “,” the percentage is not higher basically due to the proximity of some comas to other punctuation marks.

Table 4.3: Numbers of occurrences of orthographic punctuation marks, associated PCs and percentages of punctuation marks with PCs associated.

Orthographic punctuation

# of occur-rences # of PC %

. 67 64 96 , 379 261 69 ? 12 12 100 ! 4 3 75

… 1 1 100 - 7 6 86 ; 2 2 100 : 6 5 83

4.5.1.2 PCs not linked with orthographic marks

In this section only this type of PC will be discussed. The objective is to find anchors to associ-ate them. Firstly, every beginning of a paragraph should receive one PC in the obvious absence of any punctuation. In the following the eventual existence of additional PCs is analysed. Text and speech analysis of several of these PC, suggests that different factors seem to contribute to their lo-cations. Factors like distance to previous PC, distance to next PC, presence of pause, length of pre-vious word and type of next word, were statistically analyzed and correlated with the presence of this type of PC.

For each candidate position, one score, S, will be calculated by Eq. (4.5) that combines the weights of each factor.

( )pPC nPC p lpw twS W W W W W= × × + + Eq. (4.5)

Page 158: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

130

Where S is the score for the candidate position, WpPC, WnPC, Wp, Wlpw, and Wtw are the weights for previous PC distance, next PC distance, pause, length of previous word and type of next word, respectively.

0 1 2 3 40

5

10

15

20

25

30

35Previous PCs

t(s)-1 0 1 2 3 4 5

0

5

10

15

20

25

30

35

t(s)

Next PCs

Fig. 4.16 – Histogram and Gaussian approximation of distances from PCs not linked with orthographic marks

to previous PCs and next PCs.

The distances to previous and next PC factors have different histograms, but both can be ap-proximated by normal distributions, as can be seen in Fig. 4.16. Table 4.4 presents the relevant sta-tistical data. Weights for previous and next candidate are given by the normal probability density function with the respective means and standard deviation presented in Table 4.4 for the respective distances to previous and to next PC.

Table 4.4: Statistical data of distance to previous and next PC.

Statistical data Distance to previ-ous PC (s)

Distance to next PC (s)

Minimum 0.55 0.75

Maximum 3.26 4.77

Mean 1.70 1.94

Standard deviation 0.53 0.65

The distance to next PC is calculated as the end time of a so-called eligible area (see Fig. 4.19) plus 0.75 s minus the candidate position. This procedure limits the used value of distance to next PC of 4. However, next PC can, eventually, be more distant than 4 s, as depicted in Fig. 4.16.

The weight for presence of pause, Wp, is 1 or 0 in case of presence or not of pause.

For the weight relative to the length of the previous word, Wlpw, the length of the word is con-sidered plus the length of the eventual pause. This factor assumes a higher correlation with pres-ence of PC, for values above 0.5 s. The weight used for this factor is logarithmic and is given by the empirical Eq. (4.6).

Page 159: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

131

log(5 ( 0.2))lpwW length= × + Eq. (4.6)

0 0.5 1 1.50

0.5

1

1.5

2

2.5

length of previous word + pause (s)

Wlpw

Fig. 4.17 – Weight for length of previous word.

The weight, Wtw for the type of the next word, was determined according to the correlation of some words with this type of PCs and is given by the Table 4.5. This table, containing the most correlated words, has weights between 0.7 and 1. For other words not in table Wtw is 0.7, 0.5 and 0.2 for words with one two or more syllables. These values are empirical and based on several ob-servations.

Table 4.5: Weights for type of word.

word Wtw word Wtw

é 1 o 0.8

só 1 de 0.8

quando 1 da 0.8

em 0.9 para 0.7

longe 0.9 uma 0.7

mas 0.9 One syllable words 0.7

que 0.8 Two syllable words 0.5

a 0.8 Other words 0.2

4.5.1.3 Algorithm to insert PCs

A flow chart of the developed algorithm is presented in Fig. 4.18.

Page 160: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

132

The method starts by inserting PCs in the beginning of paragraph and just after the punctuation marks of Table 4.3. Then it removes each PC whose distance to the previous is less than 1s if the previous sentence is not of interrogative type.

Then, for the intervals between PCs that are longer then 3s, candidate positions inside the eligi-ble area to insert a new PC are identified.

The candidate positions are the eligible positions inside the eligible area. The Eligible area, as depicted in Fig. 4.19, starts 0.6 s after the previous PC and ends at minimum between next PC mi-nus 0.75 s and previous PC plus 3.25s. These limits for eligible area of the candidates ensure the minimum distances to previous and next PC, according to Table 4.4.

Then the score S is calculated for each candidate according to Eq. (4.5), and only the maximum scored candidate will be considered. If the maximum scored candidate has a score greater than 1, then one PC is inserted in its position. The process is repeated with the new set of PC until the end of the paragraph.

Fig. 4.20 presents an example of the application of the algorithm. Two PCs at about 0.6 and 7s, were initially inserted at the beginning of the paragraph and in eligible position, near the ortho-graphic mark. But because the distance between them is greater than 3s, the eligible area was de-fined (orange box), and the four eligible positions inside the eligible area were taken as candidate positions. For each candidate the respective score was determined. The second candidate position (at about 2.2 s) has the greater score of 2.7. Since the score is greater than 1, a new PC was inserted on that position.

Fig. 4.18 – Flow chart to insert PC in text.

Insert PCs after any orthographic mark

Remove PC whose distance to previous is less than 1 s

For intervals between PCs longer than 3 s, identify candidate positions inside eligible area

Determine the Score, S, for all candidate positions

Select the candidate with the maximum Score

If S >1 then insert a new PC at the candidate position

Page 161: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

133

Fig. 4.19 – Eligible area and candidate positions.

Fig. 4.20 – Application example of the algorithm.

4.5.2 Evaluation of preliminary inserted PC

A comparison between the positions and distances of estimated (labelled) PC and the ones in-serted by the algorithm is given in Table 4.6, 4.7 and 4.8. In this comparison, the position of la-belled PCs is the onset time position T0, meanwhile for the inserted PCs the position is the eligible position T0E that will be latter affected by T0a. Although a final evaluation will show the final quality of PC modelling, after the preliminary insertion of PCs it is interesting to measure the ap-

t (s)

>3s

0.6s 0.75s

3.25s

Eligible area

Beginning of accent groups (Eligible positions)

Candidate positions

Previous and Next PC

...

Page 162: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

134

propriateness of the methodology by evaluating the closeness of the inserted PCs to the labelled ones.

Table 4.6 shows the total numbers of inserted and manually labelled (estimated) PCs, which are very close, as well as the average of the respective distances. The standard deviations are somewhat different, as expected, because of the statistical nature of the insertion process that generally re-duces the variance of the model relative to the original. The histograms, presented in Fig. 4.21, of distances between adjacent labelled and inserted PCs, are similar in terms of basic shape.

Table 4.7, presents the number of correctly inserted PCs (C) determined as the number of in-serted PCs at position less distant than an arbitrary time distance in 3 values, from the nearest la-belled PC, the number of inserted errors (I) as the number of inserted PCs whose distance to the nearest labelled PC is longer than X seconds, and the number of deleted PCs (D) as the number of labelled PCs without inserted PC at distance X or less1. The range X is a tolerance for T0a that will affect the exact position of inserted PC. The maximum anticipation T0a was experimentally ob-served to be almost 1s. The recall rate (R) and precision rate (P) are also presented, as adopted by Hirose et al. [2003] and determined by the expression in Table 4.7.

Table 4.8 presents the recall rate and precision rate of inserted PCs considering correctly in-serted only the PCs at the next eligible position just after the estimated PCs. This measure is more exigent, since just the exact positions of labelled PCs are considered as correct positions to insert PCs.

Table 4.6: Comparison between estimated and inserted PCs. The number of PCs, the minimum, maximum and average distances and standard deviations in seconds.

PC # Dist_mn Dist_mx Averg. Std.

Labeled 646 0.55 4.78 1.86 0.66

Inserted 643 0.50 2.99 1.83 0.48

Table 4.7: Numbers of correctly inserted PCs (C), insertion errors (I), deleted PCs (D), the recall rate (R) and precision rate (P), at a tolerance time distance X, from the labelled PCs.

X=0.6 s X=0.8 s X=1 s

C 494 570 604

I 149 73 39

D 158 91 71

R=C/(C+D) 75.8% 86.2% 89.5%

P=C/(C+I) 76.8% 88.7% 93.9%

1 It must be noted that the number C+I and C+D in the case of Table 4.8 are exactly the numbers of inserted and labelled PCs, respectively. In the case of Table 4.7, C+I is also equal the number of inserted PCs, but C+D is superior than the number of labelled PCs because more than one inserted PC can be inside the range X of the labelled PC counting two correctly inserted PCs but just one labelled PC.

Page 163: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

135

Table 4.8: Correctly inserted (C), deletion errors (D), insertion errors (I), recall rate (R) and precision rate (P), for the positions of inserted PC compared to the positions of estimated PC considering the eligible position.

C D I R P

435 211 208 67.3% 67.7%

Hirose and others [2003] reported a recall rate and precision rate of 82% and 85%, respectively, for a process of automatically extraction of PCs from F0 contours (estimation process) using lin-guistic information. In this case the correct and incorrect positions are clearly known.

Taking in consideration the reported values [Hirose et al., 2003] for an automatic process of PCs estimation using linguistic information and having in mind the differences in this process of pre-dicting PCs from text, the numbers achieved for recall rate and precision rate are acceptable con-sidering that they are in the same range in case of the ones of Table 4.7 and relatively close in case of the ones of Table 4.8.

0 1 2 3 4 50

20

40

60

80

100

120hist. of distances between estimated PC

t(s)0 1 2 3 4 5

0

20

40

60

80

100

120hist. of distances between inserted PC

t(s) Fig. 4.21 – Comparison of histograms of estimated and inserted PC distances.

Visual inspection indicates that the inserted PCs are generally in a coherent position as can be observed in the example given in Fig. 4.22. The final exact position, T0, of the inserted PCs, will be affected by the anticipation T0a.

Page 164: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

136

0 2 4 6 8 10 12-100

-50

0

50

100

150

200

250

t(s)

A

p/10

0

H

z

i . i , , , , .apareceu-lhe

agoraumasituacao

quenaoesperava

ele commuita

honraantigo

emigrante

quasequediria

la muitopor

dentroainda

emigrante

ve-seagora

patraodeemigrantes

situacao

novaqueesta

emmelhores

condicoes

doqueninguem

paracompreender

Fig. 4.22 – Comparison of estimated and inserted PC positions. Black arrows are the estimated PCs; magenta

arrows are the inserted PCs.

4.5.3 Prediction of Ap and T0a parameters

The magnitudes, Ap, and anticipations, T0a, of PCs are predicted in a second step by means of an artificial neural network. Because of the low correlation (0.081) between Ap and T0a, one neu-ral network was developed for each parameter. Performances for both parameters are improved by using the two ANNs instead of one ANN with two outputs.

4.5.3.1 Architecture of ANNs

Several architectures in what concerns type of network, structure, number of layers, number of nodes in each layer, and activating functions, were considered and the more appropriate were tested for both ANNs. For each ANN several thousands of training sessions were run and the best per-formance session was selected giving the performance for this architecture. Feed-forward networks trained with back-propagation algorithms were selected as the type of network to solve the prob-lem.

The networks input layer has the necessary nodes to code the features discussed below. The out-put node codes the predicted parameter, Ap or T0a. The output is 85% of parameter value divided by the maximum parameter value and normalized to have null average and standard deviation equal to 1.

Table 4.9 and Table 4.10 present the best correlation coefficients’ architectures, activating func-tions and training algorithms of networks to predict Ap and T0a respectively. The column “Number of features” refers the features presented later in 4.5.3.3, and corresponds to the number of nodes of

Page 165: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

137

the input layer. The additional feature between the situations with 20 and 21 feature is the magni-tude of previous PC.

Table 4.9: Best performance (correlation coefficient), architectures and training algorithms to predict Ap.

Nodes in layers

Activating Functions

Training Algorithm

Number of features

Value of r in test set

20-2-2-1 Log-Log-Lin Lev.-Marq. 20 0.772

21-4-2-1 Tan-Log-Lin Lev.-Marq. 21 0.770

21-2-2-1 Tan-Log-Lin Lev.-Marq. 21 0.767

21-2-2-1 Log-Log-Lin Lev.-Marq. 21 0.764

20-4-2-1 Tan-Log-Lin Lev.-Marq. 20 0.763

20-10-1 Tan-Lin Lev.-Marq 20 0.761

Table 4.10: Best performance (correlation coefficient), architectures and training algorithms to predict T0a.

Nodes in layers

Activating Functions

Training Algorithm

Number of features

Value of r in test set

21-4-2-1 Tan-Log-Lin Lev.-Marq. 21 0.649

21-6-1 Log-Lin Lev.-Marq. 21 0.634

21-4-2-1 Log-Log-Lin Lev.-Marq. 21 0.634

20-8-4-1 Log-Log-Lin Lev.-Marq. 20 0.627

The ANN with one or two hidden layers were used with number of nodes varying between 2 and 10. The output node has always the linear activating function (Lin); meanwhile the last hidden layer has, in the best cases, the hyperbolic logarithmic activating function (Log); and the eventual first hidden layer has the hyperbolic logarithmic or tangent (Tan) activating functions. The Leven-berg-Marquardt back-propagation training algorithm [Hagan and Menhaj, 1994] gives always the best results due the relatively low number of nodes of the input layer (20 or 21).

The selected architecture to predict Ap, is the feed-forward type with two, two nodes, hidden layers with the hyperbolic logarithmic activating functions.

The architecture of the T0a ANN is also the feed forward type with two hidden layers, but with four nodes in first hidden layer and the hyperbolic tangent activating function, and two nodes in the second hidden layer activated by the hyperbolic logarithmic function.

4.5.3.2 Training the ANNs

The 101 paragraphs of the database was divided into the train set with 91 paragraphs and the test set with the rest 10 representatives paragraphs picked from the original seven tracks. The training set has 553 patterns (85%) and the test set has 93 patterns (15%).

Page 166: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

138

Training was done over the training set and using the test set for cross-validation in order to avoid over-training. The test vector was used to stop training early if further training on the training set will hurt generalization to the test set. The cost function used for training was the mean squared error between output and target values.

Training algorithms described in section 3.3.3 were used. The algorithms trainoss – ‘One Step Secant Algorithm’ and trainrp – ‘Resilient back-propagation’, give results with lower performance than trainlm – ‘Levenberg-Marquardt’. This is clearly the best algorithm for the dimension of the network, although the training process is slower.

For each variation of the ANN, concerning architecture, training algorithm, activating functions, set of features ands its codification, several thousands of training sessions were ran and the best performance were selected as the performance for this variation.

Fig. 4.23, displays the average performance of several architectures for Ap and T0a ANNs con-sidering different extensions of the training set. Is visible that for Ap ANNs the performance is sta-bilised from 75% of the training set, and more patterns do not improve performance. But, for T0a ANNs, performance still increasing at the 100% of the training set, what lead to the idea that more training patterns could improve performance of this parameter.

Average performance

0,500

0,550

0,600

0,650

0,700

0,750

0,800

25 50 75 85 95 100

% of training set

r

Ap

T0a

Fig. 4.23 – Evolution of ANNs performances in test set, over the used extension of the training set.

4.5.3.3 Set of features for Ap and T0a

Several features and different codifications were considered in this study. The final set of fea-tures and its codification will be presented in this section as well as a brief discussion relative to the excluded features.

Table 4.11, presents the list of features and their individual correlations with Ap and T0a. This set was build selecting the features by their higher correlation with output parameters and correla-

Page 167: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

139

tion between them, eliminating features linearly correlated, and, by a process of rejection, also the features that deteriorate final performance.

Although some features presented in Table 4.11 do not have an individually significant correla-tion with T0a, suggesting the exclusion from the ANN, in fact, all together, their presence improves the prediction performance.

Table 4.11: Set of features and their correlations r with Ap and T0a.

Some features are highly mutually correlated as is the case of features 3 and 6, 4 and 7, 11, 12 and 13, 15 and 16, and finally, 18 and 19. Anyhow they do not carry exactly the same information, and their ensemble use improves the performance. An explanation of the features, as measured in Table 4.11, follows:

F # Description of features F r(F,Ap) r(F,T0a)

1 Orthographic mark -0.470 0.010

2 Interrogative sentence (yes/no) 0.075 0.330

3 Index # of PC in sentence, from beginning -0.380 0.025

4 Index # of PC in sentence, from end 0.177 0.041

5 Length of sentence (in s) -0.127 0.051

6 Index # of PC in paragraph from beginning -0.448 0.026

7 Index # of PC in paragraph from end 0.239 0.027

8 Index # of sentence in paragraph from beginning -0.185 0.020

9 Index # of sentence in paragraph from end 0.117 0.008

10 Length of preceding pause (in s) 0.569 0.067

11 PC in beg. position of phrase (yes/no) 0.223 -0.017

12 PC in beg. position of sentence (yes/no) 0.460 0.025

13 PC in beg. pos. of paragraph (yes/no) 0.572 0.030

14 Tonic syllable in the beginning of the accent group (yes/no) 0.052 0.074

15 Distance to the preceding PC (in s) 0.534 0.213

16 Distance in syllables to the preceding PC 0.525 0.126

17 Orthographic mark of the preceding PC 0.279 0.032

18 Distance to the next PC (in s) 0.221 -0.323

19 Distance in syllables to the next PC 0.241 -0.285

20 Orthographic mark of the next PC 0.140 0.003

21 Magnitude of previous PC -0.212 -0.037

Page 168: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

140

1. the correlation coefficient values between most of the orthographic marks and Ap are simi-lar and important, and are not relevant with T0a. Therefore just the comma and the full stop were classified separately. This feature was coded in four levels according to the correlation of each mark with Ap: other mark=0, full stop=1/3, comma=2/3, no mark=3/3. Correlation and codification mean that PCs generated by means of other mark or full stop have higher Ap than PCs generated by commas or not associated to orthographic marks;

2. only the interrogative type of sentence showed a different correlation with Ap and T0. Therefore this feature was coded in the levels of interrogative type, 1, or other type, 0. Dif-ferent types of interrogatives were not distinguished; This is one of the most relevant fea-tures regarding T0a;

3. correlation with Ap indicates higher Ap in the beginning of sentences;

4. correlation indicates lower Ap in the end of sentences;

5. correlation shows lower Ap for long sentences;

6. correlation indicates higher Ap in the beginning of paragraph;

7. correlation indicates lower Ap in the end of paragraph;

8. correlation indicates higher Ap in the first sentences of paragraph;

9. correlation indicates lower Ap in the last sentences of paragraph;

10. is the length of pause if there is one just before de PC. This feature is highly correlated with Ap;

11. indication if the PC is in beginning position of a phrase. This position is correlated with higher Ap;

12. indication if the PC is in beginning position of a sentence. This position is correlated with higher Ap;

13. indication if the PC is in beginning position of a paragraph. This position is correlated with higher Ap;

14. indication if the accent group starts with a tonic syllable. Slightly correlated with higher Ap and longer T0a;

15. measured in seconds. As longer is preceding PC, higher is Ap and anticipation T0a;

16. measured in number of syllables with correlation similar to the previous feature;

17. similar with feature 1, but coded in different order due to different levels of correlation: other mark=0, coma=1/3, no mark=2/3, full stop=3/3. Correlation and codification mean that Ap is higher for PCs following PCs generated by full stop;

18. measured in seconds, it is the phrase component length. The longer the phrase component length is, the higher is Ap and the shorter is T0. This is the most relevant feature for T0a;

Page 169: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

141

19. measured in number of syllables, has similar correlations with the previous feature;

20. in this feature others mark are relevant. Therefore, it is coded in the following seven levels according to the correlation of each mark with Ap: “other mark”=0, “.”=1/6, “…”=2/6, “;”=3/6, “?”=4/6, “no mark”=5/6, “:”=6/6. Meaning that Ap is higher for PCs preceding PCs generated by “:”, “no mark” and “?”;

21. is negatively correlated with Ap and slightly negatively correlated with T0a. But, is not used in Ap ANN because it deteriorates the performance of this ANN. On the other hand, in spite of its low correlation with T0a, in this network this feature improves the performance.

All features are normalised in range between 0 and 1 in the codification after been divided by an established maximum limit for each feature.

The final set of features for each ANN was established by the best performance achieved. For Ap ANN the final set of features is composed of features 1 to 20, though feature 21 has not negli-gible correlation with Ap. For the T0a ANN the final set of features includes also feature 21. For this ANN a set of the most correlated features (features numbers: 2, 5, 10, 14, 15, 16, 18 and 19) was tried but with worse performance.

The usage of almost the same set of features for both ANNs was the advantage that no further processing is needed for determine other features. Feature 21, the feature used only in the T0a ANN, is the output of Ap ANN in previous PC.

4.5.4 Evaluation of the prediction of Ap and T0a

The best linear correlation coefficients (r) between predicted and estimated Ap and T0a values, obtained for the test set are presented in Table 4.12.

Table 4.12: Linear correlation coefficient obtained in the test set for the predicted Ap and T0a values, relative to the estimated (labelled) values.

Ap T0a

r 0.772 0.649

Fig. 4.24 plots the best linear fit between target and predicted values for Ap and T0a in the test set with ANNs with correlation coefficients of 0.772 and 0.649 respectively. A concentration is visible of the predicted values in a shorter interval than the target values, for both parameters. So, ANNs impose less extensive limits for minimum and maximum predicted values.

Fig. 4.25 plots the probability of the error in the test set relative to the predicted Ap and T0a val-ues, as well as the adjusted normal probability plot for same data in red. The figure shows an error less than 0.12 for 80% of the predicted Ap, and an error less than 0.2 for 90%. The prediction of T0a has an error less than 0.2 s for 75% of the cases and an error less than 0.3 s for 95%.

The values of average and standard deviation of the estimated Ap in test set are 0.356 and 0.187 respectively. The same values for the predicted Ap in test set are 0.353 and 0.144, respectively.

The average and standard deviation in the estimated T0a in the test set are 0.367 and 0.233, re-spectively. The predicted values are 0.364 and 0.146, respectively.

Page 170: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

142

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

T

ABest Linear Fit: A = (0.588)T + (0.143)

R = 0.772Data PointsBest Linear FitA = T

0 0.2 0.4 0.6 0.8 1 1.2 1.40

0.2

0.4

0.6

0.8

1

1.2

1.4

T

A

Best Linear Fit: A = (0.406) T + (0.215)

R = 0.649Data PointsBest Linear FitA = T

Fig. 4.24 – Best Linear fit between target (T) and predicted (A) values for Ap (left) and T0a (right).

0 0.1 0.2 0.3 0.40.0030.01 0.02 0.05 0.10

0.25

0.50

0.75

0.90 0.95 0.98 0.99 0.997

Ap prediction error

Probability Plot

0 0.1 0.2 0.3 0.40.0030.01 0.02 0.05 0.10

0.25

0.50

0.75

0.90 0.95 0.98 0.99 0.997

T0a prediction error (s)

Probability Plot

Fig. 4.25 – Probability error in test set for predicted Ap and T0a. Lines show the adjusted normal probability

distribution with a) µ=0.093, σ=0.075 and b) µ=0.148, σ=0.097.

4.5.5 Results of the PC model

The inserted PCs seem to be consistent both with the text and with the estimated PCs. The best linear correlation coefficient values of the prediction of Ap and T0a are 0.772 and 0.649, respec-tively. The analysis of several paragraphs’ predicted phrase components, allowed the conclusion that with a good set of ACs the resulting F0 contour fits the original one with a good closeness, that is to say, it can produce a natural intonation.

Page 171: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

143

Fig. 4.26 presents the predicted PCs for one example paragraph. The estimated PCs and phrase components are plotted in black colour. The results of the Ap and T0a ANNs can be observed in green colour PCs and phrase components, which were predicted considering the initial positions of the estimated PCs. Results of the complete model can be observed in magenta colour PCs and phrase components where Ap and T0a were predicted from the inserted PCs. The three plots allow the individual evaluation of each component part of the model (only the prediction of the magni-tudes, Ap, and anticipations, T0a, in green colour, and the insertion of PCs plus the prediction of the magnitudes, Ap, and anticipations, T0a, in magenta colour).

The Fig. 4.26 presents the paragraph: “Na passada quinta-feira, na RTP1 a jornalista Judite de Sousa entrevistou o senhor Procurador geral da República. O Senhor Doutor Cunha Rodrigues mostrou mais uma vez conhecimento profundo das matérias” (Last Thursday, in RTP 1, the journa-list Judite de Sousa, interviewed mister Republic Attorney General. Mister Doctor Cunha Rodri-gues showed once again a deep knowledge of matters).

0 2 4 6 8 10 12-150

-100

-50

0

50

100

150

200

250

t(s)

Ap/

100

H

z

i , , . .napassada

quintafeira

naR T P um a jornalista

juditesousa

entrevistou

senhor

procurador

geraldarepublica

osenhor

doutor

cunharodrigues

mostrou

maisuma

vezconhecimento

profundo

dasmaterias

n6!p

6"sa!

d6!"k

i~!t6

"f6jr

6n6E

R!te

!peu~

X6Zu

rn6"l

iS!t

6Z!d

i!t"s

oz6e~

!tr@

vS!"t

os"Jo

r!pr

u!kr

6!"do

rZ"ra

l*!d6

RE!"p

u!bl

i!k6

XXXus

"Jor!

do!"t

or!"k

uJ6R

u!"dr

i!gS

muS!

"tro"m

ajzu

m6ve

S!ku

J@si

"me~!t

!pru

"fu~

!d

6Zm6

!"tEr

j6S

XXX

Fig. 4.26 – Application example of the insertion PC model. PCs and components: black –estimated; green - initial position of estimated PCs with predicted Ap and T0a; magenta – predicted with PC model.

Page 172: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

144

4.6 Accent Commands

The model proposed in this section to predict Accent Commands admits that the PCs are already known.

In the process of commands’ estimation, the F0 contour can be fitted with more or less precision according to the number of ACs used, as can be seen in Fig. 4.11, always without modelling micro-prosody. It seems like two levels of fitting the F0 contour, one broader approximation and the other a narrower one. In the broader approximation, the ACs, in a minor number, are associated with ac-cent groups or the accented syllables of the accent groups. In the narrower approximation, the ACs, in greater number, can be associated with syllables. Maybe, the best approximation depends on the language and on the capacity of the model to accurately predict the AC parameters to produce a natural F0 contour.

As already discussed in 4.3.2, during the process of estimating the parameters of the Fujisaki model, the connection between ACs and syllables was followed. This approximation is different from the approximations used by Mixdorff [2002] or by Eva [2003] that consider enough one AC by accent group.

The present approach allows a more refined approximation of the F0 contour in the estimation process, but does not guarantee a more reliable prediction of ACs, and the number of ACs parame-ters to predict is larger. This approximation leaded to the third phase of the estimation of parame-ters as documented in 4.3.2.

After the third phase of estimation of commands:

• each AC are associated to just one syllable;

• syllables can have one or no associated ACs.

So, the model has to decide, for each syllable with voiced segments, if they will have one asso-ciated AC or none, and then predict the parameters of the associated AC.

No ACs will be associated to syllables without voiced segments.

For each AC three parameters must to be predicted: amplitude - Aa, onset time – T1 and offset time – T2. T1 and T2 are determined relatively to the syllable’s position. Concretely T1 is deter-mined as the beginning of the voiced segments of the syllable minus an anticipation (Eq. (4.7)), and T2 is the end of the voiced segments of syllable minus an anticipation (Eq. (4.8)). These anticipa-tions, from now on, T1a and T2a, are the timing parameters to be predicted once the beginning and end of the voiced sound are known.

1 1T Bvs T a= − Eq. (4.7)

2 2T Evs T a= − Eq. (4.8)

where:

Bvs – beginning of voiced segments;

Page 173: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

145

Evs – end of voiced segments;

T1a –anticipation of T1;

T2a –anticipation of T2.

Therefore, the parameters to be predicted are:

• Ca – presence or absence of an AC associated to the syllable;

• Aa – amplitude of AC;

• T1a;

• T2a.

Next sections will discuss the architecture, training and optimization of the number of features for each parameter presented above. Section 4.6.4 will present tables with the best results in the test set, for each parameter.

4.6.1 ANN architectures

Initially one ANN with four outputs to predict the four parameters was trained. But soon it was realised that the best ANN according to performance was different for each output. Then, two ANNs were used, one to predict Ca, and another to predict the other three parameters. Again the best ANN for one output had not the best performance for the other parameters. In spite of this, a very good performance for each parameter could be achieved, but with different ANNs. This can be explained by the low correlation between the output parameters, as can be seen in Table 4.13. Therefore, in order to optimise the performance for all parameters, one ANN for each parameter was used.

Table 4.13: Linear correlation coefficient between AC parameters calculated along the labelled database.

Aa T1a T2a Ca

Aa 1 0.33 0.43 0.61

T1a 1 0.34 0.29

T2a 1 0.49

Ca 1

Several architectures in what concerns type of network, structure, number of layers, number of nodes in each layer, and activating functions, were considered and tested for the four ANNs. Feed-forward networks trained with back-propagation algorithms were selected as the type of network for the solution of the problem.

The networks’ inputs have the necessary number of nodes to code the features discussed in next section. The output node codes the parameter, Ca, Aa, T1a or T2a.

Page 174: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

146

For the ANN dedicated to predict Ca, from now on, Ca ANN, a perceptron layer with just one node was tested [Demuth and Beale, 2000] with a hard-limit2 transfer function (0/1 function) in the output, due to its output being binary, as the Ca data, but with poor results.

The output of Ca ANN was tested with or without a normalisation pre-processing which con-verts the output into a null average and standard deviation equal to 1. According to results pre-sented in Table 4.15, it can be concluded that the normalisation is recommended.

Due to the activating function of last layer of Ca ANN being a linear function, a threshold, L, should be used to compare the output and convert it into a binary value. Values of L between 0.4 and 0.7 were proven to be good candidates to optimize the performance of Ca ANN. But an analy-sis of thousands of cases has showed the value 0.5 was, most frequently, the best L value. Different alternatives for L are also presented in Table 4.15.

For the other three parameters the output is 85% of its value divided by the maximum parameter value and normalized to have null average and standard deviation equal to 1.

4.6.2 Training

Training was done over the training set which consist of 6329 syllables (86%) and using the test set, with 1026 syllables (14%), also cross-validation in order to avoid over-fitting. Test set was built picking randomly some paragraphs from every text. The test vector was used to stop training earlier if further training on the training set will hurt generalization to the test set. The cost function used for training was the mean squared error between output and target values.

Ca ANN will predict, for each syllable, if there will be an AC associated or not. So, it will be applied to all syllables. On the other hand the Aa, T1a and T2a ANNs, will predict the parameters just for the syllables which will have an AC associated. This leads to two alternatives of the train-ing set: usage of all syllables; or usage of just the syllables that have associated AC, because values of other syllables are zero in training and test sets and so are irrelevant in predicted ACs.

The first alternative has the advantage of the ANNs being trained to predict a very low value for Aa, T1a and T2a, in syllables which should not have any AC associated, allowing the model to re-cover from a incorrect AC insertion by Ca ANN.

The second alternative has the advantage of training ANNs with only the non null patterns.

Since it is not clear which one is preferable, both alternatives were used and the results are re-ported in the fifth column of Table 4.17, Table 4.18 and Table 4.19. In cases where the second al-ternative (training just with syllables with associated AC) was used, signed in tables with Y (yes), the correlation coefficient (r) was determined using just these syllables in the test set and are also presented in the tables for these cases. The last column of each table presents the r values deter-mined over the test set using all syllables and considering null the predicted value of parameters Aa, T1a and T2a for syllables without AC associated as determined by Ca ANN. The values of r in last column were used to compare the performance of ANNs alternatives.

2 Hard-limit is a function with output zero, if the input argument is less than 0, or 1, if input argument is greater than or equal 0.

Page 175: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

147

Back-propagation training algorithms described in section 3.3.3 were used. The algorithm trainrp, ‘Resilient back-propagation’, gives results with inferior quality than trainlm – ‘Levenberg-Marquardt’. This is clearly the best algorithm for the dimension of the network, although the train-ing process is slower.

For each variation of the ANN, concerning architecture, training algorithm, activating functions, set of features and its codification, as well as both alternatives in training set as described above, several hundreds of training sessions were ran and the best result was selected as the performance for this variation. Only the best performance solutions are presented in Table 4.15, Table 4.17, Table 4.18 and Table 4.19.

Fig. 4.27 displays the average performance (r) in the same test set of several training sessions for each parameter, considering different dimensions of the training set. It is visible that for Ca, Aa and T1a ANNs the performance is stabilised after 90% of the training set, and more patterns in training set should not improve the performance in test set. But, for T2a ANN the performance increased 0.01 from 90% to 100% of the training set. This leads to the expectation that more training patterns could improve performance of T2a ANN, but not much more than 0.01.

Average performance

0,4

0,45

0,5

0,55

0,6

0,65

0,7

0,75

25 50 75 90 100

% of training set

r

Ca

Aa

T1a

T2a

Fig. 4.27 – Evolution of average ANNs performances in the test set, over the dimension of training set.

4.6.3 Features

The sets of features were built taking into account the known and foreseeable dependencies as well as local contextual information. An optimization followed, in the composition of the sets and the ways of coding features.

Table 4.14 presents the list of used features and their linear correlation, r, with each output pa-rameter. The correlation coefficient value was used to select the set of the most correlated features for the respective ANN. Although some features present a very low correlation with the output pa-rameter their ensemble use in the whole set of features improves the final performance.

Page 176: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

148

Table 4.14: List of features and their correlations, r, with Ca, Aa, T1a, and T2a.

F # Feature description r(F,Ca) r(F,Aa) r(F,T1a) r(F,T2a)

1 Syllable duration 0.275 0.167 0.110 0.425

2 Duration of voiced part of syllable 0.398 0.172 -0.206 0.434

3 Vowel duration 0.414 0.217 0.038 0.423

4 Type of syllable 0.509 0.390 0.227 0.381

5 Tonic syllable (Y/N) 0.262 0.228 0.077 0.241

6 Type of vowel in syllable 0.415 0.305 0.119 0.388

7 Distance in sec. to the end of the sentence 0.050 0.065 0.011 -0.034

8 Distance in sec. to the beginning of the phrase -0.040 -0.050 -0.054 -0.002

9 Number of ACs from the beg. of the phrase -0.021 -0.041 -0.068 -0.003

10 Distance in sec. to the beginning of PC 0.017 0.087 0.009 0.036

11 Number of AC from the beginning of PC 0.016 0.055 -0.032 0.020

12 Distance in sec. to next PC -0.015 -0.004 0.006 -0.074

13 Last word of paragraph (Y/N) -0.063 -0.063 -0.020 0.079

14 Last syllable of paragraph (Y/N) -0.095 -0.095 -0.038 0.044

15 Last word of sentence (Y/N) -0.063 -0.083 -0.021 0.102

16 Last syllable of sentence (Y/N) -0.100 -0.131 -0.047 0.088

17 Syllable number in the word -0.082 -0.129 -0.107 0.000

18 Number of syllables to the end of the word 0.011 0.120 0.069 -0.053

19 Total number of syllables in the word -0.056 -0.007 -0.030 -0.042

20 Duration of the word in seconds 0.018 0.019 0.013 0.095

21 Amplitude of the previous AC -0.027 0.074 0.013 0.027

22 Duration in sec. of previous AC 0.014 -0.020 -0.052 -0.008

23 Distance in sec. to the previous T2 0.000 0.005 0.222 0.040

24 Distance in sec. to previous pause -0.047 -0.050 -0.075 -0.030

25 Distance in sec. to next pause 0.012 0.060 0.044 -0.033

26 Last tonic syllable of interrogative sentence type without interrogative word (Y/N) 0.016 0.019 -0.034 -0.011

27 Interrogative sentence without interrogative word (Y/N) -0.066 -0.039 -0.018 -0.031

A previous codification of some features, like type of syllable, with one input node for each category was experimented. But, in order to reduce the number of input nodes without loss in per-

Page 177: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

149

formance, a new re-codification of those features were made into only one node. The new codifica-tion consists now in coding each category by their original correlation value.

Performance variation in the test set with and without a particular feature was the base for final decision to include or not any of the listed features in the input layers.

Any of the listed features is coded in one node of the input layer. An explanation of features fol-lows:

1. Means the duration of the syllable and is significantly correlated with the presence of AC and its amplitude, T1a and T2a;

2. Each segment of the syllable is considered voiced of voiceless according to its identity. This feature is the length from the beginning of the first voiced segment to the end of the last voiced segment inside the syllable. Is more strongly correlated with the presence of AC, because syllables without voiced segments do not have associated AC. It is also strongly correlated with Aa and even more with T2a, but is negatively correlated with T1a. What means that the longer the voiced part of syllables is, the later is the onset time of AC, and the earlier is the offset time;

3. Means the duration of the vowel or diphthong of the syllable, or is zero in the cases of syl-lables where the vowel was suppressed. Is also strongly correlated with the presence of AC, its amplitude and T2a. In fact, these first three features are significantly correlated be-tween them, but do not carry exactly the same information;

4. This feature codes the type of syllables according to vowel (V) - consonant (C) sequences. In a first phase of the work each type of syllable was coded in one node, but later all types were coded in just one node according to the correlation with output parameters, which were identical for all four parameters. Codification is the following from lower to higher correlation: 1-C; 2-CC; 3-V; 4-VC; 5-VCC; 6-CCVC; 7-CCV; 8-CVC; 9 CV. This feature is the most correlated with the presence of AC, its amplitude and onset time anticipation, T1a, and is also strongly correlated with T2a;

5. In EP any word has one tonic syllable which can be orthographically marked or deter-mined by a simple set of rules described in chapter 2. Theoretically this syllable should be prominent, although some times speakers do not realise it as stressed or accented syllable. This feature signalises if the syllable is tonic or not. Tonic syllables have a significant cor-relation with presence of AC, its amplitude and T2a;

6. Vowels were divided into five groups according to average length and category. Again, all groups were coded in just one node according to the correlation with output parameters. Codification is the following from lower to higher correlation: 1-short vowels (u and @); 2-median vowels (i and 6); 3-diphtongs; 4-nasal vowels; 5-long vowels (a, E, e, o and O). The feature is strongly correlated with the presence of AC, its amplitude and anticipation of offset instant, T2a, and moderately correlated with anticipation of onset instant, T1a;

7. Distance in sec. from the beginning of syllable to the end of sentence. Is slightly correlated with presence of AC and its amplitude. The meaning is less and weaker ACs in the end of sentences;

Page 178: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

150

8. Distance in sec. from beginning of phrase to beginning of syllable. It is slightly negatively correlated with presence of AC, its amplitude and T1a. The meaning is more and stronger ACs in the beginning of phrases;

9. Number of ACs from the beginning of phrase. It is slightly negatively correlated with T1a. The meaning is an earlier onset time for first ACs in the phrase;

10. Distance in sec. from beginning of PC to the beginning of the syllable. It is slightly corre-lated with Aa, meaning slightly stronger ACs at the end of phrase components;

11. Number of ACs from the beginning of PC. It is slightly correlated with Aa, has the same meaning as the previous feature;

12. Distance in sec. from beginning of syllable to next PC. It is slightly negatively correlated with T2a, meaning later offset times for ACs far from next PC;

13. Signalises if the present syllable belongs to the last word of paragraph, coded as yes/no. It is slightly negatively correlated with the presence of AC and its amplitude, and slightly correlated with longer anticipation of the offset time. The meaning is less and weaker ACs in the last word of the paragraph;

14. Signalises if the present syllable is the last one of the paragraph. Is coded as yes/no. Is slightly negatively correlated with presence of AC, and its amplitude. The meaning is less and weaker ACs in the last syllable of the paragraph;

15. Signalises if the present syllable belongs to the last word of the sentence, is coded as yes/no. It is slightly negatively correlated with presence of AC, its amplitude, and is slightly correlated with a longer anticipation of the offset time. The meaning is less and weaker ACs in the last word of the sentence;

16. Signalises if present syllable is the last one of the sentence, coded as yes/no. It is slightly negatively correlated with the presence of AC, and its amplitude, and slightly correlated with a longer anticipation of the offset time. Meaning less and weaker ACs in last syllable of the sentence;

17. Position in word – codes the number of syllables to the beginning of word. It is slightly negatively correlated with the presence of AC, its amplitude and T1a. The meaning is less and weaker ACs in the last syllables of words;

18. Position in word - codes the number of syllables to the end of word. It is slightly correlated with Aa and T1a. The meaning is stronger ACs in the first syllables of words;

19. Word length - total number of syllables in the word. It is slightly negatively correlated with the presence of AC. The meaning is that the longer the word is the less is the number of ACs;

20. Word length – duration of word in sec. It is slightly correlated with T2a. The meaning is that the longer the word’s duration is the earlier is the offset time of ACs;

21. Amplitude of previous AC. It is slightly correlated with Aa. The meaning is higher ampli-tudes for ACs with higher amplitudes of previous AC;

Page 179: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

151

22. Length of previous AC. It is slightly negatively correlated with T1a;

23. Distance in sec. to the offset instant of previous AC. The greater the distance to the offset time of previous AC, the earlier is the onset time of the present AC;

24. Distance in sec. to the previous pause. It is slightly negatively correlated with all parame-ters;

25. Distance in sec. to the next pause. It is slightly correlated with Aa;

26. This feature codes if the present syllable is the last tonic syllable of an interrogative sen-tence without interrogative word. It is coded as yes/no;

27. This feature codes if the syllable belongs to an interrogative sentence type without inter-rogative word. It is coded as yes/no. This and the previous feature intend to code the situa-tion of last tonic syllable in an interrogative sentence type without interrogative word, which is known to have a rising and falling F0 contour. Features 26 and 27 did not show a relevant correlation with AC, maybe because of the rarity of situation of this type of sen-tences in the database.

All features are normalised in range between 0 and 1 in the codification.

Different groups of features were selected as inputs according to their correlation with the output parameter, and tested. The following tables present only the better performing groups.

Table 4.15 presents three sets of features. The set with 6 features uses just the first 6 features (the most correlated ones with the Ca parameter). The set of 25 uses the first 25 features. Finally the set of 27 features uses all the presented features.

In Table 4.17 the set of 25 features is the first 25, the set of 27 are all presented features and the set of 9 features are just the most correlated ones with the Aa parameter (features numbers:1, 2, 3, 4, 5, 6, 14, 16 and 17).

In Table 4.18 and Table 4.19, the set of 25 features is composed of the first 25 presented fea-tures.

4.6.4 Results of prediction with ANNs

In the present section a discussion of each ANN to predict the output parameters will be made.

For Aa, T1a and T2a two performance parameters are presented. Both are linear correlation co-efficients (r) in the test set between target and predicted vectors. The first one is presented for the cases where the training was done just with syllables with AC associated (fifth column filled with Y). Just the syllables with AC predicted by Ca ANN are used. The second column of r uses all syl-lables. Target vectors have exactly the values resulting from the estimation process in all syllables. Predicted vectors have the predicted values with the corresponding ANN, but with null elements in syllables without AC predicted by Ca ANN.

In the following tables, column AF means activating functions. In these columns the L stands for hyperbolic-logarithmic, T is hyperbolic-tangent and Lin is linear function. In column Training al-

Page 180: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

152

gorithm RP mean resilient back propagation algorithm and LM mean Levenberg-Marquardt algo-rithm.

In Aa, T1a and T2a ANNs’ tables some architectures appears with 3 nodes in the output layer. These ANNs predict the three output parameters Aa, T1a and T2a, but the presented performance is just the one for the respective parameter. The output for the other two parameters is not very good and is discarded.

The first number in the architecture column of the following tables is the number of nodes in the first hidden layer. Last number is the number of output nodes. The input layer has a number of nodes equal to the number of features.

Analysis of each individual predicted parameter follows.

4.6.4.1 Ca ANN results

To evaluate the performance of the Ca ANN four parameters were used: linear correlation coef-ficient (r), accuracy (A – given by Eq. (4.9)), recall rate (R – given by Eq. (4.10)) and precision rate (P – given by Eq. (4.11)).

( ) number of correct decisions% 100%number of syllables

A = × Eq. (4.9)

Where the number of correct decisions is the number of times which the output matches the tar-get as to having an AC associated or not. The output is 0 or 1 as the output of the ANN is lower or higher than threshold L.

C(%) 100%C+D

R = × Eq. (4.10)

(%) 100%CPC I

= ×+

Eq. (4.11)

Where C is the number of correctly inserted AC, D is the number of deleted (i.e., none inserted) AC, and I is the number of inserted errors (i.e., incorrectly inserted ACs).

Table 4.15, presents the best ANNs according to the obtained accuracy and r is also presented. It must be noted that architectures with better accuracy have better r values. The recall rate and the precision rate performance parameters for the selected ANN are presented in Table 4.16.

As can be seem in Table 4.15 the accuracy between the presented architectures has very low variation (between 88,60% to 89,28%). So, which architecture must be selected? A good choice would be the one with low number of weights, because with a similar performance would be less computationally expensive. But, a very low difference in this parameter is more significant in final F0 pattern than similar difference in other parameters, since this parameter is the decision of the syllable has or not one associated CA. Moreover, there is no additional computation in determining the features once they must be determined for the other ANNs. In spite of that, the selected archi-tecture has 27 nodes in the entrance layer and 10 in hidden layer, keeping in mind the lighter archi-tecture in the case of enhancements in computational time should be needed. The accuracy of al-

Page 181: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

153

most 90% achieved in the prediction of existence of AC, is very promising for the final predicted F0 contour.

Table 4.15: best performances (A and r) in Ca ANN with different architectures, activating functions, training algorithms, set of features, limit of decision L and output processing.

Architecture AF Training

Alg. # Features L

Output processing

r A(%)

27-10-1 L-Lin LM 27 0,5 Y 0,654 89,28

25-10-1 L-Lin LM 25 0,5 Y 0,652 89,18

27-10-1 T-Lin LM 27 0,5 Y 0,650 89,18

6-6-1 T-Lin LM 6 0,61 N 0,644 88,89

27-6-1 T-Lin LM 27 0,61 N 0,639 88,89

25-13-1 L-Lin LM 25 0,5 Y * 88,89

25-7-5-1 T-L-Lin LM 25 0,5 Y * 88,89

6-4-1 T-L RP 6 0,61 N 0,642 88,79

6-6-1 L-Lin LM 6 0,61 N 0,641 88,79

6-10-1 T-Lin LM 6 0,61 N 0,641 88,79

6-10-1 L-Lin LM 6 0,61 N 0,640 88,79

27-10-1 L-Lin LM 27 0,61 N 0,639 88,79

6-3-1 L-Lin RP 6 0,61 N 0,638 88,79

6-3-1 L-Lin RP 6 0.5 Y 0,637 88,79

6-6-4-1 T-L-Lin LM 6 0,61 N 0,642 88,69

6-4-1 L-T RP 6 0,61 N 0,639 88,69

6-6-1 L-Lin LM 6 0,5 Y 0,636 88,69

27-10-1 T-Lin LM 27 0,61 N 0,634 88,69

27-6-1 L-Lin LM 27 0,5 Y 0,630 88,69

25-6-4-1 T-L-Lin LM 25 0,5 Y * 88,69

25-10-1 L-T RP 25 0,5 Y * 88,69

25-13-1 L-T RP 25 0,5 Y * 88,69

25-6-1 L-Lin LM 25 0,5 Y * 88,69

25-4-4-4-1 L-T-L-T RP 25 0,5 Y * 88,69

27-6-1 L-Lin LM 27 0,5 Y 0,644 88,60

6-4-1 L-T RP 6 0,5 Y 0,634 88,60

27-6-1 L-Lin LM 27 0,61 N 0,634 88,60

27-3-1 L-Lin RP 27 0,5 Y 0,633 88,60

27-4-1 L-T RP 27 0,61 N 0,632 88,60

Page 182: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

154

Architecture AF Training

Alg. # Features L

Output processing

r A(%)

27-6-1 T-Lin LM 27 0,5 Y 0,631 88,60

27-3-1 L-Lin RP 27 0,61 N 0,631 88,60

6-3-1 L-Lin RP 6 0,5 Y 0,629 88,60

25-3-10-1 T-L-Lin LM 25 0,5 Y * 88,60* Not measured value.

Table 4.16: Performance values for the best Ca ANN.

A(%) r P(%) R(%)

89,28 0,654 97,3 91,5

The strong correlation of the first 6 features with Ca presented in Table 4.14 proved to be really important because no significant improvements were introduced by the usage of more features. Anyhow, no deterioration in performance was felt by the introduction of the other features.

4.6.4.2 Aa ANN results

Table 4.17 presents the best ANNs to predict Aa, according to the correlation coefficient.

Table 4.17: best performance (correlation coefficient) of architectures to predict Aa.

Architecture AF Training Alg. # Features

Training just with syll. with AC associated

r (just with syll. with AC associated)

r (with all syllables)

27-6-1 L-Lin LM 27 Y 0,507 0,602

25-13-1 L-Lin LM 25 N 0,598

25-7-5-1 T-L-Lin LM 25 N 0,596

25-13-3 L-Lin LM 25 N 0,587

27-10-1 L-Lin LM 27 Y 0,472 0,585

25-10-1 L-T RP 25 N 0,582

25-13-1 L-T RP 25 N 0,582

25-13-3 L-T RP 25 Y 0,483 0,577

25-6-4-3 T-L-Lin LM 25 Y 0,455 0,577

25-13-3 L-T RP 25 N 0,577

25-10-1 L-Lin LM 25 N 0,572

25-6-4-1 T-L-Lin LM 25 N 0,572

25-7-5-3 T-L-Lin LM 25 Y 0,441 0,567

25-10-3 L-T RP 25 N 0,566

25-10-3 L-Lin LM 25 N 0,566

Page 183: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

155

Architecture AF Training Alg. # Features

Training just with syll. with AC associated

r (just with syll. with AC associated)

r (with all syllables)

9-7-1 L-Lin LM 9 Y 0,382 0,519

27-7-1 L-Lin LM 27 Y 0,504 *

27-10-1 L-Lin LM 27 Y 0,502 *

27-7-2-1 T-L-Lin LM 27 Y 0,488 *

27-7-5-1 L-L-Lin LM 27 Y 0,486 * * Not measured value.

-0.5 0 0.5 1-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

T

A

Best Linear Fit: A = (0.416) T + (0.16)

R = 0.602Data PointsBest Linear FitA = T

0 0.2 0.4 0.6

0.0010.0030.01 0.02 0.05 0.10 0.25

0.50

0.75 0.90 0.95 0.98 0.99 0.9970.999

Aa prediction error

Probability Plot

Fig. 4.28 – Best Linear fit between target (T) and predicted (A) values for Aa (left) and Probability error

(|Aatarget-Aapredicted|) in test set for predicted Aa (right), red line shows the adjusted normal probability distribu-tion with µ=0.12 and σ=0.12.

The selected architecture to predict ACs amplitudes has 27 nodes in the input layer and 6 in hid-den layer. The correlation of 0.602 is quite good compared to previous similar works for other lan-guages, but is still in a low range. The major errors occur in focus position, where this information is still lacking. Fig. 4.28 (left) displays the best linear fit between target and predicted values. The vertical aligned marks in zero value of target and the horizontal aligned marks in zero value of pre-dicted variable correspond to the wrongly inserted AC and deleted AC, respectively. The right side of the figure shows that 75% of estimated Aa values have an error less than 0.2 and 95% have an error less than 0.35.

4.6.4.3 T1a ANN results

Table 4.18 presents the best ANNs to predict T1a according to the correlation coefficient.

The selected ANN architecture to predict T1a has 25 nodes in the input layer and 10 in the hid-den layer. The 25 input nodes receive the first 25 features of Table 4.14. The correlation of 0.743 is very good compared to previous similar works for other languages. Fig. 4.29 (left) displays the best linear fit between target and predicted values. The vertical aligned marks in zero value of target and the horizontal line in zero value in the predicted variable, hidden by other marks, correspond to the

Page 184: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

156

wrongly inserted AC and deleted AC, respectively. The graphic in the right side of the figure shows that 90% of T1a values have an error less than 50 ms.

Table 4.18: best performance (correlation coefficient) of architectures to predict T1a.

Architecture AF Training Alg. # Features

Training just with syll. with AC associated

r (just with syll. with AC associated)

r (with all syllables)

25-10-1 L-Lin LM 25 N 0,743

27-6-4-1 T-L-Lin LM 27 Y 0,749 0,735

25-6-4-1 T-L-Lin LM 25 N 0,733

25-6-4-3 T-L-Lin LM 25 Y 0,724 0,728

25-7-5-3 T-L-Lin LM 25 Y 0,723 0,728

25-7-5-1 T-L-Lin LM 25 N 0,726

27-7-1 L-Lin LM 27 Y 0,730 0,723

25-6-4-3 T-L-Lin LM 25 N 0,722

25-10-1 L-T RP 25 N 0,722

25-6-1 L-Lin LM 25 N 0,722

25-13-1 L-Lin LM 25 N 0,722

27-10-1 L-Lin LM 27 Y 0,728 0,718

25-3-10-3 L-L-Lin LM 25 N 0,718

25-13-1 L-T RP 25 N 0,718

25-3-10-1 T-L-Lin LM 25 N 0,716

25-13-3 L-Lin LM 25 N 0,715

27-4-2-1 L-L-Lin LM 27 Y 0,746 *

27-7-2-1 T-L-Lin LM 27 Y 0,744 *

27-7-5-1 L-L-Lin LM 27 Y 0,744 *

27-7-1 L-Lin LM 27 Y 0,743 *

27-4-2-1 T-L-Lin LM 27 Y 0,741 *

27-6-1 L-Lin LM 27 Y 0,740 *

27-10-1 L-Lin LM 27 Y 0,735 * * Not measured value.

Page 185: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

157

-0.2 -0.1 0 0.1 0.2 0.3-0.2

-0.1

0

0.1

0.2

0.3

T

ABest Linear Fit: A = (0.551) T + (0.00987)

R = 0.743 Data PointsBest Linear FitA = T

0 0.05 0.1 0.15

0.0010.0030.01 0.02 0.05 0.10 0.25 0.50

0.75 0.90 0.95 0.98 0.99 0.9970.999

T1a prediction error (s)

Probability Plot

Fig. 4.29 – Best Linear fit between target (T) and predicted (A) values for T1a (left) and Probability error

(|T1atarget-T1apredicted|) in test set for predicted the T1a values (right), red line shows the adjusted normal prob-ability distribution with µ=0.022 (s) and σ=0.024 (s).

4.6.4.4 T2a ANN results

Table 4.19 present the best ANNs to predict T2a according to the correlation coefficient. The best performing architecture has 3 outputs but, just the one corresponding to T2a is used.

-0.2 0 0.2 0.4 0.6-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

T

A

Best Linear Fit: A = (0.442) T + (0.0283)

R = 0.65

Data PointsBest Linear FitA = T

0 0.05 0.1 0.15

0.0010.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.9970.999

T2a prediction error (s)

Probability Plot

Fig. 4.30 – Best Linear fit between target (T) and predicted (A) values for T2a (left) and Probability error in test set for predicted T2a (right), red line shows the adjusted normal probability distribution with µ=0.028 (s)

and σ=0.026 (s).

The selected ANN architecture to predict T2a has 25 nodes in the input layer and 7 and 5 in hid-den layers. The 25 input nodes receive the first 25 features of Table 4.14. The correlation of 0.650 is quite good compared to previous similar works for other languages. Fig. 4.30 (left) displays the best linear fit between target and predicted values. The vertical aligned marks in zero value of tar-get and in the horizontal aligned marks in zero value in predicted variable correspond to the wrongly inserted AC and deleted AC, respectively. The graphic in the right side of the figure shows that 90% of T2a values have an error less than 60 ms.

Page 186: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

158

Table 4.19: best performance (correlation coefficient) of architectures to predict T2a.

Architecture AF Training Alg. # Features

Training just with syll. with AC associated

r (just with syll. with AC associated)

r (with all syllables)

25-7-5-3 T-L-Lin LM 25 Y 0,593 0,650

25-13-1 L-Lin LM 25 N 0,646

27-6-1 L-Lin LM 27 Y 0,604 0,636

25-6-4-3 T-L-Lin LM 25 N 0,635

25-6-4-3 T-L-Lin LM 25 Y 0,574 0,634

25-4-4-4-3 L-T-L-Lin LM 25 Y 0,574 0,633

25-13-3 L-Lin LM 25 N 0,631

27-7-2-1 T-L-Lin LM 27 Y 0,581 0,629

25-7-5-1 T-L-Lin LM 25 N 0,628

25-10-3 L-Lin LM 25 N 0,622

25-6-1 L-Lin LM 25 N 0,622

25-3-10-1 T-L-Lin LM 25 N 0,622

25-10-1 L-Lin LM 25 N 0,621

25-6-4-1 T-L-Lin LM 25 N 0,616

25-6-3 L-Lin LM 25 N 0,613

25-10-1 L-T RP 25 N 0,611

25-13-1 L-T RP 25 N 0,610

27-4-2-1 L-L-Lin LM 27 Y 0.598 *

27-7-1 L-Lin LM 27 Y 0.597 *

27-7-5-1 L-L-Lin LM 27 Y 0.593 *

27-10-1 L-Lin LM 27 Y 0,600 *

27-7-2-1 T-L-Lin LM 27 Y 0,596 *

27-4-2-1 T-L-Lin LM 27 Y 0,594 * * Not measured value.

4.6.5 Results of AC model

Table 4.20 resumes the objective evaluation of the best ANNs to predict ACs. Subjective evalua-tion will be discussed in the following chapter.

Application of the model to a sample paragraph is presented in Fig. 4.31. The figure represents the predicted F0 contour for the utterance corresponding to the text “…and are certainly important to everyone, particularly to those with responsibilities in on-going reformation.”. The predicted F0 contour was determined with the set of predicted ACs and the estimated contour with the set of es-timated ACs. The estimated PCs were used in prediction of ACs.

Page 187: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

159

Table 4.20: Final performance of prediction the model parameters for ACs.

Performance parameter Ca Aa T1a T2a

r 0.654 0.602 0.743 0.650

A(%) 89.28 - - -

The major difficulties occur in the words “importantes” (important), “particularmente” (particu-larly), “responsabilidades” (responsibilities) and “reformas” (changes), where the speaker focussed. This paralinguistic information is not provided to the model input disabling it to fit well those con-tours. Any how it is visible that the F0 movement patterns are approximately well fitted.

In the entire paragraph rmse changes from 3.95 Hz to 14.3 Hz between the estimated and the predicted F0 contour. This difference may be interpreted as the loss in naturalness introduced by the AC model. A re-synthesised paragraph with a lower rmse than another hasn’t necessarily a bet-ter naturalness. Moreover, the same observation was made for the linear correlation coefficient in-dicator.

Page 188: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

160

Fig. 4.31 – Result of predicted ACs. In black, the estimated PCs, ACs and the associated F0 contour. In magenta, the predicted ACs, based on estimated PCs, and the

corresponding F0 contour. Vertical lines represent word boundaries.

Page 189: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

161

4.7 Results of the Predicted F0 Contour

Here we discuss the results of the application of just the F0 model, and F0 model over the seg-mental duration’s model. The F0 model consists of the above presented PC’s model and AC’s model. The application of the F0 model over the segmental durations’ model consists on the full prosody model developed in this work.

4.7.1 F0 model

Joining together the PC and AC models described above, results in the F0 model to predict the F0 contour based in Fujisaki’s proposal. The sequence of work is first predicting PCs and then pre-dicting the ACs, because the AC model depends on the predicted PCs.

Fig. 4.32 depicted a sample application of the F0 model. The predicted phrase component, in magenta, allows a good fit between predicted and original F0 contours. The addition of accent components, once again, generates a quite good fitting with the original F0 contour. In this case the rmse and the correlation coefficient vary from 3.95 Hz to 15.6 Hz and 0.972 to 0.543, respectively. The difference between Fig. 4.31 and Fig. 4.32 is just the prediction of PCs by the model. So, the loss in naturalness between these two pictures is due just to the PC model. The rmse goes from 14.3 Hz to 15.6 Hz in this paragraph, meaning a loss in accuracy of just 1.3 Hz with the PC model. It should be mentioned that this loss isn’t additive in the final quality of the model. Anyhow, the small difference in rmse of 1.3 Hz due to PC model is certainly a good indicator of the PC model.

4.7.2 F0 model over segmental durations

The developed model to predict F0 contour was also applied over the segmental durations proc-essed speech signal. In this case the original speech signal was re-synthesized with the predicted durations, and the input features to the F0 model were determined for the new timings. The signal was re-synthesized again with the new F0 contour predicted by the present F0 model.

Fig. 4.33 displays the speech signal waveform after modification of the segmental durations with the PSOLA algorithm, plus the respective determined F0 contour, the predicted phrase component plus Fb and the, also predicted, accent component, together with the respective predicted PCs and ACs. All data, orthographic marks, words, syllables and phones are presented in synchronism with the speech signal waveform.

Page 190: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

162

Fig. 4.32 – Application of the complete F0 model. In black the estimated PCs, ACs and F0 contour. In magenta the predicted ACs, PCs and F0 contour.

Page 191: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

163

Fig. 4.33 – Application of the complete F0 model over the modified duration with the duration’s model. In magenta the predicted ACs, PCs and F0 contour.

Page 192: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

164

4.8 Conclusion

The current chapter presented a model to predict the F0 contour from text based on Fujisaki model theory, in European Portuguese. The F0 model is subdivided in two sub-models, one to pre-dict the PCs and the other to predict the ACs. The parameters α, β and γ, of the Fujisaki model are considered constant and equal to 2.0 s-1, 20 s-1 and 0.9, respectively. Also the base line frequency Fb is considered constant and equal to 75 Hz. These constant values were the ones which allow the best fit between estimated and original F0 contours.

The database was parameterized with the Fujisaki model with a developed tool to manually in-sert/correct labelled PCs and ACs. The estimated F0 contour produced with this process results in an average rmse between estimated and determined F0 contours, in the whole database, of 3.97 Hz. The re-synthesized speech signal with the estimated contour is difficult to distinguish perceptually from the original one.

The PCs’ model performs in two steps. In the first step it inserts PCs associated with the begin-ning of accent groups, based on orthographic marks and weighted candidates. The second step de-termines the exact position T0, by predicting an anticipation time, T0a, of PC’s time position and its Amplitude Ap by means of two specific ANNs.

The locations of inserted PCs seem to be consistent with text and with labelled PCs. The best linear correlation coefficient of the prediction of Ap and T0 are 0.772 and 0.646, respectively. These values are quite good compared with the ones presented by Mixdorff [2002], 0.73 and 0.53 respectively, in his Integrated German Model (IGM).

The ACs model allows one AC to be assigned to each syllable. For each syllable one first ANN decides if there will be an associated AC or not. This ANN provides results with an accuracy of 89.3%. For syllables with associated ACs, the amplitude, Aa, onset time, T1, and offset time, T2, of the AC have to be predicted. T1 and T2 are determined finally by subtracting an anticipation time, T1a and T2a, to the beginning and end of the voiced part of the syllable, respectively. One ANN for each parameter was developed giving results with final linear correlation coefficients of 0.602, 0.743 and 0.650 for amplitude, anticipation of T1 and anticipation of T2, respectively. Again these values are quite good compared with the IGM, which were 0.40, 0.61 and 0.63, respectively.

A value of β=30 /s was experimented and gave a better fitting between predicted and original F0 contours, although, the re-synthesized speech does not sound quite natural in most of the utter-ances. With β=20 this problem seems to be reduced.

The produced F0 contour with the predicted parameters approximately follows the measured F0. The major differences are coming from the difficulty in emphasizing the “focus” word due to the absence of this information in the training phase of the model. The final speech signal, produced by re-synthesis with the predicted F0 contour is not completely natural yet, but is considered as ac-ceptable.

It is fair to mention that the model uses just some of the available linguistic information. For in-stance syntax information has not been used. Moreover, paralinguistic information is not extracted by the model and several times the speaker produces a higher F0 movement, which can be ex-plained by this kind of information that is not followed.

Page 193: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 4 - Fundamental Frequency

165

A perceptual test is necessary to really evaluate the perceived naturalness in each phase of the entire model. Next chapter describes this test.

Page 194: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …
Page 195: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

5 Perceptual Tests

This chapter presents two perceptual tests evaluating duration models and F0 models developed in the previous chapters. Category-judgment method and the Mean Opinion Score (MOS) scale was followed to evaluate the perceived distance between the proposed models and the original stimuli. A comparison between two proposed models to predict segmental durations is also described. The loss in naturalness along some components of the F0 model is measured in order to evaluate each component of the model and perceive which parts should be improved. A comparison between the objective measurements, r and rmse, and the subjective measure, MOS, of perceived naturalness, is presented.

Page 196: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

168

5.1 Introduction

The objective measured performances presented for each part of the developed models are not by themselves enough to understand the acceptability of models. Can perceptual tests evidence how acceptable is an objective performance?

Two perceptual tests were made and are described in this chapter. The first one considers only the duration models described in chapter 3. The second test considers the F0 model, their sub-models, and duration plus F0 models, using the best duration model selected by the first perceptual test. The selection of the model or the alternative model to predict segmental durations in the dura-tion plus F0 models’ stimuli, were the main reason to perform the subjective evaluation in two per-ceptual tests instead of just one.

Both tests were done using five paragraphs of the test set, not used in training. Several stimuli made by copy-synthesis of original paragraphs were prepared to be presented to listeners. Copy-synthesis stimuli with predicted segmental durations and/or predicted F0 contour were prepared in time domain using a TD-PSOLA algorithm [Moulines and Charpentier, 1990] and [Moulines and Laroche, 1995] in the PRAAT software [Boersman and Weenink]. The pauses were kept the same as in the original stimulus.

The methodology described in [Standard Publication No. 297, IEEE, 1969] for category-judgment tests were generally followed using the MOS scale.

Almost every listener was a college professor and ages ranged from 24 to 35. Some of them are involved in speech synthesis.

Perceptual test were presented to groups between one and five listeners at a time. The tests were performed in an office room with low level of environmental noise. Stimuli were presented in a computer with the sound volume required by listeners.

Section 5.2 describes the perceptual test of stimuli with modified segmental durations, and sec-tion 5.3 the perceptual test of stimuli with modified F0 and durations plus F0 modifications, ac-cording to the selected duration model and F0 model. In the end of each section the correlation be-tween measured rmse and r with perceived naturalness, MOS, is studied.

Page 197: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 5 - Perceptual Tests

169

5.2 Perceptual Test of Duration Models

The aim of this perceptual test is to select the preferable duration model, evaluate how accept-able the selected duration model is and clarify where the model stands, in relation to a completely natural stimuli or in relation to a stimuli with fixed segmental durations. All models evaluate in this test were presented in chapter 3.

A change of the standard category-judgment test was introduced, consisting in no reference of excellence and unsatisfactory were presented. Instead, two original stimuli (without modifications) and one stimulus with segments with the average duration by segment (henceforth “No model”) were used. The two original stimuli were used to evaluate the consistency of answers by the listen-ers, since the stimuli were exactly the same. The “No model” stimuli were produced by changing the original duration of each segment to the average duration in the database for each identity of segment. These stimuli are called “No model” because durations can be easily taken from a very simple table with the 44 different types of segments and its respective average durations. The “No model” stimuli are not similar with the unsatisfactory reference, because, in fact, they produce a fair timing for several sentences with no emphatic prosody.

A total of five stimuli per paragraph were presented in random order to listeners in a blind test, without knowing whether they were listening to the original or to a manipulated version. Listeners were informed about the type of modifications introduced in original sound and asked to concen-trate in timing acceptability. They can hear the stimuli as many times as they want and were asked to classify each stimulus in a scale from 1 to 5 (1- Unsatisfactory, 2- Poor, 3- Fair, 4- Good, 5- Excellent).

The other stimuli presented were produced with durations predicted by the model and by the al-ternative model, presented in chapter 3.

Table 5.1 and Table 5.2 characterizes the text of the five paragraphs used in the perceptual test and distance, measured by correlation coefficient and rmse, between original and alternative model, model and “No model” stimuli. First paragraph is a short title, about two seconds long, while the others are paragraphs varying between 10 and 13 seconds. Paragraphs 1 and 5 have interrogatives while the others are just declaratives sentences.

Stimuli of the alternative model has a variation in correlation coefficient between 0.817 and 0.866, and rmse between 17.7 and 23.7 ms. Stimuli of the model has a correlation coefficient be-tween 0.790 and 0.882 and rmse between 18.9 and 22.1 ms. Very close stimuli were produced by “No model” with a correlation coefficient between 0.690 and 0.800 and rmse between 21.2 and 28.0 ms. The values presented for correlation coefficient and rmse evidence that no congruence be-tween these indicators exist along paragraphs. For instance, the alternative model has the best cor-relation coefficient for paragraph one and the worst rmse for same paragraph. Which one fits better the naturalness? Maybe perceptual test can give a hint.

Twenty subjects participated in test, 7 female and 13 male. The listeners were divided into two groups. The first group was composed of 8 listeners, all experienced in speech related issues; the second comprised the remaining 12 listeners, who had no experience in the subject. The evaluation made by the experienced listeners was no different from the others’, therefore, results are displayed jointly for the whole set of listeners.

Page 198: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

170

Table 5.1: Portuguese and respective translation of the 5 paragraphs used in the perceptual test, and respective number of segments.

Parag. number Text Translation Number

of phones

1 Que igualdade perante a lei? João Amaral.

How equal in face of the law? João Amaral. 36

2

As suas opiniões sobre a situação da justiça revelam muita reflexão e são certamente importantes para todos, particularmente para os que têm res-ponsabilidades nas reformas a fazer.

His opinions regarding the justice system reveal a lot of reflection and are certainly important to everyone, particularly to those with responsi-bilities in on-going reformation.

164

3

Evidentemente que quem exerce um cargo tão sensível há cerca de quinze anos está sujeito a um desgaste natu-ral. Mais ainda, quando a justiça está muito longe de satisfazer as aspirações e interesses dos cidadãos.

It is obvious that someone with such high sensitive functions since fifteen years ago is exposed to natu-ral strain. Moreover, when justice is far from satisfying the ambitions and interests of citizens.

177

4

Há os processos contra gente impor-tante que nunca mais terminam. Há a situação de quem é pobre, e que está objectivamente em situação de inferio-ridade quando tem de enfrentar na jus-tiça os mais ricos e poderosos, que podem pagar advogados de luxo.

There are lawsuits against impor-tant people, which are never-ending. There are poor people, clearly inferior when they have to face court against the richer and powerful, who can afford luxury lawyers.

209

5

Mas, que igualdade perante a lei? Que igualdade, quando para muitos a justi-ça é praticamente inacessível? Como podem esses reclamar o cumprimento da lei, sem dinheiro para pagar a bons advogados e os elevados custos de um processo?

But, how equal facing the law? How equal, if many still have al-most no access to justice? How can they demand law enforcement, if they cannot afford good lawyers and the high costs of a lawsuit?

204

Table 5.2: Correlation coefficient, r, and rmse between original and the other three stimuli in each paragraph.

Alt. Model Model No model Paragraph number r rmse (ms) r rmse (ms) r rmse (ms)

1 0.866 23.7 0.882 19.0 0.733 28.0

2 0.817 19.1 0.814 19.2 0.690 24.1

3 0.842 19.5 0.790 22.1 0.751 23.8

4 0.866 17.7 0.846 18.9 0.800 21.2

5 0.865 18.3 0.844 19.5 0.766 23.5

Page 199: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 5 - Perceptual Tests

171

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Subjects

Ave

rage

Original1 Original2 Alt model Model No model

Fig. 5.1 – Average opinion values of each subject for the 5 stimuli.

0,00,51,0

1,52,02,53,03,5

4,04,55,0

1 2 3 4 5

Paragraph

Ave

rage

Original1 Original2 Alt model Model No model

Fig. 5.2 – Average opinion values by paragraph for the 5 stimuli.

Fig. 5.1 shows the average opinion values of each listener for the 5 stimuli. Each bar is the aver-age of five opinions. Original1 and original2 present the average opinion for the first and second original stimuli, respectively. They are treated separately to perceive the variation in the evaluation made by each subject. The original stimuli were globally the favourite, except in some cases, where the segmental duration predicted by the model or by the alternative model, imposed their prefer-

Page 200: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

172

ence to the original ones. The “No model” stimuli were never even near the opinion degree of the models’ imposed durations.

The alternative model was better classified than the model by 12 subjects and the model was bet-ter classified than the alternative model by 5 subjects. This denotes an evident preference of the al-ternative model.

Fig. 5.2 shows the average opinion values of each paragraph for the 5 stimuli. Each bar is the average of 20 opinions. Again, the opinion values for the duration models are very close to those of the original stimuli. The alternative model was even preferred, in the first, second and fifth para-graphs. Also, again, the “No model” stimuli are far from each model.

The alternative model was better classified than the model in 3 paragraphs, and the model was better classified than the alternative model in the other 2 paragraphs. The preference for the alterna-tive model is confirmed. No differences exist in scores achieved by both models in paragraphs with interrogative sentences (paragraphs 1 and 5). Although, “No model” stimuli has higher scores in those paragraphs.

Table 5.3: Mean Opinion Score (MOS) and standard deviation of the perceptual test.

Original 1 Original 2 Alt. Model Model No model

MOS 4.13 4.27 3.93 3.78 2.88

Std 0.92 0.81 0.96 0.91 1.10

Origina1 Original2 Alt model Model No model

1

1.5

2

2.5

3

3.5

4

4.5

5

Val

ues

Fig. 5.3 – Analysis of opinion scores.

Table 5.3 displays the MOS and respective standard deviation. Each MOS value is the average of 100 opinions. The analysis of variance for all types of stimuli resulted in a significance level of

Page 201: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 5 - Perceptual Tests

173

100% (p<1and-12 for F=33.5), showing an evident dependency of the results on each type of stimulus.

Fig. 5.3 illustrates the subjects’ opinion analysis for each type of stimulus over 100 opinions. Mean Opinion Scores are represented by a black thick line. Blue boxes represent the lower and up-per quartile. Red lines represent the median score. Minimum and maximum values are presented with the black thin lines. Red plus signals represent the outliers. Picture evidences the equality in original1 and original2. The alternative model is close to the original and a little bit better than the model. Finally, although “No model” presents a quite good score, is still far from the model and even more far from the alternative model.

Table 5.4 displays the significance level between each pair of stimuli given by analysis of vari-ance. First line of cells presents the significance level, p, and second line the respective confidence level given by Eq. (5.1). Orange background cells signalise low confidence level meaning high evi-dence to accept the hypothesis that these stimuli are the same. In opposition, the other cells present enough level of confidence to reject the hypothesis that the levels are the same. In conclusion, MOS between original1 and original2 are not significant, as well MOS between original1 and the alternative model and between alternative model and the model. All other MOS stimuli pairs are significant.

(%) 100 (1 )CL p= × − Eq. (5.1)

Table 5.4: Significance level between pairs of stimuli.

Stimuli Original 1 Original 2 Alt. Model Model No model

Original 1 - 0.2552 74%

0.1328 87%

0.0072 99%

0

Original 2 - 0.0074 99%

<0.001 0

Alt. Model - 0.2560 74%

0

Model - 0

No model -

5.2.1 Discussion

Original1 and original2 stimuli proved to be very well classified within the levels of Good and Excellent, and with no significant difference between them.

The test confirmed a slight preference of the alternative model over the model. For some sub-jects the alternative model was even preferred against the original stimuli. In some paragraphs the alternative model was also preferred instead of original stimuli. In general the alternative model is very close to original, with an average (original1 and original2) MOS distance of 0.27. The close-ness is confirmed by the very low confidence level between alternative model and original1. This result evidence the improved results achieved by the usage of dedicated ANNs.

Page 202: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

174

Model has also a MOS close to original (average distance of 0.42). Still at 0.15 points distant to the alternative model, but analyses of variance show a very low confidence level between them. For some subjects the model was preferred instead of original stimuli and alternative model. In two paragraphs the model was preferred to the alternative model.

“No model” still at a MOS distance of 0.9 and 1.05 to the model and the alternative model re-spectively. Although “No model” were never preferred than other stimuli for any subject or para-graph, it still at the Fair level (2.88).

Both proposed models stills at the Good level of acceptability with MOS of 3.93 and 3.78. In spite of the low confidence level between alternative model and model, the alternative model is se-lected to further developments concerning F0 modulation, because of its slightly better scores in the perceptual test and in objective measurements. This selected model, alternative model, is also pre-ferred by more subjects than the model.

5.2.1.1 Correlation between objective and subjective measurements

Some discussion follows about the correlation between proximity measurements, given by corre-lation coefficient and rmse, and perceived naturalness measurements, given by MOS.

Table 5.5 presents the objective distance of the segmental durations between original and modi-fied stimuli, and the subjective evaluation by means of MOS, for the five paragraphs.

Table 5.5: Measurement indicators for models, by paragraph.

Parag. 1 2 3 4 5

r 0.866 0.817 0.842 0.870 0.865

rmse 23.7 19.1 19.5 17.7 18.3 Alt. Model

MOS 4.2 4.2 3.9 3.5 4.0

r 0.882 0.814 0.790 0.850 0.844

rmse 19.0 19.2 22.1 18.9 19.5 Model

MOS 3.8 3.6 4.0 3.8 3.9

r 0.733 0.69 0.751 0.800 0.766

rmse 28.0 24.1 23.8 21.2 23.5 No model

MOS 3.3 2.5 2.7 2.8 3.2

Once the scales and meaning of the measurement indicators are different, some scaling was ap-plied to rmse and MOS in order to be represented in a similar scale as the correlation coefficient. The modified rmse (mrmse) is determined by Eq. (5.2), and the modified MOS (mMOS) is deter-mined by Eq. (5.3). The modification aims to represent all measurement indicators in an increasing scale with a maximum equal to one. Fig. 5.4, Fig. 5.5 and Fig. 5.6 display the measurements r, mrmse and mMOS, along paragraphs for the alternative model, model and no model.

(30 )15

rmsemrmse −= Eq. (5.2)

Page 203: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 5 - Perceptual Tests

175

5

MOSmMOS = Eq. (5.3)

Alt. Model

0,00

0,20

0,40

0,60

0,80

1,00

Paragraph

r 0,87 0,82 0,84 0,87 0,87

mrmse 0,42 0,73 0,70 0,82 0,78

mMOS 0,84 0,84 0,78 0,70 0,80

1 2 3 4 5

Fig. 5.4 – Comparison of measurement indicators by paragraph for Alternative Model.

Model

0,00

0,20

0,40

0,60

0,80

1,00

Paragraph

r 0,88 0,81 0,79 0,85 0,84

mrmse 0,73 0,72 0,53 0,74 0,70

mMOS 0,76 0,72 0,80 0,76 0,78

1 2 3 4 5

Fig. 5.5 – Comparison of measurement indicators by paragraph for Model.

Page 204: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

176

No model

0,00

0,20

0,40

0,60

0,80

1,00

Paragraph

r 0,73 0,69 0,75 0,80 0,77

mrmse 0,13 0,39 0,41 0,59 0,43

mMOS 0,66 0,50 0,54 0,56 0,64

1 2 3 4 5

Fig. 5.6 – Comparison of measurement indicators by paragraph for No model.

The correlation between measurement indicators is presented in Table 5.6. No significant corre-lation seems to exist between subjective and objective measurements. In case of rmse, the correla-tion is even significantly positive1, seeming that the indication of rmse varies in opposite direction of MOS. The most significant correlations is found between rmse and r, but, anyhow, at a low level of -0.38.

In conclusion, no correlation seems to exist along paragraphs for same model between objective and subjective measurements, and a low correlation exist between the objective measurements.

Table 5.6: Correlation coefficient along paragraphs between measurement indicators.

r(r,rmse) r(rmse,MOS) r(r,MOS)

Alt. Model 0.16 0.62 -0.41

Model -0.75 0.70 -0.19

No model -0.55 0.52 0.30

Average -0.38 0.61 -0.10

And, what about correlation between subjective and objective measurements along models?

Table 5.7 presents the mean values along paragraphs of the evaluation measurements r, rmse / mrmse, MOS / mMOS, for the alternative model, the model and “No model”. Table 5.8 presents a very strong correlation between each pair of measurements. Therefore, a very strong correlation ex-ists between objective and subjective measurement indicators when evaluating a model. In addi-

1 Once the naturalness is measured by MOS in an ascending scale and rmse in a descending scale, similar in-dications by both measurements should be denoted by negative correlation between them.

Page 205: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 5 - Perceptual Tests

177

tion, the measurement correlation coefficient, r (0.999), seems to be even more correlated with MOS than rmse (0.994). Also a strong correlation (0.992) exists between the objective measure-ments correlation coefficient and rmse.

Table 5.7: Mean values along paragraphs of evaluation measurements.

r rmse / mrmse MOS / mMOS

Alt. Model 0.851 19.66 / 0.689 3.96 / 0.792

Model 0.835 19.74 / 0.684 3.82 / 0.764

No model 0.748 24.12 / 0.392 2.90 / 0.580

Table 5.8: Correlation between mean values of evaluation measurements.

(r,rmse) (rmse,MOS) (r,MOS)

r -0.992 -0.994 0.999

In conclusion, perceived naturalness, concerning segmental durations, in two paragraphs pro-duced with same model cannot be evaluated comparing their own correlations coefficient or rmse. However, the general naturalness, concerning segmental durations, of a model can be evaluated by their rmse or even better by their correlation coefficient measured along several paragraphs.

Page 206: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

178

5.3 Perceptual Test of F0 Models

In this test, several stimuli were produced with inclusion of different sub-models of the complete model of intonation. Several stimuli were produced with F0 modifications according to sub-models described in chapter 4, and other stimuli produced with F0 model over durations models presented in chapter 3. The perceptual test aims to answer several rising questions:

• How far from natural is the estimated F0 contour?

• How much naturalness costs the prediction of a new AC?

• How much naturalness costs the prediction of a set of ACs?

• How much naturalness cost the prediction of PCs?

• How much naturalness cost the prediction of F0 contour?

• What naturalness has the duration + F0 Model?

• Less emphatic accent components sound better?

The objective is to measure the loss of naturalness introduced by each component of the prosody model and evaluate the quality concerning naturalness of the F0 model and the complete prosody model (durations + F0).

The standard category-judgment test was followed. This test proposes the presentation of refer-ences of scale, concretely, the references of excellent and unsatisfactory. The reference of excel-lence was the original recorded sound. The unsatisfactory reference for F0 contour is very ambigu-ous, so it was decided to produce a flat F0 with the average F0 value (103 Hz) to be used as the unsatisfactory reference. Apart from reference stimuli presented to subjects in the beginning of the evaluation of each paragraph, two stimuli to be evaluated were produced as a copy of the refer-ences. The original was named “1- Natural” and the one with constant F0 value was named “0- No model”.

Besides these two reference stimuli, more seven stimuli were presented to 19 subjects, in a total of 9 stimuli for each of the 5 paragraphs. The other 7 stimuli correspond to:

2. Durations - Modified durations according to predicted duration by the alternative model presented in chapter 3;

3. Estimated F0 – Modified F0 contour imposing estimated F0 based on manually estimated Fujisaki commands as presented in 4.3;

4. Predicted ACs based on estimated ACs and PCs – Modified F0 contour imposed by pre-dicted ACs. ANN features were determined based in estimated PCs and ACs. The AC pa-rameters in each syllable are predicted by ANN using features of estimated ACs and PCs instead of previously predicted ACs and PCs. Any bad predicted accent component is due to the self process of prediction and not because of bad previous predictions. The errors of previous bad predictions do not propagate;

Page 207: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 5 - Perceptual Tests

179

5. Predicted ACs with estimated PCs – Modified F0 contour imposed by predicted ACs. In this case, estimated PCs are used. Features concerning previous ACs are determined in each syllable. In opposition to previous stimuli, the possible bad AC prediction can influ-ence the prediction of present AC, by the way of features concerning previous AC. But, no prediction of PCs is used;

6. F0 Model – Modified F0 contour imposed by predicted PCs and ACs. The F0 contour is totally predicted by the text. The complete model presented in chapter 4 is applied;

7. Durations + F0 model with 0.75*Aa – Modified duration according to predicted durations by the alternative model, and than modified F0 contour according to F0 model, where the ACs amplitude were multiplied by 0.75 in order to deemphasise the accent components. These stimuli were produced because several preliminary tests gave the impressions of over emphasised syllables, and a general reduction of ACs amplitude seems to reduce this impression;

8. Duration + F0 Model - Modified duration according to predicted durations by the alterna-tive model, and than modified F0 contour according to F0 model. Corresponds to the de-veloped model.

Table 5.9: Portuguese and respective translation of the 5 paragraphs used in the perceptual test.

Parag. number Text Translation

1 Acusar os trabalhadores é uma chocante demonstração de que afinal a justiça não está cima das classes sociais.

Accusing the workers is a shocking demonstra-tion that, at the end, justice is not above social classes.

2 Conhece a situação na pele. Aprendeu-a na idade em que se aprende e se não esquece.

Knows the situation on the skin. Learned it in the ages when we learn and don’t forget.

3

Por que é então que se não há-de regu-lamentar o problema? Segundo o meu amigo, bastariam autorizações de imi-gração temporárias. Como quiserem, mas tudo menos negar a estes homens a Carta dos Direitos do Homem que tanto proclamam defender.

Why do no regulate the problem? By the way of my friend, temporary immigration authoriza-tions would be enough. As they want, but eve-rything less than denied to these mans the Let-ter of Humans Rights which they so much claim to defend.

4

Mestre Cabrita, de 76 anos, martela, com gestos certeiros, a chapa de cobre que há-de tomar o jeito de uma panela que depois ninguém vai comprar.

Expert Cabrita, of 76 years old, strikes, with accurate gestures, the copper leaf that has to as-sume a pot shape that later no one’s gone buy.

5

Hoje, uma peça de cobre serve para decorar uma casa, fica feita num instan-tinho e é mais barata. “Mas não é tão perfeita e bonita”, protesta mestre Cabrita.

Today, a copper piece is to stand as a house in-terior decoration, is made very quickly and is cheaper. “But is not so perfect and beautiful”, claim expert Cabrita.

Page 208: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

180

Table 5.10: Objective measurements of each stimulus by paragraph. For each paragraph the first line repre-sents the correlation coefficient and second line the rmse.

Parag. number

Dura-tions

Esti-mated

F0

Pred. ACs

based on est. ACs and PCs

Pred. ACs

with est. PCs

F0 Model

Dura-tions +

F0 Model with

0.75*Aa

Dura-tions +

F0 Model

No model

0.869 0.953 0.267 0.306 0.530 0.482 0.503 - 1

13.8 ms 5.2 Hz 19.3 Hz 21.8 Hz 16.1 Hz 17.9 Hz 19.0 Hz 22.4 Hz

0.882 0.964 0.693 0.554 0.528 0.605 0.639 - 2

26.0 ms 5.2 Hz 14.7 Hz 16.2 Hz 16.8 Hz 16.8 Hz 16.9 Hz 23.7 Hz

0.830 0.979 0.734 0.621 0.293 0.377 0.380 - 3

18.1 ms 3.8 Hz 12.9 Hz 15.2 Hz 19.1 Hz 18.4 Hz 19.2 Hz 20.1 Hz

0.801 0.969 0.770 0.756 0.647 0.627 0.594 - 4

21.2 ms 3.0 Hz 11.9 Hz 12.9 Hz 11.3 Hz 11.3 Hz 11.7 Hz 13.3 Hz

0.892 0.971 0.585 0.515 0.433 0.461 0.481 - 5

14.2 ms 3.7 Hz 13.2 Hz 15.2 Hz 17.4 Hz 17.4 Hz 14.6 Hz 15.9 Hz

Listeners were informed about the type of modifications introduced in original sound and asked to concentrate in intonation acceptability. For each paragraph the references of excellent and unsat-isfactory were presented, and then all the paragraph stimuli in random order were presented to get the first impression of them. Then the nine stimuli were presented in the same order to listeners in a blind test, without knowing whether they were listening to the original or to a manipulated version. Subjects were asked to classify each stimulus in a scale from 1 to 5 (1- Unsatisfactory, 2- Poor, 3- Fair, 4- Good, 5- Excellent). The test sheet used in the test is part of the appendix.

Nineteen subjects participated in the perceptual test, 7 female and 12 male. Seven subjects al-ready had participated in the previous perceptual test.

Table 5.9 presents the text paragraphs used in the perceptual test. These paragraphs, taken from several pieces of news, belong to the test set of the database, not used in training. Mainly declara-tive sentences were used. Third paragraphs start with an interrogative, and fifth paragraphs have one citation.

Table 5.10 presents the measured correlation coefficient and rmse (comparing to original) for each stimuli in each paragraph. In the case of “No model” stimuli, the correlation coefficients were not determined due to the constant value of F0. The F0 Model produced a correlation coefficient varying from 0.29 to 0.65 and an rmse varying from 11 to 19 Hz. The complete model produced a correlation coefficient varying from 0.38 to 0.64 and an rmse varying from 12 to 19 Hz. The com-plete model with Aa*0.75 has a correlation from 0.38 to 0.63 and an rmse from 11 to 18 Hz.

Fig. 5.7 shows the average opinion values of each listener for the 9 stimuli. Each bar is the aver-age of five opinions.

Page 209: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 5 - Perceptual Tests

181

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Subjects

Ave

rage

No model Natural DurationsEstimated F0 Predicted ACs based on estimated ACs and PCs Predicted ACs with estimated PCs

F0 Model Durations + F0 Model with 0.75*Aa Durations + F0 Model

Fig. 5.7 – Average opinion values for each subject in the 9 stimuli.

Page 210: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

182

00.5

11.5

22.5

33.5

44.5

5

1 2 3 4 5

Paragraph

Aver

age

No model Natural DurationsEstimated F0 Predicted ACs based on estimated ACs and PCs Predicted ACs with estimated PCsF0 Model Durations + F0 Model with 0.75*Aa Durations + F0 Model

Fig. 5.8 – Average opinion values for each paragraphs in the 9 stimuli.

Page 211: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 5 - Perceptual Tests

183

The duration model stimuli confirm the closeness to the original stimuli. In this test they were equally evaluated as original by 5 subjects. The stimuli produced with estimated F0 contour were better evaluated than original by 3 subjects and for another subject they were at same level. The three types of stimuli with predicted F0 contour were at a similar level, been preferred distinctly for different subjects. For all subjects these types of stimuli still at a lower level than original, esti-mated or duration model stimuli. The complete model without reduction of Aa were preferred than the complete model with reduction of Aa by 8 subjects, and were not preferred by 6 subjects, being equally evaluated by 5 subjects. These stimuli stills at a lower level than original, estimated, and duration model stimuli. The application of F0 model over durations model were preferred than just the F0 model by 3 subjects, but, modification of duration imposes a general slight decrease in natu-ralness. “No model” stimuli still at a very lower level than any other stimuli.

Fig. 5.8 shows the average opinion values for the 9 stimuli in each paragraph. Each bar is the average of 19 opinions. Again, the stimuli produced with the duration model and estimated F0 gets very close opinion to the original stimuli, being even preferred in third paragraph. The stimuli with the three levels of predicted F0 (stimuli 4, 5 and 6) stills at a lower level opinion than stimuli pro-duced with estimated F0. No significant difference seems to exist between the three levels of pre-dicted F0. The stimuli produced with duration and F0 models with or without Aa reduction stills at a slightly lower level than the ones produced with predicted F0. The model with Aa reduced is slightly preferred in 3 paragraphs while the model without reduction is preferred in the other two paragraphs. Stimuli produced with “No model” stills at a very low level compared with the others.

Table 5.11 displays the MOS and respective standard deviation for the 9 types of stimuli. Each MOS value is the average of 95 opinions. The analysis of variance for all types of stimuli resulted in a significance level of 100% (p<10-12 for F=214), showing an evident dependency of the results on each type of stimulus.

Table 5.11: Mean Opinion Score (MOS) and standard deviation of the perceptual test.

1- Natural

2- Du-rations

3- Es-timated

F0

4- Pred. ACs

based on est. ACs and PCs

5- Pre-dicted ACs

with es-timated

PCs

6- F0 Model

7- Du-rations

+ F0 Model with

0.75*Aa

8- Du-rations

+ F0 Model

0- No model

MOS 4.61 4.20 4.38 3.31 3.14 3.09 2.83 2.87 1.24

Std 0.57 0.76 0.64 0.74 0.80 0.73 0.73 0.76 0.46

Fig. 5.9 illustrates the subjects’ opinion analysis for each type of stimulus over 95 opinions. Mean Opinion Scores are represented by a black thick line. Blue boxes represent the lower and up-per quartile. Red lines represent the median score. Minimum and maximum values are presented with the black thin lines. Red plus signals represent outliers.

Page 212: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

184

0 1 2 3 4 5 6 7 8

1

1.5

2

2.5

3

3.5

4

4.5

5

Val

ues

Fig. 5.9 – Analysis of opinion scores by stimuli. Stimuli from 0 to 8 corresponds to: 0 – No model; 1 – Natu-ral; 2 – Durations; 3 – Estimated F0; 4 – Predicted ACs based on estimated ACs and PCs; 5 – Predicted ACs

with estimated PCs; 6 – F0 Model; 7 – Duration + F0 model with Aa*0.75; 8 – Durations + F0 model.

Table 5.12: Significance level between pairs of stimuli. Stimuli from 0 to 8 have the same correspondence as the ones in Fig. 5.9.

0 1 2 3 4 5 6 7 8

0 - 0 0 0 0 0 0 0 0

1 - <0.001 0.0095 99%

0 0 0 0 0

2 - 0.0879 91%

0 0 0 0 0

3 - 0 0 0 0 0

4 - 0.1334 86%

0.04 96%

<0.001 <0.001

5 - 0.6355 36%

0.0051 99%

0.0157 98%

6 - 0.0145 98%

0.0409 96%

7 - 0.7266 27%

8 -

Page 213: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 5 - Perceptual Tests

185

Original stimuli and the ones produced by the duration model and estimated F0 have 75% of its opinions over 4, and a minimum of 3. Original has a MOS of 4.6, estimated F0 has a MOS at 4.4 and durations model a MOS at 4.2. Stimuli produced with predicted F0 contour with different blocks of F0 model (stimuli 4, 5 and 6) have almost their opinions between 3 and 4. The F0 model has even more than half opinions with level 3. Their MOS are 3.3, 3.1 and 3.1 respectively for model with predicted ACs using estimated ACs and PCs, model with predicted ACs using esti-mated PCs and F0 model. The stimuli produced with F0 model over duration model have ¾ of its opinions between 2 and 3. Its MOS are 2.9 and 2.8 respectively for Aa without reduction and with reduction. “No model” has opinions almost in level 1.

Table 5.12 displays the significance level between each pair of types of stimuli by analysis of variance. First line of cells presents the significance level, p, and second line the respective confi-dence level given by Eq. (5.1). Orange background cells signalise low confidence level meaning high evidence to accept the hypothesis that these stimuli are the same. In opposition, the other cells present enough level of confidence to reject the hypothesis that the levels are the same. In conclu-sion, MOS between stimuli 4 (Predicted ACs based on estimated ACs and PCs) and 5 (Predicted ACs with estimated PCs) are not significant, and MOS between stimuli 5 and 6 (F0 Model) and be-tween stimuli 7 (Duration + F0 model with Aa*0.75) and 8 (Durations + F0 model) are not signifi-cant at all. All other MOS stimuli pairs are significant.

5.3.1 Discussion

Natural stimuli were very well classified within the level of Excellent.

This second perceptual test confirmed the Good acceptability of the duration model (alternative model). The stimuli produced with modified durations according to this model achieved a MOS of 4.2, very close to the natural stimuli (4.6). The distance to the natural stimuli (0.41) were similar as the one in the first test (0.27).

MOS of estimated F0 (4.4) is at the level of a Good acceptability of naturalness. The distance to the MOS of natural stimuli were very low (0.23), proven the closeness between original and re-synthesis with estimated F0 as was point out by preliminary tests. If there is a strong correlation be-tween rmse and MOS, as will be proved bellow, this result can be extended to the complete data-base, once the mean rmse of paragraphs in perceptual test (4.18 Hz) is at same level as the rmse in complete database (3.97 Hz).

This perceptual test proves that the F0 contour of European Portuguese can be modelled by the Fujisaki’s model with high closeness to the original intonation.

MOS for the stimuli 4 (Predicted ACs based on estimated ACs and PCs) was 3.3. These stimuli were produced with the same phrase components as the ones in estimated F0; just the accent com-ponent is predicted. Concretely, these stimuli evaluate the result of the four ANN that predicts the existence of an AC associated to each syllable or not, and its respective onset time, offset time and amplitude. No interference of previous bad predictions exists since the features related to previous ACs are based in estimated ACs. The degradation in perceived naturalness introduced by this pre-diction is the difference between the MOS of estimated F0 (4.38) and present MOS (3.31). The degradation of more than one in MOS is significant. This step is the one that introduces more per-ceived degradation in naturalness, deserving further developments to improve the prediction of ACs.

Page 214: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

186

Stimuli 5 (Predicted ACs with estimated PCs) differ from previous stimuli just in features of ANN related to previous ACs, that are determined, in this case, based in previously predicted ACs. MOS is 3.1, denoting a degradation of just 0.2 in perceived naturalness, but with a very low level of confidence between these two MOSs (86%).

Previous two paragraphs give the answers to the second and third questions. The prediction of a new AC, being known the previous ACs and PC has a cost of perceived naturalness of 1 in 5, and prediction of a set of ACs has a cost of 1.3 in 5.

The prediction of PCs can be evaluated by the degradation in perceived naturalness between stimuli 5 and 6 (F0 model), because the difference between those two stimuli is just the prediction instead of estimated PCs. Stimuli number 6 also evaluates the complete F0 model. These stimuli achieved a MOS of 3.1, at the level of Fair naturalness. Concerning the PC model, is evident its excellent performance, because degradation of perceived naturalness was just 0.05 in 5, and with no significance at all (confidence level = 36%).

Stimuli 7 (Durations + F0 model with 0.75*Aa) and 8 (Durations + F0 model) consist of the proposed final version of prosody model. The first with reduction of accent components and the second just as it is predicted by the model. The degradation in perceived naturalness introduced by the inclusion of duration model can be evaluated by the difference between MOS values in stimuli 6 and 8. This difference is 0.2 in 5, similar to the difference between natural and durations stimuli (0.4), The shorter difference between stimuli 6 and 8 can be explained by the fact that some bad predicted durations can be masked by poor F0 intonations. No significant difference result in MOS of stimuli produced with reduction of accent components (0.04). Moreover, the very low level of confidence (27%), given by analysis of variance, shows no evidences that the stimuli are different. So, the reduction of accent components did not prove to improve the perceived naturalness.

Finally, the complete model (Durations + F0 model) has a MOS of 2.9, at the level of Fair natu-ralness.

5.3.1.1 Correlation between objective and subjective measurements

Similarly to the discussion in the previous perceptual test, a discussion follows about the correla-tion between objective measurements, given by correlation coefficient and rmse, and subjective evaluation, given by MOS, for modified F0 contours.

Table 5.13 presents the objective distance of the F0 contours between original and modified stimuli, and the subjective evaluation by means of MOS, for the five paragraphs. Stimuli 2 (Dura-tions) is not included in this discussion because their modification is in timing domain meanwhile the other stimuli has their modifications in F0 domain. The measures rmse and correlation coeffi-cient in stimuli 7 and 8 are measured in comparison to determined F0 after timing modifications.

The correlation between measurements along paragraphs is presented in Table 5.14. No signifi-cant correlation seems to exist between subjective and objective measurements like in case of dura-tion models. In this case, the correlation between the two objective measurements, r and rmse, is -0.79. This value denotes a significant correlation between them. So, generally, as higher is the cor-relation coefficient, lower is the rmse of the predicted and measured F0 contours along paragraphs. This correlation was not verified in segmental durations.

Page 215: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 5 - Perceptual Tests

187

Table 5.13: Indicator measurements for stimuli by paragraph.

Stim. Parag. 1 2 3 4 5 Av.

rmse (Hz) 22.4 23.7 20.1 13.3 15.9 19.1 0

MOS 1.5 1.2 1.2 1.1 1.2 1.2

r 0.953 0.964 0.979 0.969 0.971 0.967

rmse (Hz) 5.2 5.2 3.8 3.0 3.7 4.2 3

MOS 4.2 4.4 4.7 4.2 4.4 4.38

r 0.267 0.693 0.734 0.770 0.585 0.610

rmse (Hz) 19.3 14.7 12.9 11.9 13.2 14.4 4

MOS 3.4 3.2 3.6 3.2 3.2 3.32

r 0.306 0.554 0.621 0.756 0.515 0.550

rmse (Hz) 21.8 16.2 15.2 12.9 15.2 16.3 5

MOS 2.9 2.8 3.4 3.3 3.3 3.14

r 0.530 0.528 0.293 0.647 0.433 0.486

rmse (Hz) 16.1 16.8 19.1 11.3 17.4 16.1 6

MOS 3.0 2.9 3.1 3.4 3.1 3.1

r 0.482 0.605 0.377 0.627 0.461 0.510

rmse (Hz) 17.9 16.8 18.4 11.3 17.4 16.4 7

MOS 3.1 2.7 3.0 2.6 2.7 2.8

r 0.503 0.639 0.380 0.594 0.481 0.519

rmse (Hz) 19.0 16.9 19.2 11.7 14.6 16.3 8

MOS 3.3 2.6 2.7 3.0 2.7 2.9

Table 5.14: Correlation coefficient along paragraphs between measurement indicators.

Stim. r(r,rmse) r(rmse,MOS) r(r,MOS)

0 0.67

3 -0.69 -0.03 0.70

4 -0.93 0.26 -0.19

5 -0.95 -0.64 0.56

6 -0.90 -0.65 0.27

7 -0.76 0.61 -0.66

8 -0.51 0.01 -0.01

Av. -0.79 0.03 0.11

Page 216: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

188

Concerning correlation between objective and subjective measurements along models, Table 5.15 presents the mean values along paragraphs of the measurement indicators r, rmse and MOS, for different types of stimuli with modified F0. Table 5.16 presents the correlation between those measurements in previous table along stimuli 3 to 8. The correlation between rmse and MOS, r(rmse,MOS), considering also stimuli 0, is presented in the bottom line of table. A very strong cor-relation between each pair of parameters exists. Therefore, a very strong correlation exists between objective and subjective parameters when evaluating a model, as, again between the two objective parameters, r and rmse. In this case, both objective parameters have the same correlation (0.976) with subjective parameter. The negative values in correlations involving rmse are due to its scale that is decreasing along better closeness, in oppositions to MOS and r.

Table 5.15: Mean values along paragraphs of indicator parameters.

Stim. r rmse MOS

0 19.1 1.2

3 0.967 4.2 4.4

4 0.610 14.4 3.3

5 0.550 16.3 3.1

6 0.486 16.1 3.1

7 0.510 16.4 2.8

8 0.519 16.3 2.9

Table 5.16: Correlation between mean values along models of indicator parameters.

r(r,rmse) r(rmse,MOS) r(r,MOS)

r -0.991 -0.976 0.976

r -0.836

In conclusion, as in duration models, perceived naturalness in two paragraphs produced with the same F0 model cannot be evaluated comparing their own correlation coefficients or rmse. But the general naturalness of an F0 model, as in duration models, can be evaluated by their rmse or corre-lation coefficient measured along several paragraphs.

Page 217: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 5 - Perceptual Tests

189

5.4 Conclusion

In general, the perceptual tests confirmed the objective results. Even, because a high correlation was found between perceived naturalness and rmse and correlation coefficient of segmental dura-tions or F0 contours along several paragraphs. Although, perceived naturalness in two paragraphs produced with the same duration model or F0 model cannot be evaluated comparing their own cor-relation coefficients or rmse.

Concerning the proposed duration models, the perceptual tests confirmed the improved results achieved by the usage of dedicated ANNs. In face of the results, the alternative model was selected for further developments with the level of Good in the MOS scale.

The second perceptual test proved that the F0 contour of European Portuguese can be modelled by the Fujisaki’s model with high closeness to the original intonation.

The F0 model achieved the level of Fair in the MOS scale, although, a significant reduction in naturalness as perceived. The test was performed in order to separate the loss in naturalness after the application of each sub-model. Almost all loss in naturalness was perceived after the AC model and no additional significant loss as felt after the PC model. These results may indicate that the PC model is in a rather good quality and that the AC model needs to be improved. But, discussion in section 6.3 recommends a more detailed error contribution analysis.

The complete proposed model (durations + F0) achieves the level of Fair in the MOS scale.

Page 218: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …
Page 219: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

6 Conclusions and Future Work

This chapter closes the thesis making some observations about the time consuming tasks, present-ing briefly the issues documented in previous chapters and their detailed conclusions. It follows a discussion about the error contribution of the sob-models of a model, and a resume of the conclu-sions. The section of future work points out some possibilities to improve the proposed model and the way to be followed in the near future.

Page 220: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

192

6.1 General Observations about the Tasks

This section, not usual in conclusion chapters, points out the very laborious tasks behind the visible modules, not reported before, but with an essential role for the proposed model. It pretends to give an acquaintance idea about the work under the presented developments.

Several constraints, not identified in the beginning appeared during the work and were success-fully passed upon with some additional time of work. Particularly, some tasks were very time con-suming. The effort taken into those tasks is not reflected in previous descriptions. Here those time consuming, but not reported, tasks are mentioned:

• construction of the speech database FEUP-IPB – the task consisted in preparation of the corpus, finding a skilled professional speaker, recording the signal waveform, conven-iently editing and storing the corresponding files, manually labelling the speech wave files into phonetic, word and phrasal levels and finally identify the inevitable errors and mis-takes and correct them; This was one of the most time consuming task due to the exten-sion of the speech database;

• estimation of the Fujisaki model parameters – this task consisted in creating a tool that al-lows an easy and intuitive way of manually estimating the PCs and ACs and the process of manually estimating those parameters in all used tracks;

• training and refinement of ANNs - a huge quantity of different ANNs were trained hun-dreds of times. The combinations of different type of ANNs, number of layers, number of nodes per layer, their respective activating functions, training algorithms, different sets of features and their codifications were tested in order to accurately select the best architec-ture. Each best architecture candidate was trained hundreds of times with different random seeds. Each seed leads to a good final solution, but they all are different. Anyhow, the best solutions of different sessions has very similar performance;

• programming the extraction of features – hundreds of features were used in the duration and F0 ANNs models. The process of programming the automatic extraction of those fea-tures from the labelled files and testing the results was also a very laborious task;

• extraction of reported results – all the reported measured intermediate and final results were obtained with developed routines for the particular purpose of obtaining those meas-ures in the whole database;

• development of visualization tools – special tools were developed to allow the visualiza-tion of the signal waveform, F0, commands, text, syllables, phones and other labels. Sev-eral figures presented in chapter 4 were produced with those tools;

• publication of papers– several scientific papers were published reporting several parts of this work. As it is well known, the process of writing a scientific paper, preparing the presentation and presenting it in a scientific meeting, takes several weeks of work. Al-though the well-known richness of the knowledge acquired in this process, the total time taken with several publications are significant in a PhD task schedule.

Other tasks are visible in the main document, as is the document writing, and several other mi-nor time consuming tasks were also realised. All those tasks together support this PhD thesis.

Page 221: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 6 - Conclusions and Future Work

193

6.2 General Conclusions

This work consisted in a prosody model to TTS systems for EP using mainly ANNs. The read speech style was adopted.

The duration and the F0 models presented in chapters 3 and 4, respectively, involve several pre-paratory works and previous processing modules that were presented in chapter 2. Also in this chapter, one preliminary study of tonic syllable characteristics is reported. A final evaluation of the model and its components was performed and reported in chapter 5 by the way of perceptual tests.

6.2.1 Preparatory work

Section 2.2 of chapter 2, presents a preliminary study of the changes in prosodic features F0, syl-lable duration and intensity in tonic syllable depending on its position in the word and in the phrase. A short corpus read by three speakers was used. Several measurements of duration, F0 and intensity, were made in tonic syllable and in one neighbour syllable, taken as the reference syllable. The measurements were used to determine the relative variation of F0 and duration of tonic syllable relative to the reference one and the variation of intensity, in tonic syllable. Some relevant varia-tions of F0, duration and intensity, in tonic syllable were reported as a function of its position in the word (beginning, middle, and end) for words in initial, medial and final position in the phrase and for isolated words. Despite the particular variation values reported in detail, general trends were observed. An interesting contrast in trends of relative duration and relative intensity was observed in the tonic syllable as the position changes from the beginning to the end of the word. While the relative duration has an increasing trend, the relative intensity has a decreasing trend. The relative variation of F0 has a tendency of a regular decrease from the initial to the final position in the phrase. Finally, large variation of the features exists for the words in final position of phrase.

The results of this study were not directly used in the proposed prosody model, because it has dependency of more features. Anyhow, the study had an important role in clarifying the future studies at that time.

The created speech labelled database of EP is an important resource, non-available at the begin-ning of the work. Section 2.3 reports the speech corpus FEUP-IPB database specially developed under this work. The speech database consists in several tracks read by a skilled professional speaker in a total of approximately 100 minutes. The speech files were labelled at the phonetic, word and phrase levels. Later, the Fujisaki model’s parameters, PCs and ACs, were estimated in 101 paragraphs. Some phonetic statistics were reported. Several phonetic changing phenomena found in the database were also reported, like dialectal and contextual changes. This database was used in the whole prosodic study.

Section 2.4 reports a developed module previous to the prosody model. This module’s operation consists in splitting words into syllables. Two algorithms were proposed, one to split written text and the other to split the ‘spoken text’, or the transcribed sequence of phonemes. Both algorithms were based in considering syllables only of the types V, VC, VCC, CV, CVC, CCV and CCVC as admissible in EP, and a small additional set of rules. The second algorithm considers, also, sylla-bles of the types C and CC, admitting that original types CV, CVC and CCV suffered vowel reduc-tion. The error rates measured in a text not seen in the development were 0.06% and 0.89% per di-vision, respectively. The second algorithm has a comprehensible superior error rate than the first

Page 222: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

194

one, because of the additional difficulty of deal with the vowel reduction, very frequent in EP. Anyhow both solutions reached acceptably low error rates.

Section 2.5 presents several proposals of sets of rules to convert graphemes into phonemes. This task does not make part of the prosody model but is also very important to improve naturalness in synthetic speech. The presented rules are not exhaustive, considering that most of the graphemes have well known and stable rules. Only the graphemes <a>, <e>, <o> and <x> deserved a special attention. For those graphemes an enlarged set of rules and respective exceptions are documented. The process of phonetic transcription from text in FEUP-TTS is also described, and considers a ta-ble of exceptions previous to the rules. The measured error rates for the graphemes <a> and <x> were 0.34% and 3,4% per phoneme, respectively. The larger error rate in case of grapheme <x> re-flects the large number of unpredictable situations in the production of that grapheme’s sound. No error rate measurements were made for the other graphemes, but a large error rate is expected be-cause of the large possibilities of phonemes into those graphemes can be converted to. The elimina-tion of those errors consists in including the detected error situations into the table of exceptions. The problem of homograph words is still unsolved with the table of exception. For those cases morphologic and contextual information is needed. Post-lexical or co-articulation rules were also presented to be applied after the grapheme-phoneme conversion rules. Those rules pretend to re-duce the unnatural distance between the formal lexical transcriptions of text to the usual naturally produced phonetic sequence, justified by the co-articulation effects.

6.2.2 Timing

Chapter 3 describes the segmental duration model. Two alternative models are proposed based on the same concepts. The first one uses one ANN to predict the duration of any segment. The sec-ond alternative uses one dedicated ANN for each segment type using the same set of features pro-posed in the first alternative. A preliminary model to insert and predict pauses is also proposed.

The chapter starts describing the state of the art in segmental duration models. Then the firstly proposed model consists in selecting the architecture of the ANN, training algorithm and the set of features and their codification. The architecture of the ANN was selected by a process of experi-menting all relevant alternatives and rejecting the ones that produced poor results. For the ones with best performances, several hundreds of training sessions with random initial weights were per-formed in order to get the very best performance. The ones with better performance were interac-tively experimented with the different sets of features. The set of features was selected by a process of including initially the features with relevant correlation coefficient with the output, and then measuring the final performance with and without each feature or group of features. The final se-lected best architecture was a feed forward ANNs with 99 nodes in the input layer, 4 nodes in the first hidden layer activated with hyperbolic tangent transfer function, 2 nodes in the second hidden layer activated with hyperbolic logarithmic function and one node in the output layer activated by a linear transfer function. The ANNs were trained with the Levenberg-Marquardt back-propagation algorithm. The features of the final set can be grouped in three levels of relevance:

• very relevant features: identity of segment;

• relevant features: position in relation to tonic syllable, type of the vowel of syllable; posi-tion to the end of the accent group and to the end of the phrase; distance to next pause; po-sition of the accent group in the phrase; identity of the previous segment; identity of next three segments;

Page 223: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 6 - Conclusions and Future Work

195

• slightly relevant features: type of syllable; type of previous syllable; type of vowel of pre-vious and next syllable; position to the beginning of the accent group and to the beginning of the phrase; length of the accent group; position of the accent group in the phrase from the beginning; suppression or not of the final vowel.

The inclusion of any of the just slightly relevant features into the set of features does not im-prove the final performance. However, the inclusion of several just slightly relevant features really improves the final performance.

Several other type of features concerning linguistic and context information were considered but without enough relevance. A very special attention was taken in the codification process in order to maximize the performance keeping in mind the reduction of the number of input nodes. This way, features like position in relation to the tonic syllable, syllable type and type of the vowel of syllable were codified in just one node each, without loss in performance. The values for those features are taken from a table that was built considering the correlation of their different possibilities with the output. However, the identity of segment, for instance, was coded in 44 nodes, because any type of codification with lower number of nodes reduces the final performance.

The, non-usual, consideration of a large number of features contributed significantly for improv-ing the final results.

The very final results in a test set comparing the predicted values with the measured ones, as they were produced by the speaker, were a standard deviation of 19.46 ms and a correlation coeffi-cient of 0.839. A statistical analysis of the error in the prediction shows that 75% of segments have an error inferior to 20 ms, 90% an error inferior to 30 ms and 95% an error inferior to 40 ms.

An alternative model was proposed using basically all attributes of the previous one, namely, the basic architecture of the ANN and set of features, but using one dedicated ANN for each type of segment. This alternative model has the advantage of each segment been predicted with an ANN trained only with segments of this type excluding the effects of other type of segments. The disad-vantage is that the knowledge of other type of segments is not used in the training process of the dedicated ANN. Is the information of other type of segments useful for the different type of seg-ment? The final objective results of this alternative model proved that this information is not useful and should not be used.

The set of features of the alternative model is the same of the previous model, excluding the identity of segment. Concretely, the final alternative model consists in 44 ANNs with 55 nodes in the input layer and equal hidden and output layers as the ANN of the previous model.

The final results of this alternative model in the same test set were a standard deviation of 18.2 ms and a correlation coefficient of 0.861. A statistical analysis of the error in the prediction shows that 75% of segments have an error inferior to 18 ms, 90% an error inferior to 30 ms and 95% an error inferior to 37 ms.

The comparison of the standard deviation of measured and predicted durations with both mod-els, as well as some observations confirm the lower dispersion of the predicted values by the model, as expected from a statistical model. The models have more difficulties in predicting the very high durations of segments. An analysis of predicted duration by phoneme (segment type) showed that the maximum predicted values are always lower than the measured and that the mini-mum predicted values are always higher than the measured. This shows once again the lower ex-tension of the predict durations by the statistical model.

Page 224: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

196

Both proposed models have an objective evaluation at the same level of the state of the art of models for other languages.

In chapter 5 a perceptual test was presented comparing both models with the natural speech pro-duced by the speaker and with a ‘model’ called ‘no model’ that imposes one duration to each seg-ment equal to the average duration of the type of segment. The following conclusions resulted from the comparison of the MOS over 100 evaluations of each model:

• The alternative model here slightly preferred than the proposed model with an MOS of 3.93 against 3.78. Both models achieved the level of Good acceptability in a MOS scale. However, the low confidence level between opinions of both models shows no significant evidence that the models are different.

• Two original stimuli of each sentence were presented to subjects. Their MOS were 4.13 and 4.27, so, at the level of Good acceptability in a MOS scale. The distance of alternative model stimuli to the average original stimuli were 0.27. Its closeness is confirmed by the very low confidence level between alternative model and one of the original stimuli. The first proposed model stimuli are 0.42 far from the original stimuli.

• ‘No model’ stimuli had a MOS of 2.88. It is 0.9 and 1.05 far from the first proposed model and the alternative model, respectively. Although the ‘no model’ stimuli were never pre-ferred than other stimuli, for any subject or paragraph, it still at the Fair level in MOS scale.

The perceptual tests confirmed the good acceptability of both proposed segmental duration mod-els.

The slightly preference of the alternative model justified the selection for further developments in final proposed prosody model.

Also in chapter 3 a preliminary intra-paragraph pausing model is proposed. It consists in pause insertion rules and pause duration prediction with one ANN.

About 70% of pauses are imposed by orthographic punctuation marks; the other 30% occurs be-tween words and usually are associated to prosodic phrasing. Just the pauses associated with punc-tuation mark were studied for pause insertion, because of the absence of syntactic information to determine the semantic group boundaries.

The statistical analysis of the database showed that the sentence marker “.” always impose the insertion of a pause. The comma, “,”, imposes the insertion of a pause in 65% of cases. Other punc-tuation marker like: “?”, “!”, “;”, “:” and “(“, seems to impose always one pause, but no statistical significance exists. Finally, the marker “””, only 20% of times imposes a pause.

An ANN was proposed to predict the duration of pauses. The used features consisted in the type of sentence marker associated to previous pause, actual pause and next pause and distance to previ-ous and following pause, in a total of 17 input nodes. The achieved results were 95 ms of rmse and a correlation coefficient of 0.54, in the test set.

Although the final results in the test set are similar to the ones achieved in other works [Navas, 2003], the model is not considered as reliable, because the rmse is significant in face of the stan-

Page 225: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 6 - Conclusions and Future Work

197

dard deviation of measured durations. A large database, for pause studies purposes, is needed. This database does not need to be phonetically labelled, but needs to have a large number of pauses.

6.2.3 Fundamental frequency

Chapter 4 presents a proposed model to predict the F0 contour based on the prediction of Fuji-saki model parameter by means of ANNs.

The chapter begins with an overview of some ways of coding the F0 contour for prosodic modu-lation. Then, it follows with the concepts behind the Fujisaki model and the mathematical formula-tion. The effects of the variation of each parameter of PCs and ACs are analysed. Then, the process of estimation of the parameters and the developed tool to fulfil the process, were presented. After that, a model to take care of prediction of PCs was proposed, consisting in the algorithm to control the insertion in text and the ANNs to predict their magnitudes and final positions. Following, a model to control the ACs was proposed. This model predicts the existence or not of one AC associ-ated with syllables, their amplitude, onset time and offset time, using ANNs. Finally, the results were analysed.

Seven tracks of the FEUP-IPB speech database were separated into 101 paragraphs with variable lengths. The values of base line frequency, Fb, the natural angular frequency of phrase control mechanism, α, the natural angular frequency of the accent control mechanism, β, and the relative ceiling level of accent components, γ, were experimentally verified as having a constant value for the present speaker at the respective values of 75 Hz, 2.0 /s, 20 /s and 0.9. For each paragraph, the PCs were inserted making the phrase component cross the lower levels of the F0 contour. Then, the ACs were estimated under the initial scope of reducing the distance between original and estimated F0 contours. A strong relation between ACs and syllables was found. So, the inserted ACs were as-sociated to the syllables. Syllables with voiced sounds usually have one AC associated. Some times there is no AC associated to the syllable but, two ACs are never associated to one syllable. The fi-nal rmse between estimated and original F0 was 3.98 Hz and the correlation coefficient was 0.973. Usually, no perceptible differences exist between original and re-synthesised with estimated F0 contours utterances. Latter, the perceptual test confirmed this proximity, being the MOS of esti-mated F0 contour at 4.38 and the original ones at 4.61.

It is important to mention that some degree of freedom is allowed by the Fujisaki model between the PCs and the ACs used to produce a very similar pattern of F0, but this freedom is severely re-duced using rules or linguistic constraints.

The PCs model performs in two phases. The first one inserts PCs in text associated to the begin-ning of the accent groups. The second phase predicts the magnitude and the anticipation used to de-termine the final exact position in speech timing.

The first phase consists in an algorithm to insert PCs in text. The PCs associated to orthographic marks are 70%, according to experimental measurements. The remaining 30% has no associations. Although the percentage of associated PCs with orthographic marks is very similar to the one pre-sented for pauses, there is no full connection between pauses and PCs. The number of pauses is su-perior to the number of PCs. The eligible positions to insert PCs are just the beginnings of the ac-cent groups. The algorithm starts inserting one PC in all orthographic marks. Then it removes the PCs that are very close to the previous one. Then, it inserts PCs in the gaps between PCs longer than 3s, by means of a weighted score. The score considers the following factors: distance to previ-ous and next PC, the presence of pause, the length of previous word and the type of previous word.

Page 226: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

198

The weight of every factor has been experimentally determined. The positions of the inserted PCs are very consistent with the positions of the labelled ones.

The second phase consists in the prediction, by the way of ANNs, of the magnitude and of the anticipation of the PC relatively to the initial eligible position (beginning of accent group). Two ANNs were used because of the low correlation between the parameters Ap and T0a (anticipation). The process of selection of the architecture and the set of features was very similar with the one de-scribed for the duration model. The selected ANN to predict Ap consists in a feed-forward ANN with 20-2-2-1 nodes in the layers. The first and second hidden layers have the hyperbolic logarith-mic transfer function, and the output node a linear transfer function. The selected ANN to predict T0a consists in a feed-forward ANN with 21-4-2-1 nodes in the layers. The first and second hidden layers have the hyperbolic tangent and hyperbolic logarithmic transfer functions, and the output node a linear transfer function. The Levenberg-Marquardt back-propagation training algorithm was used in both ANNs. A set of 20 features was used in Ap’s ANN, and the magnitude of previous PC were used as additional features in the T0a ANN. The final correlation coefficient values in the test set were 0.772 and 0.649 for Ap and T0a, respectively. These values are the higher ones published in similar works, although the perceptual test results should not be disregarded.

The AC model predicts the existence or not of an AC associated to one syllable and in positive case, predicts the parameter’s amplitude, onset time anticipation and offset time anticipation. The onset time and the offset time are determined by an anticipation related to the beginning and end of the voiced part of speech in the syllable. Again, the low correlation between parameters leaded to the usage of four ANNs. The process of selection of the architectures and the set of features was similar with the one described in the duration model. A set of 25 or 27 features were used accord-ing to the parameter. Final performance in test set for each parameter was:

• Ca ANN (existence of CA): r=0.654, accuracy of 89.3%;

• Aa (amplitude of ACs): r=0.602;

• T1a (anticipation of onset time): r=0.743;

• T2a (anticipation of offset time): r=0.650.

Again, the present results are the higher ones published in similar works. Although, some impor-tant parameter, like Aa, still show low correlation. Some information is still missing in the model to improve this parameter. Some observed results showed that most of the ACs produce an accent component that added with the other components fits closely the original F0 contour. Nevertheless, in some other cases, lower accent components did not follow the high values of original F0 pattern. In these cases just a rough approximation is achieved, because of the amplitude or even because of T1 or T2, or even because of the closeness between ACs. Again, the perceptual test is important for final judgement about obtained quality.

6.2.4 Complete prosody model

The components of the complete model were already described and individually evaluated. The complete model (durations and F0) was applied to a several paragraphs and the impact of each component of the model in the final result was measured. Table 6.1 resumes the average of the rmse, correlation coefficient, r, and Mean Opinion Score, MOS, over the 5 paragraphs used in a

Page 227: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 6 - Conclusions and Future Work

199

perceptual test for stimuli with estimated F0 (stimuli 3), predicted ACs with estimated PCs (stimuli 5), predicted ACs and PCs (stimuli 6) and Durations + F0 models (stimuli 8).

In order to evaluate the loss introduced by the ACs model the distance in F0 contours when re-placing the estimated ACs by the predicted ones was measured. This was the most significant measured loss in the whole model. In used paragraphs, the rmse increased about 12 Hz, and corre-lation coefficient decreased about 0.4, by comparison of columns 2 and 3 of Table 6.1. This does not mean that the accent component deteriorates so much, as will be discussed in section 6.3.

To evaluate the loss introduced by the PCs model, the F0 contour with ACs predicted is taken as the reference, and is compared with the F0 contour with PCs and ACs predicted. This model as-sumes that the ACs are dependent of PCs. The new F0 contour is produced predicting again the ACs because the set of PCs is new and the ACs model use this information in the input features. The AC model still exactly the same and all features, except the PCs features, also still exactly the same. The new set of ACs differs of the reference one only in what concerns the change in PCs fea-tures. No significant loss was measured between new and reference F0 contours. Table 6.1 presents an insignificant reduction in rmse (-0.2 Hz) and a reduction in r of 0.06, between columns 3 and 4.

Table 6.1: Resume of average (over the 5 paragraphs) evaluation parameters in the 4 stimuli types used for perceptual tests.

Estimated F0 Predicted ACs Predicted ACs and PCs Dur+F0 models

rmse (Hz) 4.2 16.3 16.1 16.3

r 0.967 0.550 0.486 0.519

MOS 4.4 3.1 3.1 2.9

When the F0 model (AC and PC models) is applied over the duration model no significant changes in rmse and r of predicted F0 contours exists, as can be observed in columns 4 and 5 of Table 6.1. This is coherent because no change in F0 patterns is introduced by duration model. The change in those columns of MOS is due to the timing changes and not because F0 pattern changes.

The perceptual test, presented in chapter 5, compares 9 stimuli, in order to measure the audible degradation in naturalness introduced by each component of the whole prosody model. In general the MOS of the perceptual test for each type of stimuli confirm the objective measured result, commented above. The following main observations resulted from the comparison of the MOS over 95 evaluations of each type of stimuli:

• Stimuli with estimated F0 contour, by the way of the manually labelled PCs and ACs, was relatively close to the original stimuli, 4.4 and 4.6 in a MOS scale, respectively.

• Stimuli with predicted F0 contour, by the way of estimated PCs and predicted ACs, get the score of 3.1, denoting a significant degradation in perceived naturalness. The deg-radation from the estimated F0 stimuli in MOS scale was almost 1.3. This subjective evaluation confirms the previously discussed objective results. Again, this degradation can be not only due to the accent component.

• Stimuli with predicted F0 contour by the way of predicted PCs and predicted ACs, or, in other words, the complete F0 model, get the score of 3.1, denoting no additional deg-radation introduced by the prediction of the PCs. The very low confidence level (36%)

Page 228: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

200

between stimuli with estimated and predicted PCs denotes that no evidence exist to consider them different types of stimuli.

Generally, the results of the perceptual test confirm the objective measured results.

The information about focus is determinant to improve the correctness of the accent compo-nents.

The duration model was also considered in this second test with the complete model. Again, one stimulus just with predicted segmental durations, by the way of the alternative model, was used and other stimulus with the predicted F0 contour modified over the previous speech signal. This last stimulus consists in the proposed complete prosody model. The following main observation re-sulted:

• Stimuli with segmental duration predicted with alternative model achieved an even bet-ter score (4.2) than in previous test. This small change could be caused by the existence of other stimuli (F0 modified stimuli) with less naturalness in the same test. The dis-tance from these stimuli to the original ones was 0.4.

• The complete model achieved a final score of 2.9. This score is considered in the MOS scale at the Fair level. This score is 0.2 far from the score of the stimuli with predicted F0, corresponding to the loss resulted by the introduction of the segmental duration model.

A similar decrease of 1.3 in the MOS occurred with the introduction of the predicted F0. This decreased can be observed in two comparisons. The first is the comparison of the stimuli with esti-mated and predicted F0 contour (decrease from 4.4 to 3.1). The second is the comparison of the stimuli with just segmental duration modified, and the one with the complete model (decrease from 4.2 to 2.9).

Anyhow, no similar decrease in MOS occurred by the introduction of the duration model. By one way, the comparison of original and predicted durations stimuli (decrease from 4.6 to 4.2), and by the other way the comparison of stimuli with predicted F0 and the ones with complete prosody model (decrease from 3.1 to 2.9). The smaller decrease in the second case can be explained by the lower level of naturalness in F0 contour.

Finally, a comparison between objective and subjective measurements used to evaluate the model as made in chapter 5. This comparison leaded to the conclusion that “perceived naturalness in two paragraphs produced with same model can not be evaluated comparing their own correla-tions coefficient or rmse. But the general naturalness of a model can be evaluated by the rmse or correlation coefficient measured along several paragraphs”. The rmse and r along several para-graphs are highly correlated with MOS of the perceptual test.

Page 229: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 6 - Conclusions and Future Work

201

6.3 Final Considerations about the Error Contributions

The final evaluation of the components of the F0 model (AC model and PC model) cannot be made only by the degradation in measured rmse or r of F0 contours neither by the degradation in MOS introduced by each component. Follows and analysis of the error introduced by the compo-nents of AC and PC in two situations, S5 and S6, corresponding to the produced stimuli 5 of the perceptual tests, where the F0 contour was determined by the prediction of only the ACs using the estimated PCs, and the produced stimuli 6, where the F0 contour was determined by the prediction of PCs and ACs. It is considered that no AC error is introduced between situations S5 and S6. The error considered in the following analysis can be any of the parameters rmse, r, or MOS.

Fig. 6.1 represents the error in S5, eS5, and in S6, eS6, considering that the PC and AC error axis are orthogonal. The eS5 has only AC error component (ACe), with the value eAC, and no PC error component (PCe), once the estimated PCs were used and supposedly, it has no error. The S6 has the same AC component, eAC, but a different absolute error (measured error), eS6. This figure shows that even a rather small increase in the absolute error between S5 and S6, δe, can correspond to a significant increase in component of PC error, δePC.

Indeed, the present model considers that the axis between PC and AC are not orthogonal because it is considered that the ACs are dependent of the PCs. It must be remembered that the set of fea-tures of AC model contains features related to the PCs. Therefore, Fig. 6.2 presents the same analy-sis considering now non-orthogonal axis. The represented angle between the axes was selected to produce a clear figure and was not measured. It can be observed that now, even a lower absolute er-ror (measured error) in situation S6 can produce a significant PC component error.

A group of experiments and measurements can by studied with the objective of measure the an-gle between the PC and AC components.

Fig. 6.1 – PC and AC error components in stimuli 5 and 6, considering orthogonal axis.

PCe

ACe eAC

eS6

eS5 δePC

δe

Page 230: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

202

Fig. 6.2 – PC and AC error components in stimuli 5 and 6, considering non-orthogonal axis.

This analysis demonstrated that: non-significant change in the error of the F0 pattern when the PC model is applied, does not mean that the PC model does not introduce degradation.

A similar conclusion can be taken from the analysis of the error components of the F0 model and duration model.

PCe

ACe

eAC

eS5

eS6

δePC

δe

Page 231: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 6 - Conclusions and Future Work

203

6.4 Resume of Results and Conclusions

The most important results of this work can be resumed by the following items:

• a prosody model for EP, for TTS purposes, to be implemented in FEUP-TTS system;

• a speech database in EP, FEUP-IPB, labelled at the phoneme, word, phrase and F0 lev-els;

• two syllable division algorithms for text and phoneme sequences;

• a set of phonetic transcription rules.

Two segmental duration models based in ANN were proposed. The most important conclusions can be summarized by the following items:

• the use of a large number of features contributed to improve the final results;

• both proposed segmental duration models have a Good acceptability in the objective and subjective measurements.

• the use of one dedicated ANN for each type of segment improves the final performance of the model, because the knowledge carried out by other types of segments may dam-age the learning process of the ANN;

• The level of Good was achieved by the duration model in the perceptual tests.

• ANNs proved, once again, their ability to predict segmental durations.

A model to predict the F0 contour based on the Fujisaki model using basically ANNs to predict the Phrase Commands and the Accent Commands was developed. Initially, the PCs sub-model pro-ceeds in two phases. The first phase associates PCs to the text, based on a mathematical model ob-tained with the experimental data. The second phase predicts the PCs magnitudes and exact posi-tions in the speech signal using ANNs. Then, the ACs model associates ACs with syllables and predicts their amplitudes, onset times and offset times, using ANNs. The following main conclu-sion can be pointed out:

• the process and features assures a good correlation coefficients for the predicted pa-rameters;

• the loss in naturalness measured by the MOS is significant when the AC model is ap-plied and is not significant when applying the PC model. But this does not mean neces-sarily that the AC model is the sole responsible.

• The level of Fair was achieved by the F0 model in the perceptual tests.

The complete model achieved a final score of 2.9. This score is considered in the MOS scale at the Fair level.

Page 232: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

204

Finally, high correlation was found between the MOS of a perceptual test and the measures of rmse and r between predicted and original values of segmental duration and F0 along several para-graphs. This leads to the conclusion that the rmse and r are very good evaluators for the perceived naturalness of a model (durations model or F0 model), when measured along several paragraphs.

Page 233: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Chapter 6 - Conclusions and Future Work

205

6.5 Future Work

FEUP TTS System as all others TTS systems of any language can be developed in any of its modules. Unfortunately, no one TTS module can be considered perfect enough as not needing fur-ther developments. Until now, no TTS system can produce natural enough speech to be used with-out limitations. When that happens, the number of applications using speech will blow-up. Obvi-ously, there are some improvements with more relevance in final quality of the synthetic speech. Those major important modules to be improved must deserve special attention.

Once this work was mainly dedicated to the prosody module, the pointed further developments will focus this module. Some hints for improving the performance of duration and F0 models are discussed.

For instance, the identification and special treatment of longer segments in the duration’s model could introduce a small improvement in the final performance. But, using only the same restricted information, it is not expected that significant improvement can be introduced in the duration’s model.

A special purpose database concerning pausing studies can allow the development of a better pausing or phrasing module. The use of a reliable prosodic phrasing, can, perhaps, introduce some improvements in the final model.

Concerning the F0 module, some improvements may be introduced in the AC model, as consid-ering some additional restriction in the proximity between ACs, or even by other type of associa-tions to the ACs, besides syllables. Both suggestions introduce less number of ACs and thus pro-ducing a lower fitting with the original F0 and a more flat prosody. A flatter pattern of F0, although less interesting, can attenuate eventual wrong movements or even hide them. But, real improve-ments only can be achieved breaking the restrictions of the present existent information. The identi-fication of the prominence words or syllables or the focus is the more relevant information needed. But, for this, several other kinds of input information must be used, because the prominence infor-mation can not be taken only from the text morphology. Syntactic knowledge can, probably, give some additional information useful for the segmental duration model [Ribeiro et al., 2003]. But, the semantic knowledge is more reliable to produce the focus information.

As was written before, just some part of linguistic information has been used. Further significant developments surely need other kind of information, for instance, a module to produce non linguis-tic and/or paralinguistic information.

The consideration of having new kinds of information for prosodic systems, are not new. But it is limited by the difficulty of getting dynamically non linguistic or paralinguistic information. Some scientists pointed out as the new generation of synthesizers not text-to-speech but concept-to-speech. Possibly, this approach intends to avoid the need of extraction of non-linguistic and para-linguistic information in the speech production introduced by the speaker. Additionally, the con-cept-to-speech already contains the semantic knowledge. It is well known that speech is used to convey information or concepts by the way of words. Without words there is no speech. The new difficulty introduced by the prospective new generation of synthesizers will be the concept-to-text or concept-to-words processing.

The present prosody module was produced for read speech based on a particular speaker. A prosody module can be developed by the introduction of several other functionalities. This prosody

Page 234: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

206

model can, in the future, incorporate different speech rates, different prosodic styles, emotions and different text types besides the read type. Perhaps, it can be optimised for the theme of the text de-pending if it is news, weather forecast, scientific document, mathematical formulae, etc. It can also be developed given features for facial modelling.

This prosody module can not be considered complete without a model to predict the intensity pattern.

Unfortunately, no prosody module can produce truly natural patterns yet. It is possible to find some very special dedicated applications with reasonable natural synthetic speech, but, although the long evolution made from several years ago in TTS systems, no TTS system exists yet that can produce natural speech for all applications.

If we look into the future we see the long way to cross in order to reach the followed objective of obtaining really natural synthetic speech, but if we look backwards we also can see the long way al-ready crossed. This gives hope for reaching the objective in a future not so far away.

Page 235: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Bibliography

Page 236: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

208

Allen, J.; Hunnicut, S. and Klatt, D. H.. (1987). From Text to Speech: The MITalk System. Cambridge Univer-sity Press, Cambridge.

Andrade, E. and Viana, M.. (1988). Ainda Sobre o Ritmo e o Acento em Português. In actas do 4º Encontro da Associação Portuguesa de Linguística. Lisbon, 3-5.

Barbosa, F.; Ferrari, L. and Resende, F. G.. (2003). A Methodology to Analyse Homographs for a Brazilian Portuguese TTS System. In Computational Processing of the Portuguese Language, 6th International Workshop, PROPOR Proceedings. Faro, pp.57-61.

Barbosa, F.; Pinto, G.; Resende, F. G.; Gonçalves, C. A.; Monserrat, R. and Rosa, M. C.. (2003). Grapheme-Phone Transcription for a Brazilian Portuguese TTS. In Computational Processing of the Portuguese Lan-guage, 6th International Workshop, PROPOR Proceedings. Faro, pp.23-30.

Barbosa, P. and Bailly, G.. (1994). Characterisation of rhythmic patterns for text-to-speech synthesis. Speech Communication, 15: 127-137.

Barbosa, P. and Bailly, G.. (1997). Generation of pauses within the z-score model. In Progress in Specch Syn-thesis by Van Santen J. P. H., Sproat R. W., Olive J. P. and Hirschber J. Editors. Springer Verlag, New York, pages 365-381.

Barbosa, P.. (1994). Caractérisation et generation automatique de la structuration rythmique du français. Thèse présentée pour obtenir le title de Docteur de L’Institut National Polytechnique de Grenoble.

Barbosa, P.. (1997). A Model of Segment (and Pause) Duration Generation for Brazilian Portuguese Text-to-Speech Synthesis. Proceedings of Eurospeech’97, Rodes, pages 2655-2658.

Barros, M. J.. (2002). Estudo Comparativo e Técnicas de Geração de Sinal para Síntese da Fala. Master The-sis, Faculdade de Engenharia da Universidade do Porto.

Benenati, C. (2000). Separación en Silabas. http://www.lclark.edu/~benenati/silabacento/silabas.html.

Bergström, M. and Reis, N.. (1997). Prontuário Ortográfico e Guia da Língua Portuguesa. Editorial Notícias.

Boersman, P. and Weenink, D.. "Praat doing Phonetics by Computer", http://www.fon.hum.uva.nl/praat/

Boersman, P.. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. Proceedings of the Institute of Phonetics Science of the University of Amsterdam 17: 97-110.

Braga, D.; Freitas D.; Teixeira, J. P. and Marques, A.. (2003). On the Use of Prosodic Labelling in Corpus-Based Linguistic Studies of Spontaneous Speech. In proceedings of Text Speech and Dialogue, Ceske Budejovice, Czech Republic, pages 388-394.

Braga, D.; Freitas, D. and Ferreira, H.. (2003). Processamento Linguístico Aplicado à Síntese da Fala. In pro-ceedings of III Congresso Luso-Moçambicano de Engenharia, Maputo/Moçambique. 2º Vol. Pg. 1349-1360.

Brinckmann, C. and Trouvain, J.. (2003). The Role of Duration Models and Symbolic Representation for Tim-ing in Synthetic Speech. International Journal of Speech Technology 6, 21-31.

Campbell, W. N. and Isard, S. D.. (1991). Segment durations in a syllable frame. Journal of Phonetics, 19 :37-47.

Page 237: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Bibliography

209

Campbell, W. N.. (1992). Syllable-based Segmental Duration. In Talking Machines. Teories, Models and De-signs, by G. Bailly, C. Benoit and T. Sawallis, Elsevier, Oxford, pages 211-224.

Campbell, W. N.. (1993). Predicting Segmental Durations for Accommodation Within a Syllable-Level Tim-ing Framework. Proceedings of Eurospeech’93, vol. 2, pages 1081-1084.

Campbell, W. N.. (2000). Timing in Speech: A Multi-Level Process. In Prosody: Theory and Experiment. Ed-ited by Merle Horne, Kluwer Academic Publishers, pages 281-334.

Carvalho, P.; Oliveira, L.; Trancoso, I. and Viana, M.. (1998). Concatenative Speech Synthesis for European Portuguese. Proc. of the third ESCA/COCOSDA International Workshop on Speech Synthesis. Jenolan Caves, Australia.

Caseiro, D. and Trancoso, I.. (2002). Grapheme-to-Phone Using Finite State Transducers. Proc. 2002 IEEE Workshop on Speech Synthesis. Santa Monica, California.

Catarino, D.. (2000). Separação Silábica, http://www.option-line.com/members/dilson/Silabas.htm.

Chu, M. and Feng ,Y.. (2001). Study on Factors Influencing Durations of Syllables in Mandarin. Proceedings of Eurospeech’01, Scandinavia, pages 927-930.

Córdoba, R.; Vallejo, J. A.; Montero, J. M.; Gutierrez-Arriola, J.; López, M. A. and Pardo, J. M.. (1997). Automatic Modelling of Duration in a Spanish Text-to-Speech System Using Neural Networks. Proceed-ings of Eurospeech’99, vol. 4, pages 1619-1622.

Cunha, C. and Cintra, L.. (1997). Nova Gramática do Português Contemporâneo, Edições João Sá da Costa.

D’Alessandro, C. and Mertens, P.. (1995). Automatic pitch contour stylization using a model of tonal percep-tion. Computer Speech and Language 9, 257-288.

Demuth, H. and Beale, M.. (2000). Neural Network Toolbox, for use with Matlab – User’s Guide, version 4, by the Math Works.

Dutoit, T and Leich, H.. (1992). Improving the TD-PSOLA Text-to Speech Synthesizer with a Specially De-signed MBE Re-Synthesis of the Segments Database. In Vandewalle, J., Boite, R., Moonen, M. and Ooster-linck, A. (eds), SIGNAL PROCESSING VI: Theories and Applications. Elsevier Science Publishers B. V.

Dutoit, T.. (1997). An Introduction to Text-To-Speech Synthesis. Kluwer A. P., Dordrecht.

Fackrell, J.; Vereecken, H.; Martens, J.-P. and Van Coile, B.. (1999). Multilingual prosody modelling using cascades of regression trees and neural networks. Proceedings of Eusospeech’99, Budapest, pp. 1835-1838.

Fackrell, J.; Vereecken, H.; Grover, C.; Martens, J.-P. and Van Coile, B.. (2002). Corpus-based Development of Prosodic Models Across Six Languages, pages 120-128, in E. Keller, G. Bailly, A. Monaghan, J. Terken, & M. Huckvale (editors), Improvements in Speech Synthesis, Edited by John Wiley & Sons,West Sussex.

Fant, G.; Liljencrants, J. and Lin, Q.. (1985). A four parameter model of glottal flow. In Speech Transmission Laboratory – QPSR, 1:1-12.

Ferreira, H.. (2003). Contributo para a leitura automática de textos científicos. Graduation final project /FEUP. July, 2003. http://www.fe.up.pt/~hfilipe/projecto

Ferreira, M. C.. (1998). Intonation in European Portuguese. In Intonation Systems – A Survey of Twenty Lan-guages, by Daniel Hirst e Albert Di Cristo, Cambridge University Press, pages. 167-178.

Page 238: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

210

Freitas, D.; Moura, A.; Braga, D.; Ferreira, H.; Teixeira, J. P.; Barros, M. J.; Gouveia, P. and Latsch, V.. (2002). A Project of Speech Input and Output in an E-commerce Application. In Advances in Natural Lan-guage Processing, Proceedings of Third International Conference, PorTAL 2002. Faro, Portugal.

Fromkin, V. and Rodman, R.. (1983). Introdução à Linguagem. Editora Almedina. Coimbra. Portugal.

Frota, S.. (1991). Para a Prosódia da Frase: Quantificador, Advérbio e Marcação Prosódica (Somente alguns tópicos em foco). Masters Dissertation, Faculdade de Letras da Universidade de Lisboa.

Frota, S.. (2000). Prosody and Focus in European Portuguese, Phonological Phrasing and Intonation. Gar-land Publishing Inc., New York.

Fujisaki, H. and Hirose, K.. (1984). Analysis of voice fundamental frequency contours for declarative sen-tences of Japanese. Journal of the Acoustical Society of Japan (E), 5(4):233-241.

Fujisaki, H. and Narusawa, S.. (2002). Automatic Extraction of Model Parameters from Fundamental Fre-quency Contours of Speech. Proceedings for 2001 2nd Plenary Meeting and Symposium on Prosody and Speech Processing, pp. 133-138. Sanjo-Kaikan, University of Tokyo.

Fujisaki, H.; Narusawa, S.; Ohno, S. and Freitas, D.. (2003). Analysis and Modeling of F0 Contours of Portu-guese Utterances Based on the Command-Response Model. Proceedings of Eurospeech’03, Geneva. Pages 2317-2320.

Fujisaki, H.. (1983). Dynamic characteristics of voice fundamental frequency in speech and singing. In MacNeilage. In P. F., Editor. The Production of Speech, pages 39-55. Springer-Verlag.

Fujisaki, H.. (1988). A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour. In Fujimura, O., Editor, Vocal Fold Physiology: Voice Produc-tion, Mechanisms and Functions, pages 347-355. Raven, New York.

Fujisaki, H.. (1997). Prosody, Models, and Spontaneous Speech. In Sagisaka, Y., Campbell, N. and Higuchi, N., Computing Prosody, edited by Springer-Verlag New York, Inc. Pages 27-42.

Fujisaki, H., (2002). Modeling in study of Tonal Features of Speech with Application to Multilingual Speech Synthesis. Proceedings of Joint International Conference of SNLP and Oriental COCOSDA. Thailand.

Goubanova, O. and Taylor, P.. (2000). Using Bayesian Belief Networks for model duration in text-to-speech systems. Proceedings of ICSLP 2000, Beijing.

Goubanova, O.. (2001). Predicting segmental duration using Bayesian belief network. Proceedings 4th ISCA Tutorial and Research Work shop on Speech Synthesis, Scotland.

Gouveia, P. D.; Teixeira, J. P. and Freitas, D.. (2000). Divisão Silábica Automática do Texto Escrito e Falado. Actas do V PROPOR, Processamento Computacional da Língua Portuguesa Escrita e Falada, Atibaia – S. Paulo. Pages 65-74.

Granqvist, S.. (1996). Enhancements to the Visual Analogue Scale, VAS, for listening tests. Speech, Music and Hearing, Quarterly Progress and Status Report, Royal Institute of Technology. Pages 61-65.

Guimarães, R. C. and Cabral, J. A. S.. (1997). Estatística. Edição Revista, McGraw Hill de Portugal.

Hagan, M. T. and Menhaj, M.. (1994). Training feedforward networks with the Marquardt algorithm, IEEE Transactions on Neural Networks, vol. 5, nº 6, pp.989-993.

Page 239: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Bibliography

211

Hirose, K.; Furuyama, Y.; Narusawa, S.; Minematsu, N. and Fujisaki H.. (2003). Use of Linguistic Informa-tion for Automatic Extraction of F0 Contour Generation Process Model Parameters. Proceedings of Eu-rospeech 2003, Geneva. Pages 141-144.

Hirschberg, J. and Pierrehumbert, J. B.. (1986). The intonational structuring of discourse. Proceedings of the 24th ACL Meeting. Pages136-144, New York.

Hirst, D. and Di Cristo, A.. (1998). Intonation Systems – A Survey of Twenty Languages. Cambridge Univer-sity Press.

Hirst, D. and Espesser, R.. (1993). Automatic modelling of fundamental frequency using a quadratic spline function. Travaux de L’Intitut de Phonétique d’Aix, 15, 71-85.

Hirst, D.; Di Cristo, A. and Espesser, R.. (2000). Levels of Representation and Levels of Analysis for the De-scription of Intonation Systems. In Merle Horne, Prosody: Theory and Experiment. Edited by Kluwer Aca-demic Publishers, Dordrecht, pages 51-87.

Hirst, D.. (2002). Automatic Analysis of Prosody for Multi-lingual Speech Corpora. In E. Keller, G. Bailly, A. Monaghan, J. Terken e M. Huckvale, Improvements in Speech Synthesis, Cost 258: The naturalness of syn-thetic speech, edited by John Wiley & Sons,West Sussex. Pages 320-327.

Horne, M.. (2000). Prosody: Theory and Experiment. Kluwer Academic Publishers. Dordrecht.

Huang, X.; Acero, A. and Hon, H.. (2001). Spoken Language Processing – A guide to Theory, Algorithm, and System Development. Prentice Hall, New Jersey.

Huckvale, M.. Speech Filing System Tools for Speech Research http://www.phon.ucl.ac.uk/resource/sfs/

Keller, E. and Zellner, B.. (1997). Les Défis Actuels en Synthèse de la Parole, Etudes de Lettres. Revue de la Faculté des Lettres de l’Université de Lausanne.

Keller, E.; Bailly, G.; Monaghan, A.; Terken, J. and Huckvale, M.. (2002). Improvements in Speech Synthesis, Cost 258: The naturalness of synthetic speech. Edited by John Wiley & Sons,West Sussex.

Klatt, D. H.. (1976). Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. Jour-nal of Acoustic Society of America, 59, 1208-1220.

Kochanski, G. and Shih, C.. (2002). Prosody and Prosodic Models. Tutorial of ICSLP 2002 Denver.

Ladd, D. R. and Cutler, A.. (1983). Models and Measurements in the study of prosody. In, Cutler, A. E Ladd, D. R., Prosody: Models and Measurements. Springer-Verlag, Berlin.

LiMin, Fu. (1994). Neural Networks in Computer Intelligence. McGraw-Hill International Editions, Computer Science Series.

Masaki, M.; Kashiola, H. and Campbell, N.. (2002). Modeling the Timing Characteristics of Different Speak-ing Styles. Proceeding of IEEE 2002 Workshop on Speech Synthesis.

Mateus, M.; Andrade, A.; Viana, M. and Villalva, A.. (1990). Fonética, Fonologia e Morfologia do Portu-guês. Universidade Aberta, Lisbon.

McClelland, J. L. and Rumelhard, D. E.. (1986). Parallel Distributed Processing – Explorations in the Micro-structure of Cognition. Volume 2 – Psychological and Biological Models. The Massachusetts Institute of Technology Press.

Page 240: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

212

Mixdorff, H. and Jokisch, O.. (2001). Building An Integrated Prosodic Model of German. Proceedings of Eu-rospeech’01, Aalborg. Pages 947-950.

Mixdorff, H.. (1998). Intonation Patterns of German – Model-based Quantitative Analysis and Synthesis of F0 Contours. Doktor-Ingenieurs Dissertation, Technische Universität Dresden.

Mixdorff, H.. (2000). A Novel Approach to the Fully Automatic Extraction of Fujisaki Model Parameters. Proceedings of ICASSP 2000, vol. 3, pages 1285 – 1288, Istanbul.

Mixdorff, H.. (2002). An Integrated Approach to Modeling German Prosody. Doktor-Ingenieur habilitatus Dissertation, Technische Universität Dresden.

Möbius, B.; Pätzold, M. and Hess, W.. (1993). Analysis and synthesis of German F0 contours by means of Fu-jisaki’s model. Speech Communication 13, 53-61.

Moulines, E. and Charpentier, F.. (1990). Pitch-Syncronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones. Speech Communication 9, 453-467.

Moulines, E. and Laroche, J.. (1995). Non-Parametric techniques for pitch-scale and time-scale modification of speech. Speech Communication 16, 175-205.

Narusawa, S.; Minematsu, N.; Hirose, K. and Fujisaki, H.. (2001). Automatic Extraction of Parameters from Fundamental Frequency Contours of Speech. Proceedings of ICSP 2001, Daejon Korea.

Narusawa, S.; Minematsu, N.; Hirose, K. and Fujisaki, H.. (2002a). A Method for Automatic Extraction of Model Parameters from Fundamental Frequency Contours of Speech. Proceedings of ICASSP 2002, vol. 1 pp.509-512, Orlando, USA.

Narusawa, S.; Minematsu, N.; Hirose, K. and Fujisaki, H.. (2002b). Automatic Extraction of Model Parame-ters from Fundamental Frequency Contours of English Utterances. Proceedings of ICSLP 2002, vol. 3 pp1725-1728, Denver USA.

Navas, E.; Hernáez, I. and Sánchez, J.. (2002a). Basque Intonation Modelling for Text To Speech Conversion. Proceedings of ICSLP’02, Denver, USA.

Navas, E.; Hernáez, I. and Sánchez, J.. (2002b). Subjective Evaluation of Synthetic Intonation. IEEE 2002 Workshop on Speech Synthesis. Santa Monica, USA.

Navas, E.; Hernáez, I.; Armenta, A.; Etxebarria, B. and Salaberria, J.. (2000). Modelling Basque Intonation Using Fujisaki’s Model and CARTs. In state of the art in speech synthesis digest, 3/1 – 3/6.

Navas, E.. (2003). Modelado Prosódico del Euskera Batúa para Conversión de Texto a Habla. PhD thesis, Universidad del País Vasco, Escuela Superior de Ingenieros de Bilbao.

Olaszy, G.. (1991). The inherent time structure of speech sounds. In Mária Gósy, Temporal Factors in Speech, a collection of papers, edited by Research Institute for Linguistics, Hungarian Academy of Sciences.

Olaszy, G.; Németh, G. and Olaszy, P.. (2001). Automatic Prosody Generation – a Model for Hungarian. Pro-ceedings of Eurospeech’01, Aalborg. Pages 525-528.

Oliveira, L.; Viana, M. and Trancoso, I.. (1991). DIXI – Portuguese Text-to-Speech System. Proc. of Euros-peech’91. Genoa, Italy.

Page 241: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Bibliography

213

Oliveira, L.; Viana, M. and Trancoso, I.. (1993). DIXI: Sistema de Síntese da Fala a Partir do Texto para o Português. Proc. EPLP’93 – 1º Encontro de Processamento da Língua Portuguesa Escrita e Falada. Lis-boa.

Oliveira, L.. (1996). Síntese de Fala a Partir de Texto. Phd thesis, Universidade Técnica de Lisboa.

Oliveira, M.. (2002). Pausing Strategies as Means of Information Processing in Spontaneous Narratives. Pro-ceedings of Speech Prosody 2002, Aix-En-Provence. Pages 539-542.

Pierrehumbert, J. B.. (1980). The Phonology and Phonetics of English Intonation. PhD thesis, Massachusetts Institute of Technology.

Rabiner, L. and Schafer, R.. (1978). Digital Processing of Speech Signals. Prentice-Hall.

Ribeiro, R.; Oliveira, L. and Trancoso, I.. (2003). Using Morphossyntactic Information in TTS Systems: Com-paring Strategies for European Portuguese. In Proc. PROPOR 2003. Faro, Portugal. Pages 143-150.

Riedmiller, M. and Braun, H.. (1993). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. Proceedings of the IEEE International Conference on Neural Networks.

Rossi, P.; Palmieri, F.; Cutugno, F.. (2002). A Method for Automatic Extraction of Fujisaki-Model Parame-ters. Proceedings of Speech Prosody 2002, Aix-En-Provence. Pages 615-618.

Rowden, C.. (1992). Speech Processing. McGraw-Hill.

Rumelhard, D. E. and McClelland, J. L.. (1986). Parallel Distributed Processing – Explorations in the Micro-structure of Cognition. Volume 1 – Foundations, The Massachusetts Institute of Technology Press.

Salgado, X. F. and Banga, E. R.. (1999). Segmental Duration Modelling in a Text-to-Speech System for the Galician Language. Proceedings of Eurospeech’99, Budapeste. Pages 1635-1638.

Schalkoff, R. J.. (1997). Artificial Neural Networks. Mcgraw-Hill, Singapore.

Silverman, K. and Pierrehumbert, J.. (1990). The Timing of Prenuclear High Accents in English. Papers in Laboratory Phonology I , J. Kingston and M. Beckman, (eds), Cambridge University Press, Cambridge UK. 72-106.

Souza, M. N.; Caprini, E. J.; Machado, C. G.; Ludolf, M. V.; Calôba, L. P.; Seixas, J. M.; Resende, F. G.; Net-to, S. L.; Freitas, D.; Teixeira, J. P.; Espain, C.; Pêra, V. and Moreira, F.. (1999). Developing a Voiced In-formation Retrieval System for the Portuguese Language Capable to Handle Both Brazilian and Portuguese Spoken Versions .Proceedings of the Eurospeech’99, Budapest.

Sproat, R.. (1998). Multilingual Text-To-Speech Synthesis. Kluwer A: P., Dordrecht.

Taylor, P.. (1994). The rise / fall / connection model of intonation. Speech Communication 15, 169-186.

Taylor, P.. (2000). Analysis and Synthesis of Intonation using the Tilt Model. Journal of the Acoustical Soci-ety of America. vol 1073, pp. 1697-1714.

Teixeira, J. P. and Freitas, D.. (2002). Acoustic Characterisation of the Tonic Syllable In Portuguese, pages 120-128, in E. Keller, G. Bailly, A. Monaghan, J. Terken, & M. Huckvale (Editors), Improvements in Speech Synthesis, Edited by John Wiley & Sons,West Sussex.

Page 242: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

A Prosody Model to TTS Systems

214

Teixeira, J. P. and Freitas, D.. (2003a). Evaluation of a Segmental Durations Model for TTS. In Computa-tional Processing of the Portuguese Language, 6th International Workshop, PROPOR Proceedings. Faro, pp.40-48.

Teixeira, J. P. and Freitas, D.. (2003b). Segmental Durations Predicted With a Neural Network. Proceedings of Eurospeech’03, Geneva. Pages 169-172.

Teixeira, J. P.; Freitas, D. and Fujisaki, H.. (2003). Prediction of Fujisaki Model’s Phrase Commands. Pro-ceedings of Eurospeech’03, Geneva. Pages 397-400.

Teixeira, J. P.; Freitas, D. and Fujisaki, H.. (2004). Prediction of Accent Commands for the Fujisaki Intonation Model. Proceedings of Speech Prosody 2004, Nara - Japan. Pages 451-455.

Teixeira, J. P.; Freitas, D.; Braga, D.; Barros, M. J. and Latsch, V.. (2001). Phonetic Events from the Labeling the European Portuguese Database for Speech Synthesis, FEUP/IPB-DB. Proceedings of Eurospeech’01, Aalborg. Pages 1707-1710.

Teixeira, J. P.; Freitas, D.; Gouveia, P.; Olaszy, G. and Németh G.. (1998). MULTIVOX – Conversor Texto Fala Para Português. In III Encontro Para o Processamento Computacional da Língua Portuguesa Escrita e Falada - PROPOR, Porto Alegre – Brasil.

Teixeira, J. P.; Rosa, E.; Freitas, D. and Pinto, M. da G.. (1999). Acoustical Characterization of the Accented Syllable in Portuguese, A Contribution to the Naturalness of Speech Synthesis, Proceedings of the Eu-rospeech’99, Budapest. Volume 4, Page 1651-1654.

Teixeira, J. P.. (1995). Modelização Paramétrica de Sinais Para Aplicação em Sistemas de Conversão Texto-Fala. Masters dissertation, Faculdade de Engenharia da Universidade do Porto.

Trancoso, I.; Viana, M.; Silva, M.; Marques, G. and Oliveira, L.. (1994). Rule-Based versus Neural Network Based Approaches to Letter-to-Phone Conversion for Portuguese Common and Proper Names. In Proc. In-ternational Conference on Spoken Language Processing. Yokohama, Japan.

Van Santen, J. P. H.. (1992). Contextual Effects on Vowel Duration. Speech Communication. 11(6):513-546.

Van Santen, J. P. H.. (1994). Assignment of segmental duration in text-to-speech synthesis. Computer Speech and Language, 8, 95-128.

Van Santen, J. P. H.. (1997). Segmental Duration and Speech Timing. In Sagisaka, Y., Campbell, N. e Higu-chi, N., Computing Prosody, edited by Springer Verlag, New York.

Vereecken H.; Martens J.-P.; Grover C.; Fackrell J. and Van Coile B.. (1998). Automatic Prosodic Labeling of 6 Languages. Proceedings of ICSLP’98, Sidney, Australia, Vol. 4 pp. 1399-1402.

Viana, M. C.; Oliveira, L. and Mata, A. I.. (2001). Prosodic Phrasing: Machine and Human Evaluation. TTS workshop 2001, Edinburgh.

Viana, M. C.; Oliveira, L. and Mata, A. I., (2003). Prosodic Phrasing: Machine and Human Evaluation. Inter-national Journal of Speech Technology 6, 83-94.

Vorstermans, A.; Martens, J.P. and Bert, V. C.. (1996). Automatic segmentation and labeling of multi-lingual speech data, Speech Communication, 271-293.

Wells, J.. (2000). SAMPA computer readable phonetic alphabet. http://www.phon.ucl.ac.uk/home/sampa/home.htm

Page 243: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …

Bibliography

215

Zellner, B., (1994). Pauses and the Temporal Structure of Speech. In Eric Keller, Fundamentals of Synthesis and Speech Recognition, Basic Concepts, State-of-the-Art and Future Challenges, by John Wiley & Sons, Chichester.

Zellner, B., (1998). Caractérisation et prédiction du débit de parole en français – Une étude de cas. Thèse pré-sentée pour obtenir le grade de Docteur en Lettres, Université de Lausanne.

Zellner, B.. (2001). Les enjeux de la simulation scientifique L’exemple du rythme de la parole. Actes des Journées Prosodie 10-11 Octobre 2001.

Zvonik, E. and Cummins, F.. (2002). Pause Duration and Variability in Read Texts. Proceedings of ICSLP’02, Denver, USA.

Matlab® – The Language of Technical Computing, Using Matlab, version 6, 2000. Math Works.

Standard Publication No. 297, IEEE, (1969). IEEE Recommended Pratice for Speech Quality Measurements. IEEE Transations on Audio and Electroacoustics. Vol. AU-17, no.3. 1969.

Report of ANTIGONA project: Relatório do Projecto ANTIGONA. October, 2001.

The Tilt Intonation Model. Paul Taylor : http://festvox.org/docs/speech_tools-1.2.0/c16909.htm

INSINT Homepage, Daniel Hirst : http://www.lpl.univ-aix.fr/~hirst/AMI.html

ToBI Homepage, Mary Beckman et al. : http://www.ling.ohio-state.edu/~tobi/

Page 244: FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO … · FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Electrotécnica e de Computadores A PROSODY MODEL …