Categories: Android&MERN App Projects

Text normalization in text-to-speech system

Text normalization is an important component in text-to-speech system and the difficulty in text normalization is to disambiguate the non-standard words (NSWs). This paper develops a taxonomy of NSWs on the basis of a large scale Chinese corpus, and proposes a two-stage NSWs disambiguation strategy, finite state automata (FSA) for initial classification and maximum entropy (ME) classifiers for subclass disambiguation. Based on the above NSWs taxonomy, the two-stage approach achieves an F-score of 98.53% in open test, 5.23% higher than that of FSA based approach. Experiments show that the NSWs taxonomy ensures FSA a high baseline performance and ME classifiers make considerable improvement, and the two-stage approach adapts well to new domains.

Text normalization is a crucial component of text analysis in TTS systems. Real text contains many Non-Standard Words (NSWs), in that their properties can not be found in a dictionary, nor can their pronunciations be found by an application of “letter-to-sound” rules [1]. NSWs need to be normalized into their corresponding standard words and such a process is called text normalization. In English, number expressions, abbreviations, and acronyms are NSWs. Even sentence segmentation is a task of text normalization. For Chinese, non-Chinese words like numbers, symbols and alphabets need to be normalized into Chinese forms. A Non- Standard Word (NSW) could be converted into different standard words depending on both the local context and the text genre. So it is in general a very hard homograph disambiguation task [2]. In Nuance Vocalizer, over 20% of the core application code (line of
code metric) is devoted to text normalization, and new input forms continue to be added [3].

Typical methods for text normalization are based on hand- crafted rules. But such hand-crafted rules are difficult to write, maintain and adapt to new domains. On the other hand, in view of homograph disambiguation, many machine learning methods are employed and have shown their advantages. Decision tree and decision list are used in English and Hindi text normalization [4]. Support Vector Machine is applied to Persian NSWs classification [5]. Winnow is used for homograph disambiguation in Thai text analysis

The text normalization approach proposed in this paper does not need word segmentation process. Finite state automata detect NSWs from the real text and make an initial classification and then maximum entropy classifiers are used for further classification. The rest of this paper is organized as follows. Section 2 describes the proposed approach in detail. Section 3 gives experiment results and analysis.

Based on the above taxonomy, text normalization process is composed of three stages. The first stage uses Finite State Automata to detect NSWs from real text and make initial classification. The BNSW classification is finished in this stage. For an ANSW, the output of initial classification is used for
subclass disambiguation. Maximum entropy classifiers are used in the subclass disambiguation module. When a NSW is labeled with a class tag, a Finite State Transducer transforms it into standard
words

Standard word generation is the last module of text normalization. It is a generation step while former steps are analysis steps. The input of this module is NSW itself and its class tag. The output is its corresponding Chinese words. The conversion is a one-one correspondence and finite state transducers are applicable here.

This paper makes an extensive investigation of Chinese text normalization. NSWs taxonomy is developed based on a large scale corpus. After a systematic analysis of the taxonomy, a two-
stage NSWs classification strategy is proposed, finite state automata for initial classification and maximum entropy classifiers for further classification. Experiment results show that this approach achieves a good performance and generalizes well to new domains. In addition, this approach is character-based, no need of word segmentation preprocess.

1 year ago

admin