Manual for the MONET(1) speech synthesis
engine and parameter editor
(GnuSpeech Version 1.0)

David R. Hill, P.Eng.

© 1993, 1994, 1995, 2001, 2002 David R. Hill. All rights reserved.

Note: this manual is a draft, under development.

The most obvious omission in this draft is a
definition/description of the input and output syntax.

Permission is granted to anyone to copy, distribute and/or modify this document under the terms of the GNU Free Documentation Licence, Version 1.1 or any later version published by the Free Software Foundation (; with invariant sections being Appendices A, B, C and D and all copyright information; and with the Front-cover text being: (1)Manual for the MONET speech synthesis engine and parameter editor; (2)original author David R. Hill and (3) a list of all revision authors; and with the back cover text being: (1)the ISBN; (2) the statement of the purpose of the MONET system; and (3) a summary of the revisions made. A copy of the licence is included in the section entitled "GNU Free Documentation Licence".

Please click here if you do not see a section menu on the left side, or if this multiframe page is inside someone else's frame


MONET is a generalised speech synthesis database editor and manager, and synthesiser parameter generation engine, that forms part of the TextToSpeech Experimenter Kit--a tool designed for speech researchers. Those interested in speech synthesis, speech production, or speech perception research will find the tool provides precise access to the timing and nature of speech events that, put together, form spoken language. This tool also opens up the possibility of experiments and training programs related to speech pathology, speech therapy and the process of acquiring spoken language. Although designed with the richness needed to manage an articulatory synthesiser, it can equally well be used to manage the databases associated with other parametric synthesis methods.

MONET is designed around a conceptual speech synthesis framework of speech postures, speech events, precisely controllable timing, and precise interpolation specifications.

MONET allows researchers to create new speech databases for synthesising speech (fragments or complete utterances), using a variety of synthesisers, by providing tools for:

MONET automatically provides defaults for any component that has not been explicitly defined.

This manual outlines the background to understand MONET, provides a description of how to use it to create and modify synthesis databases, and how to generate utterances for test and demonstration purposes. The same synthesis engine is used in the TextToSpeech kits intended for developers and end users, and in particular for GnuSpeech, the speech synthesis program that forms part of the GNU Project. The whole software suite, kits, developer tools, experimenter tools, databases, and proprietary software for building and testing these items, which was originally developed by Trillium Sound Research Inc. for the NeXT, and for NeXTSTEP for Intel (NSFIP) operating systems, has been donated to the Free Software Foundation and is now being released under a General Public License as GnuSpeech. For those with NeXTSTEP Versions 3.0 to 3.3, the original software just runs. For the rest, the software will be ported over time, initially to FSF's GnuStep and possibly to OpenStep, and in principle to any operating system. Volunteers for this work are welcome. Real-time performance and sophistication of the computer-human interface will, of course, depend on the power and facilities of the platform hardware and OS to which any port is made.

Note that both MONET and this manual are in the process of development. Both have a number of flaws, but the system has successfully been used to create a complete database for articulatory synthesis of spoken English that has received favourable reviews. Comments, suggestions and help in ongoing development are solicited.

There are many images in the manual which require reasonable resolution to be visible. All the diagrams have been presented as thumbnails in the text. Clicking on the thumbnail will bring up the full resolution images in a new browser window. The window may be closed or hidden when no longer needed. If it is hidden, remember that it is likely to remain hidden even if a new image is brought up, and it will be necessary to unhide the window to see the new image. Footnotes provide explicit backwards links so that the referrents for footnotes may be traced from the footnotes themselves. Of course, the back button has its usual effect if the footnote is accessed from the text in the normal way.

Some appendices related to the original research and development are attached, and will be found helpful. Readers are also strongly recommended to read Real-time articulatory speech-synthesis-by-rules by Hill, Manzara & Schock (1995) as background to this manual, and GnuSpeech in general.

Purpose of the system

MONET allows speech researchers to understand and create the databases needed to drive a variety of parametrically driven speech synthesisers in order to test theories of speech production and perception, or to synthesise different languages. The original purpose of MONET was to develop a system to allow spoken language to be synthesised automatically by machines with greater fidelity and control than has previously been possible, based on a new articulatory model derived from work by Fant, Carré and others (Hill, Manzara & Schock 1995).

The development system that had to be created provides close control over arbitrary speech postures, timing, and interpolation. All relevant information used for this purpose is formalised and quantified by rules, equations and values that can be extended and edited. As a result MONET also provides a very useful tool for psychophysical experiments related to speech production, and speech perception. Apart from creating the detailed parameter variations needed to produce the best speech, the system may also be used for the generation of speech stimuli based on systematic changes in arbitrary parameters. Because the articulatory configurations involved in speech can be controlled directly and simply, instead of being approximated by spectral derivations, there exist the further possibilities of experiments and training programs related to speech pathology, speech therapy and the process of acquiring spoken language.


Unrestricted speech synthesis-by-rules conventionally uses methods developed in the 50s and 60s to simulate speech production by feeding information about spectral features to a source-filter model of the vocal apparatus comprising a pulsed energy source, and a set of filters that approximate the resonant properties of the oral and possibly nasal passages on the human head [e.g. Lawrence (1953); Fant (1956)]. Following Tatham (see below) modelling at this level may be termed "low-level" synthesis. "High-level" synthesis is then needed to provide the data required to make the low-level synthesiser speak a particular language. Work by other researchers provided the data and methods (high-level synthesis) needed to drive these models to produce synthetic speech [e.g. Liberman, et al. 1959; Holmes et al. 1964)]. The overall approach has been given a variety of names, but formant synthesis seems the most descriptive, since the variable driving data mainly comprises the centre frequencies of the resonances of the vocal tract (correlated with the output spectrum frequency peaks or formants) interacting with the energy input from the vocal folds and various noise sources formed by the passage of air through constrictions in the tract, perhaps associated with vibrations of fleshy parts such as the tongue. DECtalk and its look-alikes is a formant synthesis approach and is widely used.

More recently, techniques for concatenating waveform segments derived from natural speech have been developed which partially overcome problems encountered with the concatenation approaches developed in the early days when the approach was first tried (problems of joining fixed segments, and managing pitch independently of the sound spectra).

Both formant and concatenation methods still suffer from restrictions of various kinds that interfere with the potential for naturalness in unrestricted speech--though, for restricted purposes, concatenated natural speech can be very effective. Waveform concatentation is the principle method underlying the Festival system at Edinburgh University's Centre for Speech Technology Research. This project modularises the synthesis process so that researchers can work on a single module (say intonation management) without having to create and manage the entire synthesis process.

Mark Tatham's SPRUCE project at the University of Essex (UK) is described as a system that provides the high level synthesis needed to drive both formant synthesis and concatentation synthesis. The emphasis appears to be on concatenation synthesis.

In 1993/4, building on fundamental work by Carré, Fant and others (e.g. Fant & Pauli 1974; Carré & Mrayati 1994), Hill, Manzara and Schock (1995) developed an improved version of the source filter model that uses a waveguide approximation to the vocal tract and thereby provides an articulatory model, also called a tube model or waveguide model. Such a model emulates rather than simulates the resonant behaviour of the vocal tract because the tube behaviour modelled maps directly onto the articulatory and acoustic characteristics of the real vocal tract and nasal passage tube structures rather than simply imitating the resonance-mediated output.

The previous barriers to using a tube model for speech were two-fold. First there was the problem of controlling the many sections required for the tube approximation, in real-time, without instability or self-defeating approximations2; and secondly there was the problem of providing the complete database that represented the behaviour of a real vocal apparatus3 speaking a particular language.

The work of Fant, Carré and their colleagues provided the theoretical basis for solving the control problem for the tube model. Based on a formant sensitivity analysis by Fant and his colleagues, Carré and Mrayati devised what they call the Distinctive Region (DRM) Model that provides an accurate model of articulation, related to the known properties of the real vocal tract, that requires only eight independently controlled sections instead of the forty or so that seem to be needed if the properties of speech are ignored. The topic is discussed more fully in the paper by Hill, Manzara & Taube-Schock (1995) "Real-time articulatory speech synthesis by rules". The controlled sections correspond closely to the distribution of articulatory possibilities in the vocal tract so that, even though the traditional parameters such as jaw rotation, tongue height, and so on are not used directly, the model is truly an articulatory model, and the traditional parameters could be used to define the changes in the DRM regions. Provision for this intended extension has been made in the basic framework of the system now to be described. It was necessary to make the DRM model practical, and to create the needed database to provide the high-level synthesis component to complement the low-level synthesis possibilities of the DRM-based Tube Model. MONET, a tool originally developed by the author and his colleagues at Trillium Sound Research Inc. was the tool for this high-level synthesis component.

A paper is in preparation to describe in detail how the database was built, what it contains, and (to some extent) why4. However, since all the kits, tools, and databases used in the development are now available under a General Public Licence (http://www/, the databases can already be examined and modified by those interested using MONET. The relevant standard file for the speech sounds in the current version of these kits is diphones.monet. If any use is made of this file, it is recommended that a copy be made to avoid corrupting the original. This is a normal precaution when experimenting with any computer data. When MONET is started, the first action (unless an entirely new database is being built) should be to open the file diphones.monet or whichever <filename>.monet file is being worked on. The <filename>.monet file contains definitions of the steady articulatory configurations of the tube model for each of the sounds to be synthesised (we call the articulatory configurations postures--see below), along with rules describing how to co-articulate the various postures as diphones, triphones or tetraphones.

MONET also makes provision for creating pitch contours to control the variation in pitch (intonation) over an utterance comprising a number of successive postures, and has a built-in rhythm model that deals with relative and absolute time duration of postures. The rhythm model is derived from research on British English speech by Hill, Jassem & Witten (e.g. Hill, Jassem & Witten 1979, Jassem, Hill & Witten 1984). The intonation model is broadly based on Halliday's model of intonation in spoken English (Halliday 1970), and is integrated with the rhythm model as it has to be. Indeed, it is not possible to describe the Halliday intonation model without also specifying the rhythmic structure.

In dealing with the machine perception and production of speech, a number of technical terms must inevitably be used in order to achieve precision of expression.. The reader's attention is drawn particularly to the terms associated with speech sounds (phones, phonemes, postures, etc) and the basic concepts associated with rhythm and intonation. "A conceptionary for speech and hearing in the context of machines and experimentation" (Hill 1991) provides a source of such conceptual knowledge.

Phonemes are not speech sounds. They are categories of speech sounds. Sounds fall in the same phoneme category for a given language if the difference between them does not distinguish words in that language. Thus the sounds in a given phoneme category--called allophones--may be quite varied acoustically, and may result from a variety of quite different articulatory causes. Sounds in the same phoneme category may be as different acoustically as upper and lower case letters are visually (consider the acoustic realisation of the English /r/ sound across various dialects). Equally, allophones from different phoneme categories may be rather similar acoustically (for example, the actual sounds produced as different vowel phonemes may overlap for different speakers and phonetic contexts). This is why we prefer to work from the concrete basis of speech postures. Speech postures can easily be related to the phoneme categories of a language, but they are not phonemes. A series of postures articulated in succession, will produce a series of phones (instantiations of phonemes--particular allophones of that phoneme). Thus the term phone is sometimes used interchangeably with posture, but the postures in any series interact with and modify each other, which is why the phones representing the same phoneme in different contexts are different. Thus, especially for an articulatory speech synthesiser, the postures and associated interpolation rules, plus special events, time quantities, and intonation contours (or, following Halliday, tones), are the truly basic entities. The notation /r/ represents the phoneme for the "r" sound in any English dialect while [r] represents a particular allophone of the /r/ phoneme. These are called broad and narrow transcriptions respectively when the notation is used to transcribe utterances. The broad transcription is a very high-level transcription that assumes an understanding of a particular dialect to fill in the missing details that describe the sounds accurately. The narrow transcription uses all kinds of additional annotations, called diacritical marks, to indicate the exact properties of the sounds of a given utterance. Thus a broad transcription is phonemic while a narrow transcription is phonetic and describes the individual allophones explicitly. The full gory details of this topic may be pursued in any decent text on phonetics or phonology.

Version 1.0 of MONET is complete in the sense that it has been successfully used to create a complete articulatory database for spoken English, including rhythm and intonation, but is still under development to provide additional productivity-enhancing features such as links between various data views and editing tools. Various bugs also need to be fixed, and all components are the subject of ongoing research. However, as noted, MONET was one of the tools used to create the databases associated with Trillium's unique TextToSpeech system based on articulatory synthesis. The other components used in that work included Trillium's interactive tube-model-based Synthesiser system, together with spectrographic analysis and display tools, dictionaries, and so on. Trillium Sound Research Inc. no longer exists, but as a last act of the copmany, the software and databases developed by Trillium were donated to the Free Software Foundation, under a General Public Licence, and have collectively been dubbed GnuSpeech. The port to GNU/Linux and the further development of the software are now part of the GNU Project. The system is suitable for speech output for end users, incorporation of speech output into applications by software developers, for use in speech research by university and industry research laboratories, as well as for further development of speech synthesis methods and additional languages.

System overview and rationale


MONET is organised around a time/parameter-value framework that assumes speech is formed by a vocal apparatus moving successively from one vocal posture to another, under contextual influences. Sounds affect neighbouring sounds even when they are not immediately adjacent. Silence (as when not speaking) is a posture, just as much as any vowel or consonant posture, and its specification may equally depend on the context. Postures are also referred to as phones (but not phonemes which are categories of sound whose realisations vary according to their specific phonetic context and other factors).

MONET assumes that speech is to be produced by a speech synthesiser that is controlled by feeding varying parameters to it at some time rate. No assumptions are made about the nature of the synthesiser, except for the assumption that there is a special parameter controlling pitch variation that will be manipulated specially in order to provide intonation contours. This special manipulation can be turned off, and pitch then treated like any other parameter, but that is not usual. The small pitch variations associated with articulation, which arise from the effect of changes in air pressure across the vocal folds (glottis), changes in vocal fold tension caused by articulation, and the like (microintonation), are handled by a mechanism called Special Parameter Prototypes. The same mechanism handles special variations to other parameters according to the postures involved. Special parameter variations are superimposed on the general parameter variations by "linear superposition" (that is, the effects of parameter specifications from all sources are added together linearly to produce the final values sent to the synthesiser). Thus Special Parameter Prototypes are used to handle microintonation and to introduce phenomena such as noise bursts, which must be superimposed on any existing noise parameter variations.

It is assumed that each posture (corresponding to a vocal tract configuration) can be defined in terms of the parameters used to control the synthesiser, but that the parameter values so defined are not necessarily the same for all instantiations (realisations) of the posture--they likely vary with context and other factors; nor do they necessarily take on their characteristic values at the same time--for an articulatory synthesiser, the lips and tongue move independently, whilst for a formant synthesiser, the formant transitions may not be synchronised.

The time framework is constructed starting with a framework (Major Event Times and Posture Targets) that is based on fixed posture targets occurring at fixed times, but the system then provides mechanisms for specifying departures from the framework in a very flexible and comprehensive manner. In particular, although the underlying time framework exists, the main governing principle for time rests on the occurrence of speech events--times at which changes in parameter rates (and therefore perceptual manifestations) begin and end, based on the changing acoustics. The target-time framework is simply a foundation for building the more complex reality. The view is related to research on muscle action groups due to William Condon and his associates (e.g. “Speech makes babies move” William Condon, New Scientist June 6 1974 which provides a good, popular summary).

Since we have used MONET exclusively for working with our tube-model-based articulatory synthesiser, the remainder of this document will assume such a synthesiser, in order to provide concrete examples when discussing synthesiser-related concepts and actions. An account of the synthesiser appears in (Hill, Manzara & Schock 1995).

Main components, subsystems and databases

It should be emphasised that GnuSpeech includes a complete database as required for MONET and the tube model synthesiser, and it is not necessary to create databases in order to produce reasonable speech output (including intonation and rhythm). In fact, although some MONET components provide the "brain" that translates input text into the parameters to drive the synthesiser (Tatham's "high-level" synthesis), the end user interested only in the existing speech out capability of GnuSpeech will not need to have any understanding of MONET at all. For such purposes, the system is well hidden. However, the capabilities of MONET, as already discussed, go far beyond providing a fixed speech output means.

The broad divisions of MONET include:

Thus, beyond the general framework of postures having parameter target values and transitional specifications, and universal facilities for adding explanatory comments to the database elements (accessible through the inspector panel in its various forms), the MONET database comprises in detail:

These broad divisions correspond to the various subsystems and data structures that together comprise MONET. The overall database itself is keyed by the postures and posture combinations that, accessed sequentially, create continuous speech. MONET allows for the contextual effects of up to four adjacent postures (tetraphones). The system could be modified to take more context into account, if necessary. For the GnuSpeech system, this has so far proved unnecessary. Context-dependency is equivalent to using diphones, triphones or tetraphones as a basis for synthesis, and allows various co-articulation effects and specific consonant clusters to be handled effectively and efficiently. Context matching is based on logical operations on combinations of postures (phones) and categories of postures (such as "nasal", or "voiceless stop", in the current database).

Fig 1: Full screen view of MONET in use

Figure 1 shows the appearance of the full MONET screen during normal operation. Only the Data Entry window, Inspector Panel, and Rule Builder window have been opened, using the Panels menu, which has been "torn off" the MONET (Main) menu at the top left of the screen. The phone (posture) "aa" was selected, and the Inspector Panel has become a Phone Inspector showing the parameter values for the phone target. The radius of DRM region 5 ("r5") was selected, and the precise value is shown in the box at the bottom of the Inspector. At the moment of the screen shot, the operator had just selected the pull-down menu on the Data Entry window, which therefore shows all the data entry possibilities (the cursor is visible on the pull down menu).

The major subsystems in MONET corresponding to the above divisions/functions, and accessible from the Panels menu, are:

  1. Inspector (Brings up a standard Inspector panel which fulfills many rôles, depending on the window that is selected as key window, and also on the pull-down menu selection on the inspector itself; it is fundamental to most MONET operations);
  2. Data Entry (Allows: phone symbols to be defined; categories for the symbols to be set up; parameters and meta-parameters relevant to the synthesiser used to be designated; and symbols to be defined for use in computational formulae. Using the inspector, actual parameter targets for each posture can be defined for both normal and special parameters);
  3. the Rule Builder (Rule Window and Rule Parser) (Allows: parsing rules to be defined that determine what transition profiles are used to control the different parameters that must be specified for groups of phones, from diphones to tetraphones--time-synchronisation of the phone target values only occurs at n-phone boundaries);
  4. the Prototype Manager (Has three modes that allow:(1) equations governing necessary computations to be defined; and both (2) normal and (3) special parameter transition profiles to be defined, named and categorised. The named entities may then be assigned to govern appropriate rule processing components in an arbitrary way. (1) involves the use of the Inspector Panel while (2) and (3) involve the use of the Transition Builder (see next) and Inspector together);
  5. the Transition Builder (Allows the form of transition, including deviations in time course and target value, to be defined; particular transition profiles (Transition Prototypes) are selected for a particular parameter variation in a particular context by the rule that matches the phone context. Different profiles may be used for each parameter in a given context. Symbolic references to prototypes for both equations and prototypes are accessed with the help of the inspector.)
  6. the Special Transition Builder (Similar to Transition Builder but for Special Transition Profiles--such as those required to control microintonation and produce noise bursts which must be accurately superimposed on the base parameter movements generated by the normal Transition Profiles selected);
  7. the Synthesiser Engine and related interactive controls (Synthesis Window, Synthesiser Control Panel and Intonation Control) (This component is used to mainly in experimental mode, to generate test utterances of various combinations of phones for fidelity and naturalness. The Intonation Control allows modifications to be made to the intonation contours produced by the intonation model based on M.A.K. Halliday's work [Halliday 1970, for example] and thesis work by Craig Schock [Taube-Schock 1993] which is built into the system. The Synthesiser input format also allows various mark-ups to allow the metric structure of utterances being used for test purposes to be varied. Such variation affects both rhythm and intonation.)
  8. in addition, there is the Tube Model of the vocal and nasal tracts which is necessary for sound output, but is not considered part of MONET (The tube model, which provides a direct emulation of the physical vocal and nasal tracts, could be replaced by an arbitrary synthesiser, provided a suitable new database was supplied to specify the kind of phones, targets, rules, parameter variations and so on required by the new synthesiser. The tube model used in GnuSpeech and the underlying theory are described in Hill, Manzara & Schock [1995]. Interactive access to the Tube Model is provided by Synthesizer, a separate application. A screen shot of the Synthesizer application appears in the section Import TRM Data below, but is not discussed in any detail.)

The interfaces to these components and sub-components (except the Synthesizer application related to the Tube Model itself, not to be confused with the Synthesis component of the MONET application) are accessed through the Panels menu, brought up using the selection of that name in the MONET (Main) menu. The menu can be torn off, so that it stays around, as shown in the screen shot (Figure 1, above).

In the Speech Server--which provides speech output facilities for applications, as a service, or as embedded capability in the host system, based on the Tube Model low-level synthesis driven by the high-level synthesis engine from MONET--these interactive components are unnecessary, because the server runs autonomously as a daemon in the background, using a database for spoken English that has already been created. However, to create the database of posture targets, and rules for the dynamic parameter variations needed for a new language, different English dialect, or modification of the existing database; to create specific speech stimuli for psychophysical experiments; to experiment with different intonation possibilities; and the like; it is necessary to provide interactive access to the various components above, together with facilities for managing and updating the various databases. Thus MONET has two "personae" (characters)--the visible persona provided by the system that is the subject of this manual; and the invisible persona that is provided by the Speech Server. The two personae have much in common at the core, but are distinct.

Fig 2: ServerTest window

The Speech Server itself will also not be discussed in detail in this manual. When the server is running under GNU/Linux, its use will be documented in the usual man pages. However, there is a simple Graphical User Interface (GUI) for the Speech Server--ServerTestPlus which allows all the facilities provided by the server to be tested directly (Figure 2). Since this includes the actual synthesis of utterances, as well as manipulation of the fixed parameter settings, the test system is able to generate syntactically correct input for the MONET engine. This is useful when experimenting with the full MONET system because it facilitates creating correct input for the Synthesis Window subsystem when editing the database and testing the result of modifications. An earlier server testing module that hid certain components—ServerTest—is obsolete now that the system is Free Software and fully open.

Thus, using the MONET facilities, a set of posture symbols may be defined, and their DRM model target values specified. Time values may be associated with the postures, and used in setting up the time-event framework for parameter interpolation and special event creation. Rules may be defined for recognising particular phone contexts (posture configurations) and for selecting appropriate interpolation methods (transition profiles and special event profiles). Finally, speech may be synthesised (with or without intonation) by passing the output from the Synthesiser module to the Tube Model and a waveform specification produced which may be output to a file, or sent to a sound card.

This organisation was chosen to allow precise control without loss of generality, and to ensure that any knowledge used to produce parameter variations needed for synthesis or the correction of imperfections in synthesis, would be precisely specified, recorded and reproducable. It was designed as a very general research tool and language database creation tool as well as a speech output method. Apart from creating the detailed parameter variations needed to produce the best speech, the system may be used for a variety of purposes including the generation of speech stimuli based on systematic changes in arbitrary parameters for use in speech therapy, language training, or psychophysical experiments (for example).

Parameter computations

The parameter computations necessary for synthesis are carried out "on the fly" (in real time), as needed. The same synthesis engine that powers MONET's synthesis (though without the graphical interface and editing facilities) is built into the GnuSpeech User and Software Developer Kits. The concept is similar to the original NeXT computer philosophy of using the same postscript system for screen display and printing. The strategy ensures that any speech synthesis is carried out in the same way.

Using the system to view and edit an existing database

Getting started

It is assumed that you can install MONET correctly on your system. For help with this, please refer to the installation manual. Double-click on the MONET icon on your desktop, or use a command line command to start the program. You will see the main menu for MONET (Figure 2).

Fig 3: Main Menu

If you intend working on an existing database, click on MONET/Document and select the <filename>.monet file you wish to work on. If you wish to work on diphones.monet, which is the file supplied with the GnuSpeech kit, make a copy and work on that. You will find the file in /usr/local/lib/system. You could rename the original diphones.monet file as diphones.orig and place a copy of your working file in /usr/local/lib/system under the name diphones.monet. It would then replace the original database for all synthesis operations in the system. But you would not do that until you were reasonably satisfied with the new dialect or language you had thereby created.

For all these operations, you will need to set up appropriate file permissions. Take care that you do not lose track of which file is the original and which file(s) is (are) the ones you wish to experiment with. By changing diphones.monet "on the fly" you could have the system speak with different voices depending on which database was referenced. It would be better arrange for a file name to be passed to the Speech Server as a parameter, which is a feature that should be added.

Facilities accessed via the Panels Menu

Fig 4: Panels sub-menu

Most of the facilities you require to create and modify databases are accessed via the MONET/Panels menu selection (see Figure 3) which brings up the Panels sub-menu shown here (Figure 4):

The Inspector Panel

(1) The Inspector selection brings up a normal GnuStep Inspector Panel (Figure 5(b)), but one which is constrained always to remain in front of other screen objects. It allows attributes and values for whatever entity is selected to be viewed and changed. Its appearance and specific function changes, as identified in the title bar, according to the context and current active selection. In Figure 5(b), because the key window is the Data Entry panel, the Inspector has become a Phone Inspector.

The Data Entry Panel

Fig 5: (a)Data Entry panel with; (b) Phone Inspector showing target values

(2) Data Entry brings up a panel that allows the names for basic posture symbols (phones) to be defined and characterised. This panel appears on the right in Figure 1 with the posture "aa" highlighted and also by itself in Figure 5(a). Figure 5(b) shows the Phone Inspector that allows inspection of the parameter values associated with the phone. Buttons on the Data Entry panel allow a posture to be added, renamed or removed (deleted) Double-clicking a defined posture symbol will automatically open the Inspector panel as a Phone Inspector (if it is not already open) as in Figure 5(b), allowing access, through a pull-down menu at the top of the Phone Inspector panel, to: the Parameter Target values; Meta Parameter Target values; Categories into which the posture falls; and Symbols associated with the posture, as well as Comment space for entering comments to help document the postures, their attributes, and associated values. The pull down menu for the Data Entry panel and the Phone Inspector are shown in Figure 6. More precisely, the Data Entry pull-down accesses:
(2a) Categories of postures which help in expressing generalised rules and may be used interchangeably with postures in rule definitions (the non-exclusive categories currently defined appear in Figure 6(a), and include phone, asp, vocoid, etc);
(2b) the basic Parameter Targets which specify the values needed to drive the synthesiser to realise the particular nominal posture (for our articulatory synthesiser database these include: cross-sectional radii, r1 through r8; the state of the velum; specification of superimposed microintonation; and so on, as shown in Figure 6(b));
(2c) Meta Parameter Targets which, for our articulatory synthesiser, would relate to parameters such as jaw rotation and lip opening, which can be converted to the basic parameters such as cross-sectional areas by means of defined equations expressing mutual constraints; although central to future plans, no meta-parameters are currently defined;
(2d) Symbols, arbitrarly named, which represent the time specifications associated with each posture which are usable both directly, and as quantities in equations. Equations allow the definition of derivative symbols. Any defined symbols may be used to define the time of occurrence of events which--in turn--define the time framework for the definition of transition profiles (trajectories) between successive nominal posture targets. In the existing database, as shown in Figure 6(c), only very basic symbols have been defined: the overall duration of phones (duration); the proportion allocated to transitions (transition); and quasi-steady-state portions of phones (qssa and qssb). The remaining symbols (a fairly numerous set) are derived from these to control the transition profiles (interpolation methods, parameter trajectories) and are defined using the Prototype Manager in the equation mode for definition, or in the transition or special event mode for use in determining time positions within transition profiles.

Fig 6: (a) Data Entry Panel and (b) Phone Inspector panel, each with pull down menu activated

The pull down menu of the Data Entry and Phone Inspector panels (Figure 6) allows further data types needed for synthesis to be defined, set and viewed to facilitate management of the overall database creation and synthesis processes. The types cover Categories for grouping phones; the (named) Parameters required for feeding to the actual synthesiser; the Meta Parameters that might be used in the synthesis engine process (MONET), as opposed to the physical parameters fed to the synthesiser itself; and the basic Formula symbols that are needed in the various timings and computations, as shown in Figures 7(a), (b) and (c). Thus the Data Entry options allow the named variables of the synthesis process to be set up, while the Phone Inspector allows the values to be assigned for particular instances (such as the DRM model radii corresponding to the "target" values needed for our form of synthesis). Provision for comments is made.

The only formula symbols defined for the current GnuSpeech process are duration, transition, qssa and qssb (the nominal posture duration, and the internal divisions associated with the major transition to/from neighbouring posture, and the division of the quasi-steady-state portion into and b components. This forms the underlying framework to which the actual movements and timings are related and anchored. The categories are self explanatory, except that the two "hack" categories are added to simplify some rewrite rules that insert extra pseudo-postures to avoid synthesiser anomalies. The parameters are those required for dynamic operation of the tube model (as opposed to the static parameters that control synthesiser qualities associated with the speaker, such as vocal tract length, pitch range, nasal cavity shape, glottal pulse shape, air stream temperature and so on). These may be set by the synthesiser Control Panel. Double clicking on an entry in any of the Data Entry panels opens the Inspector panel (if it is not already open) which then allows maximum, minimum and default values to be assigned where appropriate, or comments to be added to explain the purpose or use of the various items that can be defined/selected.

Fig 7: Data Entry Panel showing (a) Categories, (b) Parameters and (c) Symbols selections (as defined for GnuSpeech)

The Rule Builder and associated Inspector Panel selections

Fig 8: Rule Builder Window

(3) The Rule Window selection in the Panels menu brings up a Rule Builder window (Fig. 8) to allow the rules--used to consume tokens from a phonetic input string input to MONET--to be created, displayed and edited.

The consumption of one or more tokens5 by a rule causes appropriate target and interpolation data to be selected to construct the next portion of parameter input for the synthesiser. To show what tokens can fill a given expression (Total Matches column) it is necessary to place the cursor at the end of the expression and hit Enter.

It is very important that the rules are ordered from the more specific to the more general. Thus specific postures (phones), narrower categories, and larger context, will appear in rules near the start, while broader categories, and smaller context will occur towards the end. The last rule simply recognises "phone >> phone" as the default, and is present when MONET is first invoked. In this way, even before any rules have been created by the user, the system will not fail for lack of a rule. This is a general philosophy. The system should function and produce parameters, even if they are totally generic, while the database is under construction.

Note that if the Rule Builder Window is opened prior to opening a .monet file, only the "phone>>phone" rule will show so opening the <filename.monet> file should be the first action when using MONET unless the user is starting a database from scratch.

Figure 9: Inspector Panel during selection of a rule, showing the pull-down menu

If the Inspector panel is open, then double clicking on a rule will cause the it to become a Rule Inspector and display various data associated with the rule, depending on what has been selected from the pull-down menu that forms part of the inspector panel. The possibilities include: General Information (as actually shown in the figure); the Equations involved in timing (categories and names); the categories and names of the Parameter Prototypes (for regular parameter movements) and Special Prototypes (for additional parameter movements that will be superimposed on the basic structure) as necessary to implement the rule; the names of the Meta Parameter Prototypes; and Comments (again, to help with documentation).

General Information: causes the Rule Inspector to display the number of tokens (postures, phones) consumed by the rule along with the order in which the rule appears in the list of rules. It also provides a way of changing where the rule is placed in the list. Note that when a new rule is added, it is added immediately before the last rule (which the reader will remember is "phone>>phone"). The ability to re-order the rules is important, since it determines their precedence. The more particular rules generally take precedence over the more general and the ordering should reflect this.

Fig 10: Rule Inspector: Equations (a) before clicking "Rule Duration" and (b) after clicking "Rule Duration"

The Equations selection displays a list of the major timing framework symbols (Major Events) comprising: Rule Duration (the nominal diphone, triphone or tetraphone duration); Beat (the location of the rhythmic beat), plus Mark1 and Mark 2 (which specify the nominal time for achieving the target values for the second and third phones in triphones and tetraphones respectively). Rule Duration signifies the nominal time when the target values for the last posture in an n-phone are achieved. Points may be assigned arbitrary times by choosing the governing symbol; the Major Event times simply provide a framework. Clicking on a Major Event highlights the category and symbolic name of the timing value, derived from an equation which was previously defined using the Prototype Manager6. Many categories and names are defined within this module since they are also used to specify the timing of Minor Events. Minor Events are the times of additional changes in parameters as needed to form the detailed parameter movements corresponding to the articulations associated with the language and accent being synthesised. Thus changes in parameter rates can be associated with either Major or Minor Event times, as set by the user. In this way, the Major Event timing framework is only a nominal framework deviations from which are the norm rather than the exception. (It will be noted that some items have a symbolic name of "dud#". This is because equation symbols can currently be renamed, but not deleted--an obvious problem. This bug needs to be fixed!)

Fig 11: Rule Inspector: Parameter Prototypes (a) before finding and clicking "r7"; and (b) after finding and clicking "r7"

Parameter, Special Parameter and Meta Parameter Prototypes: If Parameter or Special Parameter or Meta Parameter Prototype is selected from the pull-down menu in the Rule Inspector window the top sub-panel now displays the parameter names while the two lower sub-panels continue to display the timing value symbol categories and names, as derived from the equations set up by the user from the Prototype Manager in the appropriate mode. Those in use are bolded. Those defined but not in use are in normal type face. Clicking on the parameter name in the top sub-panel highlights the category and name of the relevant transition prototype, in much the same way that the category and name of the timing value were highlighted for the Equation selection. Double clicking on a parameter will bring up the Transition Builder or Special Transition Builder window for that particular parameter, allowing the particular interpolation prototype specification to be edited by selecting points, changing the value or timing, or by other measures (see Transition Builder, below). No meta parameters are currently defined and the meta parameter part of the MONET system has therefore not been completed. Meta parameter transition profiles are currently inaccessible/unusable. This facility needs to be implemented soon, to allow normal articulatory parameter driving schemes to be developed.

Checking which rule applies: the Rule Parser

Fig 12: Rule Parser Window

(4) The Rule Parser is a utility window that allows a string of up to four specified MONET input symbols to be tested to see which rule will apply if they are fed to MONET. This facilitates ordering and debugging the rule set. Enter four valid phone (posture) symbols, hitting "Enter" after each entry until four are entered and the first applicable rule will be displayed in the dark window together with the actual values of the framework times as calculated according to the equations associated with the rule, and the number of tokens (phone/posture symbols) consumed by the rule. If less than four phones are involved, it is still necessary to enter all four fields, using blanks as needed. The new display appears when "Enter" terminates the last of the four entries. When testing a string, the unused tokens from one rule, plus any further tokens, must be re-entered to determine which is the next rule to apply. Note that unused framework fields show very small values. They should show some symbol indicating they are not applicable. This problem needs to be fixed.

Prototype Manager for Equations and Transitions (Interpolation) with associated Inspector Panels

Fig 13: Prototype manager window: (a) in Equation mode; and (b) in Transition Prototype mode
(Note (a) shows the time point group (category) Carré with endOfStopClosureOnset as the specific time symbol highlighted, showing the governing equation of the symbol; (b) shows the Transition Profile group (category) Carré with voicedStopClosure as the specific Transition Profile highlighted, and it is identified as a Triphone Transition Profile.

(4) The two Prototype Manager windows are associated with the selected point(box round the selected point) and the actual Transition Profile respectively, as shown in Figure 13. The Prototype Manager is used to create, name and edit the symbol and profile names as well as the Transition and Special Prototypes (interpolation templates) themselves. The inspector panel becomes a Point Inspector when a point is selected in the transition profile, so that the name and governing equation of each point may be checked, and its percentage value entered and adjusted, or the time symbol defining its time of occurrence changed. Note that there should be a more direct connection between the point and the Prototype Manager window. At present, the latter shows the timing symbol in full, but changes must be made in the Inspector Panel, which often entails counting down the symbols because they are truncated to fit the panel.

Typing a name into the white text field defines a new category name, time-point name, or profile name and enters it into the appropriate part of the database, which changes the display accordingly.

By defining timing symbols in a regular succession, points in the Transition Profiles may be systematically moved (by selecting different symbols for successive synthesis trials of a given utterance). The values achieved may similarly be varied using the Point Inspector. In this somewhat tedious manner, systematically varying stimuli may be produced for various purposes, such as psychophysical experiments (e.g. voice onset time experiments). The process should be automated.

The Transition Builder and associated Point Inspector

Fig 14: Transition Builder Window with
(The Transition Prototype Manager window shows parameter r7 for Rule 6 which is governed by VoicedStopClosure from the Carré group (category) rules. The point selected is at endOfStopClosure from the Carré group (category) of time symbols, as revealed by the Point Inspector). This was deliberately chosen to fit in with the Rule Builder window highlight Figure 10 and the associated Inspector Panel Figure 13 (b).

Figure 14 shows a Transition Builder window7. Symbolically named points mark the beginnings and end of segments during which a given rate of change of parameter value applies. The rate of change that applies during any given segment is represented by the percentage value of the point between 0% (the actual value being determined by the value achieved during the previous transition--normally the nominal target value for that parameter for the previous posture) and 100%, which represents the target value for the posture to which the current transition is taking the parameter. Only one Transition Builder may be open at a given time. To see changes being made, or to view a different prototype, it is necessary to double-click the relevant Transition Prototype name. If the Transition Builder is selected from the Panels Menu, a blank prototype is displayed. Note that the associated Point Inspector displays information about the rate change point that has been selected in the currently active Transition Builder window, showing the type of point, the percentage value of the total parameter change in progress that is to be reached at that point, and also the category and name of the time value at which the point occurs. The type of point refers to whether the point is part of the transition to the second posture (disc), the third (triangle), or the fourth (square) as appropriate in a diphone, triphone or tetraphone. Obviously a diphone would not have a third or fourth posture, and a triphone would not have a fourth posture.

It must be emphasised that the Transition Prototype shows, for the parameter concerned, how the value changes between the nominal target time of one posture and the next, the percentage being a percentage of the difference between the two target values. If the values happened to be the same, the actual parameter value sent to the synthesiser would not change at all (any percentage of zero is zero). If the difference is positive, the actual change would resemble the prototype in form over a particular transition, but the amplitude would depend on the absolute difference. If the difference were negative, then the actual parameter change would resemble an inverted form of the prototype at an appropriate amplitude. The actual parameters are continuous. The apparent discontinuities in the Transition Prototype arise because having achieved 100% of one parameter value change (from around one steady state to the next), the next transition (normally) starts at 0% again, only reaching 100% of the change around the next steady state time. It is possible to use percentages outside 0% and 100%, but this would clearly cause problems if the values were not 0% at the start, and 100% at the end of a given n-phone. However, this method of representing the parameter changes is clearly necessary, and explains the non-intuitive character of the profiles, which embody an apparent step at the time when control changes from one target to the next. This step leads to the need for "Phantom Points" since there are two points at this change of control boundary, but the MONET engine has to deal with only one point.

It is possible to determine the shape of transitions between postures by imposing a slope ratio on the segments of the transition rather than specifying the value of each point. This produces a similar "shape" despite changes in the length assigned to an n-phone, and is used frequently.

The Transition Builder window shows vertical boundary lines representing the conceptual values of the Major Event times for the diphone, triphone, or tetraphone to which the prototype parameter transition profile applies. These are for reference only since the actual points may be placed anywhere, according to the times calculated by the user-defined equations and referenced by the user-defined time symbols at which actual change-of-rate-of-change points are placed, as previously noted. These user-defined event times are represented by short line segments at the bottom of the transition graphs. In the Synthesis window, they appear as full lines across the parameter track display, as shown in Figure 21

Fig 15: The Point Inspector panel associated with Figure 14

The Point Inspector pull-down menu has only one entry at present--General Information. The value, type of point, and applicable time symbol may be changed. If there is no time symbol that specifies the time at which the user wishes to place a new point, it is necessary to define a new time symbol through the Equation Prototype process, and then return to the relevant Transition Profile and Point Inspector to use it. By creating and using the values for points, arbitrary transition profiles may be constructed for Parameters and Special Parameters to control the construction of the actual parameter tracks sent to the synthesiser. This interaction should be improved by linking the two processes together in an intuitive way.

Relation between the Rule Builder, Equation Prototypes and timing/duration symbols. How do we know where a given equation is used?

Fig 16: Rule Builder Window

If "Equations" is selected on the Rule Inspector panel associated with the Rule Builder window, major event timing symbols are displayed in the list displayed in the top part of the Inspector panel as already discussed in connection with Figure 12.

Clicking on a symbol displays the category and name of the symbolic duration. In Figure 17, Rule 11 has been highlighted, Rule Duration has been clicked, and the Rule Duration symbolic duration turns out to be the same as for Rule 6 (it is actually used for many rules) as shown in Figure 18.

Fig 17: (a) Rule Inspector with "Rule Duration" selected; (b) Prototype Equation Manager with the same rule selected

Fig 18: (a) Prototype Equation Inspector with "Equation" selected; (b) Prototype Equation Inspector with "Usage" selected

For Figure 18 (a) and (b) the Prototype Manager (for Equations) of Figure 17 has been made the key window. As a result, the inspector has changed to a Prototype Equation Inspector and provides details of the highlighted rule, which has been manually selected to be the same as the rule highlighted in the Rule Builder window of Figure 16. Choosing "Equation" from the pull-down menu on the inspector allows the equation governing the symbol value to be edited. Choosing "Usage" from the pull-down shows in which other rules the same timing symbol is used, which might inhibit the user from changing the symbol value without thinking. It may be necessary to create a new symbol specific to changed value to be used only in the places where the new value is appropriate.

Note: At present, there is unfortunately no mechanism to set up for moving to equation editing by double clicking the timing symbol name in the Rule Inspector. This is a deficiency that needs to be fixed! At present, as just described, it is necessary, by a manual process, to: (1) note the equation category and name from the Rule Inspector while the Rule Builder is the key window; (2) make the Prototype Manager the key window; (3) select Equation Prototypes on what has now become the Prototype Equation Inspector; and (4), in the Prototype Equation Manager, select the duration symbol category and double-click the symbol name (in this case VStopVTotalDur according to what was displayed in the Rule Inspector as above. Then the equation appears as Selected Prototype Equation in the white text field in the Prototype Equation Inspector panel, which allows editing to be carried out. The process is not made easier by the fact that names are truncated to fit the inspector window, and what is seen may be ambiguous, so that it is often necessary to count from a non-truncated symbol to ensure that the correct name is transferred from the Rule Inspector to the equation prototype management process. Not user friendly!

The task of entering a new timing symbol (including possibly a new category) and defining the governing equation is similar, except there is the extra step, in the Equation Prototype Manager of defining the timing symbol (category and name) by selection and/or typing.

"Usage" from the same pull-down menu shows in which rules the selected equation is used (Figure 19(b)). "Comment" (not shown) allows explanatory comments about the equation to be viewed, added or edited. More information on these topics is provided below under Rule Development and Sample Dialogues.

Synthesising speech from within MONET: the Synthesis Window

Fig 19: The Synthesis Window with parameters r3, r7, r8 and velum selected for display

(8) Selecting Synthesis Window opens a window used to accept a phonetic input string for synthesis, display the parameters constructed from the input string in graphical form, and to run the synthesis. The syntax for input strings is currently defined only in the source code for the GnuSpeech system. However, the input string needed may easily be constructed using the Line Pronunciation "hidden" method in the ServerTestPlus application8. Simply type in the text on the top line of the ServerTestPlus application window and click on the Line Pronunciation method under Hidden Methods. Once a valid input string for MONET has been constructed, it may easily be edited. Care is necessary when constructing input strings for MONET, because in the present version, no error checking of the input is performed, and bad strings will cause the application to crash. The syntax should be properly documented, and error checking added to MONET. The "Hidden Methods" selection is a historical carry-over from the time before the text-to-speech system was put out under a GPL as GnuSpeech and should be changed, now that the work is Free Software. Originally there was a "ServerTest" utility for the public to use, which hid the Hidden Methods, and a "ServerTestPlus" utility for use by the originators of the system, which revealed the Hidden Methods. The former can be dropped, and Server Test Plus updated to present all test methods on an equal footing and renamed ServerTest.

A selection list at the top left-hand side of the window allows the user to choose which parameter tracks will be displayed. Only four may be seen at the same time due to space limitations. It would be possible to allow scrolling, and also allow the order of the parameter displays to be changed, but this is not a priority. Double clicking on a named parameter will toggle the graphical display of the variation in that parameter "on" or "off". The parameter variation (parameter track--constructed from successive transition and special prototypes applicable to the parameters) is presented against a framework of value and event times. The major events are shown as black vertical lines and the minor events as grey vertical lines. Cursor tracking is provided to allow the value at particular times to be read directly, but is only partially implemented in this pre-release version. Note that there are two possible displays for each synthesiser parameter. One shows the track constructed from the Transition Prototypes. The other shows the deviations from that track constructed from the Special Prototypes. This approach allows special events to be inspected and included by linear superposition, regardless of what the basic parameter profile (interpolation template) may be doing.

Double clicking and dragging on the second click9, within the parameter display window, allows the time scale to be compressed to show a longer section of parameters. There is currently no mechanism for scrolling so that the time scale for lengthy pieces of speech becomes quite small. There is a limit on the total length, which is not likely to be encountered in the short utterances normally used in testing. It would be useful to add a scroll bar to allow scrolling while keeping the same scale, but also retain the current mechanism for changing the scale. Both would be useful. For the same reason, the scale changing mechanism could usefully be added to the Intonation Control window (see below) while retaining the scrolling mechanism.

Certain other fields and controls are provided in the Synthesis Window to help control intonation, divert the output speech to a sound file (.snd), control the tempo (speed) of the speech, use the software synthesizer (written in "C") instead of the DSP synthesizer (written in Motorola 56001 assembler), and dump the values stored in the database. The DSP version is actually obsolete now that synthesis can be achieved in real time with the current host processors so that these selections need revising. The current development version has some inconsistencies in the intonation controls, but they should not cause problems. The toggle for intonation is inactive, as is the field for tonic movement. A separate Intonation panel has been added which is accessed by a selection in the main menu (Figure 14). It will now be discussed.

Fig 20: The Intonation Control window
(Showing the basic intonation pattern for the utterance "GnuSpeech" without smoothing. Note, the initial pitch rise occurs when there is no voicing.)

The Intonation Control window allows selection of Macro Intonation, Micro Intonation, Drift and Smoothing. A full discussion of the rhythm and intonation models, plus ongoing research, is to appear in papers in preparation. Suffice it to say here that the rhythm is based on the assumption that English is a stress-timed language, which means that successive word stresses10 in spoken English fall (or are perceived to fall) at somewhat regular intervals (so-called isochrony). This contrasts with a language like French which is described as "syllable-timed", meaning that it is the syllables which fall (or are perceived to fall) at somewhat regular intervals.

In M.A.K. Halliday's model on intonation, which we follow closely, the most highly stressed word in a phrase or sentence conveying information is called the tonic and provides the information focus of the utterance. A significant body of spoken English was statistically analysed in order to determine the source(s) of the regularity, which turns out to be related to constraints on the length of phones in different contexts, both at the segmental level (the level of phones) and the suprasegmental level (the level of prosody--which covers many phones and relates to the speaker's resources for identifying the important content of an utterance). The work has been reported in several papers (e.g. Jassem, Hill & Witten 1984; Hill, Jassem & Witten 1979; Hill, Manzara & Schock 1992). According to the model developed, rhythm is determined by choosing the lengths of the phones according to these contexts. The intonation is then fitted to the sequence of phones according to the stressed and tonic syllables in a method closely based on Michael Halliday's work (e.g. Halliday 1970). Feet are derived from the intended utterance by placing a boundary before each stressed syllable. These feet form the basis of the partial isochrony as noted above. The tonic syllables are the stressed syllables of the main content words in the utterance and receive special treatment in both rhythmic and intonational terms. All the segments in the tonic foot are given extra lengthening; and the main pitch movement of the intonation contour occurs over the foot, with the greatest movement on the tonic syllable. Pre-tonic and post-tonic feet in a phrase or short sentence (a "phrase or short sentence" would loosely correspond to Halliday's tone group) receive some pitch movement, but not as much as, or necessarily even the same form as the tonic foot.

There are five main tone groups, receiving a different form of pitch movement, and then there are variations within these groups. This is what we call Macro Intonation. In addition, we have to add Micro Intonation which reflects the pitch changes resulting from changing air pressure across the vocal folds (glottis) as the constrictions in the vocal tract come and go; there is also some Drift or variation in absolute pitch values making up the contour, which simulates the natural variation that occurs when someone says the same utterance several times in succession. Finally, since the basic model is rather simple, and only defines straight line segments for the pitch variations required, Smoothing is added; but not just arbitrary smoothing. The form of the smoothing turns out to be quite critical to the quality of the intonation and is an area we are continuing to investigate. Figure 21 shows the same contour as Figure 20, but with smoothing added according to the current model.

Fig 21: The Intonation Control window
(As in Figure 27, but with smoothing added to the contour segments that actually have an effect)

The controls provided allow these four aspects of the intonation contour to be switched on and off (for testing and comparison) by means of toggles on the Intonation Window. The values used in the model are entered in the Synthesis Window (only four of the five fields are used).

When preparing to synthesise with a specified intonation contour, it is necessary to use the Generate Contour button before pressing the Synthesise button. The contour generated is displayed in the window and, by scrolling, the whole contour may be examined. A flat intonation contour may be used for testing the phone specifications during development to highlight any inadequacies of the segmental synthesis. In this case, do not generate an intonation contour.

Points may be added or deleted to the initial contour, and the contour is then regenerated. However, if the utterance is changed, the contour must be regenerated from scratch or some very strange things may happen! Utterances with associated intonation contours may be saved or restored using the appropriate options from the Intonation selection in the Main Menu.

The default tone group used for utterances is Halliday's tone group 1. Tone group 1 is associated with statements or "wh--" questions. The default placement for the tonic is then on the last stressed syllable of the utterance, and comprises a falling contour especially over the tonic syllable. The string in the Synthesis Window may be edited to break an utterance into more tone groups, relocate the tonic(s) and so on. The subsystems provided for going from text to speech use the original text punctuation clues to vary the tone group applied to utterances, but only 3 of Halliday's five tone groups are currently used, with naive assumptions about the location of the tonic, because more sophisticated algorithms would require both grammatical and semantic analysis of the utterances. Such analysis should be added in the future, and a more sophisticated use of Halliday's intonation model used. The high-rising (tone group 2) is associated with questions expecting the answer 'yes' or 'no'. Tone group 3 (low-rising) indicates some uncertainty. Tone groups 4 and 5, and compounds 1-3 and 5-3 which we do not use, are complex, but involve similar elements. Readers are advised to study Halliday's course book (Halliday 1970) for a simple introduction to his model of British English intonation. It is noticable that modern British English seems to have shifted to a somewhat new paradigm, as evidenced by BBC radio reporters (though news anchors still tend to the form described by Halliday). This illustrates the difficulty of studying intonation--it is a moving target, even amongst a well defined social subgroup. The main purpose of intonation, regardless of the precise form, is to reveal what is important in an utterance and to clarify which of a number of alternative meanings is intended by the speaker. It is based on the customary behaviour of the native speakers in a group, and what works. To sound natural, it has to be credible, consistent, and effective for its purpose.

A simple example may convey a flavour of the difficulties. Consider a situation where Persons A & B are discussing an appointment for the future. A says: "Shall we meet at five then?" B replies: "No earlier" or "No, earlier." The first reply has one stressed syllable and one tone group (1); the second two stressed syllables and two tone groups (both 1). This slight change in rhythm and associated intonation completely reverses the semantics of the reply. The first states that the stated time is the earliest possible time, and the meeting could be later. The second states that it is too late, and the meeting must be arranged earlier.

The Server Test application can be used to generate the required breakdown into feet and tone groups for GnuSpeech, but Halliday's book provides an excellent tutorial.

Fig 22: The Synthesiser Control panel

(9) The Synthesizer Control Panel (Figure 22) provides an interface to the synthesizer utterance rate controls. These are parameters that control attributes characteristic of the speaker, rather than what the speaker is saying. They are very similar to the equivalent parameter controls provided in the Synthesiser application (which provides GUI access to the tube model synthesiser, and is the subject of a separate document yet to be written, though the use of the system is reasonably intuitive). They include: volume, vocal tract length, stereo balance, temperature of the air in the vocal tract, breathiness, losses in the tube, mean pitch, and coefficients associated with throat, mouth and nose radiation, as well as nose passage shape and glottal pulse shape. The sample rate may be varied (the higher sampling rate is necessary at shorter vocal tract lengths), and the modulation of frication noise by glottal pulse function may be varied. This last feature is associated with voiced fricatives such as /z/. It is a subtle acoustic feature that has the perceptual effect of fusing the noise and voiced energy into one percept, thereby improving the perceived quality of the speech. It is generally not necessary to adjust these parameters for normal use of MONET.

Fig 23: The Document Menu

(11) The Document Menu (Figure 23) allows a variety of options for loading and saving data, some of which are obsolete.

Open is used to load an existing <name>.monet file at the start of an editing session, and must be the first action in a session, before any panels or windows are opened.

New allows a new <name>.monet file to be created from scratch, as might be required to define the phones and rules for a language that was different from English, or a different accent/dialect. As noted, the current GnuSpeech accent is not pure, and represents a compromise between US and British English which is not entirely satisfactory, but has been judged as the best available by independent listeners, especially in terms of extended listening sessions.

Import DEGAS File is obsolete, and was implemented to allow data to be imported from a previous system ("Diphone Editor and Generator for Animation and Speech"--described in Manzara & Hill 1992).

Fig 24: Full screen view of Synthesizer in use

Import TRM Data allows the parameters defining a posture to be imported from the independent Synthesizer application that allows direct manipulation access to the tube synthesiser (i.e. through a Graphical User Interface). The system is documented separately, but Figure 24 provides a screen shot of the Synthesizer in operation. It allows tube model configurations to be explored in real time. When a satisfactory set of steady state values defining a posture have been determined (perhaps using the Sonogram application as an aid), the data may be saved as a <name>.trm file and then imported to MONET. This is convenient, saves work, and avoids errors. Of course, the dynamic variations are equally--if not more--important. MONET allows the dynamics to be created and checked according to the rules developed and selected.

Export Data produces a text file containing data about the phones/postures. It includes: the categories to which phones may belong; the parameters which may be used to define postures, along with the allowed ranges and the default values; the symbols related to the major event times again with the allowed ranges and the default values; and a listing of the phones/postures showing the symbols, parameter values and times associated with them.

Save and Save As perform the usual operations of saving the <name>.monet file depending on whether the edited file is to replace the original file, or be saved under a new name.

Save Default Prototypes and Load Default Prototypes save and load the prototype databases--Equation, Parameter, and Special Prototypes. Together with the posture/phone database saved by the Export data selection, these comprise the entire database on which MONET operates. An entire <name>.monet database (comprising all the components) may, as noted previously, be loaded by the Open command or saved by the Save or Save As commands.

How to add a new rule to the database and manage prototypes using MONET facilities

This section is provided in a new window, by clicking here

How to add a new posture to the database using MONET facilities

This section is provided in a new window, by clicking here

Help facilities available

This manual comprises the only help facilities currently available.

If you have comments, suggestions or questions about any aspect of the MONET synthesizer database editing application, the Synthesizer application (that allows direct access to the tube model controls), or the various related applications and software, please contact the author (David Hill) in the first instance.


The author wishes to acknowledge the fundamental work performed by Craig Schock in his amazing feat of implementing MONET--which is the most complete version achieved, so far, of the author's speech synthesis approach based on the management of a sequence of events controlling the time progression of perceptually relevant acoustic parameters, including the interface to Leonard Manzara's tube model synthesizer; and for his original work on English intonation based on M.A.K. Halliday's formulation. Craig Schock received the Governor-General's Gold Medal for his thesis on the topic, though the research has progressed beyond the point reached by the thesis (Taube-Schock 1993). Craig also acted as the system architect for the overall synthesis system, including associated mini-apps, and developed a number of essential tools for Dictionary management and the like. The TrilliumSoundEditor application, intended to replace the Sonogram application was also Craig's baby, but is currently incomplete.

The author also wishes to acknowledge the fundamental work performed by Leonard Manzara on designing and implementing the tube model synthesis system, and the Synthesiser application based on earlier work by Perry Cook, of the Centre for Computer Research in Music and Acoustics (CCRMA) at Stanford University (who used waveguide synthesis as a basis for emulating musical instruments). Leonard also worked with the author on the creation of a complete database for spoken English using MONET, and acted as critic for the development of the dictionary pronunciations (around 70,000 of them, not counting derivatives) generated by the author. The one point of disagreement was the use of the rhotic /r/ in the dictionary. Leonard insisted that it was necessary to approximate US speech, even though the rhythm model, intonation model and vowel posture parameters more closely approximated British English. The resulting pronunciations for the speech system as a whole are therefore a curious mixture of British (RP) and American (General American) accents. It is not an unpleasant accent, but more work is needed on the posture and rule databases, as well as the intonation model and the dictionary.

The author acted as research director for the development of the various systems and enjoyed many stimulating discussions on the problems we all encountered.


CARRE, R & MRAYATI, M (1994) Vowel transitions, vowel systems, and the Distinctive Region Model. in Levels in Speech Communication: Relations and Interactions. Elsevier: New York

FANT, G (1956) On the predictability of formant levels and spectrum envelopes from formant frequencies. In For Roman Jakobson. Mouton: The Hague, 109-120

FANT, G & PAULI, S (1974) Spatial characteristics of vocal tract resonance models. Proceedings of the Stockholm Speech Communication Seminar, KTH, Stockholm, Sweden

HALLIDAY, MAK (1970) A course in spoken English: intonation. Oxford University Press 134pp

HILL, DR, MANZARA, L & TAUBE-SCHOCK, C-R (1995) Real-time articulatory speech-synthesis-by-rules. Proc. AVIOS '95 14th Annual International Voice Technologies Conf, San Jose, 12-14 September 1995, 27-44.

HILL, DR, SCHOCK, C-R & MANZARA, L (1992) Unrestricted text-to-speech revisited: rhythm and intonation. Proc. 2nd. Int. Conf. on Spoken Language Processing, Banff, Alberta, Canada, October 12th.-16th., 1219-1222

HILL, DR (1991) A conceptionary for speech and hearing in the context of machines and experimentation.

HILL, DR, JASSEM, W & WITTEN, IH (1979) A statistical approach to the problem of isochrony in spoken British English. In Current Issues in Linguistic Theory 9 (eds. H. & P. Hollien), 285-294, Amsterdam: John Benjamins B.V.

HOLMES, JN, MATTINGLEY, IG & SHEARME, JN (1964) Speech synthesis by rule. Language & Speech 7, 127-143

JASSEM, W, HILL, DR & WITTEN, IH (1984) Isochrony in English speech: its statistical validity and linguistic relevance. Pattern, Process and Function in Discourse Phonology (ed. Davydd Gibbon), Berlin: de Gruyter, 203-225

LAWRENCE, W (1953) The synthesis of speech from signals which have a low information rate. In Communication Theory, Butterworth: London, 460-469

LIBERMAN, AM, INGEMANN, F, LISKER, L, DELATTRE, P & COOPER, FS (1959) Minimal rules for synthesising speech. J. Acoust. Soc. Amer. 31 (11), 1490-1499, Nov

MANZARA, L & HILL, DR (1992) DEGAS: A system for rule-based diphone synthesis. Proc. 2nd. Int. Conf. on Spoken Language Processing, Banff, Alberta, Canada, October 12th.-16th., 117-120

TAUBE-SCHOCK, C-R. (1993) Synthesizing Intonation for Computer Speech Output. M.Sc. Thesis, Department of Computer Science, University of Calgary

MANZARA, L. and HILL, D.R. (2002) Pronunciation Guide

Appendix A: Posture data for tube model synthesiser -- timings and spectral features

(Note: the MONET database contains the raw tube model parameters, such as tube radii)

Unmarked Marked Parameter values
transition qss duration transition qss duration nasal F1 F2 F3 F4 FH2 BW AX
ah-uu-20112.4-40148.2-See component sounds
e-i-2099-40132.1-See component sounds
o-i-2092.5-40135-See component sounds
uh-uu-20104.8-40168-See component sounds


Assume that we are constructing a parameter transition from posture p to posture p+1. DiphoneDuration DD defaults to:

DD = QSSp/J + TTp to p+1 + QSSp+1/K = qssb1 + transition(x) + qssa2 -- (A)

for the time from steady-state target to steady-state target, for the basic framework. We have to compute or specify qssa1, qssb2 and transition(x). This must be done for two successive diphones in the case of triphones, and three successive diphones in the case of tetraphones. What follows describes basic defaults that can be used for any diphone component. Special measures may be needed for any unit, but that is not of concern in specifying the defaults. In the case of tetraphones, the notation would be extended to give, in order, qssb1, transition(x), qssa2, qssb2, transition(y), qssa3, qssb3, transition(z), qssa4. All examples are worked in diphone terms in what follows, which is constructed to allow for problems with the original data.

(a). The transition time for a vocoid >> contoid, or contoid >> vocoid is specified by the contoid. Thus transition(x) defaults to:

transition(x) = TTp+1 for vocoid to contoid = transitioncontoid = transition2

transition(x) = TTp for contoid to vocoid = transitioncontoid = transition1

The transitioncontoid time defaults to that which appears in the tabulated data for each contoid posture. The value of qssn defaults to that which is specified for each posture. For the present, qssan is assumed to be equal to qssbn. The actual qssa&b times for a vocoid to contoid or contoid to vocoid default to those obtained by taking the transition time already allocated out of the vocoid total time. Thus:

qssavocoid = durationvocoid / 2 - transitioncontoid.

If this produces a value less than 10 msecs for qssavocoid it defaults to 10 msecs

The transition(x) time for vocoid to vocoid as occurs in diphthongs, triphthongs, etc. defaults to:

transition(x) = (duration1 + duration2)/2 - 20

If this produces a transition time of less than 40 msecs, it defaults to 40 msecs. The qssa&b times default to:

qssb1 = qssa2 = 10.

If the diphthong exists in the table, the duration is taken from the table, and then:

transition(x) = durationdiphthong - 20

The transition time from contoid to contoid defaults to a fixed 12 msecs, taken from the durations of the two qss's involved (this is pretty arbitrary, but is probably not critical):

transition(x) = 12

qssb1 = (qss1)/2 - 6

qssa2 = (qss2)/2-6

Note that as a result of the above the default total diphone duration (DD) for any diphone which is not a diphthong is simply:

DD = (duration1 + duration2)/2

for diphthongs:

DD = durationdiphthong

These equations will produce results that are somewhat inaccurate compared to real speech because, when the durations of the phones were measured as a basis for setting up the rhythmic framework, what was really measured was the [qsscontoid] for contoids, [transition(x) + durationdiphthong + transition(y) for diphthongs, and [transition(x) + qssvocoid + transition(y)] for vocoids, with no record of what preceded and followed the diphthong or vocoid to produce the transitions. Thus there was an uncontrolled source of variation in the durations measured for vocoids because a variety of transitions of different types were likely included in the measures. The mitigating factor is that diphthongs were measured separately, so that the transition times included in vowels were biassed towards contoids. However, the glides "w", "y" (/j/), "r", and "l" would produced long transitions, compared to other contoids, and certainly increase the variability. A study of consonant transitions by Green (1959) showed a variation in consonant transitions (included as part of the vowel duration in traditional phonetic analysis) ranging from 41 msecs to 78 msecs even without including the glides. Interestingly, the transition time for "l" is relatively short at 58 msecs (94 in and 74 out for "w"; 96.7 for "y" (/j/); and 75.7 for "r")11. The corpus of material needs to be re-analysed in diphone terms, and the transition types specified. This will likely require an expansion of the corpus, but even a re-analysis of the existing corpus would provide some insight into the kind of data that are missing. The equations above, specifying the component durations, have attempted to unscramble the egg a little, and should produce speech that approximates the rhythm of real speech quite well, despite the confounded data. The framework will allow better data to be used when it is available.

(b) The target values for "k", "g", and "ng" must be modified before backvowels "o", "u", "uu", "aw", "ar". The second formant target should be put 300 hz above the value for the vowel.

(c) The target values for h/silence/glottal-stop to posture, or posture to h/silence/glottal-stop, are taken from the posture (i.e. the transitions are flat).

(d) The shape of a given transition is determined by which of the postures involved is checked or free (i.e. is the articulation constrained by physical constraints on the articulators, as in tongue against alveolar ridge -- checked, or is it determined more by acoustic feedback from the quality of the sound, as in vowel-like sounds -- free). There is very little deviation from the steady state targets of a checked posture, during the QSS, and the transition begins or ends fairly abruptly. For a free posture, there is considerable deviation from the target values during the QSS.

Fig A1: Diagram of parameter movement between two nominal posture targets

To obtain an appropriate shape, default slopes are computed so that if the slope during a checked posture is m, then the slope during a free posture is 3m and the slope during the transition is 6--12m Then, if the total movement from the target in posture 1 to the target in posture 2 is D, then:

D = a1.qssp + a2.TTp to p+1 + a3.QSSp+1

where a1 a2 and a3 are the slopes as described, according to the type of p and p+1. Since the relationship between a1 a2 and a3 are known in terms of m , as is the value of D, the value of m may be calculated and substituted for the individual segment slopes.

(e) The glides are similar, but fall into two groups, "w" & "y" (/j/), versus "r" & "l"12. The former pair are close to the vowels "uu" and "ee" respectively, and are adequately synthesised by two formants. The other pair require a third formant (if two formants are used, what is perceived--"r" or "l"--depends on the identity of the following vowel). A steady-state onset of at least 50-60 msecs is required to avoid hearing some sort of plosive. Extending that to 70 msecs or more produces a syllabic consonant. For "w" and "y" 30 msecs of steady state onset avoided the perception of explosion, and more than 40 msecs produced the perception of the associated vowel.

The first formant onset for l should be raised before the vowel "ar" (to 480 hz). Otherwise 360 is adequate. This represents the top of the acceptable range for w and y, which break down if F1 is placed higher. The second formant onset should be lower for back vowels than for front vowels for both r and l (is this related to clear and dark versions of these two glides--the former of which is not generally documented? The onset for w needs to be lower for vowels with lower second formants. All in all, there is clear evidence in the glides for changes in the formant onset locations according to the phonetic context. Green's study provides further specific evidence (Green 1959). Interestingly, Lisker (1957) found that F2 in the vowel u had to be raised to 840 Hz from the selected value for an American u in order to synthesise a convincing u-w-u sequence. As we use 950 Hz, it should not be a problem.

The transition duration can be similar at 100 msecs for all four glides, but shorter transitions favour l and longer favour r. If the l transition is as brief as 30 msecs, it may be confused with nasals; around 60-70 msecs was found most satisfactory. The duration of the first formant transition is exceptional, and is an important aid to perceiving l. Using a 10 msec transition did not adversely affect w, r and y, but definitely improved l. We can afford to make a fast F1 transition specific to l, and use the measured transition durations from Green (1959)--as documented in the tabulated data above.


GREEN, P.S. (1959) Consonant-Vowel Transitions: a Spectrographic Study. Studia Linguistica XII (2), Travaux de l'Institut de Phonetique de Lund, Bertil Malmberg: University of Lund 53pp.

HILL, D.R., WITTEN, I.H. & JASSEM, W. (1978) Some results from a preliminary study of British English speech rhythm. Research Report 78/26/5, Dept. of Computer Science, U of Calgary: Calgary, Canada, 34pp. ( and select "Published papers/Some results from a preliminary study of British English speech rhythm" (currently under revision and not available 04-03-24)

O'CONNOR, J.D., GERSTMAN, L.J., LIBERMAN, A.M., DeLATTRE, P.C. & COOPER, F.S. (1957) Acoustic cues for the perception of initial /w, j, r, l/ in English. Word 13 (1), 24-43

LISKER, L. (1957) Minimal cues for separating /w, r, l, y/ in intervocalic position. Word 13 (2), 256-267

Appendix B: Vowel-to-vowel transition and other miscellaneous rewrite rules

(Working notes only)

Key: "-" = do nothing; "1" = insert a "gs" (glottal stop); "2" = insert an "r" (Source Kenyon & Knott: A pronouncing Dictionary of American English)

Note 1: K&K also tell us that Eastern & Southern American drops utterance final "r" after the same vowels. Brad Hodges phoned me today about the tape I sent him. Impressed, but he says our supposedly General American "r" sounds like a Chicago teenager before maturity!

Note 2: We could probably dispense with the diphthongs if the second components are easily available.

aa ah a e i o uh u ar aw ee er uu ah_i ah_uu e_i o_i uh_uu

Other rewrite rules (* is a wild card for the stress indicator--indicates stressed or unstressed)

[stop]>>[h* or hv*] becomes [stop]>> q?* >> [h* or hv*]

[stop] >> [stop] >> [stop] becomes [stop] >> [stop] >> q?* >> [stop]

[affricate] >> [stop|affricate|hlike] becomes [affricate] >> qc* >> [stop|affricate|hlike]

[vowel(i) & end-of-word] >> [vowel(i)] becomes [vowel(i)] >> gs* >> [vowel(i)]
{i.e. the gs only gets inserted for same vowels in succession -- see table above}

[l* & end-of-word] >> [contoid] becomes [ll*] >> [contoid] {this is the dark /l/ re-write rule}
{we may need a similar rule for r to rr, but at present r and rr are the same: tried and added}

[affricate]>>[stop]>>[stop|affricate|hlike] becomes [affricate]>>qc*>>[stop]>>q?*>>[stop|affricate|hlike]

Last changed Wed Jan 18th 1995

drh: 95-05-23

Appendix C: Some internal notes from target/rule construction September 19th 1994 onwards


1. Uniform vocal tract at shortest real-time-computable length (~16 cm) gives approximation to 500, 1500, 2500 series of formants.

2. When an area figure is entered, it frequently gets slightly changed (e.g. entering "1.1"displays "1.0")

3. F1 has a propensity to come out at 435Hz. What kind of artifact (if it is) is this.When checking spectra, to get a clearer idea of the formant peaks, it seems helpful to set the rise time to 25 and the fall time min to 5. This flattens the glottal source spectrum in the low end and distorts the peaks less. May not be good for listening though.

4. There is ambiguity about whether the current state is saved or not. The close box shows a solid cross even when the configuration has been modified.

5. Should be able to import a group of .trm files to MONET

6. MONET "New File" option disabled. Why?

7. In some spectra, the amplitude of the formants is greater at higher than at mid frequencies.

8. Could we have an attachable/detachable section 8, as per Carré model. Would this give us anything (e.g. in "put" versus "poot")?

9. Whenever there is a chance to select something & perform an action, it should be possible to select several things & perform the same action (e.g. deleting items from the data entry window).

10. Monet does not appear to add the file extension automatically. I had to add the ".monet" manually.

20th September
11. Had a real problem getting MONET to start up fresh. Wouldn't read .monet file until I undeleted my mail and got a fresh copy of MonetBlank.monet. Then wouldn't import .trm files until I had manually made one manual phone entry. It allowed.trm files to be imported without having opened a .monet file, but wouldn't show any parameters.

12. Need to show which file is currently being worked on and give proper "Save" feedback.At present, the solid cross appears even after changes (e.g. in the Rule Window).

13. Why does the glottal source analysis still show a spectrum at zero amplitude?

21st September
14. Not at all clear how to raise a panel showing the posture durations in MONET. The name "Symbols" does not relate well to durations. Also, it would be very nice to be able to import and replace, and select that only certain things be replaced, using .trm files.

15. MONET in entering data, better to keep old value in entry field to allow repeated entry of data. Actual value of selected item could show in different field (or even be visible in main window).

16. Inspectors should continue to show selection even when main window is not the key window.

17. The mods made last time (94-09-20) seem to have fouled up the parameter display.

22nd September
18. Need a switch to go to file, on MONET.

19. ah_uu diphthong. Check Synthesizer 2ms output samples for discontinuity.

20. (Len) Try interpolation on all scattering coefficients.

21. It would be nice to have a recursive rule interpreter such that a set of rules could themselves form a category to be used recursively to reduce the number of rules needed to deal with combinations. E.g. we have ten rules for diphthongs. We also need the triphthongs (as in RP "fire"). Nice if there was an element to call the appropriate diphthong rule into effect & then add the "uh".

22. Utterance rate settings: Breath: 0.5, Loss Factor 0.1, Mouth & Nose 0.601071, Pitch: -14.2857

23. The click at the end of utterances can become very severe (try w' er' r' k').It seems to be a bug, rather than an artifact. "^ ah i ah i ah i ah i ^ ^ uu ^"produced very little click, but lost the "uu", while "^ ah i ah i ah i ah i ^"produced a very loud click. On the other hand "^ d' e' i' v' i d ^ ^ uu ^"produced both clicks and part of the "uu". With Temp.monet values "t" by itself comes out silent. Add initial "i" and the t-burst is heard (& very convincing too). But special parameter displays showed nothing!

24. "n" comes out sounding like "l" and "m" is little better.

23rd September 1994
25. We really should have designed the system so that we have a single database and can run either MONET or the interactive synthesiser on it--each would deal only with those parts that were appropriate, but all file conversions,difficulties of synchronisation, etc. would go away!!

26. There is a problem in that synthesising "// / ^ p aa ^ p ah ^ p a ^ / //" loses the final "pa" in real-time synthesis, but the console, the display, and recording to a file all produce everything. If this is symptomatic of some input pages to the synthesiser being overwritten under pressure (or organisation) of real time synthesis, it could also explain inconsistent clicks and other anomalies. Some inconsistencies also with the use of double slashes to enable tempo. Sometimes worked with, sometimes not, but also tied in with whether final syllable above was cut or not!! Have we got an initialisation problem?? Syllable lost after fresh start-up.

Note: that changing whether "//" surrounded the string changed whether the syllable was lost or not. Adding silences would bring the syllable back and the click ultimately went away permanently. Removing silences with "//" surrounding would bring the problem back. Looking at the console showed:


Tone Groups 1

0 start: 0 end: 1 type: 0

Feet 2

0 tempo: 1.000000 start: 0 end: 10 marked: 0 last: 0

1 tempo: 1.000000 start: 11 end: 11 marked: 0 last: 0

Phones 11

0 "^" tempo: 1.000000 syllable: 0
1 "p" tempo: 1.000000 syllable: 0
2 "aa" tempo: 1.000000 syllable: 0
3 "^" tempo: 1.000000 syllable: 0
4 "p" tempo: 1.000000 syllable: 0
5 "ah" tempo: 1.000000 syllable: 0
6 "^" tempo: 1.000000 syllable: 0
7 "p" tempo: 1.000000 syllable: 0
8 "a" tempo: 1.000000 syllable: 0
9 "^" tempo: 1.000000 syllable: 0
10 "^" tempo: 1.000000 syllable: 0
CurrentIndex = 7936 Zeroing from 360192... 256 bytes


Tone Groups 1

0 start: 0 end: 1 type: 0

Feet 2

0 tempo: 1.000000 start: 0 end: 11 marked: 0 last: 0
1 tempo: 1.000000 start: 12 end: 12 marked: 0 last: 0

Phones 12
0 "^" tempo: 1.000000 syllable: 0
1 "p" tempo: 1.000000 syllable: 0
2 "aa" tempo: 1.000000 syllable: 0
3 "^" tempo: 1.000000 syllable: 0
4 "p" tempo: 1.000000 syllable: 0
5 "ah" tempo: 1.000000 syllable: 0
6 "^" tempo: 1.000000 syllable: 0
7 "p" tempo: 1.000000 syllable: 0
8 "a" tempo: 1.000000 syllable: 0
9 "^" tempo: 1.000000 syllable: 0
10 "^" tempo: 1.000000 syllable: 0
11 "^" tempo: 1.000000 syllable: 0


Tone Groups 1
0 start: 0 end: 1 type: 0

Feet 2
0 tempo: 1.000000 start: 0 end: 10 marked: 0 last: 0
1 tempo: 1.000000 start: 11 end: 11 marked: 0 last: 0

Phones 11
0 "^" tempo: 1.000000 syllable: 0
1 "p" tempo: 1.000000 syllable: 0
2 "aa" tempo: 1.000000 syllable: 0
3 "^" tempo: 1.000000 syllable: 0
4 "p" tempo: 1.000000 syllable: 0
5 "ah" tempo: 1.000000 syllable: 0
6 "^" tempo: 1.000000 syllable: 0
7 "p" tempo: 1.000000 syllable: 0
8 "a" tempo: 1.000000 syllable: 0
9 "^" tempo: 1.000000 syllable: 0
10 "^" tempo: 1.000000 syllable: 0
CurrentIndex = 7936 Zeroing from 360192... 256 bytes

Tone Groups 1

0 start: 0 end: 1 type: 0

Feet 2
0 tempo: 1.000000 start: 0 end: 9 marked: 0 last: 0
1 tempo: 1.000000 start: 10 end: 10 marked: 0 last: 0

Phones 10
0 "^" tempo: 1.000000 syllable: 0
1 "p" tempo: 1.000000 syllable: 0
2 "aa" tempo: 1.000000 syllable: 0
3 "^" tempo: 1.000000 syllable: 0
4 "p" tempo: 1.000000 syllable: 0
5 "ah" tempo: 1.000000 syllable: 0
6 "^" tempo: 1.000000 syllable: 0
7 "p" tempo: 1.000000 syllable: 0
8 "a" tempo: 1.000000 syllable: 0
9 "^" tempo: 1.000000 syllable: 0


But tests showed that whether the "zeroing" message was produced or not did not seem consistently tied to whether the last syllable was lost or not.

27. We really need better cross-checking between inspector values for point characteristics etc. and the system that defines the symbols. At least we should be able to see the whole name. Preferably, we should be able to double click and bring up the prototype manager with the specific item highlighted.

28. Could put in piece of init code that creates the pad page values for the synthesiser from the Monet values for silence.

29. Looked at voiced stops. Horrible noises due to parameter transitions. Tried out the Carré scheme. "adu" sounded like "abu". Played with combinations of transition types. Noises came back when transitions set to triphone defaults and volume raised. Nothing seemed to produce d-like sound from crude Carré scheme. Formant transitions are wrong. When all transitions except r8 were put to Triphone default and volume kept down, something like a "d" appeared in "aa d aa".

30. Not possible to move a rule to last position in list directly. Have to move to penultimate and move last one up.

31. Would be less confusing if the Fricative Centre Frequency value equalled the section #.

32. I find it annoying that Monet only displays one decimal place in param values. Also that a value in the synthesiser will have discrepancies (1.05 = 1.045 e.g.)

33. See this! I failed to load a .monet file.

[] 16=> MONET

Now First Responder
Points = 3
Points = 3
Points = 4
Points = 3
Points = 3
Points = 3
Points = 3



Tone Groups 0

Feet 0

Phones 0

TTS Server: Sound Driver failed under heavy load (iw).

[] 17=>

34. The special symbols RuleDuration, Mark1, Mark2, Beat, rd etc. should be handled as a separate option under the Inspector (a second option besides "General Information" in the Inspector for the Rule Window). Also, setting them up should be made a more intuitive component (i.e. a separate exercise). At present these important symbols are mixed up with everything else, which is confusing.

We also need some notes on phantom points which behave in an inconsistent way, in that for the points at an internal major event, the second point (if added) is a phantom, but at the end, the first point is phantom, and the next diphone has a none-phantom point to start. I realise there are reasons for this, but the user interface is very confusing!!!*?

35. It is important to develop rules in a rational order. Get the voiced stops without any bursts sounding right before converting to voiceless stops. the m and n count as voiced stops, of course. We'll add other constraints later.

36. When selecting points in the prototype diagram. one really should have to click twice to select, and creating a new point should be done differently. I'm not sure a triple click is appropriate, but it may well be better than the current method. Len Suggests CMD-click, which seems a good idea to me.

37. I ought to be able to select a group in the middle for slope ratio.

38. Remember to make microintonation transitions for voicelss stops do the opposite things on either side of closure.

39. The input // / ^ ^ b ah i d ee ^ ^ / // produced some trailing hiss after things were supposedly all off. Is this related to the spurious events at the end, problems with special event points, and the like?? We got around a spurious parameter value problem in the Voiced Stop fricative burst by setting the inital point in the special event to a triangle instead of a circle type. Try "butn" !!

40. In the data entry process, it would be nice if, when a new posture is accessed, the inspector (if up) should already be highlighted on the item that was selected in the last access, as many times one wishes to go through and change a series of values in different postures (e.g. adjust all glottal volumes in voiced stops).

41. // / ^ ^ aa d ^ / // and // / ^ ^ aa d / // screw up really badly at the end. Probably a good test case for the end bug.

42. Need another text field on the inspector that shows the full name of the item selected, plus an easier way of reading the whole formula!

41. Bug in entering equations. Sometimes a totally wrong equation gets substituted when changing an equation (has happened two or three times, and it is not that "set" was not used. The equation changes, but it seems to a standard simple equation (like "qssb1/tempo1").

42. The scale of the fricative (at least for special events) needs to be modified so we can see what is going on by viewing the parameters.

43. We need a special way of handling the schwa vowel. It is short & the normal rules run foul of timing limits, but making it longer wrecks the rhythm.

44. Nice to be able to choose scales for parameter display. Fricative volume is so low you can't see the timing of the noise relative to other things.

45. Need to generate the padding values for the synthesiser from the target values supplied for silence in MONET, at initialisation time.

46. Should consider allowing the fricative position to move just in front of R8 so that the bursts associated with bilabial articulations may be heard more easily, and controlled more effectively. Len thinks this is a very bad idea, and we should open the closure before making the noise (true, but that's version 2!!).

47. Need extension in order to:

  • get onto faster DSP so we can shorten tube length in real time and get a better approximation to the aspiration spectrum, and generally reduce compromises made in existing prototype (split R4/R5 to see if it allows more accurate simulation of velar closure)
  • more rule development (we don't know if rules for some sounds can be done??), we think it may be possible to develop a better rules structure in our new framework, use > 4 postures etc.
  • develop a better user interface as a basis for a lab software package
  • incorporate driver software available from Stanford U to complete port to PC hardware.
  • we mistakenly put extra ideas into the first proposal which are more appropriate to an extension, once the basic ideas were proven.
  • need to do some experimental validation of the rules etc. we have developed as a basis for improvement.
  • include intonation & rhythm components in MONET for more accurate testing and develop,ment of sounds, and of rhythm & intonation stuff.
  • tackle the unresolved problems of moving the system onto white hardware.
  • what problems have we unearthed that require work? What new ideas that can be developed?

48. Nice to have a system to allow us to represent a vowel or whatever as an unused symbol and be able to link that symbol to an arbitrary symbol for synthesis.

49. In setting up "uu z uu" under Monet, we noticed that R6 had a slight change in the middle of the closure. Looking at the Monet parameter file showed that after the target, the value decreased slightly from 0.200 to 0.194. This is clearly not correct, and may relate to the negative values we saw in glottal volume in a voiceless stop example reported earlier.

50. There's a bug in the Trillium Sound Editor that causes clipping as a result of normalisation (seen on "We were away a year ago" sound output file during investigation of other matters).

51. The error message in the rule window should be cleared next time anything is done in the window, or the window is closed, if the error has been cleared. Same with the inspector.

52. Strange occurrence. j's stopped working after we'd played around with rules & database, but restored everything. Fortunately we had a back-up which really restored stuff, but there was no obvious reason why j's would have ceased working. We checked targets and rules and everything looked right. Maybe screwing up the rule field, so it took blanks (cos of extra parentheses) did damamge that restoring the rule didn't undo.

53. When applying the "Group for Slope Ratio" to a set of points, if the points are too close together, the slope ratio legends get super-imposed, even though they work correctly. Also, it would be better to allow a more direct method of selecting multiple points (such as holding the Shift key down whilst selecting whatever points are to participate in the slope ratio). The current system of boxing the points is awkward to the point of being almost unusable. Sometimes it is necessary to move points in time just to be able to select the points wanted without selecting unwanted points. The points then have to be moved back.

--------------(end of development notes section)------------

Appendix D: GNU Free Documentation Licence

GNU Free Documentation License

Version 1.1, March 2000

Copyright (C) 2000  Free Software Foundation, Inc.
59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.


The purpose of this License is to make a manual, textbook, or other written document "free" in the sense of freedom: to assure everyone the effective freedom to copy and redistribute it, with or without modifying it, either commercially or noncommercially. Secondarily, this License preserves for the author and publisher a way to get credit for their work, while not being considered responsible for modifications made by others.

This License is a kind of "copyleft", which means that derivative works of the document must themselves be free in the same sense. It complements the GNU General Public License, which is a copyleft license designed for free software.

We have designed this License in order to use it for manuals for free software, because free software needs free documentation: a free program should come with manuals providing the same freedoms that the software does. But this License is not limited to software manuals; it can be used for any textual work, regardless of subject matter or whether it is published as a printed book. We recommend this License principally for works whose purpose is instruction or reference.


This License applies to any manual or other work that contains a notice placed by the copyright holder saying it can be distributed under the terms of this License. The "Document", below, refers to any such manual or work. Any member of the public is a licensee, and is addressed as "you".

A "Modified Version" of the Document means any work containing the Document or a portion of it, either copied verbatim, or with modifications and/or translated into another language.

A "Secondary Section" is a named appendix or a front-matter section of the Document that deals exclusively with the relationship of the publishers or authors of the Document to the Document's overall subject (or to related matters) and contains nothing that could fall directly within that overall subject. (For example, if the Document is in part a textbook of mathematics, a Secondary Section may not explain any mathematics.) The relationship could be a matter of historical connection with the subject or with related matters, or of legal, commercial, philosophical, ethical or political position regarding them.

The "Invariant Sections" are certain Secondary Sections whose titles are designated, as being those of Invariant Sections, in the notice that says that the Document is released under this License.

The "Cover Texts" are certain short passages of text that are listed, as Front-Cover Texts or Back-Cover Texts, in the notice that says that the Document is released under this License.

A "Transparent" copy of the Document means a machine-readable copy, represented in a format whose specification is available to the general public, whose contents can be viewed and edited directly and straightforwardly with generic text editors or (for images composed of pixels) generic paint programs or (for drawings) some widely available drawing editor, and that is suitable for input to text formatters or for automatic translation to a variety of formats suitable for input to text formatters. A copy made in an otherwise Transparent file format whose markup has been designed to thwart or discourage subsequent modification by readers is not Transparent. A copy that is not "Transparent" is called "Opaque".

Examples of suitable formats for Transparent copies include plain ASCII without markup, Texinfo input format, LaTeX input format, SGML or XML using a publicly available DTD, and standard-conforming simple HTML designed for human modification. Opaque formats include PostScript, PDF, proprietary formats that can be read and edited only by proprietary word processors, SGML or XML for which the DTD and/or processing tools are not generally available, and the machine-generated HTML produced by some word processors for output purposes only.

The "Title Page" means, for a printed book, the title page itself, plus such following pages as are needed to hold, legibly, the material this License requires to appear in the title page. For works in formats which do not have any title page as such, "Title Page" means the text near the most prominent appearance of the work's title, preceding the beginning of the body of the text.


You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies, and that you add no other conditions whatsoever to those of this License. You may not use technical measures to obstruct or control the reading or further copying of the copies you make or distribute. However, you may accept compensation in exchange for copies. If you distribute a large enough number of copies you must also follow the conditions in section 3.

You may also lend copies, under the same conditions stated above, and you may publicly display copies.


If you publish printed copies of the Document numbering more than 100, and the Document's license notice requires Cover Texts, you must enclose the copies in covers that carry, clearly and legibly, all these Cover Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on the back cover. Both covers must also clearly and legibly identify you as the publisher of these copies. The front cover must present the full title with all words of the title equally prominent and visible. You may add other material on the covers in addition. Copying with changes limited to the covers, as long as they preserve the title of the Document and satisfy these conditions, can be treated as verbatim copying in other respects.

If the required texts for either cover are too voluminous to fit legibly, you should put the first ones listed (as many as fit reasonably) on the actual cover, and continue the rest onto adjacent pages.

If you publish or distribute Opaque copies of the Document numbering more than 100, you must either include a machine-readable Transparent copy along with each Opaque copy, or state in or with each Opaque copy a publicly-accessible computer-network location containing a complete Transparent copy of the Document, free of added material, which the general network-using public has access to download anonymously at no charge using public-standard network protocols. If you use the latter option, you must take reasonably prudent steps, when you begin distribution of Opaque copies in quantity, to ensure that this Transparent copy will remain thus accessible at the stated location until at least one year after the last time you distribute an Opaque copy (directly or through your agents or retailers) of that edition to the public.

It is requested, but not required, that you contact the authors of the Document well before redistributing any large number of copies, to give them a chance to provide you with an updated version of the Document.


You may copy and distribute a Modified Version of the Document under the conditions of sections 2 and 3 above, provided that you release the Modified Version under precisely this License, with the Modified Version filling the role of the Document, thus licensing distribution and modification of the Modified Version to whoever possesses a copy of it. In addition, you must do these things in the Modified Version:

  • A. Use in the Title Page (and on the covers, if any) a title distinct from that of the Document, and from those of previous versions (which should, if there were any, be listed in the History section of the Document). You may use the same title as a previous version if the original publisher of that version gives permission.
  • B. List on the Title Page, as authors, one or more persons or entities responsible for authorship of the modifications in the Modified Version, together with at least five of the principal authors of the Document (all of its principal authors, if it has less than five).
  • C. State on the Title page the name of the publisher of the Modified Version, as the publisher.
  • D. Preserve all the copyright notices of the Document.
  • E. Add an appropriate copyright notice for your modifications adjacent to the other copyright notices.
  • F. Include, immediately after the copyright notices, a license notice giving the public permission to use the Modified Version under the terms of this License, in the form shown in the Addendum below.
  • G. Preserve in that license notice the full lists of Invariant Sections and required Cover Texts given in the Document's license notice.
  • H. Include an unaltered copy of this License.
  • I. Preserve the section entitled "History", and its title, and add to it an item stating at least the title, year, new authors, and publisher of the Modified Version as given on the Title Page. If there is no section entitled "History" in the Document, create one stating the title, year, authors, and publisher of the Document as given on its Title Page, then add an item describing the Modified Version as stated in the previous sentence.
  • J. Preserve the network location, if any, given in the Document for public access to a Transparent copy of the Document, and likewise the network locations given in the Document for previous versions it was based on. These may be placed in the "History" section. You may omit a network location for a work that was published at least four years before the Document itself, or if the original publisher of the version it refers to gives permission.
  • K. In any section entitled "Acknowledgements" or "Dedications", preserve the section's title, and preserve in the section all the substance and tone of each of the contributor acknowledgements and/or dedications given therein.
  • L. Preserve all the Invariant Sections of the Document, unaltered in their text and in their titles. Section numbers or the equivalent are not considered part of the section titles.
  • M. Delete any section entitled "Endorsements". Such a section may not be included in the Modified Version.
  • N. Do not retitle any existing section as "Endorsements" or to conflict in title with any Invariant Section.

If the Modified Version includes new front-matter sections or appendices that qualify as Secondary Sections and contain no material copied from the Document, you may at your option designate some or all of these sections as invariant. To do this, add their titles to the list of Invariant Sections in the Modified Version's license notice. These titles must be distinct from any other section titles.

You may add a section entitled "Endorsements", provided it contains nothing but endorsements of your Modified Version by various parties--for example, statements of peer review or that the text has been approved by an organization as the authoritative definition of a standard.

You may add a passage of up to five words as a Front-Cover Text, and a passage of up to 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified Version. Only one passage of Front-Cover Text and one of Back-Cover Text may be added by (or through arrangements made by) any one entity. If the Document already includes a cover text for the same cover, previously added by you or by arrangement made by the same entity you are acting on behalf of, you may not add another; but you may replace the old one, on explicit permission from the previous publisher that added the old one.

The author(s) and publisher(s) of the Document do not by this License give permission to use their names for publicity for or to assert or imply endorsement of any Modified Version.


You may combine the Document with other documents released under this License, under the terms defined in section 4 above for modified versions, provided that you include in the combination all of the Invariant Sections of all of the original documents, unmodified, and list them all as Invariant Sections of your combined work in its license notice.

The combined work need only contain one copy of this License, and multiple identical Invariant Sections may be replaced with a single copy. If there are multiple Invariant Sections with the same name but different contents, make the title of each such section unique by adding at the end of it, in parentheses, the name of the original author or publisher of that section if known, or else a unique number. Make the same adjustment to the section titles in the list of Invariant Sections in the license notice of the combined work.

In the combination, you must combine any sections entitled "History" in the various original documents, forming one section entitled "History"; likewise combine any sections entitled "Acknowledgements", and any sections entitled "Dedications". You must delete all sections entitled "Endorsements."


You may make a collection consisting of the Document and other documents released under this License, and replace the individual copies of this License in the various documents with a single copy that is included in the collection, provided that you follow the rules of this License for verbatim copying of each of the documents in all other respects.

You may extract a single document from such a collection, and distribute it individually under this License, provided you insert a copy of this License into the extracted document, and follow this License in all other respects regarding verbatim copying of that document.


A compilation of the Document or its derivatives with other separate and independent documents or works, in or on a volume of a storage or distribution medium, does not as a whole count as a Modified Version of the Document, provided no compilation copyright is claimed for the compilation. Such a compilation is called an "aggregate", and this License does not apply to the other self-contained works thus compiled with the Document, on account of their being thus compiled, if they are not themselves derivative works of the Document.

If the Cover Text requirement of section 3 is applicable to these copies of the Document, then if the Document is less than one quarter of the entire aggregate, the Document's Cover Texts may be placed on covers that surround only the Document within the aggregate. Otherwise they must appear on covers around the whole aggregate.


Translation is considered a kind of modification, so you may distribute translations of the Document under the terms of section 4. Replacing Invariant Sections with translations requires special permission from their copyright holders, but you may include translations of some or all Invariant Sections in addition to the original versions of these Invariant Sections. You may include a translation of this License provided that you also include the original English version of this License. In case of a disagreement between the translation and the original English version of this License, the original English version will prevail.


You may not copy, modify, sublicense, or distribute the Document except as expressly provided for under this License. Any other attempt to copy, modify, sublicense or distribute the Document is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.


The Free Software Foundation may publish new, revised versions of the GNU Free Documentation License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. See

Each version of the License is given a distinguishing version number. If the Document specifies that a particular numbered version of this License "or any later version" applies to it, you have the option of following the terms and conditions either of that specified version or of any later version that has been published (not as a draft) by the Free Software Foundation. If the Document does not specify a version number of this License, you may choose any version ever published (not as a draft) by the Free Software Foundation.

--------------end of Free Documentation Licence copy------


1. The MONET speech database creation, editing and production system was designed and programmed by Craig Schock whilst at Trillium Sound Research Incorporated, based on the original ideas developed by the author of this manual. The name originally stood for "My Own Nifty Editing Tool" (for speech parameters), but took on a much more complete set of functions. Both the creation of the system, and its use, involve(d) sufficient artistry that it seems very appropriate to maintain the connection with the French impressionist of the same name, especially as that school of painting is my favourite. The section delimiters augment this connection. (back)

2. The underlying control model used for the tube-model is derived from research carried out at the Ecole Nationale Supérieur des Télécommunication (ENST), Laboratoire de Traitment et Communication de l'Information (LCTI) (Department of Signals), in Paris, by Dr. Réné Carré. This work in turn built on earlier work by Fant and his colleagues at the Speech Technology Laboratory, at KTH in Stockholm. Background on this research and the authors' developments from it are provided in Hill, Manzara and Taube-Schock (1995) (back)

3. Lungs, trachea, vocal folds, oral tube, tongue, teeth, cheeks, velum, nasal tube, lips, mouth orifice, and nostrils. (back)

4. Other papers describing the tube model itself and other aspects of the work are also in preparation. (back)

5. MONET's input string comprises mainly characters from the phonetic font named Trillium Phonetic. This font provides ascii equivalents for the International Phonetic Association (IPA) phonetic symbol set used for broad (phonemic) transcription of languages (see "References"). Only the postures relevant to English and French are defined in the current diphones.monet file, and these are mapped onto the specific IPA symbols that are relevant (see the note on the title page for comments on the term phoneme). The font was developed using the Fontographer package that ran on Macintosh computers. For MONET, only the Trillium ascii font is needed. (back)

6. The Prototype Manager has three modes: Equations, Transition, and Special. In the Equation mode, equations may be defined to allow timing values to be computed from the basic phone timing data and associated with meaningful symbols grouped into categories with meaningful names. This helps keep track of their use, and helps with documentation of the database. In the Transition and Special modes, provision is made for setting up a parameter transition profile in which the points at which the parameter rate changes are controlled by points which have a defined time based on the timing symbols set up from the equations, and a defined percentage of the total parameter movement. The parameter transition profile does not have absolute values of parameters. It is used to produce the absolute values needed to drive the synthesiser by applying the percentages and rates to the absolute data defined by the actual postures (phones) in a given string. (back)

7. The window that opens does not display the name of the prototype being edited in the Title bar. This problem requires correction. (back)

8. Not ServerTest which does not allow access to the "hidden" methods--which is why they are called hidden methods. (back)

9. I.e. don't release the second down on double clicking until the desired scale change has been achieved by dragging with the button down. (back)

10. Stressed and tonic syllables have longer phones. Tonic syllables fall in the words representing the major content items of a phrase and have lengthening even beyond that of the normally stressed syllables. The synthesis system provides for the various influences. (back)

11. This is probably an important factor in the variance that was "missing" in the study of British English speech rhythm by Hill, Witten and Jassem (1978). (back)

12. This note is based on material in O'Connor, Gerstman, Liberman, Delattre and Cooper (1957) and Lisker (1957). (back)

Please email any comments or questions about this manual to the author (David Hill)

Page last updated 04-06-22.