Hi John,
The original text-to-speech system on the NeXT, on which the port is
based, did address the "question" intonation pattern.
The intonation patterns are affected by the punctuation and
intonation control parameters. But, properly, only questions
expecting the answer "Yes" or No", or statements expressing
uncertainty that really have rising intonation at the end.
The rampant "up-talk" by the younger generation in Canada is an
exception -- everything in "up-talk" gets a rising intonation at the
end, perhaps a sign of insecurity in the speaker! :-).
Wh- questions don't show the rising intonation. The system did not
make allowance for this distinction -- it would have required some
grammatical analysis which we had not tackled, but it should be. It
isn't just a matter of detecting the presence of words like "why",
"when" "who", what", and "how" because it is fairly easy to frame a
"Yes/No" question that also contains one or more of these words (for
example: "Did you tell her when we were supposed to meet?").
The system also had regular statements and emphatic statements.
There should have been a lot more, and the plan was to implement the
whole of Michael Halliday's description of the intonation of British
English (he wrote an excellent tutorial book, with accompanying taped
examples: A course in spoken English: Intonation" -- Oxford U. Press
1970 SBN [sic] 19 453066 3).
The intonation system was tied to the metrical aspects of English
described by a number of British linguists -- most notably Professor
David Abercrombie who was at Edinburgh university. We carried out
significant research at the U of Calgary on the rhythm and intonation
of British English and this was used when we spun off Trillium Sound
Research and built the original NeXT system. The rhythm and
intonation were regarded as significantly effective features of the
text-to-speech system, even though the research results and Halliday
were only partially implemented. The speech was found to be much less
tiring to listen to for long periods than, for example, DECTalk
(which was based on MITalk developed at MIT: "From text to speech:
the MITalk system," Allen, Hunnicutt & Klatt, Cambridge University
Press, 1987 ISBN 0-521-30641-8)
Abercrombie's claim was that spoken British English had "a tendency
towards isochrony". Specifically, spoken phrases and sentences could
be split into "feet", rather like the bars in music, and the rhythmic
"beat" falls on the first syllable of this unit (the stressed
syllables dictate where the foot boundaries fall). A tendency
towards isochrony then asserts that the beats fall at more regular
intervals than would be expected from the differing number of
syllables in each foot, and this is because the length of the
syllables becomes shorter as the number increase. American linguists
are skeptical about this idea but our analyses of a corpus of English
spoken for purposes of illustrating intonation revealed that such a
tendency definitely exists. You'd think it was an easy enough
question to resolve one way or the other, but if you think this you
don't know linguists! :-)
There are several descriptions of the rhythm work we did. The most
complete one, though very academic, is:
JASSEM, W., HILL, D.R. & WITTEN, I.H. (1984) Isochrony in English
speech: its statistical validity and linguistic relevance. Pattern,
Process and Function in Discourse Phonology (collection ed. Davydd
Gibbon), Berlin: de Gruyter, 203-225 (J)
but there is a shorter version that summarises the actual research data:
HILL, D.R., WITTEN I.H. and Jassem, W. (1977) Some results from a
preliminary study of British English speech rhythm which was
presented at 94th. Meeting of the Acoustical Society of America,
Miami, Dec 12-16 but only appears as a summary in the proceedings.
The full text available as U of Calgary Computer Science Dept. Report
78/26/5
I could send you a draft electronic copy as I am currently working
on putting a copy on the web but there's also a hard copy version
published as a departmental report.
The intonation work is best accessed through Halliday's book though
Craig Taube-Schock's thesis (for which he received the Governor
General of Canada's Gold Medal) reports the initial experimental work
we did to validate and extend Halliday's descriptions for purposes of
computer speech intonation:
"Synthesizing intonation for computer speech output" Craig-Richard
Taube-Schock. M.Sc. Thesis, Department of Computer Science, The
University of Calgary 1993, 109 pages.
It is available from Proquest (who archive all university theses in
North America) though they have the date as 1994. In implementing the
intonation for the TextToSpeech kit, a number of improvements were
made that are not written up in the thesis, especially the smoothing
of contours.
From the original Developer TextToSpeech kit manual:
The Parser Module takes the text supplied by the client application
(using the speakText:
or speakStream: methods) and converts it into an equivalent
phonetic representation. The
input text is parsed, where possible, into sentences and tone
groups. This subdivision is done
primarily by examining the punctuation. Each word or number or
symbol within a tone group
is converted to a phoneme string which indicates how the word is to
be pronounced. The
pronunciation is retrieved from one of five pronunciation knowledge
bases.
The Parser must also deal with text entered in any of the special
text modes. For example, a
word may be marked in letter mode, which means the word is to
spelled out a letter at a time, or
in emphasis mode, which means the word is to receive special
emphasis by lengthening it and
altering its pitch. The Parser marks the phonetic representation
appropriately in these cases.
...
The system attempts to speak the text as a person would.
Punctuation is not pronounced, but
is used as guide to pronounce the text it marks. For example, a
period that marks the end of
sentence is not pronounced, but does indicate that a pause occurs
before proceeding to the next
sentence.
A question mark at the end of a sentence caused the rising intonation
of a question to be selected. Another special mode allowed
punctuation to be spoken, rather than used to control how the text
was spoken. I have put the whole manual on my university web site
where it is easier to find than digging through the savannah
repository, though it doesn't really address these issues completely
(but is useful for many purposes, and you will find it useful
background). Go to:
http://pages.cpsc.ucalgary.ca/~hill
Select "Published papers" from the left-hand menu, scroll down to
section "E. Other publications" and you'll find a whole lot of
Gnuspeech-related documents there. The sixth item is "Manual for the
original NeXT Developer TextToSpeech kit". Clicking the link witll
allow you to download a .pdf file of the whole manual. The five
previous links in that section are also useful references for
Gnuspeech and will help you in your work on porting the server.
Many thanks for your willingness to get involved. Very much
appreciated. Feel free to bug me with any questions/problems that
come up.
HTH. All good wishes.
david
---------
David Hill
[email protected]
http://savannah.gnu.org/projects/gnuspeech
--------
The only function of economic forecasting is to make astrology look
respectable. (J.K. Galbraith)
--------
On Nov 4, 2009, at 6:21 PM, John Delaney wrote:
Here I was trying to implement a speech synthesis API for a
graduate musical synthesis class, and now I'm getting roped into
actually working on the project. I'll implement some sort of
Parameter class to hold the current intonation parameters, that
should be pretty simple.
Would it be possible for the synthesis engine to ramp up the
intonation at the end of a sentence whenever there is a question
mark? I don't think I have seen a synthesis engine, yet to do
this, and it seems like such a small/easy thing to do.
Perhaps I'll revisit this when I eventually take machine learning
classes.
Thank you,
John Delaney
On Wed, Nov 4, 2009 at 5:09 PM, Dalmazio Brisinda
<[email protected]> wrote:
Yes, you are correct. All those server methods are yet to be
implemented. Currently the server just supports speaking text with
the defaults that were taken from Monet. This is certainly one area
that could use some filling out, and any contribution would be more
than welcome.
Best,
Dalmazio
On 2009-11-04, at 5:54 PM, John Delaney wrote:
Thank you all for your help. I have switched to using the server
method because its very easy and functional. Am I mistaken,
though that many of the parameters such as pitch and intonation
have not yet been implemented to the server? I am looking at the
server and all the get/set methods have return zero. I suppose I
will need to impliment those if this is the case.
On Wed, Nov 4, 2009 at 12:37 PM, Dalmazio Brisinda
<[email protected]> wrote:
Have a look at Linked Frameworks section in the Xcode Groups &
Files pane. I've found in the past that for setting up the project
on a different system, I've often had to remove the custom
frameworks (Tube and GnuSpeech) and then add them again, so Xcode
correctly picks up the new locations -- unless they're in standard
system Framework folders. If you would like additional information
on Xcode, have a look at the book "Xcode Unleashed" -- there may
be others.
[snip]
---------
_______________________________________________
gnuspeech-contact mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/gnuspeech-contact