Official teaser for the Vocaloid software:
The phonetic transcription system used by Vocaloid isn't IPA, but a modified version of an alternate system called X-SAMPA, since IPA isn't widely supported on computer keyboards. As shown in the video, instead of typing in phonemes, one inputs the lyrics, and a built-in dictionary transcribes them. (It is possible to edit the phonemes, however.) For example, if I typed in "hide" as the lyrics for a note, it's automatically converted to the phonemes [h aI d].
The system handles polysyllabic words by automatically splitting them across inputted notes. However, the Vocaloid software was originally developed by teams from Spain, Sweden, and Japan, so the English phonetic dictionary - which has barely changed since the first release of the software - tends to have quite a few deviations from the phonemes a native speaker would sing.
While the phonology of singing definitely has some differences with that of speech, there are various phonological rules that the dictionary doesn't even begin to consider - for example, there is no phoneme for the alveolar flap (/ɾ/) or glottalized/unreleased syllable-final plosives in many English voicebanks. Thus, when using the program, many users will, after inputting lyrics, edit the phonemes to sound more natural. At the same time, Vocaloid only uses a limited set of phonemes, so many phonemes need to be simulated by using another one (for example, where an alveolar flap would occur, users often use unaspirated [d]).
But what if Vocaloid could do those transformations automatically? It wouldn't be impossible to write an algorithm to determine the stress of a syllable in the context of its phrase, and then use regular expressions to parse the phonemes (and stress) and apply the transformations. While it would still be imperfect, especially due to the lack of certain phonemes, it would amount to much less editing by the user.
However, having read Kenstowicz's article arguing against a one-level model of phonology, such an algorithm may never actually work. After all, the only information the algorithm would have is the existing phonemes from the dictionary and thus would be operating on a single level of rules.
Of course, improving the automatic transcription of lyrics is just one consideration for improving to a piece of software that is already amazing, considering the complexity of theory and optimization going into it. And, sure, the phonetic system may be lacking in several ways, but there's always a point where more precision isn't necessary - a more imprecise system already produces very good results:
The system handles polysyllabic words by automatically splitting them across inputted notes. However, the Vocaloid software was originally developed by teams from Spain, Sweden, and Japan, so the English phonetic dictionary - which has barely changed since the first release of the software - tends to have quite a few deviations from the phonemes a native speaker would sing.
While the phonology of singing definitely has some differences with that of speech, there are various phonological rules that the dictionary doesn't even begin to consider - for example, there is no phoneme for the alveolar flap (/ɾ/) or glottalized/unreleased syllable-final plosives in many English voicebanks. Thus, when using the program, many users will, after inputting lyrics, edit the phonemes to sound more natural. At the same time, Vocaloid only uses a limited set of phonemes, so many phonemes need to be simulated by using another one (for example, where an alveolar flap would occur, users often use unaspirated [d]).
But what if Vocaloid could do those transformations automatically? It wouldn't be impossible to write an algorithm to determine the stress of a syllable in the context of its phrase, and then use regular expressions to parse the phonemes (and stress) and apply the transformations. While it would still be imperfect, especially due to the lack of certain phonemes, it would amount to much less editing by the user.
However, having read Kenstowicz's article arguing against a one-level model of phonology, such an algorithm may never actually work. After all, the only information the algorithm would have is the existing phonemes from the dictionary and thus would be operating on a single level of rules.
Of course, improving the automatic transcription of lyrics is just one consideration for improving to a piece of software that is already amazing, considering the complexity of theory and optimization going into it. And, sure, the phonetic system may be lacking in several ways, but there's always a point where more precision isn't necessary - a more imprecise system already produces very good results:
This is such an interesting lens with which to view last week's reading. Using an existing technology to give the idea of phonemes some sort of practical relevance in everyday life, is a great way to learn about the concept.
ReplyDeleteYou do an amazing job giving context to the complexity of phonology and phonetics, by trying to solve a problem and then using Kenstowicz's arguments to show how difficult the problem is to actually solve.
What I do have to ask is, if a one level model of phonology might not be sufficiently complex to realise language, but does a seemingly very good approximation (that video was unexpectedly awesome), wouldn't just one more level of user interaction, to produce more specific sounds, be sufficient to make the final product completely indistinguishable from real speech/singing (even without the complexity of the real phonological system)?