Endemic Problems in ?Timbre Morphing Processes: Causes and Cures

Endemic problems in Timbre Morphing Processes: Causes and Cures C. J. Hope and D. J. Furlong, Dept. of Electronic and Electrical Engineering, Trinity College Dublin, Dublin 1. DIT Conservatory of Music, Adelaide Road, Dublin. email: Dermot.Furlong@tcd.ie, ciaran@ciaranhope.com

This paper was published in the Procceedings of the ISSC '98, (Irish Signals and Systems Conference), DIT Kevin Street,Dublin ,Ireland 25-26 June 1998.

Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the ISSC. Contact: A. O'Dwyer, School of Control Systems and Electrical Engineering,Dublin Institute of Technology(DIT), Kevin Street, Dublin 8. Telephone: (00)3531 -402 4992. email: aodwyer@dit.ie

Abstract

The objective of this paper is to investigate and compare the endemic problems that are encountered when attempting to morph two or more tones together to create a 'mongrel' sound in the process known as timbre morphing. Three fundamental difficulties are analysed, namely the identification of harmonics in a time-frequency distribution, the identification of onset and decay times of these harmonics, and the problem of incorrect magnitudes due to the obvious limitation caused by spectral leakage. Solutions to these problems, based on the use of a time-frequency distribution other than the traditional spectrogram, are proposed in this paper, thereby providing the basis for improved timbre feature analysis and processing.

1 Introduction

When the term 'morphing' is mentioned to most people, all thoughts race to 'Visual Morphing' that can be seen in many ads on TV, and in prominent films such as 'Terminator2-Judgement Day', or in pop videos such as Michael Jackson's 'Black or White'. Few people realise that morphing can be done with sound as well. There are two main 'camps' who have been working on what is referred to as 'Timbre' or 'Audio' Morphing. Malcolm Slaney [1] lead a team which investigated the possibilities inherent in Audio morphing through the representation of sound in a multi-dimensional space. They developed a new approach based on separate spectrograms to encode the pitch and broad spectral shapes of the sound. These spectrograms are independently modified to create pleasing morphs between many sounds. The key to this approach of morphing is to correctly identify pitches. Slaney concludes that there is room for improvement through the development of better representations, better matching techniques and more natural sounding interpolation schemes - especially perceptually optimal interpolation functions. The second team who developed a practical approach to Timbre Morphing are the CERL Sound Group at the University of Illinois [2]. They developed a package called Lemur, which outputs a file containing a sequential list of frames, each describing a small portion of the sound. The time-frequency analysis method that Lemur uses to generate these frames is effective, but not optimal in its representation of a sound due to both temporal and spectral smearing. In the Audio Laboratory in TCD, a software package called 'Mongrel' is being developed which takes two tones and cross-breeds them to generate a mongrel tone which contains characteristics of both parent tones. The motivation behind this development is to aid contemporary electro-acoustic composers in their exploration of new timbres in their compositional process. Two of the main differences between this work and that already completed by other teams, is the use of improved time-frequency analysis representations, and a simplified software interface which will allow a composer to generate new sounds at leisure, hard disk space permitting!

To achieve the objective of morphing two tones into a third new tone, the tones must be processed correctly to ensure accurate representation. The accepted and logical method for causing maximum contribution on the part of both parent tones is to identify certain 'features' in the contributing spectrograms, and morph these characteristic features into one another. In the program under development all signal manipulation and morphing is to be implemented automatically, keeping user interaction to a minimum. The user simply chooses the two tones to morph and how much of each tone is required in the new sound. This paper sets out to show solutions to some of the problems that are encountered in attempting to arrive at an effective implementation.

2 Timbre Representation

Timbre is defined as that part of a sound which is neither loudness nor pitch. Despite this negative definition timbre is one of the primal attributes by which we characterise sounds. It is the fundamental reason that a saxophone for example, will be identifiable whether it is heard on the radio of the most modern hi-fi system, or over a battered old amoebic radio on a remote desert island with diabolical reception. Historically, research has identified that 'timbre space' contains both spectral and temporal dimensions. The importance of the temporal aspect can be appreciated from the fact that any harmonic tone - such as a piano tone - played backwards is perceived differently from the same tone played in its normal sequence, yet the spectral composition of both is identical. Thus, to allow the temporal development of spectral components in a musical sound to be observed, it is necessary to combine time-amplitude and frequency-amplitude representations. When the signal is observed using an a time-frequency representation it is imperative that these components can be effectively quantified.

The CERL sound group define a feature as being "any portion of the sound that is important in the morphing process" [3]. In order to facilitate a user-friendly package, a primary requirement is the use of automatic feature identification. During the morphing process, spectral features - which include the fundamental and harmonics -and transient features such as the onset, steady state and decay, which are known to contribute perceptually to what a listener hears, need to be identified.

The Short-Time Fourier Transform (STFT) or the Spectrogram [4] is an extension of the Fourier Transform (FT), where the FT is repeatedly evaluated for a running windowed version of the time domain signal. Each FT gives a frequency domain 'slice' associated with the time value at the window center. The STFT enables us to observe the temporal and spectral characteristics at any point in the 'timbre space'. However, the STFT has an intrinsic problem in that it introduces both spectral and temporal smearing. As such, its use as an analysis tool for feature identification is somewhat questionable.

Recently, the Wigner Distribution (WD) has been examined as an optional time-frequency analysis method for musical signals, instead of the STFT. It contributes less spectral smearing than the spectrogram, and no temporal smearing, and as such, is a more accurate representation for the purpose of musical synthesis. The WD does however, introduce unwanted cross-terms into the time-frequency representation. Fortunately, the use of an appropriate smoothing window in the smoothed discrete Pseudo Wigner Distribution (SPWD) alleviates this problem, while still not introducing smearing to the extent evident in the spectrogram. In Figure 1, both the spectral and temporal resolution differences of the STFT and SPWD can be clearly seen.

Figure 1. The STFT (left) and SPWD (right) of a synthetic dual chirp tone duration 0.7sec, sampled at 511Hz - rising in frequency from 10Hz to 100Hz, and falling in frequency from 220Hz to 140Hz.

3 Feature Extraction

To date, the basis for timbre representation and manipulation has been what is referred to as the 'classical' concept of tone quality which relates timbre primarily to its spectral composition, i.e. to the pattern of harmonics inherent in different instrument tones. The more modern position gives full acknowledgement to the importance of the details of the temporal evolution of individual harmonics. The issues involved in 'error free' feature detection that require care include the spectral identification of partials, and the temporal identification of salient points, such as the onset and decay of harmonics. To quote Malcolm Slaney, "Every time you make a decision, i.e. 'this is a feature', you have the chance to make an error." In other words, you can identify it incorrectly. There is also the concern of a mismatch in the number of features in two tones to be morphed. There may be a partial in one sound with no corresponding partial in the second sound. In this case, the existing partial is morphed with a zero-magnitude partial. When we are faced with a tone to morph, what is required is to choose a select number of features, identify them in the distributions, and ultimately interpolate between the start and end tones to locate any chosen intermediate timbre.

3.1 Fundamental and Harmonic Identification

The SPWD can be interpreted as a distribution of a signal's energy in time and frequency. Just as the STFT contains leakage due to the windowing process involved in generating the FFT, so will the SPWD, although the leakage will be less blatant. The first important step in the morphing process, is to identify which partials are going to be paired for morphing. It is visibly apparent that the accuracy of the spectrum is dependent on the number of samples analysed. When the spectrum is analysed to try and locate any partials - fundamental and harmonics - it should be possible to locate them, regardless of the FFT length. We can calculate the frequency of any component in the FFT output using

F_k= ( K/N ) * F_s (1)

where F_k is the frequency value at index or sample number K, N is the length of the spectrum, and F_s is the sampling frequency. However, we are not interested in just any frequency, rather, the harmonic component frequencies, assuming we are dealing with harmonic spectra. To locate these frequencies, we need to scan the spectrum for peaks at every temporal location in the time-frequency representation. It is required of us to check the peaks at every temporal location, as the harmonics will vary in amplitude and frequency location temporally. To speed up this process, we estimate the size of the separation of the harmonics, in other words, an index corresponding to the frequency of the fundamental. This index is our margin of error, or 'error index'. To speed up the harmonic search and locate process, the algorithm should allow us to jump forward a few samples in the spectrum if there are no significant magnitudes after a few trials, hence meaning that we don't have to analyse every single value in the sample individually. This 'skip' amount varies relative to the length of the sample, and the sampling frequency. This harmonic search algorithm has been found to work successfully and efficiently However, the use of the SPWD rather than the STFT allows for more accurate partial identification, as is obvious from Figure 2.

Figure 2. The STFT (left) and SPWD (right) of a section of Clarinet tone - pitch E (660Hz) - duration 0.05sec, sampled at 11025Hz.

3.2 Onset/Decay Identification

After locating the spectral features from the signal, the next thing that must be done before morphing can be affected, is to locate the (onset and decay) temporal features. Using the STFT, this analysis can be inaccurate, due to the intrinsic temporal spreading. By replacing the STFT with the SPWD, this problem is eased, and an environment for accurate temporal identification can be said to exist. In this analysis, each harmonic must be analysed individually, to locate its salient features. When the harmonic drops to 90% of its steady state magnitude then it is deemed to have begun to decay. Conversely, when the magnitude of the harmonic location, rises beyond 10% of its steady state value, it can be considered that the attack, or onset has begun, ending at 90% of steady state magnitude. It must also be considered that the harmonic is prone to 'frequency wobble', a phenomenon that can not be described as 'vibrato' because it is not as audible as this, yet still existent. This can be easily compensated for by analysing slightly above and below the correct frequency location - a quarter tine variance for example -at each temporal analysis point.

3.3 Amplitude Identification

To correctly identify the amplitude of the signal, the effects of sampling upon the magnitude representation of the signal must be considered. As is well documented, sampling potentially introduces an error due to the creation of frequency bins. This error shows up in the form of a frequency spread that can become rather dramatic as the critical midpoint of the FFT bin is reached. Analysis shows that this error is in fact proportional to the distance from the edge of the FFT bin. In fact, by setting up an analysis program, the correct amplitude of any spectral value can be estimated simply by using the relationship between leakage and proximity to the bin edges.

Figure 3.1 A Violin Tone - Pitch E(660Hz) {left}, and a Clarinet Tone - pitch E (660Hz) {right} duration 1sec, both sampled at 11025Hz.

Figure 3.2 A Morphed Tone - containing 50% Clarinet and %50 Violin, duration 1sec, sampled at 11025Hz.

Above in Figures 3.1 and 3.2, we can see two tones and a mongrel tone which contains 50% of each of the two original tones. The morph was created using the ideals that have been laid out in this paper. Looking closely at the mongrel tone, it can been seen that the pureness of the clarinet sound has blended itself into the vibrato and jaggedness of the violin tone. It can also be seen that the magnitude of the mongrel tone is approximately half way between the two parent tones for all temporal and spectral content. It must be remembered however, that what can be seen as alterations are not always the most perceptually identifiable changes, so to ensure that the morph is in fact approaching a compromise between the two parent tones, sample tests on listeners will have to be carried out.

4 Conclusions

As mentioned in section 3.3, an intrinsic problem with the FFT, is that we are dealing with sampled signals, which introduces an error due to the creation of frequency bins. By using the SPWD instead of the STFT, this error can be reduced, allowing for improved spectral analysis. Similarly, as the SPWD does not cause the temporal smearing of the STFT, the dynamic attributes which have been shown to be perceptually significant can be identified more accurately. As a result of the use of more appropriate joint time frequency (JTF) distributions combined with an analysis of leakage, solutions to the problems of identifying individual frequencies in any given signal, and also identifying the onset and decay locations of these frequencies can be better approached. For the purposes of musical synthesis deriving from timbre morphing, the improved accuracy of the newly discussed methods will allow for more accurate extraction and manipulation of those features which characterise musical timbre.

5 References

[1] M. Slaney, M. Covell and B. Lassiter, "Automatic Audio Morphing," in Proc. ICASSP, IEEE, Atlanta, GA, May 7-10, 1996.
[2] Bill Walker and Kelly Fitz of the CERL Sound Group, "Lemur" University of Illinois, Urbana, IL 61801,USA
[3] B. Holloway, E. Tellman and L. Haken, "Timbre Morphing of Sounds with Unequal Numbers of Features," J. Audio Eng. Soc. vol. 43, no. 9, pp 678-689, September 1995.
[4] D.J. Furlong, C.J. Hope,"Time-Frequency Distributions for Timbre Morphing: The Wigner Distribution versus the STFT" Proc. SBCMIV, Brasilia, Brasil, August 1997.