.nr PO 1i
.po 1i
.nr LL 6.5i
.ll 6.5i
.ds CF
.hy 14
.TL
The Phase Vocoder:  A Tutorial
.AU
Mark Dolson
.AI
Computer Audio Research Laboratory
Center for Music Experiment, Q-037 
University of California, San Diego
La Jolla, California  92093
.AB
The phase vocoder is a digital signal processing technique of potentially
great musical significance.  It can be used to perform very high fidelity
time scaling, pitch transposition, and myriad other modifications of recorded
sounds.  In this tutorial, I attempt to explain the operation of the
phase vocoder in terms that musicians can understand.
.AE
.RT
.bp 1
.SH
Introduction
.PP
For composers interested in the modification of natural sounds, the
phase vocoder is a digital signal processing technique of potentially
great significance.  By itself, the phase vocoder can perform very
high fidelity time-scale modification or pitch transposition of a
wide range of sounds.  In conjunction with a standard software
synthesis program, the phase vocoder can provide the composer with
arbitrary control of individual harmonics. But use of the phase vocoder
to date has been limited primarily to experts in digital signal processing.
Consequently, its musical potential has remained largely untapped.  
.PP
In this article, I attempt to explain the operation of the phase vocoder
in terms that musicians can understand.  I rely heavily on the familiar
concepts of sine waves, filters, and additive synthesis, and I employ a
minimum of mathematics.  My hope is that this tutorial will lay
the groundwork for widespread use of the phase vocoder, both as
a tool for sound analysis and modification, and as a catalyst for
continued musical exploration.
.SH
Overview
.PP
Historically, the phase vocoder comes from a long line of voice
coding techniques which were developed primarily for the electronic
processing of speech.  Indeed, the word ``vocoder'' is simply a
contraction of the term ``voice coder.''  There are many different
types of vocoders.  The \fIphase\fR vocoder was first described in 1966
in an article by Flanagan and Golden.  However, it is only in the
past ten years that this technique has really become popular and
well understood.
.PP
The phase vocoder is one of a number of digital signal processing algorithms
which can be categorized as analysis-synthesis techniques.  Mathematically,
these techniques are sophisticated algorithms which take an input
signal and produce an output signal which is either identical to the input
signal or a modified version of it.   The underlying assumption
is that the input signal can be well represented by a model whose
parameters are varying with time.  The analysis is devoted to 
determining the values of these parameters for the signal in question,
and the synthesis is simply the output of the model itself.  For example,
in linear prediction the signal is modeled as the output of a time-varying
filter whose input and frequency-response are determined by the analysis.
.PP
The benefits of analysis-synthesis formulations are considerable.  
Since the synthesis is based on an analysis of a specific signal, the
synthesized output can be virtually identical to the original input; this
can occur even when the signal in question bears little relation to the
assumed model.  Furthermore, the parameter values which are derived
from the analysis can be modified to synthesize any number of useful
modifications of the original signal.  In this case, however,
the perceptual significance and musical utility of the result
depends critically on the degree to which the assumed model matches the
signal to be modified.
.PP
In the phase vocoder, the signal is modeled as a sum of sine waves, and the
parameters to be determined by analysis are the time-varying amplitudes and
frequencies of each sine wave.  Since these sine waves are not required 
to be harmonically related, this model is appropriate for a wide variety
of musical signals.  As a result, the phase vocoder can serve as the basis
for a variety of very high fidelity modifications.  It is this feature
which makes the phase vocoder so useful in computer music.
.PP
In the sections which follow, I show in detail how the phase vocoder
analysis-synthesis is actually performed.  In particular, I show that
there are two complementary (but mathematically equivalent) viewpoints
which may be adopted.  I refer to these as the \fIFilter Bank\fR interpretation
and the \fIFourier Transform\fR interpretation, and I discuss each in turn.
Lastly, I show how the results of the phase vocoder analysis can be
used musically to effect useful modifications of recorded sounds.
.SH
The Filter Bank Interpretation
.PP
The simplest view of the phase vocoder analysis is that it consists
of a fixed bank of bandpass filters with the output of each filter
expressed as a time-varying amplitude and a time-varying frequency
(see Figure 1).  The synthesis is then literally a sum of
sine waves with the time-varying amplitude and frequency
of each sine wave being obtained directly from the corresponding
bandpass filter.  If the center frequencies of the individual
bandpass filters happen to align with the harmonics of a musical
signal, then the outputs of the phase vocoder analysis are essentially
the time-varying amplitudes and frequencies of each harmonic.
But even when this situation does not obtain, the phase vocoder
analysis is still surprisingly useful.
.KF
.sp
.sp
.IS
...libfile arc
...libfile arrow
...libfile circle
...libfile rect
...width 4.
osc {
    var w;
    conn w+(0,.5) to w+(0,-.5);
    conn w+(0,.5) to w+(.25,.5);
    conn w+(0,-.5) to w+(.25,-.5);
    put arc {
	center = w+.25;
	radius = .5;
	startang = 270;
	endang = 90;
    } ;
    conn w+.75 to w+1.5;
}
chan {
    var orig;
    put arrow {
	tl = orig;
	hd = tl+1;
    } ;
    put filt: rect {
	ht = 1;
	wd = 1.25;
	w = orig+1;
    } ;
    put arrow {
	tl = .7[filt.se,filt.ne];
	hd = tl+1.75;
    }
    put arrow {
	tl = .3[filt.se,filt.ne];
	hd = tl+1.75;
    }
    put osci: osc {
	w = 1.75+filt.e;
    } ;
}
main {
    var lj,rj;
    lj = 1;
    rj = 7.5;
    right "input " at 0;
    "amplitude" at (4.12,4.3);
    "frequency" at (4.12,3.6);
    conn 0 to lj;
    conn (lj,4) to (lj,-4);
    put chan {
	orig = (lj,4);
    } ;
    put chan {
	orig = (lj,2);
    } ;
    put chan {
	orig = (lj,0);
    } ;
    put chan {
	orig = (lj,-2);
    } ;
    put chan {
	orig = (lj,-4);
    } ;
    put arrow {
	tl = (rj-1,4);
	hd = (rj-.1,.22);
    } ;
    put arrow {
	tl = (rj-1,2);
	hd = (rj-.2,.14);
    } ;
    put arrow {
	tl = (rj-1,0);
	hd = (rj-.25,0);
    } ;
    put arrow {
	tl = (rj-1,-2);
	hd = (rj-.2,-.14);
    } ;
    put arrow {
	tl = (rj-1,-4);
	hd = (rj-.1,-.22);
    } ;
    put circle {
	center = rj;
	radius = .25;
    } ;
    conn rj-.05 to rj+.05;
    conn (rj,.05) to (rj,-.05);
    put arrow {
	tl = rj+.25;
	hd = tl+.75;
    } ;
    left " output" at rj+1;
    "filters" at (2.62,-5);
    "oscillators" at (5.25,-5);
}
.IE
.sp
.ce
Figure 1.  The Filter Bank Interpretation
.sp
.sp
.KE
.PP
The filter bank itself has only three constraints.  First, the 
frequency response characteristics of the individual bandpass
filters are identical except that each filter has its
passband centered at a different frequency.  Second, these
center frequencies are equally spaced across the entire spectrum
from 0 Hz to half the sampling rate.  Third, the individual bandpass
frequency response is such that the combined frequency response
of all the filters in parallel is essentially flat across the
entire spectrum.  This ensures that no frequency component
is given disproportionate weight in the analysis, and that the
phase vocoder is in fact an analysis-synthesis identity.  As a
consequence of these constraints, the only issues in the design
of the filter bank are the number of filters and the individual
bandpass frequency response.
.PP
The number of filters must be sufficiently large so that there
is never more than one partial within the passband of any single
filter.  For harmonic sounds, this amounts to saying that the
number of filters must be greater than the sampling rate divided
by the pitch.  For inharmonic and polyphonic sounds, the number
of filters may need to be much greater.  If this condition is not
satisfied, then the phase vocoder will not function as intended
because the partials within a single filter will constructively
and destructively interfere with each other, and the
information about their individual frequencies will be
coded as an unintended temporal variation in a single composite signal.
.PP
The design of the representative bandpass filter is 
dominated by a single consideration:  the sharper the
filter frequency response cuts off at the band edges (i.e., the
less overlap between adjacent bandpass filters), the longer its
impulse response will be (i.e., the longer the filter will ``ring'').
Thus, to get sharp cut-offs with minimal overlap, one must use
filters whose time response is very sluggish.  In the phase
vocoder, this tradeoff is ever-present, and the best solution
is generally discovered experimentally by simply trying different
filter settings for the sound in question.
.SH
A Closer Look at the Filter Bank
.PP
The above paragraphs provide an adequate description of the phase
vocoder from the standpoint of the user, but they leave unanswered
the question of how it actually works.  In this section, I 
show in detail how the output of a single bandpass filter is
expressed as a time-varying amplitude and a time-varying frequency.
.PP
The actual operation of a single phase-vocoder bandpass filter is
shown in Figure 2.  This diagram may appear complicated, but it can
easily be broken down into a series of fairly simple mathematical
steps.
.KF
.sp
.sp
.IS
...libfile arrow
...libfile circle
...libfile rect
...width 4.
main {
    var p1,p2,p3,p4,p5,p6,p7,p8,p9,p10;
    p1 = 1;
    p2 = (p1,1);
    p3 = (p1,-1);
    p4 = p2+1;
    p5 = p3+1;
    p6 = p4+1;
    p7 = p5+1;
    p8 = p6+2;
    p9 = p7+2;
    p10 = .5[p8,p9];
    conn 0 to p1;
    conn p2 to p3;
    put arrow {
	tl = p2;
	hd = p4-.25;
    } ;
    put arrow {
	tl = p3;
	hd = p5-.25;
    } ;
    put arrow {
	tl = p4 + (0,1);
	hd = p4 + (0,.25);
    } ;
    put arrow {
	tl = p5 + (0,-1); 
	hd = p5 + (0,-.25);
    } ;
    conn p4-(.1,.1) to p4+(.1,.1);
    conn p4+(-.1,.1) to p4+(.1,-.1);
    conn p5-(.1,.1) to p5+(.1,.1);
    conn p5+(-.1,.1) to p5+(.1,-.1);
    put circle {
	center = p4;
	radius = .25;
    } ;
    put circle {
	center = p5;
	radius = .25;
    } ;
    put arrow {
	tl = p4+.25;
	hd = p6;
    } ;
    put arrow {
	tl = p5+.25;
	hd = p7;
    } ;
    put rect {
	ht = 1;
	wd = 1.25;
	w = p6;
    } ;
    put rect {
	ht = 1;
	wd = 1.25;
	w = p7;
    } ;
    put arrow {
	tl = p8-.75;
	hd = p8;
    } ;
    put arrow {
	tl = p9-.75;
	hd = p9;
    } ;
    put rect {
	ht = 3;
	wd = 2;
	w = p10;
    } ;
    put arrow {
	tl = p10+(2,.55);
	hd = tl+1;
    } ;
    put arrow {
	tl = p10+(2,-.55);
	hd = tl+1;
    } ;
    right "input " at 0;
    "sin(2\(*pft)" at p4+(0,1.2);
    "cos(2\(*pft)" at p5+(0,-1.4);
    "lowpass" at p6+(.6,.1);
    "lowpass" at p7+(.6,.1);
    "filter" at p6+(.6,-.2);
    "filter" at p7+(.6,-.2);
    "rectangular" at p10+(1,.3);
    "to" at p10+1;
    "polar" at p10+(1,-.3);
    left " magnitude" at p10+(3,.55);
    left " phase" at p10+(3,-.55);
}
.IE
.sp
.ce
Figure 2.  An Individual Bandpass Filter
.sp
.sp
.KE
.PP
In the first step, the incoming signal is routed into two parallel
paths.  In one path, the signal is multiplied by a sine wave with an
amplitude of 1.0 and a frequency equal to the center frequency of the
bandpass filter; in the other path, the signal is multiplied by
a cosine wave of the same amplitude and frequency.  Thus, the two
parallel paths are identical except for the phase of the multiplying
waveform. Then, in each path, the result of the multiplication is fed
into a lowpass filter.
.PP
The multiplication operation itself should
be familiar to musicians as simple ring modulation.
Multiplying any signal by a sine (or cosine) wave of constant frequency
has the effect of simultaneously shifting all the frequency components
in the original signal by both plus and minus the frequency of the sine
wave.  An example of this is shown in Figure 3 in which a 100 Hz
sine wave multiplies an input signal of 101 Hz.  The result
is a sine wave at 1 Hz (i.e., 101 Hz \(mi 100 Hz) and a sine
wave at 201 Hz (i.e., 101 Hz + 100 Hz).  Furthermore, if this
result is now passed through an appropriate lowpass filter,
only the 1 Hz sine wave will remain.  This sequence of operations
(i.e., multiplying by a sine wave of frequency \fIf\fR and then lowpass
filtering) is useful in a variety of signal processing applications and
is known as \fIheterodyning\fR.  Any input frequency components in the
vicinity of frequency \fIf\fR are shifted down to the vicinity of 0 Hz
and allowed to pass; input frequency components not in the vicinity of
frequency \fIf\fR are similarly shifted but not by enough to get through
the lowpass filter.  The result is a type of bandpass filtering
in which the passband is frequency-shifted down to very low frequencies.
.KF
.sp
.sp
.ce
sin(A) sin(B) = sin(A\(miB) + sin(A+B)
.sp
.IS
...libfile arrow
...width 4.
axis {
    var orig;
    put arrow {
	head = .5;
	tl = orig;
	hd = tl+25;
    } ;
    left " frequency (Hz)" at orig+25;
    conn orig to orig+(0,-.3);
    "0" at orig+(0,-1);
    conn orig+10 to orig+(10,-.3);
    "100" at orig+(10,-1);
    conn orig+20 to orig+(20,-.3);
    "200" at orig+(20,-1);
}
main {
    put axis {
	orig = (0,10);
    } ;
    conn (10,10) to (10,12);
    put axis {
	orig = (0,5);
    } ;
    conn (10.1,5) to (10.1,7);
    put axis {
	orig = (0,0);
    } ;
    conn (.1,0) to (.1,2);
    conn (20.1,0) to (20.1,2);
}
.IE
.sp
.ce
Figure 3.  Multiplying Two Sine Waves
.sp
.sp
.KE
.PP
In the phase vocoder, heterodyning is performed in each of the two
parallel paths.  But since one path heterodynes with a sine wave while
the other path uses a cosine wave, the resulting heterodyned
signals in the two paths are out of phase by 90 degrees.  Thus, in the
above example, both paths will produce a 1 Hz sinusoidal wave at the outputs
of their respective lowpass filters, but the two sinusoids will be 90
degrees out of phase with respect to each other.  To understand what the
phase vocoder does next with these signals, we can consider the
rotating wheel illustrated in Figure 4.
.KF
.sp
.sp
.EQ
delim $$
.EN
.IS
...libfile arc
...libfile circle
...width 4.
dash {
    var beg,end,mid;
    mid = .5[beg,end];
    conn beg to mid;
}
main {
    put circle {
	center = 0;
	radius = 2;
    } ;
    conn -3 to 3;
    conn (0,-3) to (0,3);
    conn 0 to (1.732,1)
	using 10 dash {
	} <beg,end>;
    conn (1.732,1) to (1.732,0)
	using 10 dash {
	} <beg,end>;
    conn (1.732,1) to (0,1)
	using 10 dash {
	} <beg,end>;
    put arc {
	center = 0;
	radius = .75;
	startang = 0;
	endang = 30;
    } ;
    left " x" at (3,-.05);
    "y" at (0,3.1);
    "$x sub o$" at (1.732,-.2);
    right "$y sub o$ " at (0,1);
    "$r$" at (.8,.6);
    "$theta$" at (.9,.15);
}
.IE
.EQ
size 12 r~=~size +2 sqrt size -2 { {{x sub o} sup 2} + {{y sub o} sup 2}}
.EN
.EQ
size 12 theta~=~arctan left ( {y sub o} over {x sub o} right )
.EN
.sp
.ce
Figure 4.  Rectangular and Polar Coordinates
.sp
.sp
.KE
.PP
Suppose that we wish to plot the position 
of some point on the wheel as a function of time.
We have a choice of using ``rectangular'' coordinates (e.g., horizontal
position and vertical position) or ``polar'' coordinates (e.g., radial
position and angular position\(emalso known as magnitude and phase).
With rectangular coordinates we find that both the
horizontal position and the vertical position are varying sinusoidally,
but the maximum vertical displacement occurs one quarter cycle later than
the maximum horizontal displacement.  With polar coordinates we simply
have a linearly increasing angular position and a constant radius.
Clearly, the latter description is simpler.
.PP
The situation within the phase vocoder is very much analogous.
The two heterodyned signals can be viewed as the horizontal and
vertical signals of the rectangular representation,
whereas the desired representation is in terms
of a time-varying amplitude (i.e., radius) and a time-varying
frequency (i.e., rate of angular rotation).  Happily, the translation
between the two different representations is easily accomplished.
As shown in Figure 4, the amplitude at each point in time is simply the
square root of the sum of the squares of the two rectangular coordinates.
The frequency cannot be calculated directly, but it can be very well
approximated by taking the difference in successive values of angular
position and then dividing by the time between these successive values.  
To see this, we can note that the difference between two successive
values of angular position is some fraction of an entire cycle (i.e.,
a complete revolution), and that ``frequency'' is simply the number of
cycles which occur during some unit time interval.  As a result,
we need only worry about how to calculate the angular position.
.PP
Figure 4 also gives a formula for the angular position, but it
produces answers only in the range of 0 to 360 degrees.  Thus, if
we examine successive values of angular position, we may find a
sequence such as 180, 225, 270, 315, 0, 45, 90.  This suggests
that the instantaneous frequency (i.e., rate of angular rotation)
is given by the sequence: (225 \(mi 180)/T = 45/T, (270 \(mi 225)/T = 45/T,
(315 \(mi 270)/T = 45/T, (0 \(mi 315)/T = \(mi315/T, (45 \(mi 0)/T = 45/T,
(90 \(mi 45)/T = 45/T, where T is time between successive values.
But the \(mi315/T element is clearly not quite right. 
.PP
What has actually happened is that we have gone through more than
a single cycle.  Therefore, if we want our frequency calculation
to work properly, we should really write the sequence as 180, 225,
270, 315, 360, 405, 450.  Now the result of the frequency calculation
is, as it should be, a sequence of (45/T)'s.  This process of adding in 360
degrees whenever a full cycle has been completed is known as 
\fIphase unwrapping\fR (see Figure 5).  It is the final necessary step
in the sequence of operations which makes the phase vocoder work.
.KF
.sp
.sp
.EQ
delim $$
.EN
.IS
...libfile arrow
...width 4.
main {
    put arrow {
	head = .25;
	tl = (0,4);
	hd = tl+10;
    } ;
    conn (0,4) to (0,7);
    conn (-.2,4) to (0,4);
    conn (-.2,5) to (0,5);
    conn (-.2,6) to (0,6);
    conn (0,4) to (3,5);
    conn (3,4) to (6,5);
    conn (6,4) to (9,5);
    put arrow {
	head = .25;
	tl = (0,0);
	hd = tl+10;
    } ;
    conn 0 to (0,3);
    conn (-.2,0) to (0,0);
    conn (-.2,1) to (0,1);
    conn (-.2,2) to (0,2);
    conn (0,0) to (9,3);
    left " time" at (10,4);
    left " time" at (10,0);
    right "0 " at (-.2,3.9);
    right "0 " at (-.2,-.1);
    right "360 " at (-.2,4.9);
    right "360 " at (-.2,.9);
    right "720 " at (-.2,5.9);
    right "720 " at (-.2,1.9);
    "$theta$ (degrees)" at (0,7.15);
    "$theta$ (degrees)" at (0,3.15);
}
.IE
.sp
.ce
Figure 5.  Phase Unwrapping
.sp
.sp
.KE
.PP
Thus, the internal operation of a single phase vocoder bandpass filter
consists of (1) heterodyning the input with both a sine wave and a cosine
wave in parallel, (2) lowpass filtering each result, (3) converting the
two parallel lowpass filtered signals to radius and angular-position
signals, (4) unwrapping the angular-position values, and (5) subtracting
successive unwrapped angular-position values and dividing by the time to
obtain a rate-of-angular-rotation signal.  But it should be noted that
this rate-of-rotation signal (i.e., the instantaneous frequency) actually
refers only to the difference frequency between the heterodyning sinusoid
(i.e., the filter center frequency) and the input signal.  Therefore the
final step is simply to add the filter center frequency back in.
.SH
The Fourier Transform Interpretation
.PP
A complementary (and equally correct) view of the phase-vocoder analysis
is that it consists of a succession of overlapping Fourier transforms
taken over finite-duration windows in time.  It is interesting to compare
this perspective to that of the Filter Bank interpretation.
In the latter, the emphasis is on the temporal succession
of magnitude and phase values in a single filter band.  In contrast, the
Fourier Transform interpretation focuses attention on the magnitude and phase
values for all of the different filter bands or \fIfrequency bins\fR
at a single point in time (see Figure 6).
.KF
.sp
.sp
.IS
...libfile arc
...libfile arrow
...libfile circle
...width 4.
row {
    var st;
    put circle {
	center = st;
	radius = .05;
    } ;
    put circle {
	center = st+2;
	radius = .05;
    } ;
    put circle {
	center = st+4;
	radius = .05;
    } ;
    put circle {
	center = st+6;
	radius = .05;
    } ;
}
main {
    put arrow {
	head = .25;
	tl = 0;
	hd = tl+10;
    } ;
    put arrow {
	head = .25;
	tl = 0;
	hd = tl+(0,10);
    } ;
    put row {
	st = (2,8);
    } ;
    put row {
	st = (2,6);
    } ;
    put row {
	st = (2,4);
    } ;
    put row {
	st = (2,2);
    } ;
    conn (3.8,1) to (3.8,9);
    conn (4.2,1) to (4.2,9);
    put arc {
	center = (4,1);
	radius = .2;
	startang = 180;
	endang = 0;
    } ;
    put arc {
	center = (4,9);
	radius = .2;
	startang = 0;
	endang = 180;
    } ;
    conn (1,6.2) to (9,6.2);
    conn (1,5.8) to (9,5.8);
    put arc {
	center = (1,6);
	radius = .2;
	startang = 90;
	endang = 270;
    } ;
    put arc {
	center = (9,6);
	radius = .2;
	startang = 270;
	endang = 90;
    } ;
    left " time" at (10,0);
    "frequency" at (0,10.1);
    "Fourier view" at (4,.4);
    left " Filter view" at (9.5,5.9);
}
.IE
.sp
.ce
Figure 6.  Filter Bank Interpretation vs. Fourier Transform Interpretation
.sp
.sp
.KE
.PP
These two differing views of the phase-vocoder analysis suggest two equally
divergent interpretations of the resynthesis.  In the Filter Bank
interpretation (as noted above), the resynthesis can be viewed as a
classic example of additive synthesis with time-varying amplitude
and frequency controls for each oscillator.  In the
Fourier view, the synthesis is accomplished by converting back to
real-and-imaginary form and overlap-adding the successive inverse
Fourier transforms.  This is a first indication that the phase vocoder
representation may actually be more generally applicable than would
be expected of an additive-synthesis technique.
.PP
In the Fourier interpretation, the number of filters bands in the phase vocoder
is simply the number of points in the Fourier transform.   Similarly, the equal
spacing in frequency of the individual filters can be recognized as a
fundamental feature of the Fourier transform.  On the other hand, the shape of
the filter passbands (e.g., the steepness of the cutoff at the band edges)
is determined by the shape of the window function which is applied
prior to calculating the transform.  For a particular characteristic
shape (e.g., a Hamming window), the steepness of the filter cutoff
increases in direct proportion to the duration of the window.  Thus, again,
we see the fundamental tradeoff between rapid time response and narrow
frequency response.
.PP
It is important to understand that the two different interpretations of
the phase vocoder analysis apply only to the implementation of the bank
of bandpass filters.  The operation (described in the previous section)
by which the outputs of these filters are expressed as time-varying
amplitudes and frequencies is the same for each.  However, a particular
advantage of the Fourier interpretation is that it leads to the
implementation of the filter bank via the more efficient Fast Fourier Transform
(FFT) technique.  The FFT produces an output value for each of N filters
with (on the order of) N log\s-2\d2\s0\uN multiplications, while the direct
implementation of the filter bank requires N\s-2\u2\s0\d multiplications.
Thus, the Fourier interpretation can lead to a substantial increase in
computational efficiency when the number of filters is large (e.g., N = 1024).
.PP
The Fourier interpretation has also been the key to much of the recent
progress in phase-vocoder-like techniques.  Mathematically, these techniques
are described as Short-Time Fourier-Transform techniques [Rabiner &
Schafer, 1978; Crochiere, 1980; Portnoff, 1980; Portnoff, 1981a,b; Griffin
& Lim, 1984].  Such algorithms may also be referred to as Multirate
Digital Signal Processing techniques (for reasons which will be made clear
below) [Crochiere & Rabiner, 1983].
.SH
Sample-Rate Considerations
.PP
The input and output signals to and from the phase vocoder are always
assumed to be digital signals with a sampling rate of at least twice
the highest frequency in the associated analog signal (e.g., a speech
signal with a highest frequency of 5 KHz might be digitized\(emat
least in principle\(emat 10 KHz
and fed into the phase vocoder).  However, the sample rates within the
individual filter bands of the phase vocoder do not need to be nearly
so high.  This is most easily understood via the Filter Bank interpretation.
.PP
Within any given filter band, the result of the heterodyning and lowpass
filtering operation is a signal whose highest frequency is equal to the
cutoff frequency of the lowpass filter.  For instance in the above 
example, the lowpass filter may only pass frequencies up to 50 Hz.  Thus,
although the input to the filter was a speech signal sampled at 10 KHz,
the output of the filter can be sampled (at least in the ideal case) at
as little as 100 Hz without any aliasing error.  This is true for each
of the bandpass filters, because each filter operates by heterodyning a
certain frequency region down to the 0 \(mi 50 Hz region.
.PP
In practice, the lowpass filter can never have an infinitely steep cutoff.
Therefore to really avoid aliasing error, it is advisable to sample the
output of the filter at four times the cutoff frequency (e.g., 200 Hz) as
opposed to two.  Still, this represents an enormous savings in computation
(e.g., the filter output is calculated 200 times per second instead of
10,000 times).  A detail worth noting here is that this savings is only
possible because the filter is a finite impulse
response (FIR) filter, (i.e., the present output is
calculated entirely on the basis of present and past inputs).
.PP
If we now seek to resynthesize the original input from the
phase vocoder analysis signals, we face a minor problem.  The analysis
signals (which in the Filter Bank interpretation are thought of as
providing the instantaneous amplitude and frequency values for a bank
of sine-wave oscillators) are no longer at the same sample rate as the
desired output signal.  Thus, an additional interpolation operation is
required to convert the analysis signals back up to the original sample
rate.  Even so, this is a lot more computationally efficient
than avoiding the sample-rate reduction in the first place.
.PP
In the Fourier Transform interpretation the details of these multiple
sample rates within the phase vocoder are less apparent.  In the above
example, where the internal sample rate is only 2% (200/10000) of the
external sample rate, we simply skip 10000/200 = 50 samples between
successive FFT's.  As a result, the FFT values are computed only
10000/50 = 200 times per second.  In this interpretation, the interpolation
operation is automatically incorporated in the overlap-addition of the
inverse FFT's.
.PP
Lastly, it should be noted that we have so far considered the bandwidth
of the output of the lowpass filter without any mention of the conversion
from rectangular to polar coordinates.  This conversion involves highly
nonlinear operations which (at least in principle) can significantly
increase the bandwidth of the signals to which they are applied.  Fortunately,
this effect is usually small enough in practice that it can generally be
ignored.
.SH
Applications
.PP
The basic goal of the phase vocoder is to separate (as much as possible)
temporal information from spectral information.  The operative strategy
is to divide the signal into a number of spectral bands, and to characterize
the time-varying signal in each band.  This strategy succeeds to the
extent that this bandpass signal is itself slowly varying.  It fails
when there is more than a single partial in a given band,
or when the time-varying amplitude or frequency
of the bandpass signal changes too rapidly.  ``Too rapidly'' means
that the amplitude and frequency are not relatively constant over the
duration of the FFT.  This is equivalent to saying that the amplitude or
frequency changes considerably over durations which are small compared
to the inverse of the lowpass filter bandwidth.
.PP
To the extent that the phase vocoder does succeed in separating temporal
and spectral information, it provides the basis for an impressive array of
musical applications.  Historically, the first of these to be explored was
that of analyzing instrumental tones to determine the time-varying amplitudes
and frequencies of individual partials. This application was pioneered
by Moorer and Grey at Stanford in the mid `70's in a landmark series of
investigations of the perception of timbre [Grey & Moorer, 1977; Grey,
1977; Grey & Gordon, 1978; Moorer, 1978].  (The
``heterodyne filter'' technique developed by Moorer is essentially
a special case of the phase vocoder.)
.PP
More recently, interest in the phase vocoder has focused more on its
ability to modify and transform recorded sound materials in musically
useful ways.  The possibilities in this realm are myriad.
However, two basic operations stand out as particularly significant.
These operations are \fItime scaling\fR and \fIpitch transposition\fR.
.SH
Time Scaling
.PP
It is always possible to slow down a recorded sound simply by playing
it back at a lower sample rate; this is analogous to playing a
tape recording at a lower playback speed.  But this kind of simplistic
time expansion simultaneously lowers the pitch by the same factor as
the time expansion.  Slowing down the temporal evolution of a sound
without altering its pitch requires an explicit separation of temporal
and spectral information.  As noted above, this is precisely what the
phase vocoder attempts to do.
.PP
To understand the use of the phase vocoder for time scaling, it is
helpful to once again consider the two basic interpretations described
above.  In the Filter Bank interpretation, the operation
is simplicity itself.  The time-varying amplitude and frequency
signals for each oscillator are control signals which (hopefully)
carry only temporal information.  Stretching out these control
signals (via interpolation) does not change the frequency of the
individual oscillators at all, but it does slow down the temporal
evolution of the composite sound.  The result is a time-expanded
sound with the original pitch.
.PP
The Fourier transform view of time scaling is more complicated, but
it is no less instructive.  The basic idea is that in order to time-expand
a sound, the inverse FFT's can simply be spaced further apart than
the analysis FFT's.  As a result, spectral changes occur more slowly
in the synthesized sound than in the original.  But this overlooks
the details of the magnitude and phase signals in the middle.
.PP
Consider a single bin within the FFT for which successive phase values
are incremented by 45 degrees.  This implies that the signal within
that filter band is increasing in phase at a rate of 1/8 cycle (i.e.,
45 degrees) per time interval, where the time interval in question
is the time between successive FFT's.  Spacing the inverse FFT's
further apart means that the 45 degree increase now occurs over a
longer time interval.  Hence, the frequency of the signal has been
inadvertently altered.  The solution is to rescale the phase by
precisely the same factor by which the sound is being time-expanded.
This ensures that the signal in any given filter band has the same
frequency variation in the resynthesis as in the original (though
it occurs more slowly).  
.PP
The reason that the problem of rescaling the phase does not appear in
the Filter Bank interpretation is that the interpolation there
is assumed to be performed on the frequency control signal
as opposed to the phase.  This is perfectly correct conceptually,
but the actual implementation generally conforms more closely to
the Fourier interpretation.  Also, by emphasizing that the time
expansion amounts to spacing out successive ``snapshots'' of the
evolving spectrum, the Fourier view makes it easier to understand
how the phase vocoder can perform equally well with non-harmonic
material.
.PP
Of course, the phase vocoder is not the only technique which can be
employed for this kind of time scaling.  Indeed, from the standpoint
of computational efficiency, it is probably the very least attractive.
But from the standpoint of fidelity (i.e., the relative absence of
objectionable artifacts), it is decidedly the most desirable.
.SH
Pitch Transposition
.PP
Since the phase vocoder can be used to change the temporal evolution
of a sound without changing its pitch, it should also be possible to
do the reverse (i.e., change the pitch without changing the duration).
In fact, this operation is trivially accomplished.  The trick is
simply to time scale by the desired pitch-change factor, and then
to play the resulting sound back at the wrong sample rate.  For
example, to raise the pitch by an octave, the sound is first
time-expanded by a factor of two, and the time-expansion is then played at
twice the original sample rate.  This shrinks the sound back to its
original duration while simultaneously doubling all frequencies.
In practice, however, there are also some additional concerns.
.PP
First, instead of changing the clock rate on the playback
digital-to-analog converters, it is more convenient to simply
do a sample-rate conversion on the time-scaled sound via software.  Thus,
in the above example, we would simply designate a higher
sample rate for the time-expanded sound, and then sample-rate
convert it down by a factor of two so that it could be played
at the normal sample rate.  It is possible to embed this sample-rate
conversion within the phase vocoder itself, but this proves to be
of only marginal utility and will not be further discussed.
.PP
Second, upon closer examination it can be seen that only time-scale
factors which are ratios of integers are actually allowed.  This is
clearest in the Fourier view because the expansion factor is simply
the ratio of the number of samples between successive analysis FFT's
to the number of samples between successive synthesis FFT's.
However, it is equally true of the Filter Bank interpretation
because it turns out that the control signals can only be interpolated
by factors which are ratios of two integers.  Of course,
this has little significance for time scaling because,
while it may be impossible to find two suitable integers with precisely
the desired ratio, the error is perceptually negligible.  However,
when time scaling is performed as a prelude to pitch transposition,
the perceptual consequences of such errors are greatly magnified
(by virtue of the ear's sensitivity to small pitch differences),
and considerable care may be required in the selection of two appropriate
integers.
.PP
An additional complication arises when modifying the pitch of
speech signals because the transposition process changes not only
the pitch, but also the frequency of the vocal tract resonances (i.e.,
the formants).  For shifts of an octave or more, this considerably
reduces the intelligibility of the speech.  (This same phenomena occurs
in the pitch transposition of non-speech sounds as well, but for these
sounds intelligibility is not an issue.)  To correct for this,
an additional operation may be inserted into the phase vocoder
algorithm as shown in Figure 7.  For each FFT, this additional
operation determines the spectral envelope (i.e., the shape traced
out by the peaks of the harmonics as a function of frequency), and
then distorts this envelope in such a way that the subsequent
sample-rate conversion brings it back precisely to its original
shape.
.KF
.sp
.sp
.IS
...libfile arrow
...width 4.
axis {
    var orig;
    put arrow {
	head = .25;
	tl = orig;
	hd = tl+7.5;
    } ;
    put arrow {
	head = .25;
	tl = orig;
	hd = tl+(0,2);
    } ;
    left " frequency" at orig+7.5;
}
main {
    put axis {
	orig = (0,0);
    } ;
    put axis {
	orig = (0,4);
    } ;
    put axis {
	orig = (0,8);
    } ;
    conn (.25,8) to (.25,9);
    conn (.5,8) to (.5,9.25);
    conn (.75,8) to (.75,9.5);
    conn (1,8) to (1,9.75);
    conn (1.25,8) to (1.25,9.5);
    conn (1.5,8) to (1.5,9.25);
    conn (1.75,8) to (1.75,9);
    conn (2,8) to (2,8.75);
    conn (2.25,8) to (2.25,8.5);
    conn (2.5,8) to (2.5,8.75);
    conn (2.75,8) to (2.75,9);
    conn (3,8) to (3,9.25);
    conn (3.25,8) to (3.25,9);
    conn (3.5,8) to (3.5,8.75);
    conn (3.75,8) to (3.75,8.5);
    conn (4,8) to (4,8.25);
    conn (.5,4) to (.5,5);
    conn (1,4) to (1,5.25);
    conn (1.5,4) to (1.5,5.5);
    conn (2,4) to (2,5.75);
    conn (2.5,4) to (2.5,5.5);
    conn (3,4) to (3,5.25);
    conn (3.5,4) to (3.5,5);
    conn (4,4) to (4,4.75);
    conn (4.5,4) to (4.5,4.5);
    conn (5,4) to (5,4.75);
    conn (5.5,4) to (5.5,5);
    conn (6,4) to (6,5.25);
    conn (6.5,4) to (6.5,5);
    conn (7,4) to (7,4.75);
    conn (.5,0) to (.5,1.25);
    conn (1,0) to (1,1.75);
    conn (1.5,0) to (1.5,1.25);
    conn (2,0) to (2,.75);
    conn (2.5,0) to (2.5,.75);
    conn (3,0) to (3,1.25);
    conn (3.5,0) to (3.5,.75);
    conn (4,0) to (4,.25);
    left "Original Spectrum" at (5,9.75);
    left "Transposed Spectrum" at (5,5.75);
    left "Transposed Spectrum with" at (5,1.75);
    left "Original Spectral Envelope" at (4.95,1.45);
}
.IE
.sp
.ce
Figure 7.  Spectral Envelope Correction
.sp
.sp
.KE
.SH
Summary
.PP 
The above descriptions address only the most elementary
possibilities of the phase vocoder technique.  In addition
to simple time scaling and pitch transposition, it is also possible
to perform time-varying time scaling and pitch transposition,
time-varying filtering (e.g., cross synthesis), and nonlinear
filtering (e.g., noise reduction), all with very high fidelity.
A detailed description of these applications, and of their musical
implications, is the subject of a separate paper.
.SH
References
.LP
.de BB
.sp .05i
.ti -5n
..
.sp .1i
.in 5n
.BB
Crochiere, R. E. (1980).  A weighted overlap-add method of Fourier analysis-synthesis.  \fIIEEE Transactions on Acoustics, Speech, and Signal Processing\fR, \fIASSP-28\fR(1), 55-69.
.BB
Crochiere, R. E. & Rabiner, L. R. (1983).  \fIMultirate Digital Signal Processing\fR. Englewood Cliffs, NJ: Prentice-Hall.
.BB
Flanagan, J. L. & Golden, R. M. Phase vocoder. \fIBell System Technical Journal\fR, \fI45\fR, 1493-1509.
.BB
Grey, J. M., & Moorer, J. A.  (1977).  Perceptual evaluations of synthesized musical instrument tones.  \fIJournal of the Acoustical Society of America\fR, \fI62\fR, 454-462.
.BB
Grey, J. M. (1977).  Multidimensional perceptual scaling of musical timbres. \fIJournal of the Acoustical Society of America\fR, \fI61\fR, 1270-1277.
.BB
Grey, J. M., & Gordon, J. W.  (1978).  Perceptual effects of spectral modifications on musical timbres.  \fIJournal of the Acoustical Society of America\fR, \fI63\fR, 1493-1500.
.BB
Griffin, D. W. & Lim, J. S. (1984).  Signal estimation from modified short-time Fourier transform.  \fIIEEE Transactions on Acoustics, Speech, and Signal Processing\fR, \fIASSP-28\fR(2), 236-242.
.BB
Moorer, J. A. (1978) The use of the phase vocoder in computer music applications. \fIJournal of the Audio Engineering Society\fR, \fI24\fR(9), 717-727.
.BB
Portnoff, M. R. (1980).  Time-frequency representation of digital signals and systems based on short-time Fourier analysis. \fIIEEE Transactions on Acoustics, Speech, and Signal Processing\fR, \fIASSP-28\fR(1), 55-69.
.BB
Portnoff, M. R. (1981a).  Short-time Fourier analysis of sampled speech. \fIIEEE Transactions on Acoustics, Speech, and Signal Processing\fR, \fIASSP-29\fR(3), 364-373.
.BB
Portnoff, M. R. (1981b).  Time-scale modification of speech based on short-time Fourier analysis.  \fIIEEE Transactions on Acoustics, Speech, and Signal Processing\fR, \fIASSP-29\fR(3) 374-390.
