# Unit 1: The Auditory Stimulus

# Synopsis

Our bodies contain complex organs that are sensitive to rapid variations in the air pressure around us. These pressure variations, along with different representations of them, are what we might call the *auditory stimulus*, that is the source of stimulation for the auditory sensors our bodies are equipped with. The auditory stimulus is the aspect of the complex phenomenon of sound that we all share, whether we hear it or not. It is the physical component of sound that travels from sources to receivers. The thing can be captured with microphones, recorded using analog and digital technologies, and played back with amplifiers and speakers. It is the thing represented when we read and write music notation. The nature of the auditory stimulus as fluctuations in the pressure of a medium over time means that it is not able to be directly captured; that is to say, the study of the auditory stimulus is really the study of the persistent physical traces it leaves behind as it interacts with objects sensitive enough to be displaced by it.

# What Do We Hear?

Sound is all around us, invisible, yet ever present in our daily lives, but what exactly is it that we hear? In this unit, we focus on the physical phenomena that affect our perception of sound, and different ways we represent those phenomena.

When I make a sound, say I clap my hands together, I’ve produced a
disturbance in the air that eventually travels to your ears,
resulting in an effect that we ultimately describe as sound.
This disturbance travels in every direction, reflecting
differently off of every surface, the walls, the floor, the ceiling, bodies,
furniture, etc. depending on their materials. This is all caused by
air molecules bumping into each other; as they move away from the
source of the sound, they create a region of higher pressure
by *compressing* the space between themselves and their neighbors
in the direction they are traveling.
At the same time, the area in the opposite direction becomes a
region of lower pressure, or a *rarefaction*. Our ears are particularly
sensitive to these very fast and very small changes in
pressure—you can think of our ears as extremely sensitive barometers.

When those changes in air pressure propagate into our ears, they
transfer their energy through a complex array of anatomy, ultimately
resulting in electrical impulses that travel up the nerve fibers
from our inner ear to our brain, where we cognitively process
this information as “sound”. The transfers of energy from one
medium to another is called *transduction*, and the different parts
of our anatomy through which these pressure waves are *transduced*
as they make their way to our brain will be discussed in
unit 2.
For now, we will focus in this unit on the physical characteristics of
sound waves, and the ways we represent them.

# Recording and Representing Sound

| |

The image in figure 1.1 may be a familiar
one—we might say we see a sound there, but it would be more accurate
to refer to it as a *waveform*, as sound is something we *perceive*
in time, rather than something we have and see as in this image.
In order to produce a visual representation of something that sounds,
we usually transduce the sound mechanically, such that its movements
produce a visible trace on some medium. In the earliest recording devices,
such as Thomas Edison’s
phonograph cylinders,
a needle, moved
by sound vibrations, traced grooves in a wax cylinder. By reversing the
process, i.e. spinning the cylinder while touching the needle to it
and transferring the movement of the needle to a diaphragm, causing the
tiny movements of the needle to displace a larger amount of air,
the recorded sound could be played back, as shown in the video below:

The playback in such a device relies on analog amplification: a heavy, but small needle modulates a thin and light, but larger membrane that displaces a larger volume of air. Other recording and playback technologies involve the transduction of air movement into an electrical signal. This is done by using a small, light diaphragm to displace a coiled wire surrounded by a magnet. These small movements of the coil relative to the magnet alter the magnetic field and produce a very small electrical current, the voltage of which varies proportionally to the compressions and rarefactions of the sound.

Speakers work in the opposite way: an electrical signal is sent to a coil surrounded by a magnet, and the modulations in voltage displace the coil and a membrane, or speaker cone, attached to it, which moves the air back and forth.

*Digital* recordings make use of the *analog* electrical signal described
above, but record the value of the voltage *periodically*, by *sampling* it,
commonly at 44100 times per second. This is done using an
*Analog to Digital Converter* or *ADC*, and the process is reversed
by a *Digital to Analog Converter* or *DAC*, which modulates
the voltage of an electrical signal proportional to the sampled values.

Analog and digital representations of sound are similar, but have some important
differences. Perhaps most importantly, analog signals are *continuous*,
while digital signals are *discrete*, i.e. they
can be thought of as consisting of individual values
that were recorded by taking samples of an electrical voltage a finite number
of times, at equally spaced intervals in time.

(Image by Janina Luckow)

This process comes with advantages and disadvantages, the main disadvantage being the loss of information between samples. During playback, this lost information is reconstructed during the digital to analog conversion process using a reconstruction filter. The advantage of digital representation, however, is that we may operate on the signal numerically and computationally.

# Quantitative and Visual Descriptions of Sound in the Time Domain

Let’s return to figure 1.1 and discuss its properties in more
detail. The horizontal dimension represents time, and the vertical dimension
represents *amplitude*, a physical property related to the perceptual
experience of *loudness*. This representation of a sound is said to be in
the *time domain*, because it shows us how the signal varies as a function
of time. We should keep in mind that what we experience as *sound*
involves the higher-order cognitive processing of a dynamically changing
3-dimentional signal as it interacts with our environment over time. By
contrast, figure 1.1 is a rather simple, two dimentional signal
that can more accurately be thought of as the digitally sampled voltage
of a transducer modulated over time by changes in air pressure at a particular
location in space.

Let’s take a closer look at the details of a time-domain waveform.

## Time Domain Representations of Signals

### Frequency

(Image by Janina Luckow)

In figure 1.5, we see a special waveform called a *sine wave*.
A sine wave is special because it is *periodic*, meaning that its shape
repeats over and over again. The amount of time it takes for it to repeat
is called its *period*, and the number of periods that occur in one
unit of time, say a second, is called its *frequency*.
A sine wave is a plot of the position
along the y-axis of a point as it travels counter-clockwise around a
unit circle, starting from the point (1, 0).

(Animation by Janina Luckow)

As we mentioned above, the physical property *amplitude* is related to the
perceptual experience of *loudness*. Additionally, the *frequency* of this
sine wave is related to the perceptual experience of *pitch*, with
higher frequencies and shorter periods correlating to higher
*sounding* pitches, or notes.

### Phase

The phase is the position in the period relative to some point that we determine to be the beginning. The phase tells us how much of the periodic waveform has elapsed, and consequently, how much will have to occur before the waveform repeats. Since phase is a representation of something that repeats or loops over and over again, we can think of it as a point traveling around a circle, and indeed, phase is usually represented as an angle expressed in degress ($0–360^{\circ}$) or in radians ($0–2\pi$).

A *cosine wave* appears exactly the same as a sine wave, shifted $1/4$
of a period to the left.

(Image by Janina Luckow)

Where a sine wave was the position along the y-axis, a cosine wave is the position of the same point along the x-axis.

(Animation by Janina Luckow)

Since the phase difference between a sine wave and a cosine wave is $1/4$ of a period, and that corresponds to our point having traveled $1/4$ of the way around the circle, or $90^{\circ}$, we can say that the two have a phase difference of $90^{\circ}$.

#### How Does Phase Affect Sound?

If a sine wave and a cosine wave are identical with the exception of a time shift, they should sound identical, right? Well they do. That doesn’t mean that phase isn’t important, but its affect on sound is not always intuitive. For a sinusoid by itself, it is impossible to hear any difference, no matter where it started in its cycle. Where you can begin to hear a difference is when you start to combine different sine waves.

When two sinusoids are added together, the relationship of their phase and frequency determines how they sound. When they have the same frequency and phase, the result is a sinusoid with an amplitude equal to the sum of the two. But when the phase of one of them is exactly half that of the other, the two cancel each other out and produce a flat waveform of 0 amplitude. Try shifting the phase of one sinusoid against another in the example below.

(Applet by John MacCallum)

You might notice a subtle change in pitch when you move the slider quickly— this is because a shift in phase requires the waveform to “jump” forward (or backward) faster (or slower) than it would have normally, and that change in the speed of the evolution of the waveform is a temporary change in frequency.

In the figure below we have a more complex example containing
four sinusoids of different frequencies. The phases can be randomized,
but the frequencies stay the same. Note that although the waveform
changes dramatically when the phases change, the *sound* doesn’t change
much. This is important, because it tells us that while the time-domain
representation of a sound is certainly important, it does not often
tell us much about how a signal will actually *sound*. Below, we will
look at another representation of signals, the *frequency-domain*, but
before that, let’s look at more complex waveforms.

Gains: | Phases (0-360): |

(Applet by John MacCallum)

### Summary

We have been looking at different *representations* of sounds, and
before moving on, we should be clear how these relate to the physical
phenomena they describe. The time-domain representation of a signal
can be thought of as representing changes in intensity that propagate
through some medium, and the path through different media
from some event in the world to your brain is a complex one:
objects collide to produce fluctuations in air pressure, which enter
your ear and are transferred through bone, fluid, hair cells, and finally
into the electrical signals of your nervous system. The 2-D drawings above
could just as easily represent any of these, just as they could also represent
the voltage as a function of time that moves a speaker cone back and forth
(that produces changes in air pressure), the displacement of that speaker cone,
etc.

# Pure and Complex Sounds

So far, we have been looking at simple, “pure” waveforms, i.e. signals
that can be characterized as having only a single frequency or pitch.
When we add simple sounds together, as we did in figure
1.9, we produce *complex* sounds. The configuration
of the different frequency components relative to one another account for
much of the ways in which we experience *timbre* or the *tone color* of
a sound, that is, the thing that makes a clarinet sound like a clarinet and
a piano sound like a piano (more on this in the unit on timbre).

When pure tones with frequencies that are integer multiples of some
frequency are added together, we say that they are in a *harmonic* relationship;
all other sounds are said to be *inharmonic*.
For example, if you had a tone that played at 440 cycles per second – the
pitch concert A that many orchestras tune to– and added another tone that
played at 880 cycles per second, this would create a harmonic relationship.
If you instead added a tone the played at 457 cycles per second, this would be
inharmonic.

In reality, all sounds are complex—there are no truely pure sounds.
The sinusoids we discussed above are only theoretical, and any physical
representation of them contains other frequency components, no matter how small,
that distort their shape to some degree due to nonlinearities in the media
that carry them. Consider air, for example, which is not at a uniform temperature
and humidity everywhere.
Since sound travels at a speed that is dependent upon temperature and
humidity, small changes in the air produce small changes that manifest as
*distortion* or *noise*.

# Frequency Domain Representations of Signals

As we saw in figure 1.9, the time-domain representation of a signal is useful for certain things. For example, if you are editing some audio for a track you are working on in a digital audio workstation like Audacity, Garage Band, or ProTools, it makes sense to be able to “see” the waveform of the sound as it progresses over time.

From the amplitude of the wave you can see where it gets louder and depending on where you scroll on the wave you can find a specific point in the time of the track. But there is still a lot of information in that complex wave that might be useful.

What if we could take a wave form and suspend it in time, then look at a portion of it to get a better idea of all the component parts that make up the complex wave? In our next section, we look at a mathematical technique called Fourier transform that allows us to do just that.

## Fourier Transform

We can move from the time-domain into the *frequency-domain* through
a transformation of the signal known as the
*Fourier Transform*, after
Joseph Fourier.
We’re going to take a brief dive into the math behind it, but don’t worry if
it all doesn’t click on your first read. Wrapping your head around what the
Fourier Transform is doing can be difficult, but once you understand it, it
can be used as very powerful tool.

The Fourier Transform (FT) is a function that takes a time-domain signal as its input and produces a function of frequency as its output: \begin{equation} F(t)=\int_{-\infty}^{\infty}f(t)e^{-i2\pi xt}dt \label{eq:fouriertransform} \end{equation} where $f(t)$ is our time-domain function, and $F(t)$ is our frequency-domain function.

You might note those infinity signs above and below the integral symbol—
indeed, from the perspective of the FT, a *periodic* signal can only be
considered periodic if it is periodic forever! Since we can’t wait that long
to compute the FT, we can use the Discrete Fourier Transform (DFT), which is
a computation-friendly version of the continuous FT (sometimes you’ll
hear people refer to the FFT which stands for the Fast Fourier Transform—
this is simply a particular *algorithm* for computing the DFT):

\begin{equation} X(k)=\sum^{N-1}_{n=0}x(n)e^{-i2\pi\frac{kn}{N}} \label{eq:dft} \end{equation}

This may or may not look familiar, but just bear with me. $x(n)$ is our digitally-sampled time-domain signal, and $X(k)$ is the resulting frequency-domain representation of it. $e^{-i2\pi\frac{kn}{N}}$ can be rewritten using the (trigonometric) $\sin$ and $\cos$ functions: \begin{equation} e^{-i2\pi\frac{kn}{N}} = \cos\left(2\pi \frac{kn}{N}\right) - i \sin\left(2\pi \frac{kn}{N}\right) \label{eq:trig} \end{equation} So, when we rewrite the DFT in its less compact form: \begin{equation} X(k)=\sum^{N-1}_{n=0}x(n)\left[ \cos\left(2\pi \frac{kn}{N}\right) - i \sin\left(2\pi \frac{kn}{N}\right) \right] \label{eq:dft-trig} \end{equation} we can see that what the DFT is actually doing is multiplying our original signal by a bunch of sinusoids of different frequencies and summing the results, giving us a signal that tells us effectively “how much” of these frequencies the signal “contains”.

How many frequencies and which ones depends on that value $N$, and is up to you to decide, however, there is a tradeoff: the larger $N$ is, the more time it takes to collect those samples. While you get better frequency resolution when $N$ is large, you get poorer time resolution, and vice versa.

An important (and incredible) property of both the FT and the DFT is that they are reversible without any loss of information or change to the signal: you can convert fromthe time-domain to the frequency-domain and back and recover your original signal exactly. The inverse DFT (IDFT) looks remarkably similar to the DFT: \begin{equation} x(k)=\frac{1}{N}\sum^{N-1}_{n=0}X(n)e^{i2\pi\frac{kn}{N}} \label{eq:idft} \end{equation} (Note that the $X(n)$ and $x(n)$ are swapped relative to the DFT, by convention there is a $1/N$ normalization, and, if you missed it, there is a little missing “-“ sign in the exponent towards the right-hand side).

It can be difficult to form an intuition for the FT and DFT based on the equations alone. The following video provides a really nice visual demonstration of how this all works:

While that is probably more math than you might normally come into contact with in a course on music psychology, the FT allows us to perform very clever types of analyses of sounds as we will see below.

## Spectrum

Once a waveform has been converted from the time domain to the frequency domain
using a Fourier transformation, the resulting representation is often referred to as its
*spectrum*.
The spectrum is basically a snapshot of the component parts of whatever waveform
that you fed into the Fourier transform.
You could do this with something as short as a recording of a pluck of a single note
on a banjo or an entire symphony.

(Audio by John MacCallum)

Looking at a sound’s spectrum, we now have a more clear view of the frequency composition of a signal. Although the DFT does actually produce the phase of each component, that information is often thrown away in the name of efficiency, which is also a tacit statement that it is a significantly less important feature than the frequency information.

Another representation of the spectrum over time is called a *sonogram*:

(Audio by John MacCallum)

Here, we can see a history of the spectrum over time, with the height
representing frequency, the brightness *magnitude* (related to amplitude),
and time along the x-axis.

The reason that we might want to make a sonogram is if we wanted to visually investigate the component parts of a sound. For example, above we learned that a sound’s timbre or tone color is partially determined by what combinations of wave forms make up the sound. Looking at the images below, we are unable to see anything distinct in the timbres of a short clip of a clarinet versus an oboe in the time domain.

| |

| |

Looking at these exact same clips of sound, but now in the frequency domain, we can see that the two sounds are very different. Inspecting these audio clips in the frequency domain allows us to see an audio fingerprint of each of the instrument’s timbres. Using spectrograms is only made possible with the Fourier transform and is the basis for investigating questions of musical timbre, automatic chord transcription, as well as many other tasks of musical analysis.

# Links and Downloads

As part of MUTOR 2.0, the Multimedia Kontor Hamburg https://www.mmkh.de has developed a “Virtual Harp”, that may be played while wearing an Oculus Quest. Instructions for download and installation may be found here

# Quiz

Briefly explain how a microphone works.

What is the term for the way in which energy moves between different media?

What were the two signal domains discussed?

What is the name of the process by which a signal is converted from one of these domains to the other?

Describe how time, frequency, and magnitude (amplitude) are represented in a sonogram.