Prosodic Measures
What do the Voxit prosodic measures quantify?
*These measures work best for 20 second windows of audio. Rigorous testing shows that a window/audio sample of 20 seconds is ideal for its consistency in matching the prosodic measures of the entire audio length, thus downloading windowed data will also default to 20-second window lengths.
To learn more about the prosodic measures and how much they might differ for a given group of recordings, see "Beyond Poet Voice: Sampling the (Non)-Performance Style of 100 American Poets."
Words Per Minute
The average number of words per minute. The transcript of the
recording created by Gentle, corrected when necessary, produced the
number of words read, which was divided by the length of the
recording and normalized, if the recording was longer or shorter
than one minute, to reflect the speaking rate for 60 seconds.
f0 Mean
Average Pitch. Mean f0, or the fundamental frequency, of a voice is
sampled every 10 milliseconds, measured in Hertz (cycles per
second), excluding outliers. This actually measures the number of
times the vocal cords vibrate per second.
f0 Range
Range of pitches measured in octaves, excluding outliers.
f0 Mean Absolute Velocity
Speed of f0 in octaves per second. This is simply a measure of how
fast pitch is changing. While we calculate this with attention to
direction of pitch speed, or velocity—that is, whether pitch is
typically rising or falling, which can indicate a questioning,
leading, or tentative tone, versus an authoritative, declarative or
conclusive tone—we remove the +/- sign to compare the absolute value
of the pitch speed among speakers.
f0 Mean Absolute Acceleration
Acceleration of f0 in octaves per second squared. Acceleration is
the rate of change of pitch velocity, that is how rapidly the
changes in pitch change, which we perceive as the lilt of a voice.
Again, while we calculate the rate of change in speed with attention
to direction—that is, whether pitch change is typically curving
upward or curving downward —we compare pitch acceleration among
speakers without regard to direction. This is also a very useful
measure for expressivity.
f0 Entropy
or entropy for f0, indicating the predictability of pitch patterns.
Entropy is an information theoretic measure of predictability (or
strictly speaking, its opposite – unpredictability or disorder).
Mathematically, if P(f) is the probability that a given talker uses
a particular pitch frequency f, then entropy is given by the
negative of the sum of all pitch probabilities times the log (base
2) of those pitch probabilities -
. In this
case, entropy quantifies how uniformly a speaker uses all possible
pitches within +/-1 octave of their mean—that is, how much of the
available pitch values the speaker uses. Staying narrowly close to
the mean most of the time (even with high f0speed) gives low
entropy. Using all pitches with equal probability gives high
entropy.
Rhythmic Complexity All Pauses
This measure is unitless, calculated using the Lempel-Ziv algorithm
to estimate Kolomogorov complexity, also used for compression, as
with gif or zip files. A higher value indicates less predictable &
less repetitive pauses, normalized for audio length. The idea is to
find any and all repeated temporal patterns, counting speech and
moments of pause vs. speech or voiced vs. unvoiced as 1's and 0's
respectively over time. For pause complexity we used pause (1) vs.
speech (0). This measure reflects how many unique speech-pause
patterns are required to combine in order to reproduce the observed
speech-pause signal. The more easily one can reconstruct the data
with a set of repeated patterns, the simpler it is, i.e. lower
rhythmic complexity, or more predictable, regular rhythm. As a
generalization, the more predictable a poet’s rhythmic measures, the
more formally they may read, while the less predictable, the more
conversationally they may read.
Average Pause Rate
Average number of pauses greater than 100, 250 and 500 ms,
normalized for recording length.
Average Pause Duration
Average length of pauses
Pause Count
The number of pauses between words greater than 100, 500, 1000, and
2000 milliseconds, per minute, normalized for recording length. We
do not consider pauses less than 100ms because fully continuous
speech also naturally has such brief gaps in energy, nor do we
consider pauses that exceed 1,999 ms (that is, 2 seconds), because
they are quite rare within the reading of a poem.
Intensity Mean Absolute VelocityDMG
Average speed of change of intensity/volume. Sound intensity or volume is measured in decibels (dB), a
logarithmic unit of power that correlates with our subjective impression of loudness. Intensity Speed measures
how rapidly a talker modulates the volume of her/his voice within each voiced segment (which may include part
of a word, one entire word, or several words in close succession). For example an even utterance “hel-lo”
would have low Intensity Speed, whereas a sudden stress on the end “hel-LO” or the beginning “HEL-lo” would
have high Intensity speed. The unit of measurement is decibels-per-second or dB/s.
Intensity Mean Absolute AccelerationDMG
Average acceleration of change of intensity/volume in decibels per second squared (dB/s^2).
Intensity Segment RangeDMG
Range of intensity values, excluding outliers. The unit of measurement is decibels or dB.
DynamismDMG
This is a composite measure, multiplying average pitch speed by average pitch entropy, and adding the average
of syllabic and phrasal rhythmic complexity, with the terms weighted to have equal influence (dynamism =
(abs(f0speed) * f0entropy) + (complexitySyllables+complexityPhrases)*.2195. This is a measure-in-progress, an
attempt to put on a number on how predictable or repetitive a speaker’s pitch, or intonation, and rhythmic
patterns are in combination. We note that a predictable rhythmic pattern may be characteristic of more formal
poetry that uses repetition and anaphora when the poet emphasizes those patterns vocally, while pitch and
intonation patterns would seem to be more independent of poetic form.
Note: The graph is scaled in log-based hertz.
Note: Measures listed with DMG are resource-intensive. They are only available on the downloadable DMG and are turned off by default.
Full Recording Duration vs. Selection
Drift calculates prosodic measurements for both a selected timeframe and a “full recording duration”. A full recording duration is not defined as the duration of the full audio file, but the duration between when Gentle aligns the first word of the transcript and the last word of the transcript to the audio file. For example, given a long audio clip with a short transcript, “full recording duration” refers only to the period between the first word and the last word of the transcript corresponding to the audio clip. However, you can make a selection outside a transcript duration and Drift will calculate Voxit prosodic measurements for that full selection duration, regardless whether a transcript is found in the selection.
Downloadable Data
Audio Transcript
Audio transcript is the uploaded transcript associated with this audio file.
Voxit Data
This is the same data found in the prosodic measures table on a given Drift document.
Drift Data
This contains pitch, word, and phoneme data for every 10 milliseconds.
Gentle Align
This is the Gentle data used to calculate Voxit measurements, and contains word and phoneme data, as well as
their timings.
Windowed Voxit Data
Prosodic measures taken for consecutive 20-second slices of the audio file. For the prosodic measure data, the
default window or audio sample length is 20 seconds. Rigorous testing showed that a window/audio sample of 20
seconds is ideal for its consistency in matching the prosodic measures of the entire audio length.