## Prosodic Measures

### What do the Voxit prosodic measures quantify?

*These measures work best for 20 second windows of audio. Rigorous testing shows that a window/audio sample of 20 seconds is ideal for its consistency in matching the prosodic measures of the entire audio length, thus downloading windowed data will also default to 20-second window lengths.

**Words Per Minute**

The average number of words per minute. The transcript of the
recording created by Gentle, corrected when necessary, produced the
number of words read, which was divided by the length of the
recording and normalized, if the recording was longer or shorter
than one minute, to reflect the speaking rate for 60 seconds.

**f0 Mean**

Average Pitch. Mean f0, or the fundamental frequency, of a voice is
sampled every 10 milliseconds, measured in Hertz (cycles per
second), excluding outliers. This actually measures the number of
times the vocal cords vibrate per second.

**f0 Range**

Range of pitches measured in octaves, excluding outliers.

**f0 Mean Absolute Velocity**

Speed of f0 in octaves per second. This is simply a measure of how
fast pitch is changing. While we calculate this with attention to
direction of pitch speed, or velocity—that is, whether pitch is
typically rising or falling, which can indicate a questioning,
leading, or tentative tone, versus an authoritative, declarative or
conclusive tone—we remove the +/- sign to compare the absolute value
of the pitch speed among speakers.

**f0 Mean Absolute Acceleration**

Acceleration of f0 in octaves per second squared. Acceleration is
the rate of change of pitch velocity, that is how rapidly the
changes in pitch change, which we perceive as the lilt of a voice.
Again, while we calculate the rate of change in speed with attention
to direction—that is, whether pitch change is typically curving
upward or curving downward —we compare pitch acceleration among
speakers without regard to direction. This is also a very useful
measure for expressivity.

**f0 Entropy**

or entropy for f0, indicating the predictability of pitch patterns.
Entropy is an information theoretic measure of predictability (or
strictly speaking, its opposite – unpredictability or disorder).
Mathematically, if P(f) is the probability that a given talker uses
a particular pitch frequency f, then entropy is given by the
negative of the sum of all pitch probabilities times the log (base
2) of those pitch probabilities -
. In this
case, entropy quantifies how uniformly a speaker uses all possible
pitches within +/-1 octave of their mean—that is, how much of the
available pitch values the speaker uses. Staying narrowly close to
the mean most of the time (even with high f0speed) gives low
entropy. Using all pitches with equal probability gives high
entropy.

**Rhythmic Complexity All Pauses**

This measure is unitless, calculated using the Lempel-Ziv algorithm
to estimate Kolomogorov complexity, also used for compression, as
with gif or zip files. A higher value indicates less predictable &
less repetitive pauses, normalized for audio length. The idea is to
find any and all repeated temporal patterns, counting speech and
moments of pause vs. speech or voiced vs. unvoiced as 1's and 0's
respectively over time. For pause complexity we used pause (1) vs.
speech (0). This measure reflects how many unique speech-pause
patterns are required to combine in order to reproduce the observed
speech-pause signal. The more easily one can reconstruct the data
with a set of repeated patterns, the simpler it is, i.e. lower
rhythmic complexity, or more predictable, regular rhythm. As a
generalization, the more predictable a poet’s rhythmic measures, the
more formally they may read, while the less predictable, the more
conversationally they may read.

**Average Pause Rate**

Average number of pauses greater than 100, 250 and 500 ms,
normalized for recording length.

**Average Pause Duration**

Average length of pauses

**Pause Count**

The number of pauses between words greater than 100, 500, 1000, and
2000 milliseconds, per minute, normalized for recording length. We
do not consider pauses less than 100ms because fully continuous
speech also naturally has such brief gaps in energy, nor do we
consider pauses that exceed 1,999 ms (that is, 2 seconds), because
they are quite rare within the reading of a poem.

**Intensity Mean Absolute Velocity**DMG

Average speed of change of intensity/volume. Sound intensity or volume is measured in decibels (dB), a
logarithmic unit of power that correlates with our subjective impression of loudness. Intensity Speed measures
how rapidly a talker modulates the volume of her/his voice within each voiced segment (which may include part
of a word, one entire word, or several words in close succession). For example an even utterance “hel-lo”
would have low Intensity Speed, whereas a sudden stress on the end “hel-LO” or the beginning “HEL-lo” would
have high Intensity speed. The unit of measurement is decibels-per-second or dB/s.

**Intensity Mean Absolute Acceleration**DMG

Average acceleration of change of intensity/volume in decibels per second squared (dB/s^2).

**Intensity Segment Range**DMG

Range of intensity values, excluding outliers. The unit of measurement is decibels or dB.

**Dynamism**DMG

This is a composite measure, multiplying average pitch speed by average pitch entropy, and adding the average
of syllabic and phrasal rhythmic complexity, with the terms weighted to have equal influence (dynamism =
(abs(f0speed) * f0entropy) + (complexitySyllables+complexityPhrases)*.2195. This is a measure-in-progress, an
attempt to put on a number on how predictable or repetitive a speaker’s pitch, or intonation, and rhythmic
patterns are in combination. We note that a predictable rhythmic pattern may be characteristic of more formal
poetry that uses repetition and anaphora when the poet emphasizes those patterns vocally, while pitch and
intonation patterns would seem to be more independent of poetic form.

*Note: The graph is scaled in log-based hertz.*

*Note: Measures listed with DMG are resource-intensive. They
are only available on the downloadable DMG and
are turned off by default.*

### Full Recording Duration vs. Selection

Drift calculates prosodic measurements for both a selected timeframe
and a “full recording duration”. A full recording duration is
**not**
defined as the duration of the full audio file, but the duration
between when Gentle aligns the first word of the transcript and the
last word of the transcript to the audio file. For example, given a
long audio clip with a short transcript, “full recording duration”
refers only to the period between the first word and the last word
of the transcript corresponding to the audio clip. However, you can
make a selection outside a transcript duration and Drift will
calculate Voxit prosodic measurements for that full selection
duration, regardless whether a transcript is found in the selection.

### Downloadable Data

**Audio Transcript**

Audio transcript is the uploaded transcript associated with this audio file.

**Voxit Data**

This is the same data found in the prosodic measures table on a given Drift document.

**Drift Data**

This contains pitch, word, and phoneme data for every 10 milliseconds.

**Gentle Align**

This is the Gentle data used to calculate Voxit measurements, and contains word and phoneme data, as well as
their timings.

**Windowed Voxit Data**

Prosodic measures taken for consecutive 20-second slices of the audio file. For the prosodic measure data, the
default window or audio sample length is 20 seconds. Rigorous testing showed that a window/audio sample of 20
seconds is ideal for its consistency in matching the prosodic measures of the entire audio length.