Prosodic Measures

What do the Voxit prosodic measures quantify?

*These measures work best for 20 second windows of audio. Rigorous testing shows that a window/audio sample of 20 seconds is ideal for its consistency in matching the prosodic measures of the entire audio length, thus downloading windowed data will also default to 20-second window lengths.

Words Per Minute
The average number of words per minute. The transcript of the recording created by Gentle, corrected when necessary, produced the number of words read, which was divided by the length of the recording and normalized, if the recording was longer or shorter than one minute, to reflect the speaking rate for 60 seconds.

f0 Mean
Average Pitch. Mean f0, or the fundamental frequency, of a voice is sampled every 10 milliseconds, measured in Hertz (cycles per second), excluding outliers. This actually measures the number of times the vocal cords vibrate per second.

f0 Range
Range of pitches measured in octaves, excluding outliers.

f0 Mean Absolute Velocity
Speed of f0 in octaves per second. This is simply a measure of how fast pitch is changing. While we calculate this with attention to direction of pitch speed, or velocity—that is, whether pitch is typically rising or falling, which can indicate a questioning, leading, or tentative tone, versus an authoritative, declarative or conclusive tone—we remove the +/- sign to compare the absolute value of the pitch speed among speakers.

f0 Mean Absolute Acceleration
Acceleration of f0 in octaves per second squared. Acceleration is the rate of change of pitch velocity, that is how rapidly the changes in pitch change, which we perceive as the lilt of a voice. Again, while we calculate the rate of change in speed with attention to direction—that is, whether pitch change is typically curving upward or curving downward —we compare pitch acceleration among speakers without regard to direction. This is also a very useful measure for expressivity.

f0 Entropy
or entropy for f0, indicating the predictability of pitch patterns. Entropy is an information theoretic measure of predictability (or strictly speaking, its opposite – unpredictability or disorder). Mathematically, if P(f) is the probability that a given talker uses a particular pitch frequency f, then entropy is given by the negative of the sum of all pitch probabilities times the log (base 2) of those pitch probabilities - entropy equation. In this case, entropy quantifies how uniformly a speaker uses all possible pitches within +/-1 octave of their mean—that is, how much of the available pitch values the speaker uses. Staying narrowly close to the mean most of the time (even with high f0speed) gives low entropy. Using all pitches with equal probability gives high entropy.

Rhythmic Complexity All Pauses
This measure is unitless, calculated using the Lempel-Ziv algorithm to estimate Kolomogorov complexity, also used for compression, as with gif or zip files. A higher value indicates less predictable & less repetitive pauses, normalized for audio length. The idea is to find any and all repeated temporal patterns, counting speech and moments of pause vs. speech or voiced vs. unvoiced as 1's and 0's respectively over time. For pause complexity we used pause (1) vs. speech (0). This measure reflects how many unique speech-pause patterns are required to combine in order to reproduce the observed speech-pause signal. The more easily one can reconstruct the data with a set of repeated patterns, the simpler it is, i.e. lower rhythmic complexity, or more predictable, regular rhythm. As a generalization, the more predictable a poet’s rhythmic measures, the more formally they may read, while the less predictable, the more conversationally they may read.

Average Pause Rate
Average number of pauses greater than 100, 250 and 500 ms, normalized for recording length.

Average Pause Duration
Average length of pauses

Pause Count
The number of pauses between words greater than 100, 500, 1000, and 2000 milliseconds, per minute, normalized for recording length. We do not consider pauses less than 100ms because fully continuous speech also naturally has such brief gaps in energy, nor do we consider pauses that exceed 1,999 ms (that is, 2 seconds), because they are quite rare within the reading of a poem.

Intensity Mean Absolute VelocityDMG
Average speed of change of intensity/volume. Sound intensity or volume is measured in decibels (dB), a logarithmic unit of power that correlates with our subjective impression of loudness. Intensity Speed measures how rapidly a talker modulates the volume of her/his voice within each voiced segment (which may include part of a word, one entire word, or several words in close succession). For example an even utterance “hel-lo” would have low Intensity Speed, whereas a sudden stress on the end “hel-LO” or the beginning “HEL-lo” would have high Intensity speed. The unit of measurement is decibels-per-second or dB/s.

Intensity Mean Absolute AccelerationDMG
Average acceleration of change of intensity/volume in decibels per second squared (dB/s^2).

Intensity Segment RangeDMG
Range of intensity values, excluding outliers. The unit of measurement is decibels or dB.

This is a composite measure, multiplying average pitch speed by average pitch entropy, and adding the average of syllabic and phrasal rhythmic complexity, with the terms weighted to have equal influence (dynamism = (abs(f0speed) * f0entropy) + (complexitySyllables+complexityPhrases)*.2195. This is a measure-in-progress, an attempt to put on a number on how predictable or repetitive a speaker’s pitch, or intonation, and rhythmic patterns are in combination. We note that a predictable rhythmic pattern may be characteristic of more formal poetry that uses repetition and anaphora when the poet emphasizes those patterns vocally, while pitch and intonation patterns would seem to be more independent of poetic form.

Note: The graph is scaled in log-based hertz.

Note: Measures listed with DMG are resource-intensive. They are only available on the downloadable DMG and are turned off by default.

Full Recording Duration vs. Selection

Drift calculates prosodic measurements for both a selected timeframe and a “full recording duration”. A full recording duration is not defined as the duration of the full audio file, but the duration between when Gentle aligns the first word of the transcript and the last word of the transcript to the audio file. For example, given a long audio clip with a short transcript, “full recording duration” refers only to the period between the first word and the last word of the transcript corresponding to the audio clip. However, you can make a selection outside a transcript duration and Drift will calculate Voxit prosodic measurements for that full selection duration, regardless whether a transcript is found in the selection.

Downloadable Data

Audio Transcript
Audio transcript is the uploaded transcript associated with this audio file.

Voxit Data
This is the same data found in the prosodic measures table on a given Drift document.

Drift Data
This contains pitch, word, and phoneme data for every 10 milliseconds.

Gentle Align
This is the Gentle data used to calculate Voxit measurements, and contains word and phoneme data, as well as their timings.

Windowed Voxit Data
Prosodic measures taken for consecutive 20-second slices of the audio file. For the prosodic measure data, the default window or audio sample length is 20 seconds. Rigorous testing showed that a window/audio sample of 20 seconds is ideal for its consistency in matching the prosodic measures of the entire audio length.